Solved: error when restoring some files

Rami_Nasser1 · ‎06-10-2012

Kindly find the bellow error while trying to restore some files (we tried this about 3 times from backup taken from the same server and from another server) so please help ASAP:

6/10/2012 8:37:12 PM - begin Restore

6/10/2012 8:37:22 PM - 1 images required

6/10/2012 8:37:22 PM - media 041AIA required

6/10/2012 8:37:26 PM - restoring image dbprod_1339183543

6/10/2012 8:37:28 PM - requesting resource 041AIA

6/10/2012 8:37:29 PM - connecting

6/10/2012 8:37:32 PM - connected; connect time: 00:00:03

6/10/2012 8:37:32 PM - started process bptm (10028)

6/10/2012 8:37:32 PM - mounting 041AIA

6/10/2012 8:37:32 PM - granted resource 041AIA

6/10/2012 8:37:32 PM - granted resource IBMULT3580-TD22

6/10/2012 8:38:01 PM - mounted; mount time: 00:00:29

6/10/2012 8:38:01 PM - positioning 041AIA to file 1

6/10/2012 8:38:01 PM - positioned 041AIA; position time: 00:00:00

6/10/2012 8:38:01 PM - begin reading

6/10/2012 8:42:32 PM - positioning 041AIA to file 2

6/10/2012 8:42:32 PM - positioned 041AIA; position time: 00:00:00

6/10/2012 8:46:24 PM - Error bptm(pid=8960) cannot write data to socket, 10053

6/10/2012 8:46:24 PM - Error bptm(pid=8960) The following files/folders were not restored:

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/applsysd06.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/akx01.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/amfx01.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/asfd01.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/applsysd02.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/arx01.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/astx01.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/azd01.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/ahmd01.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) UTF - /u04/oracledb/prdn/prdndata/bicx01.dbf

6/10/2012 8:46:24 PM - Error bptm(pid=8960) more than 10 files were not restored, remaining ones are shown in the progress log.

6/10/2012 9:16:58 PM - restored image dbprod_1339183543 - (socket write failed(24)); restore time 00:39:32

6/10/2012 9:16:58 PM - Warning bprd(pid=8648) Restore must be resumed prior to first image expiration on 9/9/2012 10:25:43 PM

6/10/2012 9:16:58 PM - end Restore; elapsed time: 00:39:46

the restore failed to recover the requested files(5)

Mark_Solutions · ‎06-11-2012

Your job details show the following two lines:

6/10/2012 8:37:32 PM - connected; connect time: 00:00:03

6/10/2012 8:46:24 PM - Error bptm(pid=8960) cannot write data to socket, 10053

This indicates that it can connect to the client but cannot write data to the specified location - the 10053 network error is just spurious so ignore it - this is a data write issue.

It will be caused by one of 2 things:

1. rights to place files in that location

2. Inability to overwrite the files - as Praveen says they may be live and located database files - so either restore them to an alternate location or, as long as you are totally sure about what you are doing, take down Oracle first to release the lock and restore them - but be very careful - this sounds like a Production database!

Hope this helps

View solution in original post

mph999 · ‎06-10-2012

Here is just about every possible cause of Status 24 that I am aware of. Apologies, but from the NBU side, it is virtually impossible to troubleshoot, as we have no details of what has happened, apart from the fact the network is unavailable. The big clue, is the Network is unavailable, so this is not likely to be a NetBackup issue.

Really, all we can do is a 'process of elimination'.

1. How many clients have this error

2. Did this client previously work

3. What was changed

4. Does it write some data then fail

5. Does it always fail at the same point

6. Operating system of client

7. Operating system of media server

8. NetBackup version

9. Logs from media server - bptm and bpbrm, from client tar, bpcd

In my experience, Status 24 is hardly ever NBU (in fact, I don't think I have ever seen a status 24 failure caused by NetBackup myself)

Something below normally fixes it ... Yes, it is a lot to read, and will probably tyake a number of hours to go through.

If this is a Windows client, a very common cause is the TCP Chimmey settings - http://www.symantec.com/docs/TECH55653

I have given a number of technotes below (the odd one may be 'internal' only) , and have show a summary of the solutions, as well as the odd extra note.

http://www.symantec.com/docs/TECH124766

TCP Windows scaling was disabled (Operating system setting)

http://www.symantec.com/docs/TECH76201

Possible solution to Status 24 by increasing TCP receive buffer space

http://www.symantec.com/docs/TECH34183

this Technote, although written for Solaris, shows how TCP tunings can

cause status 24s. I am sure your system admins will be aware of the

corresponding setting for the windows operating system.

http://www.symantec.com/docs/TECH55653

This technote is very important. It covers many many issues that can

occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP

Segmentation Offload (TSO) are enabled. It is recommend to disable

these, as per the technote.

I also understand that we have previously seen MS Patch KB92222 resolve status 24 issues.

http://www.symantec.com/docs/TECH150369

A write operation to a socket failed, these are possible cause for this issue:

A high network load.

Intermittent connectivity.

Packet reordering.

Duplex Mismatch between client and master server NICs.

Small network buffer size

http://support.microsoft.com/kb/942861

SOLUTION/WORKAROUND:

Contact the hardware vendor of the NIC for the latest updates for their product as a resolution.

This problem occurs when the TCP Chimney Offload feature is enabled on the NetBackup Windows 2003 client. Disable this feature to workaround this problem.

To do this, at a command prompt, enter the following:

Netsh int ip set chimney DISABLED

http://www.symantec.com/docs/TECH127930

The above messages almost always indicate a networking issue of some sort. In this case it was due to a faulty switch. There are rare occasions when the above messages are not caused by a networking issue, such as those addressed in http://www.symantec.com/docs/TECH72115.

(TECH72115 is not relevant to you, this was an issue with a SAN client, fixed in 6.5.4)

But note, the technote says the issue is 'almost always' network related, this can also include operating system settings.

http://www.symantec.com/docs/TECH145223

The issue was with the idle timeout setting on the firewall that was too low to allow backups and/or restores to complete. With the DMZ media server backing up a DMZ client the media server sends only the occasional meta data updates back to the master server in order to update the images catalog. If that TCP socket connection between the media server and master server is idle for a longer period than the firewall's idle timeout the firewall breaks the connection between the media server and master servder and thus the media server breaks the connection to the client producing the socket error.

Increasing the idle timeout setting on the firewall to a value larger than the amount of time a typical backups takes to complete should resolve the issue.

Also increasing the frequency of the TCP keepalive packets can also help maintain the socket during idle periods from the server's defaults.

Although you may not have a firewall between the client and the media server, this solution is another demonstation that the issue is network related, as opposed to NetBackup.

http://www.symantec.com/docs/S:TECH130343 (Internal technote)

The issue was found to be due to NIC card Network congestion (that is, network overloaded)

http://www.symantec.com/docs/TECH135924 (I think this one I sent previously, shows the MS fix for the issue)

In this instance, the problem was isolated to this single machine making the point of failure isolated to the problematic new host.

If the problem is due to an unidentified corruption / misconfiguration in the new media server's TCP Stack and Winsock environment (as was the case in this example), executing these two commands, followed by a reboot will resolve the problem:

netsh int ip reset resetlog.txt Microsoft Reference: http://support.microsoft.com/kb/299357

netsh winsock reset catalog Microsoft Reference: http://technet.microsoft.com/en-us/library/cc759700(WS.10).aspx

NOTE: The above two commands will reset the Windows TCP Stack as well as the Windows Winsock environment back to the default values. This means that if the host is configured with a static IP Address and other customized TCP settings, they will be lost and will need to be re-entered after the reboot. The default TCP setting is to use DHCP and the host will be using DHCP upon booting up.

http://www.symantec.com/docs/TECH76201

Possible solution to Status 24 by increasing TCP receive buffer space

http://www.symantec.com/docs/TECH34183

this Technote, although written for Solaris, shows how TCP tunings can

cause status 24s. I am sure your system admins will be aware of the

corresponding setting for the windows operating system.

http://www.symantec.com/docs/TECH55653

This technote is very important. It covers many many issues that can

occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP

Segmentation Offload (TSO) are enabled. It is recommend to disable

these, as per the technote.

I understand that we have previously seen MS Patch KB92222 resolve status 24 issues.

There are 2 possible issues that could be NBU related that could cause this :

1. Client NBU version is higher than the media serevr

2. Make sure the comunications buffer is not too high (http://www.symantec.com/docs/TECH60570

)

What to do next:

http://www.symantec.com/docs/TECH135924 (mentioned before, MS suggested fix)

http://www.symantec.com/docs/TECH60570 (communications buffer, mentioned above)

http://www.symantec.com/docs/TECH60844

If these do not resolve the situation, I would recommend you talk with the Operating system vendor. In summary, apart from the Client version of software and the communication buffer size (set in host properties) I can find no other cause that could be NBU. However, from the very detailed research I have done, I can find many many causes that are the network or operating system.

Marti

Rami_Nasser1 · ‎06-10-2012

appreciate your response . from your proposed solution i need really many hours to try solving this issue

Rami_Nasser1 · ‎06-10-2012

please advice

donot thing it seems write permission issue. should initiate the restore with the user having admin/db rights

Marianne · ‎06-10-2012

A lot of status 24 errors are related to specific OS.

Please share all of the following:

OS on master, media server and client
NBU version on master, media server and client
Type of restore - normal filesystem restore of database files?
Restore to Source client or different client?
Restore to same or different location?

Please ensure all of following log folders exist before trying another restore:

On master: bprd (restart NBU to enable this log)
On media server: bptm and bpbrm
On client: bpcd and tar

Please rename log folders to reflect process name (e.g. bprd.txt) and post as attachments.

Handy NetBackup Links

Rami_Nasser1 · ‎06-10-2012

Os on master is windows 2008,media and all client are aix 6.1

NBU version 6.5.6 for all

type of files db files (oracle)

same location

mph999 · ‎06-10-2012

"appreciate your response . from your proposed solution i need really many hours to try solving this issue"

Yes, I am sorry that this is the case.

Almost certainly, Status 24 is not NetBackup, it is a Network problem. As Marianne shoes, many status 24 are related to the operating system (os network settings etc ...) - and as you will see, many of the details I posted are OS related.

You will not get any reasons to the problem in the NBU logs, all you will see is 'cannot write to socket' - NBU does not know what the problem is, as it is out side NBU.

BUT, I have shown two things that can casue the issue, wrong client version and network comms buffer size (but I have never actually seen this casue a status 24 myself)

So, you are really back working through the detiails I posted.

You could wait, someone will come along and suggest do xxx - this might be in the list of details I have posted and you might get lucky and it is the solution, but how long do you want to wait ?

Martin

rookie11 · ‎06-11-2012

please create a tar directory on client system. /usr/openv/netbackup/logs/

try to restore only 1 .dbf file from requested set of files.

also make sure at client end there is enough of disk space

Marianne · ‎06-11-2012

All I am expecting to see in the logs is where the break in communication is - master, media server or client.

If we can pinpoint the break in communication, you can take it up with relevant server owner.
This error looks like comms problem somewhere between media server and client:
Error bptm(pid=8960) cannot write data to socket, 10053

Are you doing db backups using Standard policy, or Oracle policy with RMAN?

One more thing - double-check exact W2008 version. If R2, only supported as from 7,x.

Please schedule upgrade of your environment ASAP - NBU 6.x is nearing EOSL.

Handy NetBackup Links

Rami_Nasser1 · ‎06-11-2012

we are using standard policy but the issue occur in restor! please advice

Rami_Nasser1 · ‎06-11-2012

Windows 2008 enterprise Service Pack 2

Marianne · ‎06-11-2012

As per my post earlier today:

Please ensure all of following log folders exist before trying another restore:

On master: bprd (restart NBU to enable this log)
On media server: bptm and bpbrm
On client: bpcd and tar

Please rename log folders to reflect process name (e.g. bprd.txt) and post as attachments.

You will need logs to pinpoint break in communication.

At this point it seems like media server -> client comms problem.

In what state is Oracle when doing file-level backups? Hopefully down?

In what state is Oracle when attempting restore to same location? Hopefully down?

Handy NetBackup Links

PraveenCH · ‎06-11-2012

As the files you are trying to restore are oracle database files to the orginal location, there is a chance Oracle is using the previous version of the existing files.

Try restoring to alternate location once. if it works then you can move the files manually later when oracle is not using these files or when you take the database down.

As you took a file level backup instead of agent backup, netbackup donot have permissions to write dbf files in original location.

--Praveen

Mark_Solutions · ‎06-11-2012

Your job details show the following two lines:

6/10/2012 8:37:32 PM - connected; connect time: 00:00:03

6/10/2012 8:46:24 PM - Error bptm(pid=8960) cannot write data to socket, 10053

This indicates that it can connect to the client but cannot write data to the specified location - the 10053 network error is just spurious so ignore it - this is a data write issue.

It will be caused by one of 2 things:

1. rights to place files in that location

2. Inability to overwrite the files - as Praveen says they may be live and located database files - so either restore them to an alternate location or, as long as you are totally sure about what you are doing, take down Oracle first to release the lock and restore them - but be very careful - this sounds like a Production database!

Hope this helps

Rami_Nasser1 · ‎06-11-2012

appreciate you all ,and will work on all details ASAP

VOX

error when restoring some files