sivachandraredd
12 years agoLevel 3
Backup job fails with different status codes 13 , 24 and 42
Hi ,
I need help.
In my Backup environment Some clients Backup job fails with different status codes 13 , 24 and 42.
Master server OS = Solaris 10
Media server OS = Solaris 10
...
- 12 years agoHere is just about every possible cause of Status 24 that I am aware of. Apologies, but from the NBU side, it is virtually impossible to troubleshoot, as we have no details of what has happened, apart from the fact the network is unavailable. The big clue, is the Network is unavailable, so this is not likely to be a NetBackup issue.Often, all we can do is a 'process of elimination'.We cannot begin to help without proper details of the issue :1. How many clients have this error2. Did this client previously work3. What was changed4. Does it write some data then fail5. Does it fail at the very beginning of the job6. Does it always fail at the same point7. Operating system of client8. Operating system of media server9. NetBackup version10. Logs from media server - bptm and bpbrm, from client bpbkar, bpcdIn my experience, Status 24 is hardly ever NBU (in fact, I don't think I have ever seen a status 24 failure caused by NetBackup myself)Something below normally fixes it ... Yes, it is a lot to read, and will probably tyake a number of hours to go through.If this is a Windows client, a very common cause is the TCP Chimmey settings - http://www.symantec.com/docs/TECH55653I have given a number of technotes below (the odd one may be 'internal' only) , and have show a summary of the solutions, as well as the odd extra note.http://www.symantec.com/docs/TECH124766TCP Windows scaling was disabled (Operating system setting)http://www.symantec.com/docs/TECH76201Possible solution to Status 24 by increasing TCP receive buffer spacehttp://www.symantec.com/docs/TECH34183this Technote, although written for Solaris, shows how TCP tunings cancause status 24s. I am sure your system admins will be aware of thecorresponding setting for the windows operating system.http://www.symantec.com/docs/TECH55653This technote is very important. It covers many many issues that canoccur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCPSegmentation Offload (TSO) are enabled. It is recommend to disablethese, as per the technote.I also understand that we have previously seen MS Patch KB92222 resolve status 24 issues.http://www.symantec.com/docs/TECH150369A write operation to a socket failed, these are possible cause for this issue:A high network load.Intermittent connectivity.Packet reordering.Duplex Mismatch between client and master server NICs.Small network buffer sizehttp://support.microsoft.com/kb/942861SOLUTION/WORKAROUND:Contact the hardware vendor of the NIC for the latest updates for their product as a resolution.This problem occurs when the TCP Chimney Offload feature is enabled on the NetBackup Windows 2003 client. Disable this feature to workaround this problem.To do this, at a command prompt, enter the following:Netsh int ip set chimney DISABLEDhttp://www.symantec.com/docs/TECH127930The above messages almost always indicate a networking issue of some sort. In this case it was due to a faulty switch. There are rare occasions when the above messages are not caused by a networking issue, such as those addressed in http://www.symantec.com/docs/TECH72115.(TECH72115 is not relevant to you, this was an issue with a SAN client, fixed in 6.5.4)But note, the technote says the issue is 'almost always' network related, this can also include operating system settings.http://www.symantec.com/docs/TECH145223The issue was with the idle timeout setting on the firewall that was too low to allow backups and/or restores to complete. With the DMZ media server backing up a DMZ client the media server sends only the occasional meta data updates back to the master server in order to update the images catalog. If that TCP socket connection between the media server and master server is idle for a longer period than the firewall's idle timeout the firewall breaks the connection between the media server and master servder and thus the media server breaks the connection to the client producing the socket error.Increasing the idle timeout setting on the firewall to a value larger than the amount of time a typical backups takes to complete should resolve the issue.Also increasing the frequency of the TCP keepalive packets can also help maintain the socket during idle periods from the server's defaults.Although you may not have a firewall between the client and the media server, this solution is another demonstation that the issue is network related, as opposed to NetBackup.http://www.symantec.com/docs/S:TECH130343 (Internal technote)The issue was found to be due to NIC card Network congestion (that is, network overloaded)http://www.symantec.com/docs/TECH135924 (I think this one I sent previously, shows the MS fix for the issue)In this instance, the problem was isolated to this single machine making the point of failure isolated to the problematic new host.If the problem is due to an unidentified corruption / misconfiguration in the new media server's TCP Stack and Winsock environment (as was the case in this example), executing these two commands, followed by a reboot will resolve the problem:netsh int ip reset resetlog.txt Microsoft Reference: http://support.microsoft.com/kb/299357netsh winsock reset catalog Microsoft Reference: http://technet.microsoft.com/en-us/library/cc759700(WS.10).aspxNOTE: The above two commands will reset the Windows TCP Stack as well as the Windows Winsock environment back to the default values. This means that if the host is configured with a static IP Address and other customized TCP settings, they will be lost and will need to be re-entered after the reboot. The default TCP setting is to use DHCP and the host will be using DHCP upon booting up.http://www.symantec.com/docs/TECH76201Possible solution to Status 24 by increasing TCP receive buffer spacehttp://www.symantec.com/docs/TECH34183this Technote, although written for Solaris, shows how TCP tunings cancause status 24s. I am sure your system admins will be aware of thecorresponding setting for the windows operating system.http://www.symantec.com/docs/TECH55653This technote is very important. It covers many many issues that canoccur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCPSegmentation Offload (TSO) are enabled. It is recommend to disablethese, as per the technote.I understand that we have previously seen MS Patch KB92222 resolve status 24 issues.Unix/ LinuxIf the error in bptm log shows :22:32:44.968 [35717358] <16> write_to_out: cannot write data to socket, There is no process to read data written to a pipe.Check the ulimit -a output. nofiles should be set to at least 8192.There are 2 'common' issues that could be NBU related that could cause this :1. Client NBU version is higher than the media serevr2. Make sure the comunications buffer is not too high (http://www.symantec.com/docs/TECH60570)What to do next:http://www.symantec.com/docs/TECH135924 (mentioned before, MS suggested fix)http://www.symantec.com/docs/TECH60570 (communications buffer, mentioned above)http://www.symantec.com/docs/TECH60844If these do not resolve the situation, I would recommend you talk with the Operating system vendor. In summary, apart from the Client version of software and the communication buffer size (set in host properties) I can find no other cause that could be NBU. However, from the very detailed research I have done, I can find many many causes that are the network or operating system.Martin