Forum Discussion

Deb_Techie's avatar
Deb_Techie
Level 4
11 years ago

NetBackup Status Code 24 - Possible Parameters to Check

Hi Friends

Receiving Status Code 24 is now becaming very common in our Infrastructure, where we are backing up alomost 7000 boxes using 10 different master servers. I agree there is a Technote Available from Symantec : http://www.symantec.com/business/support/index?page=content&id=HOWTO50801 for troubleshooting Status Code 24 Failures, but most of the time we are getting the situation where this is not enough to fix the issues.

Can someone please help me to understand that what are the parameters need to be checked in Client / Media / Master server and the tuning can be made to fix the issue. Also is there any tool is avaiable to determine the root cause of the Status 24 Failure.

 

Help would be highly appriciated

  • NIcolai and Marianne are right - timeouts can work around issues sometimes, but the root cause is almost always outside NBU for status 24.

    Generally, if you try and solve status 24 inside NBU, you are likely to never fix it, it is rare that NBU is the cause of a 24 (not impossible, but very very unlikely).

     

    These are the notes i sent out a while back for a status 24.  As you will see, all the TNs and case examples I gave are outside NBU.  I personally have never seen 24 caused by NBU, but have heard about the odd issue, hence my comment, it's possible but unlikely.

    ***************

    If this is a Windows client, a very common cause is the TCP Chimmey settings  - http://www.symantec.com/docs/TECH55653
     
    I have given a number of technotes below (the odd one may be 'internal' only) , and have show a summary of the solutions, as well as the odd extra note.
     
     
    http://www.symantec.com/docs/TECH76201
     
    Possible solution to Status 24 by increasing TCP receive buffer space 
     
    http://www.symantec.com/docs/TECH34183 
    this Technote, although written for Solaris, shows how TCP tunings can 
    cause status 24s. I am sure your system admins will be aware of the 
    corresponding setting for the windows operating system. 
     
    http://www.symantec.com/docs/TECH55653 
    This technote is very important. It covers many many issues that can 
    occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP 
    Segmentation Offload (TSO) are enabled. It is recommend to disable 
    these, as per the technote. 
     
    I also understand that we have previously seen MS Patch KB92222 resolve status 24 issues.
     
     
     
    http://www.symantec.com/docs/TECH150369
    A write operation to a socket failed, these are possible cause for this issue:
     
    A high network load.
    Intermittent connectivity.
    Packet reordering.
    Duplex Mismatch between client and master server NICs.
    Small network buffer size
     
     
    http://support.microsoft.com/kb/942861 
    SOLUTION/WORKAROUND:
    Contact the hardware vendor of the NIC for the latest updates for their product as a resolution.
     
    This problem occurs when the TCP Chimney Offload feature is enabled on the NetBackup Windows 2003 client.  Disable this feature to workaround this problem.
     
    To do this, at a command prompt, enter the following:
    Netsh int ip set chimney DISABLED
     
     
     
    http://www.symantec.com/docs/TECH127930
    The above messages almost always indicate a networking issue of some sort. In this case it was due to a faulty switch. 
    There are rare occasions when the above messages are not caused by a networking issue, such as those addressed in http://www.symantec.com/docs/TECH72115. 
     
    But note, the technote says the issue is 'almost always' network related, this can also include operating system settings.
     
     
    http://www.symantec.com/docs/TECH145223
    The issue was with the idle timeout setting on the firewall that was too low to allow backups and/or restores to complete. With the DMZ media server 
    backing up a DMZ client the media server sends only the occasional meta data updates back to the master server in order to update the images catalog. 
    If that TCP socket connection between the media server and master server is idle for a longer period than the firewall's idle timeout the firewall breaks 
    the connection between the media server and master servder and thus the media server breaks the connection to the client producing the socket error.
    Increasing the idle timeout setting on the firewall to a value larger than the amount of time a typical backups takes to complete should resolve the issue.
    Also increasing the frequency of the TCP keepalive packets can also help maintain the socket during idle periods from the server's defaults.
     
    Although you may not have a firewall between the client and the media server, this solution is another demonstation that the issue is network related, as opposed to NetBackup.
     
    http://www.symantec.com/docs/S:TECH130343  (Internal technote)
     
    The issue was found to be due to NIC card Network congestion (that is, network overloaded)
     
     
    http://www.symantec.com/docs/TECH135924  
     
    In this instance, the problem was isolated to this single machine making the point of failure isolated to the problematic new host.
     
    If the problem is due to an unidentified corruption / misconfiguration in the new media server's TCP Stack and Winsock environment (as was the case in this example), 
    executing these two commands, followed by a reboot will resolve the problem:
     
    netsh int ip reset resetlog.txt   Microsoft Reference:  http://support.microsoft.com/kb/299357 
    netsh winsock reset catalog    Microsoft Reference:  http://technet.microsoft.com/en-us/library/cc759700(WS.10).aspx 
     
    NOTE: The above two commands will reset the Windows TCP Stack as well as the Windows Winsock environment back to the default values.  This means that if the host
    is configured with a static IP Address and other customized TCP settings, they will be lost and will need to be re-entered after the reboot.  The default TCP setting 
    is to use DHCP and the host will be using DHCP upon booting up.
     
     
    http://www.symantec.com/docs/TECH76201
    Possible solution to Status 24 by increasing TCP receive buffer space 
     
     
    http://www.symantec.com/docs/TECH34183 
    this Technote, although written for Solaris, shows how TCP tunings can 
    cause status 24s. I am sure your system admins will be aware of the 
    corresponding setting for the windows operating system. 
     
    http://www.symantec.com/docs/TECH55653 
    This technote is very important. It covers many many issues that can 
    occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP 
    Segmentation Offload (TSO) are enabled. It is recommend to disable 
    these, as per the technote. 
     
    I understand that we have previously seen MS Patch KB92222 resolve status 24 issues.
     
     
    There are  2 possible issues that could be NBU related that could cause this :
     
    1.  Client NBU version is higher than the media serevr
    2.  Make sure the comunications buffer is not too high (http://www.symantec.com/docs/TECH60570
    )
     
    What to do next:
     
    http://www.symantec.com/docs/TECH135924  (mentioned before, MS suggested fix)
    http://www.symantec.com/docs/TECH60570  (communications buffer, mentioned above)
     http://www.symantec.com/docs/TECH60844 (Network connectivity tuning to avoid network read/write failures and increase performance) TCP Chimney
     
     
    If these do not resolve the situation, I would recommend you talk with the Operating system vendor.  In summary, apart from the Client version of software 
    and the communication buffer size (set in host properties) I can find no other cause that could be NBU.  However, from the very detailed research I have done, 
    I can find many many causes that are the network or operating system.
  • There can be a ton of reason for this error code. And I don't think one setting will fix them all.

    One setting I however want you to check is CLIENT_READ_TIMEOUT. If not configured - set it to 1800 (½ hour) on master/media and clients. Try it out on frequent failures and see if it makes any difference.

    http://www.symantec.com/docs/HOWTO86358

  • Status 24 is normally caused by issues outside of NBU.

    So, to try and solve from NBU is basically impossible.

    Best you can do is to understand the process flow and to understand where the failure is occuring:
    On the client side
    On the media server 
    On the master server

    Then start working with that server owner to determine if this is OS/NIC/network issue...

  • NIcolai and Marianne are right - timeouts can work around issues sometimes, but the root cause is almost always outside NBU for status 24.

    Generally, if you try and solve status 24 inside NBU, you are likely to never fix it, it is rare that NBU is the cause of a 24 (not impossible, but very very unlikely).

     

    These are the notes i sent out a while back for a status 24.  As you will see, all the TNs and case examples I gave are outside NBU.  I personally have never seen 24 caused by NBU, but have heard about the odd issue, hence my comment, it's possible but unlikely.

    ***************

    If this is a Windows client, a very common cause is the TCP Chimmey settings  - http://www.symantec.com/docs/TECH55653
     
    I have given a number of technotes below (the odd one may be 'internal' only) , and have show a summary of the solutions, as well as the odd extra note.
     
     
    http://www.symantec.com/docs/TECH76201
     
    Possible solution to Status 24 by increasing TCP receive buffer space 
     
    http://www.symantec.com/docs/TECH34183 
    this Technote, although written for Solaris, shows how TCP tunings can 
    cause status 24s. I am sure your system admins will be aware of the 
    corresponding setting for the windows operating system. 
     
    http://www.symantec.com/docs/TECH55653 
    This technote is very important. It covers many many issues that can 
    occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP 
    Segmentation Offload (TSO) are enabled. It is recommend to disable 
    these, as per the technote. 
     
    I also understand that we have previously seen MS Patch KB92222 resolve status 24 issues.
     
     
     
    http://www.symantec.com/docs/TECH150369
    A write operation to a socket failed, these are possible cause for this issue:
     
    A high network load.
    Intermittent connectivity.
    Packet reordering.
    Duplex Mismatch between client and master server NICs.
    Small network buffer size
     
     
    http://support.microsoft.com/kb/942861 
    SOLUTION/WORKAROUND:
    Contact the hardware vendor of the NIC for the latest updates for their product as a resolution.
     
    This problem occurs when the TCP Chimney Offload feature is enabled on the NetBackup Windows 2003 client.  Disable this feature to workaround this problem.
     
    To do this, at a command prompt, enter the following:
    Netsh int ip set chimney DISABLED
     
     
     
    http://www.symantec.com/docs/TECH127930
    The above messages almost always indicate a networking issue of some sort. In this case it was due to a faulty switch. 
    There are rare occasions when the above messages are not caused by a networking issue, such as those addressed in http://www.symantec.com/docs/TECH72115. 
     
    But note, the technote says the issue is 'almost always' network related, this can also include operating system settings.
     
     
    http://www.symantec.com/docs/TECH145223
    The issue was with the idle timeout setting on the firewall that was too low to allow backups and/or restores to complete. With the DMZ media server 
    backing up a DMZ client the media server sends only the occasional meta data updates back to the master server in order to update the images catalog. 
    If that TCP socket connection between the media server and master server is idle for a longer period than the firewall's idle timeout the firewall breaks 
    the connection between the media server and master servder and thus the media server breaks the connection to the client producing the socket error.
    Increasing the idle timeout setting on the firewall to a value larger than the amount of time a typical backups takes to complete should resolve the issue.
    Also increasing the frequency of the TCP keepalive packets can also help maintain the socket during idle periods from the server's defaults.
     
    Although you may not have a firewall between the client and the media server, this solution is another demonstation that the issue is network related, as opposed to NetBackup.
     
    http://www.symantec.com/docs/S:TECH130343  (Internal technote)
     
    The issue was found to be due to NIC card Network congestion (that is, network overloaded)
     
     
    http://www.symantec.com/docs/TECH135924  
     
    In this instance, the problem was isolated to this single machine making the point of failure isolated to the problematic new host.
     
    If the problem is due to an unidentified corruption / misconfiguration in the new media server's TCP Stack and Winsock environment (as was the case in this example), 
    executing these two commands, followed by a reboot will resolve the problem:
     
    netsh int ip reset resetlog.txt   Microsoft Reference:  http://support.microsoft.com/kb/299357 
    netsh winsock reset catalog    Microsoft Reference:  http://technet.microsoft.com/en-us/library/cc759700(WS.10).aspx 
     
    NOTE: The above two commands will reset the Windows TCP Stack as well as the Windows Winsock environment back to the default values.  This means that if the host
    is configured with a static IP Address and other customized TCP settings, they will be lost and will need to be re-entered after the reboot.  The default TCP setting 
    is to use DHCP and the host will be using DHCP upon booting up.
     
     
    http://www.symantec.com/docs/TECH76201
    Possible solution to Status 24 by increasing TCP receive buffer space 
     
     
    http://www.symantec.com/docs/TECH34183 
    this Technote, although written for Solaris, shows how TCP tunings can 
    cause status 24s. I am sure your system admins will be aware of the 
    corresponding setting for the windows operating system. 
     
    http://www.symantec.com/docs/TECH55653 
    This technote is very important. It covers many many issues that can 
    occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP 
    Segmentation Offload (TSO) are enabled. It is recommend to disable 
    these, as per the technote. 
     
    I understand that we have previously seen MS Patch KB92222 resolve status 24 issues.
     
     
    There are  2 possible issues that could be NBU related that could cause this :
     
    1.  Client NBU version is higher than the media serevr
    2.  Make sure the comunications buffer is not too high (http://www.symantec.com/docs/TECH60570
    )
     
    What to do next:
     
    http://www.symantec.com/docs/TECH135924  (mentioned before, MS suggested fix)
    http://www.symantec.com/docs/TECH60570  (communications buffer, mentioned above)
     http://www.symantec.com/docs/TECH60844 (Network connectivity tuning to avoid network read/write failures and increase performance) TCP Chimney
     
     
    If these do not resolve the situation, I would recommend you talk with the Operating system vendor.  In summary, apart from the Client version of software 
    and the communication buffer size (set in host properties) I can find no other cause that could be NBU.  However, from the very detailed research I have done, 
    I can find many many causes that are the network or operating system.
  • Deb, log a call with Symantec and ask them to the send you the appcritical tool. This will check the network between a destination host and target and tell you if any network issues are seen on your network.

     

  • You are welcome - apologies, the post is cop/ paste from my 'personal' notes so some may not apply to you.  It really was meant as a demonstartion from Mariannes post that 24 can be more or less anything.

    NetBackup has very little control over the network, it sits above the network and pretty much uses what is avalable, it there are issues, you start to see network related error codes.

    Revaroo makes a good point (as always) - AppCritical will evaluate the network and will highlight anything it finds - on average, I would say it's results are very accurate,  I've never persoanally seen it be 'wrong' in about 6 years of using it.

    I believe ther eis a free version of a tool similar to AppCritical that has good reviews, if I can find out what it is I'll let you know.  Saw it mentioned somewhere ages ago, but foolishly didn't make a note of what it was.