Solved: Netbackup failure with code 24

Planters · ‎09-10-2012

Hi everyone,

One of my backup is getting failed with error code 24 socket write failed. This server backups was running fine before with out any issues. now that is failing 24 error. I have seen similar errors before also and that has been resolved by itself. Now i want to find out out root cause of this issue. Because i will received same error if we restarted this backup now. I hope this will resolved by tomorrow and there wont be any issues with this client server afterwards.

It will writing then we receive this error 24

Version: Netbackup 6.5.6

PLEASE ADVICE

mph999 · ‎09-10-2012

Just ignore questions/ bits that are the wrong os for your client. Issue is usually something in here (not always, but usually ...)

We cannot begin to help without proper details of the issue :

1. How many clients have this error

2. Did this client previously work

3. What was changed

4. Does it write some data then fail

5. Does it fail at the very beginning of the job

6. Does it always fail at the same point

7. Operating system of client

8. Operating system of media server

9. NetBackup version

10. Logs from media server - bptm and bpbrm, from client bpbkar, bpcd

In my experience, Status 24 is hardly ever NBU (in fact, I don't think I have ever seen a status 24 failure caused by NetBackup myself)

Something below normally fixes it ... Yes, it is a lot to read, and will probably tyake a number of hours to go through.

If this is a Windows client, a very common cause is the TCP Chimmey settings - http://www.symantec.com/docs/TECH55653

I have given a number of technotes below (the odd one may be 'internal' only) , and have show a summary of the solutions, as well as the odd extra note.

http://www.symantec.com/docs/TECH124766

TCP Windows scaling was disabled (Operating system setting)

http://www.symantec.com/docs/TECH76201

Possible solution to Status 24 by increasing TCP receive buffer space

http://www.symantec.com/docs/TECH34183

this Technote, although written for Solaris, shows how TCP tunings can

cause status 24s. I am sure your system admins will be aware of the

corresponding setting for the windows operating system.

http://www.symantec.com/docs/TECH55653

This technote is very important. It covers many many issues that can

occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP

Segmentation Offload (TSO) are enabled. It is recommend to disable

these, as per the technote.

I also understand that we have previously seen MS Patch KB92222 resolve status 24 issues.

http://www.symantec.com/docs/TECH150369

A write operation to a socket failed, these are possible cause for this issue:

A high network load.

Intermittent connectivity.

Packet reordering.

Duplex Mismatch between client and master server NICs.

Small network buffer size

http://support.microsoft.com/kb/942861

SOLUTION/WORKAROUND:

Contact the hardware vendor of the NIC for the latest updates for their product as a resolution.

This problem occurs when the TCP Chimney Offload feature is enabled on the NetBackup Windows 2003 client. Disable this feature to workaround this problem.

To do this, at a command prompt, enter the following:

Netsh int ip set chimney DISABLED

http://www.symantec.com/docs/TECH127930

The above messages almost always indicate a networking issue of some sort. In this case it was due to a faulty switch. There are rare occasions when the above messages are not caused by a networking issue, such as those addressed in http://www.symantec.com/docs/TECH72115.

(TECH72115 is not relevant to you, this was an issue with a SAN client, fixed in 6.5.4)

But note, the technote says the issue is 'almost always' network related, this can also include operating system settings.

http://www.symantec.com/docs/TECH145223

The issue was with the idle timeout setting on the firewall that was too low to allow backups and/or restores to complete. With the DMZ media server backing up a DMZ client the media server sends only the occasional meta data updates back to the master server in order to update the images catalog. If that TCP socket connection between the media server and master server is idle for a longer period than the firewall's idle timeout the firewall breaks the connection between the media server and master servder and thus the media server breaks the connection to the client producing the socket error.

Increasing the idle timeout setting on the firewall to a value larger than the amount of time a typical backups takes to complete should resolve the issue.

Also increasing the frequency of the TCP keepalive packets can also help maintain the socket during idle periods from the server's defaults.

Although you may not have a firewall between the client and the media server, this solution is another demonstation that the issue is network related, as opposed to NetBackup.

http://www.symantec.com/docs/S:TECH130343 (Internal technote)

The issue was found to be due to NIC card Network congestion (that is, network overloaded)

http://www.symantec.com/docs/TECH135924 (I think this one I sent previously, shows the MS fix for the issue)

In this instance, the problem was isolated to this single machine making the point of failure isolated to the problematic new host.

If the problem is due to an unidentified corruption / misconfiguration in the new media server's TCP Stack and Winsock environment (as was the case in this example), executing these two commands, followed by a reboot will resolve the problem:

netsh int ip reset resetlog.txt Microsoft Reference: http://support.microsoft.com/kb/299357

netsh winsock reset catalog Microsoft Reference: http://technet.microsoft.com/en-us/library/cc759700(WS.10).aspx

NOTE: The above two commands will reset the Windows TCP Stack as well as the Windows Winsock environment back to the default values. This means that if the host is configured with a static IP Address and other customized TCP settings, they will be lost and will need to be re-entered after the reboot. The default TCP setting is to use DHCP and the host will be using DHCP upon booting up.

http://www.symantec.com/docs/TECH76201

Possible solution to Status 24 by increasing TCP receive buffer space

http://www.symantec.com/docs/TECH34183

this Technote, although written for Solaris, shows how TCP tunings can

cause status 24s. I am sure your system admins will be aware of the

corresponding setting for the windows operating system.

http://www.symantec.com/docs/TECH55653

This technote is very important. It covers many many issues that can

occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP

Segmentation Offload (TSO) are enabled. It is recommend to disable

these, as per the technote.

I understand that we have previously seen MS Patch KB92222 resolve status 24 issues.

Unix/ Linux

If the error in bptm log shows :

22:32:44.968 [35717358] <16> write_to_out: cannot write data to socket, There is no process to read data written to a pipe.

Check the ulimit -a output. nofiles should be set to at least 8192.

There are 2 'common' issues that could be NBU related that could cause this :

1. Client NBU version is higher than the media serevr

2. Make sure the comunications buffer is not too high (http://www.symantec.com/docs/TECH60570

)

What to do next:

http://www.symantec.com/docs/TECH135924 (mentioned before, MS suggested fix)

http://www.symantec.com/docs/TECH60570 (communications buffer, mentioned above)

http://www.symantec.com/docs/TECH60844

If these do not resolve the situation, I would recommend you talk with the Operating system vendor. In summary, apart from the Client version of software and the communication buffer size (set in host properties) I can find no other cause that could be NBU. However, from the very detailed research I have done, I can find many many causes that are the network or operating system.

Shadow Copy Components

**********************

Just to remind us of the error from the log file :

5:02:03.215 AM: [5112.5620] <2> ov_log::V_GlobalLogEx: ERR - beds_base::V_OpenForRead():FS_OpenObj() Failed! (0xE000FED1:A failure occurred querying the Writer status

vssadmin list providers

vssadmin list writers

From non-symantec documentation, I found this possible cause of the error ;

"The issue may be caused by an invalid entry inside the following registry sub tree.

HKey_Local_Machine\Software\Microsoft\Windows NT\CurrentVersion\ProfileList

Please open the registry editor with regedit.

Expand and local to the subtree, check if there is an entry that has a ".bak" value appended. If so, this may be cause the failure when trying to resolve the SID of the writer.

Please backup the registry key first, and then delete that entry with the extra ".bak"

Then you may reboot the problematic server to check if it the issue can be fixed."

Here are a few other Technotes / MS articles I have found that cover the above error message ;

http://www.symantec.com/docs/TECH42921

http://support.microsoft.com/kb/903234/en-us

http://support.microsoft.com/kb/913648/en-us

For Windows server 2003/2003R2:

If the error occurs for a Windows Server 2003 machine.

Install Microsoft Patch 940349.

As per this TN, it references the Windows RSM Service, is this running ? You could try stopping this, I believe we normally recommend this.

http://www.symantec.com/docs/TECH37208

Please look at this (IBM) Technote :

https://www-304.ibm.com/support/docview.wss?rs=663&uid=swg21304106&wv=1

Download the vshadow 'software' as per the instructions (link is in the technote).

Then, run the commands as shown on he technote :

From a MS-DOS prompt, issue the following commands using the vshadow utility (contact Microsoft for a version appropriate for the version of Windows being used):

vshadow - p c: > vshadow_p.out

vshadow - wm2 > vshadow_wm2.out

vshadow -ws > vshadow_ws.out

vshadow -nw c: > vs_nw.out

vshadow -nw -p c: > vs_p.out

Status 13

*********

{Troubleshooting NetBackup connectivity and timeout issues with NetBackup 6.x and 7.x}:

1 - Ensure that BOTH the NIC and switch ports (the servers connect to) are set to the same Link Speed and Duplex. (Some gigabit adapters can run 10mbps, 100mbps, or 1000mbps and can be hard coded to 1000mbps)

2 - Ensure the latest available NIC firmware and device drivers are installed on all servers.

3 - Review the following document on network interface card (NIC) tuning:

TECH60844: Network connectivity tuning to avoid network read/write failures and increase performance

-> http://www.symantec.com/business/support/index?page=content&id=TECH60844

4 - Disable autotuning and chimney features (per TECH60844), from command prompt run:

Windows 2008

-> netsh int tcp show global

-> netsh int tcp set global autotuning=disabled

-> netsh int tcp set global chimney=disabled

Windows 2003

-> netsh int ip show offload

-> netsh int ip set chimney DISABLED

Note: On Windows 2003 servers also download and install 'Scalable Networking Pack'

The Microsoft Windows Server 2003 Scalable Networking Pack release

-> http://support.microsoft.com/kb/912222

5 - Ensure name resolution is working properly, forward and reverse for the client. For improved performance and to ensure that name resolution is not an issue; we request that you implement a "hosts" file on the NetBackup Master, Media, and (important) Client servers.

The /etc/hosts file contains information regarding the known hosts on the network. For each host, a single line should be present with the following information:

internet_address official_host_name aliases

Items are separated by any number of spaces or tabs, or both; however, spaces or tabs are not allowed before the IP address. A # indicates the beginning of a comment. Any characters after a #, up to the end of the line, are not interpreted by routines that search the file.

Example:

[internet_address] [official_host_name] [aliases]

127.0.0.1 localhost

10.66.16.140 Master Master.domain.com

10.66.16.141 Media1 Media1.domain.com

10.66.14.90 Client1 Client1.domain.com

6 - Increase the 'Client connect timeout' and the 'Client read timeout' settings [found under NetBackup Management > Host Properties > Master Server Properties > Timeouts] to 9600 seconds and cycle the NetBackup services.

7 - Change Communication buffer size from 32 Kb to 128 KB. Go to Host Properties / Clients / Client Properties / Windows Client / ClientSettings / Communication buffer size = 128

8- In case there is an Antivirus running, turn it off for troubleshooting proposes.

9 - If the client has multiple NIC's, use trace route (tracert for Windows) and ensure there is only one path from the NetBackup server and back from the client.

- From the Master to the client, perform the tracert to the client name specified in the NetBackup policy.

- From the client, use the tracert with the Master server hostname used in the NetBackup configuration.

10 - Look at the network statistics for the sever, (netstat -s), and the switch port.

11 - If the above steps do not resolve the Status Code 41 errors when performing backups of the Windows Client servers, then the next step would be to increase the TCP 'KeepAliveTime' parameter on the client server.

The KeepAliveTime parameter controls how often TCP attempts to verify that an idle connection is still intact by sending a keep-alive packet. If the remote system is still reachable and functioning, it acknowledges the keep-alive transmission. Keep-alive packets are not sent by default. This feature may be enabled on a connection by an application.

Hive: HKEY_LOCAL_MACHINE

Key: SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

Name: KeepAliveTime

Type: REG_DWORD

Default Value: 300,000

Recommended Value: 7,200,000 (two hours)

12 - Enable TCP to release the the TCP ports sooner. Adjust the TcpTimedWaitDelay parameter in the Windows NT registry on the client [in this case \\saptst2]. If the parameter is not there, add it in. After the registry change, a reboot may be required.

Hive: HKEY_LOCAL_MACHINE\

Key: SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

Name: TcpTimedWaitDelay

Type: REG_DWORD - Time in seconds

Valid Range: 30-300 (decimal)

Default: 0xF0 (240 decimal)

This parameter determines the length of time that a connection will stay in the TIME_WAIT state when being closed. While a connection is in the TIME_WAIT state, the socket pair cannot be re-used.

By default, this value is 240 seconds (4 minutes). It is recommended that this be changed to 30 seconds. (hex = 0x00001e) TcpTimedWaitDelay (new in Windows NT versions 3.51 SP5 and later)

Further information for steps 11 and 12 can be found in Microsoft Q Article: Q120642 from their web site at:

-> http://www.microsoft.com

Martin

View solution in original post

Will_Restore · ‎09-10-2012

Check these technotes for possible solution:

Article URL http://www.symantec.com/docs/TECH188129

Article URL http://www.symantec.com/docs/TECH182268

epsilon22222 · ‎09-10-2012

I have had this issue before in the past and it turned out to be a host name resolution issue. I would check to see which media server is being granted access to the job and verifiy connectivity between the two using bpclntclmd -hn servername and bptestbpcd -host servername -verbose

I would recomend these because you said it was working before but suddenly it isn't. Your policy may be set to use any and that would cause it to use a different media server if the one it used normally wasn't available.

mph999 · ‎09-10-2012