Solved: ERR - Cannot write to STDOUT. Errno = 32: Broken p...

Arshad_Khateeb · ‎03-28-2013

The backups are failing with " socket write failed (24) " , inetd was restarted and the issue got fixed but it pops up again this morning with the same error i.e ERR - Cannot write to STDOUT. Errno = 32: Broken pipe.

Here is a job details of one of the affected client...The job initially writes data, later stops and finally fails.

Mar 21, 2013 2:55:43 PM - estimated 27182054 kbytes needed
Mar 21, 2013 2:55:48 PM - connecting
Mar 21, 2013 2:55:48 PM - connected; connect time: 0:00:00
Mar 21, 2013 2:55:48 PM - begin writing
Mar 21, 2013 3:07:19 PM - Error bpbrm (pid=9652) from client xxx: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
Mar 21, 2013 3:07:23 PM - end writing; write time: 0:11:35
socket write failed (24)

Arshad_Khateeb · ‎06-04-2013

Just thought of updating on this issue.

Finally we have installed backup NICs to the affected servers to test the backups and it was successfull.

BTW, we have tried almost everything to fix this issue except couple of things which required to change values on master and media servers just to avoid any other issue.

View solution in original post

StefanosM · ‎03-28-2013

have you tried to search support technotes? check these: http://www.symantec.com/docs/TECH76201 http://www.symantec.com/docs/TECH143964

Arshad_Khateeb · ‎03-28-2013

Thanks Stefanos for the suggestions. We too believe the issue is not within Netbackup hence routed it to Network guys, will see what they feel. I will keep you posted. Could it be possible to have a small workaround other than doing major changes in Netbackup configuration?

mph999 · ‎03-28-2013

I think, you are the first person I have every seen who has stated it is probably not NBU ...

You are most likely correct, the issue will not likely be NBU.

I am sure there is a case or two somewhere, where status 24 has been due to NBU somehow, but in 5 years, and doing quite a few status 24 cases I have never seen it caused by NBU, and searching the database, I have never found a status 24 casued by NBU.

The problem with 24s in general, is that most people, but NOT your good self, believe that as the NBU has very politely informed you that there is some issue, it must be the fault of NBU.

The problem from our side, is that as we didn't cause the fault we can't really tell you what did - yes we know the network failed, but not why.

I will however try and be at least slightly helpful.

1. You say the issue got fixed, do you know what was done to fix it

2. I presume this was working previously

3. Is it all clients, just one client, clients on a certain network, clients on a certain OS, clients that have just been patched etc ... If you can narrow it down to a group it helps, a lot

4. If it is a single client, big bonus, as it's most likely something on the client

5. Does it always fail after approx 12 mins

6. Is any data backed up

7. Is it a certain filesystem that always fails

8. Does it fail after a certain amount of data backed up

9. Any recent changes

10. Any firewalls involved

Customers generally hate questions - but asking the right questions, and ensuring you get exactly the correct answers at the right level of detail very often provides a very big step in the solution, and sometimes even provides a solution.

The way I look at issues is I try to blame NBU - however in this case I honestly don;t know a way to make NBU fail with a 24. We do not control the network, we only use what we are given. There are a few settings (net buffer size for windows clients for example) that will certainly affect performance, but I never managed to create a 24 by changing it (and I have tried). In fact, I can't think of any other settings in NBU that 'control' the network (apart from required interface / preferred interface but these won't cause a 24).

Here are my status 24 notes ... sorry, big list but contains just about every cauuse I have seen.

In my experience, Status 24 is hardly ever NBU (in fact, I don't think I have ever seen a status 24 failure caused by NetBackup myself)

Something below normally fixes it ... Yes, it is a lot to read, and will probably tyake a number of hours to go through.

If this is a Windows client, a very common cause is the TCP Chimmey settings - http://www.symantec.com/docs/TECH55653

I have given a number of technotes below (the odd one may be 'internal' only) , and have show a summary of the solutions, as well as the odd extra note.

http://www.symantec.com/docs/TECH124766

TCP Windows scaling was disabled (Operating system setting)

http://www.symantec.com/docs/TECH76201

Possible solution to Status 24 by increasing TCP receive buffer space

http://www.symantec.com/docs/TECH34183

this Technote, although written for Solaris, shows how TCP tunings can

cause status 24s. I am sure your system admins will be aware of the

corresponding setting for the windows operating system.

http://www.symantec.com/docs/TECH55653

This technote is very important. It covers many many issues that can

occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP

Segmentation Offload (TSO) are enabled. It is recommend to disable

these, as per the technote.

I also understand that we have previously seen MS Patch KB92222 resolve status 24 issues.

http://www.symantec.com/docs/TECH150369

A write operation to a socket failed, these are possible cause for this issue:

A high network load.

Intermittent connectivity.

Packet reordering.

Duplex Mismatch between client and master server NICs.

Small network buffer size

http://support.microsoft.com/kb/942861

SOLUTION/WORKAROUND:

Contact the hardware vendor of the NIC for the latest updates for their product as a resolution.

This problem occurs when the TCP Chimney Offload feature is enabled on the NetBackup Windows 2003 client. Disable this feature to workaround this problem.

To do this, at a command prompt, enter the following:

Netsh int ip set chimney DISABLED

http://www.symantec.com/docs/TECH127930

The above messages almost always indicate a networking issue of some sort. In this case it was due to a faulty switch. There are rare occasions when the above messages are not caused by a networking issue, such as those addressed in http://www.symantec.com/docs/TECH72115.

(TECH72115 is not relevant to you, this was an issue with a SAN client, fixed in 6.5.4)

But note, the technote says the issue is 'almost always' network related, this can also include operating system settings.

http://www.symantec.com/docs/TECH145223

The issue was with the idle timeout setting on the firewall that was too low to allow backups and/or restores to complete. With the DMZ media server backing up a DMZ client the media server sends only the occasional meta data updates back to the master server in order to update the images catalog. If that TCP socket connection between the media server and master server is idle for a longer period than the firewall's idle timeout the firewall breaks the connection between the media server and master servder and thus the media server breaks the connection to the client producing the socket error.

Increasing the idle timeout setting on the firewall to a value larger than the amount of time a typical backups takes to complete should resolve the issue.

Also increasing the frequency of the TCP keepalive packets can also help maintain the socket during idle periods from the server's defaults.

Although you may not have a firewall between the client and the media server, this solution is another demonstation that the issue is network related, as opposed to NetBackup.

http://www.symantec.com/docs/S:TECH130343 (Internal technote)

The issue was found to be due to NIC card Network congestion (that is, network overloaded)

http://www.symantec.com/docs/TECH135924 (I think this one I sent previously, shows the MS fix for the issue)

In this instance, the problem was isolated to this single machine making the point of failure isolated to the problematic new host.

If the problem is due to an unidentified corruption / misconfiguration in the new media server's TCP Stack and Winsock environment (as was the case in this example), executing these two commands, followed by a reboot will resolve the problem:

netsh int ip reset resetlog.txt Microsoft Reference: http://support.microsoft.com/kb/299357

netsh winsock reset catalog Microsoft Reference: http://technet.microsoft.com/en-us/library/cc759700(WS.10).aspx

NOTE: The above two commands will reset the Windows TCP Stack as well as the Windows Winsock environment back to the default values. This means that if the host is configured with a static IP Address and other customized TCP settings, they will be lost and will need to be re-entered after the reboot. The default TCP setting is to use DHCP and the host will be using DHCP upon booting up.

http://www.symantec.com/docs/TECH76201

Possible solution to Status 24 by increasing TCP receive buffer space

http://www.symantec.com/docs/TECH34183

this Technote, although written for Solaris, shows how TCP tunings can

cause status 24s. I am sure your system admins will be aware of the

corresponding setting for the windows operating system.

http://www.symantec.com/docs/TECH55653

This technote is very important. It covers many many issues that can

occur when either TCP Chimney Offload, TCP/IP Offload Engine (TOE) or TCP

Segmentation Offload (TSO) are enabled. It is recommend to disable

these, as per the technote.

I understand that we have previously seen MS Patch KB92222 resolve status 24 issues.

Unix/ Linux

If the error in bptm log shows :

22:32:44.968 [35717358] <16> write_to_out: cannot write data to socket, There is no process to read data written to a pipe.

Check the ulimit -a output. nofiles should be set to at least 8192.

There are 2 'common' issues that could be NBU related that could cause this :

1. Client NBU version is higher than the media serevr

2. Make sure the comunications buffer is not too high (http://www.symantec.com/docs/TECH60570

)

What to do next:

http://www.symantec.com/docs/TECH135924 (mentioned before, MS suggested fix)

http://www.symantec.com/docs/TECH60570 (communications buffer, mentioned above)

http://www.symantec.com/docs/TECH60844

If these do not resolve the situation, I would recommend you talk with the Operating system vendor. In summary, apart from the Client version of software and the communication buffer size (set in host properties) I can find no other cause that could be NBU. However, from the very detailed research I have done, I can find many many causes that are the network or operating system.

Arshad_Khateeb · ‎03-29-2013

Thanks Martin! Though you have replied with lot of details on how to troubleshoot or fix the 24 status code failure but i feel this has covered all areas in which a admin can troubleshoot. It would be intresting to see what our Network guys says as we have asked them to check what is the issue on their side. Here are the answeres to your questions and to be honest i feel very happy for it... 1. You say the issue got fixed, do you know what was done to fix it Ans: inetd was restarted on these unix hosts and it seems to be fix but it poped up again the next morning/ 2. I presume this was working previously Ans: Yes, the backups were working previuously but recently we moved our master server from one place to another within the data centre. The place is just adjacent to where it was before. 3. Is it all clients, just one client, clients on a certain network, clients on a certain OS, clients that have just been patched etc ... If you can narrow it down to a group it helps, a lot Ans: There are around 8 clients from couple of policies and they are on two diff network. All are solaris 10 64 bit except one which is solaris 9 64 bit. 4 out of them are patched recently. 4. If it is a single client, big bonus, as it's most likely something on the client Ans: Thera are around 8 clients. 7 from one policy and 1 from other. 5. Does it always fail after approx 12 mins Yes, since the time this issue has started approx after 5-10mins writting data. 6. Is any data backed up Ans: Yes, the data is getting backed up when the job kicks off. 7. Is it a certain filesystem that always fails Ans: The backup selection is ALL_LOCAL_DRIVES for all these clients. 8. Does it fail after a certain amount of data backed up Yes, initially the data will be backed up for 5mins and later it stops and fails finally. 9. Any recent changes Ans: Some of them are patched but not all. 10. Any firewalls involved Ans: Not sure

mph999 · ‎03-29-2013

I've added my comments ...

1. You say the issue got fixed, do you know what was done to fix it
Ans: inetd was restarted on these unix hosts and it seems to be fix but it poped up again the next morning/

>> inetd restarts the network services, so I can see why it could be a potential fix (even if only temporary).

2. I presume this was working previously
Ans: Yes, the backups were working previuously but recently we moved our master server from one place to another within the data centre. The place is just adjacent to where it was before.

Moving the master server is 'common' to all the servers that fail. However, I would exepect this to have affected all the clients.

3. Is it all clients, just one client, clients on a certain network, clients on a certain OS, clients that have just been patched etc ... If you can narrow it down to a group it helps, a lot
Ans: There are around 8 clients from couple of policies and they are on two diff network. All are solaris 10 64 bit except one which is solaris 9 64 bit. 4 out of them are patched recently.

4. If it is a single client, big bonus, as it's most likely something on the client
Ans: Thera are around 8 clients. 7 from one policy and 1 from other.

>> So, not common to a single client, and not common to a single network

5. Does it always fail after approx 12 mins
Yes, since the time this issue has started approx after 5-10mins writting data.

>>It varies then which perhaps suggests it's not a timeout happening somewhere

6. Is any data backed up
Ans: Yes, the data is getting backed up when the job kicks off.

7. Is it a certain filesystem that always fails
Ans: The backup selection is ALL_LOCAL_DRIVES for all these clients.

8. Does it fail after a certain amount of data backed up
Yes, initially the data will be backed up for 5mins and later it stops and fails finally.

9. Any recent changes
Ans: Some of them are patched but not all.

10. Any firewalls involved
Ans: Not sure

So overall, nothing here really narrows it down much, which is a shame.

I wonder if it could be load related, would you be able to test a single client that previous failed, when no other backups are running.

We could also take a lookin the logs, perhaps there may be some clue, even though I don;t expect to see an exact answer.

Media server - bptm and bpbrm

Client - bpbkar and bpcd

I would recommend VERBOSE = 5 and please also include the activity monitor details that match the logs.

On the client, add empty file (it will give more detail in the bpbkar log)

/usr/openv/netbackup/bpbkar_path_tr

As the issue is easy to reproduce, run the backup twich for the same client, we can then see if for any one client it fails in exactly the same place.

Thanks,

Martin

Arshad_Khateeb · ‎04-02-2013

Here is what the network guys said :) Currently none of the DC switches are being monitored for historical traffic trending. I am in the process of trying to get them added to one of the monitoring systems. I did look at each systems Ethernet port counters on each switch and do not see any immediate errors or buffer overruns on any system except one, which has a few minor input/output errors i.e. .00000962% errors on this interface (which correlates to dropped packets) but clearly nothing to be concerned about. I’m still researching vlan trunking between new and old switches to confirm their performance.

mph999 · ‎04-02-2013

If you log a call with Symantec and ask very nicely, we'll run AppCritical for you which will analize the network.

Martin

Arshad_Khateeb · ‎04-18-2013

Looking at the workaround in the technote http://www.symantec.com/docs/TECH76201

We have increased the TCP buffere size and tested the backups. They got failed again after writting certain amount of data with the same error i.e ERR - Cannot write to STDOUT. Errno = 32: Broken pipe

Do we need to restart the Netbackup services/daemons on master server after the TCP buffer size is increased.

Apart from this, everyday the failures are increasing and it is only happening with Unix boxes.

Arshad_Khateeb · ‎04-18-2013