03-28-2013 10:48 AM
The backups are failing with " socket write failed (24) " , inetd was restarted and the issue got fixed but it pops up again this morning with the same error i.e ERR - Cannot write to STDOUT. Errno = 32: Broken pipe.
Here is a job details of one of the affected client...The job initially writes data, later stops and finally fails.
Mar 21, 2013 2:55:43 PM - estimated 27182054 kbytes needed
Mar 21, 2013 2:55:48 PM - connecting
Mar 21, 2013 2:55:48 PM - connected; connect time: 0:00:00
Mar 21, 2013 2:55:48 PM - begin writing
Mar 21, 2013 3:07:19 PM - Error bpbrm (pid=9652) from client xxx: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
Mar 21, 2013 3:07:23 PM - end writing; write time: 0:11:35
socket write failed (24)
Solved! Go to Solution.
06-04-2013 12:44 PM
Just thought of updating on this issue.
Finally we have installed backup NICs to the affected servers to test the backups and it was successfull.
BTW, we have tried almost everything to fix this issue except couple of things which required to change values on master and media servers just to avoid any other issue.
03-28-2013 10:57 AM
03-28-2013 01:32 PM
03-28-2013 03:28 PM
I think, you are the first person I have every seen who has stated it is probably not NBU ...
I am sure there is a case or two somewhere, where status 24 has been due to NBU somehow, but in 5 years, and doing quite a few status 24 cases I have never seen it caused by NBU, and searching the database, I have never found a status 24 casued by NBU.
The problem with 24s in general, is that most people, but NOT your good self, believe that as the NBU has very politely informed you that there is some issue, it must be the fault of NBU.
The problem from our side, is that as we didn't cause the fault we can't really tell you what did - yes we know the network failed, but not why.
I will however try and be at least slightly helpful.
1. You say the issue got fixed, do you know what was done to fix it
2. I presume this was working previously
3. Is it all clients, just one client, clients on a certain network, clients on a certain OS, clients that have just been patched etc ... If you can narrow it down to a group it helps, a lot
4. If it is a single client, big bonus, as it's most likely something on the client
5. Does it always fail after approx 12 mins
6. Is any data backed up
7. Is it a certain filesystem that always fails
8. Does it fail after a certain amount of data backed up
9. Any recent changes
10. Any firewalls involved
Customers generally hate questions - but asking the right questions, and ensuring you get exactly the correct answers at the right level of detail very often provides a very big step in the solution, and sometimes even provides a solution.
The way I look at issues is I try to blame NBU - however in this case I honestly don;t know a way to make NBU fail with a 24. We do not control the network, we only use what we are given. There are a few settings (net buffer size for windows clients for example) that will certainly affect performance, but I never managed to create a 24 by changing it (and I have tried). In fact, I can't think of any other settings in NBU that 'control' the network (apart from required interface / preferred interface but these won't cause a 24).
Here are my status 24 notes ... sorry, big list but contains just about every cauuse I have seen.
03-29-2013 07:32 AM
03-29-2013 12:05 PM
So overall, nothing here really narrows it down much, which is a shame.
I wonder if it could be load related, would you be able to test a single client that previous failed, when no other backups are running.
We could also take a lookin the logs, perhaps there may be some clue, even though I don;t expect to see an exact answer.
Media server - bptm and bpbrm
Client - bpbkar and bpcd
I would recommend VERBOSE = 5 and please also include the activity monitor details that match the logs.
On the client, add empty file (it will give more detail in the bpbkar log)
/usr/openv/netbackup/bpbkar_path_tr
As the issue is easy to reproduce, run the backup twich for the same client, we can then see if for any one client it fails in exactly the same place.
Thanks,
Martin
04-02-2013 10:43 AM
04-02-2013 02:30 PM
If you log a call with Symantec and ask very nicely, we'll run AppCritical for you which will analize the network.
Martin
04-18-2013 10:12 AM
Looking at the workaround in the technote http://www.symantec.com/docs/TECH76201
We have increased the TCP buffere size and tested the backups. They got failed again after writting certain amount of data with the same error i.e ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
Do we need to restart the Netbackup services/daemons on master server after the TCP buffer size is increased.
Apart from this, everyday the failures are increasing and it is only happening with Unix boxes.
04-18-2013 10:13 AM
Looking at the workaround in the technote http://www.symantec.com/docs/TECH76201
We have increased the TCP buffere size and tested the backups. They got failed again after writting certain amount of data with the same error i.e ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
Do we need to restart the Netbackup services/daemons on master server after the TCP buffer size is increased.
Apart from this, everyday the failures are increasing and it is only happening with Unix boxes.
06-04-2013 12:44 PM
Just thought of updating on this issue.
Finally we have installed backup NICs to the affected servers to test the backups and it was successfull.
BTW, we have tried almost everything to fix this issue except couple of things which required to change values on master and media servers just to avoid any other issue.
06-05-2013 12:37 AM
Ok .. so this seems load issue as after installing backup NIC its working fine..
I've added my comments ...
1. You say the issue got fixed, do you know what was done to fix it
Ans: inetd was restarted on these unix hosts and it seems to be fix but it poped up again the next morning/
>> inetd restarts the network services, so I can see why it could be a potential fix (even if only temporary).
2. I presume this was working previously
Ans: Yes, the backups were working previuously but recently we moved our master server from one place to another within the data centre. The place is just adjacent to where it was before.
Moving the master server is 'common' to all the servers that fail. However, I would exepect this to have affected all the clients.
3. Is it all clients, just one client, clients on a certain network, clients on a certain OS, clients that have just been patched etc ... If you can narrow it down to a group it helps, a lot
Ans: There are around 8 clients from couple of policies and they are on two diff network. All are solaris 10 64 bit except one which is solaris 9 64 bit. 4 out of them are patched recently.
4. If it is a single client, big bonus, as it's most likely something on the client
Ans: Thera are around 8 clients. 7 from one policy and 1 from other.
>> So, not common to a single client, and not common to a single network
5. Does it always fail after approx 12 mins
Yes, since the time this issue has started approx after 5-10mins writting data.
>>It varies then which perhaps suggests it's not a timeout happening somewhere
6. Is any data backed up
Ans: Yes, the data is getting backed up when the job kicks off.
7. Is it a certain filesystem that always fails
Ans: The backup selection is ALL_LOCAL_DRIVES for all these clients.
8. Does it fail after a certain amount of data backed up
Yes, initially the data will be backed up for 5mins and later it stops and fails finally.
9. Any recent changes
Ans: Some of them are patched but not all.
10. Any firewalls involved
Ans: Not sure