Solaris 10 clients fails with 24

Monu_Puri · ‎04-14-2011

CLIENT: qatlbs015 (sun4v/SunOS 5.10 Generic_142900-07)
VERSION: NetBackup-Solaris10 6.5.4

Mater server: SunOS pronbu03 5.10 Generic_142900-06 sun4u sparc SUNW,Sun-Fire-V490

Netbackup version: NetBackup-Solaris10 6.5.4

The backup fails with 24. I tried to divide them into multiple streams and the backups were successful only for the empty file systems (It backs up the directory only-no files inside it). For the file systems containing data it still fails with 24.

Telnet (bpcd) , bpcoverage and bpclntcmd are fine from the client to master and vice-e-versa.

rizwan84tx · ‎04-15-2011

Hi Monu,

Can you upload the bpbrm (media) and bpcd & bpbkar (client) logs, so that it will be help ful to check the cause!

Monu_Puri · ‎04-19-2011

Please check the attached logs.

Eric_Zhang · ‎04-19-2011

Please refer to the following TN:

http://www.symantec.com/business/support/index?page=content&id=TECH76201&key=15143&basecat=TROUBLESHOOTING&actp=LIST

Eric

Eric_Zhang · ‎04-19-2011

remove the /usr/openv/netbackup/NET_BUFFER_SZ file . and then retry .

rizwan84tx · ‎04-19-2011

23:12:03.697 [3207] <16> flush_archive(): ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
23:12:03.697 [3207] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 24: socket write failed
23:12:03.698 [3207] <4> bpbkar Exit: INF - EXIT STATUS 24: socket write failed
23:12:03.698 [3207] <2> bpbkar Exit: INF - Close of stdout complete

As per the TN Article URL http://www.symantec.com/docs/TECH76201

TCP stack on Media Server is not in sync with that of Client.

Workaround:

Remove NET_BUFFER_SZ file.

(OR)

Increase the TCP incoming buffer value greater than NET_BUFFER_SZ.

Monu_Puri · ‎04-19-2011

Hi,

I have tried to make the changed as mentioned by you guys.

The setting earlier were

$ ndd -get /dev/tcp tcp_recv_hiwat

49152

$ ndd -get /dev/tcp tcp_xmit_hiwat

49152

$ /usr/sbin/ndd -get /dev/tcp tcp_max_buf

102400000

Current settings are:

$ /usr/sbin/ndd -get /dev/tcp tcp_recv_hiwat

262144

$ /usr/sbin/ndd -get /dev/tcp tcp_xmit_hiwat

262144

$ /usr/sbin/ndd -get /dev/tcp tcp_max_buf

102400000

However the backups still fail with status 24 and below mentioned is the bpbkar logs from today’s backup:

21:58:14.303 [25128] <16> flush_archive(): ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
21:58:14.303 [25128] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 24: socket write failed
21:58:14.303 [25128] <4> bpbkar Exit: INF - EXIT STATUS 24: socket write failed
21:58:14.304 [25128] <2> bpbkar Exit: INF - Close of stdout complete
21:58:14.304 [25128] <4> bpbkar Exit: INF - setenv FINISHED=0

Moreover the file named /usr/openv/netbackup/NET_BUFFER_SZ doesn’t exist on the client. Please suggest if i should create it and set a lower value.

rizwan84tx · ‎04-19-2011

You can create NET_BUFFER_SZ with value 262144 equal to the TCP network buffer.

It is recommended that if NET_BUFFER_SZ is used, the same value should be set on all the NetBackup media servers and clients.

Monu_Puri · ‎04-20-2011

We don’t use NET_BUFFER_SZ in the entire environment as i have check in the master and media server too.

Anton_Panyushki · ‎04-21-2011

I wonder if the backup job writes any data to tape?

Please issue

# pgrep bpbkar on the client

There might be a bunch of stale bpbkar's that can't read file system.

Andy_Welburn · ‎04-21-2011

(as we have literally just started to encounter this with one of our Solaris 10 servers that we recently updated):

What Solaris release (/etc/release)? Ours has just been upped to 10/08 u6 & only since we have intermittent 24's.

We subsequently upped NB from 6.5.4 to 6.5.6 but issues with both. Issues were with one f/s yesterday (then upped NB) & a different f/s this morning - so nothing consistent.

Do you have performance issues on the client? We do intermittently with the app that's running which could be the reason for us - just re-running our failed job from this morning & it's now going thru' ok (touch wood) as did yesterdays when re-trying & no failures again until this morning.

Monu_Puri · ‎04-21-2011

in my case the moment backup come in active state from queued state it fails with 24. The Job details from the activity monitor are as follows:

Apr 21, 2011 9:51:25 PM - requesting resource Nexus_disk
Apr 21, 2011 9:51:25 PM - requesting resource pronbu01.unix.gsm1900.org.NBU_CLIENT.MAXJOBS.qatlbs015
Apr 21, 2011 9:51:25 PM - requesting resource pronbu01.unix.gsm1900.org.NBU_POLICY.MAXJOBS.nx_sunqat_core_3m
Apr 21, 2011 9:51:38 PM - Error bpbrm (pid=23408) from client qatlbs015: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
Apr 21, 2011 9:51:32 PM - granted resource pronbu01.unix.gsm1900.org.NBU_CLIENT.MAXJOBS.qatlbs015
Apr 21, 2011 9:51:32 PM - granted resource pronbu01.unix.gsm1900.org.NBU_POLICY.MAXJOBS.nx_sunqat_core_3m
Apr 21, 2011 9:51:32 PM - granted resource MediaID=@aaaa3;DiskVolume=/backup_disk/disk03;DiskPool=pronbu21_ad_pool;Path=/backup_disk/disk03;Sto...
Apr 21, 2011 9:51:32 PM - granted resource pronbu21_dsu
Apr 21, 2011 9:51:33 PM - estimated 0 kbytes needed
Apr 21, 2011 9:51:34 PM - started process bpbrm (pid=23408)
Apr 21, 2011 9:51:35 PM - connecting
Apr 21, 2011 9:51:36 PM - connected; connect time: 0:00:00
Apr 21, 2011 9:51:43 PM - Error bptm (pid=23415) media manager terminated by parent process
Apr 21, 2011 9:51:44 PM - end writing
socket write failed (24)

Monu_Puri · ‎04-25-2011

Hi Marianne,

Would you please provide your valuable thoughts on this. The issue is still not resolved.

Marianne · ‎04-25-2011

I don't have 1st-hand experience of this - seems Andy is currently experiencing similar problems... Hopefully he will find something useful quite soon.

Please see if changing tcp_time_wait_interval to something like 1000 helps. (ndd -set /dev/tcp tcp_time_wait_interval 1000)

The one log that I'd like to see is bpbrm on the media server for April 21.

Apr 21, 2011 9:51:38 PM - Error bpbrm (pid=23408) from client qatlbs015: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
...

*** EDIT ***

bptm on media server as well, please.

What is OS and NBU version on media server?

Handy NetBackup Links

rizwan84tx · ‎04-25-2011

Earlier we had similar problems, jobs exiting in 24 in Windows server, we had to disable the TCP Offload Engine for the NIC to fix.

Reference TN: http://www.symantec.com/business/support/index?page=content&id=TECH60844

I'm not sure if this can help in UNIX machines!!! Can anyone clarify on this part.

Marianne · ‎04-25-2011

Just found another TN: http://www.symantec.com/docs/TECH143964:

Backups intermittently fail with status 24 or 42 within a few seconds of the start of data transfer.

Error

The job details display one of two symptoms depending on the timing.

Typically the failure will appear as a status 24 because the client will report the failure to bpbrm while bptm is still getting the media ready.

10/21/2010 11:29:21 - started process bpbrm (pid=3462)
10/21/2010 11:29:29 - Error bpbrm (pid=3462) from client myclient: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
10/21/2010 11:29:25 - connecting
10/21/2010 11:29:25 - connected; connect time: 0:00:00
10/21/2010 11:29:34 - Error bptm (pid=3464) media manager terminated by parent process
10/21/2010 11:29:35 - end writing
socket write failed (24)

(....Lots of troubleshooting info.........)

Solution

This problem was resolved on multiple client hosts by stopping and restarting the inetd process via the Service Management Facility. On two of the client hosts, the entire host needed to be rebooted to resolve the problem.

The precise root cause was not available, but it is likely that at the time inetd was last started, there was an unusual and incompatible combination of TCP and/or kernel tunables in place. One or more of the setting must have been changed to a compatible value before the problem was noticed because all settings appeared normal during debugging. However, the restarted instance of inetd clearly picked up a better environment than the prior instance.

Handy NetBackup Links

Monu_Puri · ‎05-05-2011

I tried changing tcp_time_wait_interval and got the inetd refreshed but the problem was still there.

A reboot has resolved the issue.

Thanks to all for the help and support. I really appreciate all the efforts made by you guys.

Special Thanks to Marianne.

VOX