Backups failing for Linux clients with status 24, and 41
Netbackup 7504, Windows 2008r2 master and media servers.
Hello all,
I have two Linux clients that started failing out with a 41, and later, 24.
****History -
These were running fine. The Networking team moved these clients, and a few others, to a new switch, and they were still fine. Later, they moved a few clients (including these two) back to the original switch, which is when this issue started.
****What I have tried -
1) I spoke with the networking team. They told me they connected these back to the original ports, and they see no communication issues between the clients, media, or master servers.
2) I went under Host propeties>clients, and right-clicked on both clients, and hit connect. They immediately connected, and I can browse the client properties with no issues.
3) Created a test policy with just one of the two clients (noc-edi-102), and changed the backup selection from all_local_drives to /etc, kicked off a backup....it failed.
4) After upping all needed verbosity in the logging, I was seeing timeout errors, so I upped the client read timeout from 5 minutes/300 seconds to 20 minutes/1200 seconds on both the client and the media server and also upped the client connect timeout on the media server to 20 minutes and retried. ****This is when the error went from status 41, and is now 24.
5) Ran bpclntcmd -hn between master, media, and client, and all resolved, forward and reverse.
6) I am already also working with my technical support vendor on this issue, but they are drawing a blank so far. They thought they saw an issue with timing out at a particular path, so they had me redirect the backup selection to a location that doesnt include that path. Same error.
7) Confirmed with Unix team that no firewall settings have been touched on these boxes.
8) Added client ip, and name (both long and short name) to host file on master and media servers.
9) Even tried pointing directly to tape, rather than disk, because I'm grasping at straws.
****Reporting
Job detail status -
4/27/2013 7:00:00 PM - requesting resource PDCDD_SU_1
4/27/2013 7:00:00 PM - requesting resource pdc00nbua801w.ohlogistics.com.NBU_CLIENT.MAXJOBS.noc-edi-102.ohlogistics.com
4/27/2013 7:00:00 PM - requesting resource pdc00nbua801w.ohlogistics.com.NBU_POLICY.MAXJOBS.PDC_NOC-EDI-102
4/27/2013 7:00:00 PM - awaiting resource PDCDD_SU_1 - Maximum job count has been reached for the storage unit
4/28/2013 12:26:32 AM - granted resource pdc00nbua801w.ohlogistics.com.NBU_CLIENT.MAXJOBS.noc-edi-102.ohlogistics.com
4/28/2013 12:26:32 AM - granted resource pdc00nbua801w.ohlogistics.com.NBU_POLICY.MAXJOBS.PDC_NOC-EDI-102
4/28/2013 12:26:32 AM - granted resource MediaID=@aaaae;DiskVolume=PDCDisk2;DiskPool=PDCDD_DP;Path=PDCDisk2;StorageServer=pdc00ddma901;MediaServer=pdc00nbua802w
4/28/2013 12:26:32 AM - granted resource PDCDD_SU_1
4/28/2013 12:26:32 AM - estimated 0 Kbytes needed
4/28/2013 12:26:32 AM - Info nbjm(pid=4532) started backup (backupid=noc-edi-102.ohlogistics.com_1367126792) job for client noc-edi-102.ohlogistics.com, policy PDC_NOC-EDI-102, schedule Full on storage unit PDCDD_SU_1
4/28/2013 12:26:33 AM - started process bpbrm (1368)
4/28/2013 12:26:36 AM - Info bpbrm(pid=1368) noc-edi-102.ohlogistics.com is the host to backup data from
4/28/2013 12:26:36 AM - Info bpbrm(pid=1368) reading file list from client
4/28/2013 12:26:37 AM - connecting
4/28/2013 12:26:42 AM - Info bpbrm(pid=1368) starting bpbkar32 on client
4/28/2013 12:26:42 AM - Info bpbkar32(pid=0) Backup started
4/28/2013 12:26:42 AM - Info bptm(pid=1752) start
4/28/2013 12:26:42 AM - Info bptm(pid=1752) using 1048576 data buffer size
4/28/2013 12:26:42 AM - Info bptm(pid=1752) setting receive network buffer to 1048576 bytes
4/28/2013 12:26:42 AM - Info bptm(pid=1752) using 128 data buffers
4/28/2013 12:26:43 AM - connected; connect time: 00:00:06
4/28/2013 12:26:47 AM - Info bptm(pid=1752) start backup
4/28/2013 12:26:48 AM - Info bptm(pid=1752) backup child process is pid 8180.2272
4/28/2013 12:26:48 AM - Info bptm(pid=8180) start
4/28/2013 12:26:48 AM - begin writing
4/28/2013 12:42:33 AM - Error bpbrm(pid=1368) from client noc-edi-102.ohlogistics.com: ERR - Cannot write to STDOUT. Errno = 110: Connection timed out
4/28/2013 12:42:42 AM - Error bpbrm(pid=1368) cannot send mail to etyree@ohl.com,tsmith@ohl.com
4/28/2013 12:42:43 AM - end writing; write time: 00:15:55
socket write failed(24)
****Bpbkar from client
15:23:43.788 [30956] <4> bpbkar PrintFile: /etc/minicom.users
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - cwd = /etc
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - path = modprobe.conf
15:23:43.788 [30956] <4> bpbkar PrintFile: /etc/modprobe.conf
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - cwd = /etc
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - path = modprobe.conf.dist
15:23:43.788 [30956] <4> bpbkar PrintFile: /etc/modprobe.conf.dist
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - cwd = /etc
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - path = modprobe.conf~
15:23:43.788 [30956] <4> bpbkar PrintFile: /etc/modprobe.conf~
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - cwd = /etc
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - path = motd
15:23:43.788 [30956] <4> bpbkar PrintFile: /etc/motd
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - cwd = /etc
15:23:43.788 [30956] <2> bpbkar SelectFile: INF - path = mtab
15:39:37.051 [30956] <16> flush_archive(): ERR - Cannot write to STDOUT. Errno = 110: Connection timed out
15:39:37.051 [30956] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 24: socket write failed
15:39:37.051 [30956] <4> bpbkar Exit: INF - EXIT STATUS 24: socket write failed
15:39:37.051 [30956] <2> bpbkar Exit: INF - Close of stdout complete
15:39:37.051 [30956] <4> bpbkar Exit: INF - setenv FINISHED=0
Please let me know what other information I can provide. Ive been working on this for a while, so its possible I may have left out something that I tried.
Thanks all,
Todd
Looks like you may be using MSDP/Puredisk. From my recent experience with Linux media servers, be sure that TCP offloading is (still) disabled on the clients and the media servers. There is also a Ring Buffer parameter in TCP setup on Linux hosts (not sure what the name is); be sure that is set to the max. You *do* need to reboot the Linux client for these changes to take effect, even if your system doesn't tell you to.