Having trouble with 636 status code
I am having some trouble tracking down what is causing the 636 error for some backups. It only happens on backups using the DMZ media server we have here. I know you guys will want to see some logs to help with the issue so let me know what all is needed and I will gladly grab that info and share.
I am seeing this issue on several clients of varying OS's, but they all use the same media server. Just using this one client below as a baseline. I am seeing the error on both Windows and Unix clients alike. Normally when I see this its hung bpbkar processes on a Windows Server 2003 machine and I can cycle the client, stop the hung processes and then it will work, but with these ones thats not solving the issue for me.
Netbackup version on Master - 7.6.0.2
Master server OS - Solaris 10
Client OS - Solaris 10
NB version on Client - 7.6.0.2
Running processes on client:
root@xp53web0642vz:/usr/openv/netbackup/bin ->./bpps -a
NB Processes
------------
root 24787 1 0 13:42:16 ? 0:01 /usr/openv/netbackup/bin/vnetd -standalone
root 27235 24791 0 07:21:35 ? 0:00 /usr/openv/netbackup/bin/bpcd -standalone
root 24822 1 0 13:42:17 ? 0:04 /usr/openv/netbackup/bin/nbdisco
root 24791 1 0 13:42:16 ? 0:01 /usr/openv/netbackup/bin/bpcd -standalone
root 24830 1 0 13:42:17 ? 0:00 /usr/openv/pdde/pdag/bin/mtstrmd
Job Details:
10/08/2014 05:35:35 - Info nbjm (pid=1354) starting backup job (jobid=2133986) for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental
10/08/2014 05:35:35 - Info nbjm (pid=1354) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=2133986, request id:{D62AAC74-4ED6-11E4-A400-002128A6F97C})
10/08/2014 05:35:35 - requesting resource xp53app034z-hcart-robot-tld-1
10/08/2014 05:35:35 - requesting resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 05:35:35 - requesting resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 05:35:40 - awaiting resource xp53app034z-hcart-robot-tld-1. Maximum job count has been reached for the storage unit.
10/08/2014 06:38:35 - granted resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 06:38:35 - granted resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 06:38:35 - granted resource OM1490
10/08/2014 06:38:35 - granted resource HP.ULTRIUM4-SCSI.005
10/08/2014 06:38:35 - granted resource xp53app034z-hcart-robot-tld-1
10/08/2014 06:38:41 - estimated 0 kbytes needed
10/08/2014 06:38:41 - Info nbjm (pid=1354) started backup (backupid=xp53web0642vz_1412768321) job for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental on storage unit xp53app034z-hcart-robot-tld-1
read from input socket failed (636)
Here is my comment from another thread.
"
I would take a look at your TCP keepalive settings on the media server, master server and any network device in between. Ensure the times match.
What can happen is that with a mismatch, the connection actually can get closed.
Bpbrm tries to send datat to NBJM and fails
"Error bpbrm (pid=11841) could not write FILE ADDED message to OUTSOCK"
Eventually NBJM will check for messages from BPBRM and fail throwing a 636 error because it determined the socket was already closed.
This is something I have seen happen during NDMP jobs, http://www.symantec.com/docs/TECH214335 , which can also happen during other jobs such as SQL, http://www.symantec.com/docs/TECH197144
My fellow coworkers have resolved about 90% of the 636 errors they encountered by pointing customers towards this keepalive setting and them finding differences in the values."
Here is a way to check via windows. http://www.symantec.com/docs/HOWTO56221 , Windows and unix both have different defaults. I would image the issue for you is that the firewall for the DMZ most likely has a lower setting that the other servers, and ends up closing the connection.