10-08-2014 07:31 AM
I am having some trouble tracking down what is causing the 636 error for some backups. It only happens on backups using the DMZ media server we have here. I know you guys will want to see some logs to help with the issue so let me know what all is needed and I will gladly grab that info and share.
I am seeing this issue on several clients of varying OS's, but they all use the same media server. Just using this one client below as a baseline. I am seeing the error on both Windows and Unix clients alike. Normally when I see this its hung bpbkar processes on a Windows Server 2003 machine and I can cycle the client, stop the hung processes and then it will work, but with these ones thats not solving the issue for me.
Netbackup version on Master - 7.6.0.2
Master server OS - Solaris 10
Client OS - Solaris 10
NB version on Client - 7.6.0.2
Running processes on client:
root@xp53web0642vz:/usr/openv/netbackup/bin ->./bpps -a
NB Processes
------------
root 24787 1 0 13:42:16 ? 0:01 /usr/openv/netbackup/bin/vnetd -standalone
root 27235 24791 0 07:21:35 ? 0:00 /usr/openv/netbackup/bin/bpcd -standalone
root 24822 1 0 13:42:17 ? 0:04 /usr/openv/netbackup/bin/nbdisco
root 24791 1 0 13:42:16 ? 0:01 /usr/openv/netbackup/bin/bpcd -standalone
root 24830 1 0 13:42:17 ? 0:00 /usr/openv/pdde/pdag/bin/mtstrmd
Job Details:
10/08/2014 05:35:35 - Info nbjm (pid=1354) starting backup job (jobid=2133986) for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental
10/08/2014 05:35:35 - Info nbjm (pid=1354) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=2133986, request id:{D62AAC74-4ED6-11E4-A400-002128A6F97C})
10/08/2014 05:35:35 - requesting resource xp53app034z-hcart-robot-tld-1
10/08/2014 05:35:35 - requesting resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 05:35:35 - requesting resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 05:35:40 - awaiting resource xp53app034z-hcart-robot-tld-1. Maximum job count has been reached for the storage unit.
10/08/2014 06:38:35 - granted resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 06:38:35 - granted resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 06:38:35 - granted resource OM1490
10/08/2014 06:38:35 - granted resource HP.ULTRIUM4-SCSI.005
10/08/2014 06:38:35 - granted resource xp53app034z-hcart-robot-tld-1
10/08/2014 06:38:41 - estimated 0 kbytes needed
10/08/2014 06:38:41 - Info nbjm (pid=1354) started backup (backupid=xp53web0642vz_1412768321) job for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental on storage unit xp53app034z-hcart-robot-tld-1
read from input socket failed (636)
Solved! Go to Solution.
10-08-2014 07:48 AM
10-08-2014 07:48 AM
Here is my comment from another thread.
"
I would take a look at your TCP keepalive settings on the media server, master server and any network device in between. Ensure the times match.
What can happen is that with a mismatch, the connection actually can get closed.
Bpbrm tries to send datat to NBJM and fails
"Error bpbrm (pid=11841) could not write FILE ADDED message to OUTSOCK"
Eventually NBJM will check for messages from BPBRM and fail throwing a 636 error because it determined the socket was already closed.
This is something I have seen happen during NDMP jobs, http://www.symantec.com/docs/TECH214335 , which can also happen during other jobs such as SQL, http://www.symantec.com/docs/TECH197144
My fellow coworkers have resolved about 90% of the 636 errors they encountered by pointing customers towards this keepalive setting and them finding differences in the values."
Here is a way to check via windows. http://www.symantec.com/docs/HOWTO56221 , Windows and unix both have different defaults. I would image the issue for you is that the firewall for the DMZ most likely has a lower setting that the other servers, and ends up closing the connection.
10-08-2014 07:55 AM
Where can I find the keepalive setting on the master and media servers? From there I will take that to my firewall team and see what they have on their end.
10-08-2014 07:57 AM
This is a OS setting,
"Here is a way to check via windows. http://www.symantec.com/docs/HOWTO56221 "
For unix, you'll need to google tcp keepalive + unix distro
10-08-2014 12:26 PM
I have found the setting on the master and media server. I am just waiting to hear back from the firewall guy and see what is set on their end.
While I am waiting to hear back I figured I shared this data from a backup that was tyring to run just recently...
10/08/2014 11:23:19 - Info nbjm (pid=1354) starting backup job (jobid=2134414) for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental
10/08/2014 11:23:19 - Info nbjm (pid=1354) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=2134414, request id:{6A0A246C-4F07-11E4-984B-002128A6F97C})
10/08/2014 11:23:19 - requesting resource xp53app034z-hcart-robot-tld-1
10/08/2014 11:23:19 - requesting resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 11:23:19 - requesting resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 11:23:19 - Info nbrb (pid=1307) Limit has been reached for the logical resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 12:30:05 - awaiting resource xp53app034z-hcart-robot-tld-1. Maximum job count has been reached for the storage unit.
10/08/2014 12:30:05 - Info nbrb (pid=1307) Limit has been reached for the logical resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 12:55:33 - awaiting resource xp53app034z-hcart-robot-tld-1. Maximum job count has been reached for the storage unit.
10/08/2014 12:55:34 - granted resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 12:55:34 - granted resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 12:55:34 - granted resource OM1490
10/08/2014 12:55:34 - granted resource HP.ULTRIUM4-SCSI.005
10/08/2014 12:55:34 - granted resource xp53app034z-hcart-robot-tld-1
10/08/2014 12:55:35 - estimated 0 kbytes needed
10/08/2014 12:55:35 - Info nbjm (pid=1354) started backup (backupid=xp53web0642vz_1412790935) job for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental on storage unit xp53app034z-hcart-robot-tld-1
10/08/2014 12:55:36 - started process bpbrm (pid=21339)
10/08/2014 13:13:37 - Info bpbrm (pid=7368) status: FAILED, (44) CONNECT_TIMEOUT; system: (150) Operation now in progress; FROM 0.0.0.0 TO xp53web0642vz 204.44.15.247 bpcd VIA pbx
10/08/2014 13:13:37 - Info bpbrm (pid=7368) status: FAILED, (44) CONNECT_TIMEOUT; system: (150) Operation now in progress; FROM 0.0.0.0 TO xp53web0642vz 204.44.15.247 bpcd VIA vnetd
10/08/2014 13:13:37 - Info bpbrm (pid=7368) status: FAILED, (44) CONNECT_TIMEOUT; system: (145) Connection timed out; FROM 0.0.0.0 TO xp53web0642vz 204.44.15.247 bpcd
10/08/2014 13:13:37 - Error bpbrm (pid=7368) cannot connect to xp53web0642vz, Operation now in progress (150)
10/08/2014 13:14:14 - Info bpbrm (pid=7368) got ERROR 82 from media manager
10-10-2014 09:01 AM
Still waiting on my Firewall team, but some of these backups may end with a 636, but a lot of them will write a portion of the backup. One I am looking at now is hung on /VRTS_IMAGE_SIZE_RECORD.
10-23-2014 01:23 AM
Your post on 8 October shows connection failure.
Here no connection could be made to PBX (port 1556) or to vnetd (13724) or bpcd (13782).
Ensure port 1556 is open in both directions between media server and client.
Under certain circumstance port connectivity is also needed between master and client (like database or snapshot bakups).
I would take a look at your TCP keepalive settings on the media server, master server and any network device in between. Ensure the times match.
What can happen is that with a mismatch, the connection actually can get closed.
Bpbrm tries to send datat to NBJM and fails
"Error bpbrm (pid=11841) could not write FILE ADDED message to OUTSOCK"
Eventually NBJM will check for messages from BPBRM and fail throwing a 636 error because it determined the socket was already closed.
This is something I have seen happen during NDMP jobs, http://www.symantec.com/docs/TECH214335 , which can also happen during other jobs such as SQL, http://www.symantec.com/docs/TECH197144
My fellow coworkers have resolved about 90% of the 636 errors they encountered by pointing customers towards this keepalive setting and them finding differences in the values."
Here is a way to check via windows. http://www.symantec.com/docs/HOWTO56221 , Windows and unix both have different defaults. I would image the issue for you is that the firewall for the DMZ most likely has a lower setting that the other servers, and ends up closing the connection.