cancel
Showing results for 
Search instead for 
Did you mean: 

Having trouble with 636 status code

backup-botw
Level 6

I am having some trouble tracking down what is causing the 636 error for some backups. It only happens on backups using the DMZ media server we have here. I know you guys will want to see some logs to help with the issue so let me know what all is needed and I will gladly grab that info and share.

I am seeing this issue on several clients of varying OS's, but they all use the same media server. Just using this one client below as a baseline. I am seeing the error on both Windows and Unix clients alike. Normally when I see this its hung bpbkar processes on a Windows Server 2003 machine and I can cycle the client, stop the hung processes and then it will work, but with these ones thats not solving the issue for me.

Netbackup version on Master - 7.6.0.2

Master server OS - Solaris 10

Client OS - Solaris 10

NB version on Client - 7.6.0.2

Running processes on client:

root@xp53web0642vz:/usr/openv/netbackup/bin ->./bpps -a
NB Processes
------------
    root 24787     1   0 13:42:16 ?           0:01 /usr/openv/netbackup/bin/vnetd -standalone
    root 27235 24791   0 07:21:35 ?           0:00 /usr/openv/netbackup/bin/bpcd -standalone
    root 24822     1   0 13:42:17 ?           0:04 /usr/openv/netbackup/bin/nbdisco
    root 24791     1   0 13:42:16 ?           0:01 /usr/openv/netbackup/bin/bpcd -standalone
    root 24830     1   0 13:42:17 ?           0:00 /usr/openv/pdde/pdag/bin/mtstrmd
 

Job Details:

10/08/2014 05:35:35 - Info nbjm (pid=1354) starting backup job (jobid=2133986) for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental
10/08/2014 05:35:35 - Info nbjm (pid=1354) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=2133986, request id:{D62AAC74-4ED6-11E4-A400-002128A6F97C})
10/08/2014 05:35:35 - requesting resource xp53app034z-hcart-robot-tld-1
10/08/2014 05:35:35 - requesting resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 05:35:35 - requesting resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 05:35:40 - awaiting resource xp53app034z-hcart-robot-tld-1. Maximum job count has been reached for the storage unit.
10/08/2014 06:38:35 - granted resource  xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 06:38:35 - granted resource  xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 06:38:35 - granted resource  OM1490
10/08/2014 06:38:35 - granted resource  HP.ULTRIUM4-SCSI.005
10/08/2014 06:38:35 - granted resource  xp53app034z-hcart-robot-tld-1
10/08/2014 06:38:41 - estimated 0 kbytes needed
10/08/2014 06:38:41 - Info nbjm (pid=1354) started backup (backupid=xp53web0642vz_1412768321) job for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental on storage unit xp53app034z-hcart-robot-tld-1
read from input socket failed  (636)

1 ACCEPTED SOLUTION

Accepted Solutions

mnolan
Level 6
Employee Accredited Certified

Here is my comment from another thread.

"

I would take a look at your TCP keepalive settings on the media server, master server and any network device in between. Ensure the times match.

What can happen is that with a mismatch, the connection actually can get closed.

Bpbrm tries to send datat to NBJM and fails
"Error bpbrm (pid=11841) could not write FILE ADDED message to OUTSOCK"

Eventually NBJM will check for messages from BPBRM and fail throwing a 636 error because it determined the socket was already closed.

 

This is something I have seen happen during NDMP jobs, http://www.symantec.com/docs/TECH214335 , which can also happen during other jobs such as SQL, http://www.symantec.com/docs/TECH197144

My fellow coworkers have resolved about 90% of the 636 errors they encountered by pointing customers towards this keepalive setting and them finding differences in the values."

Here is a way to check via windows. http://www.symantec.com/docs/HOWTO56221 , Windows and unix both have different defaults. I would image the issue for you is that the firewall for the DMZ most likely has a lower setting that the other servers, and ends up closing the connection.

View solution in original post

6 REPLIES 6

mnolan
Level 6
Employee Accredited Certified

Here is my comment from another thread.

"

I would take a look at your TCP keepalive settings on the media server, master server and any network device in between. Ensure the times match.

What can happen is that with a mismatch, the connection actually can get closed.

Bpbrm tries to send datat to NBJM and fails
"Error bpbrm (pid=11841) could not write FILE ADDED message to OUTSOCK"

Eventually NBJM will check for messages from BPBRM and fail throwing a 636 error because it determined the socket was already closed.

 

This is something I have seen happen during NDMP jobs, http://www.symantec.com/docs/TECH214335 , which can also happen during other jobs such as SQL, http://www.symantec.com/docs/TECH197144

My fellow coworkers have resolved about 90% of the 636 errors they encountered by pointing customers towards this keepalive setting and them finding differences in the values."

Here is a way to check via windows. http://www.symantec.com/docs/HOWTO56221 , Windows and unix both have different defaults. I would image the issue for you is that the firewall for the DMZ most likely has a lower setting that the other servers, and ends up closing the connection.

backup-botw
Level 6

Where can I find the keepalive setting on the master and media servers? From there I will take that to my firewall team and see what they have on their end.

mnolan
Level 6
Employee Accredited Certified

This is a OS setting,

"Here is a way to check via windows. http://www.symantec.com/docs/HOWTO56221 "

For unix, you'll need to google tcp keepalive + unix distro

backup-botw
Level 6

I have found the setting on the master and media server. I am just waiting to hear back from the firewall guy and see what is set on their end.

 

While I am waiting to hear back I figured I shared this data from a backup that was tyring to run just recently...

 

10/08/2014 11:23:19 - Info nbjm (pid=1354) starting backup job (jobid=2134414) for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental
10/08/2014 11:23:19 - Info nbjm (pid=1354) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=2134414, request id:{6A0A246C-4F07-11E4-984B-002128A6F97C})
10/08/2014 11:23:19 - requesting resource xp53app034z-hcart-robot-tld-1
10/08/2014 11:23:19 - requesting resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 11:23:19 - requesting resource xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 11:23:19 - Info nbrb (pid=1307) Limit has been reached for the logical resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 12:30:05 - awaiting resource xp53app034z-hcart-robot-tld-1. Maximum job count has been reached for the storage unit.
10/08/2014 12:30:05 - Info nbrb (pid=1307) Limit has been reached for the logical resource xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 12:55:33 - awaiting resource xp53app034z-hcart-robot-tld-1. Maximum job count has been reached for the storage unit.
10/08/2014 12:55:34 - granted resource  xp53tape001.NBU_CLIENT.MAXJOBS.xp53web0642vz
10/08/2014 12:55:34 - granted resource  xp53tape001.NBU_POLICY.MAXJOBS.Omaha_DMZ_Unix
10/08/2014 12:55:34 - granted resource  OM1490
10/08/2014 12:55:34 - granted resource  HP.ULTRIUM4-SCSI.005
10/08/2014 12:55:34 - granted resource  xp53app034z-hcart-robot-tld-1
10/08/2014 12:55:35 - estimated 0 kbytes needed
10/08/2014 12:55:35 - Info nbjm (pid=1354) started backup (backupid=xp53web0642vz_1412790935) job for client xp53web0642vz, policy Omaha_DMZ_Unix, schedule Incremental on storage unit xp53app034z-hcart-robot-tld-1
10/08/2014 12:55:36 - started process bpbrm (pid=21339)
10/08/2014 13:13:37 - Info bpbrm (pid=7368)  status: FAILED, (44) CONNECT_TIMEOUT; system: (150) Operation now in progress; FROM 0.0.0.0 TO xp53web0642vz 204.44.15.247 bpcd VIA pbx
10/08/2014 13:13:37 - Info bpbrm (pid=7368)  status: FAILED, (44) CONNECT_TIMEOUT; system: (150) Operation now in progress; FROM 0.0.0.0 TO xp53web0642vz 204.44.15.247 bpcd VIA vnetd
10/08/2014 13:13:37 - Info bpbrm (pid=7368)  status: FAILED, (44) CONNECT_TIMEOUT; system: (145) Connection timed out; FROM 0.0.0.0 TO xp53web0642vz 204.44.15.247 bpcd
10/08/2014 13:13:37 - Error bpbrm (pid=7368) cannot connect to xp53web0642vz, Operation now in progress (150)
10/08/2014 13:14:14 - Info bpbrm (pid=7368) got ERROR 82 from media manager

backup-botw
Level 6

Still waiting on my Firewall team, but some of these backups may end with a 636, but a lot of them will write a portion of the backup. One I am looking at now is hung on /VRTS_IMAGE_SIZE_RECORD.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Your post on 8 October shows connection failure.
Here no connection could be made to PBX (port 1556) or to vnetd (13724) or bpcd (13782).

Ensure port 1556 is open in both directions between media server and client.
Under certain circumstance port connectivity is also needed between master and client (like database or snapshot bakups).