Forum Discussion

ankur1809's avatar
ankur1809
Level 5
11 years ago

Backup failure with error code 636

On a server 4 streams are running for C:\ D:\ shadows copy and all local drives among which shadow copy is failing continously with error code 636.

10/07/2014 02:08:18 - Info bpbrm (pid=29405) nmcstraining is the host to backup data from
10/07/2014 02:08:19 - Info bpbrm (pid=29405) reading file list from client
10/07/2014 02:10:13 - Info bpbrm (pid=29405) starting bpbkar on client
10/07/2014 02:11:13 - Info bpbkar (pid=1984) Backup started
10/07/2014 02:11:13 - Info bpbrm (pid=29405) bptm pid: 29531
10/07/2014 02:11:14 - Info bptm (pid=29531) start
10/07/2014 02:11:18 - Info bptm (pid=29531) using 262144 data buffer size
10/07/2014 02:11:18 - Info bptm (pid=29531) using 30 data buffers
10/07/2014 02:11:23 - Info bptm (pid=29531) start backup
10/07/2014 02:11:35 - Info bptm (pid=29531) backup child process is pid 29560
10/07/2014 02:17:09 - Error bpbrm (pid=11841) could not write FILE ADDED message to OUTSOCK
10/07/2014 02:17:21 - Error bpbrm (pid=11841) could not write FILE ADDED message to OUTSOCK
10/07/2014 02:44:58 - Info nbjm (pid=10211) starting backup job (jobid=5194250) for client nmcstraining, policy PROD_REMOTE_NM_dalmedia19, schedule DLY_INCR
10/07/2014 02:44:58 - Info nbjm (pid=10211) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=5194250, request id:{D5FC3560-4DF5-11E4-9775-AC38CE536ADD})
10/07/2014 02:44:58 - requesting resource stu_disk_dalmedia19
10/07/2014 02:44:58 - requesting resource nbutx2.NBU_CLIENT.MAXJOBS.nmcstraining
10/07/2014 02:44:58 - requesting resource nbutx2.NBU_POLICY.MAXJOBS.PROD_REMOTE_NM_dalmedia19
10/07/2014 02:44:58 - granted resource  nbutx2.NBU_CLIENT.MAXJOBS.nmcstraining
10/07/2014 02:44:58 - granted resource  nbutx2.NBU_POLICY.MAXJOBS.PROD_REMOTE_NM_dalmedia19
10/07/2014 02:44:58 - granted resource  MediaID=@aaaeS;DiskVolume=PureDiskVolume;DiskPool=dp_disk_dalmedia19;Path=PureDiskVolume;StorageServer=dalmedia19;MediaServer=dalmedia19
10/07/2014 02:44:58 - granted resource  stu_disk_dalmedia19
10/07/2014 02:44:58 - estimated 1265157 kbytes needed
10/07/2014 02:44:58 - Info nbjm (pid=10211) resumed backup (backupid=nmcstraining_1412659253) job for client nmcstraining, policy PROD_REMOTE_NM_dalmedia19, schedule DLY_INCR on storage unit stu_disk_dalmedia19
10/07/2014 02:45:03 - started process bpbrm (pid=29405)
10/07/2014 02:45:09 - connecting
10/07/2014 02:47:01 - connected; connect time: 0:00:00
10/07/2014 02:48:23 - begin writing

 

  • Hi,

     

    Take a look at these older forum discussions.

     

    https://www-secure.symantec.com/connect/forums/status-636-status-42-errors-backing-msdp-pool

     

  • My guess is a timeout as there is more than 300 seconds between messages here:

    10/07/2014 02:11:35 - Info bptm (pid=29531) backup child process is pid 29560
    10/07/2014 02:17:09 - Error bpbrm (pid=11841) could not write FILE ADDED message to OUTSOCK

    If that is the case CLIENT_READ_TIMEOUT & CLIENT_CONNECT_TIMEOUT is the solution

  • So you want me to increase CLIENT_READ_TIMEOUT & CLIENT_CONNECT_TIMEOUT upto ....???
  • I would take a look at your TCP keepalive settings on the media server, master server and any network device in between. Ensure the times match.

    What can happen is that with a mismatch, the connection actually can get closed.

    Bpbrm tries to send datat to NBJM and fails
    "Error bpbrm (pid=11841) could not write FILE ADDED message to OUTSOCK"

    Eventually NBJM will check for messages from BPBRM and fail throwing a 636 error because it determined the socket was already closed.

     

    This is something I have seen happen during NDMP jobs, http://www.symantec.com/docs/TECH214335 , which can also happen during other jobs such as SQL, http://www.symantec.com/docs/TECH197144

    My fellow coworkers have resolved about 90% of the 636 errors they encountered by pointing customers towards this keepalive setting and them finding differences in the values.

  • In timeout options client read timeout i have set =900 but they other greyed option is File browse timeout due to Use OS dependent timeouts is checked.
  • mnolan, Do i need to check its value through registry settings or through any other mean as well. I donot have access to the client as it is a domain server.(Windows) How to check it through master, media server(Linux)
  • Both articles I linked refferenced another, http://www.symantec.com/docs/HOWTO56221 , which is the registry setting for Windows. I do not have any Linux documentation handy. Please googled \Tcp keepalive timeouts + Your Linux Distro.  Then also talk with your network admins to determine the timeout on any device in the route between the two servers.

  • mnalo, i was to add something as only shadow copy stream is failing with else other streams are perfectly fine.