Forum Discussion

AnthonyTsang's avatar
11 years ago

Backup failure with error code 636

Media Server : RedHat 2.6 with NBU 7.0

Maste Server: Windows  2008 R2 wiht NBU 7.5.0.5

That policy has two schedule job

1) Full Backup ( Full Backup)

2) cumulative Incremental (Differential Backup)

The Yearly Full Backup can work , but the monthly cumulative incremental can't work.

Please help

Log

3/4/2014 5:35:10 PM - Info nbjm(pid=12396) starting backup job (jobid=143114) for client drs1estm1b.intra.whb.net, policy DRS1ESTM1B_ARCHIVE, schedule Monthly_Diff_LES 
3/4/2014 5:35:10 PM - Info nbjm(pid=12396) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=143114, request id:{E5645178-5B4D-4348-998E-95600F02A8A1}) 
3/4/2014 5:35:10 PM - requesting resource drs1estm1b-hcart3-robot-tld
3/4/2014 5:35:10 PM - requesting resource whb1nbum1v-vaa.NBU_CLIENT.MAXJOBS.drs1estm1b.intra.whb.net
3/4/2014 5:35:10 PM - requesting resource whb1nbum1v-vaa.NBU_POLICY.MAXJOBS.DRS1ESTM1B_ARCHIVE
3/4/2014 5:35:11 PM - granted resource whb1nbum1v-vaa.NBU_CLIENT.MAXJOBS.drs1estm1b.intra.whb.net
3/4/2014 5:35:11 PM - granted resource whb1nbum1v-vaa.NBU_POLICY.MAXJOBS.DRS1ESTM1B_ARCHIVE
3/4/2014 5:35:11 PM - granted resource LESM15
3/4/2014 5:35:11 PM - granted resource Drive088
3/4/2014 5:35:11 PM - granted resource DRS1ESTM1B-hcart3-robot-tld-1
3/4/2014 5:35:11 PM - estimated 0 Kbytes needed
3/4/2014 5:35:11 PM - Info nbjm(pid=12396) started backup (backupid=drs1estm1b.intra.whb.net_1393925711) job for client drs1estm1b.intra.whb.net, policy DRS1ESTM1B_ARCHIVE, schedule Monthly_Diff_LES on storage unit DRS1ESTM1B-hcart3-robot-tld-1
3/4/2014 5:35:12 PM - started process bpbrm (13742)
3/4/2014 5:35:14 PM - connecting
3/4/2014 5:35:15 PM - connected; connect time: 00:00:01
3/4/2014 5:35:19 PM - mounting LESM15
3/4/2014 5:35:28 PM - mounted; mount time: 00:00:09
3/4/2014 5:35:28 PM - positioning LESM15 to file 5
3/4/2014 5:35:28 PM - positioned LESM15; position time: 00:00:00
3/4/2014 5:35:28 PM - begin writing
read from input socket failed(636)
3/4/2014 5:47:45 PM - Error bptm(pid=13765) media manager terminated by parent process      
3/4/2014 5:47:49 PM - Error bpbrm(pid=13742) could not write EXIT STATUS to stderr  

 

  • The last time I saw a 636 error it had nothing to do with bpsynth.

    This was the issue, http://www.symantec.com/docs/TECH214335

    Which was the same as http://www.symantec.com/docs/TECH197144

    The root cause here was that the tcp keep alive timeout on the media server and master server were higher than the firewall.

    NBJM timeouts and writes the message into the detailed status without the timestamp.

    The only reason I can think that this would fail only on a incremental schedule is if that schedule is using a different storage unit/ media server than the full backup is using.  Check to see if this pattern is there and then investigate the timeout setting on both Operating System.

  • bptm:

    12:00:48.500 [17554] <2> write_data: completed writing backup header, start writing data when first buffer is available, copy 1

    Then nothing... no data from client.

    bpbkar shows no update for 7 minutes (no data sent):

    11:51:04.161 [12833] <4> bpbkar PrintFile: /applications/esmt/archive/
    11:58:32.385 [12833] <16> bpbkar sighandler: ERR - bpbkar killed by SIGPIPE

    Backup is killed after 7 minutes of receiving nothing from client.

    Weird that status is not logged as timeout, but it looks like timeout.

    What is Client Read timeout on the media server.

    I want to repeat from my previous post:

    Client has to 'walk' the filesystem to search for files and folders that were modified since the last full back.
    ...  Have you tried to do the same at OS level?

    Something like: 
    # find /applications/esmt/archive -mtime -<no of days since last full backup>

    Keep an eye on system resources while command is running.

    What is the result?

     

    The reason why UAT is working is probably because there is a lot less data than on production...

  • The last time I saw a 636 error it had nothing to do with bpsynth.

    This was the issue, http://www.symantec.com/docs/TECH214335

    Which was the same as http://www.symantec.com/docs/TECH197144

    The root cause here was that the tcp keep alive timeout on the media server and master server were higher than the firewall.

    NBJM timeouts and writes the message into the detailed status without the timestamp.

    The only reason I can think that this would fail only on a incremental schedule is if that schedule is using a different storage unit/ media server than the full backup is using.  Check to see if this pattern is there and then investigate the timeout setting on both Operating System.

  • Hi,

    I will check the time need to generate the file list by command in below

    # find /applications/esmt/archive -mtime -<no of days since last full backup>

    I want to mention one thing, the file system is "GFS" . Any precaution?

    After read your replied, Do you think that the fail was caused by the time out during generating the backup list in client??

     

  • The File System is GFS. any precaution?

    How to check the timeout value on Operation system during generating the file list ?

  • HI,

    I approximately know the time out caused by load balanncer. Your provided KB said that the solution was set master server keep alive time lower than the time out value of load balancer.

    However, what will be the impact on the master server if the solution applied?

  • Absolutely nothing

    All of my master and media server has this setting configured and it has not caused any issues at all.

  • I want to know if the keep alive time configured on master server only. Can it be solved ?

    From my understanding, the master server will send acknowledgement to keep alive the media server within the firewall/load balancer time out value if applied the configuration.