Forum Discussion

bbot's avatar
bbot
Level 4
10 years ago
Solved

Backup hanging in Active state, won't go past "begin writing"

Out of our 100+ clients, we have one client that has been getting stuck in "Active" state, but hangs at Begin writing. The job has been running for over 36 hours. In the past, they successfully completed around 12-18 hours. This server does have a large data store, ~10 TB worth of data.

About 6 days earlier, the same client errored with a exit status 41:network timeout. After this, they have been hanging.

Here's the job details below:

1/20/2016 11:00:00 PM - Info nbjm(pid=4184) starting backup job (jobid=37269) for client lasfs01.corp.tlcinternal.us, policy LASFS01_SC, schedule Wednesday  
1/20/2016 11:00:00 PM - Info nbjm(pid=4184) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=37269, request id:{4BE50878-5D68-4336-A93E-8D8B41177185})  
1/20/2016 11:00:00 PM - requesting resource LAS
1/20/2016 11:00:00 PM - requesting resource .NBU_CLIENT.MAXJOBS.server#####
1/20/2016 11:00:00 PM - requesting resource .NBU_POLICY.MAXJOBS.LASFS01_SC
1/20/2016 11:00:00 PM - granted resource .NBU_CLIENT.MAXJOBS.server####
1/20/2016 11:00:00 PM - granted resource .NBU_POLICY.MAXJOBS.LASFS01_SC
1/20/2016 11:00:00 PM - granted resource MediaID=@aaaab;DiskVolume=lasbackup;DiskPool=lasbackup;Path=lasbackup;StorageServer=10.64.128.40;MediaServer=masterserver#####
1/20/2016 11:00:00 PM - granted resource LAS
1/20/2016 11:00:01 PM - estimated 17604715 Kbytes needed
1/20/2016 11:00:01 PM - Info nbjm(pid=4184) started backup (backupid=lasfs01.corp.tlcinternal.us_1453359601) job for client server######, policy LASFS01_SC, schedule Wednesday on storage unit LAS
1/20/2016 11:00:02 PM - Info bpbrm(pid=6892) server###### is the host to backup data from     
1/20/2016 11:00:02 PM - Info bpbrm(pid=6892) reading file list for client        
1/20/2016 11:00:02 PM - started process bpbrm (6892)
1/20/2016 11:00:02 PM - connecting
1/20/2016 11:00:03 PM - Info bpbrm(pid=6892) starting bpbkar32 on client         
1/20/2016 11:00:03 PM - connected; connect time: 0:00:01
1/20/2016 11:00:05 PM - Info bpbkar32(pid=4940) Backup started           
1/20/2016 11:00:05 PM - Info bpbkar32(pid=4940) change time comparison:<disabled>          
1/20/2016 11:00:05 PM - Info bpbkar32(pid=4940) archive bit processing:<enabled>          
1/20/2016 11:00:06 PM - Info bptm(pid=9452) start            
1/20/2016 11:00:06 PM - Info bptm(pid=9452) using 262144 data buffer size        
1/20/2016 11:00:06 PM - Info bptm(pid=9452) setting receive network buffer to 1049600 bytes      
1/20/2016 11:00:06 PM - Info bptm(pid=9452) using 30 data buffers         
1/20/2016 11:00:09 PM - Info bptm(pid=9452) start backup           
1/20/2016 11:00:09 PM - Info bptm(pid=9452) backup child process is pid 6296.6936       
1/20/2016 11:00:09 PM - Info bptm(pid=6296) start            
1/20/2016 11:00:09 PM - begin writing

 

On the client, I pulled the bpbkar log and it shows about 22 hours of the below over and over..

09:28:17.002 [4940.832] <2> dtcp_read: TCP - success: recv socket (580), 4 of 4 bytes
09:28:17.002 [4940.832] <4> bpio::read_string: INF - read non-blocking message of length 1
09:28:17.002 [4940.832] <2> dtcp_read: TCP - success: recv socket (580), 1 of 1 bytes
09:28:17.002 [4940.832] <4> tar_backup::readServerMessage: INF - keepalive message received
09:28:17.002 [4940.832] <4> tar_base::keepaliveThread: INF - sending keepalive
09:28:17.002 [4940.832] <2> dtcp_write: TCP - success: send socket (492), 1 of 1 bytes

 

  • Its gonna work fine after reboot, as soon as you forget about it, vss will break quietly again, you'll discover it after a few days of missed backups and the whole cycle will begin again. Welcome to wonderful world of backing up microsoft products :) 

  • Its gonna work fine after reboot, as soon as you forget about it, vss will break quietly again, you'll discover it after a few days of missed backups and the whole cycle will begin again. Welcome to wonderful world of backing up microsoft products :) 

  • Check VSS first.  Do these all work?

    vssadmin list providers
    
    vssadmin list volumes
    
    vssadmin list shadowstorage
    
    vssadmin list shadows
    
    vssadmin list writers
    
    vssadmin list writers | find /i "last"
    
    vssadmin list writers | find /i "state"
  • @sdo All commands run fine on the master server.



    On the client, all commands work up until "writers." It says it will be delayed if a shadow copy is being prepared. I currently have 2 active backup jobs on this server which could be causing the delay. (I wanted to see what happened if I left it running after restarting all the services.

    shadowstorage and shadows come up with "no items found that satisfy the query"

  • Sorry - commands were meant for the client.  Did one of the vssadmin commands hang/fail and not complete?

  • Have experienced a similar issue, in that case the problem was a bpbkar32 process hanging waiting on IO, but after you description of the vssadmin list writers behaviour, I would guess that a vss snaphot never completes.

    I would ask for a reboot of the client and create the bpfis folder under ../NteBackup/logs for future troubleshooting of the snapshots

    if a reboot is not possible, stop the NetBackup services and kill any lingering bpbkar32, bpfis processes in taskmanager

    Wait at least 5 minutes and then start the Netbackup Client Service, which should start all the neeeded services

  •  

    Is it possible to back this client in a separate policy using Flashbackup-Windows policy type?

  • None of these commands worked. All active jobs are stopped from the master server. They hang at "Waiting for responses. These may be delayed if a shadow copy is being prepared".

    vssadmin list writers
    
    vssadmin list writers | find /i "last"
    
    vssadmin list writers | find /i "state"

    Stopping the services and killing the processes didn't seem to help.

     

    We're going to reboot this client tonight and try again.

  • try to run backup in multiple streams and look for a particular drive or folder where backup gets stuck.