cancel
Showing results for 
Search instead for 
Did you mean: 

What should happen to running jobs when NBU master restarts

mikebounds
Level 6
Partner Accredited

What is the expected behaviour when:

  1. NBU master crashes and reboots (or is brought online on failover node in a cluser)
  2. NBU master is stopped forcefully (I believe bp.kill is used) and then restarted (on the same node, or on failover node in a cluster)
  3. NBU master is stopped cleanly (if such option exists) and then restarted (I don't believe this is applicable to a cluster as I don't THINK there is anyway to configure VCS NBU agent to cleanly stop NBU master)

Whenever NBU master has been restarted in our environment (a VCS cluster), we see:

a. All jobs are killed so they have to restart from scratch.

b. Jobs are not killed cleanly so in particular for SAP backups, the Oracle database is left in backup mode, meaning when the backup is restarted it fails.

So I am hoping there is someway to cleanly stop the NBU master (preferably integrated into the VCS cluster) so that:

  1. Ideally the backup jobs do not stop, so that the media servers just buffers information about files being backed up so that this can be transferred to NBU master when it comes back up (with some timeout, so if NBU master doesn't come backup, the backup job fails).  Alternatively the backup could pause (with timeout) until NBU master returns
  2. If there is some reason 1 cannot be done, then jobs should be killed cleanly, so that databases are not left in backup mode.

I did look in the "admin volume 1" guide and this says when shutting down NBU "make sure that no jobs are running", but this will never be the case in our environment, so for example there are 32 active jobs at the moment and we could potentially have over 100 active jobs, and in the docs, I coudn't find any guidance as to what you do if you have jobs running.

Mike

8 REPLIES 8

revarooo
Level 6
Employee

So I am hoping there is someway to cleanly stop the NBU master (preferably integrated into the VCS cluster) so that:

  1. Ideally the backup jobs do not stop, so that the media servers just buffers information about files being backed up so that this can be transferred to NBU master when it comes back up (with some timeout, so if NBU master doesn't come backup, the backup job fails).  Alternatively the backup could pause (with timeout) until NBU master returns

 

  This cannot happen - the daemons responsible for backups on the Master have died. You could enable checkpointing in your policies, so that on restart backup jobs will start off from the last check point - that is an option

 

 

  1. If there is some reason 1 cannot be done, then jobs should be killed cleanly, so that databases are not left in backup mode

 I do not know about database backups - sure someone else will pitch in here.

mikebounds
Level 6
Partner Accredited

Can you elaborate on:

  This cannot happen - the daemons responsible for backups on the Master have died

It is my understanding that NBU master schedules the backup, but the actual backup runs on the Media server, so once the media server has actually started writing, my understanding is that the NBU masters role is just to record the files backed up in the NBU catalog.

Looking at https://www-secure.symantec.com/connect/sites/default/files/NBU%207.x%20process%20flow%20QRC_1.pdf this says:

11. .bpbkar sends information about the backup image to bpbrm, which forwards it to bpdbm on the master server. This stream of metadata is sent throughout the backup and stored in the master server’s Image database. 

12. When mounting and positioning of the media in the drive, or of the disk to be used, have been accomplished, the client backup process, bpbkar, will begin sending backup data to the bptm child process on the media server system. The bptm child process receives the image and stores it block by block into a shared memory segment on the media server. The parent bptm process retrieves the image from shared memory and directs it block by block to the allocated storage media. 

13. When the backup has been completed bptm will notify bpbrm, which in turn will notify the Job Manager nbjm that the job has finished bptm will also notify nbjm that it is done with the media.

 

So this sounds like the filelist of files backed up is sent to NBU master in step 1 before the backup starts, and in step 12 there is no communication with the NBU master, except I guess progress updates, so I don't understand why the NBU media cannot carry on writing if the NBU master is not available for a short period of time.

Mike

revarooo
Level 6
Employee

Can you elaborate on:

  This cannot happen - the daemons responsible for backups on the Master have died

The connection between bpbrm to bpdbm and bptm and nbjm has gone, can't just re-establish that on the fly.

bpbrm/bptm is likely to then end on the media server (or end up in a hung state) as it cannot communicate any more.

 

Nicolai
Moderator
Moderator
Partner    VIP   

You can start/stop Netbackup in a VCS cluster using

hares -offline nbu_server -sys {node}

hares -online nbu_server -sys {node}

if you wan to take all ressoucres offline , use

hagrp -offline nbu_grp -sys {node}

hagrp -online nbu_grp -sys {node}

There is no method to save runnign jobs across Netbackup restart. You can suspend file system backup across restart and then resume them. But this is not possible for database backupps.

 

mikebounds
Level 6
Partner Accredited

Thanks for replies.

So to clarify:

If NBU media looses connection to NBU master then:

  1. Does the NBU media stop backing up as soon as it looses connection to NBU master, or is there a timeout - if there is a timeout, what is the default and is it configurable
     
  2. If NBU master returns and the same processes are runnning (i.e connection was lost due to network outsage) I am assuming backup jobs continue on the NBU media?
     
  3. If NBU master returns by restarting so there is a fresh set of daemons, the NBU media connects, and if the job is a filesystem backup AND the job was suspended before the NBU media lost connection to the NBU master then you can resume the backup
     
  4. If NBU master returns by restarting so there is a fresh set of daemons, if you did not suspend the backup job before NBU media lost connection to the NBU master, then the job will cancel and you will need to restart the jobs
     
  5. If NBU master returns by restarting so there is a fresh set of daemons, then any database or SAP backups will cancel and you will need to restart the jobs
     
  6. Under what scenarios will Media server process hang when NBU master is restarted?
     

I know you can stop NBU master by using hares - this is what I did, and from a ps it looked as though the VCS NBU agent using bp.kill, but can't find any documentation to confirm what VCS NBU agent does, or whether it is configurable (like VCS Oracle agent where you can choose shutdown option to do a clean or force shutdown)

Mike

revarooo
Level 6
Employee

1. There is a timeout, but if there is a prolonged outage the backup will fail. bpbrm on the media server has to send metadata to bpdbm on the master. Any prolonged delay and the backups will fail

 

2. See above = backups will fail if prolonged outage. A few seconds network outahe and the backups may survive

3. Yes I believe so - never tried it with restarting a Master server NBU daemons before though

4. Yes

5. According to a previous post on here then yes.

6. If you shutdown a master server or interrupt comms between master and media then bptm, bpbrm on media server MAY hang.

 

VCS NBU Agent is responsible for monitoring NBU processes AND for starting/stopping NBU Cluster resources, such as the shared/floating disk that the cluster nodes will both need to use when one of them is active, the shared/floating Cluster IP address etc

jim_dalton
Level 6

As for your db backups, I guess you could have a pre script check, to check and modify the db accordingly, but that also comes with its own issues. Would be interesting to know if when the db backup craters if theres scope for a post script. Jim

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Re-visiting this post ....

I found this post in an O-L-D article:

https://www-secure.symantec.com/connect/articles/netbackup-and-vcs#comment-2506161 

Probably the best answer to what should happen to running jobs....