cancel
Showing results for 
Search instead for 
Did you mean: 

Master server Hung

noazara
VIP
   VIP   

Hi ALL,

I need an urgent Help.

ON friday,my environment was hung.Master server was hung and no jobs were triggered.All weeknd backups were skipped.No backups happened on the weekend.

ON monday,when we checked--no backup was there. A disaster.

We restarted all NBU services and then it wa snormal and jobs started.

My Query:

I need a kind of script or rule so that I can able to know that No jobs are running on the master server(its hung) .

NBU : 8.1.1
Linux

9 REPLIES 9

sclind
Moderator
Moderator
   VIP   

Perhaps list the backups that have ran in the last X number of hours, then send an email to you if the number of jobs is less than Y jobs?

sclind
Moderator
Moderator
   VIP   

Opscenter also has this alert that may help:

 

Master Server UnreachableAn alert is generated when OpsCenter loses contact with the master server.

noazara
VIP
   VIP   

Thanks.But no such errorrs have been received.

sclind
Moderator
Moderator
   VIP   

You would need to set it up.

jnardello
Moderator
Moderator
   VIP    Certified

You could always configure a "healthcheck" policy, schedule it to run once an hour, back up your hosts or version file or such, expire immediately (because you don't care about the data, just the success/failure of the job), and send it to the destination of your choice.

Then configure the monitoring solution of your choice (a script, OpsCenter, etc.) to see if the policy has run in the last 2 hours or such (to avoid issues due to peak backup windows, etc. ). If it hasn't, cut a ticket or otherwise alert someone.

This can even be expanded slightly to deliberately write to tape as a method of confirming that your tape library is functional too, if you don't have anything else keeping an eye on that.

You just have to watch out for scheduled maintenance windows or you'll end up with a few extra tickets. =)

Hamza_H
Moderator
Moderator
   VIP   
Hello,
You may need to dig on why the policies didn’t start and look for the root cause behind that because it could be an EEB or maybe a workaround to soove this.
You need to go through masters logs (nbjm, nbpem, bprd., bpdbm ...) and seek for errors that could be behind this.
Configuring an email or alert is a plus for sure to let you know if there is a problem with your master but resolving this would be better for you..
Also make sure that the master server has enough ressources ( no memory leak.. )

Good luck.

noazara
VIP
   VIP   

Hi John Nardello1,

sclind
Moderator
Moderator
   VIP   

I think the healthcheck process  would be unique for each organization.

EthanH
Level 4

I would bet that this is related to your nbpem cache filling up.

https://www.veritas.com/support/en_US/article.100015957

I've only seen this happen on Windows masters, but it isn't exclusive to them. You don't receive an alert because none of the processes stop, and the server is functioning as expected...with the obvious exception that none of your backups are running. It's very tricky, as jobs that were active when the cache fills up will continue running, making it appear as though everything is working fine.

There are quite a few documented issues with nbpem, and the below EEB is available (albeit for 8.1.2). Might be worth logging a case to check with Support to see if there is a patch for 8.1.1.

https://www.veritas.com/content/support/en_US/downloads/update.UPD575421