Master server Hung

noazara · ‎04-22-2020

Hi ALL,

I need an urgent Help.

ON friday,my environment was hung.Master server was hung and no jobs were triggered.All weeknd backups were skipped.No backups happened on the weekend.

ON monday,when we checked--no backup was there. A disaster.

We restarted all NBU services and then it wa snormal and jobs started.

My Query:

I need a kind of script or rule so that I can able to know that No jobs are running on the master server(its hung) .

NBU : 8.1.1
Linux

sclind · ‎04-22-2020

Perhaps list the backups that have ran in the last X number of hours, then send an email to you if the number of jobs is less than Y jobs?

sclind · ‎04-22-2020

Opscenter also has this alert that may help:

Master Server Unreachable

An alert is generated when OpsCenter loses contact with the master server.

noazara · ‎04-22-2020

Thanks.But no such errorrs have been received.

sclind · ‎04-22-2020

You would need to set it up.

jnardello · ‎04-23-2020

You could always configure a "healthcheck" policy, schedule it to run once an hour, back up your hosts or version file or such, expire immediately (because you don't care about the data, just the success/failure of the job), and send it to the destination of your choice.

Then configure the monitoring solution of your choice (a script, OpsCenter, etc.) to see if the policy has run in the last 2 hours or such (to avoid issues due to peak backup windows, etc. ). If it hasn't, cut a ticket or otherwise alert someone.

This can even be expanded slightly to deliberately write to tape as a method of confirming that your tape library is functional too, if you don't have anything else keeping an eye on that.

You just have to watch out for scheduled maintenance windows or you'll end up with a few extra tickets. =)

Hamza_H · ‎04-23-2020

Hello,
You may need to dig on why the policies didn’t start and look for the root cause behind that because it could be an EEB or maybe a workaround to soove this.
You need to go through masters logs (nbjm, nbpem, bprd., bpdbm ...) and seek for errors that could be behind this.
Configuring an email or alert is a plus for sure to let you know if there is a problem with your master but resolving this would be better for you..
Also make sure that the master server has enough ressources ( no memory leak.. )

Good luck.

noazara · ‎04-24-2020

Hi John Nardello1,

sclind · ‎04-24-2020

I think the healthcheck process would be unique for each organization.

EthanH · ‎05-05-2020

I would bet that this is related to your nbpem cache filling up.

https://www.veritas.com/support/en_US/article.100015957

I've only seen this happen on Windows masters, but it isn't exclusive to them. You don't receive an alert because none of the processes stop, and the server is functioning as expected...with the obvious exception that none of your backups are running. It's very tricky, as jobs that were active when the cache fills up will continue running, making it appear as though everything is working fine.

There are quite a few documented issues with nbpem, and the below EEB is available (albeit for 8.1.2). Might be worth logging a case to check with Support to see if there is a patch for 8.1.1.

https://www.veritas.com/content/support/en_US/downloads/update.UPD575421

VOX

Master server Hung