cancel
Showing results for 
Search instead for 
Did you mean: 

After Status 50 & UNKNOWN JOBS....

Malak
Level 4
Partner

... the JOBs simply don't do nothing.....

Hello All,

Windows 2008 R2 servers (2 Master in cluster, 9 Medias)

NetBackup 7.1.0.3, in all servers.

We have now jobs that are not doing nothing, for hours, and days, they never end.

And they never run again, if they are hourly JOBs, or daily JOBs, they don't run again until last job ends, as it never ends.... we don't have regular backups.

This JOBs cannot be deleted, the only option (That I know) is to STOP all NB services an delete it from DB by hand… like:

• Shutdown NetBackup Master AND Media Server processes
• Verify no active NetBackup processes on the Master server
* Verify no active NetBackup processes on all Media servers
• Run bpjobd -r <jobid> for each ghost jobid.

Will this ever be resolved?! OR does anyone know another, less disruptive way, to remove this “never ending” JOBs?

 

Thank you

32 REPLIES 32

Mark_Solutions
Level 6
Partner Accredited Certified

Upload the System and Application logs from the Master when you get a chance

snapshot4
Level 3

Sorry the problem I am having trouble with is that, something causes the jobs to go status 50.  Now when they get to that state really the only way to handle them is to do what you have been doing.  But anytime I get those status 50 I have to start figuring why they went status 50.  For example if you shutdown netbackup services forcefully some jobs may get cancelled cleanly and removed from the jobdb and some may not. 

But do you have any hunches as to what is causing these to happen in the first place?

Malak
Level 4
Partner

No, I have no Ideia what causes those jobs. What Netbackup version do you have running?
7.1.0.3 should not report status 50 on this cases. In my case some issues were resolved with these patches (7.1.0.2 and 7.1.0.3(this is the one I'm running know)).

Mark_Solutions
Level 6
Partner Accredited Certified

Malak

a combination of 7.1.0.3 and the memory tuning i mentioned earlier solves it in my case, but without seeing your system and application event logs i cannot identify anything further

Still happy to take a look when you have chance to export them (in evtx format) and post them on here

Malak
Level 4
Partner

Hello Mark,

here goes, any messages regarding TLD0, please ignore.

Thank you.

Malak
Level 4
Partner

Done, sorry for the delay. I was way from office last week. Thank you.

Mark_Solutions
Level 6
Partner Accredited Certified

So ... from what i can see you have a VCS cluster for the Master Server using an EMC clarion somewhere along the line

Lots of events here that would be causing your issues:

The kernel power manager has initiated a shutdown transition

Tends to imply that Windows or some other software is regularly rebooting the nodes

When this happens jobs are obviously still running and cause a poor shut down

From the blue screen crash that happens on 26th Jan i can see that it did not start up cleanly at all and this seems a recuring theme as if reboots are happening when the system is very busy

If the reboots are intentional then they should be done taking NetBackup into account and not whilst the system is busy - if automatic then you need to change the setting to make them manual and done at a time when NetBackup is doing nothing.

Also on 26th Jan when it blue screened you had errors relating to VCS Helper service not being able to start due to a logon failure as well as "The HadHelper service was unable to log on as bdp.pt\Administrator with the currently configured password due to the following error:
Logon failure: unknown user name or bad password."

From what i can see the system gets regular reboots whilst NetBackup appears to be busy and does not start up cleanly so you need to have a good look at this to see what is going wrong or if your cluster groups and their dependancies are not quite right

I can summarise that your Status 50 errors and hung jobs appear to be down to the shutdown or failover of the cluster during busy periods - but mainly due to shut down or reboot

Hope this helps - you just need to look at the system procedures to isolate what is casuing the shutdowns - especially the blue screen

Malak
Level 4
Partner

Can't find solution or reason on this.
I'll open a case at symantec support.

Thank you all.

Mark_Solutions
Level 6
Partner Accredited Certified

So were you able to account for the regular shutdowns / reboots? - there would cause all of the Status 50's but that is not a Symantec issue

Malak
Level 4
Partner

Hello Mark,

We had only a blue screen, and it was bacause a FC issue.
On some of the "bpdown", to delete this hang JOBs, we do a server reboot.(at least once a week)
No unwanted reboots were detected on this server.

Thank you.

Mark_Solutions
Level 6
Partner Accredited Certified

Ok - if you have no unwanted reboots then this should not be happening - however as this is a cluster you should take the NetBackup Group off line rather than use a bpdown - and only do that when nothing is running.

Do you ensure that you only do a shutdown (off-line the group) of NetBackup when nothing is running?

If so then as you say you should log the call with Symantec

Malak
Level 4
Partner

The cluster group is freezed before any other procedure is done.

This solution is never doing nothing, backups and duplications ran 24x7, it is not possible to have a "do nothing" situation....

We do a "nbpemreq -suspend_scheduling", and we try to not cancel to much JOBs before they end, but also this is not possible. JOBs take too long to end, and we can't be waiting for all of them to end, other ones must start within one or two hours, the "suspend_scheduling" situation can not take much more than that...

 

Case it open, we'll see if there's any other solution on this.

 

Thank you so much for all your help.

Malak.
 

 

 

Mark_Solutions
Level 6
Partner Accredited Certified

OK - well if you re-start NetBackup or reboot the server whilst things are running then this is the casue of the Status 50 errors.

Any job trying to kick in or in the middle of its run when you kill of NetBackup means that its process gets aborted - Status 50 - Client Process Aborted - says exactly what it is.

To shut down NetBackup or reboot the server you need all jobs stopped or suspended and nothing due to run.

If that is not the case then you will get Status 50 errors - which will most likely need the earlier described process to clear them down again

I would be surprised if Symantec can tell you anything more than this but look for ward to hearing what they say.