Pempersist and Jobs Not Running - A Tale of Woe

Chuck_Stevens · ‎05-04-2006

Just a heads-up for those of you running NBU 6 who have backup jobs that mysteriously do not run: There is apparently a known issue with checkpoint-enabled backups and the "pempersist" file.

The scenario is this: If you have a backup policy with checkpoints enabled, and a job in that policy fails for any reason (even if it retries successfully), that job will be listed in the pempersist file as "active", and it will never again run automatically. Manual starts will work fine.

Two work-arounds:

1. Turn off checkpoints. Or,

2. Delete the pempersist file every day: Make sure no jobs are running (i.e., all jobs are "Done"), stop all NBU services, delete %INSTALLPATH%\VERITAS\NetBackup\bin\bpsched.d\pempersist, then restart all NBU services. Do this every day (or at least every day you had a backup failure).

My ticket with Symantec is still open; we're collecting logs (including the output of "nbpemreq -tables screen" - another undocumented command), and we may eventually get a fix from engineering (or will have to wait until MP3).

FYI.

AKopel · ‎05-04-2006

Chuck,
We have having the EXACT same problem and are working with Symantic also. We aren't quite a far along in the troubleshooting process yet, but it sounds like you are on to something....

Stumpr2 · ‎05-05-2006

Hmmm. I've heard of people having to remove policies and rebuild them when they migrate to 6.0 I wonder if this is related?

AKopel · ‎05-05-2006

That was us too. It is not related... but a different type of woe :(
We have rebuilt all of our policies and this is still happening.

Chuck_Stevens · ‎05-05-2006

Yes, removing and rebuilding a policy will not fix the problem.

Qiang_Lee · ‎05-05-2006

we used to run into this kind of problem on 4.5 server and rebuilding the policy fixed it.

the problem you described only happen to the client that had failed/retried ? not to the rest of clients in the same policy ?

James.

Chuck_Stevens · ‎05-05-2006

Exactly. Just the client that failed.

Lance_Hoskins · ‎05-05-2006

And on top of that, if you're using multiple data streams, it appears that just that stream will no longer run which makes it even harder to find missed jobs since you can't simply compare a client list to the list of clients in the policy!

AKopel · ‎05-05-2006

Chuck,
Is there something in particular in the pempersist file that you can tell which clients 'think' they are still running?

Chuck_Stevens · ‎05-05-2006

There might be, but I haven't studied it enough to find out. This file isn't documented anywhere, so I don't know what all the values mean.

AKopel · ‎05-05-2006

But support did tell you it was safe to delete this file when no jobs are running?
I am noticing TONS of references to old policies that don't exist in our environment anymore.

Chuck_Stevens · ‎05-05-2006

So long as you have no jobs that are Active, Waiting for Retry, Queued, Incomplete, Suspended, or anything other than "Done" then yes, it is safe to delete (after stopping all NBU services first).

What they've been having me do is actually rename the file (just in case). It has the same effect. After you've run some jobs the pempersist file will get recreated.

Oh, and keep in mind this is a one-time work-around. The problem will return the next time a job fails, and you'll have to delete the file again. Or just turn off Checkpoints for that policy (since that's the source of the bug).Message was edited by:
Chuck Stevens

AKopel · ‎05-05-2006

Great. This is good information. I will pass this along to my support tech. Maybe he can lookup your ticket and compare notes :)

Brandon_Steili · ‎06-14-2006

http://seer.support.veritas.com/docs/281780.htm

Just FYI ... here's the bug report on this issue.

Chuck_Stevens · ‎06-14-2006

Final update (for now): They've got me running a set of engineering binaries that have fixed the problem. These fixes will be in MP3, which they tell me should be out in early July.

Chuck_Stevens · ‎06-14-2006

> http://seer.support.veritas.com/docs/281780.htm
>
> Just FYI ... here's the bug report on this issue.

Woot! I'm famous! :p

Chuck_Stevens · ‎06-14-2006

Message was edited by:
Chuck Stevens

AKopel · ‎06-14-2006

yep,
We have the same binaries and they seem to have fixed most of the problems. We still are having problems with multistream jobs restarting on media write errors.

Deepak_Bhalwank · ‎06-15-2006

Hi ALL,
Same problem faced. Thank you ALL.

Br,
Deepak.

Alasdair_McQuir · ‎06-19-2006

This is amazing I am having the exact same problems!

Some times the child jobs do not start after the parent has not started and sometimes the parent will not close after the child jobs complete or fail.

VOX

Pempersist and Jobs Not Running - A Tale of Woe