cancel
Showing results for 
Search instead for 
Did you mean: 

Pempersist and Jobs Not Running - A Tale of Woe

Chuck_Stevens
Level 6
Just a heads-up for those of you running NBU 6 who have backup jobs that mysteriously do not run: There is apparently a known issue with checkpoint-enabled backups and the "pempersist" file.

The scenario is this: If you have a backup policy with checkpoints enabled, and a job in that policy fails for any reason (even if it retries successfully), that job will be listed in the pempersist file as "active", and it will never again run automatically. Manual starts will work fine.

Two work-arounds:

1. Turn off checkpoints. Or,

2. Delete the pempersist file every day: Make sure no jobs are running (i.e., all jobs are "Done"), stop all NBU services, delete %INSTALLPATH%\VERITAS\NetBackup\bin\bpsched.d\pempersist, then restart all NBU services. Do this every day (or at least every day you had a backup failure).

My ticket with Symantec is still open; we're collecting logs (including the output of "nbpemreq -tables screen" - another undocumented command), and we may eventually get a fix from engineering (or will have to wait until MP3).

FYI.
23 REPLIES 23

AKopel
Level 6
Chuck,
We have having the EXACT same problem and are working with Symantic also. We aren't quite a far along in the troubleshooting process yet, but it sounds like you are on to something....

Stumpr2
Level 6
Hmmm. I've heard of people having to remove policies and rebuild them when they migrate to 6.0 I wonder if this is related?

AKopel
Level 6
That was us too. It is not related... but a different type of woe :(
We have rebuilt all of our policies and this is still happening.

Chuck_Stevens
Level 6
Yes, removing and rebuilding a policy will not fix the problem.

Qiang_Lee
Level 3
we used to run into this kind of problem on 4.5 server and rebuilding the policy fixed it.

the problem you described only happen to the client that had failed/retried ? not to the rest of clients in the same policy ?

James.

Chuck_Stevens
Level 6
Exactly. Just the client that failed.

Lance_Hoskins
Level 6
And on top of that, if you're using multiple data streams, it appears that just that stream will no longer run which makes it even harder to find missed jobs since you can't simply compare a client list to the list of clients in the policy!

AKopel
Level 6
Chuck,
Is there something in particular in the pempersist file that you can tell which clients 'think' they are still running?

Chuck_Stevens
Level 6
There might be, but I haven't studied it enough to find out. This file isn't documented anywhere, so I don't know what all the values mean.

AKopel
Level 6
But support did tell you it was safe to delete this file when no jobs are running?
I am noticing TONS of references to old policies that don't exist in our environment anymore.

Chuck_Stevens
Level 6
So long as you have no jobs that are Active, Waiting for Retry, Queued, Incomplete, Suspended, or anything other than "Done" then yes, it is safe to delete (after stopping all NBU services first).

What they've been having me do is actually rename the file (just in case). It has the same effect. After you've run some jobs the pempersist file will get recreated.

Oh, and keep in mind this is a one-time work-around. The problem will return the next time a job fails, and you'll have to delete the file again. Or just turn off Checkpoints for that policy (since that's the source of the bug).Message was edited by:
Chuck Stevens

AKopel
Level 6
Great. This is good information. I will pass this along to my support tech. Maybe he can lookup your ticket and compare notes :)

Brandon_Steili
Level 4
http://seer.support.veritas.com/docs/281780.htm

Just FYI ... here's the bug report on this issue.

Chuck_Stevens
Level 6
Final update (for now): They've got me running a set of engineering binaries that have fixed the problem. These fixes will be in MP3, which they tell me should be out in early July.

Chuck_Stevens
Level 6
> http://seer.support.veritas.com/docs/281780.htm
>
> Just FYI ... here's the bug report on this issue.

Woot! I'm famous! :p

Chuck_Stevens
Level 6
Message was edited by:
Chuck Stevens

AKopel
Level 6
yep,
We have the same binaries and they seem to have fixed most of the problems. We still are having problems with multistream jobs restarting on media write errors.

Deepak_Bhalwank
Level 5
Hi ALL,
Same problem faced. Thank you ALL.


Br,
Deepak.

Alasdair_McQuir
Level 4
This is amazing I am having the exact same problems!

Some times the child jobs do not start after the parent has not started and sometimes the parent will not close after the child jobs complete or fail.