cancel
Showing results for 
Search instead for 
Did you mean: 

How to reset schedule after job failures

Michael_Rogers
Level 2

I understand, now, how to manage the number of tries for a backup over a given period (global attributes in host properties of the master server).  However when I am dealing with a failure and have resolved it I can't find a way to reset the policy so it will get back on the schedule.   I am having to wait until the backup attempt interval is up (in our case that's 12 hours).

Here's my example...  My master server global attributes are set to 2 tries per 12 hours (I think those are defaults but I certainly don't know).  I have an archivelog backup of an oracle database that runs hourly.... and  must or I will fairly quickly run out of space on the volume (lets assume I can't add any disk).   Recently this backup failed... twice.... which is the maximum per 12 hours so the schedule will not pick back up even on a successful manual run.   Since I really can't wait the 12 hours for it to pick back up on the schedule, I'd like to reset the schedule so that it will pick back up right now.   However I can't find any sort of command that resets the tries or simply gives the policy a 'clean slate'.  

So... what's the best thing to do here?  I really don't want to change the global policy for my job that failed, but perhaps that's the best approach?

1 ACCEPTED SOLUTION

Accepted Solutions

Nicolai
Moderator
Moderator
Partner    VIP   

And as a side comment - it's really bad system design if a Oracle host doesn't have enough archive space to survive 24 hours without archive runs. What if the Netbackup master have a hardware failure - all Oracle hosts down then ?

I recommend a little push back on system owners :)

View solution in original post

11 REPLIES 11

Marianne
Level 6
Partner    VIP    Accredited Certified
Veritas/Symantec have unfortunately never acknowledged this as an issue that needs to be fixed. The only way I have been able to reset schedule is to manually kick off a backup. If it completes successfully, the hourly backups will kick in again.

Marianne
Level 6
Partner    VIP    Accredited Certified
Oh, another way is to reduce the tries to something like 1 every 1 hour but most people don't like this as it is a Global setting that affects all policies.

Nicolai
Moderator
Moderator
Partner    VIP   

And as a side comment - it's really bad system design if a Oracle host doesn't have enough archive space to survive 24 hours without archive runs. What if the Netbackup master have a hardware failure - all Oracle hosts down then ?

I recommend a little push back on system owners :)

mph999
Level 6
Employee Accredited

Nicolai makes a good point.  Whilst the backup software shouldn't fail, things break and in the event of a master server issue that despite everyones best attemps takes it offline for hours, you're going to have a very big issue.

At my previous company, all DB servers had to be able to run, without backup for something like 2 days, and some of these were massive systems.  I don;t recall every having to fall back on that, as we had multiple backup environments (anther good piece of design in my view) so if things really got tight, we could move any critical systems to another backup server.  I certainly recall doing that oonce in the middle of the night ...  Run a couple of fibres cables, touch of rezoning and Bob's your Uncle ...

Marianne
Level 6
Partner    VIP    Accredited Certified

I still feel that users should be logging calls for regular schedules getting broken after a failure and insist on it being escalated all the way to Engineering.

It is dead easy to replicate the issue in a lab.

Do NOT accept the answer of 'the product is working as intended'.

Nicolai
Moderator
Moderator
Partner    VIP   

While I agree with Marianne, don't hold yore breath :)

Even if you suceed thru support, it will be a long time before you see the change in production. So please consider you currnet options.

 

Michael_Rogers
Level 2

@Marianne - I did try running manually... successfully.. but the policy schedule did not kick back in.  I had to wait the 12 hours.  I agree.. this is something that could and should be addressed.

My example is not a real life one but was more trying to represent a case where the inability to reset the status would be pretty inconvenient.  I did have a series of  failures and was annoyed that after I resolved the issue I couldn't get my policy back on schedule.

 

Marianne
Level 6
Partner    VIP    Accredited Certified
I have in the past managed to get hourly schedule again after a manual backup succeeded. If this is not working for you, then it seems that the situation got worse. The only remaining workaround then is to reduce period between tries - make it 1 try every 1 hour.

revarooo
Level 6
Employee

what about a "backup" policy that you can either enable and disable when you need it to run OR copy the non-working scheduled policy to a new policy until the 12 hours is up and then disable/delete that copied policy? I know that may potentially cause issues with reports, but at least the backups happen.

I think though raising an enhancement request to have the ability to reset the policty schedule is a good idea. I understand that if backups are failing you don't want them repeating and failing throughout the day (this will reserve resources that other jobs could've had throughout that day) but a schedule reset is a good idea.

Michael_Rogers
Level 2

I actually did consider copying the policy to a new policy, but realized I was trying to solve a problem that I didn't really have.   I actually do have plenty of space for 2-3 days worth of archivelogs and was fairly certain the policy would re-engage after 12 hours.

I think I will try to submit an enhancement request though.  It just seems like a "no brainer" to me.

 

Thank you to everyone for your thoughts.

areznik
Level 5

I think with this kind of requirement, it would make more sense to not rely on the Netbackup scheduler at all and just have the jobs submitted via a cron job from the client. You can put in additional checks into your script to make sure that a backup is needed (ie. logs are big enough) and can script alerts if multiple attempts have failed.