Urgent: Restore Jobs Get Hour-glass for 10 minute...

ctate · ‎07-27-2009

We are conducting a DR test and have restore jobs getting an hour glass for up to 10 minutes before queuing. Then they queue for 5 minutes or so before running. This is for Windows 2003 SP2, Netbackup 6.0 MP5 using an ADIC I2000 Scalar Library with LTO-2 tape drives. Have not had this problem before. Any ideas?

thanks

Abesama · ‎07-27-2009

in DR, sometimes NBU master gets confused about the different names or different IPs, and it wastes time in trying to find out which one is right for the backup/restore job

so double-check all your master/media name resolution and make sure all of the master-media-clients involved in restore are fine with both forward & reverse lookup

deactivating all the backup policies or running nbpemreq -suspend_scheduling is also a good idea to take some load off the master server

but, restore jobs are supposed to get resources allocated immediately regardless of the actual resources' availability ... so if your restore job stays in queued status on the activity monitor (which should not happen even if all drives were busy for backups) then we should suspect communication issues between EMM and nbrb and nbjm

again name resolution is the first place to check, in the EMM-nbrb-nbjm process flow, it's very important all those are on the same understanding.

I've seen cases like two NBU servers resolve same IP to different names and such things cause big mess.

sometimes cleaning up temp files (ior files) helps, while NBU services are stopped - but this should be done carefully with tech support's direction so I should not say anything in detail.

in summary, name resolutions, name resolutions, double, double check.

so many times we see NBU issues caused by name resolution faults although they don't seem to be the cause at the beginning

Abe

Anton_Panyushki · ‎07-28-2009

Well,

Once you launch a restore job, Netbackup starts catalog scanning for images and media that are used for restore. Please check bprd log for records pertaining to restore. As soon as this search completes, the restore job appears in Device Monitor.

First it looks for peername, then policy and finally images. You can shorten search area by designating date range, peername, policy.

ctate · ‎07-30-2009

Thanks for the feedback. Support said it was a known issue in DR and we sent some logs, but no resolution was found. We managed through it and got all of the restores done. I had already verified DNS etc. I put hosts files on all systems trying both shortname and FQDN, but the problem persisted. All policies were deactivated immmediately after the initial catalog restore so that wasn't it either. Once the job became "queued" it would launch within a few minutes.

We did run into other restore problems we had to work around. Some of the restores failed with various errors despite trying multiple tape drives in the robot, multiple media servers, media from multiple days and trying the restore to different locations. We built a standalone master server with some local tape drives, cataloged the tapes, and the restores worked fine. Why? I don't know the answer. I had to do the same thing last year also for a couple of jobs. So the learning is if you have a problem with a restore and can't get it to work despite your best efforts, catalog the tape on a standalone server with a standalone drive and try it again.

Abesama · ‎07-30-2009

Unless DR network is the "exact" replica of the production environment, there's always a chance of something going wrong or slow.

IP address can be different, NICs may be different, routing tables, tape library, etc.

But the tech support shouldn't really say a simple "known issue with DR" ... their job is to identify the point of failure/problem, not acknowledging the different setup.

:-)

Abe

VOX

Urgent: Restore Jobs Get Hour-glass for 10 minutes before queuing