Re: Restore problems with a monstrous job?

g_e · ‎08-07-2006

Hello - Netbackup 5.1 MP4 - 1 master media server, 2 media servers - 1 Storagetek SL500 w/ 4 LTO2's. Windows 2003 all around.

I've inherited the backups of our new file servers (used to be on Novell - backed up using Backup Exec).

The job is so enormous I can only fully back it up on Saturday morning - takes about 30-40some hours to complete. It's 1.2 TB and well over 3.3 million files.

It's live, enterprise-wide data that is volatile, meaning people delete files all the time, so restores are quite commonplace. So I need to back up each individual file and folder (no Flashbackup for me).

The problem is, with a job that takes so long to run, one little failure will completely ruin my week-end job. For the past 2 week ends, the job has written over a terabyte, then failed with like 5-8 hours to go (media write error, bad media, whatever). After all that writing, NOTHING is written to the Backup, Archive, Restore catalog. There is NO record of this job ever running, even though I can see all of the data on each individual tape using the reporting functions.

This is a HUGE problem for us, because that means there is no way to restore any data from this job...in Backup Exec (which we were before moving to Netbackup), even partial jobs were still able to be restored (at least the files that were backed up successfully). Why can I not do this in NB? Am I missing something?

Should I change how this job is configured? Maybe break the job up into multiple pieces so that even if 1 part fails, I'll at LEAST have something to restore?

Thank you for any assistance.

Greg.

Lance_Hoskins · ‎08-07-2006

What do you mean that one little failure completely ruin's your weekend job? Do you not let the jobs restart after a single failure? Do you us checkpoints so that you don't lose all of that hard work the backup did prior to the failure?

Let me know if I'm understanding you correctly and I'd be happy to give more information if needed.

Stumpr2 · ‎08-07-2006

Hi Greg,

Should I change how this job is configured? Maybe break the job up into multiple pieces so that even if 1 part fails, I'll at LEAST have something to restore?

YES

I know what you mean about one glitch messing up the image.
Even with restarts enabled, it will restart at the beginning.
Also, you can not use checkpoint restart with Novell :(

smaller targets and more of 'em. You are on the right track.

g_e · ‎08-07-2006

Thanks for the replies so far guys.

Currently we've moved to Windows 2003, so Checkpointing is available. However, as Bob stated, it tries to restart the job all over again...and with a 35 hour job, it's just not within the realm of possibilities.

My first question is, if the job wrote to 6 tapes for 30 hours, and 'fails' near the very end because say a drive error, or bad block on a tape, why can I find nothing of that job in the Backup, Archive, Restore utility? The job is 97% complete..yet there's NO record of it at all from that night...if a job fails, is the catalog just zero'd out, and the restore information blown away? Is there any workaround for this?

My second question, what's an efficient way to set up a 1.3TB job? It's just a file share - \\servername\e:\Department - Department is the share mapped to all users desktops - under this, we have a ton of folders - A-Z names. I want to back them all up, so I just have ALL_LOCAL_DRIVES set in the policy - too large of a job. Can I do a range like \\servername\e:\Department\A-F, etc?

Thx again!

Lance_Hoskins · ‎08-07-2006

GE,

Sounds like you have the same problem we have on a few servers in our environment. If all of that data is on one volume and all of those folders are in one subfolder (i.e. if there are hundreds/thousands of folders making up the bulk of the data), you're in a bind. If this is your case, multiple data streams wouldn't work well because you'd have hundreds/thousands of streams from the same client. I also don't think there's any way to separate by a range (i.e. A-F)--which would be nice.

You could probably do what we do which is to do multiplexing of 1 (so that it has it's own dedicated drive(s) and then turn on checkpoint restarts. Checkpoints DO NOT start from the beginning--even though the progress bar may indicate that they do so. We have several large clients that we use checkpoints on that work flawlessly. In the event of an error, they lose no more than 20 minutes worth of backup time.

So, if you have them broke down by \\servername\e:\department and you have just a few departments, give multiple data streams a try and pull data from two locations simultaneously. If that's not the case, you should definately give checkpoint restarts a try and allow the job to restart itself 3 or so times in a 12 hour period (for starters). You could also try flash backups, but depending on the total size of the volume, this may waste more tape space than you can afford (since the "free" space on the drive will also be written to tape).

Hope this helps!

Lance

Stumpr2 · ‎08-08-2006

Can I do a range like \\servername\e:\Department\A-F, etc?

YES

I do this for the exchange mailboxes and you should also be able to use this type of strategy:
NEW_STREAM
Microsoft Exchange Mailboxes:\*
NEW_STREAM
Microsoft Exchange Mailboxes:\*
NEW_STREAM
Microsoft Exchange Mailboxes:\*
NEW_STREAM
Microsoft Exchange Mailboxes:\*

but be careful to avoid disk thrashing. You may want to limit the number of streams that can be active. Depends on the hardware setup.Message was edited by:
Bob Stump

Stumpr2 · ‎08-08-2006

Lance,

I missed the part about moving away from Novell to a windows server. You are correct that checkpoint should indeed be used. However, I still like to work with smaller images. I also set my retries to 3 attempts every 4 hours.Message was edited by:
Bob Stump

Lance_Hoskins · ‎08-08-2006

Good to know about the ranges Bob, this may come in handy for some of our situations!

Thanks!
Lance

g_e · ‎08-08-2006

Wow some really nice info in here so far guys - thanks for the help thus far. A few more questions if you would indulge me:

So to set up Checkpointing completely, I make sure it's checked in the policy (default 15 minutes ok? Remember, it typically takes 30+ hours), and then make Restore retries 2 or 3 in Host Properties?

Bob - should I set all those NEW_STREAMS in one policy? What if one of them fails and checkpointing doesn't help? Does each individual stream contain its' own catalog information? Remember, my problem is that once my single, huge job fails, I lose all Restore information - does this mitigate that by creating a new catalog entry in the Backups, Archives, Restores utility for each stream? If so, it's exactly what I'm looking for...if not, maybe I have to put an individual range stream in its' own policy?

Thanks again - we're almost done I think! :)

Lance_Hoskins · ‎08-08-2006

Correct in setting up checkpoints from w/in the policy. We typically do 20-30 minutes on these large clients to try and minimize overhead. As for the retries, that can be set in the Master Server Properties under "Global Attributes". Everyone has their own theory on this setting and depending on the length of your start window and drive availability, you can set it accordingly.

As for the last question, we'll let Bob chime in, but in our experience any stream that finishes should give you a "bubble" in the BAR console. In all though, you want to make sure it all backs up which is what checkpoints and retries are going to do for you.

Hope this helps!

Stumpr2 · ‎08-08-2006

each stream is an "image" and an entity unto itself and can standalone. It has its own catalog information. If a single stream fails then just the single stream needs be restarted and the other streams will not need to be redone.

Chuck_Stevens · ‎08-08-2006

Lots of good answers so far (checkpoints, breaking the job up into pieces, etc.). Here's some more questions for you:

How's your tape performance? You getting good high speeds to your tape drives?

How are your media servers connected to the tape drives?

What kind of network are you running? Gigabit perhaps?

I can think of some other solutions that would cost money. Get a SAN storage array, with a high-speed connection to your servers, and do Disk Staging. Copying the data to the SAN drives will be much faster than writing directly to tape. Then you can take your time during the week to write that copy to tape.

There's also the solution of breaking all your data up between more servers.

Ankur_Kumar · ‎08-08-2006

Hi could you please check up on the following parameters on the master and the media servers

�NUMBER_DATA_BUFFERS: The number of buffers used by NetBackup to buffer data prior to sending it to the tape drives. The default value is 16.
�SIZE_DATA_BUFFERS: The size of each buffer setup multiplied by the NUMBER_DATA_BUFFERS value. The default value is 65536.
�NET_BUFFER_SZ: The network receive buffer on the media server. It receives data from the client. The default value is 256K.
�Buffer_size: The size of each data package sent from the client to the media server. The default value is 32K.

as well as could you dedcate a NIC on the client and the master media servers on 100 Mb FULL DUPLEX for backup.

Or could you setup the client as a media server SSO so that we can fire a local backup and data is not required to travel all the way to diffrnt media server.

Looking forward to your suggesstions.

chiao
Ankur Kumar

g_e · ‎08-09-2006

Great help here guys - I think I got it all figured out. This was my first time ever with a job this large, and all you guys have had a hand in teaching/assisting me in getting it set up correctly. Thank you very much for your help!

Greg

Stumpr2 · ‎08-09-2006

Your welcome.
Thanks for the points.
Please come again.

VOX

Restore problems with a monstrous job?