Need your FLASHBACKUP experience for LARGE data vo...

NB_BCE_Adkisson · ‎08-27-2010

If you have used NetBackup FlashBackup at your company to successfully backup volumes in the Terabyte size range, I'd love to hear your experience!

I'd be interested in hearing about:

how much and what kind of data were you taking a snapshot of...
after the snapshot, did you put it over the network, san, directly to tape, etc...
if you ran into roadblocks, what were they and what did you do to get around them...

Anything you care to share about your experience is greatly appreciated.

Ed_Wilts · ‎08-28-2010

We started trying FlashBackups when they didn't actually work from a Solaris master to a Windows client. That's been fixed for many years.

Initially, we were backing up 250GB volumes with about 10 million files on them on Windows directly to SDLT tape and it would take about 24 hours. We switched to DSSUs and cut the time in half. We then went to FlashBackup and cut the time in half again to ~6 hours. We work with a ton of scanned TIFF images and have a few applications with over a billion files.

Today we routinely back up 1 TB volumes from both Solaris and Windows clients and we still have a lot of volumes with >40M files per TB and hostile directory structures. All of the clients have GigE connections to a media server. We use LTO-3 drives now front-ended by Decru encryption appliances.

We're run into several roadblocks along the way:

System administrators forget (conveniently?) that snapshots are an OS responsibility and not a backup responsibility. So when they get a status 156, they ask the backup team why "our" snapshots failed. Patch the OS and configure it properly is a common answer from us...
The parent/child job relationships play havoc with FlashBackups because the snapshots are created by the parent, not the job. In many cases, the child may not run for a LONG time after the parent creates the snapshots so we were forced to create a separate policy per mount point. I've had a single client with over 50 policies because of this.
A large non-FlashBackup restore will usually fail because the master can't enumerate the file list. I haven't tried it lately and we don't have much need to restore an entire volume this way - it's usually small subsets of the data that need to be restored. If I had to do it again, I'd provision a full volume and do an image restore, not a file-based restore.

I haven't had a chance to work with the SAN client to see if that will improve performance. It's on my TODO list...

In general, we're trying to get away from host-based file serving so we'll hopefully phase out FlashBackup completely in favor of NDMP backups from our NetApp filers as our data continues to migrate. We're also working towards full application replication to our DR factilities so that the requirements for tape will diminish. I can't imagine how long it would take to roll a billion files back off of tape :(

NB_BCE_Adkisson · ‎08-28-2010

Excellent information Ed! Thank you for sharing. Two questions come to mind. 1) What is the largest size volume that you've seen successful snapshots occur? 2) Are those GigE connections dedicated for backup?

Interesting finding you made on the mount point per policy scenario. Did you dig into logs or have any luck with a support case to understand why the child job was taking so long to start after the parent created the snapshot?

Thanks again and I look forward to any additional details you can provide!

Ed_Wilts · ‎08-29-2010

> 1) What is the largest size volume that you've seen successful snapshots occur?

I believe it's 1TB. We try to not present volumes larger than that from a Windows or Unix system.

> 2) Are those GigE connections dedicated for backup?

Yes they are.

> Interesting finding you made on the mount point per policy scenario. Did you dig into logs or have any luck with a support case to understand why the child job was taking so long to start after the parent created the snapshot?

The issue is not the length of the time it takes the children to start - with dozens of mount points each getting large backups, obviously you have to restrict the number of simultaneous jobs (which is non-trivial on active/active clusters). On top of that, most of these don't get a typical daily/weekly/monthly type of schedule - they're written to until they're full and then may get monthly, or in some cases semi-annual, backups.

The issue is that I do not believe snapshot creation belongs in the parent job. I lost that argument hte first time :)

VOX

Need your FLASHBACKUP experience for LARGE data volumes