Solved: Vault backups are taking days to complete.....

Toddman214 · ‎09-19-2012

Windows 2008R2 Master and Media servers. Netbackup 7.1.0.4. Using Vault only for replication of backups to DataDomain to tape. Two dell ML6020 tape libraries with 6 drives each, but one library is almost entirely dedicated to Oracle backups.

We are contactually obligated that once data is backed up, it gets replicated to tape, and sent offsite on the same day.

I personally think we need more tape drives to write to, but I'd like opinions from others.

I tried running SLP a while back for duplication to tape, and even with changing parameters, tuning through the SLP Admin guide, as well as having a Netbackup techician come onsite and sit with me for two days, I would still get 1500 to 2000 slp jobs queuing up waiting to write to tape. So, I stopped using SLP's and moved to Vault, which appears to be much cleaner, and I can schedule the duplication.

But, I appear to again be running into issues even with Vault where the duplication cannot keep up with the backups to disk. I just had a Vault duplication job complete after 5 full days of running, and it duplicated 11tb of data. While that was running, of course new data continued to write to disk for those 5 days. Once that job completed, because I have the Vault policy set to go back and grab data for the last 7 days, it grabbed all of that data, and has now kicked off three new jobs, two at 11tb each, and one at 23tb. Its cool that it works that way, but if it took 5 days to duplicate 11tb of data, 23tb is going to take waaayyy too long. So, lets say conservatively, it takes 8 days to duplicate that 23tb job....there will then be 8 days of data that is not captured while vault was writing, so I'll have to adjust the policy to capture 8 days back, and so on, and so on. Am I missing something with this? It looks like a viscous cycle is building.

As per obligations, we pretty much have to duplicate and send offsite all data that backs up to disk, so I dont think we are duplicating more than we really need. There is never a time when we do not have many backups running, so I guess resource contention could be an issue. I am allowing 3 drives to handle the Vault jobs, and I dont think I can push it higher than that.

Any suggestions out there? Im also reading up on the Vault Admin guide, but it looks like we may just need more hardware in place.

Thanks again, all. You've been very helpful in the past!

Todd

Toddman214 · ‎09-20-2012

Oh boy, did I just find something interesting. Ive looked at this file several times today, but only just now noticed this. The SIZE_DATA_BUFFER file has no 'S' on the end of it. This file was created in May of 2011, so if there is supposed to be an S on it, then this file has probably been doing nothing (or only the minimum) since it was created. Geez! LOL

View solution in original post

Toddman214 · ‎09-19-2012

Oh, and the media allocation is quite fast, so there's no issues with the Vault jobs waiting on resources.

mph999 · ‎09-19-2012

Hey Todd,

Given your very reasonable approach to this issue, I'll just put it straight to you.

You are probably correct.

...but, you need to make a few checks.

Without making it difficult, I'd consider this approach ...

If the tape drives are running 24/7 when running the dups AND they are running at a good speed, say 80MB/s + then we can agree that quite simply, there are not enough drives.

Regards,

martin

mph999 · ‎09-19-2012

Hmm, you have x3 drives for vault

x3 jobs started and I presume therefore, one drive each

The 11TB job took 5 days, which is (on average) 2.2TB/day

= 2252.8 GB/day

= 2306867 MB/day

= 96119.47 MB/hr

= 1601 MB/min

= 26 MB/s

Unless I've got something wrong (apologies, late here and I'm tired ...) but I'd say your drive is running about a 1/4 of it's potential speed.

Martin

Gautier_Leblanc · ‎09-19-2012

Hello,

How is connected your datadomain ? SAN ou LAN ? Are you using VTL, OST or simple disk ?

What is the bandwith between your DD and Netbackup server ? 8Gb/s, 4x1Gb/s ?

Note : It is hard to have good duplication speeds with deduplicated source (but your's are very low). If you write backups during duplications on the same space, speeds are often decreased.

Thanks

mph999 · ‎09-20-2012

OK, I'll run it the other way as a check.

Consider a drive running at 100MB/s (an acceptable speed)

100MB/s =

100 x 60 = 6000 MB/min

6000 x 60 = 360000 MB/hour = 360000/ 1024 = 351 GB/hr

351 GB/hr = 351 x 24 = 8424 GB/day

= 42120 GB /5 days

= 42120/ 1024 = 41TB /5 days

Which is about x4 times as much data as your drive is writing, so I reakon my sums are about right.

So in summary, you need to find the slowpoint /bottleneck. If the DD is doing dedup, then I suspect it is skow as it has to re-build all the data before sending it to the tape drive. Easy test is to backup up the media server itself to the drives, does this run quick ?

Regards,

Martin

Toddman214 · ‎09-20-2012

Hmmm....Im watching those duplications today, and as of writing this, each job has only written 1.14tb in 20 hours, so if my math is right, that's 1140gb/20hrs=57gb/hr or 58000mb/hr, or 16mb/s. That sounds horrible if my math is right, and my dups will take forever.

I forgot to provide the drive hardware info. These are LTO3 drives.

My tape drives of course run fiber to the media servers. The media servers run 10gb connections to the DataDomain. We are using OST.

mph999 · ‎09-20-2012

OK, so my sums are about right.

So, the next step is to test the speed of the drives using the local disks in the media server. This should be 60-70 MB/s or above.

Do you have these files :

/usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS

/usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS

I suspect you do, but if not, I'd recommend the value of 262144 is the SIZE file and 32 in the NUMBERS file as a starting point.

My feelining is that the delay is caused by the datadomain rebuilding the data, as I'm going to guess it is deduplicated.

Another test, can you just 'copy' some data from the DD to local disk - how fast is this.

Martin

Toddman214 · ‎09-20-2012

New information..... Looking at the duplication jobs in the activity monitor, I see under the media server column this.......pdc00nbua801w -> pdc00nbua802w, where pdc00nbua801w is my dedicated master server (vm), and the pdc00nbua802w is the media server that wrote the original job). It would appear that the data IS running accross the network.

I went into the vault policy and checked the "alternate read server" box, and told it to use the media server pdc00nbua802w. I kicked the vault job off again and monitored it. I see now that under the media server column, it shows pdc00nbua802w->pdc00nbua802w, which is what I wanted. Now, after approx 22 minutes, it has written 70gb which works out to 53.3mb/s. MUCH better than the 16mb/s it was running before.

But, im think it should still be able to run even faster, even for LTO3. I see a "SIZE_DATA_BUFFER" touch file on that same media server that has its value set to 262144. Ive been told that if my hardware will not support that number (which I do not know how to determine), it may run at its default level which is down around 65536. It was suggested I set this number to 131072, but I'd like your imput. What are you all running this at?

Thanks!

Toddman214 · ‎09-20-2012

Sorry Martin, I had not refreshed my screen in a while, so I never saw your suggestions before making my last post above. I need to get a change request approved before I move forward with adjust the buffers and things.

As I mentioned above, my data buffer size is already set to 262144, and number data buffers is 128, so I dont know if I should adjust up, down, or leave that there.

mph999 · ‎09-20-2012

NP - leave as they are.

262144 works well for LTO 2/ 3/ 4/ 5 drives, higher values may be even better, but I know for a fact that this value should allow speeds of at least 80 - 100MB/s, maybe quicker. No matter, it should def. go quicker than the 16MB/s we have at the moment.

Data buffers at 128 - prob OK, if we can't fill what we have, more or less won't make a difference - I don't think.

So, the most important thing is to see if the drives can go quick - hence a backup of the media servers local disk(s) - just watch in activity monitor for the job, does it get a good speed - if so, issue is on the client (DD) side.

Martin

Toddman214 · ‎09-20-2012

Actually, after stopping the dup from crossing the network, my current speed is 53.75mb/s. I'm no longer currently at 16mb/s I started out with this morning. But, I agree....from everything I have read, it should be faster. I've set up a backup to tape of the media server itself (all_local_drives) and see what that does and report back.

mph999 · ‎09-20-2012

Super, all this is doing is just checking where the slow point is - I suspect it will be the DD

Is the DD performing dedup on the data

Martin

Toddman214 · ‎09-20-2012

Oh boy, did I just find something interesting. Ive looked at this file several times today, but only just now noticed this. The SIZE_DATA_BUFFER file has no 'S' on the end of it. This file was created in May of 2011, so if there is supposed to be an S on it, then this file has probably been doing nothing (or only the minimum) since it was created. Geez! LOL

mph999 · ‎09-20-2012

Good spot, would you be kind enough to add an S ...

M

mph999 · ‎09-20-2012

Should be these names

/usr/openv/netbackup/db/config/SIZE_DATA_BUFFERS

/usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS

mph999 · ‎09-20-2012

It will have been running with the default block size which is 65536 (64k) which will not give any performace on the drives.

Rename the file, and then try again, no need to restart anything - you should (he says ...) see an instant jump in performance.

VOX

Vault backups are taking days to complete.....