Solved: Very large file share...other solutions?

Toddman214 · ‎04-27-2015

Netbackup Master and Media Servers running Windows 2008r2, version 7604 (one media server running 7504), to DataDomain

Hi all,

I have a very large Netapp file share (its our primary fileshare, which also contains all of our user's home directories, etc), and it is imperitive that this gets backed up. The full backup is 13TB, containing approximately 20 million files, and its been a thorn for quite a while, and just getting worse as it grows. My current method for backing this up is NDMP, as that used to allow a backup within an acceptible window. But, NDMP has proven to be very inconsistent lately, ranging from about 20mb/s to 80mb/s. The last incremental backup of this data was 2.7tb and took 58 hours to complete! People make heavy changes to this data daily, and I've been bitten because the backup has been running while folks have changed things, so no restore is available.

I am able to also use Accelerator, by pointing the backup to a cif, and running ms-windows policy type on it. But, Accelerator doesnt seem to do much good. The job details do indicate that it is enabled, but it cannot use change jounal for this (unsupported for non-local volumes). The benefit of this method is that I can break the job into multiple streams, create an individual policy for each data set, etc, but each one runs so slowly, it negates any benefit.

I'm not wanting to revert back to adding a Synthetic schedule into the mix, but I'm willing to try it.

Are any of you dealing with such issues, and can tell me what works best for you? Not that it will be best for our environment (details of which I'll gradly provide), but I've tried multiple things, tuning paramenters, etc, and may need a new direction here.

Thank you!

Todd

sdo · ‎05-05-2015

Do you have a well established ITIL change control process in place? Why not look back through change requests from four weeks before the problems began, and crefully look for anything that may have impacted the storage workload, and or may have changed the LAN topology. This could be a red herring though.

Do you have NetApp Operations Manager (aka DFM)? Has this been collecting NetApp performance metrics for some time? Would it be possible to graph before and after your issues and see if latency or workload of volume of data suddenly shifted somewhere across the NetApp cluster heads - maybe you reached the point where the straw broke the camel's back - and so disk storage response times suddenly started to tail off. Even storage arrays have a breaking point. Maybe the NetApps have passed a saturation point.

View solution in original post

sdo · ‎04-27-2015

I can only empathize. Sounds like not a lot of fun. If you've investigated as many tuning angles as you can, then maybe all you're left with is to try and somehow break it in to smaller (possibly concurrent?) backups?

Is everything under the one Qtree?

Are there multiple DNS names pointing at the filer, for specific shares? Maybe a leftover from when multiple Windows servers were migrated to it? If so, maybe worth breaking these back out, I guess in to separate vfilers?

Are there multiple discreet (i.e. not traversed from above) shares at lower levels? Maybe with only a few users? Maybe these could these be broken out too without disrupting too many users?

You say user home drives are on there too? Maybe you could create a new vfiler, specifically for home drives, and then point the users' A/D records at this new vfiler name. This shouldn't affect the main file server shares, and only breaks shares that users may have setup themselves - which they shouldn't have done anyway.

.

Is there a NetApp feature called 'minra' ? Is it a volume setting? Where you can disable 'MINimum Read Ahead' (min-r-a = minra) and instead tell OnTAP to perform more read-ahead.

And, I know most poeple think that WAFL does not need to be de-fragmented. If that was true, then why does the re-allocate feature even exist? Maybe try a re-allocate over a long quiet weekend? You can interrupt a re-allocate at any time. It might take several weekends to complete a first ever pass. But maybe this will help a little. I have seen some very nice improvements with IO rates after 're-allocation'.

sdo · ‎04-27-2015

Have a search for SQL or Oracle database files in the shares/paths/trees. I wonder if you've got some 'power users' running databases on your CIFS storage - and consuming valuable IO at 'backup times'.

Are you de-duping (asis) on NetApp? If the de-dupe rates are not very good, then why bother? Maybe your NetApp multi-pass "asis" de-dupe reads are conflicting with IO for backups. Check when they run? Maybe you can avoid each other?

Toddman214 · ‎05-04-2015

Thank you for the valuable input, sdo. No, its not been a lot of fun. LOL The frustrating part is that the majority of these performace issue started fairly suddenly. Backing up this data using ndmp, was completing well within my backup window until recently, but the amount of data has not significantly changed.The issue right now is that in our environment, a high percentage of the things you mentioned trying are outside on my control, but i'll work with other teams as needed to address, and see what they think about the suggestions. I tried using the CIFS backup method, and even created a policy for each data set, but the full backup took 6 days to complete. Thats part of why Ive not responded yet. So, even using Synthetic backups is going to cause an issue each time it needs to run its normal full. A very large chunk of this data is user pst files, which back up in full every time a user breathes on them. That's something we are already in the process of addressing.

sdo · ‎05-05-2015

Do you have a well established ITIL change control process in place? Why not look back through change requests from four weeks before the problems began, and crefully look for anything that may have impacted the storage workload, and or may have changed the LAN topology. This could be a red herring though.

Do you have NetApp Operations Manager (aka DFM)? Has this been collecting NetApp performance metrics for some time? Would it be possible to graph before and after your issues and see if latency or workload of volume of data suddenly shifted somewhere across the NetApp cluster heads - maybe you reached the point where the straw broke the camel's back - and so disk storage response times suddenly started to tail off. Even storage arrays have a breaking point. Maybe the NetApps have passed a saturation point.

Toddman214 · ‎05-06-2015

Sdo, I reached back out to our storage team, and they sent me a set of statictics, indicating that the volume is 98.2% full. Im quite sure this is the issue, as that's not likely enough space for the processing needed for the backup job, killing its performance. Thank you for the suggestion. I'll mark you as a solution, since your questions prompted me in the right direction. Also, and unknown to me, they had fixed the issue of my being able to only direct my ndmp policy at the root. I can direct it at individual subdirectories as well, whereas I could not before.

VOX

Very large file share...other solutions?