5220 Slow Backup Performance
Hi everyone.
We recently invested in two 5220 36TB appliances to replace the 7.1 Windows media servers we were using. We are moving to (almost) tapeless backups and the appliances seemed a good a bet. We wil be running the two in alternate sites using AIR and backing up 95% VADP over SAN. We are currently in a transitional period where we are seeding the 5220 that will eventually be based in the remote site and therefore its sitting along side the other appliance in the same datacentre. We are currently still writing all backups to 6 x LTO-3 drives and need to continue this until we physically move the 2nd appliance. AIR is in operation and both appliances currently protecting around 170TB.
NBU version is 7.5.5 on masters and 2.5.1 on 5220's.
If I run an isolated backup (i.e. no other activity on the appliance) it flys, and I mean seriously flys...I'm getting in excess of 120MB p/s in some cases.
Our nightly cinc's also run OK. No where near the above but acceptable throughput never the less.
The issue is our full backups at the weekend. We have 420 VM's to process (approx 20TB of data). I'm lucky if I get more than 5mb p/s over SAN and the backup window is really starting to creak... We do use query based vm selection and max 2 connections to a datastore.
I've been sweating over this for weeks. We've been updating brocade switch firmware, swapping out fibre, playing with buffer size/numbers none of which have made any great difference. Have even had a Symantec appliance engineer on site to check things out (array batterys, hardware errors etc) which was interesting but largely unproductive. Symantec support over email/phone have been beyond disappointing.
After this weekends slowness, I'm almost certain that it is the AIR replication and tape duplications creating heavy I/O and severly impacting the backup write performance. I had considered this before but was kind of under the illusion that the appliance was built to handle more than I could probably throw at it. I'd say there's probably no more than 20 or so backup streams hitting it at any one time plus the 6 tape duplications plus the replications to remote master. What I saw was considerable increase in backup throughput when the SLP was suspended and the dups/reps cancelled and then steady degredation when enabled.
I'm kind of surprised and massively disapointed by this finding. I note a lot of issues with slow rehydration to tape and indeed I suffered this on our Windows media servers with SAN based MSDPs. In the case of my appliances the rehydration performs great...Just at the expense of seriously slow backups..!
My question is there anyone who has experienced similar on the appliances and how do you handle the I/O? Should I be limiting I/O to the disk pool? What would be the recomended setting?
Any advice greatly appreciated.
Thanks for your time..
Ed
Just to update and close off the post.
Thanks for helpful info. A number of suggestions have been used and the a combination has resulted in a nicely performing environment.
I used the SLP tuning advise, largely taking the same settings which seem to work well for our environment and keeps activity much more organised in the monitor which I like..
I've suspended SLP processing for the first 8 hours of our full backup window and then start via scheduled task. The bulk of the backups complete in this window so very little read/write contention on the disks..
I've used I/O limit of 35 on the dedupe disk pool. This figure seems optimal for our environment and ensures no i/o overload on the array.
I've used query based selections and limit 2 backups per datastore and 15 per ESX host.
I upgraded to v2.5.2
I used a buffer number of 64.
We now have a backup environment that is running nicely within the backup window. We have now moved our 2nd appliance to our second datacentre and use a dedicated 300MB link for the replication traffic and AIR is working fantasically for our 20TB of production data. The real win is that we have now ceased the bulk of daily tape so only need to worry about duplications on a monthly basis and due to AIR we are effectivly offsiting within a few hours (how things have come on!)
It wasn't easy but really happy I got there in the end and the people on this forum are worth 100 x symantec support.
Thank you!
Ed