Forum Discussion

Ed_Carter's avatar
Ed_Carter
Level 4
12 years ago

5220 Slow Backup Performance

Hi everyone.

We recently invested in two 5220 36TB appliances to replace the 7.1 Windows media servers we were using. We are moving to (almost) tapeless backups and the appliances seemed a good a bet. We wil be running the two in alternate sites using AIR and backing up 95% VADP over SAN. We are currently in a transitional period where we are seeding the 5220 that will eventually be based in the remote site and therefore its sitting along side the other appliance in the same datacentre. We are currently still writing all backups to 6 x LTO-3 drives and need to continue this until we physically move the 2nd appliance. AIR is in operation and both appliances currently protecting around 170TB.

NBU version is 7.5.5 on masters and 2.5.1 on 5220's.

If I run an isolated backup (i.e. no other activity on the appliance) it flys, and I mean seriously flys...I'm getting in excess of 120MB p/s in some cases.

Our nightly cinc's also run OK. No where near the above but acceptable throughput never the less.

The issue is our full backups at the weekend. We have 420 VM's to process (approx 20TB of data). I'm lucky if I get more than 5mb p/s over SAN and the backup window is really starting to creak... We do use query based vm selection and max 2 connections to a datastore.

I've been sweating over this for weeks. We've been updating brocade switch firmware, swapping out fibre, playing with buffer size/numbers none of which have made any great difference. Have even had a Symantec appliance engineer on site to check things out (array batterys, hardware errors etc) which was interesting but largely unproductive. Symantec support over email/phone have been beyond disappointing.

After this weekends slowness, I'm almost certain that it is the AIR replication and tape duplications creating heavy I/O and severly impacting the backup write performance. I had considered this before but was kind of under the illusion that the appliance was built to handle more than I could probably throw at it. I'd say there's probably no more than 20 or so backup streams hitting it at any one time plus the 6 tape duplications plus the replications to remote master. What I saw was considerable increase in backup throughput when the SLP was suspended and the dups/reps cancelled and then steady degredation when enabled.

I'm kind of surprised and massively disapointed by this finding. I note a lot of issues with slow rehydration to tape and indeed I suffered this on our Windows media servers with SAN based MSDPs. In the case of my appliances the rehydration performs great...Just at the expense of seriously slow backups..!

My question is there anyone who has experienced similar on the appliances and how do you handle the I/O? Should I be limiting I/O to the disk pool? What would be the recomended setting?

Any advice greatly appreciated.

Thanks for your time..

Ed

  • Just to update and close off the post.

    Thanks for helpful info. A number of suggestions have been used and the a combination has resulted in a nicely performing environment.

    I used the SLP tuning advise, largely taking the same settings which seem to work well for our environment and keeps activity much more organised in the monitor which I like..

    I've suspended SLP processing for the first 8 hours of our full backup window and then start via scheduled task. The bulk of the backups complete in this window so very little read/write contention on the disks..

    I've used I/O limit of 35 on the dedupe disk pool. This figure seems optimal for our environment and ensures no i/o overload on the array.

    I've used query based selections and limit 2 backups per datastore and 15 per ESX host.

    I upgraded to v2.5.2

    I used a buffer number of 64.

    We now have a backup environment that is running nicely within the backup window. We have now moved our 2nd appliance to our second datacentre and use a dedicated 300MB link for the replication traffic and AIR is working fantasically for our 20TB of production data. The real win is that we have now ceased the bulk of daily tape so only need to worry about duplications on a monthly basis and due to AIR we are effectivly offsiting within a few hours (how things have come on!)

    It wasn't easy but really happy I got there in the end and the people on this forum are worth 100 x symantec support.

    Thank you!

     

    Ed

  • Loading both read and write I/O onto the same spindles will hit performance. I would spread the load during the entire week instead of the only the weekend.  That may destroy the backup plan you made, but there is a  limit boundary you need to take into account. There are script that can control when SLP script are enabled/disabled. Use them to kick off SLP after backup has run e.g 04:00 in the morning.

  • Ed,

    Look to performance tuning for your SLPs. Running six at the same time appears to be a bit excessive especially when you are running replication and duplication at the same time. This link should help.   

    http://www.symantec.com/business/support/index?page=content&id=HOWTO33715

    Look at the MIN_GB_SIZE_PER_DUPLICATION_JOB and MIN_GB_SIZE_PER_DUPLICATION_JOB parameters. They are set low by default but need to be turned up to cut down on the number of jobs to run. This is what we have and we do not move nearly as much data in one evening:

    MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 60
    MIN_GB_SIZE_PER_DUPLICATION_JOB = 128
    MAX_GB_SIZE_PER_DUPLICATION_JOB = 256
    DUPLICATION_SESSION_INTERVAL_MINUTES = 60

    You can see how this will cut down on your overhead greatly.

  • My first guess would have been the BIOS battery but it sounds like that has been checked out - next is if you have a disk failure but again it sounds like that has been iliminated too. One other thing worth looking at is the queue processing / rebasing / compaction By default it runs at midnight and mid day - which always seems odd to me as usually at midnight the systems tend to be pretty busy so it will have an impact on performance. And then it doesn't ever process as much queue as you would like which makes it worse the next time. The queue processing is controlled somewhere (not sure where but possibly in the contentrouter.cfg or a cron job) and my feeling is that it would be better to run not only at different times (to avoid midnight) but also more frequently so that it keeps the queue right down and make the appliance run better. It may be worth adding a cron job to run it every two or three hours (sounds extreme but it will keep everything lean and running well) The other thing that may be causing your issues is the backup, replication and duplication all running together hitting bottle necks - so maybe release a few more threads for yourself (we do this by default if we are using accellerator backups) - again in the content router configuration (/disk/etc/puredisk/contentrouter.cfg) look for the worker threads section and increase the number of threads from 64 to 128 - this needs a NetBackup Re-Start or reboot to register the change on the appliance Finally ... there is a memory leak in all versions apart from 2.5.2 of the appliances when doing VMWare backups - supposedly only if you have multiple paths to the datastores mapped. To see if this effects you run the following (this is a single line command!!): for semid in `ipcs -s | awk '/^0x/ {if ($1=="0x00000000") print $2}'`; do ipcs -s -i $semid; done | egrep -v "se maphore|uid|mode|nsems|otime|ctime|semnum" | awk '{if ($5 != "0" && $5 != "") print $5}' | uniq | xargs ps -p If it outputs orphaned memory semaphores then you are affected. As long as it is not a Master Server appliance you can deal with these by adding the folowing to the crontab (crontab -e): */15 * * * * /usr/bin/ipcs -s | grep 0x00000000 | awk '{print $2}' | while read sem; do /usr/bin/ipcrm -s $sem; done Dont use this on a Master Appliance as it will take EMM down! - this leak is fixed on 2.5.2 I am told Hope some of this helps
  • Hi Ed, 

      Can you contact me offline with the Support case number? I shall ask one of the SMEs to look at it and get back to you. 

  • Thank you all very much for your really useful comments. I will digest the info and be sure to post back with my findings. Abdul - I have pm'd you the info. Cheers Ed
  • Just to update and close off the post.

    Thanks for helpful info. A number of suggestions have been used and the a combination has resulted in a nicely performing environment.

    I used the SLP tuning advise, largely taking the same settings which seem to work well for our environment and keeps activity much more organised in the monitor which I like..

    I've suspended SLP processing for the first 8 hours of our full backup window and then start via scheduled task. The bulk of the backups complete in this window so very little read/write contention on the disks..

    I've used I/O limit of 35 on the dedupe disk pool. This figure seems optimal for our environment and ensures no i/o overload on the array.

    I've used query based selections and limit 2 backups per datastore and 15 per ESX host.

    I upgraded to v2.5.2

    I used a buffer number of 64.

    We now have a backup environment that is running nicely within the backup window. We have now moved our 2nd appliance to our second datacentre and use a dedicated 300MB link for the replication traffic and AIR is working fantasically for our 20TB of production data. The real win is that we have now ceased the bulk of daily tape so only need to worry about duplications on a monthly basis and due to AIR we are effectivly offsiting within a few hours (how things have come on!)

    It wasn't easy but really happy I got there in the end and the people on this forum are worth 100 x symantec support.

    Thank you!

     

    Ed