Solved: SSR 2011 offsite copy hangs at 2%, 5%

mdwhome · ‎08-15-2012

Hi there,

We are currently running SSR 2011 doing backups on 11 HP Proliant ML 350, IBM eSeries, and Dell Poweredge servers to a local LaCie 10TB 5Big Network 2 NAS device with an offsite copy over a 10MB Metro-E fiber link to a duplicate NAS device at a remote branch. 6 of the 11 servers backup locally and offsite fine, the remaining five do not complete the offsite, but do complete the incremental recovery points successfully each day to the local NAS.

The partitions vary from multiple logical drives using RAID 5 to a single mirrored partition on each server. It does not appear consistent on similar ML generations with similar partitions/RAID levels.

There are multiple discussions that dance around this issue. Before we try moving the offsite array local and try to perform the backups during the day, removing the offsite device each night, we want to try one more thread and see if any progress has been made resolving this.

Thanks!

mdwhome · ‎09-27-2012

Resolved!

It's all in the schedule... bottom line is don't try and take too many base recovery points and incremental recovery points. What was happening was we were in a constant loop of offsite "catch-up". Each morning we would come into work and have to restart the Symantec System Recovery service because our 10M - 100M Metro-E pipe was saturated, bogging down application access to our branches, all connected via T's', but needing to access the Citrix apps and other databases on the 10M pipe. When we moved the offsite array to the same location as the primary array, we also adjusted the base recovery points and stopping the incremental completely. When that failed with the same result with the 3% hang, we analyzed the logs and found that there was too much traffic on the arrays and the offsites couldn't complete to allow us to remove the offsite array.

We backed the base recovery points back to 2 per week, and staggered the servers so that no more than two were running each night, and scheduled other non-base dates to do their incremental points at staggered times. The result was hundreds of incremental recovery points being removed from both arrays, and THEN the remaining recovery points created the corresponding offsite copy. Once we noticed it was "self-healing", we left the arrays together in the same location overnight. Now all incremental recoveries are again completing their backups and offsite copies within minutes.

The service was just trying to reconcile recovery points with the offsite and not really hanging, but there were so many that failed because we were stopping and restarting the service, the offsites would never complete the previous copies as well as the previous night's.

Next step is to move the offsite array permanently to the 100M location and monitor. Worst case is we move the array back to the primary location daily and remove it at COB.

Thanks for the suggestions and links; I hope this helps others in similar situations.

View solution in original post

Markus_Koestler · ‎08-16-2012

Have you created a support call for this issue ? Are you running SSR 2011 SP2 ?

mdwhome · ‎08-16-2012

Hi Markus,

Yeah, I tried to create a case for this but we are not covered by a service agreement apparently, the licenses were purchased through a VAR.

We are running version 10.0.2.44074, it does not say SP2, but have you heard of those conditions I mentioned?

Markus_Koestler · ‎08-16-2012

Yep this is SP2. if you have a look in this forums there are a couple of issues related to offsite copy, yes.

TRaj · ‎08-23-2012

Hi mdwhome,

Do you receive any errors while backing up to the NAS on the server you are running offsite?

Thanks

Markus_Koestler · ‎09-05-2012

Were you able to resolve the issue ?

mdwhome · ‎09-06-2012

Sorry it's taken so long to comment, I missed the comment from Tripti.

No errors are reported backing up to the offsite NAS, and the local NAS works fine for all local servers. It is just affected a few servers all running W2K3 sp2 at least, some are R2, but not all, and all but one are HP Proliant ML 350 4th and 5th Generations. The last one is an IBM eSeries.

I think at this point there are two ways to try and resolve the issue. The first is to bring the "offsite" NAS device on-site and run the backups during the day, removing the offsite array each evening. Or more radically remove and reinstall SSR from the problem servers and re-create the jobs and rebuild the recovery points.

Markus_Koestler · ‎09-07-2012

Please let us know the outcome.

TRaj · ‎09-12-2012

Hi mdwhome,

I would suggest to go for first option , also you can refer : http://www.symantec.com/docs/HOWTO48460

mdwhome · ‎09-20-2012

Here is something. I moved the offsite array onsite, and the servers where the data partition fails to copy offiste is copying extremely slowly, taking hours when more well behaving servers finish offsite in seconds of minutes. I'm talking gigabytes in minutes. This particular server taking so long on the offsite copy, backs up to the onsite array in minutes, over 100GB.

Any thoughts?

Markus_Koestler · ‎09-21-2012

Hm, have you tried the SSR 2011 performance registry keys ?

TRaj · ‎09-24-2012

Also you can check the ports....

mdwhome · ‎09-25-2012

Thanks for sticking with us on this issue!

I'll look for the performance registry keys and apply them and we'll see. We are charting each server on a spreadsheet to gauge start to complete times, which ones succeed and fail with both arrays on-site, but taking one offsite each day.

Tripti, can you elaborate on which ports we can check?

Thanks again.

Markus_Koestler · ‎09-25-2012

This ports: http://www.symantec.com/business/support/index?page=content&id=TECH54862

mdwhome · ‎09-27-2012

Resolved!

It's all in the schedule... bottom line is don't try and take too many base recovery points and incremental recovery points. What was happening was we were in a constant loop of offsite "catch-up". Each morning we would come into work and have to restart the Symantec System Recovery service because our 10M - 100M Metro-E pipe was saturated, bogging down application access to our branches, all connected via T's', but needing to access the Citrix apps and other databases on the 10M pipe. When we moved the offsite array to the same location as the primary array, we also adjusted the base recovery points and stopping the incremental completely. When that failed with the same result with the 3% hang, we analyzed the logs and found that there was too much traffic on the arrays and the offsites couldn't complete to allow us to remove the offsite array.

We backed the base recovery points back to 2 per week, and staggered the servers so that no more than two were running each night, and scheduled other non-base dates to do their incremental points at staggered times. The result was hundreds of incremental recovery points being removed from both arrays, and THEN the remaining recovery points created the corresponding offsite copy. Once we noticed it was "self-healing", we left the arrays together in the same location overnight. Now all incremental recoveries are again completing their backups and offsite copies within minutes.

The service was just trying to reconcile recovery points with the offsite and not really hanging, but there were so many that failed because we were stopping and restarting the service, the offsites would never complete the previous copies as well as the previous night's.

Next step is to move the offsite array permanently to the 100M location and monitor. Worst case is we move the array back to the primary location daily and remove it at COB.

Thanks for the suggestions and links; I hope this helps others in similar situations.

VOX

SSR 2011 offsite copy hangs at 2%, 5%