Showing results for 
Search instead for 
Did you mean: 

Disk staging to tape causes status 196 on regular policies

Level 4

Weekly full backups are staged to an EMC Data Domain basic disk staging storage unit.  Staging to a 6 drive LTO4 robot is triggered by a schedule at 3PM which takes more than the sheduled window to complete.  Tape-only policies that are sceduled at 11PM fail with status 196 because all drives are occupied by the staging.  How can I ensure that the staging has less priority than regular backup policies?  Is this recommended?   I want to avoid tweeking with policy priorities which are all at 0.  The staging schedule has a priority of 90000 which makes no sense to me.


bptm logs on the media server does show Kbytes/sec for every copy 2 job, and I see much variation, so I calculated average Kbytes/sec  on 54 jobs for this week's and last week's duplications and I get a 25% increase in speed from ajusting buffer sizes:

grep "successfully wrote backup" log.112117 | grep "copy 2" | awk '{ sum += $17 } END { if (NR > 0) print sum / NR }'
 grep "successfully wrote backup" log.111617 | grep "copy 2" | awk '{ sum += $17 } END { if (NR > 0) print sum / NR }'

NR = 53 this week and 54 last week

That's an average of 78 MB/s vs 62 MB/s from duplicating from data domain disk to LTO4 tape.

Media server is running out of memory and CPU load has increased from 9 yesterday to 15 today...Duplications are slow... backups to DD are running slow at 10MB/s

free -m
total used free shared buffers cached
Mem: 64386 63867 519 0 468 59777
-/+ buffers/cache: 3621 60765
Swap: 16383 4 16379


top - 10:51:53 up 189 days, 20:05, 1 user, load average: 14.33, 14.62, 14.70
Tasks: 349 total, 2 running, 346 sleeping, 0 stopped, 1 zombie
Cpu(s): 5.0%us, 11.0%sy, 0.0%ni, 18.6%id, 65.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 65932232k total, 65380156k used, 552076k free, 479428k buffers
Swap: 16777208k total, 4712k used, 16772496k free, 61277148k cached

Some observations:

disk-to-tape duplication has been running for the past few days.  Duplication speed is on average at 70 Mb/s.  This is fine until concurrent to-disk Incrementals  kick in at 8 PM ... at this point,  duplication speeds fall drastically to 10 MB/s.  Note that nightly to-disk backups are limited to 8 jobs per policy.  These run at 10MB/s.   From this point on, the all jobs (duplication and incrementals) are running at 10 MB/s or less and the media server's CPU load increases (15+) and the server runs out of memory.

I've had to cancel the 100 queued duplications and all active to-disk backups in order for the CPU load to fall.

My conclusion is that the BUFFER size adjustment has had a negative effect; after 40 hours of duplications, there were still 100 or more jobs queued (50%) as compared to last week's duplication which were completed by this time.

So i'm running out of ideas.  If I renable duplications, my system will quickly become overwhelmed.  I don't see a way out of this... If only duplications could be limited to X many jobs for Y many  hours, and not run until completion for 40 or more hours... eating up all system  ressources.