Re: Backups are getting so tight....

Alex-B · ‎07-11-2017

I need a bit of help playing "Tetris" with my backups... with more and more machines to back up in my environment, I'm getting more and more 196-errors (Backup window expired). With multiplexing settings, I know there are only so many concurrent backup streams that I can run at once. Question: What command can I run that would display in real-time, how many Streams are active? Also, I'd love a command that would tell me how many tape drives are handling jobs (how many drives are spinning at any given moment). I would use this to figure out where I have 'lulls' in my schedule, so I can move a policy to that gap, and fill it. This way, I'm keeping all my drives spinning & using my resoureces properly. Thanks!

sclind · ‎07-11-2017

I run this once an hour to count the drives in use (we have two silos):

day=$(date "+%Y_%m_%d_%H")

# l700 usage
countl700=$(/usr/openv/volmgr/bin/vmoprcmd|grep ULT|grep -e T[0-9]{\5\}|wc -l|awk '{print $1}')
# ts3310 usage
countts3310=$(/usr/openv/volmgr/bin/vmoprcmd|grep ULT|grep -e R[0-9]{\5\}|wc -l|awk '{print $1}')

echo $day, $countl700, $countts3310 >> /tmp/tape_drive_usage

(T and R are the tape prefixes on each silo).

So I get output like this:

2017_07_11_12, 5, 1
2017_07_11_13, 5, 1
2017_07_11_14, 4, 3
2017_07_11_15, 6, 1

I then also run this:

day=$(date "+%Y_%m_%d_%H")

/usr/openv/netbackup/bin/admincmd/bpdbjobs -report -verbose -most_columns -ignore_parent_jobs | \
awk '{FS=",";if ($3 == "1") print $3, $5, $45}' | sort -k 3 > /tmp/active_job_tapes

/usr/openv/volmgr/bin/vmoprcmd |grep Yes|awk '{print $1,$4,$5}' | sort -k 3 > /tmp/active_tape_drives

join -1 3 -2 3 /tmp/active_job_tapes /tmp/active_tape_drives | awk '{printf "%-8s %-20s %-20s\n", $1,$3,$4}' | while read x ; do
echo $day, $x >> /tmp/active_jobs_drives_and_tapes
done

which gives me output like:

2017_07_11_14, R20266 ORA_psoltp IBM.ULT3580-TD4.005
2017_07_11_14, R20354 ORA_psoltp IBM.ULT3580-TD4.007
2017_07_11_14, R20417 STD_DAILY_ALXPDB01 IBM.ULT3580-TD4.000
2017_07_11_14, R20417 STD_DAILY_ALXPDB02 IBM.ULT3580-TD4.000
2017_07_11_14, R20417 STD_DAILY_AUXPDB07 IBM.ULT3580-TD4.000
2017_07_11_14, T21073 SAP_PBW HP.ULTRIUM4-SCSI.013
2017_07_11_14, T21073 SAP_PBW HP.ULTRIUM4-SCSI.013
2017_07_11_14, T21073 SAP_PBW HP.ULTRIUM4-SCSI.013
2017_07_11_14, T21073 SAP_PBW HP.ULTRIUM4-SCSI.013
2017_07_11_14, T21235 SAP_DNP HP.ULTRIUM4-SCSI.004
2017_07_11_14, T21348 SAP_DCS HP.ULTRIUM4-SCSI.000
2017_07_11_14, T21348 SAP_DCS HP.ULTRIUM4-SCSI.000
2017_07_11_14, T21348 SAP_DCS HP.ULTRIUM4-SCSI.000
2017_07_11_14, T21348 SAP_DCS HP.ULTRIUM4-SCSI.000
2017_07_11_15, R20417 STD_DAILY_ALXPDB01 IBM.ULT3580-TD4.000
2017_07_11_15, R20417 STD_DAILY_ALXPDB02 IBM.ULT3580-TD4.000
2017_07_11_15, T21053 SAP_PBW HP.ULTRIUM4-SCSI.001
2017_07_11_15, T21053 SAP_PBW HP.ULTRIUM4-SCSI.001
2017_07_11_15, T21053 SAP_PBW HP.ULTRIUM4-SCSI.001
2017_07_11_15, T21053 SAP_PBW HP.ULTRIUM4-SCSI.001
2017_07_11_15, T21073 SAP_PBW HP.ULTRIUM4-SCSI.013
2017_07_11_15, T21073 SAP_PBW HP.ULTRIUM4-SCSI.013
2017_07_11_15, T21073 SAP_PBW HP.ULTRIUM4-SCSI.013
2017_07_11_15, T21073 SAP_PBW HP.ULTRIUM4-SCSI.013
2017_07_11_15, T21235 SAP_DFG HP.ULTRIUM4-SCSI.008
2017_07_11_15, T21235 SAP_QNP HP.ULTRIUM4-SCSI.008
2017_07_11_15, T22066 SAP_PBW HP.ULTRIUM4-SCSI.010
2017_07_11_15, T22066 SAP_PBW HP.ULTRIUM4-SCSI.010
2017_07_11_15, T22066 SAP_PBW HP.ULTRIUM4-SCSI.010
2017_07_11_15, T22066 SAP_PBW HP.ULTRIUM4-SCSI.010

Genericus · ‎07-13-2017

I found that it only took a few slow systems to totally gum up my drives, multiplexing only gets you so far

I kept running into issues with drives getting assigned to slow backups, I ended up moving to an intermediary disk appliance. And I am quite happy now.

Although Veritas has some nice features, we selected Data Domain for the throughput. I have three Data Domains with over 300 "drives" on each one, and I beat the daylights out of them. I am sending over 3600 MB/Sec maximum combined ingest to them, and sending out to 18 LTO5 at the same time.

I had to go with VTL since we use mostly Fiber Channel, I am starting to use 10G and BOOSt/Accelerator - when it works well, it screams.

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

X2 · ‎07-13-2017

While @Genericus has a good solution working for him, @Alex-B, you should give us a better picture of your environment first. Not every would go a buy a DD or similar solution because tape drives are getting busy. Even though a disk storage unit like DD over FC would give very good performance, it shouldn't be the first solution to try out.

You need to find out where the bottle neck is. Are the systems writing directly to the tape drives? What speeds are you getting? Is it line speed or less than expected?

If your systems are not able to give you enough data even after multiplexing, why not setup a disk staging unit from where you are sure you will get the required speed.

Above was some general suggestion. Other experienced users can give better advice once we know more about your environment.

PS: I have DD9800 as disk storage solution. SLPs are used to write to tape form some systems. Tapes are LTO7 drives and they just fly.

Genericus · ‎07-14-2017

I agree X2, we have to work with what we have, you rare on the money with the concept of determining the bottleneck and finding ways to overcome that.

My issue was I was not sending data to tape fast enough, and I had drives locked for long periods writing slow backups. Sending data to disk first eliminated the drive locks, and the disk array sends to tape much faster - in fact, I was using LTO2 drives because we could not send data fast enough - now I am driving LTO5, so everything is faster.

Break your process down to steps and figure out how you can improve those steps.

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

Will_Restore · ‎07-17-2017

OpsCenter has two graphical reports which can be useful to visualize library activity.

Tape Drive Throughput

Tape Drive Utilization

Marianne · ‎07-18-2017

Seems @Alex-B has not been back since posting a week ago...

Since you are on a Windows master server, the easiest would be to use the GUI to count active jobs - use the sort or filter functions. (I like to sort by job type the by job id (highest first). This will show Queued jobs at the top, then Active jobs.
The command equivalent is bpdbjobs. Pipe to findstr to extract Active

For drive usage, use Device Monitor or vmprcmd -d from cmd.

These actions will need someone to be present to monitor all the time....

Handy NetBackup Links

Alex-B · ‎07-21-2017

Original poster here: Thanks for all the suggestions -- I can answer the question about what my system looks like, and what my 'problem-statement' is. My System is a Dell PowerEdge T710 Server with about 48GB of RAM, and an 8-core processor. netBackup 7.1 is in use. The Management server and Media server are one-and-the-same. I have four LTO4 drives in a Dell ML6020 library.. I don't have OpsCenter installed, so I can't run any reports that require OpsCenter. My backup server is connected to clients via a 10GB network, so there shouldnt be any bottlenecks there.

My problem-statement: I suspect that there are time periods overnight, where I don't have all 4 tapes drives spinning, meaning that I have unused resources. I need to know when those times are, so I can shuffle the backup windows around. Also, I'd need something automated that I can run automatically that will show me what drives are spinning & when. I believe a suggestion was already made on that.

Thanks!

sdo · ‎07-24-2017

Alex, are you using multiple different pools and multiple different retentions?

Here's a tip... the best tape drive usage job occupancy levels I have ever seen... are... with one pool... and one retention... per backup session. Is that what you have?

Alex-B · ‎07-24-2017

OP here: Our backup jobs use two pools, "on-site" and "off-site". One retention-period for each, on-site is 30 days, and off-site is 2 weeks. There are dictated by policy (and by auditors), so we likely can't change them.

sdo · ‎07-25-2017

If all jobs use either:

- setA - onsite-30days

...or...

- setB - offsite-2weeks

...AND...

only one storage unit for all tape drives, then there is a possibility that at certain times of the night, that most or all tape drives are occupied by either setA or setB, and thus not leaving any tape drives free for use by the other set... e.g. setA kicks in and has eight jobs, and you have multi-plexing, and so you get an even spread of two jobs per tape drive, because they multiplex because they have the same retention... and let's imaging four are long running and for are short-running, let's imagine the mix is: s+s d1, s+s d2, l+l d3, l+l d4... and so when the four short running jobs end, two tape drives come free for other jobs and/or other retentions to do their own multi-plexing.

So, maybe what you could do is... is have two storage units pointing to the same set of 4 tape drives, but make the storage-unit only use two concurrent tape drives... and then point all of setA jobs to stuA and all of setB jobs to stuB... this was each pool+retention will always have two tape drives available.

You might need to up the multi-plexing a bit.

VOX

Backups are getting so tight....