We are running NetBackup 7.0.1 on Windows Server 2008 x64.
Yesterday I decided that I wanted to set up nightly performance monitor captures on my master server (which also serves as a media server) so that I could keep an eye on network and tape drive utilization. That would let me make tweaks to multiplexing settings, client allocation, etc. and get feedback on whether I was properly streaming my tape drives.
I discovered the "NetBackup Disk/Tape" counters and was happy - "Disk/Tape Write Bytes/Sec" seemed to be the perfect counter. However, the docs (Admin Guide for Window II, 340106) indicated the following:
The System Monitor displays object instances when NetBackup reads or writes from the disk or tape. The read or write counters are updated depending on the type of NetBackup operation performed. The object instance is removed from the list once the NetBackup operation completes.
OK. So I kept starting and stopping jobs until I managed to get it simultaneously display all 4 tape drives in our library (they show up as IBM.ULT3580-TD3.000, IBM.ULT3580-TD3.001, etc.). So far, so good. Then I went into Data Collector Sets (this is still in "Reliability and Performance Monitor") and created a new Data Collector Set under User Defined. I adjusted the directory, created a schedule to have it start at 9 PM every night and to run for 4 hours, etc. I created a single Data Collector for it that captures "Disk/Tape Write Bytes/Sec" on all four IBM.ULT3580-TD3.* objects as well as "Bytes Received/Sec" on each of the four GigE network ports. So far, so good.
I specified a sample interval of 1 second because testing showed that the NetBackup counters are unreliable under load, but if you grab every second, you only miss 1 in 20 or so. if you grab every 15 seconds, a single miss during that 15 second window will cause a drop out, so the reliability falls dramatically and you are left with spotty data. Adjusted the file name to include a time stamp. Even went into Data Manager for the Data Collector Set to define an action for purging old data after 30 days so it won't fill up my disk.
Feeling good. Then when I looked at the captured data this morning, there's nothing for the tape drives. Doh! Took me about 5 seconds to figure out what happened. The first backup started at 9 PM, but didn't start writing to the tape drive until a minute or so later.. The Data Collector Set triggered at 9 PM sharp and when it went to start logging there were no "NetBackup Disk/Tape" objects, so it didn't connect to them. Sure, they got created later by NetBackup when it actually started writing, but the Data Collector Set didn't notice that, so it didn't collect any data.
Great. This is a royal pain. The only true hack I can think of for now is to work my fingers to the bone adding a gazillion Start entries to the Schedule (one every 5 minutes, say, from 9 PM until 12:55 AM, or for whatever period) and then setting the Stop Condition to stop the data collection after 5 minutes. If I set the Data Collector to Append to the log, I should be able to avoid having a bunch of separate logs. But this is way ugly! It doesn't work to wait until 9:05 PM to start the task, because there might only be one drive in use at that time, and the other drives won't come on line until later. Could do it five minutes past the hour if all of your drive status changes happen on the hour, but if one of them happens on the half-hour, you have to go to half hour intervals because the stop condition is a single elapsed time. Argh!
Has anyone else messed with data collection? Do I have to write my own collector support to get what I want? Another approach would be to develop a service to handle scheduling the Data Collector Set. The service would enumerate the "NetBackup Disk/Tape" objects periodically (every 15 seconds or so) and automatically restart the Data Collector Set whenever it saw a change. The latter might actually be useful, because it could run 24x7 and only leave that Data Collector Set running when it sees "NetBackup Disk/Tape" objects. That way I would have logging everytime something was actually happening, but I wouldn't be logging the rest of the time.
Still, this strikes me as a pain.
Thoughts?