I have a netbackup 5.1 solution running on hpux 11i
As you can see, its a very old solution that at some stage, the applications will be migrated but we are stuck with it for the foreseeable and unable to upgrade
We have an L180 ATL SCSI attached tape library dedicated to this system alsowith 6 x DLT7000 drives
Periodically, we get slow throughput and i am keen to understand :
where i need to look for any classic errors (which logfiles and what to look for in there)
how to spot tape drive issues
how to spot media issues
ultimately, how to troubleshoot issues
A regular issue that we seem to face, is a duplicate job which starts at 8am each day and used to always finish around 5pm.
Nowadays, more often or not, it runs very late into the evening, sometimes past midnight so i am keen to find out why and resolve
Can anyone point me in the direction of any such literature please or offer some guidance.
Its not my day job so i am lumbered with keeping it ticking over unfortunately
Thanks in advance
5.1 is before my time with NBU, so this 'might' be incorrect. If so accept my apologies, and I'm sure Marianne will be along in due course to correct things ...
For a job that completes successfully, look in bptm log (media server) and bpbkar log (client) for lines like:
bptm/ bpbkar waited xx times for full / emptybuffer, delayed yy times
It's high level, but if bptm is delayed many times waiting for full buffer, then the data isn't getting from the the client to the media server quickly enough (eg. bptm writes the data from the shared memory buffers and then has to wait for them to refill).
If bpbkar is delayed waiting for an empty buffer, then bptm (ie media server) is not emptying the data from the buffers quickly enough.
Each delay is 10 miliseconds I think (might be 15, should say somewhere at the start of the log) - so the delays only become relevant when theres lots of them, we're usually only concered with many thousands of them. It's also relative to how long the job is. Eg. a small backup that should complete in 10 mins, needs less delays to have an affect - 5 minutes of delays is 50% of the time. 5 minutes delays in a 5 hour backup, we wouldn't even look at.
You're not likely to get 0 delays, there are almost always some - you have to make a judgement as to if it's enough to worry about.
In my experience, the vast majorty of times the delays are bptm waiting for full buffers. The cause is usually either slow disk read speed on the client, or network issues.
If bpbkar is having the most delays, then things are a bit harder, perhaps buffer tuning settings can be improved, but we could be looking at hardware / san etc ...
Tape issue/ Drive issues are hard to detect. If there is a clear error then yes, that's easy, but poor performance caused by errors is difficult. Modern drives suffers 'recoverable' errors (as does disk) - the drive will automatically re-write data if it needs to, but won;t tell anyone - that is, nothing appears in the logs, it's all happening at too low a level. It is detectable, but specialist software is required. Some libraries have this ability built in, to a certain degree. I think this is not that likely, if you're getting enough errors to sow things down noticeably, I would have thought the drive would start throwing out some tape alert (= hardware error). These sort of errors and handling are more 'occassional' then frequent. That said, media / drives wear out and the number of 'correctable' errors increase until it hits a level where 'real errors' occur - so if you're unlucky, that could be coming next.
Testing outside NBU is a good troubleshooting technique - FTP files between servers to get a rough idea of network performance. Copy files or use OS commands to check disk IO speed for example. If Unix, you can write to drives using cpio / tar / dd, the world is your oyster .... (don't pick a tape with data on for obvious reasons ...) If windows, you have my condolences as your'e a bit stuffed as I'm not sure there are that many commands to write to tape drives outside NBU - feel free to correct me, I'm a Unix background not windows.
It mustn't be forgotten that almost all tape I/O is done by the OS, not NBU. Sure we send a few scsi commands here and there, but the reads/ writes are done by the OS. This is despite popular opinion and very often what you are told by hardware vendors.
Logs to start with as mentioned, bptm (media) and bpbkar (client). Although Marianne will be along to tell me off, I like them at VERBOSE = 5 ... OS messages logs can be good as is volmgr/debug tpcommand and robots, but these last two are more for complete failures than performance.
If it exists /usr/openv/netbackup/db/media/errors file can be worth a peep on each media server (apologies I have no idea if this exists at 5.1).
Hope this helps a bit ...
thanks mph999 for such an extensive reply
i will read and digest your comments and advice, and take a look around the logfiles and share with you what i find