Last week, a new problem started: a tape will get stuck halfway in the tape drive and the robot, which essentially locks the robot. Only manual intervention (pushing the tape all the way into the robot w/ your fingers).
It assumed that it was a hardware problem based on age but, now it's not clear to me whether this is a hardware problem or something else. The problem seems to happen pretty quickly (within hours or at least a day of when it's fixed), but this morning I spent about 2 hours moving tapes around in the drives and library using ROBTEST, and I didn't have a single problem. And that would be alot more tape shuffling and movement than it would see on a regular basis.
The environment has been in place for years, and there definitely was no major change before this started happening:
I've done the following:
I've tried parsing through the logs and can't make too much sense of it. It would help to know:
So, if anyone is good at troubleshooting this kind of problem and reading logs, input is appreciated. I'm attaching some logs from today. The first notice of a problem that I see in the Event Vwr of the Robot Control Host is 10:49AM.
I can say for 99.9% certainty that a tape not moving correctly from robot to drive or drive to robot is absolutely nothing to do with the NetBackup software. NetBackup requests for tapes to be mounted or ejected, it is not involved in the mechanics at all and cannot stop a mount or eject mid way through that operation at all.
Have you tried power cycling the robot and the drives? Then reinventory.
Do you have the event logs?
bptm log mentions:
12:14:10.634 [5672.256] <16> mount_open_media: error requesting media, TpErrno = No robot daemon or robotics are unavailable
12:14:10.649 [4424.1324] <16> really_tpunmount: error unloading media, TpErrno = Robot operation failed
So what changed last week when the issue started happening?
The following says to me that your hardware is quite old and maybe 'dying' because of 'old age':
- Storagetek L40 w/ 3 SCSI LTO3 Drives.
- SN3300 Fiber/SCSI router connecting all servers to the Tape Drives.
ALL of our customers with this aging ifrastructure have in the the meantime replaced their hardware with fibre-attached libraries and tape drives.
When last was firmware upgraded on the library, tape drives and router?
Those routers were simply bad news - seen way to many issues over the years caused by them.
When you reboot the environment, the router should be restarted first. Wait for it to fully come up before booting the rest.
Thank you for the responses:
Revaroo: I saw those errors in BPTM but couldn't discern what exactly happened and when. Only that at some point Netbackup needed the robot and was unavailable. As I said there was no major change. I said "major" because there's the usual tape changes, I'm frequently editing policies, cancelling jobs, starting jobs. I stated that I had rebooted the library (which is inclusive of the robot and drives) and Robot Control Host only.
mph999 & ajinBabu: I agree that NBU can't directly cause the tape stop b/t drive and robot, but to say that it's not involved in what's happening is simplistic.
Marianne: I'm well aware of the age and eccentricities of the equipment, and wasn't asking about recommendations for replacement. Which is why I said I "assumed that it was a hardware problem based on age", but I was surprised that 2 hours of ROBTEST couldn't replicate a problem. Firmware, drivers, etc are all the latest available (some of which are very old regardless).
UPDATE (Resolved?, hopefully)
During the last event (yesterday), I had the robot freed by "remote hands" at the CoLo facility. The robot was working, then about a minute later the problem returned. This was the first time I knew exactly when it happend. So, this allowed me to focus on specific errors in the event viewer. While I still couldn't tell what exactly happened in from the events, I was able to go back through the event viewer and see when exactly the problem occurred previously. These times showed that the problem started after a routine tape swap, and persisted each time it was fixed, as soon as NBU resumed control of the libary/robot/drives, whether it was b/c I had services stopped, was using ROBTEST, or taken the libary offline. This, combined w/ the results of my ROBTEST, still led me to believe this was not simply a hardware problem.
I don't claim to understand everything going on under the hood, but the SCSI errors made me think I should try taking everything down, and bringing it back up in hopes that would fix it. So I stopped everything and powered it down, bringing it up one by one in the appropriate order, etc.
So, it's now been about 20 hours without a problem, and a ton of jobs and tape movement since SLP is trying to catch up on several days worth of duplication.
Some libraries come with diagnostic and exercising but I suspect that this library is past EOSL.
Is the HBA shared with disk? People used to worry about tape and disks sharing the same HBA because of throughput and the possibility of devices not functioning correctly because of the possibiltiy SCSI commands getting confused between the different technologies. I think most people ignore the requirement of separating disk and tape on HBAs because all those issues have been fixed some time ago.
Some recent patching or driver updates on the media server may induced some problems. The same if the media server is a blade recent enclosure firmware updates.
Why don't you create the following logs to find out more about the errors when it fails?
In media server:
That shall produce more messages than the usual bptm logs.
Apart from this, I recalled in a tape backup environment I worked before. We're always having this "drive going down" issue without apparent reason. No change in software (Netbackup) nor in hardware. After many rounds of checking, the library vendor finally found the issue. It was due to the inconsistent firmware of different drives that interacts with the robot. Once the vendor made all drives' firmware consistent, drive did not go down unexpectedly anymore. Things can be different nowadays, but not sure if that inconsistency of firmware stil is a limitation under certain circumstances.
I am not saying the above is your issue, but it's one of those things that won't be easy to trace, and require a fair amount of troubleshooting & field testing (for hardware).
Please bear in mind that only ONE utility can control the robot at any point in time - NBU or robtest. NBU mount requests will fail while robtest is running.
See this Note in the robtest TN: http://www.symantec.com/docs/TECH83129