05-25-2020 01:45 AM
Good Day
We have a NBU 5240 Appliance 3.1.1 installed with Dell TL4000 Libray and 4 x Dell LTO 7 drives. The Library and Drives are SAN Attached to the Appliance through a Switch.
The Drives are only used with SLP's to duplicate the Monthly and Yearly images for LTR offite.
This has been installed for over 12 months. Initially tape backups were done weekly to catch up on the old retention data. And we did not have this issue. Now that we only do Monthly or Yearly duplication to tape the Drives go OFFLINE and the SLP's fail.
Veritas support say we need to contact Dell. Dell have run diagnostics on the Library and there are no faults.
Can anyone suggest the correct trouble shooting tasks we need to perform to determine the issue and correcting this?
Best Regards
Brian
05-25-2020 07:51 AM
Have you been able to monitor this to get an idea of timelines?
Any kind of OS -> hardware connectivity issues will be logged in /var/log/messages on the appliances, but these logs are recycled on a regular basis.
So, key is to know when this happened and trace errors in messages* files.
Switch logs may also give clues?
05-25-2020 10:22 PM
Thank you Marianne,
Support could not find any issues in /var/log/messages.
We have requested the SAN team to monitor the Appliance and Library/drive Switch ports for any issues.
Regards
Brian
05-25-2020 11:29 PM
Hi Brian
(Only realised this morning who I'm talking to )
The problem with messages file that is that it gets recycled every 3 days or so.
messages is copied to messages.#. I think up to messages.2, and the oldest gests deleted.
So, if access was lost 2 weeks ago, there will be no more evidence in any of the messages files.
More versions can be saved by editing the relevant crontab.
*** Edit *** my reference to crontab was based on my Solaris history.
It seems that Linux has a 'logrotate' config file: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-lo...
Maybe worth a look?
06-02-2020 10:58 AM
Anything about the entries in /usr/openv/netbackup/db/media/errors that gives you any hints as to a cause ? Clusters of timestamps, TAPE_ALERTs, same 3 barcodes always keep showing up (i.e. bad tapes), etc ?
Are you able to run successful test (non-SLP) jobs to the drives, or do all write attempts fail now ? Can you use robtest to successfully move a tape from a slot to a drive & back again ?
Off-the-cuff, if all four of the drives are going down is it a case of they're getting tried sequentially, each fails in turn, and so you end up with all 4 down after a period of time ? Or are four different SLPs kick off, they each attempt to grab a different drive, and each job fails and downs its drive ?
Some "normal" causes for having all library drives go down :
* bad tapes, bad tapes, bad tapes
* physical blocking issue of some kind fails any kinds of mounts because the robot cannot move.
* An object (usually a tape) has been dropped to the library floor and is blocking some movement (see above)
* library calibration has shifted so the tapes aren't able to line up and load into the drives.
* drive device paths are no longer correct on the Media Server, so the tapes load up just fine (because the robotic controller path is correct) but nothing gets written to them.
* Attempting to load incompatible media into a drive.
* SAN issues - any errors showing on the switch that might account for the problem ?
* drives aren't being cleaned
Most of the above should've been noticed by Dell if they ran diagnostics but I don't know whether their logs would show media issues or just physical stuff.