03-29-2012 12:25 PM
Over the past 2 nights I have 4 media frozen with the message non netbackup media found in drive index 0, freezing xxxxx. All 4 of the media are from the same robot; a scalar 500 with 4 LTO4. I checked the library and it had media write errors on 2 of the drives. I did a cleaning on the 2 drives and unfroze the 4 media. The robot is controlled by a master/media server running 7.1 on Windows 2008. What can I do or look for to capture the results of tonights run if the media freezing occurs again? Thank you.
Solved! Go to Solution.
04-13-2012 10:46 AM
I had a combination of bad media causing drives to go offline and needing updated firmware on my library and drives. I have been stable for a week with just removing the 1 bad tape and upgrading firmware. No drive replacement was needed. Thanks for all the help.
03-29-2012 12:48 PM
and the tape library status logs
same tapes again?
bad media; set them aside (or dispose them)
same tape drive?
call the hardware vendor for replacement
03-29-2012 02:59 PM
The process that freezes the tape is ltid :
These logs are all on the media server
.../volmgr/debug/ltid (dir must exist for the log to exist)
I would also create the dir /usr/openv/volmgr/debug/robots
bptm might also give details:
.../netbackup/logs/bptm
Also put VERBOSE into the file /usr/openv/volmgr/vm.conf
Create the empty files
.../volmgr/DRIVE_DEBUG and
.../volmgr/ROBOT_DEBUG
(and then restart ltid (stopltid, then, ltid -v) you will get logging into the system messages file (of event logs in win)
Another good file on the media server is .../netbackup/db/media/errors
If you have a solaris serevr available, you can copy the errors file to it and use this script ...
https://www-secure.symantec.com/connect/downloads/tperrsh-script-solaris-only
to see if you can spot any patterns - will need to run the script like this
tperr.sh -a -n -f <path/to/file>
Note: NBU logging for these issues can be limited, and may well only say 'write error' - this will not be a NBU issue as NBU is not actually writing to the tapes at all, the operating system is.
Regards,
Martin
03-29-2012 09:55 PM
03-29-2012 11:10 PM
Mariane saves the day ... clearly I am a bit blind and managed to miss the "Non NetBackup media" bit ...
Either you have media in the library that has been written by some other application other than NBU, or, something very nasy has happened to media that 'should have' a NBU header.
As suggested we need to see the output of the command suggested by Marianne for one of the 'bad' media.
Martin
03-30-2012 02:34 AM
In addition to the excellent post above (especially Mariannes as always) - As you are on Windows the Application event logs is really useful for this sort of thing - NetBackup has a habit of writing clearly to the Windos application logs about what is wrong with a tape - worth a look.
03-30-2012 06:32 AM
Hello, thank you for the tips. None of these tapes have been accessed by other backup application nor any manual command line backups (to my knowledge!). I had 4 more tapes freeze last night, 3 were the same from the previous night plus one new one. This is the output from nbemmcmd -listmedia -mediaid xxxxxx. Can you please tell me how to verify which drives were writing to these tapes when they froze?
NBEMMCMD, Version:7.1
====================================================================
Media GUID: 0080ef8f-fc02-411b-aaf3-a54839a0c00a
Media ID: 000208
Partner: -
Media Type: HCART
Volume Group: 000_00000_TLD
Application: Netbackup
Media Flags: 1
Description: Added by Media Manager
Barcode: 000208
Partner Barcode: --------
Last Write Host: hpncpnback01
Created: 07/16/2011 07:28
Time Assigned: 03/24/2012 21:24
First Mount: 07/23/2011 15:37
Last Mount: 03/29/2012 21:22
Volume Expiration: -
Data Expiration: 04/12/2012 18:00
Last Written: 03/29/2012 20:04
Last Read: -
Robot Type: TLD
Robot Control Host: hpncpnback01
Robot Number: 0
Slot: 74
Side/Face: -
Cleanings Remaining: -
Number of Mounts: 80
Maximum Mounts Allowed: 0
Media Status: FROZEN
Kilobytes: 650883203
Images: 40
Valid Images: 40
Retention Period: 1
Number of Restores: 0
Optical Header Size Bytes: 1024
Optical Sector Size Bytes: 0
Optical Partition Size Bytes: 0
Last Header Offset: 10170145
Adamm Guid: 00000000-0000-0000-0000-000000000000
Rsm Guid: 00000000-0000-0000-0000-000000000000
Origin Host: NONE
Master Host: hpncpnback01
Server Group: NO_SHARING_GROUP
Upgrade Conflicts Flag:
Pool Number: 5
Volume Pool: FDSU
Previous Pool Name: -
Vault Flags: -
Vault Container: -
Vault Name: -
Vault Slot: -
Session ID: -
Date Vaulted: -
Return Date: -
====================================================================
Command completed successfully.
03-30-2012 06:48 AM
There is clearly a problem here as we can see that it has been a vlaid tape in the past.
The last mount was at 03/29/2012 21:22 so just check your 4 media servers to see which one has an Windows Application event log at that time relating to this. It should also tell you with drive was used when the error happened
Do the same for the other tapes
You can then check if that server has issues or if it looks like one drive has issues.
Remember that it is good practice to keep drive firmware up to date, tape drivers up to date and also to use the AutoRun registry key with a value of 0 on all Windows Media Servers (needs a reboot but does need doing):
http://support.microsoft.com/kb/842411
It may be that the headers of those tapes have been damaged so you may need to do a bplabel from the command line to use them again - but check everything out first as you dont want to loose backups (and if you decide to bplable do a bpexpdate first to tidy the catalog up)
Hope this helps
03-30-2012 06:52 AM
These messages were in the Windows Application event log of the master/media server that controls this library.
These coraspond to the times I got the failures last night. From the NBU TLD Control Daemon:
Cannot read volume header on HP.ULTRIUM4-SCSI.000 (device 0, \\.\Tape3); media may have been written at an incompatible density or is corrupt
From the NBU Tape Manager:
TapeAlert Code: 0x01, Type: Warning, Flag: READ WARNING, from drive HP.ULTRIUM4-SCSI.000 (index 0), Media Id 000208
03-30-2012 07:08 AM
Has anything been changed such as NUMBER_DATA_BUFFER sizes etc?
Do all of the ones that get frozen do so on device 0?
Just tring to pin down if O/S (the AutoRun keys), tape or drive is the issue.
As it seems a fairly new tape (frist mounted in only July last year) it should be OK but could i ask about your tape handling proceedures and what the weather has been like where you are? (Tapes are very sensitive to temperature / humidity change)
Thanks
03-30-2012 07:12 AM
All the freezing appears to be from the same drive in the same library. I cleaned it yesterday so I am going to down it for now so it won't be used.
TapeAlert Code: 0x01, Type: Warning, Flag: READ WARNING, from drive HP.ULTRIUM4-SCSI.000 (index 0), Media Id 000208
TapeAlert Code: 0x01, Type: Warning, Flag: READ WARNING, from drive HP.ULTRIUM4-SCSI.000 (index 0), Media Id 000056
TapeAlert Code: 0x01, Type: Warning, Flag: READ WARNING, from drive HP.ULTRIUM4-SCSI.000 (index 0), Media Id 000291
TapeAlert Code: 0x01, Type: Warning, Flag: READ WARNING, from drive HP.ULTRIUM4-SCSI.000 (index 0), Media Id 000066
03-30-2012 07:16 AM
OK - looks like a bad drive then - get your library vendor ot to sort it out
One of my customers has had a lot of HP drives replaced in their i500's
Go to the library GUI and download the library snapshot - they will ask for it as soon as you call them so have it ready.
Next they will ask you to download the HP Tape tools to run a test on the drive - remeber this will need a blank tape putting into the drive for tests so be careful to to allocate a blank one for the prupose.
If you are able to do this ahead of time you may save yourself some time and get the drive replaced quicker - although they will probably try a firmware upgrade first dependant on the test results
03-30-2012 07:18 AM
Think you need to follow my post .... - we need to see what is at the beginning of the tape, if it is readable at all.
Also, when these tapes were last used - were the backups successful (that is, before they froze).
I'm wondering if you have had some exter scsi rewind event or something - anything in the NBU errors report.
Could be worth getting the bptm log as well.
Martin
03-30-2012 07:53 AM
Try this:
#!/bin/ksh
function GetFrozenMedia
{
BPERROR=/usr/openv/netbackup/bin/admincmd/bperror
print "### Frozen Media in the last 24 hours ###"
$BPERROR -s ERROR+ -t MEDIADEV -hoursago 24 | egrep -i "froz|free" > FrozenMedia.log
FREEZE_COUNT=$(cat FrozenMedia.log | wc -l)
cat FrozenMedia.log
printf "%-30s%d \n\n" "Tapes Count: " $FREEZE_COUNT
}
## MAIN
GetFrozenMedia
# this little script will dump the last 24 hours frozen media events, dump the log in to FrozenMedia.log file and count them so you can have a good report, just add it to cron and you can get it early morning and in the afternoon
03-30-2012 11:16 AM
Omar, this will work great on Unix/Linux server - not Windows....
03-30-2012 12:47 PM
You are right Marianne I think I have vaned Windows on my hand and cannot even spell it, but to help our frind I think a task on he's windows box will do the job, batch file is kinda the same:
Create a FrozenMedia.cmd file with the following commands:
bperror -s ERROR+ -t MEDIADEV -hoursago 24 | find "free" > FrozenMedia.log
Configure a windows task to run the script every day and add under the actions the option send email and configure the path were you store the FrozenMedia.log file to be sent as an attachment.
Sorry is not that fancy as the unix one but it will to do the job.
Regards.
04-13-2012 10:46 AM
I had a combination of bad media causing drives to go offline and needing updated firmware on my library and drives. I have been stable for a week with just removing the 1 bad tape and upgrading firmware. No drive replacement was needed. Thanks for all the help.