Solved: DOWN'd tape drive due to very long waiting time

rclegarto · ‎04-24-2009

Hi guys,

I need your help. I have been experiencing an DOWN'd drive for quite some time now. The exit status is 800 and it is due to a down'd drive. The only log I found that is the reason of the down'd drive is this..

Event Type: Error
Event Source: NetBackup TLD Daemon
Event Category: None
Event ID: 14021
Date: 4/18/2009
Time: 4:31:40 AM
User: N/A
Computer: GQ3APSVR03
Description:
TLD(1) [6492] timed out after waiting 860 seconds for ready, drive 1

then followed by this..

Event Type: Error
Event Source: NetBackup Device Manager
Event Category: None
Event ID: 2636
Date: 4/18/2009
Time: 4:32:01 AM
User: N/A
Computer: GQ3APSVR03
Description:
Operator/EMM server has DOWN'ed drive HP.ULTRIUM2-SCSI.000 (device 0)

I run tpautoconf to check for any missing drive but this does not give me any result at all. When I tried to UP the drive, it will UP, then i can use the drive to run several backup. but after sometime, it will be down'd again and the above mentioned error logs present.

I had been searching google for anything that might help on troubleshoot this but no luck.

I am using Netbackup 6.5.3 on a Windows Server 2003 R2 - Standard Edition. All your comments will be very much appreciated.

Anonymous · ‎04-24-2009

I'm on UNIX and sometimes face similar situation but found my solution to be with old files named after the DOWN'd drive in question. I just delete them and the problem goes away.

On Windows same scenario applies

..\Veritas\NetBackup\db\media\drives\

Any files in here that don't reference an active file in the directory ..\tpreq could de deleted
Look at the file in Notepad

See if your drives stay up after this.

------------------

Heres a note I wrote:

When a Media Server performs a backup, a "Drive Device flag" file is written, along with a link file using the NetBackup Drive name to the actual device file at the Operating System level for the drive in use for the backup.

These files could possibly be preventing drives from being released from reservation or being used. ie After UP'ing the Drive it goes back down on a specific Media Server, because the file has been left hung over from a failed backup.

For example below, on Oct 7th you can see that HP.SDLT600.000 is in use. The others listed in the ./drives folder are old and can be deleted, and could be preventing them being used the next time this media server is assigned those resources, and DOWN them if they were previously UP.

./drives:
total 10
drw-r-xr-x   2 root       root          1024 Oct  7 05:38 .
drwx------   4 root       sys             96 Jan 30  2008 ..
-rw-------   1 root       root           263 Oct  7 09:15 drive_HP.SDLT600.000
-rw-------   1 root       root           262 Sep  3 09:22 drive_HP.SDLT600.008
-rw-------   1 root       root           248 Aug 18 08:47 drive_HP.SDLT600.011
-rw-------   1 root       root           248 Aug 18 08:47 drive_HP.SDLT600.012

./tpreq:
total 2
drw-r-xr-x   2 root       root          1024 Oct  7 09:33 .
drwx------   4 root       sys             96 Jan 30  2008 ..
lrwxrwxrwx   1 root       root            22 Oct  7 09:33 drive_HP.SDLT600.000 -> /dev/rmt/c76t0d1BESTnb

View solution in original post

Anonymous · ‎04-24-2009

I'm on UNIX and sometimes face similar situation but found my solution to be with old files named after the DOWN'd drive in question. I just delete them and the problem goes away.

On Windows same scenario applies

..\Veritas\NetBackup\db\media\drives\

Any files in here that don't reference an active file in the directory ..\tpreq could de deleted
Look at the file in Notepad

See if your drives stay up after this.

------------------

Heres a note I wrote:

When a Media Server performs a backup, a "Drive Device flag" file is written, along with a link file using the NetBackup Drive name to the actual device file at the Operating System level for the drive in use for the backup.

These files could possibly be preventing drives from being released from reservation or being used. ie After UP'ing the Drive it goes back down on a specific Media Server, because the file has been left hung over from a failed backup.

For example below, on Oct 7th you can see that HP.SDLT600.000 is in use. The others listed in the ./drives folder are old and can be deleted, and could be preventing them being used the next time this media server is assigned those resources, and DOWN them if they were previously UP.

./drives:
total 10
drw-r-xr-x   2 root       root          1024 Oct  7 05:38 .
drwx------   4 root       sys             96 Jan 30  2008 ..
-rw-------   1 root       root           263 Oct  7 09:15 drive_HP.SDLT600.000
-rw-------   1 root       root           262 Sep  3 09:22 drive_HP.SDLT600.008
-rw-------   1 root       root           248 Aug 18 08:47 drive_HP.SDLT600.011
-rw-------   1 root       root           248 Aug 18 08:47 drive_HP.SDLT600.012

./tpreq:
total 2
drw-r-xr-x   2 root       root          1024 Oct  7 09:33 .
drwx------   4 root       sys             96 Jan 30  2008 ..
lrwxrwxrwx   1 root       root            22 Oct  7 09:33 drive_HP.SDLT600.000 -> /dev/rmt/c76t0d1BESTnb

rclegarto · ‎04-24-2009

Hi, thanks for the tip. I did checked both /tpreq and /drives. And they have identical files. I enabled the Show hidden file on folder options and it does not reveal any other file. Earlier, I was able to UP the drive, now backup is running fine again. However, I am expecting that in 2 or 3 nights, same issue will again visit me. :(

CY · ‎04-24-2009

From past experience, sometime it could be the tape drive's firmware that is causing the problem. I see that you have HP's LTO2 tape drive. Did you have this drive replaced/installed at a different time than the other drives? I'd not be surprised to see a tape drive on slightly different firmware level than other drives - many tape libraries allow this to exist, as wellas some technicians just not good at taking care of everything during work.

Please check with your tape library's vendor what's the most current stable release of the drive's firmware (as well as library's firmware, cause they may have dependency) and see if you need upgrade.

Also, try varying off/on the drive (through the tape library's management console/software) and see if that helps.

rclegarto · ‎05-17-2009

Hi CY,

Tape drive and Tape library had been replaced. After configuring the new tape library and tape drive, backup run smoothly for 3 nights. On the 4th night, same issue happens. Backup exits again with status of 800 due to a down'd drive. And the drive is down'd having an error log of timed out waiting for drive. Can anyone help out please?

CY · ‎05-18-2009

Are your tape library on SAN? Please take a look at this KB document:

http://seer.entsupport.symantec.com/docs/270183.htm

Even though the document addressed to NetBackup 4.5/5.x environment on Windows 2000 SP4/Windows 2003, it may give you some hint as what happened in your place. Check the HBA and tape device drivers on your Windows server as the document suggested.

rclegarto · ‎05-18-2009

Library is called HP StorageWorks MSL2024 Tape Library which is connected thru a SCSI cable.

I am enclosing some logs here from vxlogview command. I had been trying to diciphere this alone but I couldn't .. this is the first part of the log.

5/14/2009 20:00:00.500 [Diagnostic] NB 51216 nbpem 116 PID:4312 TID:1976 File ID:116 [jobid=428 job_group_id=428 client=gq3apsvr03 type=4 server= task=ID:017CB23C policy=INCREMENTAL] 1 V-116-215 [BaseJob::run] jobid=428 submitted to nbjm for processing
5/14/2009 20:00:00.515 [Diagnostic] NB 51216 nbrb 118 PID:4856 TID:4068 File ID:118 [jobid=428] 4 V-118-227 [ResBroker_i::requestResources] received resource request ID {95598D9A-88A5-4AF3-BE93-09A8553CB5AF}, priority 0, secondary priority 26301, description THE_BACKUP_JOB-428-{95598D9A-88A5-4AF3-BE93-09A8553CB5AF}
5/14/2009 20:00:00.515 [Debug] NB 51216 nbrb 118 PID:4856 TID:4068 File ID:118 [jobid=428] 1 [ResBroker_i::requestResources] RequestSeq:
Request=0 provider=MPXProvider resourcename=MpxRequest-428 MPXGroupRequest maxmpx=1
SubRequest=0 provider=DriveOperationProvider resourcename=gq3apsvr03-hcart3-robot-tld-0 userSequence=0 RB::StorageUnitRequest { StorageUnitRequest { storageUnit=gq3apsvr03-hcart3-robot-tld-0 mediaPool=NonFull retentionLevel=10 mustUseLocalMediaServer=no failOnError=no mpxRequired=no mustBeNdmp=no getMaxFreeSpace=no minFreeSpaceKBytes=583437859 usageType=1 client=gq3apsvr03 shareGroup=*ANY* } }
Request=1 provider=NamedResourceProvider resourcename=gq3apsvr03.NBU_CLIENT.MAXJOBS.gq3apsvr03 CountedResourceRequest { max=1 }
Request=2 provider=NamedResourceProvider resourcename=gq3apsvr03.NBU_POLICY.MAXJOBS.INCREMENTAL CountedResourceRequest { max=2147483647 }
5/14/2009 20:00:00.515 [Debug] NB 51216 nbjm 117 PID:5844 TID:1704 File ID:117 [jobid=428 parentid=428] 1 [JobInstance::state] (013ED638) state=0(JobInstance.cpp:313)
5/14/2009 20:00:00.515 [Diagnostic] NB 51216 nbjm 117 PID:5844 TID:1704 File ID:117 [jobid=428 parentid=428] 4 V-117-56 [BackupJob::sendRequestToRB] requesting resources from RB for backup job (jobid=428)
5/14/2009 20:00:00.796 [Diagnostic] NB 51216 nbrb 118 PID:4856 TID:3824 File ID:118 [jobid=428] 4 V-118-226 [ResBroker_i::evaluateOne] Evaluating request ID {95598D9A-88A5-4AF3-BE93-09A8553CB5AF}
5/14/2009 20:00:00.796 [Diagnostic] NB 51216 nbrb 118 PID:4856 TID:3824 File ID:118 [jobid=428] 2 V-118-146 [ProviderManager::allocate] NamedResourceProvider returned Allocation Granted for request ID {95598D9A-88A5-4AF3-BE93-09A8553CB5AF}
5/14/2009 20:00:00.796 [Diagnostic] NB 51216 nbrb 118 PID:4856 TID:3824 File ID:118 [jobid=428] 2 V-118-146 [ProviderManager::allocate] MPXProvider returned Resource is currently in use for request ID {95598D9A-88A5-4AF3-BE93-09A8553CB5AF}
5/14/2009 20:00:00.796 [Application] NB 51216 nbrb 118 PID:4856 TID:3824 File ID:118 [jobid=428] [Error] V-118-171 request ID {95598D9A-88A5-4AF3-BE93-09A8553CB5AF} failed with status 1 (Resource is currently in use); releasing 2 allocated resources
5/14/2009 20:00:00.796 [Debug] NB 51216 nbrb 118 PID:4856 TID:3824 File ID:118 [jobid=428] 1 [ResBroker_i::holdOne] tracking error status (RBErrorInfo: errorStatus=2005001 retryClass="RB Defined Retry" retryReason=25 retryText="AWAITING_RESOURCE 1242298800 gq3apsvr03-hcart3-robot-tld-0 EMM_STATUS 2005001

Does the part I had bold has something to do with the busy resource error?

Anonymous · ‎05-19-2009

Is this still on going?

If so, check these are TRUE:

Removable Storage service is Disabled and Stopped (if not then this causes havoc with NetBackup and devices)
You have downloaded HP Library Tape and Tools (latest v4.7) onto Windows server with library/drive attached and checked that both devices at latest firmware revision
create legacy debug directory <install_path>/netbackup/logs/bptm/ and review that rather than vxlogs
If tape library on a SAN verify you have the library zoned off correctly and not presented to any other server with its Removable Storage service enabled. (This bit me in the backside 2 days ago, deploying a MSL8096 with FC drives, I had forgot I presented it to another Windows server and it was DOWN'ing the drives and holding SCSI reservations. As well as auto-inventorying the library everytime I closed the mail slots)
You are not sharing this device with any other host. If so the you need to revisit your device configuration.

In matters like this it might also be insightful to install the latest 'nbsu' NetBackup Support Utility, and run command and then review all the output, sometimes this can open up holes in your configuration.

Heres is the download:
http://seer.entsupport.symantec.com/docs/323434.htm

VOX

DOWN'd tape drive due to very long waiting time