Tape drive down very often

AmirJabran81 · ‎01-04-2017

Hi All,

My current enviroment is as under:

Hardware server spec: DELL R710 Having external HBA : Qlogic (QLE2562)

OS= Windows 2008 R2 Enterprise Edition

Netbackup Server = 7.6.0.4

Tape Library = HP MSL2024 with two FC Tape Drive

Problem:

After One year successfull running , now my tape drives down oftenly one by one or most of the time both drive.

Where my Physical status of library and tape drive shows fine, i Log on to the MSL2024 Web console. It shows drives are in UP and Ready State.

There will be just below error in windows Event Viewer:(Application Log)

"Operator/EMM server has DOWN'ed drive HP.ULTRIUM5-SCSI.001 (device 1)" .

Please update is netbackup has some issue or HP library with tapes have some issue, HP vendor throughly check and test the drives , said it fine.

Waiting for yours response.

Thanks & Regards

M.Amir

Michael_G_Ander · ‎01-04-2017

Most often the error in Netbackup is an indication that is another problem with infrastructure/hardware

My list of things to check is

1) Library console/panel (Seems you done that already)

2) SAN connectivity, is the drive logged into the fabric (There has been a problem with some HP libraries/tape drives that they logged out after a while)

3) HBA tool, can the drives be seen here & is persistent binding still in place

4) Device manager in OS

And of course all the related/relevant logs like bptm, media error file, OS event logs, library, tape drive, SAN switch log

What is the cleaning state of these drives ? Have often see issue with drives that needed to be cleaned

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Nicolai · ‎01-04-2017

Can you show us [install_path]/netbackup/db/media/errors file ?

AmirJabran81 · ‎01-04-2017

Hello!

There is no of files which one to send , the lastest one?

AmirJabran81 · ‎01-04-2017

Hello!

Thanks for the update, Let me explain you questions!

1) Library console/panel (Seems you done that already)

Yes , its working fine with no errors

2) SAN connectivity, is the drive logged into the fabric (There has been a problem with some HP libraries/tape drives that they logged out after a while)

I have connected the tape library with dell server directly without any SAN switch

3) HBA tool, can the drives be seen here & is persistent binding still in place

Sorry could not getted what information you need.

4) Device manager in OS

Yes all latest driver for Tape library and tape drive is installed with latest firmware.

And of course all the related/relevant logs like bptm, media error file, OS event logs, library, tape drive, SAN switch log

What is the cleaning state of these drives ? Have often see issue with drives that needed to be cleaned

Once in a month or as required. After cleaning behaviour will be same.

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

No change has been made just a minor upgrade from 7.6.0.3 to 7.6.0.4

Regards

M.Amir

mph999 · ‎01-04-2017

PLease send the errors file as previously requested, that covers the most recent issues of drive going down. Usually there is only one errors file.

...netbackup\db\jobs\media\errors

Please also create the ..netbackup\logs\bptm folder.

Via the GUI > Host Properties >Logs you can set the bptm verbose level to 5

Create these folders

<install path>\veritas\volmgr\debug\tpcommand

<install path>\veritas\volmgr\debug\robots

Add the work VERBOSE into the file <install path>\veritas\volmgr\vm.conf

Create the empty files

<install path>\veritas\volmgr\DRIVE_DEBUG

<install path>\veritas\volmgr\ROBOT_DEBUG

(Make sure wndows does not add and file suffix)

Restart NBU

Await repeat of issue and collect the above logs, along with the activity monitor details showing the issue for the job.

Marianne · ‎01-04-2017

There should be only one file called 'errors' in <install_path>\Veritas\NetBackup\db\media folder.

Please copy this file to errors.txt and upload as attachment (use 'Choose Files' link below the Reply screen).

The errors file will show us failure trends - same drive(s) with different media or same media with different/same drives...

Are you seeing many backup failures with status such as 84?
NBU will automatically DOWN a drive after 3 failures within one hour.

If you UP the drive and start another backup (or restart same backup) and fails again, the same logic of 3 failures within 1 hour will be applied and drive will be DOWN'ed.

So, we need to see what is causing the backup failures.
Verbose bptm log will be a good start.

In the meantime, please upload 'errors' file.

Have you checked Event Viewer System log for hardware errors?

Handy NetBackup Links

Nicolai · ‎01-04-2017

in [installl_path]/netbackup/db/media do a

more errors

File should look somthing like:

12/15/16 19:59:53 A37482 6 WRITE_ERROR 0211
12/23/16 19:31:43 A32855 3 POSITION_ERROR 11112
12/23/16 20:00:47 A32855 3 POSITION_ERROR 11112
12/23/16 22:03:54 A32855 3 POSITION_ERROR 11112
12/23/16 22:30:43 A32855 3 POSITION_ERROR 11112
12/25/16 18:16:03 A23704 5 WRITE_ERROR 1110

AmirJabran81 · ‎01-04-2017

Here is the error.txt file attached

Marianne · ‎01-04-2017

Hopefully @Nicolai and @mph999 will be along soon with interpretation of the TapeAlerts seen in the errors file.

If this post is correct :
https://vox.veritas.com/t5/NetBackup/TapeAlert-help-netbackup-db-media-errors/td-p/217530
then yours is also a 9.

In this TN: http://www.veritas.com/docs/000005226 we see that 0x9 means 'Cartridge write protected'. Is it possible that someone inserted a bunch of write-protected tapes in the robot?
(Martin and/or Nicolai will correct me if my interpretation is wrong...)

Can you show us Details tab of a failed backup?
According to the errors file there would've been a failed backup at 09:26 this morning.

Have you created bptm log as suggested by Martin yesterday?
With VERBOSE in vm.conf and logging level for bptm increased?

Handy NetBackup Links

AmirJabran81 · ‎01-04-2017

Hi all

The Tapes are neither write protected nor i done that, but some time it shows like that and failed the backup also freeze the tape then , i will unfreeze it and it will write the same tape successfully.

I had enable the logging will share with you soon.

Regards

M.Amir

Marianne · ‎01-04-2017

My interpretation of the Tape_Alert is probably incorrect.
Verbose bptm log will help.

Can you please show us job details of the this morning's failed backup?

Handy NetBackup Links

AmirJabran81 · ‎01-04-2017

Hello

Let me clear one thing that we have a scenrio of backup is First copy to DSU (disk unit) then we will stage the backup on Tapes.

Here is the error due to which tape drive down and generate this error:

1/4/2017 4:43:21 PM - begin Duplicate
1/4/2017 4:43:22 PM - requesting resource ST-Backup
1/4/2017 4:43:22 PM - awaiting resource ST-Backup - No drives are available
1/5/2017 10:19:29 AM - awaiting resource ST-Backup - Maximum job count has been reached for the storage unit
1/5/2017 10:19:33 AM - awaiting resource ST-Backup - No drives are available
1/5/2017 11:06:18 AM - awaiting resource ST-Backup Reason: Drives are in use, Media Server: srv-avbkp,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: Monthly_Offsite, Storage Unit: ST-Backup, Drive Scan Host: N/A

1/5/2017 11:11:29 AM - awaiting resource ST-Backup - No drives are available
1/5/2017 12:22:59 PM - granted resource J326L5
1/5/2017 12:22:59 PM - granted resource HP.ULTRIUM5-SCSI.001
1/5/2017 12:22:59 PM - granted resource ST-Backup
1/5/2017 12:23:00 PM - Info bptm(pid=8108) start
1/5/2017 12:23:00 PM - started process bptm (8108)
1/5/2017 12:23:01 PM - Info bptm(pid=8108) start backup
1/5/2017 12:23:02 PM - Info bpdm(pid=6996) started
1/5/2017 12:23:02 PM - started process bpdm (6996)
1/5/2017 12:23:02 PM - Info bpdm(pid=6996) reading backup image
1/5/2017 12:23:02 PM - Info bpdm(pid=6996) using 30 data buffers
1/5/2017 12:23:02 PM - Info bptm(pid=8108) Waiting for mount of media id J326L5 (copy 3) on server srv-avbkp.
1/5/2017 12:23:02 PM - started process bptm (8108)
1/5/2017 12:23:02 PM - mounting J326L5
1/5/2017 12:23:02 PM - Info bptm(pid=8108) INF - Waiting for mount of media id J326L5 on server srv-avbkp for writing.
1/5/2017 12:23:02 PM - begin reading
1/5/2017 12:24:08 PM - Info bptm(pid=8108) media id J326L5 mounted on drive index 1, drivepath {3,0,0,0}, drivename HP.ULTRIUM5-SCSI.001, copy 3
1/5/2017 12:24:16 PM - Info bptm(pid=8108) INF - Waiting for positioning of media id J326L5 on server srv-avbkp for writing.
1/5/2017 12:24:31 PM - Error bptm(pid=8108) ioctl (MTWEOF) failed on media id J326L5, drive index 1, The media is write protected. (19) (bptm.c.22862)
1/5/2017 12:24:31 PM - Error bptm(pid=8108) FROZE media id J326L5, could not write tape mark to begin new image
1/5/2017 12:24:31 PM - current media J326L5 complete, requesting next resource Any
1/5/2017 12:25:51 PM - current media -- complete, awaiting next media Any Reason: Drives are in use, Media Server: srv-avbkp,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: Monthly_Offsite, Storage Unit: ST-Backup, Drive Scan Host: N/A

I have unfreeze that tape, now it use again with other tape drive , which are working currently.

As these backup compelted i will share the bptm logs with you.

regards

M.Amir

Marianne · ‎01-04-2017

The firmware in your tape drive thinks that the media is write-protected:

Error bptm(pid=8108) ioctl (MTWEOF) failed on media id J326L5, drive index 1, The media is write protected. (19) (bptm.c.22862)

If J326L5 is NOT write-protected, then there is something wrong with the firmware.

Handy NetBackup Links

mph999 · ‎01-05-2017

The tape alert is as Mariane suggests

0x00800000 0x00000000

Flag 9: Write protect. Severity: Critical

If the tape is 'not' write protected via the little switch, then you have a drive fault, as the drive itself, not NBU or anything else, is reporting that it is. The tape alerts are sent directlt from the tape drive firmware.

Nicolai · ‎01-05-2017

Suspend these tapes and see if the problem persist. Its a unique tapes list from the db/media/error file

e.g: bpmedia -m J332L5 -suspend

J332L5
J334L5
3798L5
0044L5
0043L5
0049L5
A384L5
J326L5
3792L5
J325L5
J333L5

The tape structure of the tapes look damaged. A tape has a file table just like disk, if this file table has been damaged e.g power loss, it could give the unable to space to end of file. When you suspend a tape, it will be reused when all images has been expired.

mph999 · ‎01-05-2017

Good point Nicolai - though if there are cartridge memory issues that should also show as a tape alert.

Usually

0x0F: 'Failure of cartridge memory chip',
0x12: 'Tape directory corrupted on load',

Certainly agree with suspending the tapes theough ...

Marianne · ‎01-05-2017

The 'write-protect' error seems to be intermittent.

@AmirJabran81 says if he unfreezes the tape it will be written fine at the next attempt.

So, probably firmware issue rather than faulty media?

Issue seems to be present for quite a while - even when LTO3 media was loaded in the LTO5 drives (probably for restore?):

03/26/15 17:02:33 0003L3 0 TAPE_ALERT HP.ULTRIUM5-SCSI.000 0x00008000 0x00000000

Handy NetBackup Links

AmirJabran81 · ‎01-05-2017

Thanks for update,

How could i set The tape structure of the tapes damaged to correctly.

The mention tapes are only tapes which are in library and free to use.

Will also discuss the issue with hardware vendor on tapes alert as you said firmware issue.

Mentions tape are my monthly and yearly backup tape it will go offside in next week.

Will try to suspend these and check.

regards

M.Amir

Marianne · ‎01-05-2017

You might as well leave the tapes FROZEN rather than unfreeze and then suspend them.

Either way - suspended or frozen tapes can not be written to.
You will have to enter a new set of tapes in the robot.

Handy NetBackup Links