01-04-2017 01:56 AM
Hi All,
My current enviroment is as under:
Hardware server spec: DELL R710 Having external HBA : Qlogic (QLE2562)
OS= Windows 2008 R2 Enterprise Edition
Netbackup Server = 7.6.0.4
Tape Library = HP MSL2024 with two FC Tape Drive
Problem:
After One year successfull running , now my tape drives down oftenly one by one or most of the time both drive.
Where my Physical status of library and tape drive shows fine, i Log on to the MSL2024 Web console. It shows drives are in UP and Ready State.
There will be just below error in windows Event Viewer:(Application Log)
"Operator/EMM server has DOWN'ed drive HP.ULTRIUM5-SCSI.001 (device 1)" .
Please update is netbackup has some issue or HP library with tapes have some issue, HP vendor throughly check and test the drives , said it fine.
Waiting for yours response.
Thanks & Regards
M.Amir
01-04-2017 02:51 AM
Most often the error in Netbackup is an indication that is another problem with infrastructure/hardware
My list of things to check is
1) Library console/panel (Seems you done that already)
2) SAN connectivity, is the drive logged into the fabric (There has been a problem with some HP libraries/tape drives that they logged out after a while)
3) HBA tool, can the drives be seen here & is persistent binding still in place
4) Device manager in OS
And of course all the related/relevant logs like bptm, media error file, OS event logs, library, tape drive, SAN switch log
What is the cleaning state of these drives ? Have often see issue with drives that needed to be cleaned
01-04-2017 02:53 AM
Can you show us [install_path]/netbackup/db/media/errors file ?
01-04-2017 03:02 AM
Hello!
There is no of files which one to send , the lastest one?
01-04-2017 03:07 AM
Hello!
Thanks for the update, Let me explain you questions!
1) Library console/panel (Seems you done that already)
Yes , its working fine with no errors
2) SAN connectivity, is the drive logged into the fabric (There has been a problem with some HP libraries/tape drives that they logged out after a while)
I have connected the tape library with dell server directly without any SAN switch
3) HBA tool, can the drives be seen here & is persistent binding still in place
Sorry could not getted what information you need.
4) Device manager in OS
Yes all latest driver for Tape library and tape drive is installed with latest firmware.
And of course all the related/relevant logs like bptm, media error file, OS event logs, library, tape drive, SAN switch log
What is the cleaning state of these drives ? Have often see issue with drives that needed to be cleaned
Once in a month or as required. After cleaning behaviour will be same.
01-04-2017 04:25 AM
PLease send the errors file as previously requested, that covers the most recent issues of drive going down. Usually there is only one errors file.
...netbackup\db\jobs\media\errors
Please also create the ..netbackup\logs\bptm folder.
Via the GUI > Host Properties >Logs you can set the bptm verbose level to 5
Create these folders
<install path>\veritas\volmgr\debug\tpcommand
<install path>\veritas\volmgr\debug\robots
Add the work VERBOSE into the file <install path>\veritas\volmgr\vm.conf
Create the empty files
<install path>\veritas\volmgr\DRIVE_DEBUG
<install path>\veritas\volmgr\ROBOT_DEBUG
(Make sure wndows does not add and file suffix)
Restart NBU
Await repeat of issue and collect the above logs, along with the activity monitor details showing the issue for the job.
01-04-2017 04:38 AM
There should be only one file called 'errors' in <install_path>\Veritas\NetBackup\db\media folder.
Please copy this file to errors.txt and upload as attachment (use 'Choose Files' link below the Reply screen).
The errors file will show us failure trends - same drive(s) with different media or same media with different/same drives...
Are you seeing many backup failures with status such as 84?
NBU will automatically DOWN a drive after 3 failures within one hour.
If you UP the drive and start another backup (or restart same backup) and fails again, the same logic of 3 failures within 1 hour will be applied and drive will be DOWN'ed.
So, we need to see what is causing the backup failures.
Verbose bptm log will be a good start.
In the meantime, please upload 'errors' file.
Have you checked Event Viewer System log for hardware errors?
01-04-2017 05:43 AM
in [installl_path]/netbackup/db/media do a
more errors
File should look somthing like:
12/15/16 19:59:53 A37482 6 WRITE_ERROR 0211
12/23/16 19:31:43 A32855 3 POSITION_ERROR 11112
12/23/16 20:00:47 A32855 3 POSITION_ERROR 11112
12/23/16 22:03:54 A32855 3 POSITION_ERROR 11112
12/23/16 22:30:43 A32855 3 POSITION_ERROR 11112
12/25/16 18:16:03 A23704 5 WRITE_ERROR 1110
01-04-2017 08:30 PM
Here is the error.txt file attached
01-04-2017 11:05 PM
Hopefully @Nicolai and @mph999 will be along soon with interpretation of the TapeAlerts seen in the errors file.
If this post is correct :
https://vox.veritas.com/t5/NetBackup/TapeAlert-help-netbackup-db-media-errors/td-p/217530
then yours is also a 9.
In this TN: http://www.veritas.com/docs/000005226 we see that 0x9 means 'Cartridge write protected'. Is it possible that someone inserted a bunch of write-protected tapes in the robot?
(Martin and/or Nicolai will correct me if my interpretation is wrong...)
Can you show us Details tab of a failed backup?
According to the errors file there would've been a failed backup at 09:26 this morning.
Have you created bptm log as suggested by Martin yesterday?
With VERBOSE in vm.conf and logging level for bptm increased?
01-04-2017 11:14 PM
Hi all
The Tapes are neither write protected nor i done that, but some time it shows like that and failed the backup also freeze the tape then , i will unfreeze it and it will write the same tape successfully.
I had enable the logging will share with you soon.
Regards
M.Amir
01-04-2017 11:20 PM
My interpretation of the Tape_Alert is probably incorrect.
Verbose bptm log will help.
Can you please show us job details of the this morning's failed backup?
01-04-2017 11:29 PM
Hello
Let me clear one thing that we have a scenrio of backup is First copy to DSU (disk unit) then we will stage the backup on Tapes.
Here is the error due to which tape drive down and generate this error:
1/4/2017 4:43:21 PM - begin Duplicate
1/4/2017 4:43:22 PM - requesting resource ST-Backup
1/4/2017 4:43:22 PM - awaiting resource ST-Backup - No drives are available
1/5/2017 10:19:29 AM - awaiting resource ST-Backup - Maximum job count has been reached for the storage unit
1/5/2017 10:19:33 AM - awaiting resource ST-Backup - No drives are available
1/5/2017 11:06:18 AM - awaiting resource ST-Backup Reason: Drives are in use, Media Server: srv-avbkp,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: Monthly_Offsite, Storage Unit: ST-Backup, Drive Scan Host: N/A
1/5/2017 11:11:29 AM - awaiting resource ST-Backup - No drives are available
1/5/2017 12:22:59 PM - granted resource J326L5
1/5/2017 12:22:59 PM - granted resource HP.ULTRIUM5-SCSI.001
1/5/2017 12:22:59 PM - granted resource ST-Backup
1/5/2017 12:23:00 PM - Info bptm(pid=8108) start
1/5/2017 12:23:00 PM - started process bptm (8108)
1/5/2017 12:23:01 PM - Info bptm(pid=8108) start backup
1/5/2017 12:23:02 PM - Info bpdm(pid=6996) started
1/5/2017 12:23:02 PM - started process bpdm (6996)
1/5/2017 12:23:02 PM - Info bpdm(pid=6996) reading backup image
1/5/2017 12:23:02 PM - Info bpdm(pid=6996) using 30 data buffers
1/5/2017 12:23:02 PM - Info bptm(pid=8108) Waiting for mount of media id J326L5 (copy 3) on server srv-avbkp.
1/5/2017 12:23:02 PM - started process bptm (8108)
1/5/2017 12:23:02 PM - mounting J326L5
1/5/2017 12:23:02 PM - Info bptm(pid=8108) INF - Waiting for mount of media id J326L5 on server srv-avbkp for writing.
1/5/2017 12:23:02 PM - begin reading
1/5/2017 12:24:08 PM - Info bptm(pid=8108) media id J326L5 mounted on drive index 1, drivepath {3,0,0,0}, drivename HP.ULTRIUM5-SCSI.001, copy 3
1/5/2017 12:24:16 PM - Info bptm(pid=8108) INF - Waiting for positioning of media id J326L5 on server srv-avbkp for writing.
1/5/2017 12:24:31 PM - Error bptm(pid=8108) ioctl (MTWEOF) failed on media id J326L5, drive index 1, The media is write protected. (19) (bptm.c.22862)
1/5/2017 12:24:31 PM - Error bptm(pid=8108) FROZE media id J326L5, could not write tape mark to begin new image
1/5/2017 12:24:31 PM - current media J326L5 complete, requesting next resource Any
1/5/2017 12:25:51 PM - current media -- complete, awaiting next media Any Reason: Drives are in use, Media Server: srv-avbkp,
Robot Number: 0, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
Volume Pool: Monthly_Offsite, Storage Unit: ST-Backup, Drive Scan Host: N/A
I have unfreeze that tape, now it use again with other tape drive , which are working currently.
As these backup compelted i will share the bptm logs with you.
regards
M.Amir
01-04-2017 11:35 PM
The firmware in your tape drive thinks that the media is write-protected:
Error bptm(pid=8108) ioctl (MTWEOF) failed on media id J326L5, drive index 1, The media is write protected. (19) (bptm.c.22862)
If J326L5 is NOT write-protected, then there is something wrong with the firmware.
01-05-2017 12:13 AM
The tape alert is as Mariane suggests
0x00800000 0x00000000
Flag 9: Write protect. Severity: Critical
If the tape is 'not' write protected via the little switch, then you have a drive fault, as the drive itself, not NBU or anything else, is reporting that it is. The tape alerts are sent directlt from the tape drive firmware.
01-05-2017 01:07 AM
Suspend these tapes and see if the problem persist. Its a unique tapes list from the db/media/error file
e.g: bpmedia -m J332L5 -suspend
J332L5
J334L5
3798L5
0044L5
0043L5
0049L5
A384L5
J326L5
3792L5
J325L5
J333L5
The tape structure of the tapes look damaged. A tape has a file table just like disk, if this file table has been damaged e.g power loss, it could give the unable to space to end of file. When you suspend a tape, it will be reused when all images has been expired.
01-05-2017 01:15 AM
Good point Nicolai - though if there are cartridge memory issues that should also show as a tape alert.
Usually
0x0F: 'Failure of cartridge memory chip',
0x12: 'Tape directory corrupted on load',
Certainly agree with suspending the tapes theough ...
01-05-2017 01:22 AM
The 'write-protect' error seems to be intermittent.
@AmirJabran81 says if he unfreezes the tape it will be written fine at the next attempt.
So, probably firmware issue rather than faulty media?
Issue seems to be present for quite a while - even when LTO3 media was loaded in the LTO5 drives (probably for restore?):
03/26/15 17:02:33 0003L3 0 TAPE_ALERT HP.ULTRIUM5-SCSI.000 0x00008000 0x00000000
01-05-2017 01:23 AM
Thanks for update,
How could i set The tape structure of the tapes damaged to correctly.
The mention tapes are only tapes which are in library and free to use.
Will also discuss the issue with hardware vendor on tapes alert as you said firmware issue.
Mentions tape are my monthly and yearly backup tape it will go offside in next week.
Will try to suspend these and check.
regards
M.Amir
01-05-2017 02:08 AM
You might as well leave the tapes FROZEN rather than unfreeze and then suspend them.
Either way - suspended or frozen tapes can not be written to.
You will have to enter a new set of tapes in the robot.