cancel
Showing results for 
Search instead for 
Did you mean: 

Netbackup scsi errors on Solaris hw - what do they mean?

afacey
Level 4
Hi All,

NBU 6.5.4
Solaris T2000 master/media server
Qualstar TLS-8332 (4) LTO2 & 4 LTO3 -SCSI connected

So recently I have been getting scsi errors reported to the messages file on my Solaris  server.

1. Where can I find out what they mean? Here is an example:

Feb  4 18:20:36 polaris scsi: [ID 107833 kern.warning] WARNING: /pci@780/pci@0/pci@8/pci@0/scsi@8,1/st@2,0 (st8):
Feb  4 18:20:36 polaris         Error for Command: write file mark         Error Level: Fatal
Feb  4 18:20:36 polaris scsi: [ID 107833 kern.notice]   Requested Block: 5963                      Error Block: 5963
Feb  4 18:20:36 polaris scsi: [ID 107833 kern.notice]   Vendor: IBM                                Serial Number:            
Feb  4 18:20:36 polaris scsi: [ID 107833 kern.notice]   Sense Key: Media Error
Feb  4 18:20:36 polaris scsi: [ID 107833 kern.notice]   ASC: 0x52 (cartridge fault), ASCQ: 0x0, FRU: 0x36

Feb  4 20:21:12 polaris scsi: [ID 107833 kern.warning] WARNING: /pci@780/pci@0/pci@8/pci@0/scsi@8/st@5,0 (st6):
Feb  4 20:21:12 polaris         Error for Command: write                   Error Level: Fatal
Feb  4 20:21:12 polaris scsi: [ID 107833 kern.notice]   Requested Block: 21170                     Error Block: 21170
Feb  4 20:21:12 polaris scsi: [ID 107833 kern.notice]   Vendor: IBM                                Serial Number:            
Feb  4 20:21:12 polaris scsi: [ID 107833 kern.notice]   Sense Key: Aborted Command
Feb  4 20:21:12 polaris scsi: [ID 107833 kern.notice]   ASC: 0x4b (data phase error), ASCQ: 0x0, FRU: 0x30

Feb  8 18:14:25 polaris scsi: [ID 107833 kern.warning] WARNING: /pci@7c0/pci@0/pci@8/pci@0/scsi@8,1/st@4,0 (st5):
Feb  8 18:14:25 polaris         Error for Command: space                   Error Level: Fatal
Feb  8 18:14:25 polaris scsi: [ID 107833 kern.notice]   Requested Block: 1                         Error Block: 1
Feb  8 18:14:25 polaris scsi: [ID 107833 kern.notice]   Vendor: IBM                                Serial Number:            
Feb  8 18:14:25 polaris scsi: [ID 107833 kern.notice]   Sense Key: Media Error
Feb  8 18:14:25 polaris scsi: [ID 107833 kern.notice]   ASC: 0x14 (recorded entity not found), ASCQ: 0x0, FRU: 0x36

2. The above errors, two look like media problems, but not sure which tape had the issue, should I assume the the next entry with a tape dismount is the culprit? 
3. And the data phase error is drive related?

4. Also how do I map /pci@780/pci@0/pci@8/pci@0/scsi@8/st@5,0 (st6)  above to to the rmt/# shown in iostat -En below?


Below shows hard errors:

iostat -En
c1t0d0           Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: LSILOGIC Product: Logical Volume   Revision: 3000 Serial No: 
Size: 73.01GB <73012215808 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
c1t2d0           Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: LSILOGIC Product: Logical Volume   Revision: 3000 Serial No: 
Size: 146.56GB <146561286144 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
c0t0d0           Soft Errors: 9 Hard Errors: 0 Transport Errors: 1
Vendor: MATSHITA Product: CD-RW  CW-8124   Revision: DZ13 Serial No: 
Size: 1.53GB <1533480960 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 9 Predictive Failure Analysis: 0
rmt/6            Soft Errors: 0 Hard Errors: 1 Transport Errors: 0
Vendor: IBM      Product: ULTRIUM-TD2      Revision: 67U1 Serial No: 
rmt/2            Soft Errors: 0 Hard Errors: 1 Transport Errors: 0
Vendor: IBM      Product: ULTRIUM-TD2      Revision: 67U1 Serial No: 
rmt/3            Soft Errors: 0 Hard Errors: 1 Transport Errors: 0
Vendor: IBM      Product: ULTRIUM-TD2      Revision: 67U1 Serial No: 
rmt/4            Soft Errors: 0 Hard Errors: 15 Transport Errors: 0
Vendor: IBM      Product: ULTRIUM-TD3      Revision: 73P5 Serial No: 
rmt/5            Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: IBM      Product: ULTRIUM-TD3      Revision: 69U2 Serial No: 
rmt/7            Soft Errors: 0 Hard Errors: 1 Transport Errors: 0
Vendor: IBM      Product: ULTRIUM-TD2      Revision: 67U1 Serial No: 
rmt/8            Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: IBM      Product: ULTRIUM-TD3      Revision: 69U2 Serial No: 
rmt/9            Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: IBM      Product: ULTRIUM-TD3      Revision: 69U2 Serial No:


5. If a piece of media has an error, the tape is not automatically marked as frozen. NBU tries it again, so it is possible the same errors repeat as NBU tries to use the tape again. Is this the right way to do this? Or should I set NBU to auto freeze the tape when a write error is detected? Is this a fuction of the volmgr and where is it configured?


Any help is greatly appreciated!

4 REPLIES 4

afacey
Level 4
But the media errors seem to be all coming from incremental tapes which we do not remove from the lib. They only have 1-2 week retention on some and 2 months on others, so they get used alot. In fact I noticed some of the tapes have first mount dates from 2004 and mounts numbers from 500~1000. I think I will cycle those out.

Still looking for input on scsi mappings and data phase errors & etc above.

Thanks!


Andy_Welburn
Level 6
ls -l   /dev/rmt

There're probably other ways, but I think my dinners in the oven!

All I could find via a quick Google was:

Data Phase Error - A command could not be completed because too many parity errors occurred during the Data phase.

& this was in relation to IBM tape autoloader manual.

If you've got tapes that old I would start to replace them as you suggest.

Nicolai
Moderator
Moderator
Partner    VIP   
Technote regarding media error treshold:

How Veritas NetBackup 6.x (tm) determines if a tape should be frozen or the status of a tape drive s...

A write failure can reoccur many times as Netbackup contain a feature called "Resume Logic". Resume Logic will retry 5 times before giving up on a  read/write operation can throw a status code 84 or 85 . If you have the bptm log enabled you can see this progress.

I prefer to suspend media upon write failures insted of freezing them. Most media related errors can be linked to tape drive defects that actual media failures.

Updated: Link now working

Nicolai
Moderator
Moderator
Partner    VIP   
Media health. After a error try issuing the iostat -En. It show how many error there was detected on the media. If none is shown you have a SCSI bus problem.

Soft errors: Read or Write error corrected on the fly
Hard errors: Read or Write error corrected be re-reading the media.

rmt/6            Soft Errors: 0 Hard Errors: 1 Transport Errors: 0
Vendor: IBM      Product: ULTRIUM-TD2      Revision: 67U1 Serial No: 

The staticis get reset after unloading the media.