cancel
Showing results for 
Search instead for 
Did you mean: 

Have four drives attached with NAS filer , Getting down everyday.

Mack_Disouza
Level 4

Hello Friends,

Attached are the BPTM logs with verbosity(5).  Everyday drives getting down.

Actions done :- 

(1) Drive cleaning.

(2) Reconfigure drives through device configuration wizard.

(3) Stop/Start NB services.

 

Please assis what to do next to resolve this proble..

16 REPLIES 16

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Hi,

 

You have a lot of errors in there, read errors, position errors, unload errors. This is probably a hardware issue as the scope of the errors are all over the place. Does the library/drives work fine with regular backups?

Mack_Disouza
Level 4

There is no error at library ened. Drives are always up from library end. No error logs found and no RAS ticket from library till now. Drives gettting down from Netbackup only. I have only one NDMP filer to backup as a client.

Library/drives are attached/configured on NAS filer.

root@RUEH2BKP2 bptm]# nbemmcmd -listhost

NBEMMCMD, Version: 7.6.0.1
The following hosts were found:
server           RUEH2BKP2
master           RUEH2BKP2
ndmp             narue06a
Command completed successfully.

 

mph999
Level 6
Employee Accredited


I'm not convinced this is going to be a NBU fault.

We try and read the media header, this is 1K in size and it is read into a buffer that is 64K
04:51:21.829 [32394] <2> io_read_media_header: drive index 2, reading media header, buflen = 65536, buff = 0x0x23a1820, copy 1

Tape is rewound ...
04:51:21.829 [32394] <2> io_ioctl: command (6)MTREW 1 0x0 from (bptm.c.8311) on drive index 2

Read header
04:51:26.304 [32394] <2> io_read_media_header: ndmp_tape_read_func returned 1024

Skips forward
04:51:26.304 [32394] <2> io_ioctl: command (1)MTFSF 1 0x0 from (bptm.c.8563) on drive index 2

This is successful, so the tape header can be read

Next try to position for the to the end of the last backup ... so after the 12th image
04:51:26.473 [32394] <2> io_position_for_write: position media id 000053, copy 1, current number images = 12

Try to skip forward 12 tapemarks
04:51:26.473 [32394] <2> io_position_for_write: skip forward 12 tapemarks, copy 1
04:51:26.473 [32394] <2> io_ioctl: command (1)MTFSF 12 0x0 from (bptm.c.7156) on drive index 2

04:52:16.628 [32394] <2> Media_siginfo_print: 0: delay 0 signo SIGHUP:1 code 0 pid 32391

I'm not sure exactly what this line means, but I 'think' it is relevant, although only marked as a <2>
04:52:16.628 [32394] <2> Media_library_signal_poll: 1:Terminate detected

Either way, the tape doesn't get positioned to where it should, as we can't read the header on the tape, which we are looking for to confirm we are in the right place

04:52:16.629 [32394] <2> io_read_block: ndmp_tape_read_func returned 18
04:52:16.629 [32394] <2> set_job_details: Tfile (1386): LOG 1402455136 8 bptm 32394 read error on media id 000053, drive index 2 reading header block, error code 18 (NDMP_XDR_DECODE_ERR)
04:52:16.629 [32394] <2> send_job_file: job ID 1386, ftype = 3 msg len = 131, msg = LOG 1402455136 8 bptm 32394 read error on media id 000053, drive index 2 reading header block, error code 18 (NDMP_XDR_DECODE_ERR)
04:52:16.629 [32394] <8> io_read_block: read error on media id 000053, drive index 2 reading header block, error code 18 (NDMP_XDR_DECODE_ERR)
04:52:16.629 [32394] <2> io_position_for_write: error, rewind and retry


The tape positioning is effectivly scsi/ OS level, not NBU.

This is not unlike the issue in TECH159543, The NDMP NAS vendor took responsibility for the issue in the TN.  I appreciate your sypmptoms are not quite te same, but they are close.
The reason I mention the TN is really to show you that in fact these errors are not always the fault of NBU.

In this case, I would suggest you start t/shooting with the NDMP vendor.

Mack_Disouza
Level 4

I really want to make myself sure that its not a problem from Netbackup.

===========

[root@RUEH2BKP2 bin]#  tpautoconf -verify narue06a
-bash: tpautoconf: command not found
[root@RUEH2BKP2 bin]# ./tpautoconf -verify narue06a
Connecting to host "narue06a" as user "root"...
Waiting for connect notification message...
Opening session--attempting with NDMP protocol version 4...
Opening session--successful with NDMP protocol version 4
  host supports MD5 authentication
Getting MD5 challenge from host...
Logging in using MD5 method...
Host info is:
  host name "narue06a"
  os type "NetApp"
  os version "NetApp Release 8.1.3 7-Mode"
  host id "1573798426"
Login was successful
Host supports LOCAL backup/restore
Host supports 3-way backup/restore
Opening SCSI device "mc0"...
Inquiry result is "ADIC    Scalar i500     643G643G.GS002         "

 

mph999
Level 6
Employee Accredited

Issues of any sort have to be investigated, and when starting an investigation it is reasonable to look at the error (obviously), consider previous experience and also consider what is the most likely cause, as it would not be wise to start looking for a problem in the least likey place, you would look in the most likely place first.

So, what I am saying, is that you cannot say for 100% it is/ is not NBU, you can only consider what we know.  Positioning errors are rarely caused by NBU and we are fairly limited in NBU tools to investigate them.

 

As a matter of interest, are these drives shared with other devices (SSO).

You do need to be 100% sure that if they are shared, that each device is using the same scsi reservation type (eg, SPC-2 or persistent).

or in other words,

If a device is accessed by multiple 'hosts' and one of more different types of scsi reservation are used, then quite simply you're going to have major issues.

Also ...

Does this happen 100% of the time

When did the issue start

Did it ever work

Does it fail if using a 'blank' tape

Those questions will be useful, but providing the scsi reservation is consistent, my original advice stands - start with the vendor and see what they say.  When speaking with them you need to explain that the tape is positioned, but it appears that when we try to read the tape, we are not at the point we expect to be.

You could run an mcontents repotrt on the tape (think this works on NDMP tapes, not 100% sure) 

bpmedialist -mcontents -m <media id>

Run this from the media serevr that you got the bptm log from.

 

Mack_Disouza
Level 4

Hellomph999 :--> This technote :- TECH159543 is not available.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Please check SCSI reservation on Filer as well as media server(s) - as per Martin's excellent advice.

Extract from NDMP Appliance Information: http://www.symantec.com/docs/TECH31885  :

Tips for control and configuration
■ For NDMP devices to share tape drives, tape reservation must be enabled in
the ONTAP software on the filer as well as in NetBackup. You can use either
SCSI persistent reservation or SCSI reservation. To share tape drives, note
that the drive itself must support one of these types of reservation.

To enable SCSI reservation in Data ONTAP, enter either of the following at
the ONTAP command line on the filer:
options tape.reservations scsi
options tape.reservations persistent

To enable SCSI reservation in the NetBackup Administration Console, go to
Host Properties > Media Servers > double click the media server >
Properties > Media
. Make sure to select the same type of SCSI reservation as
you set on the filer.

jim_dalton
Level 6

Take netbackup out of the equation: write the same data to the same drive/tape directly on the filer. Then there is no netbackup at all. To me it looks like its not a NB issue, but then I would expect a drive error, an error light or on the drive display itself. These are lto6: maybe theres an incompatibility with drivers being rather new ie is the filer compat with LTO6 and with latest drivers? If you have older LTO it might be worth attempting same.

Jim

jim_dalton
Level 6

...and if these are fibre attached via a switch, check the port/s for errors too. Also check how the drives are configured as I see they support media partitioning. Jim 

mph999
Level 6
Employee Accredited

Apologies, the TN must be internal.  No worries, all I was demonstratng was that these issues can be caused by the filer.

 

Yasuhisa_Ishika
Level 6
Partner Accredited Certified

NetApp returns NDMP_XDR_DECODE_ERR. This return code is generic so you should enable debug logging of NDMP service on NetApp Data ONTAP, reproduce thos issue, and look into debug logs.

filer> options ndmpd.debug.enable on
filer> options ndmpd.debug.filter normal

NDMP log is /etc/log/ndmpd.log on Data ONTAP 8.1.x.

 

Yasuhisa_Ishika
Level 6
Partner Accredited Certified

My mistake.

/etc/log/ndmpd.log -> /etc/log/mlog//ndmpd.log

 

FAQ: What data should be collected to troubleshoot NDMP operations?

https://kb.netapp.com/support/index?page=content&id=3013954

Mack_Disouza
Level 4

Drives are not shared. We have 4 drives on quatum library and library is directly attached with NAS filer.

NDMP host itself acting as a media server, Robotic control host is master server.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Please check ndmp logs as per Yasuhisa's advice.

Mark_Solutions
Level 6
Partner Accredited Certified

Is it direct attached or via a switch?

If direct then fine - as the guys say it is most likely a filer issue - if not then double check the zoning to make sure no other server can get near those 4 drives.

Mark_Solutions
Level 6
Partner Accredited Certified

Very long log file but really only two errors as such in there...

000053 looks to be a new tape and doesn't get written to

000056 doesnt get positioned and then does not unload either.

Based on 000053 being listed as a new tape i am guessing that 00056 is too? (how many images listed on that tape - if any?)

So if it is a nice new LTO6 tape library using new tapes and you have only got as far as tape 53 in your system so far then it may just be a new drive issue (or firmware issue - see later).

Quantum no longer clean and process drives before they are sent out to customers due to costs - it is always worth cleaning a new drive at least 3 times before using it so well worth doing so now.

Next point is what media are you using - some are notorious for leaving dirt on the drive heads - but then if it is actually struggling to even unload 000056 then they may well not be great tapes?

My feeling is that the drives all need a good cleaning (three times each before further use) and it may be combined with poor media

One other thing that helps is to make sure the library and drives are all on the latest firmware - i cannot see what the drives are on but your library is on 643G and the current release is 646G so you are behind there.

The latest drive firmware is:

HP LTO-6 FC: J3LZ
HP LTO-6 SAS: O31Z

Hope this helps