06-10-2014 11:23 PM
Hello Friends,
Attached are the BPTM logs with verbosity(5). Everyday drives getting down.
Actions done :-
(1) Drive cleaning.
(2) Reconfigure drives through device configuration wizard.
(3) Stop/Start NB services.
Please assis what to do next to resolve this proble..
06-10-2014 11:56 PM
Hi,
You have a lot of errors in there, read errors, position errors, unload errors. This is probably a hardware issue as the scope of the errors are all over the place. Does the library/drives work fine with regular backups?
06-11-2014 12:14 AM
There is no error at library ened. Drives are always up from library end. No error logs found and no RAS ticket from library till now. Drives gettting down from Netbackup only. I have only one NDMP filer to backup as a client.
Library/drives are attached/configured on NAS filer.
root@RUEH2BKP2 bptm]# nbemmcmd -listhost
NBEMMCMD, Version: 7.6.0.1
The following hosts were found:
server RUEH2BKP2
master RUEH2BKP2
ndmp narue06a
Command completed successfully.
06-11-2014 01:11 AM
I'm not convinced this is going to be a NBU fault.
We try and read the media header, this is 1K in size and it is read into a buffer that is 64K
04:51:21.829 [32394] <2> io_read_media_header: drive index 2, reading media header, buflen = 65536, buff = 0x0x23a1820, copy 1
Tape is rewound ...
04:51:21.829 [32394] <2> io_ioctl: command (6)MTREW 1 0x0 from (bptm.c.8311) on drive index 2
Read header
04:51:26.304 [32394] <2> io_read_media_header: ndmp_tape_read_func returned 1024
Skips forward
04:51:26.304 [32394] <2> io_ioctl: command (1)MTFSF 1 0x0 from (bptm.c.8563) on drive index 2
This is successful, so the tape header can be read
Next try to position for the to the end of the last backup ... so after the 12th image
04:51:26.473 [32394] <2> io_position_for_write: position media id 000053, copy 1, current number images = 12
Try to skip forward 12 tapemarks
04:51:26.473 [32394] <2> io_position_for_write: skip forward 12 tapemarks, copy 1
04:51:26.473 [32394] <2> io_ioctl: command (1)MTFSF 12 0x0 from (bptm.c.7156) on drive index 2
04:52:16.628 [32394] <2> Media_siginfo_print: 0: delay 0 signo SIGHUP:1 code 0 pid 32391
I'm not sure exactly what this line means, but I 'think' it is relevant, although only marked as a <2>
04:52:16.628 [32394] <2> Media_library_signal_poll: 1:Terminate detected
Either way, the tape doesn't get positioned to where it should, as we can't read the header on the tape, which we are looking for to confirm we are in the right place
04:52:16.629 [32394] <2> io_read_block: ndmp_tape_read_func returned 18
04:52:16.629 [32394] <2> set_job_details: Tfile (1386): LOG 1402455136 8 bptm 32394 read error on media id 000053, drive index 2 reading header block, error code 18 (NDMP_XDR_DECODE_ERR)
04:52:16.629 [32394] <2> send_job_file: job ID 1386, ftype = 3 msg len = 131, msg = LOG 1402455136 8 bptm 32394 read error on media id 000053, drive index 2 reading header block, error code 18 (NDMP_XDR_DECODE_ERR)
04:52:16.629 [32394] <8> io_read_block: read error on media id 000053, drive index 2 reading header block, error code 18 (NDMP_XDR_DECODE_ERR)
04:52:16.629 [32394] <2> io_position_for_write: error, rewind and retry
The tape positioning is effectivly scsi/ OS level, not NBU.
This is not unlike the issue in TECH159543, The NDMP NAS vendor took responsibility for the issue in the TN. I appreciate your sypmptoms are not quite te same, but they are close.
The reason I mention the TN is really to show you that in fact these errors are not always the fault of NBU.
In this case, I would suggest you start t/shooting with the NDMP vendor.
06-11-2014 01:25 AM
I really want to make myself sure that its not a problem from Netbackup.
===========
[root@RUEH2BKP2 bin]# tpautoconf -verify narue06a
-bash: tpautoconf: command not found
[root@RUEH2BKP2 bin]# ./tpautoconf -verify narue06a
Connecting to host "narue06a" as user "root"...
Waiting for connect notification message...
Opening session--attempting with NDMP protocol version 4...
Opening session--successful with NDMP protocol version 4
host supports MD5 authentication
Getting MD5 challenge from host...
Logging in using MD5 method...
Host info is:
host name "narue06a"
os type "NetApp"
os version "NetApp Release 8.1.3 7-Mode"
host id "1573798426"
Login was successful
Host supports LOCAL backup/restore
Host supports 3-way backup/restore
Opening SCSI device "mc0"...
Inquiry result is "ADIC Scalar i500 643G643G.GS002 "
06-11-2014 02:04 AM
Issues of any sort have to be investigated, and when starting an investigation it is reasonable to look at the error (obviously), consider previous experience and also consider what is the most likely cause, as it would not be wise to start looking for a problem in the least likey place, you would look in the most likely place first.
So, what I am saying, is that you cannot say for 100% it is/ is not NBU, you can only consider what we know. Positioning errors are rarely caused by NBU and we are fairly limited in NBU tools to investigate them.
As a matter of interest, are these drives shared with other devices (SSO).
You do need to be 100% sure that if they are shared, that each device is using the same scsi reservation type (eg, SPC-2 or persistent).
or in other words,
If a device is accessed by multiple 'hosts' and one of more different types of scsi reservation are used, then quite simply you're going to have major issues.
Also ...
Does this happen 100% of the time
When did the issue start
Did it ever work
Does it fail if using a 'blank' tape
Those questions will be useful, but providing the scsi reservation is consistent, my original advice stands - start with the vendor and see what they say. When speaking with them you need to explain that the tape is positioned, but it appears that when we try to read the tape, we are not at the point we expect to be.
You could run an mcontents repotrt on the tape (think this works on NDMP tapes, not 100% sure)
bpmedialist -mcontents -m <media id>
Run this from the media serevr that you got the bptm log from.
06-12-2014 03:21 AM
Hellomph999 :--> This technote :- TECH159543 is not available.
06-12-2014 05:39 AM
Please check SCSI reservation on Filer as well as media server(s) - as per Martin's excellent advice.
Extract from NDMP Appliance Information: http://www.symantec.com/docs/TECH31885 :
Tips for control and configuration
■ For NDMP devices to share tape drives, tape reservation must be enabled in
the ONTAP software on the filer as well as in NetBackup. You can use either
SCSI persistent reservation or SCSI reservation. To share tape drives, note
that the drive itself must support one of these types of reservation.
To enable SCSI reservation in Data ONTAP, enter either of the following at
the ONTAP command line on the filer:
options tape.reservations scsi
options tape.reservations persistent
To enable SCSI reservation in the NetBackup Administration Console, go to
Host Properties > Media Servers > double click the media server >
Properties > Media. Make sure to select the same type of SCSI reservation as
you set on the filer.
06-12-2014 06:51 AM
Take netbackup out of the equation: write the same data to the same drive/tape directly on the filer. Then there is no netbackup at all. To me it looks like its not a NB issue, but then I would expect a drive error, an error light or on the drive display itself. These are lto6: maybe theres an incompatibility with drivers being rather new ie is the filer compat with LTO6 and with latest drivers? If you have older LTO it might be worth attempting same.
Jim
06-12-2014 07:05 AM
...and if these are fibre attached via a switch, check the port/s for errors too. Also check how the drives are configured as I see they support media partitioning. Jim
06-12-2014 01:16 PM
Apologies, the TN must be internal. No worries, all I was demonstratng was that these issues can be caused by the filer.
06-12-2014 04:44 PM
NetApp returns NDMP_XDR_DECODE_ERR. This return code is generic so you should enable debug logging of NDMP service on NetApp Data ONTAP, reproduce thos issue, and look into debug logs.
filer> options ndmpd.debug.enable on
filer> options ndmpd.debug.filter normal
NDMP log is /etc/log/ndmpd.log on Data ONTAP 8.1.x.
06-12-2014 04:50 PM
My mistake.
/etc/log/ndmpd.log -> /etc/log/mlog//ndmpd.log
FAQ: What data should be collected to troubleshoot NDMP operations?
https://kb.netapp.com/support/index?page=content&id=3013954
06-13-2014 02:48 AM
Drives are not shared. We have 4 drives on quatum library and library is directly attached with NAS filer.
NDMP host itself acting as a media server, Robotic control host is master server.
06-13-2014 02:57 AM
Please check ndmp logs as per Yasuhisa's advice.
06-13-2014 03:02 AM
Is it direct attached or via a switch?
If direct then fine - as the guys say it is most likely a filer issue - if not then double check the zoning to make sure no other server can get near those 4 drives.
06-13-2014 03:23 AM
Very long log file but really only two errors as such in there...
000053 looks to be a new tape and doesn't get written to
000056 doesnt get positioned and then does not unload either.
Based on 000053 being listed as a new tape i am guessing that 00056 is too? (how many images listed on that tape - if any?)
So if it is a nice new LTO6 tape library using new tapes and you have only got as far as tape 53 in your system so far then it may just be a new drive issue (or firmware issue - see later).
Quantum no longer clean and process drives before they are sent out to customers due to costs - it is always worth cleaning a new drive at least 3 times before using it so well worth doing so now.
Next point is what media are you using - some are notorious for leaving dirt on the drive heads - but then if it is actually struggling to even unload 000056 then they may well not be great tapes?
My feeling is that the drives all need a good cleaning (three times each before further use) and it may be combined with poor media
One other thing that helps is to make sure the library and drives are all on the latest firmware - i cannot see what the drives are on but your library is on 643G and the current release is 646G so you are behind there.
The latest drive firmware is:
HP LTO-6 FC: J3LZ
HP LTO-6 SAS: O31Z
Hope this helps