cancel
Showing results for 
Search instead for 
Did you mean: 

NBU 7.1.03 inconsistent failures with 84 on NDMP backups (NetApp)

AV-IT
Level 3
Partner Accredited

Hi,

My customer is experiencing strange failures of certain large filer's volumes. The problem that they are quite inconsistent... If anyone can shere some insight it will be greatly appreciated.

NBU 7.1.03 HP MS6000 library with 3 active drives. 2 HP LTO-3 drives daisy chained with robotic control to a Media Server, and a HP LTO-4 drive directly attached to the NetApp filer. 

Often, but not always, large filer volumes backups to directly attached LTO-4 fail with status 84. These succeed while directed to remotely-attached LTO-3.

in short, the error shows as: 07/11/2012 15:19:48 - Error ndmpagent(pid=5216) MOVER_HALTED media write error - reason = 5 (NDMP_MOVER_HALT_MEDIA_ERROR)   
07/11/2012 15:19:48 - Error ndmpagent(pid=5216) NDMP backup failed, path = /vol/notes_vol      
07/11/2012 15:19:49 - Error bptm(pid=5452) cannot write image to media id B246L4, drive index 3  
07/11/2012 15:23:10 - Info bptm(pid=5452) EXITING with status 84 <----------       
 

Has anyone seen someting similar? Many thanks!

If anyone is interested there are more details/logs below:

============

On the filer:

ndmpd log (debug level 50) shows: "Wed Nov 7 15:23:20 GMT [tape.cmd.chkCondErr:error]: Tape device 2a.3: Check Condition: SCSI Op code Write(06)) (CDB 0x0a: 0x010000 bytes) hard error - Write error (0x3 - 0xc 0x0x0).

/etc/system hardly helpful: "DUMP: Media Error on tape write"

NBU logs are better:

All Log Entries:

07/11/2012 15:15:20 hreuknetflow  Info 0 General cleaning media DB(s)
07/11/2012 15:19:48 hreuknetflow hreuk3210 Error 7515 Backup MOVER_HALTED media write error - reason = 5 (NDMP_MOVER_HALT_MEDIA_ERROR)
07/11/2012 15:19:48 hreuknetflow hreuk3210 Error 7515 Backup NDMP backup failed, path = /vol/notes_vol
07/11/2012 15:19:49 hreuknetflow hreuk3210 Error 7515 Media Device cannot write image to media id B246L4, drive index 3
07/11/2012 15:23:10 hreuknetflow hreuk3210 Info 7515 Backup Status media write error
07/11/2012 15:23:10 hreuknetflow hreuk3210 Error 7515 Backup backup of client hreuk3210 exited with status 84 (media write error)
07/11/2012 15:23:55 hreuknetflow hreuk3210 Info 7516 Backup started backup job for client hreuk3210, policy test-hreukfs1-main, schedule Full on storage unit HREUKNETFLOW-HCART4-ROBOT-TLD-1-HREUK3210
07/11/2012 15:23:55 hreuknetflow hreuk3210 Info 7516 Backup client hreuk3210 handling path /vol/vol0
07/11/2012 15:25:01 hreuknetflow hreuk3210 Info 7516 Media Device begin writing backup id hreuk3210_1352301835, copy 1, fragment 1, to media id B246L4 on drive HP.ULTRIUM4-SCSI.000 (index 3)
07/11/2012 15:25:20 hreuknetflow  Info 0 General cleaning media DB(s)  

job detailed status:

07/11/2012 12:33:34 - Info nbjm(pid=3528) starting backup job (jobid=7515) for client hreuk3210, policy test-hreukfs1-main, schedule Full 
07/11/2012 12:33:35 - estimated 0 Kbytes needed
07/11/2012 12:33:35 - Info nbjm(pid=3528) started backup job for client hreuk3210, policy test-hreukfs1-main, schedule Full on storage unit HREUKNETFLOW-HCART4-ROBOT-TLD-1-HREUK3210
07/11/2012 12:33:36 - Info bpbrm(pid=4708) hreuk3210 is the host to backup data from    
07/11/2012 12:33:36 - Info bpbrm(pid=4708) reading file list from client       
07/11/2012 12:33:36 - started process bpbrm (4708)
07/11/2012 12:33:36 - connecting
07/11/2012 12:33:37 - Info bpbrm(pid=4708) starting ndmpagent on client        
07/11/2012 12:33:37 - Info ndmpagent(pid=5216) Backup started          
07/11/2012 12:33:37 - connected; connect time: 00:00:01
07/11/2012 12:33:37 - Info bptm(pid=5452) start           
07/11/2012 12:33:38 - Info bptm(pid=5452) using 30 data buffers        
07/11/2012 12:33:38 - Info bptm(pid=5452) using 65536 data buffer size       
07/11/2012 12:33:38 - Info bptm(pid=5452) start backup          
07/11/2012 12:33:38 - Info bptm(pid=5452) Waiting for mount of media id B246L4 (copy 1) on server hreuknetflow.
07/11/2012 12:33:38 - mounting B246L4
07/11/2012 12:34:29 - Info bptm(pid=5452) media id B246L4 mounted on drive index 3, drivepath nrst1a, drivename HP.ULTRIUM4-SCSI.000, copy 1
07/11/2012 12:34:29 - mounted; mount time: 00:00:51
07/11/2012 12:34:35 - positioning B246L4 to file 1
07/11/2012 12:34:51 - positioned B246L4; position time: 00:00:16
07/11/2012 12:34:51 - begin writing
07/11/2012 15:19:48 - Error ndmpagent(pid=5216) MOVER_HALTED media write error - reason = 5 (NDMP_MOVER_HALT_MEDIA_ERROR)   
07/11/2012 15:19:48 - Error ndmpagent(pid=5216) NDMP backup failed, path = /vol/notes_vol      
07/11/2012 15:19:49 - Error bptm(pid=5452) cannot write image to media id B246L4, drive index 3  
07/11/2012 15:23:10 - Info bptm(pid=5452) EXITING with status 84 <----------       
07/11/2012 15:23:10 - end writing; write time: 02:48:19
07/11/2012 15:23:15 - Info ndmpagent(pid=0) done. status: 84: media write error      
media write error(84)
07/11/2012 15:33:10 - Info nbjm(pid=3528) starting backup job (jobid=7515) for client hreuk3210, policy test-hreukfs1-main, schedule Full

bptm log:

15:19:49.843 [5452.3576] <2> NdmpAgentSession[0]: [332] Received 10 (MEDIA_ERROR) ""
15:19:49.843 [5452.3576] <2> NdmpAgentSession[0]: [332] Replying error = 0
15:19:49.843 [5452.3576] <16> check_and_process_ndmpagent_backup_tasks: ndmpagent[0] reports media write error
15:19:49.843 [5452.3576] <2> NdmpAgentSession_close_by_index[0]: Saving ndmpagent session (still needed for ndmp media session, index 0)
15:19:49.843 [5452.3576] <2> write_data: ndmp_task = 1
15:19:49.843 [5452.3576] <2> write_data: status_to_return = -8, total_frag_kbytes 5306624
15:19:49.843 [5452.3576] <16> write_data: cannot write image to media id B246L4, drive index 3
15:19:49.859 [5452.3576] <2> send_MDS_msg: DEVICE_STATUS 1 2658 hreuknetflow B246L4 4000503 HP.ULTRIUM4-SCSI.000 2000056 WRITE_ERROR 0 0
15:19:49.921 [5452.3576] <2> log_media_error: successfully wrote to error file - 11/07/12 15:19:49 B246L4 3 WRITE_ERROR HP.ULTRIUM4-SCSI.000
15:19:49.921 [5452.3576] <2> write_backup: write_data() returned, exit_status = 84, CINDEX = 0, TWIN_INDEX = 0, backup_status = -8
15:19:49.921 [5452.3576] <2> write_backup: tp = 1302490921, stp = 1293098234, et = 9392687, mpx_total_kbytes[TWIN_INDEX = 0] = 5306624
15:19:49.937 [5452.3576] <2> io_terminate_tape: writing empty backup header, drive index 3, copy 1
15:19:49.937 [5452.3576] <2> io_terminate_tape: reposition to previous tapemark and rewrite header
15:19:49.937 [5452.3576] <2> io_ioctl: command (2)MTBSF 1 0x0 from (bptm.c.8774) on drive index 3
15:22:41.875 [5452.3576] <2> io_ioctl: command (0)MTWEOF 1 0x1 from (bptm.c.8856) on drive index 3
15:22:57.187 [5452.3576] <2> io_terminate_tape: absolute block position prior to writing empty header is 2, copy 1
15:22:57.187 [5452.3576] <2> io_write_back_header: drive index 3, empty_file, file num = 1, mpx_headers = 0, copy 1
15:22:57.406 [5452.3576] <2> io_write_block: ndmp_tape_write_func returned 1024
15:22:57.406 [5452.3576] <2> send_MDS_msg: MEDIADB 1 2658 B246L4 4000503 *NULL* 6 1352291615 1352291615 1352334815 0 0 0 0 24 5 0 0 1024 0 2 0
15:22:57.421 [5452.3576] <2> io_ioctl: command (2)MTBSF 1 0x0 from (bptm.c.9122) on drive index 3
15:23:04.625 [5452.3576] <2> io_ioctl: command (1)MTFSF 1 0x0 from (bptm.c.9124) on drive index 3
15:23:05.062 [5452.3576] <2> io_close: closing E:\Program Files\VERITAS\NetBackup\db\media\tpreq\drive_HP.ULTRIUM4-SCSI.000, from bptm.c.9152
15:23:05.281 [5452.3576] <2> NdmpMediaSession_close_public_and_ndmpagent[0]: closing public session, force = 0, agent_session_index = 0
15:23:05.281 [5452.3576] <2> NdmpMediaSession[0]: ndmp_public_session_destory: destroying session 0x014C7FE8
15:23:05.281 [5452.3576] <2> NdmpAgentSession[0]: [333] Received 16 (CONNECT_CLOSE_REPLY) ""
15:23:05.281 [5452.3576] <2> NdmpAgentSession[0]: [333] Replying error = 0
15:23:10.281 [5452.3576] <2> NdmpAgentSession_close_by_index[0]: closing ndmpagent session 014B4CA8
15:23:10.281 [5452.3576] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from io_terminate_tape)
15:23:10.281 [5452.3576] <2> send_brm_msg: EXIT hreuk3210_1352291615 84
15:23:10.281 [5452.3576] <2> bptm: EXITING with status 84 <----------
15:23:11.562 [5428.4636] <2> bptm: instance - 1303197343
15:23:11.562 [5428.4636] <2> bptm: INITIATING (VERBOSE = 5): -unload -dn HP.ULTRIUM4-SCSI.000 -dp nrst1a -dk 2000056 -m B246L4 -mk 4000503 -mds 0 -alocid 2658 -nh hreuk3210 -nu root -nk 2b900d52288612d7 -np 764f88d4ba3b3a741959e2b5c58c9e4b59473dba4d130d3ce85c49c9a997fe7a5c695968576bd19fb6d556344b572fff0b9a3d289b695b07349651727b8a9150 -...

and finally vxlogview:

E:\Program Files\VERITAS\NetBackup\bin>vxlogview -p 51216 -o 134 -b "07/11/12 12:00:00" -e "07/11/12 16:10:00"

 


07/11/2012 15:18:14.296 [CtnLogMsgCB] NDMP_LOG_NORMAL 0 DUMP: Wed Nov  7 15:22:01 2012 : We have written 5305862 KB.
07/11/2012 15:19:32.781 [CtnLogMsgCB] NDMP_LOG_NORMAL 0 DUMP: Media error on tape write.
07/11/2012 15:19:32.781 [CtnLogMsgCB] NDMP_LOG_NORMAL 0 DUMP: DUMP IS ABORTED
07/11/2012 15:19:33.968 [CtnLogMsgCB] NDMP_LOG_NORMAL 0 DUMP: Deleting "/vol/notes_vol/../snapshot_for_backup.21437" snapshot.
07/11/2012 15:19:48.515 [CtnLogMsgCB] NDMP_LOG_NORMAL 0 Connection or IO Error.
07/11/2012 15:19:48.515 [Error] V-134-36 MOVER_HALTED media write error - reason = 5 (NDMP_MOVER_HALT_MEDIA_ERROR)
07/11/2012 15:19:48.562 V-134-19 [NdmpAgent::SetErrorAndHalt] NdmpBackupManager.cpp(1173) - error code 84 (media write error)
07/11/2012 15:19:48.562 [CtnLogMsgCB] NDMP_LOG_NORMAL 0 MoveletOutput: Media Error.
07/11/2012 15:19:48.953 [Error] V-134-32 NDMP backup failed, path = /vol/notes_vol
07/11/2012 15:23:10.296 [ConnectionToBrm::SendExitStatus] Sending EXIT STATUS 84: media write error
07/11/2012 15:23:56.484 [MainStartup] ==================== STARTUP ====================
07/11/2012 15:23:56.484 [Info] V-134-3 started process ndmpagent (pid=5824.4864)

07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter -stream_count

07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter 8
07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter -stream_numbe
r
07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter 2
07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter -blks_per_buf
fer
07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter 128
07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter -use_otm
07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter -fso
07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter -kl
07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter 28
07/11/2012 15:23:56.484 [Warning] V-134-255 Unrecognized parameter -use_ofb
07/11/2012 15:23:56.484 [NdmpAgent::CheckLogLevels] DiagLevel = 1 DebugLevel = 1

07/11/2012 15:23:57.140 [NdmpGlueLogTraceCb] ndmp_connect_to_server: hostname = hreuk3210, portname = 10000

1 ACCEPTED SOLUTION

Accepted Solutions

mph999
Level 6
Employee Accredited

I think this ...

0x3 - 0xc

is the asc/ ascq error.

+ it says 'hard error'

Good enough proof for me.

Either way, it's an issue between the filer and the drive, as it is directly connected.  As the log is on the filer it is for te filer vendor to translate the meaning.

You are correct, it it were density, the drive would spit the tape out straight away.

M

View solution in original post

5 REPLIES 5

mph999
Level 6
Employee Accredited

From the filer log

ndmpd log (debug level 50) shows: "Wed Nov 7 15:23:20 GMT [tape.cmd.chkCondErr:error]: Tape device 2a.3: Check Condition: SCSI Op code Write(06)) (CDB 0x0a: 0x010000 bytes) hard error - Write error (0x3 - 0xc 0x0x0).

That looks like an 'ASC/ ASCQ' error

That means the tape drive is reporting it has an error - therefore hardwrae issue.

From this TN  http://www.symantec.com/docs/TECH169477

The ASC/ ASCQ section shows that all acs/ acsq errors are hardware related

 

 

ASC/ ASCQ
 
SCSI Sense keys describe a 'state',  which are returned when a command requests a 'check condition' status.  
 
The example given is different from your issue, but all of these types of errors are generated from the hardware.
In a similar manner to Tape Alerts, SCSI Sense Keys are produced by the device, not by NetBackup.
 
As ASC /ASQ alerts are sent from the hardware, it is impossible for them to be caused by NetBackup. 
It has been seen that a power cycle of the drive (not soft reset) can sometimes clear ASC/ ASCQ errors.
 
 
 
 
 
 
Martin

AV-IT
Level 3
Partner Accredited

Thanks Martin!

It does look like a HW error, but I would have expected an ASC/ASCQ code somewhere... I will have to look at the full logs again.

Not much evidence to present to the HP support either. Any suggestions?

Just a thought, could it be a wrong tape density? The same library shares 2xLTO-3 (attached to Media Server) and LTO-4 (directly attached to NDMP). May be tapes got mixed up... But then, I assume, Ultrium-4 drive would fail at the very begining? Just thinking aloud...

Thanks. 

mph999
Level 6
Employee Accredited

I think this ...

0x3 - 0xc

is the asc/ ascq error.

+ it says 'hard error'

Good enough proof for me.

Either way, it's an issue between the filer and the drive, as it is directly connected.  As the log is on the filer it is for te filer vendor to translate the meaning.

You are correct, it it were density, the drive would spit the tape out straight away.

M

AV-IT
Level 3
Partner Accredited

Thanks Martin!

(Hopefully) I will have an opportunity to change a suspected tape drive to an identical LTO4 one and try same backup policies. Will update. Thanks for your input.

Alex

jim_dalton
Level 6

Dodgy tape / dodgy firmware / dodgy drive /  dodgy connection....thats where I'd be looking. Its a write error.

If its a combo thats been working for a while and unmodified then firmware is unlikely.

Jim