01-20-2015 11:35 AM
We have been using NDMP to backup NetApp snapshot for about 2 years now. Since yesterday a new issue as arise. All ndmp backup are failing, either on status 86, media position or on status 24, socket write failed(24).
20/01/2015 11:43:19 AM - Info nbjm(pid=5220) starting backup job (jobid=175381) for client dick-san01, policy ddc_tld0_ndmp_nfsdsprd_01, schedule incr_daily
20/01/2015 11:43:19 AM - Info nbjm(pid=5220) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=175381, request id:{2E765ED3-9D2A-4DDD-8082-799E8E1C4CF0})
20/01/2015 11:43:19 AM - requesting resource smtlhq143-hcart2-robot-tld-0-dick-san01
20/01/2015 11:43:19 AM - requesting resource smtlhq143.NBU_CLIENT.MAXJOBS.dick-san01
20/01/2015 11:43:19 AM - requesting resource smtlhq143.NBU_POLICY.MAXJOBS.ddc_tld0_ndmp_nfsdsprd_01
20/01/2015 11:43:19 AM - Info bpbrm(pid=9432) dick-san01 is the host to backup data from
20/01/2015 11:43:19 AM - Info bpbrm(pid=9432) reading file list for client
20/01/2015 11:43:19 AM - Info bpbrm(pid=9432) starting ndmpagent on client
20/01/2015 11:43:19 AM - Info ndmpagent(pid=6756) Backup started
20/01/2015 11:43:19 AM - Info bptm(pid=7096) start
20/01/2015 11:43:19 AM - granted resource smtlhq143.NBU_CLIENT.MAXJOBS.dick-san01.viarail.intra
20/01/2015 11:43:19 AM - granted resource smtlhq143.NBU_POLICY.MAXJOBS.ddc_tld0_ndmp_nfsdsprd_01
20/01/2015 11:43:19 AM - granted resource 015257
20/01/2015 11:43:19 AM - granted resource ddc-tld0-lto5-003
20/01/2015 11:43:19 AM - granted resource smtlhq143-hcart2-robot-tld-0-dick-san01
20/01/2015 11:43:19 AM - estimated 3222944486 Kbytes needed
20/01/2015 11:43:19 AM - Info nbjm(pid=5220) started backup (backupid=dick-san01_1421772199) job for client dick-san01, policy ddc_tld0_ndmp_nfsdsprd_01, schedule incr_daily on storage unit smtlhq143-hcart2-robot-tld-0-dick-san01
20/01/2015 11:43:19 AM - started process bpbrm (9432)
20/01/2015 11:43:19 AM - connecting
20/01/2015 11:43:19 AM - connected; connect time: 0:00:00
20/01/2015 11:43:20 AM - Info bptm(pid=7096) using 64 data buffers
20/01/2015 11:43:20 AM - Info bptm(pid=7096) using 65536 data buffer size
20/01/2015 11:43:20 AM - Info bptm(pid=7096) start backup
20/01/2015 11:43:20 AM - Info bptm(pid=7096) Waiting for mount of media id 015257 (copy 1) on server smtlhq143.
20/01/2015 11:43:20 AM - mounting 015257
20/01/2015 11:44:15 AM - Info bptm(pid=7096) media id 015257 mounted on drive index 0, drivepath nrst1a, drivename ddc-tld0-lto5-003, copy 1
20/01/2015 11:44:15 AM - mounted; mount time: 0:00:55
20/01/2015 11:44:19 AM - Info ndmpagent(pid=6756) dick-san01: SCSI: TAPE READ: short read for nrst1a
20/01/2015 11:44:19 AM - positioning 015257 to file 73
20/01/2015 11:45:19 AM - Error bptm(pid=7096) io_ioctl_ndmp (MTFSF) failed on media id 015257, drive index 0, return code 18 (NDMP_XDR_DECODE_ERR) (bptm.c.7157)
20/01/2015 11:45:19 AM - Error ndmpagent(pid=6756) eof is set, connection 000000000101AF80
20/01/2015 11:45:19 AM - Error ndmpagent(pid=6756) NDMP backup failed, path = UNKNOWN
20/01/2015 11:45:20 AM - Error ndmpagent(pid=6756) connection 0x101b360 ndmp_message_process_one_failed, status = NDMP_ILLEGAL_STATE_ERR
20/01/2015 11:45:20 AM - Error ndmpagent(pid=6756) eof is set - connection 0x101b360
20/01/2015 11:45:20 AM - Info bptm(pid=7096) EXITING with status 86 <----------
20/01/2015 11:45:20 AM - Info ndmpagent(pid=0) done. status: 86: media position error
20/01/2015 11:45:20 AM - end writing
media position error(86)
Error 2 (start writing then fail)
20/01/2015 9:58:47 AM - Info nbjm(pid=5220) starting backup job (jobid=175367) for client dick-san01, policy ddc_tld0_ndmp_nfsdsprd_05, schedule incr_daily
20/01/2015 9:58:47 AM - Info nbjm(pid=5220) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=175367, request id:{A7DF3C66-7600-4CE0-B086-CD43C24F91D5})
20/01/2015 9:58:47 AM - requesting resource smtlhq143-hcart2-robot-tld-0-dick-san01
20/01/2015 9:58:47 AM - requesting resource smtlhq143.NBU_CLIENT.MAXJOBS.dick-san01
20/01/2015 9:58:47 AM - requesting resource smtlhq143.NBU_POLICY.MAXJOBS.ddc_tld0_ndmp_nfsdsprd_05
20/01/2015 9:58:48 AM - Info bpbrm(pid=7036) dick-san01 is the host to backup data from
20/01/2015 9:58:48 AM - Info bpbrm(pid=7036) reading file list for client
20/01/2015 9:58:48 AM - Info bpbrm(pid=7036) starting ndmpagent on client
20/01/2015 9:58:48 AM - Info ndmpagent(pid=9764) Backup started
20/01/2015 9:58:48 AM - Info bptm(pid=1392) start
20/01/2015 9:58:48 AM - Info bptm(pid=1392) using 64 data buffers
20/01/2015 9:58:48 AM - Info bptm(pid=1392) using 65536 data buffer size
20/01/2015 9:58:48 AM - granted resource smtlhq143.NBU_CLIENT.MAXJOBS.dick-san01
20/01/2015 9:58:48 AM - granted resource smtlhq143.NBU_POLICY.MAXJOBS.ddc_tld0_ndmp_nfsdsprd_05
20/01/2015 9:58:48 AM - granted resource 015176
20/01/2015 9:58:48 AM - granted resource ddc-tld0-lto5-002
20/01/2015 9:58:48 AM - granted resource smtlhq143-hcart2-robot-tld-0-dick-san01
20/01/2015 9:58:48 AM - estimated 1340055948 Kbytes needed
20/01/2015 9:58:48 AM - Info nbjm(pid=5220) started backup (backupid=dick-san01_1421765928) job for client dick-san01, policy ddc_tld0_ndmp_nfsdsprd_05, schedule incr_daily on storage unit smtlhq143-hcart2-robot-tld-0-dick-san01
20/01/2015 9:58:48 AM - started process bpbrm (7036)
20/01/2015 9:58:48 AM - connecting
20/01/2015 9:58:48 AM - connected; connect time: 0:00:00
20/01/2015 9:58:49 AM - Info bptm(pid=1392) start backup
20/01/2015 9:58:49 AM - Info bptm(pid=1392) Waiting for mount of media id 015176 (copy 1) on server smtlhq143.
20/01/2015 9:58:49 AM - mounting 015176
20/01/2015 9:59:45 AM - Info bptm(pid=1392) media id 015176 mounted on drive index 2, drivepath nrst0a, drivename ddc-tld0-lto5-002, copy 1
20/01/2015 9:59:45 AM - mounted; mount time: 0:00:56
20/01/2015 9:59:49 AM - Info ndmpagent(pid=9764) dick-san01: SCSI: TAPE READ: short read for nrst0a
20/01/2015 9:59:49 AM - positioning 015176 to file 1
20/01/2015 9:59:53 AM - Info ndmpagent(pid=9764) dick-san01: DUMP: Using inowalk incremental dump algorithm for qtree
20/01/2015 9:59:53 AM - positioned 015176; position time: 0:00:04
20/01/2015 9:59:53 AM - begin writing
20/01/2015 10:00:04 AM - Info ndmpagent(pid=9764) dick-san01: DUMP: Date of this level 2 dump: Tue Jan 20 02:39:03 2015.
20/01/2015 10:00:04 AM - Info ndmpagent(pid=9764) dick-san01: DUMP: Date of last level 1 dump: Mon Jan 19 02:40:09 2015.
20/01/2015 10:00:04 AM - Info ndmpagent(pid=9764) dick-san01: DUMP: Dumping /vol/v_nfsds_prd_05/q_nfsds_prd_05/.snapshot/smvi_backup_datastore_prod_05_mfm_only_novmsnap-daily_recent to NDMP connection
20/01/2015 10:00:04 AM - Info ndmpagent(pid=9764) dick-san01: DUMP: mapping (Pass I)[regular files]
20/01/2015 10:00:40 AM - Info ndmpagent(pid=9764) dick-san01: DUMP: mapping (Pass II)[directories]
20/01/2015 10:00:40 AM - Info ndmpagent(pid=9764) dick-san01: DUMP: estimated 848756807 KB.
20/01/2015 10:00:40 AM - Info ndmpagent(pid=9764) dick-san01: DUMP: dumping (Pass III) [directories]
20/01/2015 10:00:40 AM - Info ndmpagent(pid=9764) dick-san01: DUMP: dumping (Pass IV) [regular files]
20/01/2015 10:01:40 AM - Info ndmpagent(pid=9764) dick-san01: MOVER: signaling backup EOW at 10485760000 (160000 records)
20/01/2015 10:02:49 AM - Info ndmpagent(pid=9764) dick-san01: MOVER: signaling backup EOW at 20971520000 (320000 records)
20/01/2015 10:02:50 AM - Error ndmpagent(pid=9764) terminated by parent process
20/01/2015 10:02:50 AM - Info ndmpagent(pid=0) done
20/01/2015 10:03:03 AM - Info bptm(pid=1392) EXITING with status 24 <----------
20/01/2015 10:03:03 AM - Error ndmpagent(pid=9764) Connection was closed but has not yet been destroyed.
20/01/2015 10:03:03 AM - Info ndmpagent(pid=0) done. status: 24: socket write failed
20/01/2015 10:03:03 AM - Error ndmpagent(pid=9764) Connection was closed but has not yet been destroyed.
20/01/2015 10:03:03 AM - end writing; write time: 0:03:10
socket write failed(24)
20/01/2015 10:03:03 AM - Error ndmpagent(pid=9764) MoverGetState called with no session
20/01/2015 10:03:03 AM - Error ndmpagent(pid=9764) NDMP backup failed, path = /vol/v_nfsds_prd_05/q_nfsds_prd_05/.snapshot/smvi_backup_datastore_prod_05_mfm_only_novmsnap-daily_recent
20/01/2015 10:03:03 AM - Error ndmpagent(pid=9764) Connection was closed but has not yet been destroyed.
20/01/2015 10:03:03 AM - Error ndmpagent(pid=9764) Connection was closed but has not yet been destroyed.
20/01/2015 10:03:03 AM - Error ndmpagent(pid=9764) Connection was closed but has not yet been destroyed.
20/01/2015 10:03:03 AM - Error ndmpagent(pid=9764) Connection was closed but has not yet been destroyed.
20/01/2015 10:03:03 AM - Error ndmpagent(pid=9764) Connection was closed but has not yet been destroyed.
I was able to write a dump 0f of Netapp snapshot to tape using NepApp command line.
Robot has been restarted.
still no luck. any help will be appreciated. Thank you.
01-20-2015 02:44 PM
Status 86 means The system's device driver returned an I/O error while NetBackup was positioning the tape media. (HOWTO104014). Please check your media and drive:
Check for the following:
A defective or a dirty drive. Clean it or have it repaired (see the tpclean command for cleaning).
Incorrect drive configuration. Verify the Media and Device Management and system configuration for the drive.
For example, on UNIX, the drive may be configured for fixed mode when it must be variable mode.
See the NetBackup Device Configuration Guide for more information.
Defective media. In this case, some data may be lost. Use the bpmedia command to set the volume to the FROZEN state so it is not used for future backups.
See "Frozen media troubleshooting considerations" in the Troubleshooting Guide.
The wrong media type. Verify that the media matches the drive type you use.
01-21-2015 04:08 AM
Ive had a few of these, not found the root cause, suspect ndmpd on filer. Interested to know the answer! Jim
01-21-2015 08:21 AM
I did suspect also the ndmpd daemon. we did a stop/start but it didn't fix the issue.
But, we found out that someone from my team decrease the client read timeout on the master server to 60 seconds to debug an other problem a few hours before the issue start. Setting back the timeout to 300 seconds fix the issue.
if after 60 seconds the tape wasn't positionned, the timeout kill the process.
case close.
thank you
01-21-2015 09:10 AM
Interesting! Jim