cancel
Showing results for 
Search instead for 
Did you mean: 

NDMP Backups failing

Rwigg
Level 3
Greetings,

We recently moved two of our NetApp filers to a new datacenter and now when I try to backup some of the larger volumes I am receiving the error NDMP backup failure(99).  The detailed status shows the following lines.

12/15/2009 2:24:21 PM - Error ndmpagent(pid=3704) MOVER_HALTED unexpected reason = 3 (NDMP_MOVER_HALT_INTERNAL_ERROR)      
12/15/2009 2:24:23 PM - Error ndmpagent(pid=3704) NDMP backup failed, path = /vol/epvol5/

I have ran successful backups from these two filers since the move.  Also, I am not seeing any errors in the NetApp syslogs.  Has anyone else ran into this problem or have any ideas?
1 ACCEPTED SOLUTION

Accepted Solutions

MattS
Level 6
That does sound tape drive related... maybe try performing a large/long standard backup on that drive to see if any errors pop up?  Is it SAN connected? If so you might want to check for errors on your fiber switch ports.

View solution in original post

9 REPLIES 9

lu
Level 6
Can you run a "tpautoconf -verify yournas" ?

Rwigg
Level 3

Yes I can run tpautoconf -verify yournas for both filers on the media server.  The output is below.

C:\Program Files\Veritas\Volmgr\bin>tpautoconf -verify filer301c
Connecting to host "filer301c" as user "root"...
Waiting for connect notification message...
Opening session--attempting with NDMP protocol version 4...
Opening session--successful with NDMP protocol version 4
  host supports MD5 authentication
Getting MD5 challenge from host...
Logging in using MD5 method...
Host info is:
  host name "filer301c"
  os type "NetApp"
  os version "NetApp Release 7.2.6.1P2"
  host id "0151703198"
Login was successful
Host supports LOCAL backup/restore
Host supports 3-way backup/restore
Host has SnapVault Secondary license installed

C:\Program Files\Veritas\Volmgr\bin>tpautoconf -verify filer300c
Connecting to host "filer300c" as user "root"...
Waiting for connect notification message...
Opening session--attempting with NDMP protocol version 4...
Opening session--successful with NDMP protocol version 4
  host supports MD5 authentication
Getting MD5 challenge from host...
Logging in using MD5 method...
Host info is:
  host name "filer300c"
  os type "NetApp"
  os version "NetApp Release 7.2.6.1P2"
  host id "0151703082"
Login was successful
Host supports LOCAL backup/restore
Host supports 3-way backup/restore
Host has SnapVault Secondary license installed

g_man1
Level 3
I cannot speak to NetApp, but this could be similar to the EMC Celerra. With my Celerra, I think I can only run 4 concurrent backups per data mover. If another backup kicks off while 4 are running, it will end in a status 99.

Are your backups by chance taking longer to run and you have more concurrent backups running now than you used to?

Manoj_Siricilla
Level 4
Certified
- Start a backup and then at the ndmp prompt run the command

>ndmpd status

See the seesion and verify if ndmpd is on

Also, review the log file on the nas filer by running the command

>rdfile /etc/log/backup

99 - is a generic error code

You can have a 99 even if the filer cannot perform a snapshot to dump the backup.


MattS
Level 6
Rwigg,

Did you get this issue resolved?  I was having this same issue with the same errors in the logs.

After much troubleshooting i found that i was using the e0m interface on the NetApp for netbackup media/master communication.  I went over this with the SAN guys and they weren't sure why that interface had a DNS entry with the NetApp's hostname.  It should have been something like hostname-e0m.

So i edited the host files on the master/media to use the IP address of the vif-master interface instead and it seems to have resolved the issue.

Let me know if this works for you.

Matt

lu
Level 6
Can you try to create the file /usr/openv/netbackup/db/config/ndmp.cfg and put the following keyword in it : NDMP_MOVER_CLIENT_DISABLE

Rwigg
Level 3
Thanks MattS I did check the host file on the master/media and we are already using the vif-master interface.  I had thought we fixed the issue after upgrading the tape drive firmware but I received the error again today.  After rerunning the job I received a new error which is leading me to believe this is still related to the tape drive.  Below is the Status from the failed job.  I am going to follow up with IBM again.

1/26/2010 10:47:38 AM - requesting resource ares05c-hcart-robot-tld-3-Filer300c
1/26/2010 10:47:38 AM - requesting resource ares02.NBU_CLIENT.MAXJOBS.filer300c
1/26/2010 10:47:38 AM - requesting resource ares02.NBU_POLICY.MAXJOBS.NDMP_Vol_Dotnextvirt01
1/26/2010 10:47:38 AM - granted resource ares02.NBU_CLIENT.MAXJOBS.filer300c
1/26/2010 10:47:38 AM - granted resource ares02.NBU_POLICY.MAXJOBS.NDMP_Vol_Dotnextvirt01
1/26/2010 10:47:38 AM - granted resource P00045
1/26/2010 10:47:38 AM - granted resource IBM.ULT3580-TD4.016
1/26/2010 10:47:38 AM - granted resource ares05c-hcart-robot-tld-3-Filer300c
1/26/2010 10:47:38 AM - estimated 0 kbytes needed
1/26/2010 10:47:40 AM - started process bpbrm (3520)
1/26/2010 10:47:41 AM - connecting
1/26/2010 10:47:41 AM - connected; connect time: 00:00:00
1/26/2010 10:47:46 AM - mounting P00045
1/26/2010 10:49:00 AM - mounted; mount time: 00:01:14
1/26/2010 10:49:00 AM - positioning P00045 to file 9
1/26/2010 10:50:59 AM - positioned P00045; position time: 00:01:59
1/26/2010 10:50:59 AM - begin writing
1/26/2010 11:38:14 AM - current media P00045 complete, requesting next resource Any
1/26/2010 11:38:15 AM - current media -- complete, awaiting next media Any Reason: Drives are in use, Media Server: ares05c,
  Robot Number: 3, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
  Volume Pool: NDMPFiler01, Storage Unit: ares05c-hcart-robot-tld-3-Filer300c, Drive Scan Host: N/A
 
1/26/2010 11:39:02 AM - granted resource P00079
1/26/2010 11:39:02 AM - granted resource IBM.ULT3580-TD4.016
1/26/2010 11:39:02 AM - granted resource ares05c-hcart-robot-tld-3-Filer300c
1/26/2010 11:39:02 AM - mounting P00079
1/26/2010 11:40:15 AM - mounted; mount time: 00:01:13
1/26/2010 11:40:16 AM - positioning P00079 to file 1
1/26/2010 11:40:19 AM - positioned P00079; position time: 00:00:03
1/26/2010 11:40:19 AM - begin writing
1/26/2010 11:48:31 AM - Error ndmpagent(pid=1452) NDMP backup failed, path = /vol/dotnextvirt01      
1/26/2010 11:48:32 AM - Error bptm(pid=2632) io_ioctl_ndmp (MTBSF) failed on media id P00079, drive index 2, return code 7 (NDMP_IO_ERR) (bptm.c.21479)
1/26/2010 11:48:33 AM - end writing; write time: 00:08:14
1/26/2010 11:48:37 AM - Error ndmpagent(pid=1452) terminated by parent process        
1/26/2010 11:48:38 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:40 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
media position error(86)
1/26/2010 11:48:41 AM - Error ndmpagent(pid=1452) MoverGetState called with no session       
1/26/2010 11:48:42 AM - Error ndmpagent(pid=1452) NDMP backup failed, path = /vol/dotnextvirt01      
1/26/2010 11:48:43 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:45 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:46 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:48 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:49 AM - Error ndmpagent(pid=1452) MoverGetState called with no session       
1/26/2010 11:48:50 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:52 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   

MattS
Level 6
That does sound tape drive related... maybe try performing a large/long standard backup on that drive to see if any errors pop up?  Is it SAN connected? If so you might want to check for errors on your fiber switch ports.

View solution in original post

Rwigg
Level 3
The issue was with the tape drive.  IBM replaced it and now all is well.  Thanks everyone for your suggestions.