Solved: Netbackup 6.5.4 and Netapp Ontap 7.2.6.1 NDMP back...

AdN · ‎03-16-2010

Hi,

We're using Netbackup 6.5.4 on a Solaris 10 host (master and media server) for backing up our Netapp FAS 3140 with ONTAP 7.2.6.1 on a Quantum PX 506 LTO tape library. Both media server and Netapp appliance are connected with fibrechannel to the tape library. Two drives are assigned to the Netapp appliance.

Since a few days we're having problems doing incremental backups on one of the volumes (/vol/home) on the NetApp filer. The incremental backup of other volumes run fine. The /vol/home which is causing the problems is our biggest volume. It contains about 30 million files and 1.6 TB (1.2 TB with NetApp block deduplication):

adnnfs03> df -h home                                           
Filesystem               total       used      avail capacity  Mounted on
/vol/home/              1440GB     1238GB      201GB      86%  /vol/home/
/vol/home/.snapshot      360GB      253GB      106GB      70%  /vol/home/.snapshot

adnnfs03> df -i home         
Filesystem               iused      ifree  %iused  Mounted on
/vol/home/            30047941    4952040     86%  /vol/home/

adnnfs03> df -hs home                                                                  
Filesystem                used      saved       %saved                                 
/vol/home/              1239GB      426GB          26%

Are there any known limitations or problems with this constellation and that amount of data? We didn't change our setup in the last few months and backing up the smaller volumes works fine. The backup process of /vol/home stalls for hours and then aborts with this error message:

03/15/2010 09:32:31 - requesting resource adnnfs03-ndmp
03/15/2010 09:32:31 - requesting resource adnbackup.NBU_CLIENT.MAXJOBS.adnnfs03
03/15/2010 09:32:31 - requesting resource adnbackup.NBU_POLICY.MAXJOBS.adnnfs03
03/15/2010 09:32:31 - granted resource  adnbackup.NBU_CLIENT.MAXJOBS.adnnfs03
03/15/2010 09:32:31 - granted resource  adnbackup.NBU_POLICY.MAXJOBS.adnnfs03
03/15/2010 09:32:31 - granted resource  LM2727
03/15/2010 09:32:31 - granted resource  HP.ULTRIUM3-SCSI.003
03/15/2010 09:32:31 - granted resource  adnnfs03-ndmp
03/15/2010 09:32:32 - estimated 1905263392 kbytes needed
03/15/2010 09:32:32 - started process bpbrm (pid=29229)
03/15/2010 09:32:32 - connecting
03/15/2010 09:32:32 - connected; connect time: 0:00:00
03/15/2010 09:32:35 - mounting LM2727
03/15/2010 09:33:27 - mounted LM2727; mount time: 0:00:52
03/15/2010 09:33:31 - positioning LM2727 to file 39
03/15/2010 09:33:47 - positioned LM2727; position time: 0:00:16
03/15/2010 09:33:47 - begin writing
03/15/2010 17:34:18 - Error ndmpagent (pid=29234) aborting operation - no mover progress
03/15/2010 17:34:18 - Error ndmpagent (pid=29234) NDMP backup failed, path = /vol/home
03/15/2010 17:35:13 - end writing; write time: 8:01:26
NDMP backup failure (99)

During that time the NDMP debug logfile of the Netapp filer shows these messages:

Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: Error code: NDMP_NO_TAPE_LOADED_ERR                           
Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: Device name: nrst0a                                           
Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: Mode: 0                                                       
Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: IOException: Device cannot be opened. Device may have no tape.
Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: NDMP message type: NDMP_CONNECT_CLOSE                         
Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: NDMP message replysequence: 7

I doubt that this message is true. There is a tape in the drive. Backing up other volumes works and a second drive is unused. In the meantime we begged our users to clean up their home directories and reduce the number of files and data size, hoping that this will help.

Thanks in advance for any hint.

Best regards,

Bernd

AdN · ‎03-24-2010

Hi,

Got a solution from Symantec support. By default the NDMP progress timeout is at 8 hours. Startign with patch levels 6.0MP7 and 6.5.4, a higher timeout can be defined. I increased it to 24 hours (1440 minutes):

echo 1440 > /usr/openv/netbackup/db/config/NDMP_PROGRESS_TIMEOUT

http://support.veritas.com/docs/249241

Best regards,
Bernd

View solution in original post

Marianne · ‎03-16-2010

Nothing in NDMP logs between 09:30 and 17:35?

The log that you've posted is for 23:59 and seems to be correct - the NBU job has failed at 17:35 due to "no mover progress" and would've unloaded the tape within 5 minutes of the failure.

Handy NetBackup Links

AdN · ‎03-16-2010

Hi Marianne,

Yes, there is NDP log on the Netapp, but it's just repeating the following sequence. Looks normal to me, except the "IOException: Device cannot be opened. Device may have no tape." line. They appear for both drives (nrst0a and nrst1a).

According to [1] the error code 99 most likely means in this case "An Incremental backup is run, but no files have changed since the last backup.". But files have changed. I assume that the incremental backup process takes too long to find all the files that have changed and aborts after 8 hours. The last full backup ran on 8 march and terminated fine. Our backup operator says that the incremental backups fail sometimes. I have to check if a full backup is working now.

Best regards,

Bernd

[1] http://seer.entsupport.symantec.com/docs/275330.htm

Netapp NDMP debug log:

Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Created an NDMP server connection
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Message NDMP_NOTIFY_CONNECTION_STATUS sent
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Message Header:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Sequence 1
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Timestamp 1268670843
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Msgtype 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Method 1282
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: ReplySequence 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Reason: 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: version: 4
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Text:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message type: NDMP_CONNECT_OPEN
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message replysequence: 1
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Message Header:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Sequence 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Timestamp 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Msgtype 1
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Method 2304
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: ReplySequence 1
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Request version: 4
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error code: NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message type: NDMP_CONFIG_GET_SERVER_INFO
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message replysequence: 2
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Message Header:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Sequence 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Timestamp 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Msgtype 1
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Method 264
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: ReplySequence 2
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error code: NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Vendor:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Product:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Revision:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message type: NDMP_CONFIG_GET_AUTH_TYPE_ATTR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message replysequence: 3
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Message Header:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Sequence 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Timestamp 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Msgtype 1
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Method 259
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: ReplySequence 3
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Auth type: 2
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error code: NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Challenge: [B@fac8d3da
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message type: NDMP_CONNECT_CLIENT_AUTH
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message replysequence: 4
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Message Header:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Sequence 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Timestamp 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Msgtype 1
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Method 2305
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: ReplySequence 4
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Auth Type: 2
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error code: NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message type: NDMP_CONFIG_GET_HOST_INFO_V3
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: NDMP message replysequence: 5
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Message Header:
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Sequence 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Timestamp 0
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Msgtype 1
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Method 256
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: ReplySequence 5
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error NDMP_NO_ERR
Mar 15 17:34:03 GMT+01:00 [ndmpd:129]: Error code: NDMP_NO_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: NDMP message type: NDMP_TAPE_OPEN
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: NDMP message replysequence: 6
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Message Header:
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Sequence 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Timestamp 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Msgtype 1
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Method 768
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: ReplySequence 6
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Error NDMP_NO_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Error code: NDMP_NO_TAPE_LOADED_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Device name: nrst0a
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Mode: 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: IOException: Device cannot be opened. Device may have no tape.
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: NDMP message type: NDMP_CONNECT_CLOSE
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: NDMP message replysequence: 7
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Message Header:
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Sequence 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Timestamp 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Msgtype 1
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Method 2306
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: ReplySequence 7
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Error NDMP_NO_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Cleaning up connection
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Message NDMP_NOTIFY_CONNECTION_STATUS sent
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Message Header:
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Sequence 8
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Timestamp 1268670844
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Msgtype 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Method 1282
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: ReplySequence 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Error NDMP_NO_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Reason: 1
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: version: 4
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Text: Connection shutdown
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Ndmpd session closed successfully
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Calling NdmpServer.kill
Mar 15 17:34:04 GMT+01:00 [ndmpd:129]: Error code: NDMP_NO_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: NDMP message type: NDMP_DATA_GET_STATE_V2
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: NDMP message replysequence: 1840
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Message Header:
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Sequence 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Timestamp 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Msgtype 1
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Method 1024
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: ReplySequence 1840
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Error NDMP_NO_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Error code: NDMP_NO_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: NDMP message type: NDMP_MOVER_GET_STATE
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: NDMP message replysequence: 1841
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Message Header:
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Sequence 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Timestamp 0
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Msgtype 1
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Method 2560
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: ReplySequence 1841
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Error NDMP_NO_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Error code: NDMP_NO_ERR
Mar 15 17:34:04 GMT+01:00 [ndmpd:113]: Mover state: NDMP_MOVER_STATE_ACTIVE

AdN · ‎03-16-2010

OK. Full backups are not affected - at least a full backup of /vol/home runs now for 5.5 hours and has written 105 GB onto tape. Only incremental is failing.

Omar_Villa · ‎03-16-2010

I think that media errors is because a mount time out are all your drives healthy? most of the time 99 errors are because the qtrees are not available.

Omar_Villa · ‎03-16-2010

Also you can always enable unify logging to see if there is any issue with the NDMP Agent.

Marianne · ‎03-16-2010

Omar, there is no media mount timeout here:

03/15/2010 09:32:35 - mounting LM2727
03/15/2010 09:33:27 - mounted LM2727; mount time: 0:00:52

Also - unified logging is enabled by default

Handy NetBackup Links

AdN · ‎03-17-2010

Hi, as I wrote before: a full backup of that volume runs fine. An incremental backup does nothing for 8 hours and then terminates. Therefore I assume no media or drive error. To confirm that, I ran a full backup yesterday. Here is the log output of that full backup job that worked:

03/16/2010 11:34:48 - estimated 0 kbytes needed
03/16/2010 11:34:49 - started process bpbrm (pid=21050)
03/16/2010 11:34:49 - connecting
03/16/2010 11:34:49 - connected; connect time: 0:00:00
03/16/2010 11:34:52 - mounting MD7569
03/16/2010 11:36:01 - mounted MD7569; mount time: 0:01:09
03/16/2010 11:36:05 - positioning MD7569 to file 1
03/16/2010 11:36:08 - positioned MD7569; position time: 0:00:03
03/16/2010 11:36:08 - begin writing
03/16/2010 17:05:00 - current media MD7569 complete, requesting next media Any
03/16/2010 17:05:01 - granted resource  MD7578
03/16/2010 17:05:01 - granted resource  HP.ULTRIUM3-SCSI.002
03/16/2010 17:05:01 - granted resource  adnnfs03-ndmp
03/16/2010 17:05:02 - mounting MD7578
03/16/2010 17:05:53 - mounted MD7578; mount time: 0:00:51
03/16/2010 17:05:55 - positioning MD7578 to file 1
03/16/2010 17:05:58 - positioned MD7578; position time: 0:00:03
03/16/2010 17:05:58 - begin writing
03/16/2010 19:53:17 - current media MD7578 complete, requesting next media Any
03/16/2010 19:53:18 - granted resource  MD7567
03/16/2010 19:53:18 - granted resource  HP.ULTRIUM3-SCSI.003
03/16/2010 19:53:18 - granted resource  adnnfs03-ndmp
03/16/2010 19:53:19 - mounting MD7567
03/16/2010 19:54:10 - mounted MD7567; mount time: 0:00:51
03/16/2010 19:54:13 - positioning MD7567 to file 1
03/16/2010 19:54:16 - positioned MD7567; position time: 0:00:03
03/16/2010 19:54:16 - begin writing
03/16/2010 23:08:08 - end writing; write time: 3:13:52
the requested operation was successfully completed (0)

Best regards,
Bernd

AdN · ‎03-17-2010

Hi, Our users deleted some files in their home directories and the number of inodes in that volume is reduced from 30 million to 27 million:

adnnfs03> df -h /vol/home
Filesystem               total       used      avail capacity  Mounted on
/vol/home/              1440GB     1171GB      268GB      81%  /vol/home/
/vol/home/.snapshot      360GB      341GB       18GB      95%  /vol/home/.snapshot
adnnfs03> df -i /vol/home
Filesystem               iused      ifree  %iused  Mounted on
/vol/home/            27748657    7251324     79%  /vol/home/

The incremental backup ran now fine:

03/17/2010 08:07:01 - estimated 0 kbytes needed
03/17/2010 08:07:02 - started process bpbrm (pid=6977)
03/17/2010 08:07:02 - connecting
03/17/2010 08:07:02 - connected; connect time: 0:00:00
03/17/2010 08:07:05 - mounting LM2727
03/17/2010 08:08:02 - mounted LM2727; mount time: 0:00:57
03/17/2010 08:08:06 - positioning LM2727 to file 45
03/17/2010 08:09:51 - positioned LM2727; position time: 0:01:45
03/17/2010 08:09:51 - begin writing
03/17/2010 11:17:35 - end writing; write time: 3:07:44
the requested operation was successfully completed (0)

So there must be a bug when a Netapp volume contains more than 30 million inodes and ist backed up with NDMP. Is there a limitation known with Netbackup 6.5.4 and Data ONTAP 7.2.6.1? Restoring files to that volume also fails:

03/17/2010 08:16:49 - begin Restore
03/17/2010 09:49:43 - media needed: KVB802
03/17/2010 09:49:43 - media needed: KZ8690
03/17/2010 09:49:43 - media needed: KV6769
03/17/2010 09:49:43 - restoring from image adnnfs03_1265416147
03/17/2010 09:49:44 - connecting
03/17/2010 09:49:44 - connected; connect time: 0:00:00
03/17/2010 09:49:44 - started process bptm (pid=13279)
03/17/2010 09:49:47 - started process bptm (pid=13279)
03/17/2010 09:49:47 - mounting KZ8690
03/17/2010 09:49:45 - requesting resource KZ8690
03/17/2010 09:49:45 - granted resource  KZ8690
03/17/2010 09:49:45 - granted resource  HP.ULTRIUM3-SCSI.003
03/17/2010 09:50:44 - mounted KZ8690; mount time: 0:00:57
03/17/2010 09:50:44 - positioning KZ8690 to file 3
03/17/2010 09:51:29 - positioned KZ8690; position time: 0:00:45
03/17/2010 09:51:29 - begin reading
03/17/2010 10:51:29 - Error bpbrm (pid=13266) socket read failed: errno = 62 - Timer expired
03/17/2010 11:27:35 - Error bptm (pid=13279) media manager terminated by parent process
03/17/2010 11:27:35 - Error ndmpagent (pid=13278) NDMP restore failed from path /vol/home
03/17/2010 11:27:43 - Error bpbrm (pid=13266) client restore EXIT STATUS 13: file read failed
03/17/2010 11:27:43 - restored from image adnnfs03_1265416147; restore time: 1:38:00
03/17/2010 11:27:43 - end Restore; elapsed time 3:10:54
the restore failed to recover the requested files (5)

Restoring a backup from /vol/home to another volume with less used inodes on the same Netapp filer is being tested. Best regards, Bernd

MattS · ‎03-17-2010

Hi,

Can you turn on logging on your netapp and post the results here?

I've never had to turn it on, but from the documentation the command is:

options backup.log.enable [on | off]

From the (sry have no link):

"Data ONTAP® 7.3
Data Protection
Tape Backup and Recovery Guide"

Also, are there any firewalls between your master and media server?

AdN · ‎03-18-2010

Hi Matt,

Master and media server is the same host and there is no firewall between. The backup server and the Netapp filer are both directly attached with fibre channel to the tape library. Backup log is enabled by default. Here's the content of the backup log from the failed incremental backup on 15 march:

dmp Mon Mar 15 09:58:54 CET /vol/home/(0) Start (Level 1, NDMP)
dmp Mon Mar 15 09:58:54 CET /vol/home/(0) Options (b=63, u)
dmp Mon Mar 15 09:58:54 CET /vol/home/(0) Snapshot (snapshot_for_backup.6050, Mon Mar 15 09:58:53 CET)
dmp Mon Mar 15 09:58:54 CET /vol/home/(0) Base_dump (Level 0, Sun Mar  7 03:15:51 CET)
dmp Mon Mar 15 09:58:57 CET /vol/home/(0) Tape_open (ndmp)
dmp Mon Mar 15 09:58:57 CET /vol/home/(0) Phase_change (I)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Tape_close (ndmp)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Abort (0 MB)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (reg inodes: 15870475 other inodes: 831105 dirs: 3850221 nt dirs: 1364 nt inodes: 1324 acls: 0)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Phase 3: directories dumped: 0)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Phase 3: wafl directory blocks read: 0)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Phase 3 throughput (MB sec): read 0 write 0)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Percent of phase3 time spent for: reading inos 0% dumping ino 0%)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Percent of phase3 dump time spent for: convert-wafl-dirs 0% lev0-ra 0%)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Phase 3 averages (usec): wafl load buf time 0 level 0 ra time 0)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Phase 4: inodes dumped: 0)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Phase 4: wafl data blocks read: 0)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Percent of phase4 time spent for: reading inos 0% dumping inos 0%)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Percent of phase4 dump time spent for: wafl read iovec: 0% lev0-ra 0%)
dmp Mon Mar 15 17:59:25 CET /vol/home/(0) Log_msg (Phase 4 averages (usec): wafl read iovec time 0 level 0 ra time 0)

The log of the successful full backup from 16 march:

dmp Tue Mar 16 12:01:28 CET /vol/home/(0) Start (Level 0, NDMP)
dmp Tue Mar 16 12:01:28 CET /vol/home/(0) Options (b=63, u)
dmp Tue Mar 16 12:01:28 CET /vol/home/(0) Snapshot (snapshot_for_backup.6227, Tue Mar 16 12:01:18 CET)
dmp Tue Mar 16 12:01:33 CET /vol/home/(0) Tape_open (ndmp)
dmp Tue Mar 16 12:01:33 CET /vol/home/(0) Phase_change (I)
dmp Tue Mar 16 12:13:56 CET /vol/home/(0) Phase_change (II)
dmp Tue Mar 16 12:14:12 CET /vol/home/(0) Phase_change (III)
dmp Tue Mar 16 14:39:48 CET /vol/home/(0) Phase_change (IV)
dmp Tue Mar 16 22:12:28 CET /vol/home/(0) Phase_change (V)
dmp Tue Mar 16 22:12:30 CET /vol/home/(0) Tape_close (ndmp)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) End (1602458 MB)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (reg inodes: 22542131 other inodes: 891108 dirs: 5013545 nt dirs: 1368 nt inodes: 1330 acls: 0)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 3: directories dumped: 5014914)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 3: wafl directory blocks read: 5204184)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 3: average wafl directory blocks per inode: 1)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 3: average tape blocks per inode: 2)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 3 throughput (MB sec): read 3 write 1)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Percent of phase3 time spent for: reading inos 0% dumping ino 99%)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Percent of phase3 dump time spent for: convert-wafl-dirs 94% lev0-ra 1%)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 3 averages (usec): wafl load buf time 14 level 0 ra time 12)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 4: inodes dumped: 23434569)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 4: wafl data blocks read: 415358068)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 4: average wafl data blocks per inode: 17)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 4: average tape data blocks per inode: 69)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 4 throughput (MB sec): read 61 write 59)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Percent of phase4 time spent for: reading inos 0% dumping inos 98%)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Percent of phase4 dump time spent for: wafl read iovec: 19% lev0-ra 2%)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Phase 4 averages (usec): wafl read iovec time 112 level 0 ra time 25)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Tape write times (msec): average: 0 max: 69030)
dmp Tue Mar 16 22:12:31 CET /vol/home/(0) Log_msg (Tape changes: 1)

The time of netapp filer and backup server differ by 24 minutes because NTP daemon wasn't running. We also have sometimes problems restoring files onto that volume. I'm currently testing this issue iteratively to find out under wich conditions this happens. Restoring takes many hours until it has found the requested file.

Best regards,
Bernd

Andy_Welburn · ‎03-18-2010

specifically the length of time taken for Phase1 before failure.

From the NetApp site:
"...
Phase I : In this phase, we generate the list of files that need to be backed up. The output of this phase is what is called an inode map. The inode map contains an entry for every inode on the volume. The inode that is to be backed up has its corresponding entry set to 1 whereas the inode that is not supposed to be backed up has its entry set to 0.

With Data Ontap 6.4 onwards, for backups that are initiated by NDMP with file history turned on, we need to do some processing to estimate the amount of space each file (inode) will take on tape. This processing is invaluable during Phase IIIb (Once again introduced in Data ONTAP 6.4) where we write the offset map to the backup image. This adds some cost to NDMP backups w/ file history in this Phase. However, this allows us to provide some extraordinary features (DAR Enhancements) during restore.
..."

It then goes on to say:
"...
Phase I of dump behaves differently based on whether the user initiated the backup of a Volume/Qtree or a sub-directory.

With Sub-directory dumps, each directory and file below the specified root of the backup must be examined to determine whether it should be included in the backup.
This is a time-consuming operation especially if the sub-directory contains several million inodes (files).

With Volume and Qtree based backups, it is not necessary to go through the entire directory tree to determine what needs to be backed up. Instead we just go through the inode file for the backup snapshot. By just looking at an inode entry, it is possible to determine whether the inode exists and whether it is a part of a given qtree/volume.
This is a much faster and an efficient process as compared to the directory traversal. However, the inode file does not contain enough information to be able to do the same for sub-directory dumps. So we cannot apply this efficient process for sub-directory dumps.
..."

Looking more closely at your log, can I ask what your backup selection looks like for this volume? Comparing it with the log for one of my own, it looks like it is /vol/home/. If it has got that trailing / then is it treating it as a "sub-directory dump" & not a "volume or q-tree based backup" hence the length of time it's taking to do incrementals? Or am I just reading too much into this & it's simply due to the number of inodes as you've already said?

AdN · ‎03-18-2010

Hi Andy,

In the backup policy we have /vol/home (without trailing slash) together with 19 other volumes. So it's a volume dump without directory traversal. We don't have qtrees in the user home volume for about 200 users. That policy ran fine for about a year with full and incremental backups. We upgraded then to Netbackup 6.5.4 due to a backup problem caused by unicode file names.

In the meantime the number of inodes sank from 30 mio to 27 mio. An incremental backup ran yesterday. Also a restore from a full backup with less than 30 mio inodes ran through. I'm currently testing restores from a full backup with more than 30 mio inodes some weeks in the past. That one failed previously.

30 mio is an odd number and not close to the border of computer data types. I would understand it, if were crossed 2^31 inodes. It looks more like a timing issue. Calculating the difference takes longer than a certain timeout. The backup and media server is this model:

[root@adnbackup ~]$ prtdiag | more
System Configuration:  Sun Microsystems  sun4us Fujitsu Siemens PRIMEPOWER450 2x SPARC64 V
System clock frequency: 220 MHz
Memory size: 4096 Megabytes

========================= CPUs =========================

                    Run   Ecache   CPU    CPU
Brd  CPU   Module   MHz     MB    Impl.   Mask
---  ---  -------  -----  ------  ------  ----
 0     0     0      1869     3.0   5        5.1
 0     1     1      1869     3.0   5        5.1

Not the most powerful model of today, but it ran fine for years.

Best regards,
Bernd

AdN · ‎03-24-2010

Hi,

Got a solution from Symantec support. By default the NDMP progress timeout is at 8 hours. Startign with patch levels 6.0MP7 and 6.5.4, a higher timeout can be defined. I increased it to 24 hours (1440 minutes):

echo 1440 > /usr/openv/netbackup/db/config/NDMP_PROGRESS_TIMEOUT

http://support.veritas.com/docs/249241

Best regards,
Bernd

Marianne · ‎03-24-2010

Great stuff! Thanks for sharing this info!

Handy NetBackup Links

MattS · ‎03-24-2010

Thanks for update AdNovum!

Putting this on my internal wiki as im sure ill run into this someday.

Thanks again!
Matt

VOX

Netbackup 6.5.4 and Netapp Ontap 7.2.6.1 NDMP backup problems