Troubleshoot Restore Problems

manmountain · ‎07-16-2013

Hi there, I am attempting to do a restore of a directory but it is taking ages so i am trying to troubleshoot the problem now.

Sometimes when I attepmpt the restore netbackup does not get as far as mounting the tape in the tape drive. When it does get by this part it just begings to hang.

I am getting this error in the bptm log where the tape is not mounting:

12:57:01.757 [24238] <2> drivename_close: Called for file HP.ULTRIUM5-SCSI.001

12:57:11.758 [24238] <2> drivename_open: Called with Create 1, file HP.ULTRIUM5-SCSI.001

12:57:11.758 [24238] <2> drivename_lock: lock failed

Any help much appreciated!

manmountain · ‎07-16-2013

I am now getting this error... Any ideas on what is going on here:

13:53:15.348 [27704] <2> child_wait: waitpid returned zero, no children waiting (mpxrestore.c:2945)

13:53:15.669 [263] <2> bptm: INITIATING (VERBOSE = 5): -rptdrv -jobid -1373979192 -jm

13:53:15.669 [263] <2> bptm: PORT_STATUS = 0x00000000

13:53:15.670 [263] <2> drivename_open: Called with Create 0, file HP.ULTRIUM5-SCSI.001

13:53:15.670 [263] <2> drivename_checklock: Called

13:53:15.670 [263] <2> drivename_checklock: PID 27704 has lock

13:53:15.670 [263] <2> report_drives: DRIVE = HP.ULTRIUM5-SCSI.001 LOCK = TRUE CURTIME = 1373982795

13:53:15.670 [263] <2> report_drives: MODE = 0

13:53:15.670 [263] <2> report_drives: TIME = 1373980682

13:53:15.670 [263] <2> report_drives: MASTER = attembak5

13:53:15.670 [263] <2> report_drives: SR_KEY = 0 1

13:53:15.670 [263] <2> report_drives: PATH = /dev/rmt/1cbn

13:53:15.670 [263] <2> report_drives: MEDIA = Q930L5

13:53:15.670 [263] <2> report_drives: REQID = 45131

13:53:15.670 [263] <2> report_drives: ALOCID = 824889

13:53:15.670 [263] <2> report_drives: RBID = {2449F58C-EE1A-11E2-B996-3CD92BECCAA4}

13:53:15.670 [263] <2> report_drives: PID = 27704

13:53:15.670 [263] <2> report_drives: FILE = /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM5-SCSI.001

13:53:15.670 [263] <2> main: Sending [EXIT STATUS 0] to NBJM

13:53:15.670 [263] <2> bptm: EXITING with status 0 <----------

13:53:16.870 [27704] <2> child_wait: waitpid returned zero, no children waiting (mpxrestore.c:2945)

13:53:18.392 [27704] <2> child_wait: waitpid returned zero, no children waiting (mpxrestore.c:2945)

13:53:19.913 [27704] <2> child_wait: waitpid returned zero, no children waiting (mpxrestore.c:2945)

Mark_Solutions · ‎07-16-2013

A few reasons for this but could do with knowing about your system - O/S, NBU Version and patch level, Shared Drives? etc.

If drive should be free but you cannot get a lock on them then there may be orphaned bptm processes locking them or there could be an orphaned reservation on the drive

If you run nbrbutil -dump it will show all of the reservations

If you see one that should not be there then release it using nbrbutil -releaseMDS id

If no jobs are running at all you could use nbrbutil -resetAll to clear everything down and then try again.

This tech note covers it as well as removing old lock files: http://www.symantec.com/docs/TECH126575

Hope this helps

Mark_Solutions · ‎07-16-2013

Ok, now you have got the lock and it is running - it looks to be from a MPX backup so by default will wait incase there are any other restores to be submitted

The time it waits is set in the Master Servers Host Propeties - General Server - Delay on multiplexed restores

If you dont do multiplexed restores you can reduce thi svalue but will need a service re-start to register the change

manmountain · ‎07-16-2013

Hi Mark, I have checked the Master Server Host Properties - General Server and it is set to 30 seconds. Do you recemmend I lower this vlaue?

I am going to attach output of bpbrm logfile:

manmountain · ‎07-16-2013

I reduced this parameter down to 0. I restarted all services. I ran command nbrbutil -resetAll. There is nothing from output from command nbrbutil -dump. I am getting this error.

5:27:52.602 [27704] <2> child_wait: waitpid returned zero, no children waiting (tmcommon.c:5438)
15:27:53.602 [27704] <2> child_wait: waitpid returned zero, no children waiting (tmcommon.c:5438)
15:27:53.609 [7635] <2> drivename_open: Called with Create 1, file HP.ULTRIUM5-SCSI.001
15:27:53.609 [7635] <2> drivename_lock: lock failed
15:27:53.609 [7635] <2> drivename_close: Called for file HP.ULTRIUM5-SCSI.001
15:27:54.603 [27704] <2> child_wait: waitpid returned zero, no children waiting (tmcommon.c:5438)
15:27:55.603 [27704] <2> child_wait: waitpid returned zero, no children waiting (tmcommon.c:5438)

Mark_Solutions · ‎07-16-2013

A few questions:

1. So what O/S and NBU Version etc are you on? (looks like UNIX of some sort)

2. Where are you running the restore from?

3. Is the restore going to the original client? (From the log it looks like you are re-directing it to the Master Server?)

Just need to double check exactly what you are doing

manmountain · ‎07-16-2013

1. So what O/S and NBU Version etc are you on? (looks like UNIX of some sort)

Solaris 10 Netbackup 7.0.1

2. Where are you running the restore from?

Running Restore from Master Server

3. Is the restore going to the original client? (From the log it looks like you are re-directing it to the Master Server?)

Yes it's going to the original client

manmountain · ‎07-16-2013

The restore just stalls now for some reason, not sure why. I've attached latest message from bprm.

Also on the client this is what I am getting from tar file.

13:05:33 (45131.001) INF - TAR STARTED
13:05:33 (45131.001) Setting network receive buffer size to 32032 bytes
15:25:27 (45131.001) Write interrupted by SIGPIPE.
15:25:27 (45131.001) INF - TAR EXITING WITH STATUS = 40
15:25:27 (45131.001) INF - TAR RESTORED 0 OF 9 FILES SUCCESSFULLY
15:25:27 (45131.001) INF - TAR KEPT 9 EXISTING FILES
15:25:27 (45131.001) INF - TAR PARTIALLY RESTORED 0 FILES

15:26:30 (45132.001) INF - TAR STARTED
15:26:30 (45132.001) Setting network receive buffer size to 32032 bytes

Mark_Solutions · ‎07-16-2013

OK - looking through the log again it seems to be working through the file system for files to restore but its progress is fairly slow and it finds the original files so keeps then (I assume overwrite is not selected)

But after 5 minutes it all goes wrong

Based on thi splease increase the Client Read and Client Connect timouts on your Master and Media Servers to a higher figure (i am guessing they are set to 300 - so try increasing them to 1800 or 3600)

manmountain · ‎07-16-2013

bptm log file attached

manmountain · ‎07-16-2013

On the Media Server Properties Client Connect was set to 300 I changed this to 3600 while Client read timeout was set to 8400 so I left it at this vlaue. After changing these parameters do I need to restart netbackup processes?

manmountain · ‎07-16-2013

Hi Mark,

Just wondering if there are backups jobs about to begin then if I change the priority of the restore higher will restore job continue and queue up the backups. Earlier when I was attempting to do restore backups were continuting as normal and restore job was queued even though I had restore job priority as higher priority.

Thanks

Mark_Solutions · ‎07-16-2013

Once jobs are actually running they will carry one and if mutiplexed the restore will not kick in until all jobs running to a drive have completed

If you start the restore before the backups kick in then it will stay running and the backups will have to queue.

The value changes do normally need a service re-start. As your client read timeout was pretty high then I am surprised that it seems to drop out after 5 minutes

It is worth checking the client read timeout on the clients host properties and the Master Server too and set them all high

Mark_Solutions · ‎07-16-2013

OK - so can you tell me your version and patch level of NetBackup please

How long did you wait before re-running the restore after the service re-start

Please confirm all names involved here - i am seeing different names to those in the earlier log you posted

#edit# also how old is this backup - was it done using the same media sever and tape drives - has anything changed recently in your configuration (san switch / connections etc etc)

manmountain · ‎07-16-2013

I have changed the Client parameters for Client read timeout to be 3600. I have stopped netabckup service and restarted. I am getting this now contsantly in bptm log.

16:59:40.537 [8766] <2> child_wait: waitpid returned zero, no children waiting (tmcommon.c:5438)
16:59:40.906 [13466] <2> drivename_open: Called with Create 1, file HP.ULTRIUM5-SCSI.001
16:59:40.906 [13466] <2> drivename_lock: lock failed
16:59:40.907 [13466] <16> drivename_wait_lock: unable to access drive HP.ULTRIUM5-SCSI.001, drive is locked
16:59:40.907 [13466] <2> drivename_close: Called for file HP.ULTRIUM5-SCSI.001
16:59:40.907 [13466] <2> send_MDS_msg: OP_STATUS 0 824894 attembak5 8198 5 0 0 0 0 0 0 *NULL* 0
16:59:40.907 [13466] <2> EndpointSelector_R2::updateConfMembers: tardy lookup log time value set to 5 seconds(Endpoint_Selector.cpp:1169)
16:59:40.908 [13466] <2> EndpointSelector_R2::updateConfMembers: USE_VXSS = PROHIBITED(Endpoint_Selector.cpp:1180)
16:59:40.908 [13466] <2> EndpointSelector_R2::endpoint_from_profile: Evaluating 3 endpoints(Endpoint_Selector.cpp:1356)
16:59:40.908 [13466] <2> ConnectList::search: host master5-mgmt not found in cache(ConnectList.cpp:142)
16:59:40.908 [13466] <2> ConnectList::search: host 10.248.0.7 found GOOD(ConnectList.cpp:116)
16:59:40.908 [13466] <2> EndpointSelector_R2::performCallUsingEndpoint: insecure invocation->perform_call(...) to host 10.248.0.7 SUCCESS(Endpoint_Selector.cpp:1712)
16:59:40.909 [13466] <2> EndpointSelector_R2::endpoint_from_profile: Evaluating 3 endpoints(Endpoint_Selector.cpp:1356)
16:59:40.909 [13466] <2> ConnectList::search: host master5-mgmt not found in cache(ConnectList.cpp:142)
16:59:40.909 [13466] <2> ConnectList::search: host 10.248.0.7 found GOOD(ConnectList.cpp:116)
16:59:40.909 [13466] <2> EndpointSelector_R2::performCallUsingEndpoint: insecure invocation->perform_call(...) to host 10.248.0.7 SUCCESS(Endpoint_Selector.cpp:1712)
16:59:40.918 [13466] <2> Orb::setDebugLevelFromVxul: Orb logging configuration level set to 0(Orb.cpp:2284)
16:59:40.918 [13466] <2> emmlib_handleMessage: (0) CORBA call returned 2005023
16:59:40.918 [13466] <16> send_MDS_msg: Error from emmlib_handleMessage, Master master5, type 12, returned error 2005023
16:59:40.918 [13466] <2> send_operation_error: Decoded status = 6 from 5
16:59:40.918 [13466] <2> main: Sending [EXIT STATUS 174] to NBJM
16:59:40.918 [13466] <2> bptm: EXITING with status 174 <----------
16:59:41.537 [8766] <2> child_wait: waitpid returned zero, no children waiting (tmcommon.c:5438)
16:59:42.537 [8766] <2> child_wait: waitpid returned zero, no children waiting (tmcommon.c:5438)
16:59:43.537 [8766] <2> child_wait: waitpid returned zero, no children waiting (tmcommon.c:5438)
16:59:44.537 [8766] <2> child_wait: waitpid

manmountain · ‎07-16-2013

Netbackup 7.0.1

I would say i waited approx 2 mins.

Names i prefer not to put up the real names. Apologies if this is causing some confusion.

It has restored some of the files. Very few of the files but a few none the less.

Backup is from back in April. The configuration did not change in the meantime. Not on the masrer/media server anyway. On the client there may have been changes but not sure what. So media server and tape drive have remained the same.

Appreciate your time here btw!

Mark_Solutions · ‎07-17-2013

There is clearly something odd going on with the tape drive but when it does restore it seems to spend a lot of time working through all of the existing files to determine whether it should restore any or not.

It may be worth restoring everything to an alternate location for now so that it can just get on with doing the restore.

The path also look slike a web site of some sort - have you shut down all web services / daemons before doing the restore as the files may all be locked.

I would run it to another location and then copy the files across to where you need them

manmountain · ‎07-17-2013

Hi Mark,

After leaving restore job overnight it comes up as incomplete and I get this when I come back in the Activity monitor > Job Details > Detailed Status

In this instance I am restoring job to different location but to the same Client. Maybe I will restore to a different client?

Jul 16, 2013 5:52:28 PM - Error bptm (pid=19069) error requesting media, TpErrno = Robot operation failed
Jul 16, 2013 5:52:27 PM - requesting resource Q930L5
Jul 16, 2013 5:52:27 PM - granted resource Q930L5
Jul 16, 2013 5:52:27 PM - granted resource HP.ULTRIUM5-SCSI.000
Jul 16, 2013 5:52:28 PM - current media Q930L5 complete, requesting next media HP.ULTRIUM5-SCSI.000:NetBackup:Q930L5
Jul 16, 2013 5:54:44 PM - started process bptm (pid=19069)
Jul 16, 2013 5:54:44 PM - mounting Q930L5
Jul 16, 2013 5:54:44 PM - granted resource Q930L5
Jul 16, 2013 5:54:44 PM - granted resource HP.ULTRIUM5-SCSI.001
Jul 16, 2013 5:55:55 PM - mounted Q930L5; mount time: 0:01:11
Jul 16, 2013 5:55:55 PM - positioning Q930L5 to file 3
Jul 16, 2013 5:57:46 PM - positioned Q930L5; position time: 0:01:51
Jul 16, 2013 5:59:41 PM - begin reading
Jul 17, 2013 5:17:07 AM - Error bpbrm (pid=19074) socket read failed: errno = 62 - Timer expired
Jul 17, 2013 5:17:07 AM - Error bpbrm (pid=19074) could not close progress file /usr/openv/netbackup/logs/user_ops/root/logs/jbp-13610373992852443412000000034-2qaqLA.log on master5
Jul 17, 2013 5:17:07 AM - Error bpbrm (pid=19074) client close errno = 1
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) cannot write data to socket, Interrupted system call
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) The following files/folders were not restored:
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/netsim_pm_setup_st.tar
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/netsimlin157_rollout.log
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/rollout-0.110.sh
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/rollout-0.111.sh
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/rollout-0.112.sh
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/rollout-0.113.sh
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/rollout-0.114.sh
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/rollout-0.115.sh
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/rollout-0.116.sh
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) UTF - /var/www/html/scripts/automation_wran/rollout-unstable.sh
Jul 17, 2013 5:17:07 AM - Error bptm (pid=19073) more than 10 files were not restored, remaining ones are shown in the progress log.
Jul 17, 2013 5:18:53 AM - restored from image Client-bup_1366475354; restore time: 11:26:27
Jul 17, 2013 5:18:53 AM - Warning bprd (pid=14411) Restore must be resumed prior to first image expiration on Thu Oct 31 00:00:00 2013

Thanks, M

Marianne · ‎07-17-2013

We need to see what is happening in this period:

Jul 16, 2013 5:59:41 PM - begin reading
Jul 17, 2013 5:17:07 AM - Error bpbrm (pid=19074) socket read failed: errno = 62 - Timer expired

ALL of these logs are needed:

On media server: bptm and bpbrm logs for 16 & 17 July

On client: tar

Please copy log files to .txt files (e.g. bpbrm16Jul.txt) and upload as File attachments.

Handy NetBackup Links

VOX

Troubleshoot Restore Problems