Forum Discussion

Doctorski's avatar
Doctorski
Level 5
12 years ago

RMAN clone/restore problem

Hi,

     I am seeing intermittent issues when restoring an Oracle database using RMAN.

The backup is done to a 5220 at our live site, with the database going to dedupe, and the archive logs going to advanced disk.

The database is optimised duplicated to a remote 5520 (same NBU domain).

The archive logs are duplicated to the same remote 5220, then duplicated to tape.

We identify the images required, and make the relevant offsite copy the primary. (copy 2 (dedupe) for database, copy 3 (tape) for arcs)

When we start the clone there appears to be a delay for netbackup to send the data and RMAN thinks the pieces are unavailable. See error below.

The pieces for restore are available, and if we restart it, it usually works. If we rehydrate to tape and use the tape copies, it works.

The strange thing is, if I manually duplicate (rehydrate) the offending images to tape, there appears to be a delay there as well. The tape mounts and sits there for up to 15 mins or so, and then kicks into life. See second section below. (9 minute delay this time)

So its not RMAN, is it a rehydration problem. These databases are small, about 25Gb in size.

CLIENT_READ_TIMEOUT settings etc are set to 3600.

Output from RMAN fail.

Starting restore at 02.10.13 22:38:25
allocated channel: ORA_AUX_DISK_1
channel ORA_AUX_DISK_1: SID=353 device type=DISK
allocated channel: ORA_AUX_DISK_2
channel ORA_AUX_DISK_2: SID=3 device type=DISK
allocated channel: ORA_AUX_DISK_3
channel ORA_AUX_DISK_3: SID=51 device type=DISK
allocated channel: ORA_AUX_SBT_TAPE_1
channel ORA_AUX_SBT_TAPE_1: SID=103 device type=SBT_TAPE
channel ORA_AUX_SBT_TAPE_1: Veritas NetBackup for Oracle - Release 7.5 (2012091610)
allocated channel: ORA_AUX_SBT_TAPE_2
channel ORA_AUX_SBT_TAPE_2: SID=153 device type=SBT_TAPE
channel ORA_AUX_SBT_TAPE_2: Veritas NetBackup for Oracle - Release 7.5 (2012091610)

channel ORA_AUX_SBT_TAPE_1: starting datafile backup set restore
channel ORA_AUX_SBT_TAPE_1: restoring control file
channel ORA_AUX_SBT_TAPE_1: reading from backup piece DBNAME_HOT_FULL_DAILY_m8ol6ckq_827536026
channel ORA_AUX_SBT_TAPE_1: ORA-19870: error while restoring backup piece DBNAME_HOT_FULL_DAILY_m8ol6ckq_827536026
ORA-19507: failed to retrieve sequential file, handle="DBNAME_HOT_FULL_DAILY_m8ol6ckq_827536026", parms=""
ORA-27029: skgfrtrv: sbtrestore returned error
ORA-19511: Error received from media manager layer, error text:
   Failed to open backup file for restore.

 

Output from manual duplication to tape

 

03/10/2013 10:36:47 - begin Duplicate
03/10/2013 10:36:53 - requesting resource nbu5520-LTO4-1
03/10/2013 10:36:53 - requesting resource @aaaab
03/10/2013 10:36:53 - reserving resource @aaaab
03/10/2013 10:36:55 - reserved resource @aaaab
03/10/2013 10:36:55 - granted resource LT4811
03/10/2013 10:36:55 - granted resource tapelib-DRIVE06
03/10/2013 10:36:55 - granted resource nbu5520-LTO4-1
03/10/2013 10:36:55 - granted resource MediaID=@aaaab;DiskVolume=PureDiskVolume;DiskPool=dp_disk_nbu5520;Path=PureDiskVolume;StorageServer=nbu5520;MediaServer=nbu5520
03/10/2013 10:36:56 - Info bptm(pid=24450) start            
03/10/2013 10:36:56 - started process bptm (24450)
03/10/2013 10:36:58 - Info bptm(pid=24450) start backup           
03/10/2013 10:36:58 - Info bpdm(pid=24455) started            
03/10/2013 10:36:58 - started process bpdm (24455)
03/10/2013 10:36:58 - Info bpdm(pid=24455) reading backup image          
03/10/2013 10:36:59 - Info bpdm(pid=24455) using 64 data buffers         
03/10/2013 10:36:59 - Info bpdm(pid=24455) requesting nbjm for media         
03/10/2013 10:36:59 - Info bptm(pid=24450) Waiting for mount of media id LT4811 (copy 3) on server nbu5520.
03/10/2013 10:36:59 - started process bptm (24450)
03/10/2013 10:36:59 - mounting LT4811
03/10/2013 10:36:59 - Info bptm(pid=24450) INF - Waiting for mount of media id LT4811 on server nbu5520 for writing.
03/10/2013 10:38:02 - Info bptm(pid=24450) media id LT4811 mounted on drive index 6, drivepath /dev/nst3, drivename tapelib-DRIVE06, copy 3

XXXXX 9 minute gap here before we write to tape XXXXX


03/10/2013 10:47:15 - begin reading
03/10/2013 10:47:26 - Info bptm(pid=24450) waited for full buffer 149 times, delayed 34112 times    
03/10/2013 10:47:30 - end reading; read time: 00:00:15
03/10/2013 10:47:30 - Info bpdm(pid=24455) completed reading backup image

 

We are using NBU 7.5.4 and OS is hp-ux. Oracle is 11.2.03

 

Many thanks in anticipation. - Darren

 

  • We opened a call with Symantec and sent them full verbose logs from the client and the media server.

    It turns out our problem is related to ET3016768. 

    We ran a script on the appliance that shows the nbMSDPpoCounts totals.

    Our offending images were all in one policy, but with different schedules.

    The counts for this policy were many thousands higher than any other policy. See below

    The issue is how the appliance "paths" the images, and thousands of backups under the same "path" means the appliance has to traverse the paths to find the images required.

    Count   Client/Policy

        ------- ------------------------------------------

          65907 clienta-b/CLIENTA-B-B_ORA_BKUP

           5269 clientb-b/CLIENTB-B_BKUP

           4060 clientc-b/CLIENTC-B_ORA_BKUP

           4019 clientd-b/CLIENTD-B_BKUP

        . . . SNIP . . .

    Symantec recommend using more policies to spread the load on the "paths".

    We have rehydrated lots of long-term retention backups off the appliance to tape, and reduced the above counts significantly. We have also changed our SLP's to do the same thing moving forward.

    The above actions have fixed our optimised duplication / rehydration delays, and timeouts on RMAN restores.

    There is currently no fix for this, but may be fixed in 7.6.1.

    Thanks for your help.

     

     

     

  • Also within RMAN we set NB_ORA_DISK_MEDIA_SERVER=nbu5520 to use the remote 5520.

  • I am not sure that the delay in duplication is related to the restore problem.

    You will need logs to troubleshoot each of the 2 issues.

    Restore:
    bprd on the master
    bptm and bpbrm on media server
    bpcd and dbclient on destination client

    duplication:
    admin on master server
    bptm bpdm and bpbrm on remote media server.

  • Thanks Marianne,

    I will look into the duplication issue first as we are working around the restore issue by rehydrating to tape via SLP and using the tape copy. This works first time every time.

    I will update when I have something and then move on to the restore from dedupe issue.

    - Darren

  • We opened a call with Symantec and sent them full verbose logs from the client and the media server.

    It turns out our problem is related to ET3016768. 

    We ran a script on the appliance that shows the nbMSDPpoCounts totals.

    Our offending images were all in one policy, but with different schedules.

    The counts for this policy were many thousands higher than any other policy. See below

    The issue is how the appliance "paths" the images, and thousands of backups under the same "path" means the appliance has to traverse the paths to find the images required.

    Count   Client/Policy

        ------- ------------------------------------------

          65907 clienta-b/CLIENTA-B-B_ORA_BKUP

           5269 clientb-b/CLIENTB-B_BKUP

           4060 clientc-b/CLIENTC-B_ORA_BKUP

           4019 clientd-b/CLIENTD-B_BKUP

        . . . SNIP . . .

    Symantec recommend using more policies to spread the load on the "paths".

    We have rehydrated lots of long-term retention backups off the appliance to tape, and reduced the above counts significantly. We have also changed our SLP's to do the same thing moving forward.

    The above actions have fixed our optimised duplication / rehydration delays, and timeouts on RMAN restores.

    There is currently no fix for this, but may be fixed in 7.6.1.

    Thanks for your help.