RMAN clone/restore problem
Hi,
I am seeing intermittent issues when restoring an Oracle database using RMAN.
The backup is done to a 5220 at our live site, with the database going to dedupe, and the archive logs going to advanced disk.
The database is optimised duplicated to a remote 5520 (same NBU domain).
The archive logs are duplicated to the same remote 5220, then duplicated to tape.
We identify the images required, and make the relevant offsite copy the primary. (copy 2 (dedupe) for database, copy 3 (tape) for arcs)
When we start the clone there appears to be a delay for netbackup to send the data and RMAN thinks the pieces are unavailable. See error below.
The pieces for restore are available, and if we restart it, it usually works. If we rehydrate to tape and use the tape copies, it works.
The strange thing is, if I manually duplicate (rehydrate) the offending images to tape, there appears to be a delay there as well. The tape mounts and sits there for up to 15 mins or so, and then kicks into life. See second section below. (9 minute delay this time)
So its not RMAN, is it a rehydration problem. These databases are small, about 25Gb in size.
CLIENT_READ_TIMEOUT settings etc are set to 3600.
Output from RMAN fail.
Starting restore at 02.10.13 22:38:25
allocated channel: ORA_AUX_DISK_1
channel ORA_AUX_DISK_1: SID=353 device type=DISK
allocated channel: ORA_AUX_DISK_2
channel ORA_AUX_DISK_2: SID=3 device type=DISK
allocated channel: ORA_AUX_DISK_3
channel ORA_AUX_DISK_3: SID=51 device type=DISK
allocated channel: ORA_AUX_SBT_TAPE_1
channel ORA_AUX_SBT_TAPE_1: SID=103 device type=SBT_TAPE
channel ORA_AUX_SBT_TAPE_1: Veritas NetBackup for Oracle - Release 7.5 (2012091610)
allocated channel: ORA_AUX_SBT_TAPE_2
channel ORA_AUX_SBT_TAPE_2: SID=153 device type=SBT_TAPE
channel ORA_AUX_SBT_TAPE_2: Veritas NetBackup for Oracle - Release 7.5 (2012091610)
channel ORA_AUX_SBT_TAPE_1: starting datafile backup set restore
channel ORA_AUX_SBT_TAPE_1: restoring control file
channel ORA_AUX_SBT_TAPE_1: reading from backup piece DBNAME_HOT_FULL_DAILY_m8ol6ckq_827536026
channel ORA_AUX_SBT_TAPE_1: ORA-19870: error while restoring backup piece DBNAME_HOT_FULL_DAILY_m8ol6ckq_827536026
ORA-19507: failed to retrieve sequential file, handle="DBNAME_HOT_FULL_DAILY_m8ol6ckq_827536026", parms=""
ORA-27029: skgfrtrv: sbtrestore returned error
ORA-19511: Error received from media manager layer, error text:
Failed to open backup file for restore.
Output from manual duplication to tape
03/10/2013 10:36:47 - begin Duplicate
03/10/2013 10:36:53 - requesting resource nbu5520-LTO4-1
03/10/2013 10:36:53 - requesting resource @aaaab
03/10/2013 10:36:53 - reserving resource @aaaab
03/10/2013 10:36:55 - reserved resource @aaaab
03/10/2013 10:36:55 - granted resource LT4811
03/10/2013 10:36:55 - granted resource tapelib-DRIVE06
03/10/2013 10:36:55 - granted resource nbu5520-LTO4-1
03/10/2013 10:36:55 - granted resource MediaID=@aaaab;DiskVolume=PureDiskVolume;DiskPool=dp_disk_nbu5520;Path=PureDiskVolume;StorageServer=nbu5520;MediaServer=nbu5520
03/10/2013 10:36:56 - Info bptm(pid=24450) start
03/10/2013 10:36:56 - started process bptm (24450)
03/10/2013 10:36:58 - Info bptm(pid=24450) start backup
03/10/2013 10:36:58 - Info bpdm(pid=24455) started
03/10/2013 10:36:58 - started process bpdm (24455)
03/10/2013 10:36:58 - Info bpdm(pid=24455) reading backup image
03/10/2013 10:36:59 - Info bpdm(pid=24455) using 64 data buffers
03/10/2013 10:36:59 - Info bpdm(pid=24455) requesting nbjm for media
03/10/2013 10:36:59 - Info bptm(pid=24450) Waiting for mount of media id LT4811 (copy 3) on server nbu5520.
03/10/2013 10:36:59 - started process bptm (24450)
03/10/2013 10:36:59 - mounting LT4811
03/10/2013 10:36:59 - Info bptm(pid=24450) INF - Waiting for mount of media id LT4811 on server nbu5520 for writing.
03/10/2013 10:38:02 - Info bptm(pid=24450) media id LT4811 mounted on drive index 6, drivepath /dev/nst3, drivename tapelib-DRIVE06, copy 3
XXXXX 9 minute gap here before we write to tape XXXXX
03/10/2013 10:47:15 - begin reading
03/10/2013 10:47:26 - Info bptm(pid=24450) waited for full buffer 149 times, delayed 34112 times
03/10/2013 10:47:30 - end reading; read time: 00:00:15
03/10/2013 10:47:30 - Info bpdm(pid=24455) completed reading backup image
We are using NBU 7.5.4 and OS is hp-ux. Oracle is 11.2.03
Many thanks in anticipation. - Darren
We opened a call with Symantec and sent them full verbose logs from the client and the media server.
It turns out our problem is related to ET3016768.
We ran a script on the appliance that shows the nbMSDPpoCounts totals.
Our offending images were all in one policy, but with different schedules.
The counts for this policy were many thousands higher than any other policy. See below
The issue is how the appliance "paths" the images, and thousands of backups under the same "path" means the appliance has to traverse the paths to find the images required.
Count Client/Policy
------- ------------------------------------------
65907 clienta-b/CLIENTA-B-B_ORA_BKUP
5269 clientb-b/CLIENTB-B_BKUP
4060 clientc-b/CLIENTC-B_ORA_BKUP
4019 clientd-b/CLIENTD-B_BKUP
. . . SNIP . . .
Symantec recommend using more policies to spread the load on the "paths".
We have rehydrated lots of long-term retention backups off the appliance to tape, and reduced the above counts significantly. We have also changed our SLP's to do the same thing moving forward.
The above actions have fixed our optimised duplication / rehydration delays, and timeouts on RMAN restores.
There is currently no fix for this, but may be fixed in 7.6.1.
Thanks for your help.