cancel
Showing results for 
Search instead for 
Did you mean: 

Oracle RMAN Backup job on N5020 considerably slower than on tape library

Alberto_Colombo
Level 4
Partner Accredited

 

Hi Everybody,
 
i've a question about why a deduplication backup Oracle job (using RMAN) is considerably slower than the same backup job using a tape library.
 
Here is my setup:
 
One NetBackup Master Server 7.1.0.3 with RHEL 5.6
One NetBackup Media Server 7.1.0.3 with win 2008
one N5020 with 1.4.1.1
Oracle Database 10.2.0.3 on RHEL 5.6 x64 (1TB) and Netbackup Client Agent 7.1.0.3
 
we have done this comparison:
we have taken a full oracle backup (RMAN) on a tape libray (FC), and it took approx 5hrs (which, BTW, was the average time we always experinced)
the very same day we have taken a full oracle backup (RMAN) on N5020 (and it took more than 9hrs).
in the following days, these backup jobs (with same spec) on N5020 never have taken less than approx 8hrs (but often have taken more than 10/12 hrs)
 
We have elevated the number of active channel in the RMAN backup script to 2 and set filespersets=1 but without any benefit.
 
can someone give me some hint/tip to try to solve this issue?
 
thanks in advance,
Alberto
1 ACCEPTED SOLUTION

Accepted Solutions

Alberto_Colombo
Level 4
Partner Accredited

Hi Bernard, Hi Chad

the great news is that removing just a parameter in RMAN script, my oracle backup (which lasted on wednesday almost 23hrs) yesterday taken 7hrs and half to complete!!!! :)

that parameter is FILESPERSET=1 (we added it because is a best practice taken from Symantec Admin Guide).

with it in RMAN script, we saw almost 350 netbackup jobs (each job taking an average of 4mins, even if some jobs were backupping just some 50MBs), without  it the number of netbackup jobs decreased to almost 100, each job with a 70% dedup rate.

i just not understand why Symantec and Oracle Admin Guide don't point out in a clear way that implementing it will dramatically affect backup times.

this morning i've also changed to NUMBER_DATA_BUFFERS 128, and added one more channel in RMAN scripts (now they're 3), next monday i'll see if these changes have improved backup time any more.

regards,

Alberto

View solution in original post

21 REPLIES 21

Efi_G
Level 4

our backup to VTL was much faster than to the 5020

i even tested to backup to a 5220 advanced disk and it was slower then the VTL.

i have a case open with support they ran an app critical network test i am waiting to see what they can come up with.

Alberto_Colombo
Level 4
Partner Accredited

that's quite sad :(

please let me know if you can understand something more about it.

i can tell you that we've opened a TAR with oracle support (i hoped they would have had some case history to help us), but they dismissed it quite fastly telling us that the appliance is the one and only to be biased.  

Mark_Solutions
Level 6
Partner Accredited Certified

A few things here which may assist and worth testing (1 at a time)  ....

1. If you are doing client side de-dupe then try using media server side

2. Try using de-dupe compression if you are not already (client and media server side - (/usr/openv/lib/ost-plugins/pd.conf change to COMPRESSION = 1)

3. Reduce the default fragment size of the storage unit to 5000MB

4. SIZE and NUMBER_DATA_BUFFERS - see if they exist and what values are used - I have found that performance can be better without them at times.

5. Check that the appliance has been tuned - not sure if this is used on the 5020 but on the 5200 there is a tuning script that should run when first configured - if it has one it may be located at: /opt/NBUAppliance/scripts/bin/tune.pl

If it exists check with Symantec if it is OK to run it - if so do the following:

cd to: /opt/NBUAppliance/scripts/
type the following once in that directory:
../bin/perl tune.pl

Hope this helps

Alberto_Colombo
Level 4
Partner Accredited

Hi Mark,

thank for your reply, here are my setup:

  1. i'm already using media server side dedup
  2. COMPRESSION value is already = 1
  3. my max defrag size is 51200MB - i'll try to reduce it to 5000MB 
  4. cat SIZE_DATA_BUFFERS 132096 ; cat NUMBER_DATA_BUFFERS 128
  5. tune.pl does not exist in /opt

i'll try to vary the defrag size, while i've read that modifing SIZE and NUMBER_DATA_BUFFERS should be done really carefully.

thank you!

Alberto

Sebastian_Baszc
Level 5
Partner Accredited Certified

Hi Alberto,

 

Could you send me the RMAN script and config you are using for your backup, please. 

Also: the SIZE_DATA_BUFFERS is always multiple of 1024... so it should be 131072 not 132096. 

 

Could you also make sure that the following was configured in RMAN:

 

 

CONFIGURE BACKUP OPTIMIZATION OFF;
CONFIGURE ENCRYPTION FOR DATABASE OFF;
CONFIGURE DEVICE TYPE SBT_TAPE BACKUP TYPE TO BACKUPSET;
 
A few things to check/verify:
 
The number of allocated channels is dependent on the size of the database and the number/type of the file systems on which Oracle data files are located. If your disks are fast or data files are spread across a few disks - use even more channels. 
You can also try to enable client side deduplication if you have spare cycles on that server. 
 
Hope it helps. 
 
S

Alberto_Colombo
Level 4
Partner Accredited

 

Hi Sebastian,
i'll reply below to your answer, before i have to better clarify our environment:
 
One NetBackup Master Server 7.1.0.3 with RHEL 5.6 (which also is a Media Server, when it has to use our FC Tape Library)
One NetBackup Media Server 7.1.0.3 with win 2008
one N5020 with 1.4.1.1
Oracle Database 10.2.0.3 on RHEL 5.6 x64 (1TB) and Netbackup Client Agent 7.1.0.3
 
 
the vaules about SIZE and NUMBER_DATA_BUFFERS i've written are about the RHEL Master/Media Server - our WIN Media Server doesn't have any config.
 
I've looked at config values in the RMAN catalog, optimization and encryption are already set as you wrote.
About the third parameter, i've this one:
CONFIGURE DEVICE TYPE DISK PARALLELISM 1 BACKUP TYPE TO BACKUPSET
but in the rman script, as you can see, "device type disk" is overwritten.
 
Maybe we will need some more channels (because the db size is more than 1TB,and datafile are spread across multiple different luns (with ocfs2)), while using client side dedup is not eligible as a choice.
 
thank you very much,
Alberto
 

drpaine10
Level 3

We are having the same issue. Did any of these suggestion improve your Oracle backups?

Has this issue been resolved? Thanks

jandersen1
Level 4
Partner Accredited

Hi,

What does the detailed jobinfo tell about wating for empty buffers on the puredisk jobs??

What is the content of SIZE_DATA_BUFFERS_DISK and NUMBER_DATA_BUFFERS_DISK which are the files used for disk pools? - not SIZE_DATA_BUFFERS and NUMBER_DATA_BUFFERS which are used for tape communication (physical and virtual - VTL)

--jakob;

drpaine10
Level 3

No waiting for empty buffers reported but rather example...  waited for full buffer 12625 times, delayed 16416 times

These do not currently exist -- SIZE_DATA_BUFFERS_DISK and NUMBER_DATA_BUFFERS_DISK

SIZE_DATA_BUFFERS and NUMBER_DATA_BUFFERS are set to 262144 and 256 respectively.

chashock
Level 6
Employee Accredited Certified

You mention you're going to a library.  How many drives and of what type?  I have seen cases where streaming to multiple tape units in a library can outperform certain disk setups. 

Are you using SAN Client?

Alberto_Colombo
Level 4
Partner Accredited

Hi Sebastian, 

i've found this note from Symantec:

http://www.symantec.com/business/support/index?page=content&id=TECH35968

so, i've added to my media servers these parametres:

 

[root@srvnbms01 config]# cat NUMBER_DATA_BUFFERS_DISK
64
[root@srvnbms01 config]# cat SIZE_DATA_BUFFERS_DISK
1048576
 
but i think that now my RMAN backup is even slower :(
 
Do you have some more ideas/advices?
 
Alberto

 

Chad_Wansing2
Level 5
Employee Accredited Certified

ok, so to clear up a couple things, in the absence of SIZE_DATA_BUFFERS_DISK and the associated "NUMBER" file, the standard SIZE_DATA_BUFFERS is used.  I would highly recommend using 262144 as your size data buffer with the recognition that if you exceed your hardware devices buffer size you will have a buffer overrun condition which will allow you to write the backups but not restore them.  Provided you're using at least LTO3 drives, this value will ensure you're fine.  If you want to get more performance out of the back end of the process (media server writing to tape), feel free to change these files.  HOWEVER, where I really see the biggest bang for my buck is on the client-side communications buffer.  Windows clients have a default of like 16KB or something stupid like that.  Sure, it keeps from breaking WIndows NT boxes but not so hot for performance.  The max value is 32767 or 32MB of RAM that will get reserved for NBU during the backup.  What this allows NBU to do is to buffer more data on the client-side of the backup operation prior to flushing to the network.  Your traffic profile will get a LOT more bursty, but I've seen SIGNIFICANT improvements in throughput.

Additionally, for Oracle backups to dedupe storage the best practice is to leverage an RMAN capability called "proxy copy" to make sure the data always comes out of the DB the same way.  This results in higher dedupe rates which in turn will give you better performance.

Another note, backups of a single client to tape will MOST times result in faster throughput than to dedupe, particularly on data sets with very low dedupe rates.  The process of deduplication inserts a non-trivial amount of CPU processing into the backup process whereas the backup to tape is a pure IO operation.....take the data in on this Ethernet port, send it out on this FC port.  Of course tape's going to be faster.  NOW, the REAL difference is when you start running 100, 200 or more jobs simultaneously.  If you even have enough tape devices to handle that kind of load you're probably using multiplexing settings that will kill you if you have a DR event.  But the disk-based backups are able to cope with this high level of parallelization leading to shorter backup windows.  While each stream may not be as fast as the same stream sent to tape, the real key is how long is your backup window right?  THAT'S where the dedupe helps, it's all about the parallelization of the workload.........well that and saving you a ton of disk space you would have needed to store all those backups.  :)

 

Alberto_Colombo
Level 4
Partner Accredited

 

Hi Chad,

as i wrote above, i've set up these parameters on my media servers:

 

[root@srvnbms01 config]# cat NUMBER_DATA_BUFFERS_DISK
64
[root@srvnbms01 config]# cat SIZE_DATA_BUFFERS_DISK
1048576
 
as a result, the RMAN backup on N5020 took 18hrs (while the same backup on tape usually takes 5hrs, and while with these previous parameters:
SIZE_DATA_BUFFERS: 132096
NUMBER_DATA_BUFFERS: 128
 
it usually lasted from 9hrs to 12hrs
 
now i've removed every SIZE_DATA_BUFFERS* and NUMBER_DATA_BUFFERS* parametres, and i'm looking in the default parametres can be better.
i've also asked my oracle dba to add proxy "copy" to rman script.
there is something more i can do to have better performance using N5020?
there is any chance to have backup on N5020 lasting the same time that on FC tape?
 
regards,
Alberto

Chad_Wansing2
Level 5
Employee Accredited Certified

So I would look at doing a couple of things:

  1. increase your client comm buffer size to at LEAST 64K, although I used to set it considerably higher but find the right balance of performance for your environment
  2. increase your number data buffers - What I generally do is to look at my RAM utilization on my media server and keep pushing up my client communications and number of data buffers until I'm at 80% RAM utilization
  3. set your size data buffers to 262144 - leverage the NUMBER of data buffers to push up performance less so than the SIZE of the data buffers
  4. talk to the support folks to run the standalone sequencer utility (sas.exe) between your clients and your master and media servers to make sure your network is providing a good foundation.  NBU acts more as a streaming protocol than most applications and if your network connectivity isn't very smooth, it can cause problems that even the network guys won't detect many times.

Altimate1
Level 6
Partner Accredited

Hi, I'm asking myself either or not "Maximum concurrent jobs" setting on the N5020 STU couldn't be participating in performances?

Regards

Chad_Wansing2
Level 5
Employee Accredited Certified

It's possible.  I generally set the max concurrent jobs to 100.....it's a nice round number and I haven't had problems with it although I generally don't do a ton of simultaneous backups as most of what I'm doing is for POC's.  With that said though, the key is to push the envelope for as many simultaneous jobs as you possibly can.  The good thing with dedupe is that for about 90% of the traffic, the data is getting thrown out at either the client or media server, so the dedupe box shouldn't have to do that much other than answering the question of "have you ever seen this chunk of data before".

So for first backups I might start the max concurrent jobs setting a bit lower as you're actually going to be moving real data, but after that I would keep bumping it up until you start to see stream throughput being impacted.

As for the above references with the Oracle backups, there are a lot of other factors that go into how many streams are really created and allowed to run concurrently, almost all of which is controlled by the RMAN parameters in the script file on the client.

Altimate1
Level 6
Partner Accredited

Hi, it would be great having Alberto feedback regarding above concurent jobs settings.
Regards

Alberto_Colombo
Level 4
Partner Accredited

Hi Bernard, Hi Chad

the great news is that removing just a parameter in RMAN script, my oracle backup (which lasted on wednesday almost 23hrs) yesterday taken 7hrs and half to complete!!!! :)

that parameter is FILESPERSET=1 (we added it because is a best practice taken from Symantec Admin Guide).

with it in RMAN script, we saw almost 350 netbackup jobs (each job taking an average of 4mins, even if some jobs were backupping just some 50MBs), without  it the number of netbackup jobs decreased to almost 100, each job with a 70% dedup rate.

i just not understand why Symantec and Oracle Admin Guide don't point out in a clear way that implementing it will dramatically affect backup times.

this morning i've also changed to NUMBER_DATA_BUFFERS 128, and added one more channel in RMAN scripts (now they're 3), next monday i'll see if these changes have improved backup time any more.

regards,

Alberto

Altimate1
Level 6
Partner Accredited

Thanks for the technical feedback you provide.

Regards