cancel
Showing results for 
Search instead for 
Did you mean: 

Netbackup 7.5 - Duplication / Replication Performance

SYMAJ
Level 6
Partner Accredited

I have a site with a 'strange' requirement (considering multiple 5220 appliances are in place) whereby data is required to be backed up initially to an advanced disk 'landing zone' on the appliance - and subsequently duplicated to de-dup disk on the same appliance and also to SAN attached tape.  In addition the de-dup copy is then replicated to a remote 5220 appliance - and further onto tape from there.  All very complicated, and some may say unneccessary - but that is the requirement.

After initially 'seeding' the remote appliance on site it was relocated to the DR site, and connected via an 800Mb link.  All appeared well.

SLP's are being utilised to manage the backup/duplication/replication of the data.

Approximately 160 servers are being backed up totalling around 28TB of data for a full backup.  Full backups are running mainly at weekends, with incrementals running during the week.

Using the standard LIFECYCLE parameters initially there were a high number of duplication and replication jobs running in order to satisfy the requirements.  There is a 3 x LTO5 tape library attached via SAN to both of the local appliances - with two paths from each appliance to the 'tape SAN'.  One major issue I came accross was the fact that each image being duplicated 'single streams' to a tape drive, and multiplexing is not provided even when configured within storage units.

This resulted in many duplication jobs queueing for tape drives, some for many hours, and jobs failing with various errors including 83/84/190/191.

I have a support case open at present and they recommended changing LIFECYCLE parameters to 'batch' the jobs.  This was done by setting the following parameters:

MIN_GB_SIZE_PER_DUPLICATION_JOB  512

MAX_GB_SIZE_PER_DUPLICATION_JOB 1024

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 180

This appeared to result in duplications / replications overrunning in a big way, so I subsequently reduced these to 128GB / 512GB / 60mins.

When I look into the failing duplication logs I see that the jobs have been waiting for a long time for the 'logical' tape drive resources, and when they get the 'logical' resource they then wait for the 'physical' tape resource.  When they eventually get a drive, it appears not be allocated correctly and NBU is unable to mount the resource to the tpreq path and therefore when it comes to write to the tpreq location it gets the 83 error.  My feeling is that this is a device allocation / SSO issue.

Based on the above, is there any option for 'streamlining' the duplication process ?  The duplication step from 'landing zone' to 'de-dup' disk works fine every time - it is the 'landing zone' to tape which has all of the issues.  Failing jobs do retry and eventually succeed, but this could take days....

I understand that the data flow is out of the norm, with the advanced disk landing zone (copy 1) duplicating to de-dup disk (copy 2) and tape (copy 3) - then replication to a remote appliance (copy 1 remote) and duplicating to remote tape (copy 2 remote) - but as above this is the specific customer reuirement.

Both Netbackup Masters are at 7.5.0.4, with all 3 appliances at 2.5.1B.

Any thoughts / input appreciated......

AJ.

 

23 REPLIES 23

Umair_Hussain
Level 4
Partner Accredited

What are the possibilities on upgrading the appliance to 2.5.2 ??

SYMAJ
Level 6
Partner Accredited

Will 2.5.2 help, or just a statement ?

I have held off doing this 2.5.2 upgrade until things settle down a little more (people seem to be bombarded with alerts following the upgrade etc.).  I am always wary of jumping in too early unless there is a good reason to do so !

If this will help me then I will plan fo rthe upgrade (Masters x 2 to 7505 then appliances x 3 to 2.5.2).

AJ

RonCaplinger
Level 6

Excuse me if I ask basic questions you've already tried:

  1. In your SLP's, are you specifying the correct Alternate Read Server for every duplication step?  I have seen similar behavior in my systems if I leave that blank; NBU will choose a media server that doesn't have tape drive connectivity for a duplication to tape.
  2. I've never seen duplications to tape use multiplexing.  I was under the impression that only backups from the clients would ever multiplex.  Have you used multiplexing with your duplications before?
  3. Have you considered spacing your full backups during the week?  Maybe create separate policies and schedules to run full backups on 20 servers on Sunday and incremental for the others, then another policy for 20 more servers to be backed up in full on Monday, etc. 

SYMAJ
Level 6
Partner Accredited

Ron,

1. Appliance A duplicates from it's own advanced disk partition to its de-dup disk partition, and to its own tape drives - so no Alternate read server is specified.  The same appliance is the media server for all of its operations.

2. Neither have I, unless you are duplicating a tape which is already multiplexed - this functionality would be in my view a good enhancement request.

3. A consideration, but I only want to produce tapes (and have to move one set of tapes) once per week - not every day.

AJ

Mark_Solutions
Level 6
Partner Accredited Certified

Ok - first off you cannot multiplex a tape duplicatiomn - it has never been possible but i have heard that it may be coming.

I could understand if de-dupe to tape was slower due to re-hydration but if your advanced disk to tape is slow then the first place i would look is your data buffer settings on the appliance

You can do this in the O/S (Support - Maintainance) or via the CLISH under the Settings section for NetBackup (Settings - NetBackup DataBuffers Number Tape 64   etc..)

What you will find is that you need to set both disk and tape buffer sizes as if you just set tape sizes then the disk will also use them (dont know why!)

So I use 64 for the number of disk and tape, 262144 for th size of tape and 1048576 for the size of disk.

If that still doesn't help than you may simply need more tape drives but 3 x LTO5 on a single media server is more than i would normally reccomend anyway, so you may need a second appliance

Hope this helps

SYMAJ
Level 6
Partner Accredited

Mark,

Thanks for the input.  The buffers on all appliances were set to the values below from installation time:

SETTINGS/NETBACKUP/DATABUFFERS SIZE TAPE 262144 – done all 3 appliances

SETTINGS/NETBACKUP/DATABUFFERS NUMBER TAPE 128 – done all 3 appliances SETTINGS/NETBACKUP/DATABUFFERS SIZE DISK 1048576 – done all 3 appliances

SETTINGS/NETBACKUP/DATABUFFERS NUMBER DISK 64 – done all 3 appliances

On primary site we have two appliances, both configured the same way with 18TB of advanced disk and 22TB of de-dup disk.  Each appliance has access to all 3 tape drives.

The three LTO5 drives are not on a single media server, as they are shared between the two appliances (which are media servers) and the master server (which doesn't use tape but has access to them and also controls the robot).

As we end up with literally a hunderd or more duplications to tape queueing the addition of a couple of tape drives would not help us here.  I don't mind jobs queueing if they did not fail when they do come to run. 

As you correctly say, there is no re-hydration going on here so that should not be slowing things down.

I am prepared to live with the queuing of jobs (to some extent) knowing that multiplexing is not an option when duplicating disk images to tape, but the jobs failing should not be happenning.

We have the de-dup pools on the two local appliances running at approx 74% and 80% of capacity - is this an issue ?

As a point of interest concerning Global De-dedup - not related to ths issue - we have two local appliances running at 74% and 80% of the 22TB pool size, and when both of these replicate ALL their content to the single remote 5220 appliance which has a 39TB de-dup pool the capacity of the remote appliance pool is approx 84% !!  Shows the savings of Global De-Dup.......

Any other ideas.....

AJ 

Mark_Solutions
Level 6
Partner Accredited Certified

OK - so you have 2 appliances sharing three tape drive and i am guessing that each storage unit shows that there are three drives available.

So think about this scenario ....

Appliance1 has lots of duplications to do and puts them al in the queue and can have three active.

At this point it is writing to write to 3 x LTO5 drives which may be too much for it (but it may cope?) .... Appliance2 meantime has no drives to use and is doing nothing!

So you add one more drive to the tape library giving you 4 shared drives - then reduce the number of drives in each storage unit from 4 to 2 - so that each appliance can only use 2 drives at a time.

Now you will always have both appliances able to be duplicating to 2 tape drives (which is optimal with LTO5)

That has to be more efficient - by up to double the duplication speed by adding one drive (so halves you duplication window)

Now if they can cope with writing to 3 drives you could add an extra 3 drives to the library and set the storage units to 3 - so each appliance constantly writes its duplications to 3 tape drives and never has to wait for the other one - that should really increase you duplications window

The other issue you may be having is that the allocations are constantly swapping from one server to the other - appliance1 writes a duplication, finishes, dismounts tape, give appliance2 the drive, mounts a tape etc. This would be even worse if you have not setup a media sharing group.

So if you had 6 drives in the library then just give 3 to each media server and stop the drive sharing to remove all contention - do still use media sharing though!

Hope this helps

lmosla
Level 6

Hi SYMAJ,

I'd like to assist with this. DM me and send me your contact information.

Thanks,

 

 

smakovits
Level 6

Curious, in this document it states that we need to use "=" when defining a parameter.  However, you seem to be on 7.5 and say support said to use  MIN_GB_SIZE_PER_DUPLICATION_JOB 512 or is it MIN_GB_SIZE_PER_DUPLICATION_JOB = 512?
 

SYMAJ
Level 6
Partner Accredited

I will double check this for you and come back tomorrow.

AJ

SYMAJ
Level 6
Partner Accredited

Just Checked - there is no '=' symbol required in the LIFECYCLE_PARAMETERS file.

AJ

Mark_Solutions
Level 6
Partner Accredited Certified

Prior to V7.5 the LIFECYCLE_PARAMETERS file did not have an equals sign

From 7.5 onwards it must have an equals sign to be used

http://www.symantec.com/docs/HOWTO68315

SYMAJ
Level 6
Partner Accredited

Mark - I went through this confusion previously, but I am using it without the = sign and changes are being effected when I change the values......... (7505).

AJ

Mark_Solutions
Level 6
Partner Accredited Certified

OK - Thanks for that - best tell Symantec!

smakovits
Level 6

Thanks guys, now the second part of my question would be, do I create the file on my master server only or do I also create it on my media server (5220)?  My assumption is that SLPs are executed from the master, so that is where it belongs, but I want to be sure.  Thanks

Umair_Hussain
Level 4
Partner Accredited

Guys,

 

If you are making changes from Clish on appliance there is no "=" sign but in actual LIFECYCLE_PARAMETERS file you need to put "="

smakovits - you only need to create file on Master server only, if you have 5220 as your master server change your SLP setting through CLISH because in appliance parameters file linked to other location.

Umair_Hussain
Level 4
Partner Accredited

Just saw this new update in netbackup 7.5 admin guide on page 584 (configuring SLP) new format is with "=" old format was without equal sign.

Tanmoy1
Level 4
Partner Accredited

Please update if you got any solution already. I am also having similar kind of issue. Thanks.

smakovits
Level 6

I worked with support and they confirmed that the "=" is not needed.  Here is what I have now

 

DUPLICATION_SESSION_INTERVAL_MINUTES 20
MIN_GB_SIZE_PER_DUPLICATION_JOB 256
MAX_GB_SIZE_PER_DUPLICATION_JOB 512
MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 180

 

They told me the session interval is needed to change the default 5 minute interval, because otherwise it will still kick off every 5 minutes.  They also noted that changing the settings on mean the system will "try" to group systems per SLP, but I still see lots of SLPs with one system and outside my size requirements.

 

In the end, I am not sure what these settings are actually doing for me, if anything at all.