cancel
Showing results for 
Search instead for 
Did you mean: 

Netbackup 7.5 - Duplication / Replication Performance

SYMAJ
Level 6
Partner Accredited

I have a site with a 'strange' requirement (considering multiple 5220 appliances are in place) whereby data is required to be backed up initially to an advanced disk 'landing zone' on the appliance - and subsequently duplicated to de-dup disk on the same appliance and also to SAN attached tape.  In addition the de-dup copy is then replicated to a remote 5220 appliance - and further onto tape from there.  All very complicated, and some may say unneccessary - but that is the requirement.

After initially 'seeding' the remote appliance on site it was relocated to the DR site, and connected via an 800Mb link.  All appeared well.

SLP's are being utilised to manage the backup/duplication/replication of the data.

Approximately 160 servers are being backed up totalling around 28TB of data for a full backup.  Full backups are running mainly at weekends, with incrementals running during the week.

Using the standard LIFECYCLE parameters initially there were a high number of duplication and replication jobs running in order to satisfy the requirements.  There is a 3 x LTO5 tape library attached via SAN to both of the local appliances - with two paths from each appliance to the 'tape SAN'.  One major issue I came accross was the fact that each image being duplicated 'single streams' to a tape drive, and multiplexing is not provided even when configured within storage units.

This resulted in many duplication jobs queueing for tape drives, some for many hours, and jobs failing with various errors including 83/84/190/191.

I have a support case open at present and they recommended changing LIFECYCLE parameters to 'batch' the jobs.  This was done by setting the following parameters:

MIN_GB_SIZE_PER_DUPLICATION_JOB  512

MAX_GB_SIZE_PER_DUPLICATION_JOB 1024

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 180

This appeared to result in duplications / replications overrunning in a big way, so I subsequently reduced these to 128GB / 512GB / 60mins.

When I look into the failing duplication logs I see that the jobs have been waiting for a long time for the 'logical' tape drive resources, and when they get the 'logical' resource they then wait for the 'physical' tape resource.  When they eventually get a drive, it appears not be allocated correctly and NBU is unable to mount the resource to the tpreq path and therefore when it comes to write to the tpreq location it gets the 83 error.  My feeling is that this is a device allocation / SSO issue.

Based on the above, is there any option for 'streamlining' the duplication process ?  The duplication step from 'landing zone' to 'de-dup' disk works fine every time - it is the 'landing zone' to tape which has all of the issues.  Failing jobs do retry and eventually succeed, but this could take days....

I understand that the data flow is out of the norm, with the advanced disk landing zone (copy 1) duplicating to de-dup disk (copy 2) and tape (copy 3) - then replication to a remote appliance (copy 1 remote) and duplicating to remote tape (copy 2 remote) - but as above this is the specific customer reuirement.

Both Netbackup Masters are at 7.5.0.4, with all 3 appliances at 2.5.1B.

Any thoughts / input appreciated......

AJ.

 

23 REPLIES 23

SYMAJ
Level 6
Partner Accredited

I have a support case open at present and they recommended changing LIFECYCLE parameters to 'batch' the jobs. This was done by setting the following parameters:

MIN_GB_SIZE_PER_DUPLICATION_JOB 512

MAX_GB_SIZE_PER_DUPLICATION_JOB 1024

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 180

This appeared to result in duplications / replications overrunning in a big way, so I subsequently reduced these to 128GB / 512GB / 60mins.

 

When I was doing this using the above it did indeed have an effect, too much of an effect hence me reducing the numbers provided by support.

I don't recall having to perform any restart of services etc to effect these changes - I think they are read every time the interval is trigerred.

Did the changes have any effect ?  If not, Just to be pedantic - did you try with the = sign in ??  Mine are working without but hey - worth a try if no effect at present ! 

AJ

smakovits
Level 6

What exactly do you mean by overrunning, like running way longer than they ever used to before?

watsons
Level 6

At 7.5.0.x you definitely need the "=" sign in all the entries of LIFECYCLE_PARAMETERS, not sure why support told you it's not needed. If you ever upgrade from a previous version to 7.5.x you will notice that file is modified to include "=" sign for all entries.

It's easy to test if they'rre working just by running a small backup and check if the duplication starts after 5 minutes with the following setting:

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 5

smakovits
Level 6

OK, I think I can confirm that the "=" is needed after all.

Essentially I saw no changes in the way my SLPs were running when I had the rule in place or without it, so I disabled it.  However, the real reason I disabled it was because my SLPs were staying queued for long periods and the number of jobs grew out of control and I wanted to make sure that my parameters file was not the issue so I renamed it.

Well, after that, things cleared up a bit and I thought maybe I don't need it after all.  Until I moved some more jobs around.  Still, without the parameters file, I moved some jobs and suddenly some new SLPs started queueing again.  Essentially I had 400+ SLPs queued and waiting for tape.  I have 20 tape drives for those SLPs and they were running, but whenever they are set to start, they sit for hours waiting for a tape, so this is what probably leads to my large queue.

Regardless, here is the important part...

As a test to reduce the 400+ SLP jobs being queued I returned to the parameters file, but this time I added the "=":

DUPLICATION_SESSION_INTERVAL_MINUTES = 20
MIN_GB_SIZE_PER_DUPLICATION_JOB = 256
MAX_GB_SIZE_PER_DUPLICATION_JOB = 512
MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 180

This morning my development VMs started and all but 4 finished before I finally saw an SLP kick off.  It has essentially all of my development VMs, size was 296GB, so this tells me that it was definitely working with the "=".  Before, when the "=" was not in there I saw no change, dev vms would start and SLPs would contain only a few machines, telling me that when I originally had the parameters file in place it was doing nothing and had no impact on the way my SLPs were being queued.

My master is on UNIX and is running 7.5.0.6