TAPE_RESOURCE_MULTIPLIER seems not to work

William_Brown · ‎05-20-2013

Platform: Solaris 10 SPARC, NetBackup 6.5.6 (yes I know it is out of support, and no, we will not upgrade). We don't have the ET2073042 EEB and will not be able to apply it.

I've read the other threads about why not all drives are in use for duplications, and while we have the same symptoms I'm confident that we don't have the same causes.

We have 30 LTO-4 drives across 8 media servers. Each media server has its own private disk pool. All the media servers are in a media server load balancing STUG for the disks, and also for the tapes. There is one SLP that is used to do the backups to disk and then duplicate to tape. We backup about 1.5 times the total disk pool capacity every 24 hours so we need all tapes to run for duplication to keep ahead of backup.

Duplications do not go cross-media server, so to empty a media server's disk pool its own tape drives are used.

We consistently have 30 duplications 'Active' and several hundred queued. The queued jobs are waiting for the LCM_prefixed resource for the tape STUG. The active jobs have one of the 30 'slots' for that resource, we can see that in the nbrbutil output. All this is expected beahviour.

The problem comes partly because one of the media servers has 2 drives, while the others have 4 - so it has half the bandwidth. That means that naturally of all the duplication jobs, it has a disproportionate backlog. That also means that over time more of the 'Active' jobs are for that media server; they acquire the LCM_ resource but cannot actually do any work as the drives are in use.

For each Active job that cannot run, one of the 30 'slots' is taken, which means that on another media server there is an idle tape drive, because there is no corresponding Active job for it. Over time we can end with 2 drives in use on the media server with 2 drives, 28 'Active' jobs all for that same media server, 28 idle drives, and 7 other media servers whose disk pools are full. At which point 'Status 129' errors begin to get out of hand.

TAPE_RESOURCE_MULTIPLIER is designed so far as I read for exactly this problem.

"This parameter helps administrators ensure a balance in the following situation:

•To overload the Resource Broker with jobs it can't run is not prudent.

•Make sure that there's enough work that is queued so that the devices won't become idle. The TAPE_RESOURCE_MULTIPLIER entry lets administrators tune the amount of work that the Resource Broker can evaluate for a particular storage unit.

For example, a particular storage unit contains three write drives. If the TAPE_RESOURCE_MULTIPLIER parameter is set to two, then the limit on concurrently active jobs is six. Other duplication jobs requiring the storage unit remain queued"

I set TAPE_RESOURCE_MULTIPLIER 4 expecting the number of Active jobs to increase to 4 * 30, but nothing at all happened. So far as I read no restart is required. I can see in the log of nbstserv (OID=226) that it is picking up the contents of LIFECYCLE_PARAMETERS, as we did for a while set TAPE_RESOURCE_MULTIPLIER to 1 before setting it to 4 (the default is 2):

/usr/openv/netbackup/bin/vxlogview -p 51216 -i 226 | grep TAPE_RESOURCE_MULTIPLIER

.

05/15/13 09:11:05.195 [SessionParameters::getConfgInfo] TAPE_RESOURCE_MULTIPLIER = 1

05/15/13 09:35:01.281 [SessionParameters::getConfgInfo] TAPE_RESOURCE_MULTIPLIER = 1

05/15/13 10:00:24.618 [SessionParameters::getConfgInfo] TAPE_RESOURCE_MULTIPLIER = 4

05/15/13 10:25:45.591 [SessionParameters::getConfgInfo] TAPE_RESOURCE_MULTIPLIER = 4

I also did "/usr/openv/netbackup/bin/nbstserv new_session -force" after the change to be sure it was immediately picked up.

It is mildly frustrating that jobs are selected to be Active regardless of how many other jobs that must run on the same Media Server are already active. Equally while for the Active jobs I can see from the disk volumes reserved by them, what media server they are for; for the 'Queued' jobs I have yet to find a way to see what they are going to duplicate when they go active, as they have made no reservations. However I think if I could make the TAPE_RESOURCE_MULTIPLIER work it would not matter.

My workaround (which also explains why I am confident the issue is not the same as others in this forum) is to cancel the Active-but-Not-Working jobs; I can see what media server they are for and if as is often the case all are for the 2-drive media server, another set of Queued jobs becomes Active, and there is a reasonable chance some will be for Media Servers whose drives are idle. If I do this I can get almost all drives busy, but it requires constant oversight.

The fact that one media server has fewer drives is not a cause, it just aggravates the situation. If any media server had very large duplications, it would create a backlog. We also use MSEO so some tape writes are encrypted which reduces throughput to tape, so if any server has 4 encrypted duplicatons it too could become a bottleneck. The issue is not being able to keep at least 4 Active duplications for each media server.

Have I misread what TAPE_RESOURCE_MULTIPLIER is supposed to do? Do I need to restart NetBackup on the Master Server? Has anyone else seen it change the number of 'Active' duplications?

VOX

TAPE_RESOURCE_MULTIPLIER seems not to work