I have recently upgraded the Netbackup Master from 6.5.5 to 7.1.02. And after the upgrade I have started running into this issue. Issue: When the 'High Water Mark' is reached on the 'AdvanceDisk' pool, the nbjm is failing new jobs. This only occurs for a short period (i.e, < 30 mins), while the image cleanup creates some space. But during the short period ( < 30 mins), when user-defined RMAN jobs executed from cronjob fails and do not rerun. I am noticing this issue after the upgrade. Before the upgrade we never had this issue, so I am assuming that the previously nbjm used to pause the new jobs while the image cleanup is active, instead of failing them straight. Is there a way I can pause the new jobs, instead of getting them failed, while the image cleanup is active. Netbackup Environment: 1xMaster - Solars 10 on sparc hardware, Running Netbackup 7.1.0.2 Media Servers (x2) - Solaris 10 on sparc hardware. Running Netbackup 6.5.5 ( I will be upgrading them shortly. ) Tape Storage Units: SL500 Tape library with 5 Drives Disk Storage Unit (used for staging): 22 TB Lun from Sun Storage, configured as a AdvanceDisk type. Advance Disk Details: Disk Pool Name : AdvanceDiskPool2 Disk Pool Id : AdvanceDiskPool2 Disk Type : AdvancedDisk Status : UP Flag : Patchwork Flag : Visible Flag : OpenStorage Flag : AdminUp Flag : InternalUp Flag : SpanImages Flag : LifeCycle Flag : CapacityMgmt Flag : FragmentImages Flag : Cpr Flag : RandomWrites Flag : FT-Transfer Flag : CapacityManagedRetention Flag : CapacityManagedJobQueuing Raw Size (GB) : 22339.27 Usable Size (GB) : 22339.27 Num Volumes : 1 High Watermark : 90 Low Watermark : 40 Max IO Streams : -1 Comment : Storage Server : Media2 (UP) - Amar

Thanks for a very detailed description of your problem. I have not seen this behaviour in 7.x, but then we have always insisted that our customers upgrade Master and Media servers all on the same day. (Clients are upgraded in a phased approach). No mention either of this type of problem in LBN or 7.1.0.3 Release Notes. I know that different NBU versions are perfectly supported, but experience has taught us that certain features just work better with everything on the same NBU level. Suggestion: Upgrade media servers. If problem persists, open a support ticket with Symantec.

Hi Marianne, Thanks for taking time and providing a suggestion. Will try to stick to this rule in my next upgrade process. Due to a limited change window, I had to take a 3 phased approach for the Master and media server upgrades. As part of the upgrade, I am also migrating from Veritas volume manager to zfs, hence just the post upgrade preparation itself took me around 7 hours. And I also had to include times for the upgrade, testing (liase with app and server teams) and rollback without affecting the scheduled backup times. :-( As a temporary fix(till the issue resolves), I am planning on implementing the below. I guess this should will remove the duplicated images and stop disk from hitting the high water mark.?? During a quiet time, executing a cronjob to force image cleanups: nbdevconfig -changedp -dp AdvanceDiskPool2 -stype AdvancedDisk -hwm 0 -lwm 0 nbdelete -allvolumes -force bpimage -cleanup -allclients nbdevconfig -changedp -dp AdvanceDiskPool2 -stype AdvancedDisk -hwm 90 -lwm 40 I would appreciate any comments/suggestions on the above approach. - Amar

Were you aware of the following when you decided to migrate to ZFS? Best practice for configuring AdvancedDisk: http://www.symantec.com/docs/TECH158427 p.9: File System Restrictions For some file system types, notably NFS and ZFS, the lack of full posix compliance of the file system means that full file system conditions cannot be detected correctly, leading to problems when spanning volumes. To avoid problems where these file systems are used, each disk pool should be comprised of only one volume so that no spanning occurs. This lack of full posix compliance of the file system is probably the reason for high water mark not being detected correctly.

Yes! I am aware of this. Not using the Disk pooling capabilities of AdvanceDisk. I am using zfs (previously VxFS) to combine multiple Luns into a single volume. And that single volume(mount point) is presented to AdvanceDiskPool. Also VxFS/zfs can spread the IO load better across the underlying Luns hence will be better on the performance. -Amar

You really cannot compare VxFS with ZFS. (I can provide a comparative doc if you are interested). VxFS is fully posix compliant. The Best Practice Guide says that filesystems not fully posix compliant (like ZFS) have problems detecting full filesystem condition. The implication here is that reporting of filesystem capacity will be affected. Seems your problem has nothing to do with NBU upgrade but with migration to ZFS. You did not have the same problem when you were using VxVM volumes with VxFS filesystem.

Advance disk, High Water Mark and New jobs

18 Replies

Marianne
Level 6
13 years ago
APOLOGIES!!!

I totally missed the section in your previous section where you said that the media servers were still on VxFS..... (being a Storage Foundation fanatic I only read through your motivation for choosing ZFS.)

Then back to advice in my first reply - upgrade at least on Media Server to 7.1.x and see if NBU 'behaves' better with same levels.

To troubleshoot what is happening with disk cleanup on the media server, create bpdm log on media servers. See what kind of info is logged with default level 0 log. Increase logging level (5) if level 0 doesn't give enough info....
Mark_Solutions
Level 6
13 years ago
Just like to clarify a couple of things please ....

When you say "new" jobs do you mean brand new jobs that have never run before (new clients and policies) or just jobs that start to run after the disk has hit is high watermark?

Do I take it it fails with 129 errors?

Is there a reason you have a 90% high water mark? On a 22TB disk this leaves 2TB free when it hits the high water mark.

Also your low watermark is 40% - this is the point NetBackup needs to get to to finish its image cleanup - so you are asking it to clear down 8TB of space to be usable again.

Perhaps settings of 96% for high and 85% for low would be more suitable and make things run a little smoother.

Extract from this tech note: http://www.symantec.com/docs/HOWTO32794

The High water mark setting is a threshold that triggers the following actions:

When an individual volume in the disk pool reaches the High water mark, NetBackup considers the volume full.NetBackup chooses a different volume in the disk pool to write backup images to.

When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.

NetBackup begins image cleanup when a volume reaches the High water mark; image cleanup expires the images that are no longer valid. For a disk pool that is full, NetBackup again assigns jobs to the storage unit when image cleanup reduces any disk volume's capacity to less than the High water mark.

If the storage unit for the disk pool is in a capacity-managed storage lifecycle policy, other factors affect image cleanup.

The default is 98%.

The Low water mark is a threshold at which NetBackup stops image cleanup.

TheLow water mark setting cannot be greater than or equal to the High water mark setting.

The default is 80%.

Hope this helps
mph999
Level 6
13 years ago
The low water mark won't be an issue.

NBU will tru and clear down to the LWM if it can. If it can not (lets say it can only cleardown to 55%) then it will just carry on from there.

The LWM is a 'limit', not a target.

Martin
errrlog
Level 4
13 years ago
There doesn't seem to be any issue with the Disk Cleanup.

Image cleanup job kicks in within 10 minutes after the Highwater mark is hit and successfully clears the images and brings down high water mark.

Guessing some step seems to be missed by either mds (EMM media and device selection) or nbrb (resource broker). I guess either one of them should perform a check.

Check if the ‘Image cleanup’ has been run after the ‘High water mark is hit’. <possiblites YES/NO>

upon NO-> Initiate a image cleanup or wait for a image cleanup to run.

Then check result.

Then process the resource for the Jobid=1320574

And get backup to nbjm with the resource or Error

upon YES -> Check for the Image Cleanup result.

Then process the resource for the Jobid=1320574

And get backup to nbjm with the resource or Error

Example Event Sequence:

With comments and unified logs of nbpem, nbjm, nbrb, mds for JobID 1320574.

0800 hrs -> STU hits the Highwater mark (we still have lot of free space on STU)

0805 hrs -> A new backup job kicks in (jobid=1320574) and fails in 2 seconds with status 129.

## Comments

## 1. nbpem submits new job submitted to nbjm

log>> 3/02/12 08:05:13.616 V-116-215 [BaseJob::run] jobid=1320574 submitted to nbjm for processing

## 2. nbjm is sending a resource request to nbrb

Log>> 3/02/12 08:05:13.672 V-117-56 [BackupJob::sendRequestToRB] requesting resources from RB for backup job (jobid=1320574)

## 3. nbrb is checks the for the resources.

Log start >>>

3/02/12 08:05:13.678 V-118-227 [ResBroker_i::requestResources] received resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, priority 0, secondary priority 25,554, description THE_BACKUP_JOB-1320574-{BED68FB4-4DFA-11E1-997E-00144F970E6E}

3/02/12 08:05:13.978 V-118-226 [ResBroker_i::evaluateOne] Evaluating request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}

3/02/12 08:05:14.005 V-118-146 [ProviderManager::allocate] NamedResourceProvider returned Allocation Granted for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}

3/02/12 08:05:14.014 [allocateTwin] INITIATING:

3/02/12 08:05:14.014 [allocateTwin] masterServer = Master01, client = client2, jobType = 1, capabilityFlags = 256, fatPipePreference = 0, statusOnly = 0, numberOfCopies = 1, kbytesNeeded = 0

3/02/12 08:05:14.014 [allocateTwin] Twin_Record: STUIdentifier = AdvanceDiskPool2, STUIdentifierType = 1, PoolName = NetBackup, MediaSharingGroup = *ANY*, RetentionLevel = 10, RequiredMediaServer = , PreferredMediaServer = , RequiredDiskVolumeMediaId = , RequiredStorageUnitName = , GetMaxFreeSpaceSTU = 0, CkptRestart = 0, CkptRestartSTUType = 0, CkptRestartSTUSubType = -1, CkptRestartSTUName = , CkptRestartMediaServer = , CkptRestartDiskGroupName = , CkptRestartDiskGroupServerType = , MpxEnabled = 0, MustUseLocalMediaServer = 0

Log END >>>

## 4. High water mark limit detected. and validated by EMM

Log Start >>>

3/02/12 08:05:14.120 [check_disk_space] disk storage unit has exceeded high_water_mark, name = /nbu/pool1, free_space = 17843059695616, free_space_limit = 19189287249510

3/02/12 08:05:14.120 V-143-1546 [validate_stu] Disk volume is Full or down @aaaat

3/02/12 08:05:14.148 [allocateTwin] EXIT INFO:

3/02/12 08:05:14.148 [allocateTwin] EXIT STATUS = 2005030 (EMM_ERROR_MDS_InsufficientDiskSpace, Insufficient disk space or High Water Mark would be exceeded)

3/02/12 08:05:14.154 V-118-146 [ProviderManager::allocate] MPXProvider returned Not Enough Valid Resources for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}

3/02/12 08:05:14.155 V-118-108 [ResBroker_i::failOne] failing resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, status 2005030

3/02/12 08:05:14.166 V-118-255 [CorbaCall_requestFailed::execute] sending failure of request to nbjm for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, error code 2,005,030, reason not enough valid resources

3/02/12 08:05:14.195 [BackupJob::ERMEvent] (1044b4430) initial resource request failed, copy#=-1, EMM status=Insufficient disk space or High Water Mark would be exceeded, NBU status=129(../BackupJob.cpp:333)

Log End >>>

## 5. proceeded by nbjm failing the job.

Log >> 3/02/12 08:05:14.209 [Error] V-117-131 NBU status: 129, EMM status: Insufficient disk space or High Water Mark would be exceeded

0810 hrs -> Image Cleanup job kicks in and runs for 16 minutes, freeing up space to bring the STU usage below High Water Mark.

Mark,

>> When you say "new" jobs do you mean

New Job => Jobs that have started after the disk has hit HWM.

>> Do I take it fails with 129 errors

Yes, the job fails with status code 129.

>> Is there a reason you have a 90% high water mark? On a 22TB disk this leaves 2TB free when it hits the high water mark.

We run Full backups over the weekend, which accounts to 17TB. So most of the time HWM is reached during the Weekends during the peak times

On an average image cleanup job take 30mins – 2 hours to complete

Storage Unit can provide 550 – 800 MB/sec of throughput.

Considering Average throughput of 350MB/sec will give us around 2 hours of time before the existing jobs start to fail

Setting the LWM that low is to try avoid running Image cleanup multiple times.

- Amar
Mark_Solutions
Level 6
13 years ago
As per the tech note aextract above:

When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.

This just seems to be how NetBackup deals with things, but your high watermark does waste a lot of disk space.

Also and new clients that have never been seen before have their possible backup size calculated using a formula that relates to the disk size and High Water Mark so that can also affect things

I still reccomend increasing the HWM and if possible use groups of storage units - even splitting your 22TB volume into two 11 TB volumes could help as it would use the least full one whilst the other was clearing down so would bounce between the two rather than have no where else to go and just fail

Hope this helps
errrlog
Level 4
13 years ago

>>> When all volumes in the disk pool reach the High water mark, the disk pool is considered full.NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full.

I have never experienced Netbackup (on 6.5.5) Failing the assigned Jobs upon hitting the Highwater mark. Assigned jobs will continue till filesystem gets to 100% and then fail.

>>> NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.

Yes, it does not assign the new jobs. And Addition to that (prior to upgrade on the Master Server) it NEVER USE TO FAIL THE NEW JOBS (with 129 status) BEFORE running the Image Cleanup job.

I am 100% sure on this behaviour as, I have been using this setup for almost 2 year.

>>> This just seems to be how NetBackup deals with things, but your high watermark does waste a lot of disk space.

If Assigned Jobs are to be failed at HWM, then what can be the use of a HWM?? Then optimal HWM will be 100%.

In my situation (Before and after NBU upgrade), assigned Jobs never failed at HWM.

After upgrade, new jobs started to fail as soon as the disk hits HWM and before Image Cleanup job kicks in.

>> I still reccomend increasing the HWM and if possible use groups of storage units - even splitting your 22TB volume into two 11 TB volumes could help as it would use the least full one whilst the other was clearing down so would bounce between the two rather than have no where else to go and just fail

If I Split STU, I cannot use the loadbalancing model, as it will put me in the same situation. In priority model, I will loose half the spindles inturn effecting the performance and may also effect the backup window.

Forcing image cleanups using a cronjob, before the Full backup window, seems like the best option till the issue resolves.

I have logged a case with the support. Will see how it goes.

- Amar
Mark_Solutions
Level 6
13 years ago
I agree with you - at 6.5.6 jobs did not seem to fail - in V7+ they do seem to so some of the logic has changed - perhaps that is why that tech note ws released
errrlog
Level 4
13 years ago
The tech which you referred to says: (2nd point in 'High water mark' section from the table).

-------------------------------

When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full

-------------------------------

"On Disk full condition, it fails ASSIGNED backup, and "NOT ASSIGN NEW BACKUPS"

With 7.1.0.2 version, the above does not stand true for ASSIGNED JOBS. Because...

1. The ASSIGNED jobs does not fail after hitting disk full condition (i.e, reaching high water mark)

2. Also the TN does not full explain, how the 'NEW JOB'S' are handled at 'High water mark' condition.

During my research, I stumbled into the below tech note. Which explains, the previous NB versions (5/6)were mature enuff to handle the new jobs by TEMPORARLY PAUSING and running cleanup instead of FAILING.

REF: http://www.symantec.com/business/support/index?page=content&id=TECH66149

Extract from the above reference:

----------------------------------------------------

It is important to understand that NetBackup 6.5 staging and high water mark processing is very different from 5.x and 6.0.

In NetBackup 5.x and 6.0, High Water Mark value does not apply to disk staging storage units. The only condition that triggered staged (but not expired) image cleanup was the disk full condition reached during a backup. If this was encountered, all disk backups to that media server (not only to that storage unit) were temporarily paused and a cleanup process was launched to clean the oldest 10 images that had been staged.

----------------------------------------------------

Even in the older versions of Netbackup 5.x and 6.0 the ‘DISK FULL CONDITION’ was handled EFFICIENTLY, by PAUSING the NEW JOBS and NOT FAILING them.

Latest versions (7 or later) should be expected to handle the New Jobs similar or more efficiently??. I have put the same question to support. Curiously waiting on the feed back. :-)

- Amar

Forum Discussion

Advance disk, High Water Mark and New jobs

18 Replies

Related Content

Advanced disk pool down

Re: How check disk status

Advanced Disk

Advanced disk space

Advance Disk Licensing Question

Recent Discussions

"failed, status 6" error after increasing datastore (LUN) capacity

command: bperror

MS-SharePoint policy restore error (2804) .

How to restore a backup

How to configure RBAC