cancel
Showing results for 
Search instead for 
Did you mean: 

Advance disk, High Water Mark and New jobs

errrlog
Level 4

I have recently upgraded the Netbackup Master from 6.5.5 to 7.1.02. And after the upgrade I have started running into this issue.

Issue:

When the 'High Water Mark' is reached on the 'AdvanceDisk' pool, the nbjm is failing new jobs. This only occurs for a short period (i.e, < 30 mins), while the image cleanup creates some space. But during the short period ( < 30 mins), when user-defined RMAN jobs executed from cronjob fails and do not rerun.

I am noticing this issue after the upgrade. Before the upgrade we never had this issue, so I am assuming that the previously nbjm used to pause the new jobs while the image cleanup is active, instead of failing them straight.

Is there a way I can pause the new jobs, instead of getting them failed, while the image cleanup is active. 

Netbackup Environment:

1xMaster - Solars 10 on sparc hardware, Running Netbackup 7.1.0.2

Media Servers (x2)  - Solaris 10 on sparc hardware. Running Netbackup 6.5.5 ( I will be upgrading them shortly. )

Tape Storage Units: SL500 Tape library with 5 Drives

Disk Storage Unit (used for staging): 22 TB Lun from Sun Storage, configured as a AdvanceDisk type.  

Advance Disk Details:

Disk Pool Name   : AdvanceDiskPool2
Disk Pool Id     : AdvanceDiskPool2
Disk Type        : AdvancedDisk
Status           : UP
Flag             : Patchwork
Flag             : Visible
Flag             : OpenStorage
Flag             : AdminUp
Flag             : InternalUp
Flag             : SpanImages
Flag             : LifeCycle
Flag             : CapacityMgmt
Flag             : FragmentImages
Flag             : Cpr
Flag             : RandomWrites
Flag             : FT-Transfer
Flag             : CapacityManagedRetention
Flag             : CapacityManagedJobQueuing
Raw Size (GB)    : 22339.27
Usable Size (GB) : 22339.27
Num Volumes      : 1
High Watermark   : 90
Low Watermark    : 40
Max IO Streams   : -1
Comment          :
Storage Server   : Media2 (UP)

 

 - Amar

18 REPLIES 18

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Thanks for a very detailed description of your problem.

I have not seen this behaviour in 7.x, but then we have always insisted that our customers upgrade Master and Media servers all on the same day. (Clients are upgraded in a phased approach).

No mention either of this type of problem in LBN or 7.1.0.3 Release Notes.

I know that different NBU versions are perfectly supported, but experience has taught us that certain features just work better with everything on the same NBU level.

Suggestion:
Upgrade media servers.
If problem persists, open a support ticket with Symantec.

errrlog
Level 4

Hi Marianne,

Thanks for taking time and providing a suggestion. Will try to stick to this rule in my next upgrade process.

Due to a limited change window, I had to take a 3 phased approach for the Master and media server upgrades. As part of the upgrade, I am also migrating from Veritas volume manager to zfs, hence just the post upgrade preparation itself took me around 7 hours. And I also had to include times for the upgrade, testing (liase with app and server teams) and rollback without affecting the scheduled backup times. :(

As a temporary fix(till the issue resolves), I am planning on implementing the below. I guess this should will remove the duplicated images and stop disk from hitting the high water mark.??

During a quiet time, executing a cronjob to force image cleanups:

nbdevconfig -changedp -dp AdvanceDiskPool2 -stype AdvancedDisk -hwm 0 -lwm 0

nbdelete -allvolumes -force
bpimage -cleanup -allclients

nbdevconfig -changedp -dp AdvanceDiskPool2 -stype AdvancedDisk -hwm 90 -lwm 40

I would appreciate any comments/suggestions on the above approach.

 

- Amar

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Were you aware of the following when you decided to migrate to ZFS?

Best practice for configuring AdvancedDisk:

http://www.symantec.com/docs/TECH158427

p.9:

File System Restrictions
For some file system types, notably NFS and ZFS, the lack of full posix compliance of the file system means that full file system conditions cannot be detected correctly, leading to problems when spanning volumes. To avoid problems where these file systems are used, each disk pool should be comprised of only one volume so that no spanning occurs.
 

 

This lack of full posix compliance of the file system is probably the reason for high water mark not being detected correctly.
 

errrlog
Level 4

Yes! I am aware of this. Not using the Disk pooling capabilities of AdvanceDisk.

I am using zfs (previously VxFS) to combine multiple Luns into a single volume. And that single volume(mount point) is presented to AdvanceDiskPool.

Also VxFS/zfs can spread the IO load better across the underlying Luns hence will be better on the performance. 

-Amar

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

You really cannot compare VxFS with ZFS. (I can provide a comparative doc if you are interested).

VxFS is fully posix compliant.

The Best Practice Guide says that filesystems not fully posix compliant (like ZFS) have problems detecting full filesystem condition. The implication here is that reporting of filesystem capacity will be affected.

Seems your problem has nothing to do with NBU upgrade but with migration to ZFS.

You did not have the same problem when you were using VxVM volumes with VxFS filesystem.

mph999
Level 6
Employee Accredited

 

First, as mentioned by Marianne - you did an excellent job describing your issues, and mentioned enough details (the zfs fs) to allow some suggestions to be made.  If only everyone could proviode this level of details when describing a problem - good job.
 
Second I agree with Marianne -  I've not seen this exact issue before, but I have certainly seen strange things when zfs is used ...
 
Many people use DSSU  etc ./.. at 7.x with no problems , not many people upgarde and swicg to zfs at he same time ...   I suspect the zfs is causing the issue ...
 
Here are a couple of previous issues we have seen ... when zfs has been used
 
backup size shown in activity monitor differs with actuual size of filesytem/stream  
 
NetBackup shows more space than is actually available when using sub file systems created on a ZFS file system.
 
 
Martin

mph999
Level 6
Employee Accredited

Some more details ...

 

In general, NetBackup can use any POSIX complaint file system as storage destination for Storage Units of the type BasicDisk and AdvancedDisk. 

ZFS has two major POSIX Compliance Issues that will not be addressed by Sun in any release of ZFS. 

The details are documented here:  http://docs.sun.com/app/docs/doc/820-5245/gcukw?l=en&a=view 

These two issues affect NetBackup as follows:

Issue 1: Inability to commit ZFS files system capacity statistics atomically 

    This affects the capacity managed storage units like AdvancedDisk. As capacity statistics are not available to operating system monitoring tools instantaneously, capacity managed storage units like AdvancedDisk cannot efficiently manage backup images and predict space availability. 
 
I would suggest you DO NOT use zfs.  As the above snippet of TN explains it will NOT be addressed by Sun, there is nothing Symantec will be able to do.
 
I agree you see the problem in NetBackup, but NetBackup is not the cause of the problem.
 
Martin

errrlog
Level 4

Point 1:

Additional details to "Netbackup Environment:" section in my first post:

NBU upgrade and Zfs

  1. Only on the Master Server.
  2. Master Server is not used as a media server.
  3. There are no Storage Units on the Master Server.
  4. ZFS is used for '/usr/openv' volume. This in any reason cannot be cause for the issue I am experiencing.

Media Servers: (Still at 6.5.5, using VxFS)

  1. We have two media servers (identical configuration) but lets only consider one as the other is at a DR site and wouldn't take much load.
  2. Media Server in the Question: Media2
  3. NBU Version : 6.5.5
  4. Staging Unit Filesystem : VxFS

 

So considering that I am still using the VxFS on the Staging unit (AdvanceDisk), the issue can only be caused by NBU.

Point 2:

NBU behaviour at HIGH WATER MARK:

  1. Existing job will continue running: TRUE in my case.
  2. New Jobs will not be run: Partially True in my case with a slight issue.
    1. At Highwater mark, image cleanup kicks in and tries to remove the staged or expired images.
    2. If Image cleanup clears the images and drops STU usage below Highwater mark, then it will start executing new jobs.
    3. If Image cleanup cannot clear images and STU usage stays above Highwater mark, then it will start failing the jobs.

 

Assumption: Even if the Staging unit is on ZFS, the issue I am seeing still have to be a NBU related. Because..

                ZFS filesystem Issue: “Inability to commit ZFS files system capacity statistics atomically”.

(ref: http://www.symantec.com/business/support/index?page=content&id=TECH66993)

 

Means if we use ZFS, there will be a LITTLE EXTRA DELAY before NBU DETECTS the Highwater mark (Again depends on the frequency at which NBU makes its checks) and kicks the Image cleanup. And the Image Cleanup job follows the later sequence. But the jobs should not fail straight away.

 

How frequently does Netbackup monitor the STU capacity?? How quick do you need??

Below is the  explanation for “Disk storage unit capacity check frequency”. Found in ‘General Server’ options in master server host properties.

______________________________________________________________

This property determines how often NetBackup checks disk storage units for available capacity. If checks occur too frequently, then system resources are wasted. If checks do not occur often enough, too much time elapses and backup jobs are delayed.

The default is 300 seconds (5 minutes).

Note: This property applies to the disk storage units of 6.0 media servers only. Subsequent releases use internal methods to monitor disk space more frequently.

______________________________________________________________

 

In Regards to Migrating to ZFS:

I was not trying to compare VxVM to ZFS but I compared “Disk Pooling capability” of of AdvanceDisk to ZFS and VxVM.

I only considered ZFS based upon my requirements.

Requirements in the order of importance:

  1. Logical Volume management (pooling multiple luns). As there is a 2 TB lun size limitation on our Storage Array. (ZFS and VxVM)
  2. Data integrity (zfs)
  3. Performance (zfs with large block, highly sequential workloads)
  4. Cost (zfs)
  5. Ease in Management (zfs)
  6. Scalability (VxVM has some advantages but zfs fits fine in my case)

Cons with Netbackup: (in my situation)

VxVM:None

 ZFS:            ZFS filesystem Issue: “Inability to commit ZFS files system capacity statistics atomically”.

(ref: http://www.symantec.com/business/support/index?page=content&id=TECH66993)

I am only effected in terms of the time DELAY between a) Actual file system size hitting ‘High water mark’ and b) Netbackup detecting that File system has hit the High water mark.

 

What will be the actual impact?

STU size                                                      : 22 TB

High water mark                                  : 90% (i.e, 19.8 TB)

Room before job starting to fail                       : 10% (i.e, 2.2 TB)

Average Throughput                                                 :  350 MB/sec (i,e 1.2 TB per hour).

According to the above figures, I easily have a minimum of 2 hour period before the disk fills up and backup starts to fail (only if the Netbackup cannot detect that file system has reached the High Water Mark).

So considering the 2 hour limit, I don’t think I should be worrying about “delayed capacity statistics” for zfs.

I guess this issue only effects when the Storage Units are smaller and the difference between highwater mark and the disk full condition is less. Which is not the case in my situation.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

At least we agree that your problem is not a result of NBU Master server upgrade but a result of implementing ZFS on media server.

As per the TN that you reference above, I fail to see how the following is a NetBackup problem:

ZFS has two major POSIX Compliance Issues that will not be addressed by Sun in any release of ZFS.

You need to decide how you want to deal with this going forward......

errrlog
Level 4

Media Server is still at NB version 6.5.5.

Storage Unit on the Media Server is still using VxFS filesystem.

Only Master Server is up been upgraded to 7.1.0.3. We DO NOT use master server AS media server.

I started noticing the ISSUE after upgrading to NB 7.1.0.3 on the Master.

 

Forcing the image cleanup seems like a temporary work around for now. Will see how it goes after the Media Server Upgrade.

-Amar

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

APOLOGIES!!!  

I totally missed the section in your previous section where you said that the media servers were still on VxFS..... (being a Storage Foundation fanatic I only read through your motivation for choosing ZFS.)

Then back to advice in my first reply - upgrade at least on Media Server to 7.1.x and see if NBU 'behaves'  better with same levels.

To troubleshoot what is happening with disk cleanup on the media server, create bpdm log on media servers. See what kind of info is logged with default level 0 log. Increase logging level (5) if level 0 doesn't give enough info....

Mark_Solutions
Level 6
Partner Accredited Certified

Just like to clarify a couple of things please ....

When you say "new" jobs do you mean brand new jobs that have never run before (new clients and policies) or just jobs that start to run after the disk has hit is high watermark?

Do I take it it fails with 129 errors?

Is there a reason you have a 90% high water mark? On a 22TB disk this leaves 2TB free when it hits the high water mark.

Also your low watermark is 40% - this is the point NetBackup needs to get to to finish its image cleanup - so you are asking it to clear down 8TB of space to be usable again.

Perhaps settings of 96% for high and 85% for low would be more suitable and make things run a little smoother.

Extract from this tech note: http://www.symantec.com/docs/HOWTO32794

The High water mark setting is a threshold that triggers the following actions:

When an individual volume in the disk pool reaches the High water mark, NetBackup considers the volume full.NetBackup chooses a different volume in the disk pool to write backup images to.

When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.

NetBackup begins image cleanup when a volume reaches the High water mark; image cleanup expires the images that are no longer valid. For a disk pool that is full, NetBackup again assigns jobs to the storage unit when image cleanup reduces any disk volume's capacity to less than the High water mark.

If the storage unit for the disk pool is in a capacity-managed storage lifecycle policy, other factors affect image cleanup.

The default is 98%.

The Low water mark is a threshold at which NetBackup stops image cleanup.

TheLow water mark setting cannot be greater than or equal to the High water mark setting.

The default is 80%.

Hope this helps

mph999
Level 6
Employee Accredited

The low water mark won't be an issue.

NBU will tru and clear down to the LWM if it can.  If it can not (lets say it can only cleardown to 55%) then it will just carry on from there.

The LWM is a 'limit', not a target.

Martin

errrlog
Level 4

There doesn't seem to be any issue with the Disk Cleanup.

Image cleanup job kicks in within 10 minutes after the Highwater mark is hit and successfully clears the images and brings down high water mark. 

Guessing some step seems to be missed by either mds (EMM media and device selection) or nbrb (resource broker). I guess either one of them should perform a check.

  1. Check if the ‘Image cleanup’ has been run after the ‘High water mark is hit’. <possiblites YES/NO>
    1. upon NO-> Initiate a image cleanup or wait for a image cleanup to run.
      1. Then check result.
      2. Then process the resource for the Jobid=1320574
      3. And get backup to nbjm with the resource or Error
    2. upon YES -> Check for the Image Cleanup result.
      1. Then process the resource for the Jobid=1320574
      2. And get backup to nbjm with the resource or Error

 

Example Event Sequence:

With comments and unified logs of nbpem, nbjm, nbrb, mds for JobID 1320574.

0800 hrs -> STU hits the Highwater mark (we still have lot of free space on STU)

0805 hrs -> A new backup job kicks in (jobid=1320574) and fails in 2 seconds with status 129.

 

## Comments

## 1. nbpem submits new job submitted to nbjm

log>> 3/02/12 08:05:13.616 V-116-215 [BaseJob::run] jobid=1320574 submitted to nbjm for processing

 

## 2. nbjm is sending a resource request to nbrb

Log>> 3/02/12 08:05:13.672 V-117-56 [BackupJob::sendRequestToRB] requesting resources from RB for backup job (jobid=1320574)

 

## 3. nbrb is checks the for the resources.

Log start >>>

3/02/12 08:05:13.678 V-118-227 [ResBroker_i::requestResources] received resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, priority 0, secondary priority 25,554, description THE_BACKUP_JOB-1320574-{BED68FB4-4DFA-11E1-997E-00144F970E6E}

 3/02/12 08:05:13.978 V-118-226 [ResBroker_i::evaluateOne] Evaluating request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}

 

 3/02/12 08:05:14.005 V-118-146 [ProviderManager::allocate] NamedResourceProvider returned Allocation Granted for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}

 3/02/12 08:05:14.014 [allocateTwin] INITIATING:

 3/02/12 08:05:14.014 [allocateTwin] masterServer = Master01, client = client2, jobType = 1, capabilityFlags = 256, fatPipePreference = 0, statusOnly = 0, numberOfCopies = 1, kbytesNeeded = 0

 3/02/12 08:05:14.014 [allocateTwin] Twin_Record: STUIdentifier = AdvanceDiskPool2, STUIdentifierType = 1, PoolName = NetBackup, MediaSharingGroup = *ANY*, RetentionLevel = 10, RequiredMediaServer = , PreferredMediaServer = , RequiredDiskVolumeMediaId = , RequiredStorageUnitName = , GetMaxFreeSpaceSTU = 0, CkptRestart = 0, CkptRestartSTUType = 0, CkptRestartSTUSubType = -1, CkptRestartSTUName = , CkptRestartMediaServer = , CkptRestartDiskGroupName = , CkptRestartDiskGroupServerType = , MpxEnabled = 0, MustUseLocalMediaServer = 0

Log END >>>

 

## 4. High water mark limit detected. and validated by EMM

Log Start >>>

 3/02/12 08:05:14.120 [check_disk_space] disk storage unit has exceeded high_water_mark, name = /nbu/pool1, free_space = 17843059695616, free_space_limit = 19189287249510

 3/02/12 08:05:14.120 V-143-1546 [validate_stu] Disk volume is Full or down @aaaat

 3/02/12 08:05:14.148 [allocateTwin] EXIT INFO:

 3/02/12 08:05:14.148 [allocateTwin] EXIT STATUS = 2005030 (EMM_ERROR_MDS_InsufficientDiskSpace, Insufficient disk space or High Water Mark would be exceeded)

 3/02/12 08:05:14.154 V-118-146 [ProviderManager::allocate] MPXProvider returned Not Enough Valid Resources for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}

 3/02/12 08:05:14.155 V-118-108 [ResBroker_i::failOne] failing resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, status 2005030

 3/02/12 08:05:14.166 V-118-255 [CorbaCall_requestFailed::execute] sending failure of request to nbjm for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, error code 2,005,030, reason not enough valid resources

 3/02/12 08:05:14.195 [BackupJob::ERMEvent] (1044b4430) initial resource request failed, copy#=-1, EMM status=Insufficient disk space or High Water Mark would be exceeded, NBU status=129(../BackupJob.cpp:333)

Log End >>>

 

## 5. proceeded by nbjm failing the job.

Log >> 3/02/12 08:05:14.209 [Error] V-117-131  NBU status: 129, EMM status: Insufficient disk space or High Water Mark would be exceeded

 

0810 hrs -> Image Cleanup job kicks in and runs for 16 minutes, freeing up space to bring the STU usage below High Water Mark.

 

Mark,

>> When you say "new" jobs do you mean

New Job => Jobs that have started after the disk has hit HWM.

>> Do I take it fails with 129 errors

Yes, the job fails with status code 129.

>> Is there a reason you have a 90% high water mark? On a 22TB disk this leaves 2TB free when it hits the high water mark.

  • We run Full backups over the weekend, which accounts to 17TB.  So most of the time HWM is reached during the Weekends during the peak times
  • On an average image cleanup job take 30mins – 2 hours to complete
  • Storage Unit can provide 550 – 800 MB/sec of throughput.
  • Considering Average throughput of 350MB/sec will give us around 2 hours of time before the existing jobs start to fail
  • Setting the LWM that low is to try avoid running Image cleanup multiple times.

- Amar

Mark_Solutions
Level 6
Partner Accredited Certified

As per the tech note aextract above:

When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.

This just seems to be how NetBackup deals with things, but your high watermark does waste a lot of disk space.

Also and new clients that have never been seen before have their possible backup size calculated using a formula that relates to the disk size and High Water Mark so that can also affect things

I still reccomend increasing the HWM and if possible use groups of storage units - even splitting your 22TB volume into two 11 TB volumes could help as it would use the least full one whilst the other was clearing down so would bounce between the two rather than have no where else to go and just fail

Hope this helps

errrlog
Level 4

 

>>> When all volumes in the disk pool reach the High water mark, the disk pool is considered full.NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full.
 
I have never experienced Netbackup (on 6.5.5) Failing the assigned Jobs upon hitting the Highwater mark. Assigned jobs will continue till filesystem gets to 100% and then fail.
 
>>> NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.
 
Yes, it does not assign the new jobs. And Addition to that (prior to upgrade on the Master Server) it NEVER USE TO FAIL THE NEW JOBS (with 129 status) BEFORE running the Image Cleanup job. 
I am 100% sure on this behaviour as, I have been using this setup for almost 2 year.
 
>>> This just seems to be how NetBackup deals with things, but your high watermark does waste a lot of disk space.
 
If Assigned Jobs are to be failed at HWM, then what can be the use of a HWM?? Then optimal HWM will be 100%.
 
In my situation (Before and after NBU upgrade), assigned Jobs never failed at HWM.
After upgrade, new jobs started to fail as soon as the disk hits HWM and before Image Cleanup job kicks in. 
 
>> I still reccomend increasing the HWM and if possible use groups of storage units - even splitting your 22TB volume into two 11 TB volumes could help as it would use the least full one whilst the other was clearing down so would bounce between the two rather than have no where else to go and just fail
 
If I Split STU, I cannot use the loadbalancing model, as it will put me in the same situation. In priority model, I will loose half the spindles inturn effecting the performance and may also effect the backup window. 
 
Forcing image cleanups using a cronjob, before the Full backup window, seems like the best option till the issue resolves.
I have logged a case with the support. Will see how it goes.
 
- Amar

Mark_Solutions
Level 6
Partner Accredited Certified

I agree with you - at 6.5.6 jobs did not seem to fail - in V7+ they do seem to so some of the logic has changed - perhaps that is why that tech note ws released

errrlog
Level 4

The tech which you referred to says: (2nd point in 'High water mark' section from the table).

-------------------------------

When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full

-------------------------------

"On Disk full condition, it fails ASSIGNED backup, and "NOT ASSIGN NEW BACKUPS"

With 7.1.0.2 version, the above does not stand true for ASSIGNED JOBS. Because...

1. The ASSIGNED jobs does not fail after hitting disk full condition (i.e, reaching high water mark)

2. Also the TN does not full explain, how the 'NEW JOB'S' are handled at 'High water mark' condition.

During my research, I stumbled into the below tech note. Which explains, the previous NB versions (5/6)were mature enuff to handle the new jobs by TEMPORARLY PAUSING and running cleanup instead of FAILING.

 

REF: http://www.symantec.com/business/support/index?page=content&id=TECH66149

Extract from the above reference:

 ----------------------------------------------------

It is important to understand that NetBackup 6.5 staging and high water mark processing is very different from 5.x and 6.0.

In NetBackup 5.x and 6.0, High Water Mark value does not apply to disk staging storage units. The only condition that triggered staged (but not expired) image cleanup was the disk full condition reached during a backup. If this was encountered, all disk backups to that media server (not only to that storage unit) were temporarily paused and a cleanup process was launched to clean the oldest 10 images that had been staged.

 ----------------------------------------------------  

Even in the older versions of Netbackup 5.x and 6.0 the ‘DISK FULL CONDITION’ was handled EFFICIENTLY, by PAUSING the NEW JOBS and NOT FAILING them.

Latest versions (7 or later) should be expected to handle the New Jobs similar or more efficiently??. I have put the same question to support. Curiously waiting on the feed back. :)

- Amar