Forum Discussion

errrlog's avatar
errrlog
Level 4
13 years ago

Advance disk, High Water Mark and New jobs

I have recently upgraded the Netbackup Master from 6.5.5 to 7.1.02. And after the upgrade I have started running into this issue.

Issue:

When the 'High Water Mark' is reached on the 'AdvanceDisk' pool, the nbjm is failing new jobs. This only occurs for a short period (i.e, < 30 mins), while the image cleanup creates some space. But during the short period ( < 30 mins), when user-defined RMAN jobs executed from cronjob fails and do not rerun.

I am noticing this issue after the upgrade. Before the upgrade we never had this issue, so I am assuming that the previously nbjm used to pause the new jobs while the image cleanup is active, instead of failing them straight.

Is there a way I can pause the new jobs, instead of getting them failed, while the image cleanup is active. 

Netbackup Environment:

1xMaster - Solars 10 on sparc hardware, Running Netbackup 7.1.0.2

Media Servers (x2)  - Solaris 10 on sparc hardware. Running Netbackup 6.5.5 ( I will be upgrading them shortly. )

Tape Storage Units: SL500 Tape library with 5 Drives

Disk Storage Unit (used for staging): 22 TB Lun from Sun Storage, configured as a AdvanceDisk type.  

Advance Disk Details:

Disk Pool Name   : AdvanceDiskPool2
Disk Pool Id     : AdvanceDiskPool2
Disk Type        : AdvancedDisk
Status           : UP
Flag             : Patchwork
Flag             : Visible
Flag             : OpenStorage
Flag             : AdminUp
Flag             : InternalUp
Flag             : SpanImages
Flag             : LifeCycle
Flag             : CapacityMgmt
Flag             : FragmentImages
Flag             : Cpr
Flag             : RandomWrites
Flag             : FT-Transfer
Flag             : CapacityManagedRetention
Flag             : CapacityManagedJobQueuing
Raw Size (GB)    : 22339.27
Usable Size (GB) : 22339.27
Num Volumes      : 1
High Watermark   : 90
Low Watermark    : 40
Max IO Streams   : -1
Comment          :
Storage Server   : Media2 (UP)

 

 - Amar

18 Replies

  • APOLOGIES!!!  

    I totally missed the section in your previous section where you said that the media servers were still on VxFS..... (being a Storage Foundation fanatic I only read through your motivation for choosing ZFS.)

    Then back to advice in my first reply - upgrade at least on Media Server to 7.1.x and see if NBU 'behaves'  better with same levels.

    To troubleshoot what is happening with disk cleanup on the media server, create bpdm log on media servers. See what kind of info is logged with default level 0 log. Increase logging level (5) if level 0 doesn't give enough info....

  • Just like to clarify a couple of things please ....

    When you say "new" jobs do you mean brand new jobs that have never run before (new clients and policies) or just jobs that start to run after the disk has hit is high watermark?

    Do I take it it fails with 129 errors?

    Is there a reason you have a 90% high water mark? On a 22TB disk this leaves 2TB free when it hits the high water mark.

    Also your low watermark is 40% - this is the point NetBackup needs to get to to finish its image cleanup - so you are asking it to clear down 8TB of space to be usable again.

    Perhaps settings of 96% for high and 85% for low would be more suitable and make things run a little smoother.

    Extract from this tech note: http://www.symantec.com/docs/HOWTO32794

    The High water mark setting is a threshold that triggers the following actions:

    When an individual volume in the disk pool reaches the High water mark, NetBackup considers the volume full.NetBackup chooses a different volume in the disk pool to write backup images to.

    When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.

    NetBackup begins image cleanup when a volume reaches the High water mark; image cleanup expires the images that are no longer valid. For a disk pool that is full, NetBackup again assigns jobs to the storage unit when image cleanup reduces any disk volume's capacity to less than the High water mark.

    If the storage unit for the disk pool is in a capacity-managed storage lifecycle policy, other factors affect image cleanup.

    The default is 98%.

    The Low water mark is a threshold at which NetBackup stops image cleanup.

    TheLow water mark setting cannot be greater than or equal to the High water mark setting.

    The default is 80%.

    Hope this helps

  • The low water mark won't be an issue.

    NBU will tru and clear down to the LWM if it can.  If it can not (lets say it can only cleardown to 55%) then it will just carry on from there.

    The LWM is a 'limit', not a target.

    Martin

  • There doesn't seem to be any issue with the Disk Cleanup.

    Image cleanup job kicks in within 10 minutes after the Highwater mark is hit and successfully clears the images and brings down high water mark. 

    Guessing some step seems to be missed by either mds (EMM media and device selection) or nbrb (resource broker). I guess either one of them should perform a check.

    1. Check if the ‘Image cleanup’ has been run after the ‘High water mark is hit’. <possiblites YES/NO>
      1. upon NO-> Initiate a image cleanup or wait for a image cleanup to run.
        1. Then check result.
        2. Then process the resource for the Jobid=1320574
        3. And get backup to nbjm with the resource or Error
      2. upon YES -> Check for the Image Cleanup result.
        1. Then process the resource for the Jobid=1320574
        2. And get backup to nbjm with the resource or Error

     

    Example Event Sequence:

    With comments and unified logs of nbpem, nbjm, nbrb, mds for JobID 1320574.

    0800 hrs -> STU hits the Highwater mark (we still have lot of free space on STU)

    0805 hrs -> A new backup job kicks in (jobid=1320574) and fails in 2 seconds with status 129.

     

    ## Comments

    ## 1. nbpem submits new job submitted to nbjm

    log>> 3/02/12 08:05:13.616 V-116-215 [BaseJob::run] jobid=1320574 submitted to nbjm for processing

     

    ## 2. nbjm is sending a resource request to nbrb

    Log>> 3/02/12 08:05:13.672 V-117-56 [BackupJob::sendRequestToRB] requesting resources from RB for backup job (jobid=1320574)

     

    ## 3. nbrb is checks the for the resources.

    Log start >>>

    3/02/12 08:05:13.678 V-118-227 [ResBroker_i::requestResources] received resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, priority 0, secondary priority 25,554, description THE_BACKUP_JOB-1320574-{BED68FB4-4DFA-11E1-997E-00144F970E6E}

     3/02/12 08:05:13.978 V-118-226 [ResBroker_i::evaluateOne] Evaluating request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}

     

     3/02/12 08:05:14.005 V-118-146 [ProviderManager::allocate] NamedResourceProvider returned Allocation Granted for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}

     3/02/12 08:05:14.014 [allocateTwin] INITIATING:

     3/02/12 08:05:14.014 [allocateTwin] masterServer = Master01, client = client2, jobType = 1, capabilityFlags = 256, fatPipePreference = 0, statusOnly = 0, numberOfCopies = 1, kbytesNeeded = 0

     3/02/12 08:05:14.014 [allocateTwin] Twin_Record: STUIdentifier = AdvanceDiskPool2, STUIdentifierType = 1, PoolName = NetBackup, MediaSharingGroup = *ANY*, RetentionLevel = 10, RequiredMediaServer = , PreferredMediaServer = , RequiredDiskVolumeMediaId = , RequiredStorageUnitName = , GetMaxFreeSpaceSTU = 0, CkptRestart = 0, CkptRestartSTUType = 0, CkptRestartSTUSubType = -1, CkptRestartSTUName = , CkptRestartMediaServer = , CkptRestartDiskGroupName = , CkptRestartDiskGroupServerType = , MpxEnabled = 0, MustUseLocalMediaServer = 0

    Log END >>>

     

    ## 4. High water mark limit detected. and validated by EMM

    Log Start >>>

     3/02/12 08:05:14.120 [check_disk_space] disk storage unit has exceeded high_water_mark, name = /nbu/pool1, free_space = 17843059695616, free_space_limit = 19189287249510

     3/02/12 08:05:14.120 V-143-1546 [validate_stu] Disk volume is Full or down @aaaat

     3/02/12 08:05:14.148 [allocateTwin] EXIT INFO:

     3/02/12 08:05:14.148 [allocateTwin] EXIT STATUS = 2005030 (EMM_ERROR_MDS_InsufficientDiskSpace, Insufficient disk space or High Water Mark would be exceeded)

     3/02/12 08:05:14.154 V-118-146 [ProviderManager::allocate] MPXProvider returned Not Enough Valid Resources for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}

     3/02/12 08:05:14.155 V-118-108 [ResBroker_i::failOne] failing resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, status 2005030

     3/02/12 08:05:14.166 V-118-255 [CorbaCall_requestFailed::execute] sending failure of request to nbjm for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, error code 2,005,030, reason not enough valid resources

     3/02/12 08:05:14.195 [BackupJob::ERMEvent] (1044b4430) initial resource request failed, copy#=-1, EMM status=Insufficient disk space or High Water Mark would be exceeded, NBU status=129(../BackupJob.cpp:333)

    Log End >>>

     

    ## 5. proceeded by nbjm failing the job.

    Log >> 3/02/12 08:05:14.209 [Error] V-117-131  NBU status: 129, EMM status: Insufficient disk space or High Water Mark would be exceeded

     

    0810 hrs -> Image Cleanup job kicks in and runs for 16 minutes, freeing up space to bring the STU usage below High Water Mark.

     

    Mark,

    >> When you say "new" jobs do you mean

    New Job => Jobs that have started after the disk has hit HWM.

    >> Do I take it fails with 129 errors

    Yes, the job fails with status code 129.

    >> Is there a reason you have a 90% high water mark? On a 22TB disk this leaves 2TB free when it hits the high water mark.

    • We run Full backups over the weekend, which accounts to 17TB.  So most of the time HWM is reached during the Weekends during the peak times
    • On an average image cleanup job take 30mins – 2 hours to complete
    • Storage Unit can provide 550 – 800 MB/sec of throughput.
    • Considering Average throughput of 350MB/sec will give us around 2 hours of time before the existing jobs start to fail
    • Setting the LWM that low is to try avoid running Image cleanup multiple times.

    - Amar

  • As per the tech note aextract above:

    When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.

    This just seems to be how NetBackup deals with things, but your high watermark does waste a lot of disk space.

    Also and new clients that have never been seen before have their possible backup size calculated using a formula that relates to the disk size and High Water Mark so that can also affect things

    I still reccomend increasing the HWM and if possible use groups of storage units - even splitting your 22TB volume into two 11 TB volumes could help as it would use the least full one whilst the other was clearing down so would bounce between the two rather than have no where else to go and just fail

    Hope this helps

  •  

    >>> When all volumes in the disk pool reach the High water mark, the disk pool is considered full.NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full.
     
    I have never experienced Netbackup (on 6.5.5) Failing the assigned Jobs upon hitting the Highwater mark. Assigned jobs will continue till filesystem gets to 100% and then fail.
     
    >>> NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.
     
    Yes, it does not assign the new jobs. And Addition to that (prior to upgrade on the Master Server) it NEVER USE TO FAIL THE NEW JOBS (with 129 status) BEFORE running the Image Cleanup job. 
    I am 100% sure on this behaviour as, I have been using this setup for almost 2 year.
     
    >>> This just seems to be how NetBackup deals with things, but your high watermark does waste a lot of disk space.
     
    If Assigned Jobs are to be failed at HWM, then what can be the use of a HWM?? Then optimal HWM will be 100%.
     
    In my situation (Before and after NBU upgrade), assigned Jobs never failed at HWM.
    After upgrade, new jobs started to fail as soon as the disk hits HWM and before Image Cleanup job kicks in. 
     
    >> I still reccomend increasing the HWM and if possible use groups of storage units - even splitting your 22TB volume into two 11 TB volumes could help as it would use the least full one whilst the other was clearing down so would bounce between the two rather than have no where else to go and just fail
     
    If I Split STU, I cannot use the loadbalancing model, as it will put me in the same situation. In priority model, I will loose half the spindles inturn effecting the performance and may also effect the backup window. 
     
    Forcing image cleanups using a cronjob, before the Full backup window, seems like the best option till the issue resolves.
    I have logged a case with the support. Will see how it goes.
     
    - Amar
  • I agree with you - at 6.5.6 jobs did not seem to fail - in V7+ they do seem to so some of the logic has changed - perhaps that is why that tech note ws released

  • The tech which you referred to says: (2nd point in 'High water mark' section from the table).

    -------------------------------

    When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full

    -------------------------------

    "On Disk full condition, it fails ASSIGNED backup, and "NOT ASSIGN NEW BACKUPS"

    With 7.1.0.2 version, the above does not stand true for ASSIGNED JOBS. Because...

    1. The ASSIGNED jobs does not fail after hitting disk full condition (i.e, reaching high water mark)

    2. Also the TN does not full explain, how the 'NEW JOB'S' are handled at 'High water mark' condition.

    During my research, I stumbled into the below tech note. Which explains, the previous NB versions (5/6)were mature enuff to handle the new jobs by TEMPORARLY PAUSING and running cleanup instead of FAILING.

     

    REF: http://www.symantec.com/business/support/index?page=content&id=TECH66149

    Extract from the above reference:

     ----------------------------------------------------

    It is important to understand that NetBackup 6.5 staging and high water mark processing is very different from 5.x and 6.0.

    In NetBackup 5.x and 6.0, High Water Mark value does not apply to disk staging storage units. The only condition that triggered staged (but not expired) image cleanup was the disk full condition reached during a backup. If this was encountered, all disk backups to that media server (not only to that storage unit) were temporarily paused and a cleanup process was launched to clean the oldest 10 images that had been staged.

     ----------------------------------------------------  

    Even in the older versions of Netbackup 5.x and 6.0 the ‘DISK FULL CONDITION’ was handled EFFICIENTLY, by PAUSING the NEW JOBS and NOT FAILING them.

    Latest versions (7 or later) should be expected to handle the New Jobs similar or more efficiently??. I have put the same question to support. Curiously waiting on the feed back. :-)

    - Amar