There doesn't seem to be any issue with the Disk Cleanup.
Image cleanup job kicks in within 10 minutes after the Highwater mark is hit and successfully clears the images and brings down high water mark.
Guessing some step seems to be missed by either mds (EMM media and device selection) or nbrb (resource broker). I guess either one of them should perform a check.
- Check if the ‘Image cleanup’ has been run after the ‘High water mark is hit’. <possiblites YES/NO>
- upon NO-> Initiate a image cleanup or wait for a image cleanup to run.
- Then check result.
- Then process the resource for the Jobid=1320574
- And get backup to nbjm with the resource or Error
- upon YES -> Check for the Image Cleanup result.
- Then process the resource for the Jobid=1320574
- And get backup to nbjm with the resource or Error
Example Event Sequence:
With comments and unified logs of nbpem, nbjm, nbrb, mds for JobID 1320574.
0800 hrs -> STU hits the Highwater mark (we still have lot of free space on STU)
0805 hrs -> A new backup job kicks in (jobid=1320574) and fails in 2 seconds with status 129.
## Comments
## 1. nbpem submits new job submitted to nbjm
log>> 3/02/12 08:05:13.616 V-116-215 [BaseJob::run] jobid=1320574 submitted to nbjm for processing
## 2. nbjm is sending a resource request to nbrb
Log>> 3/02/12 08:05:13.672 V-117-56 [BackupJob::sendRequestToRB] requesting resources from RB for backup job (jobid=1320574)
## 3. nbrb is checks the for the resources.
Log start >>>
3/02/12 08:05:13.678 V-118-227 [ResBroker_i::requestResources] received resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, priority 0, secondary priority 25,554, description THE_BACKUP_JOB-1320574-{BED68FB4-4DFA-11E1-997E-00144F970E6E}
3/02/12 08:05:13.978 V-118-226 [ResBroker_i::evaluateOne] Evaluating request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}
3/02/12 08:05:14.005 V-118-146 [ProviderManager::allocate] NamedResourceProvider returned Allocation Granted for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}
3/02/12 08:05:14.014 [allocateTwin] INITIATING:
3/02/12 08:05:14.014 [allocateTwin] masterServer = Master01, client = client2, jobType = 1, capabilityFlags = 256, fatPipePreference = 0, statusOnly = 0, numberOfCopies = 1, kbytesNeeded = 0
3/02/12 08:05:14.014 [allocateTwin] Twin_Record: STUIdentifier = AdvanceDiskPool2, STUIdentifierType = 1, PoolName = NetBackup, MediaSharingGroup = *ANY*, RetentionLevel = 10, RequiredMediaServer = , PreferredMediaServer = , RequiredDiskVolumeMediaId = , RequiredStorageUnitName = , GetMaxFreeSpaceSTU = 0, CkptRestart = 0, CkptRestartSTUType = 0, CkptRestartSTUSubType = -1, CkptRestartSTUName = , CkptRestartMediaServer = , CkptRestartDiskGroupName = , CkptRestartDiskGroupServerType = , MpxEnabled = 0, MustUseLocalMediaServer = 0
Log END >>>
## 4. High water mark limit detected. and validated by EMM
Log Start >>>
3/02/12 08:05:14.120 [check_disk_space] disk storage unit has exceeded high_water_mark, name = /nbu/pool1, free_space = 17843059695616, free_space_limit = 19189287249510
3/02/12 08:05:14.120 V-143-1546 [validate_stu] Disk volume is Full or down @aaaat
3/02/12 08:05:14.148 [allocateTwin] EXIT INFO:
3/02/12 08:05:14.148 [allocateTwin] EXIT STATUS = 2005030 (EMM_ERROR_MDS_InsufficientDiskSpace, Insufficient disk space or High Water Mark would be exceeded)
3/02/12 08:05:14.154 V-118-146 [ProviderManager::allocate] MPXProvider returned Not Enough Valid Resources for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}
3/02/12 08:05:14.155 V-118-108 [ResBroker_i::failOne] failing resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, status 2005030
3/02/12 08:05:14.166 V-118-255 [CorbaCall_requestFailed::execute] sending failure of request to nbjm for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, error code 2,005,030, reason not enough valid resources
3/02/12 08:05:14.195 [BackupJob::ERMEvent] (1044b4430) initial resource request failed, copy#=-1, EMM status=Insufficient disk space or High Water Mark would be exceeded, NBU status=129(../BackupJob.cpp:333)
Log End >>>
## 5. proceeded by nbjm failing the job.
Log >> 3/02/12 08:05:14.209 [Error] V-117-131 NBU status: 129, EMM status: Insufficient disk space or High Water Mark would be exceeded
0810 hrs -> Image Cleanup job kicks in and runs for 16 minutes, freeing up space to bring the STU usage below High Water Mark.
Mark,
>> When you say "new" jobs do you mean
New Job => Jobs that have started after the disk has hit HWM.
>> Do I take it fails with 129 errors
Yes, the job fails with status code 129.
>> Is there a reason you have a 90% high water mark? On a 22TB disk this leaves 2TB free when it hits the high water mark.
- We run Full backups over the weekend, which accounts to 17TB. So most of the time HWM is reached during the Weekends during the peak times
- On an average image cleanup job take 30mins – 2 hours to complete
- Storage Unit can provide 550 – 800 MB/sec of throughput.
- Considering Average throughput of 350MB/sec will give us around 2 hours of time before the existing jobs start to fail
- Setting the LWM that low is to try avoid running Image cleanup multiple times.
- Amar