01-30-2012 06:58 PM
I have recently upgraded the Netbackup Master from 6.5.5 to 7.1.02. And after the upgrade I have started running into this issue.
Issue:
When the 'High Water Mark' is reached on the 'AdvanceDisk' pool, the nbjm is failing new jobs. This only occurs for a short period (i.e, < 30 mins), while the image cleanup creates some space. But during the short period ( < 30 mins), when user-defined RMAN jobs executed from cronjob fails and do not rerun.
I am noticing this issue after the upgrade. Before the upgrade we never had this issue, so I am assuming that the previously nbjm used to pause the new jobs while the image cleanup is active, instead of failing them straight.
Is there a way I can pause the new jobs, instead of getting them failed, while the image cleanup is active.
Netbackup Environment:
1xMaster - Solars 10 on sparc hardware, Running Netbackup 7.1.0.2
Media Servers (x2) - Solaris 10 on sparc hardware. Running Netbackup 6.5.5 ( I will be upgrading them shortly. )
Tape Storage Units: SL500 Tape library with 5 Drives
Disk Storage Unit (used for staging): 22 TB Lun from Sun Storage, configured as a AdvanceDisk type.
Advance Disk Details:
Disk Pool Name : AdvanceDiskPool2
Disk Pool Id : AdvanceDiskPool2
Disk Type : AdvancedDisk
Status : UP
Flag : Patchwork
Flag : Visible
Flag : OpenStorage
Flag : AdminUp
Flag : InternalUp
Flag : SpanImages
Flag : LifeCycle
Flag : CapacityMgmt
Flag : FragmentImages
Flag : Cpr
Flag : RandomWrites
Flag : FT-Transfer
Flag : CapacityManagedRetention
Flag : CapacityManagedJobQueuing
Raw Size (GB) : 22339.27
Usable Size (GB) : 22339.27
Num Volumes : 1
High Watermark : 90
Low Watermark : 40
Max IO Streams : -1
Comment :
Storage Server : Media2 (UP)
- Amar
01-31-2012 12:02 AM
Thanks for a very detailed description of your problem.
I have not seen this behaviour in 7.x, but then we have always insisted that our customers upgrade Master and Media servers all on the same day. (Clients are upgraded in a phased approach).
No mention either of this type of problem in LBN or 7.1.0.3 Release Notes.
I know that different NBU versions are perfectly supported, but experience has taught us that certain features just work better with everything on the same NBU level.
Suggestion:
Upgrade media servers.
If problem persists, open a support ticket with Symantec.
01-31-2012 03:41 PM
Hi Marianne,
Thanks for taking time and providing a suggestion. Will try to stick to this rule in my next upgrade process.
Due to a limited change window, I had to take a 3 phased approach for the Master and media server upgrades. As part of the upgrade, I am also migrating from Veritas volume manager to zfs, hence just the post upgrade preparation itself took me around 7 hours. And I also had to include times for the upgrade, testing (liase with app and server teams) and rollback without affecting the scheduled backup times. :(
As a temporary fix(till the issue resolves), I am planning on implementing the below. I guess this should will remove the duplicated images and stop disk from hitting the high water mark.??
During a quiet time, executing a cronjob to force image cleanups:
nbdevconfig -changedp -dp AdvanceDiskPool2 -stype AdvancedDisk -hwm 0 -lwm 0
nbdelete -allvolumes -force
bpimage -cleanup -allclients
nbdevconfig -changedp -dp AdvanceDiskPool2 -stype AdvancedDisk -hwm 90 -lwm 40
I would appreciate any comments/suggestions on the above approach.
- Amar
01-31-2012 08:25 PM
Were you aware of the following when you decided to migrate to ZFS?
Best practice for configuring AdvancedDisk:
http://www.symantec.com/docs/TECH158427
p.9:
File System Restrictions
For some file system types, notably NFS and ZFS, the lack of full posix compliance of the file system means that full file system conditions cannot be detected correctly, leading to problems when spanning volumes. To avoid problems where these file systems are used, each disk pool should be comprised of only one volume so that no spanning occurs.
This lack of full posix compliance of the file system is probably the reason for high water mark not being detected correctly.
01-31-2012 11:19 PM
Yes! I am aware of this. Not using the Disk pooling capabilities of AdvanceDisk.
I am using zfs (previously VxFS) to combine multiple Luns into a single volume. And that single volume(mount point) is presented to AdvanceDiskPool.
Also VxFS/zfs can spread the IO load better across the underlying Luns hence will be better on the performance.
-Amar
01-31-2012 11:41 PM
You really cannot compare VxFS with ZFS. (I can provide a comparative doc if you are interested).
VxFS is fully posix compliant.
The Best Practice Guide says that filesystems not fully posix compliant (like ZFS) have problems detecting full filesystem condition. The implication here is that reporting of filesystem capacity will be affected.
Seems your problem has nothing to do with NBU upgrade but with migration to ZFS.
You did not have the same problem when you were using VxVM volumes with VxFS filesystem.
02-01-2012 01:32 AM
02-01-2012 01:37 AM
Some more details ...
02-01-2012 04:58 PM
Point 1:
Additional details to "Netbackup Environment:" section in my first post:
NBU upgrade and Zfs
Media Servers: (Still at 6.5.5, using VxFS)
So considering that I am still using the VxFS on the Staging unit (AdvanceDisk), the issue can only be caused by NBU.
Point 2:
NBU behaviour at HIGH WATER MARK:
Assumption: Even if the Staging unit is on ZFS, the issue I am seeing still have to be a NBU related. Because..
ZFS filesystem Issue: “Inability to commit ZFS files system capacity statistics atomically”.
(ref: http://www.symantec.com/business/support/index?page=content&id=TECH66993)
Means if we use ZFS, there will be a LITTLE EXTRA DELAY before NBU DETECTS the Highwater mark (Again depends on the frequency at which NBU makes its checks) and kicks the Image cleanup. And the Image Cleanup job follows the later sequence. But the jobs should not fail straight away.
How frequently does Netbackup monitor the STU capacity?? How quick do you need??
Below is the explanation for “Disk storage unit capacity check frequency”. Found in ‘General Server’ options in master server host properties.
______________________________________________________________
This property determines how often NetBackup checks disk storage units for available capacity. If checks occur too frequently, then system resources are wasted. If checks do not occur often enough, too much time elapses and backup jobs are delayed.
The default is 300 seconds (5 minutes).
Note: This property applies to the disk storage units of 6.0 media servers only. Subsequent releases use internal methods to monitor disk space more frequently.
______________________________________________________________
In Regards to Migrating to ZFS:
I was not trying to compare VxVM to ZFS but I compared “Disk Pooling capability” of of AdvanceDisk to ZFS and VxVM.
I only considered ZFS based upon my requirements.
Requirements in the order of importance:
Cons with Netbackup: (in my situation)
VxVM:None
ZFS: ZFS filesystem Issue: “Inability to commit ZFS files system capacity statistics atomically”.
(ref: http://www.symantec.com/business/support/index?page=content&id=TECH66993)
I am only effected in terms of the time DELAY between a) Actual file system size hitting ‘High water mark’ and b) Netbackup detecting that File system has hit the High water mark.
What will be the actual impact?
STU size : 22 TB
High water mark : 90% (i.e, 19.8 TB)
Room before job starting to fail : 10% (i.e, 2.2 TB)
Average Throughput : 350 MB/sec (i,e 1.2 TB per hour).
According to the above figures, I easily have a minimum of 2 hour period before the disk fills up and backup starts to fail (only if the Netbackup cannot detect that file system has reached the High Water Mark).
So considering the 2 hour limit, I don’t think I should be worrying about “delayed capacity statistics” for zfs.
I guess this issue only effects when the Storage Units are smaller and the difference between highwater mark and the disk full condition is less. Which is not the case in my situation.
02-01-2012 09:40 PM
At least we agree that your problem is not a result of NBU Master server upgrade but a result of implementing ZFS on media server.
As per the TN that you reference above, I fail to see how the following is a NetBackup problem:
ZFS has two major POSIX Compliance Issues that will not be addressed by Sun in any release of ZFS.
You need to decide how you want to deal with this going forward......
02-01-2012 10:32 PM
Media Server is still at NB version 6.5.5.
Storage Unit on the Media Server is still using VxFS filesystem.
Only Master Server is up been upgraded to 7.1.0.3. We DO NOT use master server AS media server.
I started noticing the ISSUE after upgrading to NB 7.1.0.3 on the Master.
Forcing the image cleanup seems like a temporary work around for now. Will see how it goes after the Media Server Upgrade.
-Amar
02-02-2012 01:46 AM
APOLOGIES!!!
I totally missed the section in your previous section where you said that the media servers were still on VxFS..... (being a Storage Foundation fanatic I only read through your motivation for choosing ZFS.)
Then back to advice in my first reply - upgrade at least on Media Server to 7.1.x and see if NBU 'behaves' better with same levels.
To troubleshoot what is happening with disk cleanup on the media server, create bpdm log on media servers. See what kind of info is logged with default level 0 log. Increase logging level (5) if level 0 doesn't give enough info....
02-02-2012 02:08 AM
Just like to clarify a couple of things please ....
When you say "new" jobs do you mean brand new jobs that have never run before (new clients and policies) or just jobs that start to run after the disk has hit is high watermark?
Do I take it it fails with 129 errors?
Is there a reason you have a 90% high water mark? On a 22TB disk this leaves 2TB free when it hits the high water mark.
Also your low watermark is 40% - this is the point NetBackup needs to get to to finish its image cleanup - so you are asking it to clear down 8TB of space to be usable again.
Perhaps settings of 96% for high and 85% for low would be more suitable and make things run a little smoother.
Extract from this tech note: http://www.symantec.com/docs/HOWTO32794
The High water mark setting is a threshold that triggers the following actions:
When an individual volume in the disk pool reaches the High water mark, NetBackup considers the volume full.NetBackup chooses a different volume in the disk pool to write backup images to.
When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.
NetBackup begins image cleanup when a volume reaches the High water mark; image cleanup expires the images that are no longer valid. For a disk pool that is full, NetBackup again assigns jobs to the storage unit when image cleanup reduces any disk volume's capacity to less than the High water mark.
If the storage unit for the disk pool is in a capacity-managed storage lifecycle policy, other factors affect image cleanup.
The default is 98%.
The Low water mark is a threshold at which NetBackup stops image cleanup.
TheLow water mark setting cannot be greater than or equal to the High water mark setting.
The default is 80%.
Hope this helps
02-02-2012 02:22 AM
The low water mark won't be an issue.
NBU will tru and clear down to the LWM if it can. If it can not (lets say it can only cleardown to 55%) then it will just carry on from there.
The LWM is a 'limit', not a target.
Martin
02-02-2012 05:59 PM
There doesn't seem to be any issue with the Disk Cleanup.
Image cleanup job kicks in within 10 minutes after the Highwater mark is hit and successfully clears the images and brings down high water mark.
Guessing some step seems to be missed by either mds (EMM media and device selection) or nbrb (resource broker). I guess either one of them should perform a check.
Example Event Sequence:
With comments and unified logs of nbpem, nbjm, nbrb, mds for JobID 1320574.
0800 hrs -> STU hits the Highwater mark (we still have lot of free space on STU)
0805 hrs -> A new backup job kicks in (jobid=1320574) and fails in 2 seconds with status 129.
## Comments
## 1. nbpem submits new job submitted to nbjm
log>> 3/02/12 08:05:13.616 V-116-215 [BaseJob::run] jobid=1320574 submitted to nbjm for processing
## 2. nbjm is sending a resource request to nbrb
Log>> 3/02/12 08:05:13.672 V-117-56 [BackupJob::sendRequestToRB] requesting resources from RB for backup job (jobid=1320574)
## 3. nbrb is checks the for the resources.
Log start >>>
3/02/12 08:05:13.678 V-118-227 [ResBroker_i::requestResources] received resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, priority 0, secondary priority 25,554, description THE_BACKUP_JOB-1320574-{BED68FB4-4DFA-11E1-997E-00144F970E6E}
3/02/12 08:05:13.978 V-118-226 [ResBroker_i::evaluateOne] Evaluating request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}
3/02/12 08:05:14.005 V-118-146 [ProviderManager::allocate] NamedResourceProvider returned Allocation Granted for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}
3/02/12 08:05:14.014 [allocateTwin] INITIATING:
3/02/12 08:05:14.014 [allocateTwin] masterServer = Master01, client = client2, jobType = 1, capabilityFlags = 256, fatPipePreference = 0, statusOnly = 0, numberOfCopies = 1, kbytesNeeded = 0
3/02/12 08:05:14.014 [allocateTwin] Twin_Record: STUIdentifier = AdvanceDiskPool2, STUIdentifierType = 1, PoolName = NetBackup, MediaSharingGroup = *ANY*, RetentionLevel = 10, RequiredMediaServer = , PreferredMediaServer = , RequiredDiskVolumeMediaId = , RequiredStorageUnitName = , GetMaxFreeSpaceSTU = 0, CkptRestart = 0, CkptRestartSTUType = 0, CkptRestartSTUSubType = -1, CkptRestartSTUName = , CkptRestartMediaServer = , CkptRestartDiskGroupName = , CkptRestartDiskGroupServerType = , MpxEnabled = 0, MustUseLocalMediaServer = 0
Log END >>>
## 4. High water mark limit detected. and validated by EMM
Log Start >>>
3/02/12 08:05:14.120 [check_disk_space] disk storage unit has exceeded high_water_mark, name = /nbu/pool1, free_space = 17843059695616, free_space_limit = 19189287249510
3/02/12 08:05:14.120 V-143-1546 [validate_stu] Disk volume is Full or down @aaaat
3/02/12 08:05:14.148 [allocateTwin] EXIT INFO:
3/02/12 08:05:14.148 [allocateTwin] EXIT STATUS = 2005030 (EMM_ERROR_MDS_InsufficientDiskSpace, Insufficient disk space or High Water Mark would be exceeded)
3/02/12 08:05:14.154 V-118-146 [ProviderManager::allocate] MPXProvider returned Not Enough Valid Resources for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}
3/02/12 08:05:14.155 V-118-108 [ResBroker_i::failOne] failing resource request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, status 2005030
3/02/12 08:05:14.166 V-118-255 [CorbaCall_requestFailed::execute] sending failure of request to nbjm for request ID {BED68FB4-4DFA-11E1-997E-00144F970E6E}, error code 2,005,030, reason not enough valid resources
3/02/12 08:05:14.195 [BackupJob::ERMEvent] (1044b4430) initial resource request failed, copy#=-1, EMM status=Insufficient disk space or High Water Mark would be exceeded, NBU status=129(../BackupJob.cpp:333)
Log End >>>
## 5. proceeded by nbjm failing the job.
Log >> 3/02/12 08:05:14.209 [Error] V-117-131 NBU status: 129, EMM status: Insufficient disk space or High Water Mark would be exceeded
0810 hrs -> Image Cleanup job kicks in and runs for 16 minutes, freeing up space to bring the STU usage below High Water Mark.
Mark,
>> When you say "new" jobs do you mean
New Job => Jobs that have started after the disk has hit HWM.
>> Do I take it fails with 129 errors
Yes, the job fails with status code 129.
>> Is there a reason you have a 90% high water mark? On a 22TB disk this leaves 2TB free when it hits the high water mark.
- Amar
02-03-2012 01:57 AM
As per the tech note aextract above:
When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full.
This just seems to be how NetBackup deals with things, but your high watermark does waste a lot of disk space.
Also and new clients that have never been seen before have their possible backup size calculated using a formula that relates to the disk size and High Water Mark so that can also affect things
I still reccomend increasing the HWM and if possible use groups of storage units - even splitting your 22TB volume into two 11 TB volumes could help as it would use the least full one whilst the other was clearing down so would bounce between the two rather than have no where else to go and just fail
Hope this helps
02-03-2012 03:18 PM
02-06-2012 08:51 AM
I agree with you - at 6.5.6 jobs did not seem to fail - in V7+ they do seem to so some of the logic has changed - perhaps that is why that tech note ws released
02-07-2012 05:05 PM
The tech which you referred to says: (2nd point in 'High water mark' section from the table).
-------------------------------
When all volumes in the disk pool reach the High water mark, the disk pool is considered full. NetBackup fails any backup jobs that are assigned to a storage unit in which the disk pool is full. NetBackup also does not assign new jobs to a storage unit in which the disk pool is full
-------------------------------
"On Disk full condition, it fails ASSIGNED backup, and "NOT ASSIGN NEW BACKUPS"
With 7.1.0.2 version, the above does not stand true for ASSIGNED JOBS. Because...
1. The ASSIGNED jobs does not fail after hitting disk full condition (i.e, reaching high water mark)
2. Also the TN does not full explain, how the 'NEW JOB'S' are handled at 'High water mark' condition.
During my research, I stumbled into the below tech note. Which explains, the previous NB versions (5/6)were mature enuff to handle the new jobs by TEMPORARLY PAUSING and running cleanup instead of FAILING.
REF: http://www.symantec.com/business/support/index?page=content&id=TECH66149
Extract from the above reference:
----------------------------------------------------
It is important to understand that NetBackup 6.5 staging and high water mark processing is very different from 5.x and 6.0.
In NetBackup 5.x and 6.0, High Water Mark value does not apply to disk staging storage units. The only condition that triggered staged (but not expired) image cleanup was the disk full condition reached during a backup. If this was encountered, all disk backups to that media server (not only to that storage unit) were temporarily paused and a cleanup process was launched to clean the oldest 10 images that had been staged.
----------------------------------------------------
Even in the older versions of Netbackup 5.x and 6.0 the ‘DISK FULL CONDITION’ was handled EFFICIENTLY, by PAUSING the NEW JOBS and NOT FAILING them.
Latest versions (7 or later) should be expected to handle the New Jobs similar or more efficiently??. I have put the same question to support. Curiously waiting on the feed back. :)
- Amar