HWM - and Error 129

Dirk_Mueller · ‎05-16-2014

HI,

We have NBU 7.1.0.4.

Master + Media-SErvers: Win2008 R2

I have a missunderstanding of the High Water Mark for my Advanced Disk Pool.

If the HWM is reached, all backups ends with Error 129 until the image cleanup created futher space????

Is that right?

Kind regards

Dirk

Marianne · ‎05-16-2014

Extract from NBU Admin Guide I:

The High water mark setting (default 98%) is a threshold that triggers the
following actions:
■ When an individual disk volume of the underlying storage reaches the High
water mark, NetBackup considers the volume full. NetBackup chooses a
different volume in the underlying storage to write backup images to.
■ When all volumes in the underlying storage reach the High water mark, the
BasicDisk storage is considered full. NetBackup fails any backup jobs that are
assigned to a storage unit in which the underlying storage is full. NetBackup
also does not assign new jobs to a BasicDisk storage unit in which the
underlying storage is full.
■ NetBackup begins image cleanup when a volume reaches the Highwatermark;
image cleanup expires the images that are no longer valid. NetBackup again
assigns jobs to the storage unit when image cleanup reduces any disk volume's
capacity to less than the High water mark

Hope this helps.

Handy NetBackup Links

Mark_Solutions · ‎05-16-2014

That rather depends .. is your Advanced Disk set to stage data to tape or using a capacity managed SLP?

If not and use use fixed retentions then once it is full it is full until such time that data expires

If it does stage but has not been staging off to tape then the images are protected until they have staged so again cannot be removed to make space

The ideal is to use capacity managed / staged and make sure that everything that should be has been staged / duplicated off to tape.

Having said that I have seen jobs occasionally fail as it cannot clear the disk down fast enough or clear down enough space for the backup it is attempting - especially when it is a new clients first time backup which can be estimated at too high a value (an internal NetBackup thing)

Hope that helps?

Carlo_Palmieri · ‎05-16-2014

Yes this is correct; you can also:

-free some space

- add more space to filesystem for this STU

- low the high watermark for this disk

- Configure the policies to access it through a storage unit group that provides alternative storage to use when this storage unit fills up

Regards

Carlo

loori · ‎05-17-2014

It depends how exactly you use the disks. I suppose – partly 'cause you mention "AdvancedDisk" – that you use SLP (no basic disk there). Image cleanup checks when the HWM is reached if there are images eligible for deletion and purges them. That takes a bit of time, therefore the jobs run on 129. That usually causes no further problem in a functioning environment as they are repeated automatically, nevertheless they might run out of the backup window.

To lower the HWM only helps in so far as the cleanup process ist started earlier and you can up the HWM and let the jobs run – get more disks, saves trouble.

You shouldn't mix "capacity managed" and "fixed" retention, that can cause trouble, besides my experience is, that netbackup doesn't use the image on disk for recovery when there is an image on tape and you use "capacity managed". So with "capacity managed" don't use to long retentions, the are not needed.

In rare cases something went wrong and you have orphaned images on disk which are not cleaned. If they don't have an EMM entry you can clean them manually. I had that several times when I decommissioned disk pools or servers. Better open a case then.

You could in an emergency select the images on disk and expire them if you are 100% positive that they are duplicated and/or not needed. These games did cause quite a few white hairs on my head.

Dirk_Mueller · ‎05-19-2014

All backups runs with SLP, all capacity managed and all backups become duplicated to two tape-labraries.

I have two SUG each with two discs.

On every disk I have backups of the last 5-6 days - the oldest backups with an activ lifecycle are 10-12 h old.

LWM: 55, HWM 90

Every Disk pool: 3 TB

I thought NetBackup has no problem until a disk pool reaches 90%. Then NetBackup starts an image cleanup. Now I have 300 GB free on this disc, so NetBackup can use this. If I want to backup more than 300 GB until the cleanup is finished I will get error 129. But all backups below 300 Gb can run normal.

Mark_Solutions · ‎05-19-2014

I have seen this behavoir, especially with storage unit groups and it does cause issues

The logic is covered here - not sure how helpful it is http://www.symantec.com/docs/HOWTO35008

This may be more appropriate to your issue: http://www.symantec.com/docs/TECH87559

Hope these help

loori · ‎05-19-2014

You seem to believe that netbackup starts the cleanup when the HWM is reached, and in the meantime you can go on with the backup. This would be nice, but it's just a wish. As soon as you reach the HWM you get a 129. You could set the HWM a bit higher, I used 98%.

gbeyken · ‎05-19-2014

Hi Dirk, as Marianne already said, the behaviour in an AdvDsk-SU is different to a BasicDisk-SU. The HWM is measured per DiskVolume (in your case per Windows-Drive which is a member of this DiskPool). If HWM is reached the cleanup-job deletes expired (in your Case also already copied images) in the affected Volume. If there are other Volumes which have enough space to save the image everything is fine, otherwise the Backup-Job will end with "129". So i would recommend you to reorganize your AdvDsk-Pool(s) to hold a few smaller Volumes instead of one or two big ones. This is also an advantage in Case of a Filesystemcheck:-) Regards, Gerrit

loori · ‎05-19-2014

It doesn't make much sense if your volumes are smaller than the physical lun. Who would need such flimsy things nowadays anyway?

I you have more disks within a pool, netbackup would start the cleanup as soon as one reaches the HWM. That doesn't help much, as netbackup tries it's very best to fill members of the pool to the same degree which even costs time. I'd consider it better to have more DSTU in load balance group. In case of migration e.g it's much easier to take a DSTU out of the group than a volume out of the pool. A filesystemcheck is a problem with NTFS, that's the price you have to pay for not using a decent OS (you use 64k as blocksize I hope?).

gbeyken · ‎05-19-2014

Who said that the Windows-Drive(FS) should be smaller than the "physical" LUN? Of course it's better not to put more than one FS on a LUN in this scenario. BTW an FS-Check of an x-terrabyte Big Filesystem is no fun in Unix-Systems also:-) Believe me or not, this is the way we do it and this works:-) It's correct that Nbu tries to fill the volumes leveraged, but this due to the fact that the backups are mostly somewhat bigger not totally possible;-)

Dirk_Mueller · ‎05-19-2014

Hmmm, so if my advanced disc pool reaches the HWM, all backups failes.

But I have two advanced disc pools in one SUG. When one AdvDsk-Pool reached the HWM, my backups failed with 129. My SUG aber with load balancing. The backups did not have to run to the other pool of the SUG?

And why should I use HWM, if NetBackup use it like the physical limit?

Old_Remove · ‎05-26-2014

Hallo !

Ich schreib das jetzt mal auf deutsch da es mir in englisch zu kompliziziert ist :)

Also die HWM markiert den Punkt an dem NBU anfängt die Images auf dem Pool zu untersuchen und alle Images löscht die als Flag "Capacity Managed" haben und bereits "Dupliziert" sind. Diese werden dann gelöscht. NBU löscht dann so viele Images bis die LWM erreicht ist falls es entsprechend viele Images in dieser Kategorie gibt. NBU löscht keine Images wenn:

1. Die Images die "FIXED" als Flag haben

2. Die Images die "Capacity Managed" oder "Expire after Duplication" als Flag haben aber noch nicht dupliziert sind.

Daher können 129er Staties auftreten wenn:

1. NBU keine Images löschen kann aufgrund der oben beschrieben Gründe oder wenn z.B. die HWM so hoch liegt z.b. bei 95% und die Backup schneller in den Pool reinschreiben als NBU sie löschen kann. Wenn während des Backups NBU auch noch aufräumen muss kann das manchmal ein wenig dauern. In dieser Zeit wird dann der restliche Platz z.B. nur die 5% vollgeschrieben. Dann gibt es auch 129er.

2. "Capacity Managed" oder "Expire after Duplication" benutzt wird aber zu wenig Tape Laufwerke zur Verfügung stehen um die Dups wegzuschreiben.

3. Die B2D Storage zu langsam im Read ist und die Dups keine Performance Richtung Tapes liefern.

Gruß

Johannes

loori · ‎05-26-2014

Sorry,

aber das stimmt so nicht. Netbackup schreibt nicht über die HWM hinaus. Bei Staging ohne SLP tritt etwas in der Richtung auf. Dort bleiben laufende Jobs "stehen" und Netbackup räumt nach Möglichkeit die Staging Disk ab, und die Jobs laufen dann weiter. Bei SLP kommt es bei Erreichen der HWM zu einem 129. Damit wird unabhängig von sonstigen Schedules eine Cleanup angestossen, der die Images, die "eligible" sind, löscht.

Netbackup doesn't write over the HWM. In the case of staging without SLP jobs stop and wait until Netbackup purges images. Afterwards the jobs are resumed without error. When using SLP you always get a SC 129 when you "bump" into the HWM. That causes Netbackup to start an unscheduled image cleanup which purges expired images (all duplications of particular image are done, or fixed retention is past).

loori · ‎05-26-2014

It would be nice to have two limits: One which starts netbackup's cleanup, the other to stop the backups/duplications.
Unfortunately we don't have that.
Without HWM we would cause an exception, with HWM we have a SC 129 and the backup/duplication is repeated automatically – consistent behaviour.

Just calculate your retention for the DSTU with some margin.
Probably it would be better to use failover STUG to omit the SC 129, but in my opinion that has disadvantages for the administrator – unbalanced usage of the DSTU/media servers e.g.

My experience is that it's better to use short retentions for capacity managed DSTU, as the primary copy seems to switch to the first copy with fixed retention as soon as that exists, the copy on disk is then only used for duplications, if configured correctly (maybe for synthetic backups, not sure about that).

Old_Remove · ‎05-26-2014

Hallo Loori,

du hast "teilweise" recht. Das mit Status 129 und der HWM gilt für neue Jobs, beziehungsweise für noch nicht angefangene Jobs. Diese bekommen wenn die HWM erreicht ist sofort einen 129. Alle bereits angefangenen Jobs schreiben weiter egal wie voll die Disk ist. Habs gerade nochmal gecheckt.

Gruß

Johannes

loori · ‎05-26-2014

Stimmt,
ist aber von der Menge her eigentlich unbedeutend, da Netbackup vor Beginn des Backups prüft, ob's passt (und wenn nicht gleich eine 129 auswirft).
Je nach Grösse des Diskpools gibt's deswegen üblicherweise kein Problem mit einem vollen Filesystem.

Das Verhalten ohne SLP früher hat mir besser gefallen.

Dirk_Mueller · ‎06-04-2014

Hallo,

danke für Eure Antworten.

Das bedeutet dann dass ich beim Erreichen der HWM im Dauerbetrieb immer 129er bekomme, egal ob sie bei 90% oder bei 60% eingestellt ist. Lediglich laufende Backups laufen weiter bis zum physischen Ende des Pools, aber neue brechen ab.

Und die Abbrüche kann ich nicht verhindern, oder?

Und die Steuerung über SUG mit zwei Pools bringt mir dann gar nichts - wir haben die ja auf media server load balancing stehen. Damit verhindere ich ja scheinbar auch nicht diese Abbrüche.

Gruß

Dirk

loori · ‎06-04-2014

Ich denke, die einzige Möglichkeit ist ein passendes Sizing. Das Volumen, das ich innerhalb von 12 Stunden wegschreiben kann geteilt durch Anzahl Media-Server minus eins, pro Media-Server, wäre ein ganz guter Anhaltspunkt – wenn ich die Retention auf einen Tag stelle, ansonsten halt entsprechend mehr. Das hängt auch vom Storage ab, bei schnellen lokalen Platten ist u.U. mehr drin, bei günstigem SAN-Storage lassen sich nicht sehr viele I/O-Streams parallel verarbeiten.
Wir hatten an einer MSA zwei Media-Server hängen, die Volumes mit Raid6 konfiguriert, und keine Zeitfenster für die SLP, und einen kumulierten Durchsatz von bis zu einem Gigabyte pro Sekunde – Backup und Duplizierung – erzielt. Insgesamt vier Server, je ~14 TB, Frontendvolumen ~30 TB, ein bis sieben Tage Retention (capacity managed), alles doppelt auf Tapes. Solange keine Tape-Drives ausfielen gab's keine 129er. Mittlerweile ist der Eiertanz mit MSDP etwas spannender, da die Performance schlechter ist, dafür haben wir auch viel mehr Daten länger in den Pools.

gbeyken · ‎06-04-2014

Hallo Dirk, ich kann nur noch einmal sagen, daß wir mit unserer Konfiguration keine 129er Jobs haben. Versuchs doch einfach, wenn Du es nicht glauben möchtest;-) Gruß und viel Erfolg Gerrit

VOX

HWM - and Error 129