Solved: Image Cleanup not cleaning up. Disk Staging Storag...

Stan_Dalone · ‎10-17-2011

Hi out there!

One of our DSSUs frequently fills up to 100%, resulting in jobs failing with error 129.

Settings are as follows:

bpstulist -label DSSU_olha5046 -L

Label:                DSSU_olha5046
Storage Unit Type:    Disk
Media Subtype:        Basic (1)
Host Connection:      olha5046
Concurrent Jobs:      25
On Demand Only:       yes
Path:                 "/dssu/"
Robot Type:           (not robotic)
Max Fragment Size:    51200
Max MPX:              1
Stage data:           yes
Block Sharing:        no
File System Export:   no
High Water Mark:      60
Low Water Mark:       40
Ok On Root:           no

I took a glance at bpdm log, and found out that there's a great number of errors like these:

00:00:00.741 [4155] <2> image_db: Q_IMAGE_GET_CCNAME
00:00:00.748 [4155] <2> LOCAL CLASS_ATT_DEFS: Product ID = 6
00:00:00.750 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
00:00:00.751 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
00:00:00.751 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
00:00:00.752 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
00:00:00.752 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
00:00:00.752 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
00:00:00.753 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
00:00:00.753 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
00:00:00.754 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
00:00:00.755 [4155] <16> read_info_file: EXCHANGE_SOURCE corrupt: expected range
0-3 got -1
@

I can't see any errors in the cleanup jobs. They all finish without any problems. However, this one DSSU will remain uncleared.

Has anybody experienced this?

Any ideas?

Thanks and best regards,

Gerald

Stan_Dalone · ‎10-27-2011

After clearing the filesystem manually, I made screenshots of the configuration of the affected storage unit "DSSU_olha5046". Then I deleted it.

After that, I re-created it with the same parameters as before.

Since then, the storage unit is being cleaned up just like all the others.

Thanks for all your advice ;)

Gerald

View solution in original post

Mark_Solutions · ‎10-17-2011

What version of NBU are you running?

Are your duplications all up to date?

Depending on your version this may well be the issue - applies to V7 :

http://www.symantec.com/docs/TECH141441

Stan_Dalone · ‎10-17-2011

TECH141441 seems to relate to a lower version.

And yes, duplications are up-to-date.
I've built a script that verifies that most of the images in the filesystem /dssu have been duplicated and should be ready to be cleared.

Moreover, I tried nbstlutil on those images, and they have the state "ELIGIBLE_FOR_EXPIRATION"

So far, I really don't see why they aren't being cleared.

Any help will be appreciated.

Mark_Solutions · ‎10-17-2011

Thanks for that.

How large is you disk staging area?

What disk is it located on? - i see the path in the output above as "/dssu/" - what is that?

Is this on the Master or Media and does this only happen to this one of your DSSU's?

You either need to look in the All Log Entries report or the bpdbm logs for more clues.

Mark_Solutions · ‎10-17-2011

Also see this for more information on where to look for clues:

http://www.symantec.com/docs/TECH70241

Mark_Solutions · ‎10-17-2011

One other thing worth looking at is, if possible, bpdown all of your servers and then see if any processes are left behind, especially a bpdbm on the Master Server.

If so do a bpdown again or reboot the Master to ensure everything is clean

If a bpdbm process gets orphaned it can prevent the clear down of images.

Nicolai · ‎10-17-2011

Do you have a "Image Clean" in the activity monitor that does not exit ?

When a Image Clean hang duplication will continue but images will got get delete to free up space. Often you can't cancel the "Image Clean job".

Re-Start Netbackup to solve the issue.

Stan_Dalone · ‎10-19-2011

Only DSSU_olha5046 on Media Server olha5046 is affected.

/dssu is a filesystem on a basic disk, 6TB in size. This one fills up if I don't intervene manually.

/dssu/redo is another filesystem on the same machine, serving as DSSU_REDO_olha5046. This one behaves absolutely normal.

Highwatermark is 40% , lowwater mark is 20% on all of our DSSUs.

No, there are no hanging cleanups.

The only irregularity in cleanups is the fact that some Media Servers are cleaned very often - they don't have any problems. olha5046, however, is cleaned up rarely (once in a daytime, mostly). This seems to be not enough to prevent filling up the filesystem.

@Mark_Solutions:
Thanks for giving me the clue about REM and DPS. They were not on my screen so far.
Is there any way to check what they actually do?
I would like to find out what REM "thinks" about olha5046, and where's the difference with the other servers.

Mark_Solutions · ‎10-19-2011

Stan

If you setup the logging as per the tech note and then just check out the logs.

I do find your watermark settings interesting and they may be causing a block in some way if this one writes very large images.

So you have a 6TB volume with a HW of 40% - so it can only write 2.4TB of data before it becomes full.

If a single backup pass gets near this it will encounter a disk full condition but not be able to do anything about it as no images are eligible for expiration as it hasnt had chance to duplicate then yet - so may put the process on hold.

My reccomendation initially would be to set the HW to 95% (5.7 TB - so still 300GB free space) and the LW to 75% (4.5 TB)

This may give it enough room to play with and allow the cleanup and disk full condition to do something about it.

Hope this helps

Stan_Dalone · ‎10-19-2011

I don't say that Netbackup just MARKS the DSSU as full (that would mean no more jobs are scheduled for the DSSU - very fine, no problem). The problem is: the filesystem REALLY fills up. It says 90%, and no cleanup ever deals with the media server olha5046. Other DSSUs are dealt with perfectly, with the same settings for hwm and lwm.

And, as I stated above, there are plenty of images that have the state "ELIGIBLE_FOR_EXPIRATION" - the problem is that they aren't being cleared.

I had in mind that there might be very large images. That's the reason I lowered the hwm and lwm settings (they were 60/40 before), in order to give Netbackup more time to do the cleanup.

And: if there are very large images, that applies to all DSSUs in the same way. It's only the 1 DSSU that's not being cleared.

Stan_Dalone · ‎10-19-2011

The filesystem REALLY runs above 90%, and no cleanup deals with this media server. Other media servers are dealt with regularly, and those don't have the problem.

Other DSSUs run perfectly, with the same hwm and lwm settings.

Large images can occur, but that applies to all of our DSSUs.

I reduced hwm and lwm settings to give the cleanup more time to do the job (were 60/40 before).

Mark_Solutions · ‎10-19-2011

Just re-reading this thread to see what i had missed and want to clarify something:

You said:

/dssu is a filesystem on a basic disk, 6TB in size. This one fills up if I don't intervene manually.

/dssu/redo is another filesystem on the same machine, serving as DSSU_REDO_olha5046. This one behaves absolutely normal.

I'm no expert in how the file systems are used but my brain tells me that /dssu/redo is a sub set of /dssu so may be causing issues here.

How are your other DSSU's set up?

Could you post the output from bpstulist -U -M <masterservername> just so that i can see what you have and also an admin log if possible

Thanks

Mark_Solutions · ‎10-19-2011

Hi again

Just realised that i havent asked about whether you have done a reboot of the Master Server or checked its processes.

If there is an orphaned bpdbm process it will stop it from clearing down

Try either a reboot of the Master or shut down NetBackup and see if you still have a bpdbm process running.

If you do then kill it off - this could also account for why the regular image cleanup are not running.

I would still set the HW and LW as i suggested though.

Marianne · ‎10-19-2011

Please create bpdm log folder on the media server, then run 'nbdelete -allvolumes -force' on the master.

Check bpdm log on media server for clues/errors.

Handy NetBackup Links

Stan_Dalone · ‎10-20-2011

As stated right in my first posting, I looked in the bpdbm log right in the first place.

And yes, there are error messages (see above)

If only someone could give me a clue what they mean ;)

Mark_Solutions · ‎10-20-2011

Stan

Have you checked for hung or orphaned bpdbm processes as per my earlier message?

Can you post a bpdbm and bpdm log so that we can see all of it please.

Thanks

#edit#

On one of our managed sites we have a Friday afternoon check where all Media Server are checked for bpbrm processes when no backups are running and they are killed off if there are any - and the Master Server has either NetBackup taken down and any orphaned processes killed off (it quite often needs bpdown running twice) or it is rebooted.

Orphaned processes are quite common and can cause a lot of issues so regular checks and cleanups save a lot of problems.

Stan_Dalone · ‎10-20-2011

My latest (and smallest) bpdbm log is 600 MB in size

I don't think I could paste this here - or should I?
I pasted the lines that look the most striking (see above).

But thanks, Mark, the hint with hanging bpdbm processes is worth looking at. I'll try to sort those processes out and will be back here after that.

As to "Friday afternoon check": Congratulations! That sounds quite comfortable.
We perform backups 7*24*365 (enterprise environment with a great number of class A customers) .
Many backups run for 3 or more days. So, there's no way of scheduling a downtime just for troubleshooting.
Pity!

Marianne · ‎10-20-2011

I really meant bpdm (disk manager) on the media server. This log contains disk cleanup info . Errors/problems with cleanup will be logged in bpdm.

Handy NetBackup Links

Mark_Solutions · ‎10-20-2011

OK - if your log is that large and you run 24x7 with no downtime then you more than likely have orphaned processes by now which would definately cause your issue.

My customer only backups up 70TB with 6000 jobs per day (7 days per week) but that is easily done over a weekend and duplicated by Tuesday or Wednesday with teh daily ones doen and duplicated by lunchtime so by Friday afternoon things are generally quiet.

Marianne · ‎10-20-2011

About the errors in bpdbm:

Are you using GRT for Exchange?

Have you disabled cataloging for duplications of Exchange backups ?

See this section in NBU for Exchange Guide:

Disabling the cataloging for duplications of Exchange backups using Granular Recovery Technology (GRT)
Unlike a duplication of a backup that uses Granular Recovery Technology (GRT)
from tape to disk, duplication of the same backup from disk to tape takes extra
time. NetBackup requires this extra time to catalog the granular Exchange
information. You can choose not to catalog the granular information so that the
duplication is performed more quickly. However, then users are not able to browse
for individual items on the image that was duplicated to tape if the disk copy
expires.
During the duplication process, NetBackup writes log entries periodically to show
the progress of the job.
To disable the cataloging of Exchange backups using Granular Recovery Technology
1 On the master server, open the NetBackup Administration Console.
2 In the left pane, expand Host Properties.
3 Click Master Servers.
4 In the right pane, right-click the master server click Properties.
5 Click General Server.
6 Uncheck Enable message-level cataloging when duplicating Exchange
images that use Granular Recovery Technology.
7 Click OK.

Handy NetBackup Links

VOX

Image Cleanup not cleaning up. Disk Staging Storage Unit full.