Staging area not clearing down when hitting high w...

Guinch · ‎06-14-2012

Hi and hope you can help.

First of all some details:

Master server:

Client/Master = Master

NetBackup Client Platform = PC, Windows2000

NetBackup Client Protocol Level = 6.5.4

Product = NetBackup

Version Name = 6.5

Version Number = 650000

Client OS/Release = Windows2003 5

Media Server (one example of, there are a few.):

Client/Master = Media Host

NetBackup Client Platform = RS6000, AIX5

NetBackup Client Protocol Level = 6.5.4

Product = NetBackup

Version Name = 6.5

Version Number = 650000

Client OS/Release = AIX 5.3

Now the problem is the DSSU does not clear down when the disk hits the high water mark. I need to go and manually identify the images that have been duplicated to tape and then expire them from disk.

It also doesnt seem to report a "disk full condition" when it should. For example the disk was 100% full when checked it first thing this morning but it only reported a "disk full condition" a minute of so after I manually expired some of the images.

I have just taken over from someone so I have no idea how long this has been going on.

I have googled everything I can think of but can find no reference to anything even similar.

Has anyone experienced anything like this or does anyone have any ideas as to the cause/solution?

Thanks in advance.

Marianne · ‎06-14-2012

Can you verify that staging to tape is successful? How often are staging jobs scheduled?

Cleanup can only happen when images have been successfully staged or when image expiration is reached.

See this TN for DSSU cleanup behaviour: http://www.symantec.com/docs/TECH66149

Handy NetBackup Links

Guinch · ‎06-14-2012

Yes, before manually expiring images I check the catalog and make sure the images that i'm expiring are on tape.

Staging is scheduled once everyday.

watsons · ‎06-14-2012

The main problem of having DSSU is usually this cleanup process, some images may not get duplicated properly due to various reason and were left there to fill up your storage. You will then have to manually clean them up and is quite frustrating.

Check out SLP and Vault duplication to see if they suit you better for duplication. Note that SLP does not support BasicDisk and Vault requires additional license.

Marianne · ‎06-14-2012

Please show us output of bpstulist -label <storage_unit_label> -L

Please also confirm that DSU location is a dedicated volume used for disk backup only and not shared with anything else (e.g. software repository).

Is once a day sufficient to duplicate all disk backups before next backup window? We normally recommend to schedule multiple times per day (e.g. every 2 hours).

Ensure bptm, bpdm and admin log folders exist on the media server to verify duplication and troubleshoot cleanup.

Handy NetBackup Links

mph999 · ‎06-14-2012

If this turns out to be more than a config issue (unlikely) or just a mis-understanding ...

These would be the logs required to start with (there could be others, depends what is found)

vxlogcfg -a -p 51216 -o 226 -s DebugLevel=4 -s DiagnosticLevel=6 (nbstserv / master)

vxlogcfg -a -p 51216 -o 272 -s DebugLevel=6 -s DiagnosticLevel=6 (expmgr / master)

vxlogcfg -a -p 51216 -o 222 -s DebugLevel=6 -s DiagnosticLevel=6 (nbrmms / media)

Basically, I would want to know if a line like this is recieved in the 272 log.

16/05/2011 14:50:22.215 [Diagnostic] NB 51216 expmgr 272 PID:10551 TID:10 File ID:226 [No context] 4 V-272-6 [ExpMgrEventConsumer::notify] High Water Mark event received

There should be something in nbemm (111) prior to this time ... For example :

The lines after this line are important (from emm)

[No context] 5 [RemNotifier::checkForHWM] stu_disk:Internal_16 DSM::freeSpace :: 1660561510400 : DSM::totalCapacity :: 10219510488576 :: Highwatermark 83

If you have this line it 'might' show an issue (not 100% sure).

[No context] 1 V-219-8 [RemNotifier::checkForHWM] Significant DB change detected

There was a nbstserv bundle feleased for 6.5.6 - I would upgrade to at least this version and apply eTractk as listed in http://www.symantec.com/docs/TECH125272 (will require a support call).

As 6.x is nearing EOL, I would recommend upgrading to 7.x.

I've seen this exact issue, at NBU 7.0, which was resolved by eTrack 1878699

This issue was fixed in 7.1 onwards, and also in 6.5.6, which might suggest it can happen in 6.5.4

Martin

Guinch · ‎06-14-2012

Thanks Marianne,

bpstulist:

Label: XXXX-disk-staging_1

Storage Unit Type: Disk

Media Subtype: Basic (1)

Host Connection: cairdp1b

Concurrent Jobs: 150

On Demand Only: yes

Path: "/nbu/stage1"

Robot Type: (not robotic)

Max Fragment Size: 524288

Max MPX: 16

Stage data: yes

Block Sharing: no

File System Export: no

High Water Mark: 90

Low Water Mark: 80

Ok On Root: no

I'll have a look at increasing the number of times we run staging but how would that cause the problem we are experiencing?

Guinch · ‎06-14-2012

I've just had a quick look through the bpdm and found a lot of these:

03:45:45.089 [344266] <16> emmlib_ImageQueryFetchRaw: (0) fetchImages failed, emmError = 3000004, nbError = 0

03:45:45.089 [344266] <16> volume_cleanup: (-) Translating EMM_ERROR_CorbaException(3000004) to 25 in the NetBackup context

03:45:45.089 [344266] <2> volume_cleanup: emmlib_ImageQueryFetchRaw failed 25, attempting to get new image query handle

03:45:45.112 [344266] <16> emmlib_ImageQueryFetchRaw: (0) CORBA call threw exception. <system exception, ID 'IDL:omg.org/CORBA/OBJECT_

NOT_EXIST:1.0'

TAO exception, minor code = 0 (unknown location; unspecified errno), completed = NO

And when i say a lot I mean about 20 to 30 every second are getting logged. Todays logfile is approx 0.5Gb so far.

I'm off to google this but if anyone can point me in the right direction with this it would be appreciated.

Mark_Solutions · ‎06-14-2012

Sounds like your servers may be struggling and needs some tuning or patching.

For the Windows ones you can try these:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

add a DWORD – TcpTimedWaitDelay - Decimal Value of 30

add a DWORD – MaxUserPort – Decimal Value 65534

HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\

add or edit DWORD - PoolUsageMaximum - Decimal value of 40

add or edit DWORD - PagedPoolSize Hex value of FFFFFFFF (this is 8 x F)

Also check your desktop heap settings - i have seen this cause the Corba errors:

HKLM\System\CurrentControlSet\Control\Session Manager\SubSystems

There is a Windows key with a long string - see what values you have for the part reading similar to: Windows SharedSection=1024,12288,512

If it ends in 512 like the above then change it to 1024

All of the above apply to Master and Media Servers and all need a reboot.

See how it goes after the reboot - though you should go to at least 6.5.6 anyway for now to ensure you have the bug fixes applied

Hope this helps

mph999 · ‎06-14-2012

A similar but not quite exact error is listed in internal TN TECH69705

Eg

12:43:23.739 [1928.6180] <16> emmlib_ImageQueryFetchOneSeq: (0) CORBA call threw exception. <system exception, ID 'IDL:omg.org/CORBA/OBJECT_NOT_EXIST:1.0'TAO exception, minor code = 0 (unknown location; unspecified errno), completed = NO>
12:43:23.739 [1928.6180] <16> emmlib_ImageQueryFetchOneSeq: (0) fetchImages failed, emmError = 3000004, nbError = 0
12:43:23.739 [1928.6180] <16> emmlib_ImageQueryFetch: (0) emmlib_ImageQueryFetch failed, emmError = 3000004, nbError = 0
12:43:23.739 [1928.6180] <16> volume_cleanup: (-) Translating EMM_ERROR_CorbaException(3000004) to 25 in the NetBackup context
12:43:23.739 [1928.6180] <2> volume_cleanup: emmlib_ImageQueryFetch failed 25

This was fixed with eTrack 1542212

Also fixed in 6.5.6 and 7.0.1 and 7.1 onwards

Martin

mph999 · ‎06-14-2012

This is also along the same lines ...

http://www.symantec.com/docs/TECH74989

Following on from Marks point - lets consider tuning/ performance.

From when this was working, till now, when it fails, has there been a change in the amount of work the server has to do.

Martin

VOX

Staging area not clearing down when hitting high water mark.