cancel
Showing results for 
Search instead for 
Did you mean: 

Staging area not clearing down when hitting high water mark.

Guinch
Level 3

Hi and hope you can help.

First of all some details:

Master server:

 

Client/Master = Master
NetBackup Client Platform = PC, Windows2000
NetBackup Client Protocol Level = 6.5.4
Product = NetBackup
Version Name = 6.5
Version Number = 650000
Client OS/Release = Windows2003 5 
 
 
 
Media Server (one example of, there are a few.):
 
Client/Master = Media Host
NetBackup Client Platform = RS6000, AIX5
NetBackup Client Protocol Level = 6.5.4
Product = NetBackup
Version Name = 6.5
Version Number = 650000
Client OS/Release = AIX 5.3

 

Now the problem is the DSSU does not clear down when the disk hits the high water mark. I need to go and manually identify the images that have been duplicated to tape and then expire them from disk.

It also doesnt seem to report a "disk full condition" when it should. For example the disk was 100% full when checked it first thing this morning but it only reported a "disk full condition" a minute of so after I manually expired some of the images.

I have just taken over from someone so I have no idea how long this has been going on.

I have googled everything I can think of but can find no reference to anything even similar.

Has anyone experienced anything like this or does anyone have any ideas as to the cause/solution?

 

Thanks in advance.

10 REPLIES 10

Marianne
Level 6
Partner    VIP    Accredited Certified

Can you verify that staging to tape is successful? How often are staging jobs scheduled?

Cleanup can only happen when images have been successfully staged or when image expiration is reached.

See this TN for DSSU cleanup behaviour: http://www.symantec.com/docs/TECH66149

 

Guinch
Level 3

Yes, before manually expiring images I check the catalog and make sure the images that i'm expiring are on tape.

Staging is scheduled once everyday.

watsons
Level 6

The main problem of having DSSU is usually this cleanup process, some images may not get duplicated properly due to various reason and were left there to fill up your storage. You will then have to manually clean them up and is quite frustrating.

Check out SLP and Vault duplication to see if they suit you better for duplication. Note that SLP does not support BasicDisk and Vault requires additional license.

Marianne
Level 6
Partner    VIP    Accredited Certified

Please show us output of bpstulist -label <storage_unit_label> -L

Please also confirm that DSU location is a dedicated volume used for disk backup only and not shared with anything else (e.g. software repository).

Is once a day sufficient to duplicate all disk backups before next backup window? We normally recommend to schedule multiple times per day (e.g. every 2 hours).

Ensure bptm, bpdm and admin log folders exist on the media server to verify duplication and troubleshoot cleanup.

mph999
Level 6
Employee Accredited

 

If this turns out to be more than a config issue (unlikely) or just a mis-understanding ...

These would be the logs required to start with (there could be others, depends what is found)

vxlogcfg -a -p 51216 -o 226 -s DebugLevel=4 -s DiagnosticLevel=6  (nbstserv / master)

 

vxlogcfg -a -p 51216 -o 272 -s DebugLevel=6 -s DiagnosticLevel=6  (expmgr / master)
vxlogcfg -a -p 51216 -o 222 -s DebugLevel=6 -s DiagnosticLevel=6  (nbrmms / media)
 
Basically, I would want to know if a line like this is recieved in the 272 log.
 
16/05/2011 14:50:22.215 [Diagnostic] NB 51216 expmgr 272 PID:10551 TID:10 File ID:226 [No context] 4 V-272-6 [ExpMgrEventConsumer::notify] High Water Mark event received
 
There should be something in nbemm (111) prior to this time ... For example :
 
The lines after this line are important (from emm)
 
[No context] 5 [RemNotifier::checkForHWM] stu_disk:Internal_16 DSM::freeSpace :: 1660561510400 : DSM::totalCapacity :: 10219510488576 ::  Highwatermark 83
 
If you have this line it 'might' show an issue (not 100% sure).
 
[No context] 1 V-219-8 [RemNotifier::checkForHWM] Significant DB change detected              
 
There was a nbstserv bundle feleased for 6.5.6 - I would upgrade to at least this version and apply eTractk as listed in http://www.symantec.com/docs/TECH125272  (will require a support call).
 
As 6.x is nearing EOL, I would recommend upgrading to 7.x.
 
I've seen this exact issue, at NBU 7.0, which was resolved by eTrack 1878699
 
This issue was fixed in 7.1 onwards, and also in 6.5.6, which might suggest it can happen in 6.5.4
 
Martin

Guinch
Level 3

 

Thanks Marianne,
 
bpstulist:
 
Label:                XXXX-disk-staging_1
Storage Unit Type:    Disk
Media Subtype:        Basic (1)
Host Connection:      cairdp1b
Concurrent Jobs:      150
On Demand Only:       yes
Path:                 "/nbu/stage1"
Robot Type:           (not robotic)
Max Fragment Size:    524288
Max MPX:              16
Stage data:           yes
Block Sharing:        no
File System Export:   no
High Water Mark:      90
Low Water Mark:       80
Ok On Root:           no
 
 
I'll have a look at increasing the number of times we run staging but how would that cause the problem we are experiencing?
 
 

Guinch
Level 3

 

I've just had a quick look through the bpdm and found a lot of these:
 
03:45:45.089 [344266] <16> emmlib_ImageQueryFetchRaw: (0) fetchImages failed, emmError = 3000004, nbError = 0
03:45:45.089 [344266] <16> volume_cleanup: (-) Translating EMM_ERROR_CorbaException(3000004) to 25 in the NetBackup context
03:45:45.089 [344266] <2> volume_cleanup: emmlib_ImageQueryFetchRaw failed 25, attempting to get new image query handle
03:45:45.112 [344266] <16> emmlib_ImageQueryFetchRaw: (0) CORBA call threw exception. <system exception, ID 'IDL:omg.org/CORBA/OBJECT_
NOT_EXIST:1.0'
TAO exception, minor code = 0 (unknown location; unspecified errno), completed = NO
 
And when i say a lot I mean about 20 to 30 every second are getting logged. Todays logfile is approx 0.5Gb so far.
 
I'm off to google this but if anyone can point me in the right direction with this it would be appreciated.

Mark_Solutions
Level 6
Partner Accredited Certified

Sounds like your servers may be struggling and needs some tuning or patching.

For the Windows ones you can try these:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

add a DWORD – TcpTimedWaitDelay  - Decimal Value of 30

add a DWORD – MaxUserPort – Decimal Value 65534

HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\

 add or edit DWORD - PoolUsageMaximum  - Decimal value of 40

 add or edit DWORD - PagedPoolSize Hex value of FFFFFFFF (this is 8 x F)

Also check your desktop heap settings - i have seen this cause the Corba errors:

HKLM\System\CurrentControlSet\Control\Session Manager\SubSystems

There is a Windows key with a long string - see what values you have for the part reading similar to: Windows SharedSection=1024,12288,512

If it ends in 512 like the above then change it to 1024

All of the above apply to Master and Media Servers and all need a reboot.

See how it goes after the reboot - though you should go to at least 6.5.6 anyway for now to ensure you have the bug fixes applied

Hope this helps

mph999
Level 6
Employee Accredited

A similar but not quite exact error is listed in internal TN TECH69705

Eg

 

12:43:23.739 [1928.6180] <16> emmlib_ImageQueryFetchOneSeq: (0) CORBA call threw exception. <system exception, ID 'IDL:omg.org/CORBA/OBJECT_NOT_EXIST:1.0'TAO exception, minor code = 0 (unknown location; unspecified errno), completed = NO>
12:43:23.739 [1928.6180] <16> emmlib_ImageQueryFetchOneSeq: (0) fetchImages failed, emmError = 3000004, nbError = 0
12:43:23.739 [1928.6180] <16> emmlib_ImageQueryFetch: (0) emmlib_ImageQueryFetch failed, emmError = 3000004, nbError = 0
12:43:23.739 [1928.6180] <16> volume_cleanup: (-) Translating EMM_ERROR_CorbaException(3000004) to 25 in the NetBackup context
12:43:23.739 [1928.6180] <2> volume_cleanup: emmlib_ImageQueryFetch failed 25

This was fixed with eTrack 1542212

Also fixed in 6.5.6 and 7.0.1 and 7.1 onwards

Martin

mph999
Level 6
Employee Accredited

This is also along the same lines  ...

http://www.symantec.com/docs/TECH74989

 

Following on from Marks point - lets consider tuning/ performance.

From when this was working, till now, when it fails, has there been a change in the amount of work the server has to do.

Martin