cancel
Showing results for 
Search instead for 
Did you mean: 

NBU 7.5.0.3 - Replication Fail(Status 84)

lsaldana1201
Level 3

Hi Forum,

 

Just wondering if the experts can help me troubleshoot the issue. I have logged a case, but it takes a long time for them to respond and it seems they arent sure how to approach this.

Here is the scenario: from Local Master.

Two master servers(Local and DR), replication setup one way(Local to DR). replication fails from local master to DR master. Status 84.

Details of error:

 

17/05/2013 10:34:38 - requesting resource LCM_STU-MSDP-Source
17/05/2013 10:34:38 - granted resource LCM_STU-MSDP-Source
17/05/2013 10:34:38 - started process RUNCMD (13884)
17/05/2013 10:34:39 - requesting resource @aaaac
17/05/2013 10:34:39 - reserving resource @aaaac
17/05/2013 10:34:39 - reserved resource @aaaac
17/05/2013 10:34:39 - granted resource MediaID=@aaaac;DiskVolume=PureDiskVolume;DiskPool=xxxxxxx-dp;Path=PureDiskVolume;StorageServer=xxxxx...
17/05/2013 10:34:42 - Info bpdm(pid=13900) started           
17/05/2013 10:34:42 - started process bpdm (13900)
17/05/2013 10:34:47 - Info xxxxxxbak01(pid=13900) Using OpenStorage to replicate backup id xxxxx-nfs01xxxxx.xxxxxx.local_1368795600, media id @aaaac, storage server xxxxxx-bak01, disk volume PureDiskVolume
17/05/2013 10:34:47 - Info xxxxxxx-bak01(pid=13900) Replicating images to target storage server xxxxxxbak01, disk volume PureDiskVolume  
17/05/2013 10:35:18 - Error bpdm(pid=13900) wait failed: error 2060014        
17/05/2013 10:35:18 - Error bpdm(pid=13900) <async> cancel failed: error 2060001: one or more invalid arguments  
17/05/2013 10:35:18 - Error bpdm(pid=13900) copy cancel failed: error 2060001       
17/05/2013 10:35:18 - Info xxxxxxxbak01(pid=13900) StorageServer=PureDisk:xxxxxxx-bak01; Report=PDDO Stats for (xxxxxbak01): scanned: 4 KB, CR sent: 0 KB, CR sent over FC: 0 KB, dedup: 100.0%
17/05/2013 10:35:18 - Error (pid=13884) ReplicationJob::Replicate: Replication failed for backup id xxxxxxxnfs01.xxxxxxxx_1368795600: media write error (84) 
17/05/2013 10:35:18Replicate failed for backup id xxxxnfs01.xxxxxxxx_1368795600 with status 84
17/05/2013 10:35:18 - end operation

From the DR Master:

5/17/2013 2:47:27 AM - requesting resource @aaaac
5/17/2013 2:47:27 AM - Error nbjm(pid=6492) NBU status: 2074, EMM status: Disk volume is down   
Disk volume is down(2074)
5/17/2013 2:47:27 AM - Error nbjm(pid=6492) NBU status: 2074, EMM status: Disk volume is down

When I look on the DR Master, Media and Device Mgmt > Credentials > Storage Servers > under properties or replication I get the following error.

"Internal application error. Please contact your administrator. invalid comman parameter. RDSM has encountered an issue with STS where the server was not found: getDiskVolumeInfoList"

 

I have tried to UP the disk volume through command prompt, but same issues continue. Let me know which logs might be useful for you experts reading this.

 

Thanks in advance,

 

Luis S.

 

19 REPLIES 19

Jaykullar
Level 5

I take it your using MSDP and AIR?

Spoold and storaged are the logs you want to take a look at for MSDP down, located <install_path>\msdp folder\logs\spoold.

Mark_Solutions
Level 6
Partner Accredited Certified

Just concentrate on the DR site - it looks to be down and cannot even be connected to

Tell us about the MSDP setup on this site so that we know where to start - O/S, NetBackup Versions etc.

Doesa the rest of NetBackup on that site look OK?

lsaldana1201
Level 3

Mark,

DR Master/Media Server:

Windows Server 2008 R2

NBU 7.5.0.3

I able to ping Source from Target, and vice versa.

 

Everything else like ,Backups are working fine. Its just the replication that seems to be the issue.

 

Mark_Solutions
Level 6
Partner Accredited Certified

How is the DR Master for disk space on the de-dupe drive? If it has hit its high watermark it may have filled up

Run an All Log Entries Report for the last few hours (since just before it started failing) to see what it started to report.

If in doubt shut down NetBackup, makes sure all services are stopped and then reboot (this is the DR Master / Media) - but check through the logs first

lsaldana1201
Level 3

More info:

 

Here are some commands we have tried but have not worked.

 

C:\Program Files\Veritas\NetBackup\bin\admincmd>nbdevquery.exe -listdv -stype Pu

reDisk -dp suslv-pwl-bak01-dp

 

V7.5 suslv-pwl-bak01-dp PureDisk PureDiskVolume @aaaac 6246.10 2202.37 64 0 0 1

0 0 54

 

C:\Program Files\Veritas\NetBackup\bin\admincmd>nbdevquery.exe -listdv -stype Pu

reDisk -dp suslv-pwl-bak01-dp -U

 

Disk Pool Name      : suslv-pwl-bak01-dp

Disk Type           : PureDisk

Disk Volume Name    : PureDiskVolume

Disk Media ID       : @aaaac

Total Capacity (GB) : 6246.10

Free Space (GB)     : 2202.37

Use%                : 64

Status              : DOWN

Flag                : ReadOnWrite

Flag                : AdminUp

Flag                : InternalDown

Flag                : ReplicationSource

Flag                : ReplicationTarget

Num Read Mounts     : 0

Num Write Mounts    : 1

Cur Read Streams    : 0

Cur Write Streams   : 0

Num Repl Sources    : 1

Num Repl Targets    : 1

Replication Source  : sustp-pwl-bak01:PureDiskVolume

Replication Target  : suksh-pwl-bak01.tgptrading.glb.corp.local:PureDiskVolume

 

-----------------------------------------------------------------------------

 

C:\Program Files\Veritas\NetBackup\bin\admincmd>nbdevconfig -changestate -stype

 

PureDisk -dp suslv-pwl-bak01-dp -dv PureDiskVolume -state UP

 

successfully changed the state of disk volume

 

 

-----------------------------------------------------------------------------

 

 

C:\Program Files\Veritas\NetBackup\bin\admincmd>nbdevquery.exe -listdv -stype Pu

reDisk -dp suslv-pwl-bak01-dp -U

 

Disk Pool Name      : suslv-pwl-bak01-dp

Disk Type           : PureDisk

Disk Volume Name    : PureDiskVolume

Disk Media ID       : @aaaac

Total Capacity (GB) : 6246.10

Free Space (GB)     : 2202.37

Use%                : 64

Status              : DOWN

Flag                : ReadOnWrite

Flag                : AdminUp

Flag                : InternalDown

Flag                : ReplicationSource

Flag                : ReplicationTarget

Num Read Mounts     : 0

Num Write Mounts    : 1

Cur Read Streams    : 0

Cur Write Streams   : 0

Num Repl Sources    : 1

Num Repl Targets    : 1

Replication Source  : sustp-pwl-bak01:PureDiskVolume

Replication Target  : suksh-pwl-bak01.tgptrading.glb.corp.local:PureDiskVolume

lsaldana1201
Level 3

Actually, Yes. The issue started right after we ran out of space on the DeDupe Backup Volume. Replications failed and we then expanded the volume (added 2TB) and rebooted both servers. After that incident the replications began failing constantly.

Mark_Solutions
Level 6
Partner Accredited Certified

You need to take a look at the spoold and storaged logs

I am guessing (if you havent sorted this out by now) that either queue processing was stopped and wont resume again due to the disk full condition - this just stops the site working! - you can correct this by manually running the queue processing, but may need a reboit first.

It could also be a corrupt file causing the blockage - again the logs should point to something here.

So if we could see those two logs but also the output of:

crcontrol --getmode

Which will tell us it current state

lsaldana1201
Level 3

Thanks for your reply Mark. Here are part of the logs you requested. I only copied a part of the logs because there was a lot of info on there. Hope that helps.

Let me know if there is anything else you need from me.

 

-Luis

huanglao2002
Level 6

May 22 14:59:59 ERR [00000000006E9040]: 25002: _storeCheckContainers: container index file F:\dedup_data\data/68.bhd is missing
 

Can you call symc engineer for support?

I think it's a serious problem

lsaldana1201
Level 3

Would I be able to copy that missing file from the Source server to the Target Server(where the file is missing) and try to restart the server to see if it fixes the issue?

Jaykullar
Level 5

Error : Connection failed connection actively refused. Note that the content router needs to be running to get a connection.

Is spoold running? I would think not, this missing file maybe causing an inconsistency. I would advise raising a support call.

If you knew the location the file was missing from (could be a few) you could always create the bhd file then try starting deduplication engine / manager, then run recoverCR to check for problems and inconsistencies, BUT Please call support.

Mark_Solutions
Level 6
Partner Accredited Certified

I have seen a similar issue to this - and you really should get support involved to be on the safe side - i wouldn't want to give you the wrong information

Basically you have a little corruption involved and this has shut down the puredisk system - so it wont work - hence the getmode command not giving any output

The processes to get it going cannot run as that file is missing - once that has been dealt with the processes will be able to start and then it should fire itself into life within about 30 minutes or so (it will have a lot of catching up to do)

What need cleaning up is the actual reference to that missing file so that it doesn't keep looking for it and so can start up - but get support to do this for you.

The storaged log doesn't go far enough to get anything out of it - looking for the last part before it all stopped working - but i guess support will sort this for you now

Keep us updated

Mark_Solutions
Level 6
Partner Accredited Certified

It may be possible to use the RecoverCR Tool - but see what support reccomend

lsaldana1201
Level 3

Cannot get Spoold to start. I get the following error.

C:\Program Files\Veritas\pdde>spoold.exe --start

Error: 67: Database: connection to database crdb at localhost:10085 failed (could not connect to server: Connection refused (0x0000274D/1006

1)

        Is the server running on host "localhost" and accepting

        TCP/IP connections on port 10085?

)

Error: 53: Database Manager: could not access storage database (connection actively refused).

Error: 4: Failed to run database class

Error: 4: Failed to read startup mode

 

 

Mark_Solutions
Level 6
Partner Accredited Certified

Nothing will start until it has been repaired

The referenced missing file shows that you have corruption in the system and that will stop anything from working

As i said the RecoverCR tool may help you out but I would log a call with support and show them the missing file line - an engineer may be able to remove the reference to it and get you up and running, or it may need the RecoverCR using - but take advice from support first

inn_kam
Level 6
Partner Accredited

Yes as Marks solution suggested , before running Recovery tool , took advice from Symantec engineer

 

huanglao2002
Level 6

Aravind_babu
Not applicable

I am facing similar issue in last 4 days......is there any updates on the above issue ?

Was tecg support able to fix the issue for you ?

Mark_Solutions
Level 6
Partner Accredited Certified

Aravind - please start your own thread with full details so that we can try and help you there