05-17-2013 08:48 AM
Hi Forum,
Just wondering if the experts can help me troubleshoot the issue. I have logged a case, but it takes a long time for them to respond and it seems they arent sure how to approach this.
Here is the scenario: from Local Master.
Two master servers(Local and DR), replication setup one way(Local to DR). replication fails from local master to DR master. Status 84.
Details of error:
17/05/2013 10:34:38 - requesting resource LCM_STU-MSDP-Source 17/05/2013 10:34:38 - granted resource LCM_STU-MSDP-Source 17/05/2013 10:34:38 - started process RUNCMD (13884) 17/05/2013 10:34:39 - requesting resource @aaaac 17/05/2013 10:34:39 - reserving resource @aaaac 17/05/2013 10:34:39 - reserved resource @aaaac 17/05/2013 10:34:39 - granted resource MediaID=@aaaac;DiskVolume=PureDiskVolume;DiskPool=xxxxxxx-dp;Path=PureDiskVolume;StorageServer=xxxxx... 17/05/2013 10:34:42 - Info bpdm(pid=13900) started 17/05/2013 10:34:42 - started process bpdm (13900) 17/05/2013 10:34:47 - Info xxxxxxbak01(pid=13900) Using OpenStorage to replicate backup id xxxxx-nfs01xxxxx.xxxxxx.local_1368795600, media id @aaaac, storage server xxxxxx-bak01, disk volume PureDiskVolume 17/05/2013 10:34:47 - Info xxxxxxx-bak01(pid=13900) Replicating images to target storage server xxxxxxbak01, disk volume PureDiskVolume 17/05/2013 10:35:18 - Error bpdm(pid=13900) wait failed: error 2060014 17/05/2013 10:35:18 - Error bpdm(pid=13900) <async> cancel failed: error 2060001: one or more invalid arguments 17/05/2013 10:35:18 - Error bpdm(pid=13900) copy cancel failed: error 2060001 17/05/2013 10:35:18 - Info xxxxxxxbak01(pid=13900) StorageServer=PureDisk:xxxxxxx-bak01; Report=PDDO Stats for (xxxxxbak01): scanned: 4 KB, CR sent: 0 KB, CR sent over FC: 0 KB, dedup: 100.0% 17/05/2013 10:35:18 - Error (pid=13884) ReplicationJob::Replicate: Replication failed for backup id xxxxxxxnfs01.xxxxxxxx_1368795600: media write error (84) 17/05/2013 10:35:18Replicate failed for backup id xxxxnfs01.xxxxxxxx_1368795600 with status 84 17/05/2013 10:35:18 - end operation
From the DR Master:
5/17/2013 2:47:27 AM - requesting resource @aaaac 5/17/2013 2:47:27 AM - Error nbjm(pid=6492) NBU status: 2074, EMM status: Disk volume is down Disk volume is down(2074) 5/17/2013 2:47:27 AM - Error nbjm(pid=6492) NBU status: 2074, EMM status: Disk volume is down
When I look on the DR Master, Media and Device Mgmt > Credentials > Storage Servers > under properties or replication I get the following error.
"Internal application error. Please contact your administrator. invalid comman parameter. RDSM has encountered an issue with STS where the server was not found: getDiskVolumeInfoList"
I have tried to UP the disk volume through command prompt, but same issues continue. Let me know which logs might be useful for you experts reading this.
Thanks in advance,
Luis S.
05-17-2013 09:06 AM
I take it your using MSDP and AIR?
Spoold and storaged are the logs you want to take a look at for MSDP down, located <install_path>\msdp folder\logs\spoold.
05-17-2013 09:07 AM
Just concentrate on the DR site - it looks to be down and cannot even be connected to
Tell us about the MSDP setup on this site so that we know where to start - O/S, NetBackup Versions etc.
Doesa the rest of NetBackup on that site look OK?
05-17-2013 09:12 AM
Mark,
DR Master/Media Server:
Windows Server 2008 R2
NBU 7.5.0.3
I able to ping Source from Target, and vice versa.
Everything else like ,Backups are working fine. Its just the replication that seems to be the issue.
05-17-2013 09:15 AM
How is the DR Master for disk space on the de-dupe drive? If it has hit its high watermark it may have filled up
Run an All Log Entries Report for the last few hours (since just before it started failing) to see what it started to report.
If in doubt shut down NetBackup, makes sure all services are stopped and then reboot (this is the DR Master / Media) - but check through the logs first
05-17-2013 09:27 AM
More info:
Here are some commands we have tried but have not worked.
C:\Program Files\Veritas\NetBackup\bin\admincmd>nbdevquery.exe -listdv -stype Pu
reDisk -dp suslv-pwl-bak01-dp
V7.5 suslv-pwl-bak01-dp PureDisk PureDiskVolume @aaaac 6246.10 2202.37 64 0 0 1
0 0 54
C:\Program Files\Veritas\NetBackup\bin\admincmd>nbdevquery.exe -listdv -stype Pu
reDisk -dp suslv-pwl-bak01-dp -U
Disk Pool Name : suslv-pwl-bak01-dp
Disk Type : PureDisk
Disk Volume Name : PureDiskVolume
Disk Media ID : @aaaac
Total Capacity (GB) : 6246.10
Free Space (GB) : 2202.37
Use% : 64
Status : DOWN
Flag : ReadOnWrite
Flag : AdminUp
Flag : InternalDown
Flag : ReplicationSource
Flag : ReplicationTarget
Num Read Mounts : 0
Num Write Mounts : 1
Cur Read Streams : 0
Cur Write Streams : 0
Num Repl Sources : 1
Num Repl Targets : 1
Replication Source : sustp-pwl-bak01:PureDiskVolume
Replication Target : suksh-pwl-bak01.tgptrading.glb.corp.local:PureDiskVolume
-----------------------------------------------------------------------------
C:\Program Files\Veritas\NetBackup\bin\admincmd>nbdevconfig -changestate -stype
PureDisk -dp suslv-pwl-bak01-dp -dv PureDiskVolume -state UP
successfully changed the state of disk volume
-----------------------------------------------------------------------------
C:\Program Files\Veritas\NetBackup\bin\admincmd>nbdevquery.exe -listdv -stype Pu
reDisk -dp suslv-pwl-bak01-dp -U
Disk Pool Name : suslv-pwl-bak01-dp
Disk Type : PureDisk
Disk Volume Name : PureDiskVolume
Disk Media ID : @aaaac
Total Capacity (GB) : 6246.10
Free Space (GB) : 2202.37
Use% : 64
Status : DOWN
Flag : ReadOnWrite
Flag : AdminUp
Flag : InternalDown
Flag : ReplicationSource
Flag : ReplicationTarget
Num Read Mounts : 0
Num Write Mounts : 1
Cur Read Streams : 0
Cur Write Streams : 0
Num Repl Sources : 1
Num Repl Targets : 1
Replication Source : sustp-pwl-bak01:PureDiskVolume
Replication Target : suksh-pwl-bak01.tgptrading.glb.corp.local:PureDiskVolume
05-17-2013 09:30 AM
Actually, Yes. The issue started right after we ran out of space on the DeDupe Backup Volume. Replications failed and we then expanded the volume (added 2TB) and rebooted both servers. After that incident the replications began failing constantly.
05-23-2013 04:03 AM
You need to take a look at the spoold and storaged logs
I am guessing (if you havent sorted this out by now) that either queue processing was stopped and wont resume again due to the disk full condition - this just stops the site working! - you can correct this by manually running the queue processing, but may need a reboit first.
It could also be a corrupt file causing the blockage - again the logs should point to something here.
So if we could see those two logs but also the output of:
crcontrol --getmode
Which will tell us it current state
05-23-2013 07:30 AM
Thanks for your reply Mark. Here are part of the logs you requested. I only copied a part of the logs because there was a lot of info on there. Hope that helps.
Let me know if there is anything else you need from me.
-Luis
05-23-2013 07:40 AM
May 22 14:59:59 ERR [00000000006E9040]: 25002: _storeCheckContainers: container index file F:\dedup_data\data/68.bhd is missing
Can you call symc engineer for support?
I think it's a serious problem
05-23-2013 07:52 AM
Would I be able to copy that missing file from the Source server to the Target Server(where the file is missing) and try to restart the server to see if it fixes the issue?
05-23-2013 08:23 AM
Error : Connection failed connection actively refused. Note that the content router needs to be running to get a connection.
Is spoold running? I would think not, this missing file maybe causing an inconsistency. I would advise raising a support call.
If you knew the location the file was missing from (could be a few) you could always create the bhd file then try starting deduplication engine / manager, then run recoverCR to check for problems and inconsistencies, BUT Please call support.
05-23-2013 08:40 AM
I have seen a similar issue to this - and you really should get support involved to be on the safe side - i wouldn't want to give you the wrong information
Basically you have a little corruption involved and this has shut down the puredisk system - so it wont work - hence the getmode command not giving any output
The processes to get it going cannot run as that file is missing - once that has been dealt with the processes will be able to start and then it should fire itself into life within about 30 minutes or so (it will have a lot of catching up to do)
What need cleaning up is the actual reference to that missing file so that it doesn't keep looking for it and so can start up - but get support to do this for you.
The storaged log doesn't go far enough to get anything out of it - looking for the last part before it all stopped working - but i guess support will sort this for you now
Keep us updated
05-23-2013 09:00 AM
It may be possible to use the RecoverCR Tool - but see what support reccomend
05-23-2013 09:30 AM
Cannot get Spoold to start. I get the following error.
C:\Program Files\Veritas\pdde>spoold.exe --start
Error: 67: Database: connection to database crdb at localhost:10085 failed (could not connect to server: Connection refused (0x0000274D/1006
1)
Is the server running on host "localhost" and accepting
TCP/IP connections on port 10085?
)
Error: 53: Database Manager: could not access storage database (connection actively refused).
Error: 4: Failed to run database class
Error: 4: Failed to read startup mode
05-24-2013 01:42 AM
Nothing will start until it has been repaired
The referenced missing file shows that you have corruption in the system and that will stop anything from working
As i said the RecoverCR tool may help you out but I would log a call with support and show them the missing file line - an engineer may be able to remove the reference to it and get you up and running, or it may need the RecoverCR using - but take advice from support first
05-24-2013 06:01 AM
Yes as Marks solution suggested , before running Recovery tool , took advice from Symantec engineer
05-25-2013 05:51 AM
09-20-2013 04:55 AM
I am facing similar issue in last 4 days......is there any updates on the above issue ?
Was tecg support able to fix the issue for you ?
09-20-2013 05:00 AM
Aravind - please start your own thread with full details so that we can try and help you there