06-12-2012 09:12 AM
We have 2 exchange clusters running ccr replication. On one of them backups on eight mailbox stores are failing and nothing is being logged in the event log on exchange itself. When the failure occurs I am see errors like the following in the netbackup logs. Right now we are okay but these stores will rapidly fill our log drive if we cannot capture a full backup.
09:48:16.433 [16344.8960] <2> bpcd exit_bpcd: exit status 0 ----------->exiting
09:48:16.526 [16492.14496] <2> process_requests: BPCD_READ_HOST_CONFIG_RQST
09:48:16.916 [16492.14496] <2> process_requests: BPCD_DISCONNECT_RQST
09:48:16.916 [16492.14496] <2> bpcd exit_bpcd: exit status 0 ----------->exiting
09:48:17.088 [4692.4696] <2> vnet_pbxAcceptSocket: Accepted sock[492] from 150.228.21.26:3181
09:48:17.088 [4692.4696] <2> bpcd main: accept sock = 492
09:48:17.166 [12492.11244] <2> setup_debug_log: switched debug log file for bpcd
09:48:17.166 [12492.11244] <2> bpcd main: VERBOSE = 0
09:48:17.166 [12492.11244] <2> logparams: C:\Program Files\VERITAS\NetBackup\bin\bpcd.exe -standalone
09:48:17.166 [12492.11244] <2> process_requests: offset to GMT 21600
09:48:17.166 [12492.11244] <2> logconnections: BPCD ACCEPT FROM 150.228.21.26.3181 TO 10.228.12.1.1556 fd = 492
09:48:17.166 [12492.11244] <2> process_requests: setup_sockopts complete
09:48:17.182 [12492.11244] <2> bpcd peer_hostname: Connection from host stptape2ms.adminsys.mrll.com (150.228.21.26) port 3181
09:48:17.182 [12492.11244] <2> bpcd valid_server: comparing master and stptape2ms.adminsys.mrll.com
09:48:19.490 [12492.11244] <2> hosts_equal: gethostbyname failed for master: The requested name is valid, but no data of the requested type was found.
(0)
09:48:19.490 [12492.11244] <2> bpcd valid_server: comparing media2 and stptape2ms.adminsys.mrll.com
09:48:21.799 [12492.11244] <2> hosts_equal: gethostbyname failed for media2: The requested name is valid, but no data of the requested type was found.
(0)
09:48:21.799 [12492.11244] <2> bpcd valid_server: comparing media3 and stptape2ms.adminsys.mrll.com
09:48:24.108 [12492.11244] <2> hosts_equal: gethostbyname failed for media3: The requested name is valid, but no data of the requested type was found.
(0)
09:48:24.108 [12492.11244] <2> bpcd valid_server: comparing media4 and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> hosts_equal: gethostbyname failed for media4: The requested name is valid, but no data of the requested type was found.
(0)
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing osiris.bck.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing osiris.stp.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing bertape2ms.bck.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing lax1pbe02.bck.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing stp2pbe01.bck.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing stp2pbe01.stp.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing stptape1ms.bck.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing stptape1ms.stp.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing stptape2ms.bck.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <2> bpcd valid_server: comparing stptape2ms.stp.mrll.com and stptape2ms.adminsys.mrll.com
09:48:26.417 [12492.11244] <4> bpcd valid_server: hostname comparison succeeded
09:48:26.417 [12492.11244] <2> process_requests: output socket port number = 1
09:48:26.448 [12492.11244] <2> process_requests: Duplicated vnetd socket on stderr
09:48:26.448 [12492.11244] <2> process_requests: <---- NetBackup 7.0 0 ------------initiated
09:48:26.448 [12492.11244] <2> process_requests: VERBOSE = 0
09:48:26.448 [12492.11244] <2> process_requests: Not using VxSS authentication with stptape2ms.adminsys.mrll.com
09:48:27.150 [12492.11244] <2> process_requests: BPCD_GET_VERSION_RQST
09:48:27.353 [12492.11244] <2> process_requests: BPCD_GET_VERSION_RQST
09:48:27.556 [12492.11244] <2> process_requests: BPCD_READ_USERS_CONFIG_RQST
09:48:27.758 [12492.11244] <2> process_requests: BPCD_GET_VERSION_RQST
09:48:27.961 [12492.11244] <2> process_requests: BPCD_READ_HOST_CONFIG_RQST
09:48:28.164 [12492.11244] <2> process_requests: BPCD_DISCONNECT_RQST
09:48:28.164 [12492.11244] <2> bpcd exit_bpcd: exit status 0 ----------->exiting
09:49:32.249 [4692.4696] <2> vnet_pbxAcceptSocket: Accepted sock[492] from 150.228.21.26:3323
09:49:32.249 [4692.4696] <2> bpcd main: accept sock = 492
06-12-2012 09:26 AM
So has anything changed recently (active node, DNS, patches etc.)
Have you checked that the NetBackup client service is running on all nodes and that replication is up to date
The early errors indicate name resolution errors or network errors so check all of those things
06-12-2012 09:41 AM
The active nodes have not changed and I am not aware of any DNS changes. There was some patches that were installed on the 2nd which is when the issues started on these particular stores. These patches were installed on all 4 of the mailbox servers and the stores having issues are not the only stores on the cluster.
Get-storagegroupcopystatus indicates all stores are healthy. We have also checked the network cards to ensure they are not experiencing errors. I have included the information on the last backups of the problematic stores.
StorageGroupName : Ar-Az
LatestFullBackupTime : 6/5/2012 1:29:11 PM
LatestIncrementalBackupTime :
SnapshotBackup : True
StorageGroupName : Ba-Bo
LatestFullBackupTime : 6/1/2012 6:10:28 AM
LatestIncrementalBackupTime : 6/2/2012 10:01:05 AM
SnapshotBackup : True
StorageGroupName : Bp-Bz
LatestFullBackupTime : 6/1/2012 6:13:31 AM
LatestIncrementalBackupTime : 6/6/2012 8:03:28 PM
SnapshotBackup : True
StorageGroupName : Ca-Ck
LatestFullBackupTime : 6/5/2012 1:28:46 PM
LatestIncrementalBackupTime :
SnapshotBackup : True
StorageGroupName : Cl-Cz
LatestFullBackupTime : 5/29/2012 12:03:39 PM
LatestIncrementalBackupTime : 6/2/2012 6:25:41 AM
SnapshotBackup : True
StorageGroupName : Dj-Dz
LatestFullBackupTime : 5/29/2012 6:42:48 AM
LatestIncrementalBackupTime : 6/2/2012 6:49:59 AM
SnapshotBackup : True
06-12-2012 10:03 AM
06-12-2012 10:47 AM
Here are the bpcd and bpbkar logs. The storage group is the only group that has access to the job details. My knowledge is more on the exchange side. When I got them engaged in the issue they said the following. I had suspected it had something to do with the client on this particular box
"The list of Master/Media servers for both nodes in EXHA1 contained several bogus entries, including one in the 1st position which is reserved for the Master server.
I have removed these bogus entries and I’m going to try one of the backups again. I am not certain, but the discrepancy in server lists could have been a contributing factor."
After these changes we are are 8% done with a backup on one of the stores so we will see how that goes. Before they were failing instantly.
06-13-2012 01:40 AM
The bpcd log is huge! - not even going to try and open it.
This does imply something was going mad on this client - that could have been caused by your bogus entries
Update us after the backups finish and then see if it starts to create smaller bpcd logs indicating that things have settled a little