cancel
Showing results for 
Search instead for 
Did you mean: 

Exchange GRT backup and bpclntcmd -pn

Alexis_Jeldrez
Level 6
Partner    VIP    Accredited Certified

Summary:
Our Exchange DAG is conformed by four Mailbox nodes, two of them are being backed up fine through GRT backups but the other two are giving us trouble.
I did some troubleshooting and found out that basic commands like "bpclntcmd -pn" and "bplist" don't work properly in the two bad DAG nodes, but the commands work as expected in the two good DAG nodes.
We tested basic connections by backing up "C:\Windows\system32\drivers\etc\" in a basic disk: the two good DAG nodes finished the backup in less than ten minutes, the other two required like 35 minutes, no idea what took them so long (they were only a couple kilobytes of data).


In detail:

This is not a new implementation, but a three year old platform that was somehow working. Recent changes in the network required a revision of the Exchange backups.

  • All servers (NetBackup Master, NetBackup Media and Exchange Mailbox) are Windows 2008 R2 SP1 with NetBackup 7.5.0.6.
  • All Exchange backups are going to a disk with Deduplication in the Media Server.
  • Full database backups without GRT works fine: all jobs end with status 0.
  • Full database backups with GRT works fine in two DAG nodes, in the other two the backup gives a status 1.
  • I tried backing up passive and active copies, but the problem didn't look as if it was related in any way to the mail databases (which are small, less than 100 GB each).
  • I don't believe there's a problem in the Media Server either, because it's already receiving the Exchange GRT backups from the other DAG nodes without errors.
  • So, I believe the problem is in the two "bad" DAG nodes.
  • I followed the guide in the NetBackup for Exchange Manual and checked the NFS services. I checked the Exchange account was being used for the three NetBackup services in all the DAG nodes.

I didn't find anything strange, so I started checking the logs.

According to the detailed status of the backup job:

5/9/2014 3:13:39 PM - Info bpbrm(pid=4260) DB_BACKUP_STATUS is 0          
5/9/2014 3:13:52 PM - Info bpbrm(pid=4260) from client dag1: TRV - Starting granular backup processing for (/Microsoft Exchange Database Availability Groups/dag1\Microsoft Information Store:\DB012\). This may take a while...
5/9/2014 3:17:26 PM - Error bpbrm(pid=4260) from client dag1: ERR - Unable to obtain image list. Confirm this client has access to images on server 'svwin002' for client 'dag1'
5/9/2014 3:17:26 PM - Error bpbrm(pid=4260) from client dag1: ERR - Refer to the 'Configuring the hosts for backups that use Granular Recovery Technology (GRT) in virtual environments' section of the Exchange Administrator's Guide for more information
5/9/2014 3:17:26 PM - Info bpbrm(pid=4260) from client dag1: TRV - Granular processing failed!     
5/9/2014 3:17:26 PM - Error bpbrm(pid=4260) from client dag1: ERR - Error encountered while attempting to get additional files for Microsoft Information Store:\DB012\Logs_1399658938\
5/9/2014 3:17:26 PM - Error bpbrm(pid=4260) from client dag1: ERR - Exchange granular restore from this image may not work.
5/9/2014 3:17:33 PM - Info bptm(pid=5856) waited for full buffer 15994 times, delayed 83299 times    
5/9/2014 3:17:47 PM - Info bptm(pid=5856) EXITING with status 0 <----------        
5/9/2014 3:17:48 PM - Info svwin003.uai.cl(pid=5856) StorageServer=PureDisk:svwin003.uai.cl; Report=PDDO Stats for (svwin003.uai.cl): scanned: 22043160 KB, CR sent: 15286366 KB, CR sent over FC: 0 KB, dedup: 30.7%
5/9/2014 3:17:51 PM - Info bpbrm(pid=4260) validating image for client dag1        
5/9/2014 3:17:55 PM - Info bpbkar32(pid=7280) done. status: 1: the requested operation was partially successful   

I noticed that a "status = 25" appeared in the parent job that creates the snapshot, so I checked the BPBRM log in the Media Server:

13:58:40.247 [6620.3856] <2> get_bpfis_msg: bpfis buff: ERR - Send bpfis state to svwin002 failed. status = 25
13:58:40.247 [6620.3856] <16> get_bpfis_msg: from client VMWIN209: ERR - Send bpfis state to svwin002 failed. status = 25
13:58:40.247 [6620.3856] <2> set_job_details: Tfile (144817): LOG 1400608720 16 bpbrm 6620 from client VMWIN209: ERR - Send bpfis state to svwin002 failed. status = 25
13:58:40.247 [6620.3856] <2> send_job_file: job ID 144817, ftype = 3 msg len = 106, msg = LOG 1400608720 16 bpbrm 6620 from client VMWIN209: ERR - Send bpfis state to svwin002 failed. status = 25

The BPFIS log in the client:

13:58:40.388 [9044.6016] <8> async_connect: [vnet_connect.c:1653] getsockopt SO_ERROR returned 10060 0x274c
13:58:40.388 [9044.6016] <16> connect_to_service: connect failed STATUS (18) CONNECT_FAILED
    status: FAILED, (42) CONNECT_REFUSED; system: (10060) Connection timed out.; FROM 10.2.1.159 TO svwin002 192.168.5.10 bprd VIA pbx
    status: FAILED, (42) CONNECT_REFUSED; system: (10060) Connection timed out.; FROM 10.2.1.159 TO svwin002 192.168.5.10 bprd VIA vnetd
    status: FAILED, (42) CONNECT_REFUSED; system: (10060) Connection timed out.; FROM 10.2.1.159 TO svwin002 192.168.5.10 bprd
13:58:40.388 [9044.6016] <8> vnet_connect_to_service: [vnet_connect.c:215] connect_to_service() failed 18 0x12
13:58:40.388 [9044.6016] <2> bprd_connect: vnet_connect_to_service(svwin002) failed: 18
13:58:40.388 [9044.6016] <2> bprd_bpfis_state_xfer_to_server_rqst: bprd_connect(svwin002) failed, cannot connect on socket (25)
13:58:40.388 [9044.6016] <16> store_bpfis_state_on_master: ERR - Send bpfis state to svwin002 failed. status = 25

Then and only then I realized that some basic commands like "bpclntcmd -pn" or "bplist" were not working properly in these two DAG nodes.
The command "bpclntcmd -pn" gives no output and bplist delivers a "status 25" error.
I then checked the client BPCD log and it showed the same error seen in the the BPFIS log:

14:04:08.772 [1324.5188] <16> connect_to_service: connect failed STATUS (18) CONNECT_FAILED
    status: FAILED, (42) CONNECT_REFUSED; system: (10060) Connection timed out.; FROM 10.2.1.156 TO svwin002.uai.cl 192.168.5.10 bprd VIA pbx
    status: FAILED, (42) CONNECT_REFUSED; system: (10060) Connection timed out.; FROM 10.2.1.156 TO svwin002.uai.cl 192.168.5.10 bprd VIA vnetd
    status: FAILED, (42) CONNECT_REFUSED; system: (10060) Connection timed out.; FROM 10.2.1.156 TO svwin002.uai.cl 192.168.5.10 bprd
14:04:08.772 [1324.5188] <8> vnet_connect_to_service: [vnet_connect.c:215] connect_to_service() failed 18 0x12
14:04:08.772 [1324.5188] <2> bprd_connect: vnet_connect_to_service(svwin002.uai.cl) failed: 18
14:04:08.772 [1324.5188] <16> send_request: Can't connect to host svwin002.uai.cl: cannot connect on socket (25)
14:04:08.772 [1324.5188] <4> bplist:exit(): EXIT STATUS 25: cannot connect on socket

I used to believe that if "bpclntcmd -pn" didn't work, then backups wouldn't work either.
But these two servers had their Exchange Databases backed up fine everyday and if it weren't for the errors during the GRT cataloging we would never have found this issue.

So, I checked the hostnames of everything, checked the registered servers in the GUI and the registry, I checked NSLOOKUP and bpclntcmd with "-hn" and "-ip", I checked the hosts tables and refreshed everything with "ipconfig /flushdns" and "bpclntcmd -clear_host_cache".
I got nothing.
The "Windows Firewall Service" is disabled in all DAG nodes and I couldn't find a problem doing telnet to the ports 13782, 13720, 1556, etc.
Also, there are at least two networks, one is public and the other is dedicated to backups. Both the NetBackup servers and the Exchange servers can see both networks: the hostnames are associated to the public IPs through DNS.
To force the data traffic through the backup network without messing up with the Exchange service, we put references to the Master and Media in the clients host tables to associate their names with their backup IPs.
As for the other way around, we used the "required interface" from Host Properties in each of the DAG nodes. Exchange Backups are indeed using the backup network for data traffic.

I'm running out of ideas. I'm thinking of reinstalling the NetBackup Client software on these two nodes and wishing for a miracle. Also I could ask the VMware team to check again the DAG nodes interfaces.
Any ideas are welcome.

 

PD: Sorry if the times, dates and hostnames don't seem to be coherent, I did a lot of tests and I only saved some of them. I can test and collect data again, if needed.

3 REPLIES 3

Linette_V
Level 4
Employee Accredited Certified

How frustrating!!

I'm thinking you are right and it probably is a name resolution/routing issue somewhere as we see the connection being refused. I also notice that some of the logs use the shortname, and other the FQDN. Do you have both names defined in the host files you mentioned?

It is possible to have PN fail and still be able to do backups, depending on which server actually initiates the backup. The bprd log may be of assistance to find out why pn is failing on the client.  This technote is really detailed about the steps taken to resolve a bpclntcmd -pn call, and I'd recommend having a look at it from the client side: http://www.symantec.com/docs/TECH50198

I hope that helps, and good luck!

Alexis_Jeldrez
Level 6
Partner    VIP    Accredited Certified

Reinstaling the client software solved the problem with "bplcntcmd -pn" & bplist.

Apparently, this virtual machine was cloned and the NetBackup installation got corrupted somehow.

I noticed that after uninstalling 7.5.0.6 the version was downgrading to 7.5.0.4 (and not directly to 7.5.0.0, as I was expecting). I thought you could only do that in UNIX/Linux. Anyway, I uninstalled everything related to NetBackup, deleted the C:\Program Files\Veritas\ folder and reinstalled again.

Alexis_Jeldrez
Level 6
Partner    VIP    Accredited Certified

OK, the problem was resolved when I deleted the entry in the "Use specified network interface" in the two conflictive DAG servers. Immediatly the bpclntcmd -pn & bplist errors were solved and the GRT Exchange backup worked fine. Reinstalling the software only helped because it deleted that "network interface" entry.

I still don't know why this was producing an error. I guess it was because the NetBackup servers and the clients were both using a productive network and a backup-exclusive network, and the DNS was publishing the productive network IPs. But ony two DAG nodes were affected, the other two never had any trouble and they were using the "Use specified network interface" without any issues.

Backups worked well and they used the backup-only network, so everything's fine now.