cancel
Showing results for 
Search instead for 
Did you mean: 

Inter Site SLP fails but Intra Site SLP succeeds

Alun
Level 4

I have two datacenters, with a clustered master server node, media servers and an SSO connected tape library in each.

I migrated our server infrastructure from older hardware running Windows Server 2008 R2 to newer servers running Windows Server 2012 R2 (Master Server Cluster Nodes) or 2016 (Media Servers), the IP addresses from the old media servers were re-used for the new servers

Since the re-platform, existing SLP duplications between the two datacenters fail.

The required media is loaded into the tape drives in each site, the server hosting the images to be duplicated queues the restore, the target server queues the backup job but the two servers never successfully initiate the communication channel required for the duplication to proceed.

If I create an SLP to duplicate from server 1 to server 2 in the same site, the duplication completes successfully.

The required media is loaded into the tape drives in the site, the server hosting the images to be duplicated queues the restore, the target server queues the backup job, the servers establish communication and the duplication successfully completes.

Can anyone explain the actual processes that initiate SLP duplication, what the process flow is and what to look for when comparing the differences between the successful intra-site and the unsuccessful inter-site duplications?

Thanks,

Alun

1 ACCEPTED SOLUTION

Accepted Solutions

Alun
Level 4

We were never able to satisfactorily resolve this issue, instead we chose to duplicate within each datacenter and complete all of the outstanding SLPs in that fashion.

Thanks for your advice and assistance, apologies for not replying sooner.

View solution in original post

18 REPLIES 18

davidmoline
Level 6
Employee

Hi @Alun 

Can you confirm you are looking for the differences between SLP controlled duplication within one site (which is working) and AIR replication between sites (which is not working)?

If nothing else has changed other than the re-platform, then I'd first be looking at firewall configurations (local windows firewall). How did you go about the replatform work (for both the master server cluster and media servers (i.e. what process)?

Have you used the nbstlutil command to determine what the state of the SLP managed image is?

David

Hi David,

I'm referring to SLP controlled duplication between media servers in the same physical location and also SLP controlled duplication between media servers in two physical locations (we don't use AIR).

The networking is in theory no different between the new servers in both datacenters, disabling the Windows firewall makes no difference to the success or failure of the SLP duplications.

New servers were built on new hardware with newer OSs and new IP addresses, once the old media servers were removed the new media servers were allocated the IP addresses from the old media servers.

The Master server cluster nodes were replaced by performing in place OS upgrades on the old nodes, adding the two new nodes into the cluster, installing NetBackup on them, failing over between each to confirm that they were all working as anticipated and then finally removing the old nodes form the cluster.

Alun

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Alun 

Can you please show us all text in Job Details of a failed duplication?

This will tell us which processes and PIDs on master and media servers to troubleshoot.

If you do not want to display hostnames, please replace real names with generic names, e.g.
master, media1, media2.

Ensure that log folders exist on master: admin (I don't think more legacy logs are needed on the master)
On media servers : bpbrm, bptm, bpdm.
Increase logging level to 3 (level 3 is sufficient for this forum; if you log a call with Veritas Support, they will ask for level 5).

Depending on what we see in Job Details, we will know which logs to check.

@Marianne 

Job Details for two duplications are attached, Job01 reported a different error from Job02 (and Job03 which I omitted).

The errors reported in Job02 are caused by me terminating the bpduplicate.exe processes on the master server after almost 60 minutes of inactivity in the job log (no conflicting jobs are running that would prevent the duplications from accessing their requested tapes or drives).

This is the behaviour that is repeatable whenever SLP secondary operations are enabled that require media servers in different datacenters.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Alun 

This is the only error I see in the 1st log:

 Error bptm (pid=4284) cannot connect to the writing side process for duplication

This looks like comms failure between the 2 media servers.

What I do not understand is where PID 4284 is coming from - probably a child process of a parent bptm process.
We will need full bptm log on at least the source media server.

Please check and verify comms between the 2 media servers with bptestbpcd.
Ensure that bpcd log folder exist on both media servers to troubleshoot connectivity issues.

Do this on each media server:
bptestbpcd -client <remote-mediaserver> -verbose -debug

Another way to test, is to perform small backups for each other to test 'regular' client-server comms between the 2 media servers.

@Marianne 

BPTESTBPCD failed.

I checked the server that generated the error, it had an invalid IP address for the server in the other datacenter, I've corrected this and cleared the cache.

BPTESTBPCD succeeded.

I re-enabled secondary operations and the jobs ran in the same fashion, loaded tapes and then communication ceased.

I have uploaded the job logs and bptm logs from all three servers as well as the Admin log from the Master server, I have anonymised the IP Addresses, replacing them with SOURCESERVER1IPADDRESS, SOURCESERVER2IPADDRESS, TARGETSERVERIPADDRESS, MASTERSERVERIPADDRESS, MASTERSERVERNODEIPADDRESS and MASTERSERVERCLUSTERIPADDRESS as appropriate.

As before, the duplication jobs ran for over an hour before erroring, at this point I terminated the bpduplicate.exe processes which triggered the tapes to be unloaded and deallocated.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Alun 

I will look at logs when time permits (hectic work day...)

Hopefully someone else will have a look.
The alternative is to log a Support call with Veritas...

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Alun 

I'm getting error 404 when trying to access attachment...

Please try to to attach logs individually without zipping them?

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@JustineVelcich 

Could you please ask someone to investigate?

All VOX attachments are now producing Error 404.

Attachments could be viewed / downloaded last week.

Thanks!

@Marianne, apologies I've been battling some other issues today.

I tried to upload these individually last week but the duplication job logs caused an extension / content conflict and were rejected, they've done it again and have been stripped from the attachments list...

Duplication job logs

davidmoline
Level 6
Employee

Hi @Alun 

So you have two media servers gbs0015652 & gbs1500491. Can you confirm that you can run bptestbpcd from each to the other successfully. 

What appears to be happening is that the duplication is setup correctly by the master and each media server then gets ready to do the next steps, but for some reason the inter media server communication doesn't seem to happen. The first apparent error (connection reset error =10054) in the logs occurs at 16:27 which is when I gather you terminated the duplication. 

I recall you indicating there was an IP address problem at one end - have you confirmed two way NetBackup comms from each end?

David

Hi @davidmoline,

BPTESTBPCDs are successful (attached).

I recall you indicating there was an IP address problem at one end - have you confirmed two way NetBackup comms from each end?

The IP Address issue was resolved although duplications have continued to fail, what is the best way to test two way NetBackup comms apart from using BPTESTBPCD?

Thanks, Alun

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Alun 

I would try to perform small backups in each direction:
target media server do small backup for source media server as the client and vice versa.

Ensure bpcd (for client-side comms) and bpbrm on media server (for media server comms) exist on both servers.

I've started to look at logs.
Long story short - target media server mounted and positioned the tape, and then sat and waited for data to arrive:

14:56:23.739 [9548.844] <2> io_write_media_header: drive index 2, writing media header
14:56:28.535 [9548.844] <4> write_backup: begin writing backup id gbvm019993_1612555220, copy 2, fragment 1, to media id 200354 on drive KSP-LTO5-03 (index 2)
14:56:28.535 [9548.844] <2> process_brm_msg: no pending message from bpbrm
14:56:28.535 [9548.844] <4> write_backup: waiting for client data or brm message

I cannot see evidence in bptm that any data arrived...
14:56 was the last entry written by PID 9548.

*** EDIT ***

One more thing - SLP to duplicate from tape to tape between media servers is not the norm.
Most SLP duplications between media servers is from dedupe storage to dedupe storage to ensure optimized (only changed blocks) duplication.
I have tried to follow the SLP process flow in the NBU logging Guide to see what other media server processes are involved. I could only find bptm (for tape). My logic says to be that bpbrm should be involved as well for initial comms.

Hopefully there will be evidence of initial 'handshake' between media servers in bpbrm logs.
Or maybe bpbrm (source) -> bpcd  (target) -> bpbrm (target)

Or maybe @davidmoline  has got idea of where and how to track source and target media server comms...

The alternative is to log a Support call where they will need level 5 logs...

Hi @Alun 

Using bptestbpcd to Test the two way NBU comms should be sufficient (it's certainly a good starting pioint). First a question from the bptestbpcd results - on the test from gbs1500491->gbs0015652, the peer name (i.e. what it thinks its hostname is) is displayed as GBS0036961.lloyds.net - can you explain what host this is and why it is coming up.

The result in general though seems to be okay, with each side identifying the other media server's certificate correctly (i.e. the media server's certificates match in each direction). 

As for troiubleshooting further, one way forward I would suggest identifying a JobID of one of the failed duplication jobs (that you terminated), and using this with the logging assistant to setup the relevant debug logging on the master and media servers. With the JobID, the assistant will analyse the job and suggest relevant logging to setup (you can add more if desirred). Once the logging is setup - initiate a new duplication, wait for some time and then use the logging assistant again to collect the logs for the relevant time frame (i.e. from just before you enabled the logging until you terminated the job). You will given an opportunity to select a location (on the master) to save the logs from the various servers. These can then be analysed (you will need to also get the job details of this new duplication job to correlate the various log files. 

Alternately as @Marianne suggested - log a support call with Veritas and get them to help. 

To answer your original question though, there is no difference in running a duplication within a data center and between two data centers from a NetBackup point of view. The only difference is really a networking one (local, it is problably all on the same network, although not necessarily; between DCs the various servers need to be able to correctly route between themselves) which NetBackup has no control over. The other aspect to be sure is working is a working name resolution system - whether DNS/bind which is preferred or local hosts files (which is prone to include erros).

Cheers
David

@davidmoline, GBS0036961.lloyds.net was the media server that was directly replaced by GBS1500491.lloyds.net, the IP address seems to be resolving back to the old name via a stale PTR record in DNS.

I'm currently waiting to talk to my DNS engineer to locate and purge the offending PTR record.

davidmoline
Level 6
Employee

Hi @Alun 

Does the forward lookup of the name GBS0036961.lloyds.net resolve to the IP address of the media server GBS1500491.lloyds.net or some other IP? If some other IP, then that may be the problem.

(host) GBS1500491 (resolves to IP) MediaServerIP (e.g. 10.10.10.10)

(IP) MediaServerIP (resolves to host) GBS0036961

What does (host) GBS0036961 (resolve to) IP???

David

Alun
Level 4

We were never able to satisfactorily resolve this issue, instead we chose to duplicate within each datacenter and complete all of the outstanding SLPs in that fashion.

Thanks for your advice and assistance, apologies for not replying sooner.