cancel
Showing results for 
Search instead for 
Did you mean: 

Auto Image Replication not working properly

kimchooi
Level 3

Hi,

I doing a project to setup 2 MSDP on 2 location, when starting, i put the server side by side and all run perfect even i enable accelerator

So i move 1 MSDP to DR side, replication starting not to work. then i disable acceletor, replication start to work back. I believe the setup is correct just not sure which port need to open for the accelerator to communicate

Do i need to open other port for acceletor to communicate ? 

Ports open between 2 site

HQ --> DRC

1556, 10102, 10082

 

DRC --> HQ

1556

NetBackup version 7.6.1.1 

MSDP windows 2012 R2 server

anyone please help ?

1 ACCEPTED SOLUTION

Accepted Solutions

watsons
Level 6

When you move the target MSDP to DR site, did its IP address change? 

I am not sure how the impact would be if the IP address did change, possibly causing replication error? 

Personally I don't think it has to something to do with Accelerator. I will try to test something similar to see if I can get what you experienced (Accelerator + AIR)

View solution in original post

8 REPLIES 8

kimchooi
Level 3

More testing

I create a new policy perform backup the replication to DR work fine, i re-run backup without accelerator, then replication job failed

+++++++++++++++++++++++++++++++
May 19 13:49:02 INFO [0000000001BB0090]: Starting replication with a bandwidth limit of 0 KB/s
May 19 13:49:02 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
May 19 13:49:03 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
May 19 13:49:03 ERR [0000000001BB0090]: 36: __process_refop_batch: could not send reference message(s): connection reset by peer
May 19 13:49:03 ERR [0000000001BB0090]: -1:  __process_refop_batch: Could not send reference message: connection reset by peer
May 19 13:49:03 ERR [0000000001BB0090]: 36: CRReplicate: reference operation batch failed at DO 5a19697264c5d9d96e484c8cb4c037f6:6147
May 19 13:49:03 INFO [0000000001BB0090]: -----------Replication Last Image Cache Report-----------------
May 19 13:49:03 INFO [0000000001BB0090]: Total cache entry: 224997
May 19 13:49:03 INFO [0000000001BB0090]: Cache Hits       : 173850
May 19 13:49:03 INFO [0000000001BB0090]: Cache Miss       : 1725
May 19 13:49:03 INFO [0000000001BB0090]: Cache Rebase     : 0
May 19 13:49:03 INFO [0000000001BB0090]: -----------------------End-------------------------------------
May 19 13:49:03 ERR [0000000001BB0090]: 36: Puredisk::Replication::Engine::ReplicationThread::_forwardFingerprintBatch: CRReplicate of batch failed (connection reset by peer).
May 19 13:49:03 ERR [0000000001BB0090]: 36: Puredisk::Replication::Engine::ReplicationThread::_forwardDataList: forward fingerprints failed (connection reset by peer).
May 19 13:49:03 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
May 19 13:49:03 ERR [0000000001BB0090]: 36: Synchronization to pkpsbackup_drc failed (connection reset by peer).
May 19 13:49:03 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
May 19 13:49:03 WARNING [0000000001BB0090]: 36: Pd_ReplicationClass::ClientTask: failed at the 1 time for job[1d] (connection reset by peer)
May 19 13:49:08 INFO [0000000001BB0090]: Puredisk::Replication::Engine::ReplicationEngine::forward: beginning replication for Data Protection Application job id 288
+++++++++++++++++++++++++++

above logs from replication.log, why initial first backup and replication work, but not subsequent backup ?

All this never happen when server without firewall rules, what others port require 2 MSDP to communicate ?

 

 

 

sdo
Moderator
Moderator
Partner    VIP    Certified

You need tcp/1556 and tcp/10082 and tcp/10102 all bi-directional for NetBackup AIR to work between MSDP storage servers.

kimchooi
Level 3

Just open bidirectional, still not fully working

I create a backup job with few path total 128GB, dedupe almost 98% 

First try it can replicate over to DR MSDP server

I re-run backup again, replication failed with above message

sdo
Moderator
Moderator
Partner    VIP    Certified

Have you confirmed open ports, using six different telnet commands:

source-machine# telnet target-ip port

...i.e. test three ports - each way?

kimchooi
Level 3

tested all port working ( port forward )

As i said, it can replicate to DR MSDP but only first time

 

I also create another backup policy with 1 path 2 files , Dedupe 80% around 6 GB

For this backup, replication will always work, i run at least 3 times, and everytime it got replicate to DR

 

But for my previous backup policy with 4 path a lot of files with backup size 120+ GB, it only replicate successful for the first time, but failed after that

Anyone has this kind of weird behaviour ?

kimchooi
Level 3

Logs for successful replicate

++++++++++++++++++++

May 19 18:06:58 INFO [0000000001BB0090]: WSRequestExt: submitting &request=9&login=admindrc&passwd=********&action=getSDKVersion
May 19 18:06:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
May 19 18:06:58 INFO [0000000001BB0090]: WSRequestExt: submitting &request=6&login=admindrc&passwd=********&action=getlastimage&dsid=2&client=pkpssaga_hq&policy=Level0_Backup
May 19 18:06:58 INFO [0000000001BB0090]: Empty source FP f1450306517624a57eafbbf81266a67a met, skip it.
May 19 18:06:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
May 19 18:06:58 INFO [0000000001BB0090]: Starting replication with a bandwidth limit of 0 KB/s
May 19 18:06:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
May 19 18:07:00 INFO [0000000001BB0090]: This replication job is currently using a bandwidth limit of 0 KB/s, 11 SOs has been sent in current batch.
May 19 18:07:44 INFO [0000000001ABE4E0]: ===  Replication Monitor  ===
May 19 18:07:44 INFO [0000000001ABE4E0]: Job ID: 10, Target: pdde://pkpsbackup_drc, Segments Processed: 205145, Age: 45.937 seconds
May 19 18:07:44 INFO [0000000001ABE4E0]: There are 1 replication jobs currently running and 0 jobs waiting in the queue
May 19 18:07:44 INFO [0000000001ABE4E0]: =============================
May 19 18:08:03 INFO [0000000001BB0090]: This replication job is currently using a bandwidth limit of 0 KB/s, 1 SOs has been sent in current batch.
May 19 18:08:03 INFO [0000000001BB0090]: ---------------------------------------------------
May 19 18:08:03 INFO [0000000001BB0090]: Transfer time                                      : 64.52 sec
May 19 18:08:03 INFO [0000000001BB0090]: Transfer rate                                      : 402.33 MB/sec
May 19 18:08:03 INFO [0000000001BB0090]: De-dup percentage                                  : 100.00
May 19 18:08:03 INFO [0000000001BB0090]: Total number of DOs to replicate                   : 11
May 19 18:08:03 INFO [0000000001BB0090]: Total size of all DOs to replicate                 : 27217764585 bytes
May 19 18:08:03 INFO [0000000001BB0090]: Total size of all segments that were sent          : 905315 bytes
May 19 18:08:03 INFO [0000000001BB0090]: Number of segments                                 : 208509
May 19 18:08:03 INFO [0000000001BB0090]: Number of segments sent                            : 12
May 19 18:08:03 INFO [0000000001BB0090]: Details:
May 19 18:08:03 INFO [0000000001BB0090]: -----------Replication Last Image Cache Report-----------------
May 19 18:08:03 INFO [0000000001BB0090]: Total cache entry: 0
May 19 18:08:03 INFO [0000000001BB0090]: Cache Hits       : 0
May 19 18:08:03 INFO [0000000001BB0090]: Cache Miss       : 208629
May 19 18:08:03 INFO [0000000001BB0090]: Cache Rebase     : 0
May 19 18:08:03 INFO [0000000001BB0090]: -----------------------End-------------------------------------

++++++++++++++++++++++++++++

Logs for failed replication

++++++++++++++++++++++++++++

May 19 18:26:58 INFO [0000000001BB0090]: WSRequestExt: submitting &request=6&login=admindrc&passwd=********&action=getlastimage&dsid=2&client=pkpssaga_hq&policy=Level0_Backup
May 19 18:26:58 INFO [0000000001BB0090]: Empty source FP f1450306517624a57eafbbf81266a67a met, skip it.
May 19 18:26:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
May 19 18:26:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
May 19 18:27:44 INFO [0000000001ABE4E0]: ===  Replication Monitor  ===
May 19 18:27:44 INFO [0000000001ABE4E0]: Job ID: 11, Target: pdde://pkpsbackup_drc, Segments Processed: 0, Age: 46.578 seconds
May 19 18:27:44 INFO [0000000001ABE4E0]: There are 1 replication jobs currently running and 0 jobs waiting in the queue
May 19 18:27:44 INFO [0000000001ABE4E0]: =============================
May 19 18:28:53 INFO [0000000001BB0090]: CRAnalyzeDoLocality: rebasing has been disable.
May 19 18:28:53 INFO [0000000001BB0090]: CRAnalyzeDoLocality: FP number in image: 208497
May 19 18:28:53 INFO [0000000001BB0090]: Starting replication with a bandwidth limit of 0 KB/s
May 19 18:28:53 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
May 19 18:28:54 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
May 19 18:28:54 ERR [0000000001BB0090]: 36: __process_refop_batch: could not send reference message(s): connection reset by peer
May 19 18:28:54 ERR [0000000001BB0090]: -1:  __process_refop_batch: Could not send reference message: connection reset by peer
May 19 18:28:54 ERR [0000000001BB0090]: 36: CRReplicate: reference operation batch failed at DO 90f72f1bac130fbd44bf8954435e9d47:65
May 19 18:28:54 INFO [0000000001BB0090]: -----------Replication Last Image Cache Report-----------------
May 19 18:28:54 INFO [0000000001BB0090]: Total cache entry: 208121
May 19 18:28:54 INFO [0000000001BB0090]: Cache Hits       : 208497
May 19 18:28:54 INFO [0000000001BB0090]: Cache Miss       : 1
May 19 18:28:54 INFO [0000000001BB0090]: Cache Rebase     : 0
May 19 18:28:54 INFO [0000000001BB0090]: -----------------------End-------------------------------------
May 19 18:28:54 ERR [0000000001BB0090]: 36: Puredisk::Replication::Engine::ReplicationThread::_forwardFingerprintBatch: CRReplicate of batch failed (connection reset by peer).
May 19 18:28:54 ERR [0000000001BB0090]: 36: Puredisk::Replication::Engine::ReplicationThread::_forwardDataList: forward fingerprints failed (connection reset by peer).
May 19 18:28:54 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
May 19 18:28:54 ERR [0000000001BB0090]: 36: Synchronization to pkpsbackup_drc failed (connection reset by peer).
May 19 18:28:54 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
May 19 18:28:54 WARNING [0000000001BB0090]: 36: Pd_ReplicationClass::ClientTask: failed at the 1 time for job[11d] (connection reset by peer)

+++++++++++++++++++++++++++++++++

i see that it try to request for last image being transfer, if there is not image previously being replicate, the replication will be successful, but when there is a valid last image, then the replication will failed

what is the target MSDP try to send back to source MSDP for the full communication to work so the replication will be proceed

anyone ?

watsons
Level 6

When you move the target MSDP to DR site, did its IP address change? 

I am not sure how the impact would be if the IP address did change, possibly causing replication error? 

Personally I don't think it has to something to do with Accelerator. I will try to test something similar to see if I can get what you experienced (Accelerator + AIR)

kimchooi
Level 3

Yes IP did change, but hostname remain.

The connection between 2 site is without VPN, is using port foward, not sure will that affect 

I did another test and i found out that if i choose only 2-3 files for my backup policy, it able to replicate, but if i choose a path with a lots of files, then only first replication work after backup, subsequent backup, replication will fail with above error