Forum Discussion

kimchooi's avatar
kimchooi
Level 3
10 years ago

Auto Image Replication not working properly

Hi,

I doing a project to setup 2 MSDP on 2 location, when starting, i put the server side by side and all run perfect even i enable accelerator

So i move 1 MSDP to DR side, replication starting not to work. then i disable acceletor, replication start to work back. I believe the setup is correct just not sure which port need to open for the accelerator to communicate

Do i need to open other port for acceletor to communicate ? 

Ports open between 2 site

HQ --> DRC

1556, 10102, 10082

 

DRC --> HQ

1556

NetBackup version 7.6.1.1 

MSDP windows 2012 R2 server

anyone please help ?

  • When you move the target MSDP to DR site, did its IP address change? 

    I am not sure how the impact would be if the IP address did change, possibly causing replication error? 

    Personally I don't think it has to something to do with Accelerator. I will try to test something similar to see if I can get what you experienced (Accelerator + AIR)

8 Replies

  • More testing

    I create a new policy perform backup the replication to DR work fine, i re-run backup without accelerator, then replication job failed

    +++++++++++++++++++++++++++++++
    May 19 13:49:02 INFO [0000000001BB0090]: Starting replication with a bandwidth limit of 0 KB/s
    May 19 13:49:02 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
    May 19 13:49:03 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
    May 19 13:49:03 ERR [0000000001BB0090]: 36: __process_refop_batch: could not send reference message(s): connection reset by peer
    May 19 13:49:03 ERR [0000000001BB0090]: -1:  __process_refop_batch: Could not send reference message: connection reset by peer
    May 19 13:49:03 ERR [0000000001BB0090]: 36: CRReplicate: reference operation batch failed at DO 5a19697264c5d9d96e484c8cb4c037f6:6147
    May 19 13:49:03 INFO [0000000001BB0090]: -----------Replication Last Image Cache Report-----------------
    May 19 13:49:03 INFO [0000000001BB0090]: Total cache entry: 224997
    May 19 13:49:03 INFO [0000000001BB0090]: Cache Hits       : 173850
    May 19 13:49:03 INFO [0000000001BB0090]: Cache Miss       : 1725
    May 19 13:49:03 INFO [0000000001BB0090]: Cache Rebase     : 0
    May 19 13:49:03 INFO [0000000001BB0090]: -----------------------End-------------------------------------
    May 19 13:49:03 ERR [0000000001BB0090]: 36: Puredisk::Replication::Engine::ReplicationThread::_forwardFingerprintBatch: CRReplicate of batch failed (connection reset by peer).
    May 19 13:49:03 ERR [0000000001BB0090]: 36: Puredisk::Replication::Engine::ReplicationThread::_forwardDataList: forward fingerprints failed (connection reset by peer).
    May 19 13:49:03 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
    May 19 13:49:03 ERR [0000000001BB0090]: 36: Synchronization to pkpsbackup_drc failed (connection reset by peer).
    May 19 13:49:03 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
    May 19 13:49:03 WARNING [0000000001BB0090]: 36: Pd_ReplicationClass::ClientTask: failed at the 1 time for job[1d] (connection reset by peer)
    May 19 13:49:08 INFO [0000000001BB0090]: Puredisk::Replication::Engine::ReplicationEngine::forward: beginning replication for Data Protection Application job id 288
    +++++++++++++++++++++++++++

    above logs from replication.log, why initial first backup and replication work, but not subsequent backup ?

    All this never happen when server without firewall rules, what others port require 2 MSDP to communicate ?

     

     

     

  • You need tcp/1556 and tcp/10082 and tcp/10102 all bi-directional for NetBackup AIR to work between MSDP storage servers.

  • Just open bidirectional, still not fully working

    I create a backup job with few path total 128GB, dedupe almost 98% 

    First try it can replicate over to DR MSDP server

    I re-run backup again, replication failed with above message

  • Have you confirmed open ports, using six different telnet commands:

    source-machine# telnet target-ip port

    ...i.e. test three ports - each way?

  • tested all port working ( port forward )

    As i said, it can replicate to DR MSDP but only first time

     

    I also create another backup policy with 1 path 2 files , Dedupe 80% around 6 GB

    For this backup, replication will always work, i run at least 3 times, and everytime it got replicate to DR

     

    But for my previous backup policy with 4 path a lot of files with backup size 120+ GB, it only replicate successful for the first time, but failed after that

    Anyone has this kind of weird behaviour ?

  • Logs for successful replicate

    ++++++++++++++++++++

    May 19 18:06:58 INFO [0000000001BB0090]: WSRequestExt: submitting &request=9&login=admindrc&passwd=********&action=getSDKVersion
    May 19 18:06:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
    May 19 18:06:58 INFO [0000000001BB0090]: WSRequestExt: submitting &request=6&login=admindrc&passwd=********&action=getlastimage&dsid=2&client=pkpssaga_hq&policy=Level0_Backup
    May 19 18:06:58 INFO [0000000001BB0090]: Empty source FP f1450306517624a57eafbbf81266a67a met, skip it.
    May 19 18:06:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
    May 19 18:06:58 INFO [0000000001BB0090]: Starting replication with a bandwidth limit of 0 KB/s
    May 19 18:06:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
    May 19 18:07:00 INFO [0000000001BB0090]: This replication job is currently using a bandwidth limit of 0 KB/s, 11 SOs has been sent in current batch.
    May 19 18:07:44 INFO [0000000001ABE4E0]: ===  Replication Monitor  ===
    May 19 18:07:44 INFO [0000000001ABE4E0]: Job ID: 10, Target: pdde://pkpsbackup_drc, Segments Processed: 205145, Age: 45.937 seconds
    May 19 18:07:44 INFO [0000000001ABE4E0]: There are 1 replication jobs currently running and 0 jobs waiting in the queue
    May 19 18:07:44 INFO [0000000001ABE4E0]: =============================
    May 19 18:08:03 INFO [0000000001BB0090]: This replication job is currently using a bandwidth limit of 0 KB/s, 1 SOs has been sent in current batch.
    May 19 18:08:03 INFO [0000000001BB0090]: ---------------------------------------------------
    May 19 18:08:03 INFO [0000000001BB0090]: Transfer time                                      : 64.52 sec
    May 19 18:08:03 INFO [0000000001BB0090]: Transfer rate                                      : 402.33 MB/sec
    May 19 18:08:03 INFO [0000000001BB0090]: De-dup percentage                                  : 100.00
    May 19 18:08:03 INFO [0000000001BB0090]: Total number of DOs to replicate                   : 11
    May 19 18:08:03 INFO [0000000001BB0090]: Total size of all DOs to replicate                 : 27217764585 bytes
    May 19 18:08:03 INFO [0000000001BB0090]: Total size of all segments that were sent          : 905315 bytes
    May 19 18:08:03 INFO [0000000001BB0090]: Number of segments                                 : 208509
    May 19 18:08:03 INFO [0000000001BB0090]: Number of segments sent                            : 12
    May 19 18:08:03 INFO [0000000001BB0090]: Details:
    May 19 18:08:03 INFO [0000000001BB0090]: -----------Replication Last Image Cache Report-----------------
    May 19 18:08:03 INFO [0000000001BB0090]: Total cache entry: 0
    May 19 18:08:03 INFO [0000000001BB0090]: Cache Hits       : 0
    May 19 18:08:03 INFO [0000000001BB0090]: Cache Miss       : 208629
    May 19 18:08:03 INFO [0000000001BB0090]: Cache Rebase     : 0
    May 19 18:08:03 INFO [0000000001BB0090]: -----------------------End-------------------------------------

    ++++++++++++++++++++++++++++

    Logs for failed replication

    ++++++++++++++++++++++++++++

    May 19 18:26:58 INFO [0000000001BB0090]: WSRequestExt: submitting &request=6&login=admindrc&passwd=********&action=getlastimage&dsid=2&client=pkpssaga_hq&policy=Level0_Backup
    May 19 18:26:58 INFO [0000000001BB0090]: Empty source FP f1450306517624a57eafbbf81266a67a met, skip it.
    May 19 18:26:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
    May 19 18:26:58 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
    May 19 18:27:44 INFO [0000000001ABE4E0]: ===  Replication Monitor  ===
    May 19 18:27:44 INFO [0000000001ABE4E0]: Job ID: 11, Target: pdde://pkpsbackup_drc, Segments Processed: 0, Age: 46.578 seconds
    May 19 18:27:44 INFO [0000000001ABE4E0]: There are 1 replication jobs currently running and 0 jobs waiting in the queue
    May 19 18:27:44 INFO [0000000001ABE4E0]: =============================
    May 19 18:28:53 INFO [0000000001BB0090]: CRAnalyzeDoLocality: rebasing has been disable.
    May 19 18:28:53 INFO [0000000001BB0090]: CRAnalyzeDoLocality: FP number in image: 208497
    May 19 18:28:53 INFO [0000000001BB0090]: Starting replication with a bandwidth limit of 0 KB/s
    May 19 18:28:53 INFO [0000000001BB0090]: sessionStartAgent: Server is Version 8.0101.0015.013, Protocol Version 6.6.1.1
    May 19 18:28:54 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
    May 19 18:28:54 ERR [0000000001BB0090]: 36: __process_refop_batch: could not send reference message(s): connection reset by peer
    May 19 18:28:54 ERR [0000000001BB0090]: -1:  __process_refop_batch: Could not send reference message: connection reset by peer
    May 19 18:28:54 ERR [0000000001BB0090]: 36: CRReplicate: reference operation batch failed at DO 90f72f1bac130fbd44bf8954435e9d47:65
    May 19 18:28:54 INFO [0000000001BB0090]: -----------Replication Last Image Cache Report-----------------
    May 19 18:28:54 INFO [0000000001BB0090]: Total cache entry: 208121
    May 19 18:28:54 INFO [0000000001BB0090]: Cache Hits       : 208497
    May 19 18:28:54 INFO [0000000001BB0090]: Cache Miss       : 1
    May 19 18:28:54 INFO [0000000001BB0090]: Cache Rebase     : 0
    May 19 18:28:54 INFO [0000000001BB0090]: -----------------------End-------------------------------------
    May 19 18:28:54 ERR [0000000001BB0090]: 36: Puredisk::Replication::Engine::ReplicationThread::_forwardFingerprintBatch: CRReplicate of batch failed (connection reset by peer).
    May 19 18:28:54 ERR [0000000001BB0090]: 36: Puredisk::Replication::Engine::ReplicationThread::_forwardDataList: forward fingerprints failed (connection reset by peer).
    May 19 18:28:54 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
    May 19 18:28:54 ERR [0000000001BB0090]: 36: Synchronization to pkpsbackup_drc failed (connection reset by peer).
    May 19 18:28:54 ERR [0000000001BB0090]: 36: _crBinaryMessageSend2: Error sending data: connection reset by peer
    May 19 18:28:54 WARNING [0000000001BB0090]: 36: Pd_ReplicationClass::ClientTask: failed at the 1 time for job[11d] (connection reset by peer)

    +++++++++++++++++++++++++++++++++

    i see that it try to request for last image being transfer, if there is not image previously being replicate, the replication will be successful, but when there is a valid last image, then the replication will failed

    what is the target MSDP try to send back to source MSDP for the full communication to work so the replication will be proceed

    anyone ?

  • When you move the target MSDP to DR site, did its IP address change? 

    I am not sure how the impact would be if the IP address did change, possibly causing replication error? 

    Personally I don't think it has to something to do with Accelerator. I will try to test something similar to see if I can get what you experienced (Accelerator + AIR)

  • Yes IP did change, but hostname remain.

    The connection between 2 site is without VPN, is using port foward, not sure will that affect 

    I did another test and i found out that if i choose only 2-3 files for my backup policy, it able to replicate, but if i choose a path with a lots of files, then only first replication work after backup, subsequent backup, replication will fail with above error