cancel
Showing results for 
Search instead for 
Did you mean: 

Puredisk Replication failure

baloonga
Level 3

 

Puredisk 6.6.0.3 running on platform Suse Linux.

I have single SPA at PROD and DR. Backup completes successfully. Replication was successful for few days and then it started failing with the below error

[2010-Nov-20 16:20:11 EST]Sleeping 30 second(s) before next check.
[2010-Nov-20 16:20:11 EST]Checking the execution status of each remote MBImport Job.
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26117 has stopped. (SUCCESS)
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26118 is still running.
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26119 is still running.
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26120 is still running.
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26121 is still running.
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26122 is still running.
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26123 is still running.
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26124 is still running.
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26125 is still running.
[2010-Nov-20 16:20:11 EST]Remote MBImport Job with ID: 26126 is still running.
***ERROR***
9999
severity: bug
server:
source:
description: Webservice could not be queried. URL: https://10.7.1.134/spa/ws/ws_job.php (couldn't connect to host).
 
***DONE***
Agent Jobstep analysis: exitcode 1, status 3, progress 82.

 Out of 55 replication jobs only 15 jobs fails with this error.

WAN link between SPA is 100 Mbps

Tried following options:

Changed replication schedule.

Increased nap time, sleep time, maxstreams,

For one job Data selection is 300 GB. split that into 100 GB data selection and 2/3 replication was successful.

Checked network for package drop.

Storage is iSCSI

Created new data selection. backup successful but replication fails.

2 REPLIES 2

f25
Level 4

Hi,

Perhaps try running reverification.

Go to policies management, find the replication policy that fails and select option (in advanced settings or parameters) "Reverify All" do it before the weekend as this may use all you've got of WAN and PD (but 100 Mbit/s seems quite cool and more than sufficient for 300 GB replication - 5 Mbit/s would do for PD).

Another option is to create a "brand new" data selection for this and after manual backup completes: replicating it.Then disabling the old data selection's replication as it is not going to change anymore.

Good luck!

Michał

S_Williamson
Level 6

Hi

Personally I'd upgrade to 6.6.1 as there is a lot of replication improvement.

Also it appears the issue is with one SPA timing out talking to the other SPA. This could be because one side is currently too busy. It doesn't look like you have setup any retrys either. I'd suggest put a minimum of 5 retrys in there as to help prevent the job from failing.

vi /etc/puredisk/agent.cfg

find

# Number of times replication should be retried in case of error
# Default value is 1 (no retry)
maxretrycount=1

and change it from 1 to 5 or even 10.

Simon