Error: 110 and Error: 42 on Replication

Liliana_Windver · ‎02-26-2010

Hello,

Puredisk 6.6 and two SPA . We replicate the date from SPA1 to SPA2 every day. We replicate all agents data .
The replication succeeded several times we've got a failed one complaining about spool out of space even the "df -k" shows is is enough space:
Output from "Job Log" for one of the replicated agents and df output bellow:

*** Start: Replication Prepare ***
The remote dataselection mirror for source dataselection 8 is: 6
*** Stop: Replication Prepare ***
Agent Jobstep analysis: exitcode 0, status 2, progress 100.

*** Supportability Summary ***
jobid = 894
jobstepid = 3998
agentid = 1106000000
hostname = shpd01
starttimejobstep = February 25, 2010, 4:00 am
endtimejobstep = February 25, 2010, 4:00 am
workflowstepname = Prepare Replication
status = SUCCESS

[2010-Feb-25 04:09:35 IST]Starting Replication.
[2010-Feb-25 04:09:35 IST]Start to create the Replication Task
[2010-Feb-25 04:09:35 IST]Replication Task created.
Source Application : PUREDISK Remote Office
Policy Id : 123
Source DSID : 8
Destination DSID : 6
Source AgentID : 6
Type of Replication : INCREMENTAL
Remote ContentRouter port : 10082
Delayed DO Max Queue Size : 4194304 bytes
Encryption enabled : YES
MBFind Statement : <?xml version="1.0" encoding="UTF-8"?><MBFindCollection/>
Bandwidth limit : <not defined>
JobId : 894
JobStep : 4011
URL Remote SPA : 1.12.61.86
Destination StoragepoolID : 884
Local StoragepoolID : 1106
[2010-Feb-25 04:09:35 IST]Init Replication Engine.
[2010-Feb-25 04:09:36 IST]ReplicationEngine initialized.
[2010-Feb-25 04:09:36 IST]Start of Replication init step.
[2010-Feb-25 04:09:36 IST]Stop of Replication init step.
[2010-Feb-25 04:09:36 IST]Updating Agent Mirror Data Lock Password if needed.
[2010-Feb-25 04:09:36 IST]Updating Agent Mirror Data Lock Password finished.
[2010-Feb-25 04:09:36 IST]Start forwarding actual content.
[2010-Feb-25 04:09:37 IST]Forwarding batchNumbers (Incremental):707-731
[2010-Feb-25 04:09:37 IST]Destination current routingtables 0000 ffff shpd02 0 are written to file /Storage/var/rt/884_894.current
[2010-Feb-25 04:09:37 IST]Destination recommended routingtables 0000 ffff shpd02 0 are written to file /Storage/var/rt/884_894.recommended
[2010-Feb-25 04:09:37 IST]Executing MBFind to batchnumber : 731
[2010-Feb-25 04:09:37 IST]Using MBFind <?xml version="1.0" encoding="UTF-8"?><MBFindCollection/>
[2010-Feb-25 04:09:37 IST]Using DSFind -i 8
[2010-Feb-25 04:10:01 IST]Starting multi-stream replication.
[2010-Feb-25 04:10:02 IST]Starting multi-stream replication with 4 stream(s)
[2010-Feb-25 04:10:02 IST]Successfully started stream 0
[2010-Feb-25 04:10:02 IST]Successfully started stream 1
[2010-Feb-25 04:10:02 IST]Successfully started stream 2
[2010-Feb-25 04:10:02 IST]Successfully started stream 3
[2010-Feb-25 4:10:07 IST][stream2] Forwarding data (NUMBER OF FINGERPRINTS in this batch:24)
[2010-Feb-25 4:10:07 IST][stream2] Info: Server is Version 6.6.0.29164, Protocol Version 6.6
[2010-Feb-25 4:10:07 IST][stream2] Error: 110 : Received an abort message: spool directory out of space: Could not store reference operation
[2010-Feb-25 4:10:07 IST][stream2] Error: 42 : __replicate_DO_refop_batch_process: could not receive reference reply message(s): aborted
[2010-Feb-25 4:10:07 IST][stream2] Error: 42 : __replicate_DO_refop_batch: Could not process reference operation batch for replication batch entry 0-23, cache: aborted
[2010-Feb-25 4:10:07 IST][stream2] Error: 42 : Could not send reference add operations for source DOs to destination storage pool: aborted
[2010-Feb-25 4:10:07 IST][stream2]
[2010-Feb-25 4:10:07 IST][stream2] Fatal error: zif_cr_replicate: could not process the replication batch: aborted in /opt/pdmbe/mgmtclass/ReplicationStream.php on line 210
[2010-Feb-25 04:10:07 IST]Stream 2 completed with exit value 255
[2010-Feb-25 04:10:08 IST]Replication will retry sending data for attempt number: 1 after sleeping 10 second(s).

.........................
The error are shown for stream 0 , stream 1 total of 10 retries.
The stream 3 never started

......

Any ideas?

Regards,

Liliana

[2010-Feb-25 04:18:34 IST]Replication has tried 10 time(s) to replicate data, but was not successful.[2010-Feb-25 04:18:34 IST]Checking the execution status of each remote MBImport Job.[2010-Feb-25 04:18:34 IST]The batchnumber could not be increased, Failing the replication Job.[2010-Feb-25 04:18:35 IST]Statistics on SOURCE connection:
uptime = 0
bytes_transferred = 0
bytes_received = 0
messages_transferred = 0
messages_received = 0
seconds_in_transfer = 0
seconds_in_receive = 0
data_bytes_transferred = 0
data_bytes_received = 0
data_seconds_in_transfer = 0
data_seconds_in_receive = 0
message_bytes_transferred = 0
message_bytes_received = 0
message_seconds_in_transfer = 0
message_seconds_in_receive = 0
[2010-Feb-25 04:18:35 IST]Statistics on DESTINATION connection:
uptime = 0
bytes_transferred = 0
bytes_received = 0
messages_transferred = 0
messages_received = 0
seconds_in_transfer = 0
seconds_in_receive = 0
data_bytes_transferred = 0
data_bytes_received = 0
data_seconds_in_transfer = 0
data_seconds_in_receive = 0
message_bytes_transferred = 0
message_bytes_received = 0
message_seconds_in_transfer = 0
message_seconds_in_receive = 0
[2010-Feb-25 04:18:35 IST]Statistics on from Meta Data (PO-objects):
po_replicated_success = 0
po_new_source = 0
po_deleted_source = 0
po_modified_source = 0
po_bytes_replicated_success = 0
po_bytes_new_source = 0
po_bytes_deleted_source = 0
po_bytes_modified_source = 0
[2010-Feb-25 04:19:00 IST]Stop forwarding actual content.
[2010-Feb-25 04:19:00 IST]Start finalizing Replication.
[2010-Feb-25 04:19:00 IST]Stop finalizing Replication.
[2010-Feb-25 04:19:00 IST]Stopping Replication.

Agent Jobstep analysis: exitcode 1, status 3, progress 0.
*** Supportability Summary ***
jobid = 894
jobstepid = 4011
agentid = 1106000000
hostname = shpd01
starttimejobstep = February 25, 2010, 4:09 am
endtimejobstep = February 25, 2010, 4:19 am
workflowstepname = Forward Data
status = ERROR
Execute WFAction: Mark Error

　
*** Supportability Summary ***
jobid = 894
jobstepid = 4026
agentid = 1106000000
hostname = shpd01
starttimejobstep = February 25, 2010, 4:19 am
endtimejobstep = February 25, 2010, 4:19 am
workflowstepname = Error
status = SUCCESS
Execute WFAction: Exit
Job exited with 1 errors, 0 warnings, 3 successes
*** Supportability Summary ***
jobid = 894
jobstepid = 4027
agentid = 1106000000
hostname = shpd01
starttimejobstep = February 25, 2010, 4:19 am
endtimejobstep = February 25, 2010, 4:19 am
workflowstepname = Exit
status = SUCCESS

Shpd01

Last login: Mon Feb 22 12:34:46 2010 from eliaf-7.clalit.org.il
shpd01:~ # df  -h
Filesystem            Size  Used Avail Use% Mounted on
rootfs                123G  2.3G  115G   2% /
udev                  2.0G  144K  2.0G   1% /dev
/dev/disk/by-id/cciss-3600508b1001030364537413446300a00-part3
                      123G  2.3G  115G   2% /
tmpfs                 4.0K     0  4.0K   0% /dev/vx
/dev/disk/by-id/cciss-3600508b1001030364537413446300a00-part1
                       99M   14M   80M  15% /boot
/dev/disk/by-id/cciss-3600508b1001030364537413446300b00-part1
                      932G  319G  613G  35% /Storage
shpd01:~ #


shpd02

shpd02:~ # df -h
Filesystem            Size  Used Avail Use% Mounted on
rootfs                 56G  2.2G   51G   5% /
udev                  2.0G  144K  2.0G   1% /dev
/dev/disk/by-id/cciss-3600508b1001037373020202020200002-part3
                       56G  2.2G   51G   5% /
tmpfs                 4.0K     0  4.0K   0% /dev/vx
/dev/disk/by-id/cciss-3600508b1001037373020202020200002-part1
                       99M   14M   80M  15% /boot
/dev/disk/by-id/cciss-3600508b1001037373020202020200003-part1
                      932G  291G  642G  32% /Storage
shpd02:~ #

VOX

Error: 110 and Error: 42 on Replication