cancel
Showing results for 
Search instead for 
Did you mean: 

Infoscale 8.0 VVR (volume replicator) once DCM mode activates it never stops updating DCM bitmaps

martinfrancis
Level 3

Infoscale 8.0  on Redhat 8.7

I am observing that even when SRL is not overflowing , the primary and second volumes are writing into DCM bitmaps which is not how it should work.
DCM should only be used during an initial sync or resync or SRL overflow and shouldn't be written into in a normal operation.
So to narrow down this, I separated the data volume and DCO volume(DCM inside DCO) and SRL to separate disks so I can use iostat to look at the IO.
Each disk only have either Data Vol, SRL or DCO volume(with DCM) and absolutely nothing else. So this helps me isolate the IO to the culprit.
--here is my test case--

Primary:
#vxassist -g dg1 make sourcevol2 1g layout=concat logtype=dco drl=off dcoversion=20 ndcomirror=1 regionsz=256 init=active alloc=disk1,disk2
#vxassist -g dg1 make dg1_srl 1g layout=concat init=active alloc=disk3
#vxprint -g dg1 -ht [Verified that sourcevol2, srl and dco volume are on separate disks as intended]
#/opt/VRTS/bin/mkfs -t vxfs  /dev/vx/rdsk/dg1/sourcevol2
#mount /dev/vx/dsk/dg1/sourcevol2 /sourcevol2
#vradmin -g dg1 createpri dg1_rvg sourcevol2 dg1_srl 

Secondary:
#vxassist -g dg1 make sourcevol2 1g layout=concat logtype=dco drl=off dcoversion=20 ndcomirror=1 regionsz=256 init=active alloc=disk1,disk2
#vxassist -g dg1 make dg1_srl 1g layout=concat init=active alloc=disk3
#vxprint -g dg1 -ht [Verified that sourcevol2, srl and dco volume are on separate disks as intended]


Primary:
#vradmin -g dg1 addsec dg1_rvg primarynode secondarynode
#vradmin -g dg1 -a startrep dg1_rvg primarynode2

After replication completes [verified rlink, rvg status ] , I do an fio on primary sourcevol2
Primary:
#vradmin -g dg1 -l repstatus dg1_rvg |egrep "Data status| Replication status"
Data status: consistent, up-to-date
Replication status: replicating (connected)
#vxprint -g dg1 -VPl dg1_rvg |grep flags
flags: closed primary enabled attached bulktransfer dcm_in_dco

Primary:
#fio --name=8krandomwrites --rw=randwrite --direct=1 --ioengine=libaio --bs=8k --numjobs=1 --iodepth=1 --size=800M --runtime=60 --group_reporting --overwrite=0 --rate_iops=80 --filename=/sourcevol2/8krandomwrites.0.0

Observed Behavior:
iostat on both primary and secondary shows almost same level of IO writes going against DCO-DCM disk.

Expected Behavior:
There should NOT be constant writes going against the DCO-DCM . However not the case. I see constant writes happening on DCO-DCM disk indicating that DCM bitmaps are constantly being updated even when there is no SRL overflow.
This DCM writes happens on both primary & secondary.

Further investigation1:
What is even more odd is that, I was able to fix this issue by merely un-mounting mount point, stopping the primary side rvg and then stopping the volume and restarting volume and rvg again and remounting. Repeat on secondary. From that point on, there is no more DCM writes happening .
Basically pointing to a bug in DCM behavior(I think). Stopping rvg and volume is not a practical solution.
Further investigation2:
I created SRL overflow condition manually by pausing rlink and writing enough primary data to overflow SRL . I am able to see DCM writes happening via iostat as expected.
I resumed the link and did a resync and waited for SRL to drain completely.
Verified rlink to return back to consistent upto date status.
However problem happens again. Writes on primary volume trigger DCM writes (same as before) .
To fix it, repeat the same steps, stop rvg,stop volume, restart vol, restart rvg.

So essentially I am able to recreate this issue consistently. Basically once DCM bitmap logging happens, it never comes out of it even though replication has caught up and SRL is not overflowing.
None of the status commands show this issue happening. Only way I was able to understand this oddity was by isolating the Data , SRL and DCO-DCM into separate disk.
The only way to stop this from happening is stop vol, rvg and restart until issue happens again which is not a practical solution.





 

 

 

4 REPLIES 4

sdighe1
Level 3
Employee

Hi Martin,

This looks like a product bug and you have already mentioned how to consistently reproduce this. I have logged this as defect on your behalf and tagged for VVR resiliency. Thanks for great explanation, this should help in further debugging of the problem.

Here is the defect link - https://engtools.engba.veritas.com/Etrack/readonly_inc.php?sid=etrack&incident=4118233

You can subscribe for updates and if needed add more details.

I am more than happy to collaborate to test and run any debug effort . I am evaluating infoscale's suitability for a major ERP migration from onprem to azure leveraging infoscale off host backup , multiple snapshots, vvr and vcs.
Every bit of storage performance is critical and that's when I noticed that the storage is consuming more IO operations than expected which made me dig deeper and establish a clear repeatable test case. If there is anything I can do to assist let me know .

Is that link accessible from public. The link doesn't open for me

martinfrancis
Level 3

The link which you posted for etrack defect is inaccessible . Is there a public view of the same link?
Or any way to track any updates or progress