CFS umount abnormally

ZhangYi2012 · ‎12-10-2012

Hi everyone!

I have ten SFCFS nodes on the same version of AIX OS platform with SFRAC 5.1SP1RP3. This SFCFS cluster has two CFS resources which running on all nodes.Recently, I encounter a very strange problem for service group sg-cfs-mount01 on a specific node.(DB2) The resource cfsmount1 related to service group sg-cfs-mount01 was offlined abnormally. I have checked in engine_A.log and there are shown I/O test failure by CVMVolDg agent.Then I also checked SYS logs, dmpevent.log .. etc. but there are no related errors.This problem only occurs on node DB2 with SG sg-cfs-mount01,but the SG sg-cfs-mount01 on other nodes and SG sg-cfs-mount02 on all nodes (include DB2) is working normally.(This is what I think strange for)

I attached engine_A.log CFSMount_A.log CVMVolDg_A.log etc.. in logs.tar

Thanks in advance.

mikebounds · ‎12-10-2012

I have seen CVMVolDg resources time-out before and this is due to busy systems, where the "dd" read on the volumes in the "CVMVolume" attribute does not have enough time to complete and this is made worse when several volumes are specified for the "CVMVolume" attribute. However your resource is not experiencing timeouts, rather the "dd" read seems to be failing - see extract from your engine log:

VCS ERROR V-16-10011-1097 (db2) CVMVolDg:cvmvoldg1:monitor:Device /dev/vx/rdsk/yyzc-cfsdg01/cfsdg01-vol001 could not be read at offset 0

I can think of 3 causes:

There is a problem with the underlying storage access from db2 so reads are failing
There is a problem in the CVM stack preventing reads
There is a problem with the CVMVolDg agent which is incorrectly reporting that volumes cannot be read

The dd command the agent is running is:

dd if=/dev/vx/rdsk/yyzc-cfsdg01/cfsdg01-vol001 of=/dev/null count=1 skip=$_cdi_offset bs=1024

where $_cdi_offset is randomly generated, but this maybe failing as all errors say they are trying to read from offset 0.

So you could try running independent "dd" in a cron every minute to see if these fail to try and determine where the issue is.

As a work-a-round to your current issue, you could set a non-zero ToleranceLimit on the CVMVolDg type so a certain number of failures are ignored, but the downside of this, is if there is a real failure, then failover could be delayed. You can set ToleranceLimit, to 2, for example using:

hatype -modfiy CVMVolDg ToleranceLimit 2

Mike

ZhangYi2012 · ‎12-11-2012

Thanks Mike.

Today, I did some tests by using dd command and also I had been opening debug mode for CVMVolDg agent and HAD daemon which are adviced from Symantec Support in China. Actually, I think the storage hardware layer access from db2 has no problems, because that the /dev/vx/rdsk/yyzc-cfsdg01/cfsdg01-vol002 volume on db2 is working nicely. As I mentioned above, the problematical volume cfsdg01-vol001 is working fine on other nine nodes. So, I think it's not a physical problem. The tests methods and logs attached following:

Test for cfsdg01-vol001

=====================================================================

#more /tmp/ddvol

while [ 1 ]; do

/bin/sleep 1

date >> /tmp/ddvol.out

dd if=/dev/vx/dsk/yyzc-cfsdg01/cfsdg01-vol001 of=/dev/null count=1024 bs=512 >> /tmp/ddvol.out 2>&1

done

#/tmp/ ddvol &

Test for a Hard Disk

=====================================================================

#more /tmp/dddisk &

while [ 1 ]; do

/bin/sleep 1

date >> /tmp/dd.disk

dd if=/dev/hdisk98 of=/dev/null count=1024 bs=512 skip=65791 >> /tmp/dd.disk 2>&1

done

#/tmp/ dddisk &

opening Debug Mode for CVMVolDG agent and HAD daemon

====================================================================

# haconf -makerw

# /opt/VRTSvcs/bin/hatype -modify CVMVolDg LogDbg -add DBG_1 DBG_2 DBG_3 DBG_4 DBG_5 DBG_AGDEBUG DBG_AGINFO DBG_AGTRACE

# /opt/VRTSvcs/bin/haconf -dump –makero

# haconf -makerw

# halog -addtags DBG_POLICY

# halog -addtags DBG_TRACE

# halog -addtags DBG_AGTRACE

# halog -addtags DBG_AGINFO

# halog -addtags DBG_AGDEBUG

# haconf -dump -makero

Finally, the symptom was appears again at 2012/12/11 11:46:16

I attached all logs output from these tests and also I upload a VRTSexplorer logs hope useful for analyze my problem.

THX.

mikebounds · ‎12-11-2012

As the dd's in your test were successul there must be an issue with the agent. The agent code is located in /opt/VRTSvcs/bin/CVMVolDg and so the file "monitor" runs:

cvmvoldg_do_iotest 0 _cvm_res

Funtion cvmvoldg_do_iotest is from file cvmvoldg.lib and which says:

# cvmvoldg_do_iotest : do IO to a set of 10 volumes.

# Takes two(2) arguments. The first is 'quiet' - whether to print a message

# about the exact IO command that failed. The second is the result, which

# will be non-zero in case of any error.

So the "monitor" file is hard-coded to run in quiet mode (zero) so if you modify this file and change the 0 to a 1 (I don't know why Symantec don't set this to 1 in the code when debug is set !), then you should get the exact dd command that is run and the output:

I suspect that maybe the offset being generated is out of range causing the dd command to fail.

Mike

VOX

CFS umount abnormally