12-10-2012 08:24 AM
12-10-2012 09:58 AM
I have seen CVMVolDg resources time-out before and this is due to busy systems, where the "dd" read on the volumes in the "CVMVolume" attribute does not have enough time to complete and this is made worse when several volumes are specified for the "CVMVolume" attribute. However your resource is not experiencing timeouts, rather the "dd" read seems to be failing - see extract from your engine log:
VCS ERROR V-16-10011-1097 (db2) CVMVolDg:cvmvoldg1:monitor:Device /dev/vx/rdsk/yyzc-cfsdg01/cfsdg01-vol001 could not be read at offset 0
I can think of 3 causes:
The dd command the agent is running is:
dd if=/dev/vx/rdsk/yyzc-cfsdg01/cfsdg01-vol001 of=/dev/null count=1 skip=$_cdi_offset bs=1024
where $_cdi_offset is randomly generated, but this maybe failing as all errors say they are trying to read from offset 0.
So you could try running independent "dd" in a cron every minute to see if these fail to try and determine where the issue is.
As a work-a-round to your current issue, you could set a non-zero ToleranceLimit on the CVMVolDg type so a certain number of failures are ignored, but the downside of this, is if there is a real failure, then failover could be delayed. You can set ToleranceLimit, to 2, for example using:
hatype -modfiy CVMVolDg ToleranceLimit 2
Mike
12-11-2012 12:59 AM
Thanks Mike.
Today, I did some tests by using dd command and also I had been opening debug mode for CVMVolDg agent and HAD daemon which are adviced from Symantec Support in China. Actually, I think the storage hardware layer access from db2 has no problems, because that the /dev/vx/rdsk/yyzc-cfsdg01/cfsdg01-vol002 volume on db2 is working nicely. As I mentioned above, the problematical volume cfsdg01-vol001 is working fine on other nine nodes. So, I think it's not a physical problem. The tests methods and logs attached following:
Test for cfsdg01-vol001
=====================================================================
#more /tmp/ddvol
while [ 1 ]; do
/bin/sleep 1
date >> /tmp/ddvol.out
dd if=/dev/vx/dsk/yyzc-cfsdg01/cfsdg01-vol001 of=/dev/null count=1024 bs=512 >> /tmp/ddvol.out 2>&1
done
#/tmp/ ddvol &
Test for a Hard Disk
=====================================================================
#more /tmp/dddisk &
while [ 1 ]; do
/bin/sleep 1
date >> /tmp/dd.disk
dd if=/dev/hdisk98 of=/dev/null count=1024 bs=512 skip=65791 >> /tmp/dd.disk 2>&1
done
#/tmp/ dddisk &
opening Debug Mode for CVMVolDG agent and HAD daemon
====================================================================
# haconf -makerw
# /opt/VRTSvcs/bin/hatype -modify CVMVolDg LogDbg -add DBG_1 DBG_2 DBG_3 DBG_4 DBG_5 DBG_AGDEBUG DBG_AGINFO DBG_AGTRACE
# /opt/VRTSvcs/bin/haconf -dump –makero
# haconf -makerw
# halog -addtags DBG_POLICY
# halog -addtags DBG_TRACE
# halog -addtags DBG_AGTRACE
# halog -addtags DBG_AGINFO
# halog -addtags DBG_AGDEBUG
# haconf -dump -makero
Finally, the symptom was appears again at 2012/12/11 11:46:16
I attached all logs output from these tests and also I upload a VRTSexplorer logs hope useful for analyze my problem.
THX.
12-11-2012 02:03 AM
As the dd's in your test were successul there must be an issue with the agent. The agent code is located in /opt/VRTSvcs/bin/CVMVolDg and so the file "monitor" runs:
cvmvoldg_do_iotest 0 _cvm_res
Funtion cvmvoldg_do_iotest is from file cvmvoldg.lib and which says:
# cvmvoldg_do_iotest : do IO to a set of 10 volumes.# Takes two(2) arguments. The first is 'quiet' - whether to print a message# about the exact IO command that failed. The second is the result, which# will be non-zero in case of any error.