In VCS Disk group resource showing as offline but every thing fine from Veritas level.

Hi,

We have one rac cluster with two nodes. On the slave node all of a suddent, application went down. As per engine_A.log, the vcs recognized the disk group resource as offline but actually all the volumes and disk group is accessible on both the nodes.

We are able to bring the application working only after restarting the service group ( offline and online). VRTSvcs and VRTSvxvm versions are as mentioned below. Please let us know what could be the problem for this issue.


engine_A.log update.
2011/02/20 11:04:14 VCS INFO V-16-1-10307 Resource oratruescp-vdg (Owner: unknown, Group: oratruescp-psg) is offline on HOST 2 (Not initiated by VCS
)

 $ pkginfo -l VRTSvcs
   PKGINST:  VRTSvcs
      NAME:  Veritas Cluster Server by Symantec
  CATEGORY:  system
      ARCH:  sparc
   VERSION:  5.0
   BASEDIR:  /
    VENDOR:  Symantec Corporation
      DESC:  Veritas Cluster Server by Symantec
    PSTAMP:  Veritas-5.0MP1-11/29/06-17:15:00
  INSTDATE:  May 29 2008 11:08
    STATUS:  completely installed
     FILES:      160 installed pathnames
                  22 shared pathnames
                   2 linked files
                  45 directories
                  83 executables
              142180 blocks used (approx)

 $ pkginfo -l VRTSvxvm
   PKGINST:  VRTSvxvm
      NAME:  Binaries for VERITAS Volume Manager by Symantec
  CATEGORY:  system
      ARCH:  sparc
   VERSION:  5.0,REV=05.11.2006.17.55
   BASEDIR:  /
    VENDOR:  Symantec Corporation
      DESC:  Virtual Disk Subsystem
    PSTAMP:  Veritas-5.0_MP1_RP3.2:2007-08-28
  INSTDATE:  May 29 2008 12:03
   HOTLINE:  http://support.veritas.com/phonesup/phonesup_ddProduct_.htm
     EMAIL:  support@veritas.com
    STATUS:  completely installed
     FILES:      965 installed pathnames
                  30 shared pathnames
                  13 linked files
                 106 directories
                 407 executables
              391560 blocks used (approx)

1 Solution

Accepted Solutions
Accepted Solution!

The 5.0 CVMVolDg agent does a

The 5.0 CVMVolDg agent does a "dd" read of the volumes specified by the CVMVolume attribute to determine the reource is online (as oppose to just seeing if diskgroup is imported like Diskgroup agent) and therefore if the read of any of these volumes fails then resource will fail.  I have not seen "dd read" fail before, but I have seen it timeout - if this is issue you will see something in the engine_A.log like:

Monitor timed out (you will see this 4 times a 1 minute intervals, assuming default type attibutes)

Then I think you will see something like "Monitor timed out 4 times so as FaultOnMonitorTimeout=4, resource faulting)

Then you will see "Resource offline - Not initiated by VCS"

I have seen this happen when a backup kicked in and it effectively the performance so much that the dd's timed out - in particular if there are lots volume specified by the CVMVolume attribute as the aget doesn't have time to read all the volumes

If you have more than one volume specified in CVMVolume attribute, I would recommend changing this so it contains just one volume.

Mike

 

4 Replies
Accepted Solution!

The 5.0 CVMVolDg agent does a

The 5.0 CVMVolDg agent does a "dd" read of the volumes specified by the CVMVolume attribute to determine the reource is online (as oppose to just seeing if diskgroup is imported like Diskgroup agent) and therefore if the read of any of these volumes fails then resource will fail.  I have not seen "dd read" fail before, but I have seen it timeout - if this is issue you will see something in the engine_A.log like:

Monitor timed out (you will see this 4 times a 1 minute intervals, assuming default type attibutes)

Then I think you will see something like "Monitor timed out 4 times so as FaultOnMonitorTimeout=4, resource faulting)

Then you will see "Resource offline - Not initiated by VCS"

I have seen this happen when a backup kicked in and it effectively the performance so much that the dd's timed out - in particular if there are lots volume specified by the CVMVolume attribute as the aget doesn't have time to read all the volumes

If you have more than one volume specified in CVMVolume attribute, I would recommend changing this so it contains just one volume.

Mike

 

Highlighted

Hi Mike, Thanks for immediate

Hi Mike,

Thanks for immediate response. You pointed exactly the same problem faced by me. In my setup, all the shared volumes are mentioned in the CVMVolume attribute.

Not sure why they added all the volumes in the CVMVolume. Will see another setups also prior to removing the volumes from the CVMVolume.

Use all the Critical volumes in CVMVolume attribute

Hi,

 

 As explained by Mike, CVMVolDg agent does dd test to make sure shared volumes are active.

We normally suggest to add all the critical volumes to this list, thus any fault on critical volumes

are immediatly acted on.

We can increase the monitortimeout value of CVMVolDg resource if this issue is repeating.

Below commands will increase the values from 1 min to 2 mins.

 

# haconf -makerw

# hatype -modify CVMVolDg MonitorTimeout 120

# hatype -modify CVMVolDg MonitorInterval 120

# haconf -dump -makero

 

 

 

Regards

Srini

Hi Srini, In our setup all

Hi Srini,

In our setup all the volumes are mentioned in the attribute and they are around 10TB in size. Do you still suggest just changing the monitor interval and Time out are sufficient ?

Mike,Srini,

Request you to share any document, which provides detailed description about the "DD on volumes" mentioned in the CVMVolume attribute.

Thanks in Advance.