Forum Discussion

lukaskison's avatar
lukaskison
Level 3
11 years ago

vxdisk list showing errors on multiple disks, and I am unable to start cluster on slave node.

Hello,

If anybody have same experience and can help me, I am gonna be very thankful

I am using solars 10 (x86 141445-09) + EMC PowerPath  (5.5.P01_b002) + vxvm (5.0,REV=04.15.2007.12.15) on two node cluster.

This is fileserver cluster.

I've added couple new LUNs and when I try to scan for new disk :"vxdisk scandisks" command hangs and after that time I was unable to do any vxvm job on that node, everytime command hangs.

I've rebooted server in maintanance windows, (before reboot switched all SGs on 2nd node)

After that reboot I am unable to join to cluster with reason

2014/04/13 01:04:48 VCS WARNING V-16-10001-1002 (filesvr1) CVMCluster:cvm_clus:online:CVMCluster start failed on this node.
2014/04/13 01:04:49 VCS INFO V-16-2-13001 (filesvr1) Resource(cvm_clus): Output of the completed operation (online)
ERROR:
2014/04/13 01:04:49 VCS ERROR V-16-10001-1005 (filesvr1) CVMCluster:???:monitor:node - state: out of cluster
reason: Cannot find disk on slave node: retry to add a node failed 

 

Apr 13 01:10:09 s_local@filesvr1 vxvm: vxconfigd: [ID 702911 daemon.warning] V-5-1-8222 slave: missing disk 1306358680.76.filesvr1
Apr 13 01:10:09 s_local@filesvr1 vxvm: vxconfigd: [ID 702911 daemon.warning] V-5-1-7830 cannot find disk 1306358680.76.filesvr1
Apr 13 01:10:09 s_local@filesvr1 vxvm: vxconfigd: [ID 702911 daemon.error] V-5-1-11092 cleanup_client: (Cannot find disk on slave node) 222

 

here is output from 2nd node (working fine)

 

Disk:   emcpower33s2
type:   auto
flags:  online ready private autoconfig shared autoimport imported
guid:   {665c6838-1dd2-11b2-b1c1-00238b8a7c90}
udid:   DGC%5FVRAID%5FCKM00111001420%5F6006016066902C00915931414A86E011
site:    -
diskid: 1306358680.76.filesvr1
dgname: fileimgdg
dgid:   1254302839.50.filesvr1
clusterid: filesvrvcs
info:   format=cdsdisk,privoffset=256,pubslice=2,privslice=2

and here is from node where i see this problems

 

Device:    emcpower33s2
devicetag: emcpower33
type:      auto
flags:     error private autoconfig
pubpaths:  block=/dev/vx/dmp/emcpower33s2 char=/dev/vx/rdmp/emcpower33s2
guid:      {665c6838-1dd2-11b2-b1c1-00238b8a7c90}
udid:      DGC%5FVRAID%5FCKM00111001420%5F6006016066902C00915931414A86E011
site:      -
errno:     Configuration request too large
Multipathing information:
numpaths:   1
emcpower33c     state=enabled

 

Can anybody help me?

I am not sure about Configuration request too large 

 

  • 1.Run vxdisk -o alldgs list on the second (problem) node to identify the diskgroups that currently
    have one or more disks in error state

    2. Execute "vxdg -g <dgname> flush" from cvm master for the dgs identified in step 1

    3. Try to online cvm SG via VCS (hagrp -online cvm -sys node2)

  • I think that vxfen cause all my problems, because I can access to error disks via format command. I can see size od disk, layout, partitioning everything. But I have no idea how to confirm that vxfen is really causing all this problems.

  • Well we can't say that IOFencing is cause, if reservation was an issue, even format would have problems reading the disks. can you get below outputs to confirm the IOFencing bit (below commands won't cause any harm, its just reading keys from disks)

    Create a file with some error disks in it

    # cat /tmp/diskfile
    /dev/rdsk/c5t15d64s0
    /dev/rdsk/c5t15d65s0
    /dev/rdsk/c5t15d66s0   (you can get cxtxdx names with vxdisk -e list)

     

    # vxfenadm -s all -f /tmp/diskfile

    # vxfenadm -r all -f /tmp/diskfile

     

    G

  • ok my bad ... -s came post 5.1

    try

    # vxfenadm -g all -f /tmp/diskfile    ( -g for registration keys)

    # vxfenadm -r all -f /tmp/diskfile     (-r for reservation keys)

     

    G

  • hello all

    vxdg -g <disk_group> flush --> from CVM master node fixed my problem

    all disks are visible after rescan and node was successfully joined to cluster.

    Again, thanks for your time guys!

  • Just to explain, why `vxdg flush` worked and why the joiner node initially failed to join the node? i.e. solution provided by rsharma1.

    Details:

    As part of join protocols:

    - CVM master send the list of online disks (from imported shared disk groups) that it can see to the joiner.

    - Now joiner check whether it can see those disks or not. If not, it creates a dummy entry by fetching some basic properties of disks from other nodes of cluster.

     

    Why node join failed in this case:

    - There were some disks on master that are in online state but actually missing connectivity globally (all nodes of cluster). And hence were potential candidate to be detached (error state). But we don't proactively detach disks (for some valid reasons) and it will only get detached as part of I/Os on that disk.

    - So now any joiner is expected to have connectivity to these disks as well (online on master) or the join will fail.

     

    How `vxdg flush` solved this problem:

    - vxdg flush triggers some private region I/Os on the disks, as part of refershing the contents.

    - And as mentioned in previous these I/Os will detach the globally(all nodes of cluster) disconnected disks on master.

    NOTE: We have handled some of these (likely) scenario proactively but not all.

     

    Thanks & Regards,

    Pankaj Tandon (CVM Team)