05-13-2014 04:29 AM
I am testing FSS (Flexible Shared Storage) on SF 6.1 on RH 5.5 in a Virtual Box VM and when I try to start CVM on the remote node I get:
VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster reason: Disk for disk group not found: retry to add a node failed
Here is my setup:
Node A is master with a local (sdd) and remote disk (B_sdd)
[root@r55v61a ~]# vxdctl -c mode mode: enabled: cluster active - MASTER master: r55v61a [root@r55v61a ~]# vxdisk list DEVICE TYPE DISK GROUP STATUS B_sdd auto:cdsdisk - - online remote sda auto:none - - online invalid sdb auto:none - - online invalid sdc auto:cdsdisk - - online sdd auto:cdsdisk - - online exported
Node B is the slave, and sees local (sdd) and remote disk (A_sdd)
[root@r55v61b ~]# vxdisk list DEVICE TYPE DISK GROUP STATUS A_sdd auto:cdsdisk - - online remote sda auto:none - - online invalid sdb auto:none - - online invalid sdc auto:cdsdisk - - online sdd auto:cdsdisk - - online exported
On node A, I add an FSS diskgroup, so on node A the disk is local
[root@r55v61a ~]# vxdg -s -o fss=on init fss-dg fd1_La=sdd [root@r55v61a ~]# vxdisk list DEVICE TYPE DISK GROUP STATUS B_sdd auto:cdsdisk - - online remote sda auto:none - - online invalid sdb auto:none - - online invalid sdc auto:cdsdisk - - online sdd auto:cdsdisk fd1_La fss-dg online exported shared
And on node B the disk in fss-dg is remote
[root@r55v61b ~]# vxdisk list DEVICE TYPE DISK GROUP STATUS A_sdd auto:cdsdisk fd1_La fss-dg online shared remote sda auto:none - - online invalid sdb auto:none - - online invalid sdc auto:cdsdisk - - online sdd auto:cdsdisk - - online exported
I then stop and start VCS on node B which is when I see the issue:
2014/05/13 12:05:23 VCS INFO V-16-2-13716 (r55v61b) Resource(cvm_clus): Output of the completed operation (online) ============================================== ERROR: ============================================== 2014/05/13 12:05:24 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster reason: Disk for disk group not found: retry to add a node failed
If I destroy fss-dg diskgroup on node A, then CVM will start on node B, so issue is the FSS diskgroup where it seems CVM cannot find the remote disk in the diskgroup
I can also get round issue by stopping VCS on node A and then CVM will start on node B:
[root@r55v61b ~]# hagrp -online cvm -sys r55v61b [root@r55v61b ~]# vxdisk -o alldgs list DEVICE TYPE DISK GROUP STATUS sda auto:none - - online invalid sdb auto:none - - online invalid sdc auto:cdsdisk - - online sdd auto:cdsdisk - - online exported
If I then start VCS on node A, then B is able to see the FSS diskgroup:
[root@r55v61b ~]# vxdisk list DEVICE TYPE DISK GROUP STATUS A_sdd auto:cdsdisk fd1_La fss-dg online shared remote sda auto:none - - online invalid sdb auto:none - - online invalid sdc auto:cdsdisk - - online sdd auto:cdsdisk - - online exported
I can stop and start VCS on each node when disks are just exported and VCS is able to see disk from other node, but when I create the FSS diskgroup, CVM won't start on the system that has the remote disk - does anybody have any ideas as to why?
Mike
Solved! Go to Solution.
05-21-2014 05:15 AM
The issue of this post which was CVM would not start if an FSS diskgroup was present, giving error message:
CVMCluster:cvm_clus:monitor:node - state: out of cluster reason: Disk for disk group not found: retry to add a node failed
was resolved by recreating separate diskgroup which was purely CVM (no exported disks). The likely issues was UDID mismatches or conflicts as it would appear with non-FSS failover and CVM diskgroups, all that is required is that VxVM read the private region, but with FSS diskgroups, my theory is the UDID is required to be used to ensure that if you export a disk then it only shows as a remote disk on other systems if the same disk can NOT be seen on the SAN of the remote system and it needs to use the UDID to determine this.
Hence in Virtual box, the same disk will normally show as having different UDID when viewed from different systems, and if this disk is shared then I did indeed see a single disk presented on one server BOTH via the SAN and as a remote disk, but when I made the UDID the same by changing the hostname of one of the nodes so both nodes had the same hostname and hence the same constucted UDID, then VxVM correctly identied the remote disk was available via the SAN and hence ONLY showed the disk as a SAN attached disk and not also a remote disk.
Although in my opening post I was not exporting any shared SAN disks (only local disks), I believe the UDID checking when autoimporting the diskgroups caused the issue.
Mike
05-13-2014 04:46 AM
Can you please paste the o/p of the following commands.
From Node A :
# vxdisk list sdd
# lltstat -nvv|more
# vxprint
From Node B :
# vxdisk list A_sdd
Can you try exporting and then importing the dgs with VCS not in picture :
# hastop -all -force
# vxdg export fss-dg
# vxdg -s -o fss=on import fss-dg
05-13-2014 05:21 AM
Node A:
[root@r55v61a ~]# vxdctl -c mode mode: enabled: cluster active - SLAVE master: r55v61b [root@r55v61a ~]# vxdisk list sdd Device: sdd devicetag: sdd type: auto clusterid: r55v61c1 disk: name=fd1_La id=1399977098.91.r55v61a group: name=fss-dg id=1399978977.116.r55v61a info: format=cdsdisk,privoffset=256,pubslice=3,privslice=3 flags: online ready private autoconfig exported shared autoimport imported pubpaths: block=/dev/vx/dmp/sdd3 char=/dev/vx/rdmp/sdd3 guid: {c4025770-da89-11e3-a9f1-d5ec014b3593} udid: ATA%5FVBOX%20HARDDISK%5FOTHER%5FDISKS%5Fr55v61a.localdomain%5F%2Fdev%2Fsdd site: - version: 3.1 iosize: min=512 (bytes) max=1024 (blocks) public: slice=3 offset=2304 len=517888 disk_offset=0 private: slice=3 offset=256 len=2048 disk_offset=0 update: time=1399982408 seqno=0.32 ssb: actual_seqno=0.0 headers: 0 240 configs: count=1 len=1280 logs: count=1 len=192 Defined regions: config priv 000048-000239[000192]: copy=01 offset=000000 enabled config priv 000256-001343[001088]: copy=01 offset=000192 enabled log priv 001344-001535[000192]: copy=01 offset=000000 enabled lockrgn priv 001536-001679[000144]: part=00 offset=000000 Multipathing information: numpaths: 1 sdd state=enabled connectivity: r55v61a [root@r55v61a ~]# lltstat -nvv LLT node information: Node State Link Status Address * 0 r55v61a OPEN eth1 UP 08:00:27:C9:3A:39 eth2 UP 08:00:27:1A:E8:66 1 r55v61b OPEN eth1 UP 08:00:27:2F:39:51 eth2 UP 08:00:27:15:CD:E8 2 CONNWAIT eth1 DOWN eth2 DOWN 3 CONNWAIT eth1 DOWN eth2 DOWN [root@r55v61a ~]# vxprint -g fss-dg TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0 PUTIL0 dg fss-dg fss-dg - - - - - - dm fd1_La sdd - 517888 - - - - [root@r55v61a ~]#
Node B:
[root@r55v61b config]# vxdctl -c mode mode: enabled: cluster active - MASTER master: r55v61b [root@r55v61b config]# vxdisk list A_sdd Device: A_sdd type: auto clusterid: r55v61c1 disk: name=fd1_La id=1399977098.91.r55v61a group: name=fss-dg id=1399978977.116.r55v61a info: format=cdsdisk,privoffset=256,pubslice=3,privslice=3 flags: online ready private autoconfig remote exported shared autoimport imported guid: {c4025770-da89-11e3-a9f1-d5ec014b3593} udid: ATA%5FVBOX%20HARDDISK%5FOTHER%5FDISKS%5Fr55v61a.localdomain%5F%2Fdev%2Fsdd site: - version: 3.1 iosize: min=512 (bytes) max=1024 (blocks) public: slice=3 offset=2304 len=517888 disk_offset=0 private: slice=3 offset=256 len=2048 disk_offset=0 update: time=1399982408 seqno=0.32 ssb: actual_seqno=0.0 headers: 0 240 configs: count=1 len=1280 logs: count=1 len=192 Defined regions: config priv 000048-000239[000192]: copy=01 offset=000000 enabled config priv 000256-001343[001088]: copy=01 offset=000192 enabled log priv 001344-001535[000192]: copy=01 offset=000000 enabled lockrgn priv 001536-001679[000144]: part=00 offset=000000 connectivity: r55v61a [root@r55v61b config]# hastop -all -force [root@r55v61b config]# vxdg deport fss-dg [root@r55v61b config]# vxdg -s -o fss=on import fss-dg
The import hangs, so after minute I used <ctrl><c> and then I see:
[root@r55v61b config]# vxdisk list DEVICE TYPE DISK GROUP STATUS A_sdd auto:cdsdisk fd1_La fss-dg online shared remote sda auto:none - - online invalid sdb auto:none - - online invalid sdc auto:cdsdisk - - online sdd auto:cdsdisk - - online exported [root@r55v61b config]# vxprint -g fss-dg VxVM vxprint ERROR V-5-1-582 Disk group fss-dg: No such disk group [root@r55v61b config]# vxdg deport fss-dg VxVM vxdg ERROR V-5-1-2275 vxdg: Disk group fss-dg: No such disk group [root@r55v61b config]#
So vxdisk shows fss-dg is imported, but vxprint and vxdg think is is deported.
Node A shows:
[root@r55v61a ~]# vxdisk list DEVICE TYPE DISK GROUP STATUS B_sdd auto:cdsdisk - - online remote sda auto:none - - online invalid sdb auto:none - - online invalid sdc auto:cdsdisk - - online sdd auto:cdsdisk - - online exported
Mike
05-13-2014 09:08 AM
Mike ,
Did you observe any errors / messages in syslog while importing the dg.
What is the CVM protocol and DG version. Please paste the below commands o/ps.
# vxdctl protocolversion
# vxdg list fss-dg
05-13-2014 09:40 AM
In /var/log/messages on node B (there are no logs on node A at time of deport/import) I see:
May 13 16:39:53 r55v61b vxvm:vxconfigd: V-5-1-16252 Disk group deport of fss-dg succeeded.
May 13 16:40:07 r55v61b vxvm:vxconfigd: V-5-1-16765 Selecting configuration database copy from A_sdd from disks: A_sdd
May 13 16:40:07 r55v61b vxvm:vxconfigd: V-5-1-16766 Trying to import the disk group fss-dg using configuration database copy from A_sdd
This is a fresh install of 6.1 with newly create diskgroups so versions are the latest:
[root@r55v61b log]# vxdctl protocolversion Cluster running at protocol 130 [root@r55v61b log]# vxdg list fss-dg Group: fss-dg dgid: 1399978977.116.r55v61a import-id: 33792.135 flags: shared cds version: 190 alignment: 8192 (bytes) local-activation: shared-write cluster-actv-modes: r55v61b=sw r55v61a=sw ssb: on autotagging: on detach-policy: local dg-fail-policy: obsolete ioship: on fss: on storage-sources: r55v61a copies: nconfig=default nlog=default config: seqno=0.1039 permlen=1280 free=1277 templen=2 loglen=192 config disk A_sdd copy 1 len=1280 state=clean online log disk A_sdd copy 1 len=192 [root@r55v61b log]#
I am using fencing in disabled mode and I have a normal cvm diskgroup (I truncated output of vxdisk list) and and before creating FSS diskgroup, CVM would start normally and mount CFS mount on both nodes.
Mike
05-13-2014 12:05 PM
I noticed that after I <ctrl><c> the import, after a while the diskgroup imported so I check the messages log which showed:
May 13 16:39:53 r55v61b vxvm:vxconfigd: V-5-1-16252 Disk group deport of fss-dg succeeded.
May 13 16:40:07 r55v61b vxvm:vxconfigd: V-5-1-16765 Selecting configuration database copy from A_sdd from disks: A_sdd
May 13 16:40:07 r55v61b vxvm:vxconfigd: V-5-1-16766 Trying to import the disk group fss-dg using configuration database copy from A_sdd
May 13 16:41:02 r55v61b Had[624]: VCS CRITICAL V-16-1-50086 Mem usage on r55v61b is 91%
May 13 16:43:02 r55v61b Had[624]: VCS CRITICAL V-16-1-50086 CPU usage on r55v61b is 100%
May 13 16:43:38 r55v61b vxvm:vxconfigd: V-5-1-16254 Disk group import of fss-dg succeeded.
So it is eventually importing and it may be taking so long due to not enough CPU power, but having said that, I have installed a very light-weight O/S without X-windows which runs at 99% idle without VCS running.
With VCS running (just CVM/CFS stuff) the CPU runs at 90% idle
If I online the cvm service group on node B first then it takes 20 seconds for cvm_clus resource to online (this would be importing the other normal diskgroup on just node B) and it also takes 20 seconds for cvm_clus resource to then online on Node A (this would be importing the other normal diskgroup on node A and the fss diskgroup on both systems)
I did an import again on node B and watched the CPU and it was maxed out for nearly 5 mins with vxconfigd taking 90%, so to me this indicates that vxconfigd is doing something wrong as normal operations complete in a reasonable time.
If this does just take excessive CPU to import a diskgroup containing a remote disk, then are there any timeouts I can set for the cvm_clus resource. The only timeouts I can see see are the type OnlineTimeout which is by default 400 seconds and the resource CVMTimeout which is 200 seconds, but the resource seems to be timing out a lot earlier than this:
2014/05/13 19:04:23 VCS NOTICE V-16-1-10301 Initiating Online of Resource cvm_clus (Owner: Unspecified, Group: cvm) on System r55v61b
2014/05/13 19:04:46 VCS WARNING V-16-20006-1002 (r55v61b) CVMCluster:cvm_clus:online:CVMCluster start failed on this node.
2014/05/13 19:04:47 VCS INFO V-16-2-13716 (r55v61b) Resource(cvm_clus): Output of the completed operation (online)
==============================================
ERROR:
==============================================
2014/05/13 19:04:48 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:05:48 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:06:48 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:06:48 VCS ERROR V-16-2-13066 (r55v61b) Agent is calling clean for resource(cvm_clus) because the resource is not up even after online completed.
2014/05/13 19:06:49 VCS INFO V-16-2-13068 (r55v61b) Resource(cvm_clus) - clean completed successfully.
2014/05/13 19:06:50 VCS INFO V-16-2-13072 (r55v61b) Resource(cvm_clus): Agent is retrying online (attempt number 1 of 2).
2014/05/13 19:07:13 VCS WARNING V-16-20006-1002 (r55v61b) CVMCluster:cvm_clus:online:CVMCluster start failed on this node.
2014/05/13 19:07:13 VCS INFO V-16-2-13716 (r55v61b) Resource(cvm_clus): Output of the completed operation (online)
==============================================
ERROR:
==============================================
2014/05/13 19:07:14 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:08:13 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:09:10 VCS INFO V-16-10031-20903 (r55v61b) CFSfsckd:vxfsckd:imf_register:/opt/VRTSamf/bin/amfregister -ipf -ouid=0,euid=0,gid=0,egid=0 -r CFSfsckd -g vxfsckd "/usr/lib/fs/vxfs/vxfsckd" -- "-p /var/adm/cfs/vxfsckd-pid"
2014/05/13 19:09:13 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:09:14 VCS ERROR V-16-2-13066 (r55v61b) Agent is calling clean for resource(cvm_clus) because the resource is not up even after online completed.
2014/05/13 19:09:15 VCS INFO V-16-2-13068 (r55v61b) Resource(cvm_clus) - clean completed successfully.
2014/05/13 19:09:15 VCS INFO V-16-2-13072 (r55v61b) Resource(cvm_clus): Agent is retrying online (attempt number 2 of 2).
2014/05/13 19:09:38 VCS WARNING V-16-20006-1002 (r55v61b) CVMCluster:cvm_clus:online:CVMCluster start failed on this node.
2014/05/13 19:09:39 VCS INFO V-16-2-13716 (r55v61b) Resource(cvm_clus): Output of the completed operation (online)
==============================================
ERROR:
==============================================
2014/05/13 19:09:39 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:10:39 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:11:39 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:11:40 VCS ERROR V-16-2-13066 (r55v61b) Agent is calling clean for resource(cvm_clus) because the resource is not up even after online completed.
2014/05/13 19:11:41 VCS INFO V-16-2-13068 (r55v61b) Resource(cvm_clus) - clean completed successfully.
2014/05/13 19:11:41 VCS INFO V-16-2-13071 (r55v61b) Resource(cvm_clus): reached OnlineRetryLimit(2).
2014/05/13 19:11:42 VCS ERROR V-16-20006-1005 (r55v61b) CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: Disk for disk group not found: retry to add a node failed
2014/05/13 19:11:42 VCS ERROR V-16-1-54031 Resource cvm_clus (Owner: Unspecified, Group: cvm) is FAULTED on sys r55v61b
So you can see here that I get a "CVMCluster start failed on this node" error after 23 seconds and then get "node - state: out of cluster
reason: Disk for disk group not found"
repeated at 60 second intervals.
Mike
05-13-2014 01:55 PM
Done some more investigation:
If I try to start CVM manually on node B with FSS diskgroup imported on node A, then I get:
# ./vxclustadm nodestate; ./vxclustadm -m vcs -t gab startnode; while true > do > date > ./vxclustadm nodestate > sleep 1 > done state: out of cluster reason: Disk for disk group not found: user initiated stop VxVM vxclustadm INFO V-5-2-9687 vxclustadm: Fencing driver is in disabled mode Tue May 13 20:40:50 BST 2014 state: joining reconfig: initialized Tue May 13 20:40:51 BST 2014 state: joining reconfig: initialized Tue May 13 20:40:52 BST 2014 state: joining reconfig: initialized Tue May 13 20:40:54 BST 2014 state: joining reconfig: initialized Tue May 13 20:40:56 BST 2014 state: joining reconfig: vxconfigd in join Tue May 13 20:40:58 BST 2014 state: joining reconfig: vxconfigd in join Tue May 13 20:40:59 BST 2014 state: joining reconfig: vxconfigd in join Tue May 13 20:41:00 BST 2014 state: joining reconfig: vxconfigd in join Tue May 13 20:41:01 BST 2014 state: out of cluster reason: Disk for disk group not found: retry to add a node failed Tue May 13 20:41:02 BST 2014 state: out of cluster reason: Disk for disk group not found: retry to add a node failed
So it is in joining state for just over 10 seconds and then complains "Disk for disk group not found"
If I then destroy FSS diskgroup on node A and rerun manual CVM start again as above I get:
[root@r55v61b bin]# ./vxclustadm nodestate; ./vxclustadm -m vcs -t gab startnode; while true; do date; ./vxclustadm nodestate; sleep 1; done state: out of cluster reason: Disk for disk group not found: retry to add a node failed VxVM vxclustadm INFO V-5-2-9687 vxclustadm: Fencing driver is in disabled mode Tue May 13 20:46:21 BST 2014 state: joining reconfig: initialized Tue May 13 20:46:22 BST 2014 state: joining reconfig: initialized Tue May 13 20:46:23 BST 2014 state: joining reconfig: initialized Tue May 13 20:46:24 BST 2014 state: joining reconfig: initialized Tue May 13 20:46:25 BST 2014 state: joining reconfig: initialized Tue May 13 20:46:27 BST 2014 state: joining reconfig: vxconfigd in join Tue May 13 20:46:28 BST 2014 state: joining reconfig: vxconfigd in join Tue May 13 20:46:29 BST 2014 state: joining reconfig: vxconfigd in join Tue May 13 20:46:30 BST 2014 state: joining reconfig: vxconfigd in join Tue May 13 20:46:31 BST 2014 state: cluster member Tue May 13 20:46:33 BST 2014 state: cluster member
So this again is in joining state for about 10 seconds and then successfully joins CVM.
So it seems CVM is quite quickly determining that with an FSS diskgroup it can't see a disk (presumbly the remote disk).
Do you know why this might be happening?
Thanks
Mike
05-13-2014 08:57 PM
Mark,
Please refer to SFHA 6.1 solutions guide http://www.symantec.com/docs/DOC6982 for the FSS limitations. Check whether your configuration comply with FSS requirements. Are your storage scsi3 complaint ?
Here are below limitations :
■ FSS is only supported on clusters of up to 8 nodes.
■ Disk initialization operations should be performed only on nodes with local
connectivity to the disk.
■ FSS does not support the use of boot disks, opaque disks, and non-VxVM disks
for network sharing.
■ Hot-relocation is disabled on FSS disk groups.
■ The vxresize operation is not supported on volumes and file systems from the
slave node.
■ FSS does not support non-SCSI3 disks connected to multiple hosts.
■ Dynamic Lun Expansion (DLE) is not supported.
■ FSS only supports instant data change object (DCO), created using the vxsnap
operationorby specifying"logtype=dcodcoversion=20"attributesduringvolume
creation.
■ By default creating a mirror between SSD and HDD is not supported through
vxassist, as the underlying mediatypes are different. To workaround this issue,
you can create a volume with one mediatype, for instance the HDD, which is
the default mediatype, and then later add a mirror on the SSD.
For example:
# vxassist -g diskgroup make volume size init=none
# vxassist -g diskgroup mirror volume mediatype:ssd
Optimizing storage with Flexible Storage Sharing 135
About Flexible Storage Sharing# vxvol -g diskgroup init active volume
See the "Administering mirrored volumes using vxassist" section in the Symantec
Storage Foundation Cluster File System High Availability Administrator's Guide or
the Symantec Storage Foundation for Oracle RAC Administrator's Guide.
Also check your private interconnects as all the metadata exchanges happen over there in a FSS configuration. I recommend you to check if your llt links and their settings like speed, autoneg , mtu , switch settings etc etc..are desirable.
Please paste the below o/p :
# ifconfig -a
05-14-2014 12:35 AM
Hi Novonil,
My disks are not SCSI3, but they are not "connected to multiple hosts" - I guess this is the point of FSS - you don't need disks in an array - you can use local disks (which probably are not going to be SCSI3).
Just to clarify, as in opening post, I am testing FSS in Virtual box, so the 2 VMs are running on my laptop using virtual networks and virtual disks. So I am not expecting this to be supported as this is not supported from a redundancy poiint of view (my laptop is a SPOF) and a data protection point of view (I am using fencing in disabled mode which gives me no protection against split brain, so in a real environment I should be using I/O fencing or CPS).
FSS was demonstated to me at Vision and this was running on virtual hardware, which I believe was VMWare ESX which I presume was using vmdks which are not SCSI3.
I installed SFCFS HA on Node A and then cloned it to create Node B, so they should be identically configured - below is output of ifconfig -a:
Node A:
[root@r55v61a log]# ifconfig -a eth0 Link encap:Ethernet HWaddr 08:00:27:7A:FF:1E inet addr:192.168.56.51 Bcast:192.168.56.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fe7a:ff1e/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:9744 errors:0 dropped:0 overruns:0 frame:0 TX packets:7611 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:806345 (787.4 KiB) TX bytes:1783862 (1.7 MiB) eth0:0 Link encap:Ethernet HWaddr 08:00:27:7A:FF:1E inet addr:192.168.56.55 Bcast:192.168.56.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth1 Link encap:Ethernet HWaddr 08:00:27:C9:3A:39 inet6 addr: fe80::a00:27ff:fec9:3a39/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:113984 errors:0 dropped:0 overruns:0 frame:0 TX packets:121992 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:15288510 (14.5 MiB) TX bytes:18173014 (17.3 MiB) eth2 Link encap:Ethernet HWaddr 08:00:27:1A:E8:66 inet6 addr: fe80::a00:27ff:fe1a:e866/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:114602 errors:0 dropped:0 overruns:0 frame:0 TX packets:121423 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:15335104 (14.6 MiB) TX bytes:17804072 (16.9 MiB) eth3 Link encap:Ethernet HWaddr 08:00:27:B8:54:AE inet addr:192.168.57.51 Bcast:192.168.57.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:feb8:54ae/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:12 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:720 (720.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:17 errors:0 dropped:0 overruns:0 frame:0 TX packets:17 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1796 (1.7 KiB) TX bytes:1796 (1.7 KiB) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Node B:
[root@r55v61b bin]# ifconfig -a eth0 Link encap:Ethernet HWaddr 08:00:27:68:CC:26 inet addr:192.168.56.52 Bcast:192.168.56.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fe68:cc26/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:23233 errors:0 dropped:0 overruns:0 frame:0 TX packets:21826 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1899658 (1.8 MiB) TX bytes:10908440 (10.4 MiB) eth1 Link encap:Ethernet HWaddr 08:00:27:2F:39:51 inet6 addr: fe80::a00:27ff:fe2f:3951/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:140787 errors:0 dropped:0 overruns:0 frame:0 TX packets:132047 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21324780 (20.3 MiB) TX bytes:19628051 (18.7 MiB) eth2 Link encap:Ethernet HWaddr 08:00:27:15:CD:E8 inet6 addr: fe80::a00:27ff:fe15:cde8/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:140200 errors:0 dropped:0 overruns:0 frame:0 TX packets:132810 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21113691 (20.1 MiB) TX bytes:19811643 (18.8 MiB) eth3 Link encap:Ethernet HWaddr 08:00:27:00:F7:01 inet addr:192.168.57.52 Bcast:192.168.57.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fe00:f701/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:6 errors:0 dropped:0 overruns:0 frame:0 TX packets:12 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:360 (360.0 b) TX bytes:720 (720.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:38 errors:0 dropped:0 overruns:0 frame:0 TX packets:38 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:4040 (3.9 KiB) TX bytes:4040 (3.9 KiB) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
FSS and CVM are working in terms of I can mount a filesystem on both nodes and write to filesystems from both nodes, - it is just the starting of CVM that is not working for FSS. See below:
Node A:
[root@r55v61a ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda2 8022104 1531008 6077024 21% / /dev/sda1 101086 11792 84075 13% /boot tmpfs 206120 0 206120 0% /dev/shm /dev/vx/dsk/fss-dg/fssvol1 16384 5488 10478 35% /fss1 /dev/vx/dsk/cvm-dg/volshare1 5120 3820 1300 75% /share1 [root@r55v61a ~]# [root@r55v61a ~]# echo test > /fss1/created_from_a [root@r55v61a ~]# echo test > /share1/created_from_a
Node B:
[root@r55v61b /]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda2 8022104 1534184 6073848 21% / /dev/sda1 101086 11792 84075 13% /boot tmpfs 190632 0 190632 0% /dev/shm /dev/vx/dsk/fss-dg/fssvol1 16384 5488 10478 35% /fss1 /dev/vx/dsk/cvm-dg/volshare1 5120 3820 1300 75% /share1 [root@r55v61b /]# echo test > /fss1/created_from_b [root@r55v61b /]# echo test > /share1/created_from_b [root@r55v61b /]# ls -l /fss1 /share1 /fss1: total 2 -rw-r--r-- 1 root root 5 May 13 23:25 created_from_a -rw-r--r-- 1 root root 5 May 13 23:26 created_from_b drwxr-xr-x 2 root root 96 May 13 22:55 lost+found /share1: total 2 -rw-r--r-- 1 root root 5 May 13 23:25 created_from_a -rw-r--r-- 1 root root 5 May 13 23:26 created_from_b drwxr-xr-x 2 root root 96 May 10 14:13 lost+found
Node A:
[root@r55v61a ~]# ls -l /fss1 /share1 /fss1: total 2 -rw-r--r-- 1 root root 5 May 13 23:25 created_from_a -rw-r--r-- 1 root root 5 May 13 2014 created_from_b drwxr-xr-x 2 root root 96 May 13 22:55 lost+found /share1: total 2 -rw-r--r-- 1 root root 5 May 13 23:25 created_from_a -rw-r--r-- 1 root root 5 May 13 2014 created_from_b drwxr-xr-x 2 root root 96 May 10 14:13 lost+found
Mike
05-14-2014 10:54 PM
Mike,
Please follow this guide for configuring SFHA deployment guide on VMware vmdks. Just to ensure you are not missing anything.
https://www-secure.symantec.com/connect/articles/storage-foundation-cluster-file-system-ha-vmware-vmdk-deployment-guide
05-15-2014 01:32 AM
I have had a look at this, but it is most not applicable because:
Actually FSS, in terms of wrting to the disks is working too, what is not working is CVM membership - how is this meant to work with FSS. Here is my understanding of what happens when a system tries to join the membership:
In regular CVM, the joining system asks CVM master what disks it has in its shared disk groups and membership is only successful if the joining system see all these disks.
For FSS, with exported disks on the Master, but NO shared diskgroups, when joining systems asks CVM master what disks it has in its shared disk groups, it replies it has no shared diskgroups, so then it joins and ONLY after it joins does it see the exported disks.
For FSS, with exported disks on the Master in a shared diskgroup, when joining systems asks CVM master what disks it has in its shared disk groups, it replies it has a disk (which is an exported disk), but the joining system won't see this disk as it is not a member yet so what appears to be happening is joining system reports it can't see this disk and won't join.
Clearly this can't be how FSS works, but this appears to be what is happening as the joining system IS reporting "Disk for disk group not found", so can you elaborate how the joining process works?
Mike
05-15-2014 03:47 AM
I have made some progress:
If you look at my opening post you will see I only have 1 disk in the fss-dg diskgroup which is from node A, so I added the exported /dev/sdd disk on node B to the fss-dg diskgroup so that the diskgroup now contains a disk from each system.
Initially this did not help, so when restarting CVM on node B, it still couldn't join. I then stopped CVM on node A, and when I restarted CVM on node A, I expected it not to start, because fss-dg had an exported disk from node B, but CVM did start without any errors, but it doesn't import fss-dg as I guess it does not have all the disks (it doesn't have the exported disk from node B). Then I started CVM on node B and it starts as I guess fss-dg is not imported. After CVM starts on node B, both systems can now see all disks in fss-dg and fss-dg auto imports. I then tried restarting CVM on node B and it won't start as before.
I then stopped CVM on both nodes and started CVM on node A and diskgroup does not import as above. I then tried to online CFS service group (created using cfsmntadm) and this does not online, as I expected, as the CVMVolDg resource only activates the shared diskgroup - it does not import it. This is an issue for a campus cluster as if only one node comes up when the cluster starts, you won't be able to access your storage, but FSS is supposed to work with campus clusters as per the Solutions guide:
FSS lets you configure an Active/Active campus cluster configuration with nodes
across the site. Network sharing of local storage and mirroring across sites provides
a disaster recovery solution without requiring the cost and complexity of Fibre
Channel connectivity across sites.
So I need to understand how this is meant to work, so I can figure out what is going wrong in my set-up, so I'll start a new post to find out how campus clusters work with FSS.
I've opened a new post:
https://www-secure.symantec.com/connect/forums/how-does-fss-work-campus-cluster
Mike
05-15-2014 05:43 AM
Here is how FSS works in the case node B is joining and it earlier had REMOTE disk out of exported disk from Node A.
step 1: During Join process on node B, node B receives the exported disks list from node A. (if there are more nodes in the cluster, master consolidates the list from all other nodes and sends to joiner)
step 2: Node B then forms remote disks on itself for the exported disks sent by node A.
Step 3: Other nodes in the cluster (apart from Joiner ie node B) will also form remote disks, if there are any preexported disks from node B.
step 4: Node B then goes with the regular import process. As it has the remote disk when it reached this stage, it suffices the requirement for Join and then imports the dg.
----
What you are seeing on the setup seems to be not expected behaviour. I mean that node B join should happen successfully in your case.
The suspected reason for join to be failing is because remote disk creation on node B is somehow not happening.
05-16-2014 01:01 AM
This is now working, but not really sure why:
I took system down and removed all shared disks and booted with just the local disks and problem went away. I then took system down and just added the single shared disk containing the "normal" cvm diskgroup and problem came back.
I then destroyed the cvm diskgroup and problem went away. I then re-created cvm diskgroup and problem did not come back. I then took system down and added all the disks back in and problem did not come back.
So re-creating the "normal" cvm diskgroup (not the fss diskgroup) seems to have fixed problem, but I suspect it is related to https://www-secure.symantec.com/connect/forums/udidmismatch-using-rhel5u5-sf61-virtual-box where Virtual Box has device (not host) specific udid which can cause duplicate udid if the disk controllers are discovered in different order after a reboot, so maybe the disk in the fss diskgroup got the same udid as the cvm diskgroup.
Mike
05-16-2014 02:39 AM
Hi Mike,
I did some testing and I read now your latest post. When usign VMware, something to do is to set the enableUUID flag to true. It is clear in your case it is not set because disk are listed as sdb instead of disk_0. Please set disk.EnableUUID to "TRUE" and give it another try.
I have been using that kind of environment very frequently with no issues. In fact, it is similar to what you used last week in the FSS lab in VISION. As you stated, there is no need for SCSI3 here and also no need for multi-write flag at all. But there is a need for the enableUUID.
I tried to reproduce the issue in my lab just in case, but I had no success. This is what I did:
Down is my CVM master node (this should be irrelevant, but just in case):
down~> vxdctl -c mode mode: enabled: cluster active - MASTER master: down
And I have a three node cluster:
down~> vxclustadm nidmap Name CVM Nid CM Nid State down 1 0 Joined: Master strange 2 2 Joined: Slave up 0 1 Joined: Slave
I am going to export one of the local disks for down server:
down~> vxdisk export disk_18
It is exported. Also I have visibility from local disks from other nodes:
down~> vxdisk list | grep disk_18 disk_18 auto:cdsdisk - - online exported up_disk_18 auto:cdsdisk - - online remote down~>
I am going to create a DG using only that disk as you did:
down~> vxdg -s -o fss init mikedg fd1_La=disk_18 down~>
Here the DG:
down~> vxdisk -g mikedg list DEVICE TYPE DISK GROUP STATUS disk_18 auto:cdsdisk fd1_La mikedg online exported shared down~>
I also have the visibility from the up node:
up~> vxdisk -g mikedg list DEVICE TYPE DISK GROUP STATUS down_disk_18 auto:cdsdisk fd1_La mikedg online shared remote up~>
Now I stop VCS on the up node:
up~> hastop –local
This is the current situation:
down~> vxclustadm nidmap Name CVM Nid CM Nid State down 1 0 Joined: Master strange 2 2 Joined: Slave up 0 1 Out of Cluster down~>
Because up is now out of the cluster, it only has local storage visibility:
up~> vxdisk list DEVICE TYPE DISK GROUP STATUS disk_0 auto:cdsdisk - - online exported shared disk_1 auto:cdsdisk - - online exported shared disk_2 auto:cdsdisk - - online exported shared disk_3 auto:cdsdisk - - online exported shared disk_4 auto:cdsdisk - - online exported shared disk_5 auto:cdsdisk - - online exported shared disk_6 auto:cdsdisk - - online exported shared disk_7 auto:cdsdisk - - online exported shared disk_8 auto:cdsdisk - - online exported shared disk_9 auto:cdsdisk - - online exported disk_10 auto:cdsdisk - - online exported disk_11 auto:cdsdisk - - online exported disk_12 auto:cdsdisk - - online exported disk_13 auto:cdsdisk - - online exported disk_14 auto:cdsdisk - - online exported disk_15 auto:cdsdisk - - online exported disk_16 auto:cdsdisk - - online exported disk_17 auto:cdsdisk - - online exported disk_18 auto:cdsdisk - - online exported disk_19 auto:cdsdisk - - online exported disk_20 auto:cdsdisk - - online exported disk_21 auto:cdsdisk disk_21 gold02 online disk_22 auto:cdsdisk disk_22 gold02 online disk_24 auto:cdsdisk disk_24 gold02 online fusionio0_0 auto:cdsdisk - - online ssdtrim exported sda auto:none - - online invalid up~>
Now we start the cluster again and the up node got the visibility for the mikedg again with no issues:
up~> vxdisk -g mikedg list DEVICE TYPE DISK GROUP STATUS down_disk_18 auto:cdsdisk fd1_La mikedg online shared remote up~>
And I can create a volume with no issues:
up~> vxassist -g mikedg make vol1 100m up~>
If you still can reproduce the issue after setting the UUID, please send me an email so we can collect some debug logs from vxconfigd
05-16-2014 04:02 AM
Hi Carlos,
I can't reproduce issue since recreating the "normal" cvm diskgroup. I am using Virtualbox, not VMWare, so I don't know if there is an equivalent to disk.EnableUUID in Virtual Box, but I can't find one in the user manual. I am also confused by the different identifiers - I have created a shared disk for cvm as follows:
<HardDisk uuid="{5b05f304-093f-4631-b360-2989daee7985}" location="/media/mike/Data/VM_Images/Shared_disks/fixr61s8X_16MB.vmdk" format="VMDK" type="Shareable"/>
This disk is shown in the .vbox configuration file for the host (equivalent of .vmx file) for both by cluster nodes
<AttachedDevice type="HardDisk" port="8" device="0">
<Image uuid="{5b05f304-093f-4631-b360-2989daee7985}"/>
So this is showing a UUID and the vbox manual says:
VirtualBox assigns a unique identity number (UUID) to each disk image, which is also stored inside the image.
newer Linux distributions identify the boot hard disk from the ID of the drive. The ID VirtualBox reports for a drive is determined from the UUID of the virtual disk image
I am using RH5.5, so I don't know if this constitutes as a newer Linux distributions as 5.5 is quite old.
In Linux, I can't find this UUID anywhere, but I have a GUID and a UDID:
From Node A:
[root@r55v61a ~]# vxdisk list sdk | grep id clusterid: r55v61c1 disk: name=cd1_X id=1400188876.60.r55v61a group: name=cvm-dg id=1400194712.39.r55v61a flags: online ready private autoconfig udid_mismatch shared autoimport imported clone_disk guid: {d95c0d0c-dc76-11e3-a7b3-9a5a5203d998} udid: VBOX%5FHARDDISK%5FOTHER%5FDISKS%5Fr55v61a.localdomain%5F%2Fdev%2Fsdk [root@r55v61a ~]# [root@r55v61a ~]# vxprint -l cd1_X Disk group: cvm-dg Disk: cd1_X info: diskid=1400188876.60.r55v61a assoc: device=sdk type=auto flags: autoconfig device: path=/dev/vx/dmp/sdk3 devinfo: publen=27392 privlen=1024 mediatype: hdd udid: VBOX%5FHARDDISK%5FOTHER%5FDISKS%5Fr55v61b.localdomain%5F%2Fdev%2Fsdk
Node B:
[root@r55v61b ~]# vxdisk list sdk | grep id clusterid: r55v61c1 disk: name=cd1_X id=1400188876.60.r55v61a group: name=cvm-dg id=1400194712.39.r55v61a guid: {d95c0d0c-dc76-11e3-a7b3-9a5a5203d998} udid: VBOX%5FHARDDISK%5FOTHER%5FDISKS%5Fr55v61b.localdomain%5F%2Fdev%2Fsdk [root@r55v61b ~]# [root@r55v61b ~]# vxprint -l cd1_X Disk group: cvm-dg Disk: cd1_X info: diskid=1400188876.60.r55v61a assoc: device=sdk type=auto flags: autoconfig device: path=/dev/vx/dmp/sdk3 devinfo: publen=27392 privlen=1024 mediatype: hdd udid: VBOX%5FHARDDISK%5FOTHER%5FDISKS%5Fr55v61b.localdomain%5F%2Fdev%2Fsdk
So I have UUID defined in the host vbox config file which is the same for a given shared disks for both hosts, and I have a GUID shown in the host which is the same for a shared disk, but the UUID and GUID are differerent ids. The UDID is yet another identifier, but as in post https://www-secure.symantec.com/connect/forums/udidmismatch-using-rhel5u5-sf61-virtual-box this is a tuple consisting of : VendorId , Product ID, Cabinet Serial number , Lun Serial Number, so I wouldn't expect this to be the same as the UUID. In the above output you can see that the UDID in the private region (shown by vxprint) is the UDID for node B which is why node A has the "udid_mismatch" flag
How is UUID related to UDID - i.e how is enabling the UUID in VMWare making the Cabinet Serial number and Lun Serial Number unique for a given disk viewed from different hosts? I looked at a "vxdisk list" from VMWare workstation (not ESX) and this too had a UDID containing the hostname and the disk device path, so I don't understanding how enabling a random generated UUID will change the Cabinet Serial number from host associated to enclosue associated and the Lun Serial Number from disk device path to the LUN id.
Do you know of any Linux or Veritas commands to view the UUID (not UDID) on the host for a disk?
Thanks
Mike
05-16-2014 08:54 AM
I've found out, that for at least disks on the SATA controller in my Virtual box VM, the disk UDID is being presented to the host as it is used to construct the disk serial number on the host, but vxdisk list does not use this serial number in the UDID - see https://www-secure.symantec.com/connect/forums/udidmismatch-using-rhel5u5-sf61-virtual-box#comment-1...
Mike
05-19-2014 12:54 AM
Hi Mike,
I am not an expert on UUIDs and UDIDs at all. What I got from some folks is that with VirtualBox the disks are claimed using OTHER_DISKS cathegory for which we genreate a fake UDID value using the hostname and device name. Since on Linux the device name changes across reboot, this UDID is not reliable and can change.
When you set enableUUID=true on VMware ESXi, the virtual disk get assigned a SCSI3 unique ID.
Carlos.
05-20-2014 12:45 PM
Hi Mike, Carlos
Regarding the UDID aspect, I think this is due to the OTHER_DISKS classification.
When VxVM does its discovery process, it tries to claim devices against the various ASLs and classify them under the relevant enclosure. If none of the ASLs are applicable (rare these days), it will claim any SCSI3 devices in the JBOD DISKS classification. If it cannot detect SCSI3 devices, it will use the OTHER_DISKS classification.
In the OTHER_DISKS classification, the UDID that is constructed, uses the hostname of the system as you indicated. As the device is being shared, each node will have a different perspective on it (different nodenames).
Is it possible to configure the virtual machines to have scsi3-compliant disks, ?
cheers
tony
05-20-2014 01:30 PM
Thanks Tony - this is what I though as I have seen OTHER_DISKS in several live production servers for the internal disks, so with Virtual Box, even if you use VMDKs, they get presented to server as VBOX disks, and so its not recognised by any ASL. I don't believe you can configure scsi3-compliant virtual disks in Virtual box or VMWare. Anyway, despite the cosmetic udid_mismatch and clone disks, everything seems to work, unless you get conflicts with udid if you reboot and disks come up in a different order.
One problem I have seen is that you can't export a shared disk properly, if the udid doesn't match for a given disk that is shared between 2 hosts, so I had to change the hostname of a server so they were both the same so disks got the same udid and as VCS uses /etc/VRTSvcs/conf/sysname, this doesn't effect VCS.
So it looks like FSS is an alternative to an RDC using VVR in synchronus mode which is quite neat - Carlos is this a valid Use Case?
Mike