We are looking for a solution to recover the 2 node VCS Cluster at DR location for a DR Test.
Prod: Hardware is Dell with 24 cores 64 GB RAM with Suse 11.4 Linux over SAN Boot & 4 NIC's of 1 Gig.
DR: HP Hardware with 48 cores 128 GB RAM with Suse 11.4 locally installed on hard drives. (Hardware keeps changing on each DR test but it will be equal or better than prod. 4 NIC's of 1 Gig here as well. OS version will not change it will be exactly the same.
In DR we were facing challenges to recover boot from SAN OS with replicated LUN, we belive those changes are because of change in Hardwear. Hence we installed the OS locally and then got the Boot LUN which we had replicated in DR mounted on the server in /tmp/ , so that we can copy the requried directories and files from that boot lun.
Then used "cp -Rpnv <source> <destination>" command to copy all the files from below directories of boot lun to same directories of newly installed OS directories on local disk drives. (this was done without overwriting existing files from the installed or running OS's directory)
Then we took backup of below files and then overwriten those with the once which where there in the boot lun, to match prod environment.
Then we also setup the local environment using .profile files in root user and the /etc/sktn directories & then reseted the root password. & rebooted the system.
(Note: We are using the same network and VLAN's in DR as they are in Prod. Just the difference is that Prod Network and DR Network are no ware connected.)
after doing all this OS comes up fine we can login and all but the VCS cluster dinies to start and throughts many errors.
Is there a way we can recover the cluster by reinstalling or any other method? (We do not have any backup solution as we have everything on SAN including the OS in PROD and all SAN LUN's are replicated to DR location.) and then just restoring the main.cf & type.cf & llttab files?
Please help!! Thanks in advance. :)
Usually VCS has heartbeat issues when hardware is changed. Start by checking the LLT configuration still matches your hardware configuration. Once you get the LLT heartbeats working, then VCS should start up.
This seems to be a long way to go to get DR working. You might want to look into the Global Cluster Option (GCO) option for future DR needs. But keep in mind that GCO is for failing over your application with its data and not your entire OS.
To add Wally's point:
Replication of whole clusters, regardless of the vendor, if fraught with pain. It rarely works that well as all cluster software is configured specifically for the host it runs on. If you need resiliency at your DR site, you need to automate the failover process. If your Recovery Time Objective is less than one day, you will never meet it during time of crisis with your current setup. Way too many manual steps. Automate :)
Also, VCS has no DR entitlement which means servers must be licensed at all times, so you might as well configure global clusters. The only cost differnece is that the DR cluster is booted (electircity/HVAC costs). VCS can manage the state and direction of your array replication. In this case, only replicate the data LUNs. Haveing global clusters also allows for constant, automated DR testing of the application. I recommend your network team provide an extra vLAN (or firewall every server) so you can startup the applicaton in DR without impacting production.
Based on the issue descrition in the post title and below in the post
"In DR we were facing challenges to recover boot from SAN OS with replicated LUN, we belive those changes are because of change in Hardwear."
the first issue you need to resolve is to make the systems on the DR site bootable. Once you can boot the OS of the DR systems up, you should be able then to diagnose the issue further by checking the items listed below in the order listed.
1. are the SAN storage accessible? (vxdisk -o alldgs list)
2. can the nodes heartbeat (gabconfig -a)
3. is VCS up and foremd 2-node cluster (hastatus -sum)
4. is the network for VVR up (vxprint -lPV)
5. any VxVM objects are in diabled/error state (vxprint -ht | egrep -i "err|fail|detatch"
6. the state of the rvg and rlink (vxprint -ht | egrep ^rv|^rl")
7. run the commands below on one of the nodes on Prod site
haclus -display | grep -i vers
Once I see the outputs above, I can know roughly your environment and the status of the systems