Inter cluster communication

S_Herdejurgen · ‎08-15-2011

I am trying to automate scripts that we run across multiple clusters during a Disaster Recovery scenario. We have an HTC (Hitachi True Copy) resource on the database cluster that makes the local storage a P-VOL before importing the database disk group. This includes database storage on the local cluster as well as application storage located on another CFS cluster. We currently run one script to failover the database to the DR site. Once the storage is failed over, we run another script on the CFS cluster to unfreeze (persistent) the service groups we need to run, 'vxdg -Cs import' the shared disk group, and run fsck on the shared file systems we are importing, then we start the shared mount points that have CVMVolDg / CFSMount resources.

I am looking for a way to tie these two scripts together, but we have security restrictions in our environment such that there is no root-to-root communication between the clusters. TCP port 14141 is enabled between the clusters.

Does anyone have any suggestions for kicking off the script on the second CFS cluster? I have been able to kick off a trigger from the database cluster to the application cluster, and was thinking about invoking a preonline trigger to import the shared disk group and run fsck. I was also thinking of invoking a postoffline trigger to deport the shared disk group, but the postoffline trigger only accepts two arguments, <system> and <group>. One issue with triggers is I need to make sure they only run on one node. I would do this using:

export VCS_HOST=cluster-vip
halogin admin
hatrigger -preonline 0 <CVM master> <group> IMPORT

The preonline script would then check the 4th argument and see that it is IMPORT and run the 'vxdg -Cs import' and 'fsck' commands.

I guess another way to do this would be to set UserIntGlobal to 1 for the service group which the preonline and postoffline triggers can check for and import/export the shared disk group, but then I end up with trying to make sure that only one system (the CVM master) runs the import command and only one system runs the export command after all other CFS service groups are offline. In the case of these clusters, the CFS mount points will not necessary all be mounted on the CVM master, and the CVM master won't always be the last node to offline the shared mount point.

Does anyone have any suggestions?

mikebounds · ‎08-15-2011

This should all be handled by the HTC agent - you need to perform the following steps from the HTC agent guide:

To configure the agent in a Storage Foundation for Oracle RAC or SFCFS environment:

1 Configure the SupportedActions attribute for the CVMVolDg resource.

2 Add the following keys to the list: import, deport, vxdctlenable.

3 Use the following commands to add the entry points to the CVMVolDg resource:

 haconf -makerw

hatype -modify CVMVolDg SupportedActions import deport vxdctlenable

haconf -dump -makero

Note that SupportedActions is a resource type attribute and defines a list of action tokens for the resource.

These actions will be run by the HTC agent against the CVMVolDg resources so that diskgroups are deported and imported when necessary. The fsck will be run if necessary by the CFSMount resource.

You have to configure the CVMVolDg actions in order for the HTCagent to work in a CFS environment.

No cross communication between clusters is required in order for the agent to deport and import, but preferably you want to be using GCO or an RDC (i.e be using the VCS-DR license) so that you don't get into split-brain and I think if you use HTC agent then you are supposed to have a DR license anyway, so you should be using GCO or an RDC setup.

Mike

S_Herdejurgen · ‎08-16-2011

The database cluster is a GCO cluster that uses the HTC agent, but the application cluster does not run the HTC agent. The application disks are replicated using the HTC resource on the database cluster. I saw the mention of vxdctlenable in the HTC agent guide, but I could not find any additional information on what it actually does, so we didn't enable it. Do you have any pointers to additional documentation about adding the vxdctlenable action to the SupportedActions attribute? Is there any risk to adding vxdctlenable to the SupportedActions attribute at the primary site?

joseph_dangelo · ‎08-16-2011

Sean,

If I am understanding your post correctly, are you managing the HORCM/TC pairs for more than one node from a single host? i.e. Does your horcm##.conf on your Database cluster contains entries for LDev's that are from both the DB and application/CFS cluster? (DB Service Groups are global and the application Service Groups are local)

Ideally each Service Group would be configured as global and the would "Switchover" independently of one another with separate HORCM configurations. However, if that is not possible, the coordination between the two clusters would be best served by the remote group agent. In theory you should be able to have the Application Service Group contain a remote SG resource that would verify that your Database and HTC pair status are available before coming online. You would then have to create the same configuration (resource dependency) at your remote site as there would be no concurrency violation protection between the application Service Groups at either site (due to that fact they are Local and not Global).

The all "global" route however is much more ideal.

Joe D

mikebounds · ‎08-17-2011

Sean,

If Joe's understanding is correct, then you would be better to have DB and App using different HORCM pairs as Joe says, and if this is not possible then you may be able to use RemoteGroup agent, also as Joe says. However you still need to address import and deport. The way this normally works is that when you online a HTC resource in a CVM cluster, the HTC agent calls the import action to import the diskgroup.

This should work if you put HTC agent in app cluster - if you do this then HTC agent should be able to tell that horcm has already done takeover and cvm dg is still imported - see extract below from HTC agent online script

 if ($ret == 0) {
    VCSAG_LOG_MSG("N", "devices in group $groupname are all read/write enabled; no action is required", 18, "$groupname");
    $res->create_lockfile();
    $res->cvm_import();
    exit(0);

Here you would want to use RemoteGroup agent as Joe explains to make sure that DB cluster has done the Horcm takeover first as you don't want to be in a situation where App runs Horcm takeover at DR site when DB is still running on primary site.

But is sounds like you want to trigger App failover from DB cluster. You could use RemoteGroup agent for this too as RemoteAgent can be used to online and offline remote groups as well as monitor them - however I assume DB is a failover SG and App is a parallel SG, so you would need a seperate RemoteGroup resource in the DB cluster for each system in the App cluster.

If you need to do down the scripting route, here are a few pointers:

You can run VCS commands across clusters using halogin as you know, but I would not use "admin" account. I would use an operator account that is narrowed down to just the SGs you need. Then use halogin manually as a one off task on the DB cluster as root to the admin cluster using this VCS user which will create a .vcspwd file in roots home directory on the DB cluster (do this on each node). Then when you subsequently use VCS ha commands (hagrp, hares etc) with VCS_HOST set, the credentials in .vcspwd will be used. By not using admin account, this is both safer as you are not giving VCS user more privileges than necessary, but also if admin password is changed, it won't break your scripts.
You can run VCS actions across cluster to run any script you like - I have done this to even run a Windows command from a UNIX cluster to a Windows VCS cluster. To to this on your remote cluster place the script(s) you want to run in /opt/VRTSvcs/bin/FileOnOff/actions (will need to create actions directory) and create a FileOnOff resource and add script names to FileOnOff SupportedActions attribute. So for example you could create a script which onlines a service group on all systems
Create FileOnOff resource in App SG (containing CVMVolDg and CFSMounts and maybe HTC)
Create script online_all in /opt/VRTSvcs/bin/FileOnOff/actions which uses the resource name which VCS passes to action script to work out SG name (hares -value resname -attribute Group) and then onlines the group on all systems
Modify FileOnOff type - "hatype -modify FileOnOff SupportedActions online_all

Now from DB cluster you can run "hares -action FileOnOff_res_name online_all (after setting VCS_HOST variable)
You could write you own scripts which use cvm_import function from the HTC agent without doing a horcm takeover and I assume the cvm_import function deals with running the import on the CVM master.

Mike

S_Herdejurgen · ‎08-17-2011

Joe & Mike, I like the ideas you are coming up with, especially the 'hares -action' idea. Here's a little more background. We are in a very secure environment, so there is no hardcoding of admin or operator passwords allowed. We will probably never allow an automatic failover between data centers. Currently we have a three tier architecture (web->app->db), but only the app and db tiers are clustered using VCS. In the future we will also cluster the web tier. We support multiple business applications which are composed of infrastructure components located in each of these tiers. For example, application A has components on the dbcluster, appcluster1 and appcluster 2, while application B has components on the dbcluster, appcluster1 and appcluster3. Replication for all of application A's storage is controlled using a single HORCM consistency group. Replication needs to be handled as a single consistency group and not multiple HORCM groups.

I looked into enabling the import, deport and vxdctlenable actions as part of the CVMVolDg agent. These actions are only defined if you have the HTC agent installed. Since we need to have a HA/DR license to use the HTC agent, it isn't cost effective for us to upgrade to SFCFS HA/DR on two dozen boxes. This would cost several hundred thousand dollars in software licensing to replace running a single script by hand. So, this isn't a viable solution for us.

Since DR failovers are controlled by hand, I envision the solution to be a VCS administrator logging into the database cluster to initiate a failover. This script would then offline service groups on appcluster1,2,3, deport shared disk groups and then failover the database to the remote data center using GCO. Next, the script would import disk groups and online service groups on appcluster1,2,3. The script would prompt for a VCS administrator password, so passwords won't be stored in a script.

I am going to do some research with the 'hares -action' command to see if I can make it do what we need it to do.

mikebounds · ‎08-17-2011

With regards to security, the passwords are encrypted in .vcspwd file and this file would be owned by root and you could change it as often as you want (if you have to change passwords every month), so security wise this really is no different to encrypted passwords in main.cf. If you are really serious about security then you should use secure cluster if you are not already. If you have not set-up a halogin session, then you will prompted for a password if you use ha commands in your script - if you are going to do this then I would use a halogin command, run your ha command and then run a halogin -endsession.

I wouldn't have thought you would need a DR license to use action scripts as these are actions to the CVMVolDg agent, even though the scripts get installed with HTCAgent, so I would just copy action scripts from DB cluster, so that App cluster just contains CVMVolDg actions and not HTC agent. Would also be useful to copy cvm_import and cvm_deport functions, but I don't know where these are located.

Note if you use hares -action, then using actionargs can be useful too.

Mike

S_Herdejurgen · ‎08-17-2011

Secure Cluster doesn't scale for us. Our largest cluster is a 13-node CFS cluster with 40 users and 160 service groups (and growing). Because Secure Cluster defines access using (node,user,group), that gives us 83,200 possible security identifiers to have to manage. In reality, we would probably need to define about 2000 security identifiers for this cluster, but even that number is unmanageable.

We currently use halogin for accessing VCS from non-root accounts.

I did some testing using 'hares -action' and it looks very promising. It's not interactive, but it solves the basic problem of running a script as root on a remote cluster. Now I have some scripting to do. Thanks Mike for the idea of using agent action scripts.

joseph_dangelo · ‎08-18-2011

Seann,

Ultimately you may want to consider Virtual Business Services with VERITAS Operations Manager as a means to control all of these functions. VOM will create a centralized authentication point as well as orchestrate all the tiered actions your are considering. VOM 4.0 should support all of what you are looking to achieve.

Joe D

VOX

Inter cluster communication