Solved: pause between one resource and another

bazi · ‎04-17-2012

Hey All,

I am looking for a solution to the following problem.

I have a group conrolling Sybase dataserver instance which starts resources in the following order

1) disk group

2) ip and volumes

3) mount points

4) dataserver

5) backupserver

Thing is that the Sybase interfaces file say to Sybase to start the listener on a WLB address. This WLB address/url is dynamic and can resolve to prod or cob IP depending which is is up. Once IP is brought up online F5 Load Balancers set DNS to resolve the WLB address to the correct IP. This allows us to handle COB scenarios in a transparent way from the connectivity point of view. Load Balancers however takes few seconds to pick up the IP change so if dataserver Sybase resource starts onlining the Sybase instance before the WLB url resolves the start of the dataserver will fail. Hence I am looking for a way to tell VCS to wait with starting dataserver until WLB starts resolving, some sort of a condition. Any ideas? So far I was thinking of a Application type of a resource. I wrote a perl script that verifies if the WLB address already resolves to the IP from the same subnet and I can easily use it for StartProgram and StopProgram. Problem is that I have no way to actually offline the resource cause as such it is only a monitor porgram that makes sure WLB address resolves to the correct IP. So what I am really looking for is some sort of a MonitorOnly type of a resource....

Any clues much appreciated

Wojtek

jstucki · ‎04-17-2012

Wojtek,

Create a StartProgram script which loops and sleeps 5 seconds, until the WLB address resolves. Then, in your StartProgram script, touch a lockfile (in /var/run, if on Solaris) and exit. Its a good idea to touch a file in a directory where it will get wiped away when a reboot occurs. Make the lockfile permissions such that only root can remove it.

Create a MonitorProgram script which checks for the lockfile. If the lockfile exists, exit with 110. If it doesn't exist, exit with 100.

Create a StopProgram script which remove the lockfile. And the CleanProgram script will also remove the lockfile, probably with a -f to force the remove.

With these scripts, your resource can go online and offline, and it will give you the results you want.

Set the OnlineTimeout attribute of the Application Agent to a high value, so that your resource has plenty of time to wait until the WLB address resolves, and your resource can go online.

You'll need to copy your scripts to all hosts in the cluster. Create a directory called something like /opt/VRTSvcs/bin/WlbWait, and put the scripts in the directory.

-John

View solution in original post

jstucki · ‎04-17-2012

Wojtek,

Create a StartProgram script which loops and sleeps 5 seconds, until the WLB address resolves. Then, in your StartProgram script, touch a lockfile (in /var/run, if on Solaris) and exit. Its a good idea to touch a file in a directory where it will get wiped away when a reboot occurs. Make the lockfile permissions such that only root can remove it.

Create a MonitorProgram script which checks for the lockfile. If the lockfile exists, exit with 110. If it doesn't exist, exit with 100.

Create a StopProgram script which remove the lockfile. And the CleanProgram script will also remove the lockfile, probably with a -f to force the remove.

With these scripts, your resource can go online and offline, and it will give you the results you want.

Set the OnlineTimeout attribute of the Application Agent to a high value, so that your resource has plenty of time to wait until the WLB address resolves, and your resource can go online.

You'll need to copy your scripts to all hosts in the cluster. Create a directory called something like /opt/VRTSvcs/bin/WlbWait, and put the scripts in the directory.

-John

arangari · ‎04-18-2012

you could also create 'ononly' resource type which doesnt do anything in 'online EP', in 'monitor EP' it can check if the WLP address is resolved.

Here, the assumption is that WLB address is required continuously for sybase.

bazi · ‎04-18-2012

Hey John,

This sounds very good. Though I am thinking of a scenario in which I am left with both nodes holding the lock file or. That could lead to a Concarency Violation condition on which VCS will try to offline both groups. But I will definatelt try to implement it. It is a quickie to add that lock logic to the script. Thanks for that

Regards,

Wojte

bazi · ‎04-18-2012

Hey Amit,

That is also a very good tip though it does not meet one requirement that I forgot to mention in my initial problem description (apologies for that). Basically I would not be happy with the program running all the time as I want to avoid a scenario in which it dies or someone accidentialy kills it and the resource goes offline. ProcessOn has only PID monitoring unfortunately.

Thanks a lot for the suggestion.

Best regards,

Wojtek

jstucki · ‎04-18-2012

Wojte,

You shouldn't have issues with a concurrency violation. I've created many custom agents which use this approach. It works well. Putting the lockfile in /tmp or /var/run will ensure that it disappears when a reboot occurs.

If, by some remote chance, the lockfile gets created on a node where the resource is not online, VCS will immediately run the clean script and remove the lockfile. You can test this by manually creating the lockfile on a second node, to see what happens.

Let us know if the scripts are successful in doing what you intended. Its always nice to know that the solution worked well.

-John

Satish_K__Pagar · ‎04-18-2012

You may want to increase the MonitorInterval for the extra script resource. So that the monitor invocation for the lockfile would not happen at the default interval of 60 seconds. You can increase it to a sufficiently high value, max is 2147483647 seconds. That would delay the subsequent monitor invocation for that much seconds. Along with this you may want to increase the ToleranceLimit for the resource, so that the resource will not be reported as OFFLINE for those many subsequent faults as defined by the ToleranceLimit attribute. Since both these attributes are type level attributes you will have to override them as below:

eg.

# hares -override <ResourceName> MonitorInterval

# hares -modify <ResourceName> MonitorInterval 1500

bazi · ‎04-19-2012

Hey John,

Works like a charm. There are a few monitoring modes that I implemented. One that only checks for the lck file, one that verifes if WLB remains resolvable (in addition to lck file check) and one that checks for the lck file, WLB availbility and if the IP is up on the node it should be. The last one may be useful if I see that someone keeps creating the lck file in /var/run on the other nodes. Although I would find this very bizzare given the naming convention I decided for the lcf file name and the criticality of the servers. Though, who knows :) Now I just use the lck file check as I only care about the pause before dataserver starts. If I need extra functionality in the future I will easily move to that simply modifying the argument the MonitoringProgram is exected with. Since for all modes presence of the lck file is a must I can control offline/online as you suggested.

Solution is in QA now (already tested against concurrency violation condition that I worried of) and shortly will be put into PROD.

Quick word why I worried about the CV condition. Recently I was adding a Volume type resources on a Sol machines with VCS4.1. I was in sort of a hurry and was doing them by copy/paste not worrying of the order I was doing that in. Suddenly I saw groups going down although I could not find a reason for such behavior. Logged a call with Symantec and it appeared that VCS4.1 has a bug. If you add a Volume resource and set it to enabled before adding DiskGroup and Volume VCS gets crazy and considers the res to be online on all nodes from the SystemList. Moreover it took my groups down instead of just bringing them down on the nodes that were standby for the group. Although it sounds reasonable to enable the Volume resource only after all values are put in I would expect such enterprise tool to protect me from such mistakes. People learn every day, don't they?

Again, thanks for your help.

Wojtek

bazi · ‎04-19-2012

Hey Satish,

Thanks for your input.

Yeah I can consider that however lck file solution seems stable and I do not expect anyone to and create/remove random files from /var/run. Also since I implemented more types of monitoring modes (see my reply to John), one day I may want to actualy start monioring for WLB URL continously. Requirements change every day....:)

Thanks,

Wojtek

VOX

pause between one resource and another