Solved: Hi Mike, Application team

Ashish_C · ‎03-23-2014

Hi,

I'm facing some issues, while doing a failover test of an application, which is configured in VCS as a service group.

Application resource is getting faulted in the standby node while doing the fail-over.

For testing purpose, i have made the SG offline on all the nodes and made the resources online one by one on the secondary node,

After making this, all the resources where detected online in vcs console, except the application resource, and as per the application team, the service which was supposed to start after starting the application is started on the node.

VCS was not throwing any error at this time. ** (as per the support team, they suggested me to check with appication team, as VCS is not giving error in engine log.)

When I'm doing the same test on the active node,application resource is detecting online and same is reflecting in the console.

Pls suggest.

Regards,

Ashish C

mikebounds · ‎03-24-2014

The logs don't show issue occuring on PmsProd26, but VCS is only reporting what your scripts are doing, so this is a script/environment issue.

Try the following on each node:

Bring all resources up, except application, then

# sh
# /etc/VRTSvcs/conf/HPPI/Run_OVPI.ksh
# /etc/VRTSvcs/conf/HPPI/Monitor_OVPI.ksh
# echo $?

If /etc/VRTSvcs/conf/HPPI/Monitor_OVPI.ksh is not returning exist code 110, then you need to fix script.

Mike

View solution in original post

Gaurav_S · ‎03-23-2014

Hi Ashish,

Can you paste whatever is there in engine_A.log & resource definition from main.cf ?

Also, is the monitor script same for Application resource on both the nodes ? It is quite possible that monitor script on Active node is diffferent than to passive node which is why resource is not getting online on node B.

VCS needs exit codes to be defined as 110 (for successful) & 100 (unsuccesful) for monitoring a resource. Compare the monitor program on both the nodes & see the results.

G

Ashish_C · ‎03-24-2014

Hi Gaurav,

For checking the functionality of the statrting script i had copied the same from active one to standby.. its exit status was 110 as you mentioned, when it started. For your reference, i'm attaching the engine logs, when i started the failopver testing to standby node and the main.cf entried for the SG on that node.

Regards,

Ashish C

Setu_Gupta · ‎03-24-2014

Hi Ashish,

This typically happens when there are setup differences between the two nodes.

Please check and compare the setup on both the nodes (where the application resource is coming online and where it is not coming online). You will have to check the following files for differences on the two nodes - MonitorProgram / MonitorProcesses / PidFiles (all that you have configured).

HTH.

Setu_Gupta · ‎03-24-2014

Hi Ashish,

From the engine log, it seems that the application resource was never tried to bring online on the system - PmsProd26. Online was fired only for the resources upto IP resource.

If online was tried for the Application resource on PmsProd26, then can you please provide the relevant section of engine logs?

Thanks.

Ashish_C · ‎03-24-2014

Hi Setu,

I've checked the monitoring scripts on bothe the nodes, it looks same. Attaching the same for your reference. I didn't tried with replacing the script on PmsProd26, my standby node with the active node.

** Is there any issues, if the PATH for vsc and the application are not set in ./root/.bash_profile ??

It was not there previously, just now only set the same for root.

Regards,

Ashish

Setu_Gupta · ‎03-24-2014

Hi Ashish,

The diff of both your monitor scripts reveals that there is some difference in the commands executed during online. Please copy the working file on not-working node and check the result of online operation.

There might be issues if PATH for VCS or application are not set in /root/.bash_profile if your scripts does not use the complete path at any place.

Thanks,

Setu.

mikebounds · ‎03-24-2014

Which node is the active node - PmsProd25 or PmsProd26.

Which version of VCS are you using?

Mike

Ashish_C · ‎03-24-2014

Hi Mike,

PmsProd25 is active node.

VCS version is

Engine Version 5.1
Join Version 5.1.10.0
Build Date Fri 01 Oct 2010 12:00:00 PM IST
PSTAMP 5.1.100.000-5.1SP1GA-2010-09-30_23.30.00

OS version is

Red Hat Enterprise Linux Server release 5.6 (Tikanga)

Kernal Version is

Linux PmsProd26.IDEACONNECT.COM 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64 GNU/Linux

Regards,

Ashish C

mikebounds · ‎03-24-2014

In 5.1 there is the following attribute for the Application agent:

UseSUDash
When the value of this attribute is 0, the agent performs an su User command before it executes the StartProgram, the StopProgram, the MonitorProgram, or the CleanProgram agent functions.
When the value of this attribute is 1, the agent performs an su - User command before it executes the StartProgram, the StopProgram, the MonitorProgram or the CleanProgram agent functions.
Type and dimension: boolean-scalar
Default: 0

You have not set this which means it takes the default of 0, so the profile will not be run so you will need to include any environment variables you need in your scripts, or change the attribute to 1.

If this does not fix your issue, then please explain your issue further as the logs do not match your description as in your opening post you say you:

"made the resources online one by one on the secondary node", but the logs do not show you onlining resource HPPI_Appl on PmsProd26

Also the logs show:

2014/03/21 16:23:32 VCS ERROR V-16-2-13067 (PmsProd25) Agent is calling clean for resource(HPPI_Appl) because the resource became OFFLINE unexpectedly, on its own.

Was this a test you did killing the processes on PmsProd25, or do you have an issue on PmsProd25 too?

Mike

Gaurav_S · ‎03-24-2014

Looking at script on both the nodes

in the "online" loop

on PmsProd25

#############  # FUNCTION: Called by main() when ($MODE = "Run_OVPI.ksh")
OVPI_online()  # RETURN: 0 if success, else 1.
#############  # Start piweb, and optionally trendtimer if not WAS-ONLY/
{
 typeset Res

 if [ ${OVPI_SCENARIO} -ne ${WAS_ONLY} ] ; then

    # Don't start trendtimer if another copy of it is already running:
    #
    Res=`ps -ef | grep "trendtimer" | grep -v grep`
    if [ ! "$Res" ] ; then
       #
       # Implement same mechanism used by "/etc/init.d/ovpi_timer start" :
       #
       logit I "OVPI_online() STARTING trendtimer ...."

       /bin/su - $PIUSER -c "${PIHOME}/bin/trendtimer -s \
         ${PIHOME}/lib/trendtimer.sched"  >StartTtimerOUT  2>StartTtimerERR
       sleep 5
       /bin/su - $PIUSER -c "${PIHOME}/bin/trendtimer -s \
         ${PIHOME}/lib/trendtimer_IDEA.sched"  >StartTtimerOUT  2>StartTtimerERR
       sleep 5
    fi
 fi
 #THIS CODE CHECKS THE APPROPRIATE OS AND STARTS PIWEB ACCORDINGLY #
 logit I "OVPI_online() STARTING piweb ...."
 startjboss
 logit I "OVPI_online() WAITING 3 seconds after piweb start ...."
 sleep 3

 logit I "OVPI_online() EXIT 0"
 return 110
}

While on PmsProd26

#############  # FUNCTION: Called by main() when ($MODE = "Run_OVPI.ksh")
OVPI_online()  # RETURN: 0 if success, else 1.
#############  # Start piweb, and optionally trendtimer if not WAS-ONLY/
{
 typeset Res

 if [ ${OVPI_SCENARIO} -ne ${WAS_ONLY} ] ; then

    # Don't start trendtimer if another copy of it is already running:
    #
    Res=`ps -ef | grep "trendtimer" | grep -v grep`
    if [ ! "$Res" ] ; then
       #
       # Implement same mechanism used by "/etc/init.d/ovpi_timer start" :
       #
       logit I "OVPI_online() STARTING trendtimer ...."

       /bin/su - $PIUSER -c "${PIHOME}/bin/trendtimer -s \
         ${PIHOME}/lib/trendtimer.sched"  >StartTtimerOUT  2>StartTtimerERR
       sleep 5
    fi
 fi
 #THIS CODE CHECKS THE APPROPRIATE OS AND STARTS PIWEB ACCORDINGLY #
 logit I "OVPI_online() STARTING piweb ...."
 startjboss
 logit I "OVPI_online() WAITING 3 seconds after piweb start ...."
 sleep 3

 logit I "OVPI_online() EXIT 0"
 return 110
}

As you can see above, node 25 is bringing up "trendtimer" twice .. one for

trendtimer.sched

& second time for

trendtimer_IDEA.sched

This should same on both the nodes right ?

G

Ashish_C · ‎03-24-2014

Hi Gaurav,

Even I had tried copying the start and monitoring scripts from PmsProd25 ( Active node ) to PmsProd26. It doesnt worked.

Regards,

Ashish C

Ashish_C · ‎03-24-2014

Hi Mike,

I didnt tried changing the parameter yet, as the Application support team is working on that.

I'm trying to make the SG from offline state to online on PmsProd25( Active node ), its happening. But when I'm doing the same like, making offine to online on PmsProd26 its faulting.

My query is, if I'm changing the parameter, it has to be applied on all the nodes, here in my case its already working on one of the node. Will the parameter change help me??

Below in the commands which I'm going to execute for the suggested change.

# haconf -makerw
# hares -modify HPPI_Appl UseSUDash 1
# haconf -dump -makero.

Regards,

Ashish C

mikebounds · ‎03-24-2014

The logs don't show issue occuring on PmsProd26, but VCS is only reporting what your scripts are doing, so this is a script/environment issue.

Try the following on each node:

Bring all resources up, except application, then

# sh
# /etc/VRTSvcs/conf/HPPI/Run_OVPI.ksh
# /etc/VRTSvcs/conf/HPPI/Monitor_OVPI.ksh
# echo $?

If /etc/VRTSvcs/conf/HPPI/Monitor_OVPI.ksh is not returning exist code 110, then you need to fix script.

Mike

Ashish_C · ‎03-25-2014

Hi Mike,

Application team had made some changes after checking their logs. After that we made the application online manually on the second node, which was not happening before.

Failover tests are yet to complete. If it doesnt works, we will go for changing the SUDash parameter as you suggested.

Hopefully, it will failover, as it came online on the secondary node.

Regards,

Ashish C

Ashish_C · ‎03-25-2014

Hi All,

Thanks to all, who guided and helped me in this discussion.

Sorry for wasting all yours time, It was application related issue, it worked after making the changes on the application scripts by the support team. It was a good experience too work with all you guys, i got much more idea in VCS while discussing my issue here.

Again, thanks to all for your support and response.

Regards,

Ashish C

VOX

Application Resource failing while doing SG Fail-over test