cancel
Showing results for 
Search instead for 
Did you mean: 

Veritas Cluster Server, Resource Application failed to start

cedric_tours
Level 3

Hello,

on 2 servers OS Linux Red Hat 6.3, I've got a VRTS 6.0

For on Application, when I put it offline and then online on the same server (with all other resources online on this server) it start well.

But when I test a 'switch to' of all the service group on the other node it doesn't start properly (and all the other resources started )

The application is link with 3 Mount ressources and 1 IP ressource

we set attributes critical to false, and we set UseSUDash to true.

The StartProgram script is supposed to start several processes, with a offline, online action all the processes are started, with a 'switch to' action, only half of them are started.

No interesting log in the Application side.

Any suggestion to debug will be appreciated.

 

 

 

 

 

 

7 REPLIES 7

kjbss
Level 5
Partner Accredited

What do you have the Application attribute 'User' set to?  If this is not set, it default's to the root user. 

Specifying UseSUDash causes the agent to start the shell as a login shell with an environment similar to a real login for the User specified (user root if not specified).  Therefore, make sure that the user's profile files (depending upon that user's SHELL) are identical on both systems.  Also make sure to check if any other files are sourced into the environment by the user's profile file and also make sure that both systems have them too (and they need to be identical also).  There is probably something that is set, like an environment variable, (OR there is some action that is performed) on the system that is working that is not set (or performed) on the system where it does not work. 

If you have confirmed that these are identical, then you should start the application manually on the system where it currently is unable to start via VCS.  If you can start it manually, then we should simply be able to "do the same" by integrating the steps into the Application agent (via the attribute settings). 

If you cannot start it manually, then this is not a VCS problem and you need to debug your application environment in order to identify why it cannot start on the failing system.

-HTH

Kevin

mikebounds
Level 6
Partner Accredited

If on BOTH nodes:

when I put it offline and then online on the same server (with all other resources online on this server) it start well

but if you online group with all resources offline first, it fails, then you probably have a VCS dependency problem so that your application is starting before another resource it requires is online.

If with all other resources online on the server you can online on node1, but if you do the same on node2 it fails, then most likely then is an issue with your app, in which case you need to try and start it manually as Kevin says.

 

Mike

cedric_tours
Level 3

Thank you for yours reply.

The attribute user is set ot 'user1' the user that need to start the processes, the application can start with or without the UseSUDash.

 We were very carefull to configure the 2 nodes identically, and the symtom is the same on both node.

I was more looking in the direction of a VCS dependency issue.

How can I confirm (and fix) that ?

My App is not yet in production, so I can stop and start withouth trouble.

thanks

Cedric

mikebounds
Level 6
Partner Accredited

So as I understand, if you online resources one at a time, then your app resource starts, but if you online the whole group (using hagrp -online or using Java GUI), then your app does not start - or does "hagrp -online" work and only "hagrp -switch" fails?

The issue is likely dependencies or timings.  It sounds like your app should be the last to come up so it should be at the top of the dependency tree - please provide main.cf if you are unsure.

If all you have is Mounts, DG, IP and your single app resource, then timings is unlikely to be the issue as when a Mount or IP is up, it is ready, but if you have other App resources, they may report they are up, before they are ready.  If you have another resource that your app needs, say "app_dep" , then run:

hares -online app_dep -sys node1;hares -wait app_dep State ONLINE -sys node1; hares -online app_with_issue -sys node1

and if this fails, then offline group and try again with something like:

hares -online app_dep -sys node1;hares -wait app_dep State ONLINE -sys node1; sleep 30; hares -online app_with_issue -sys node1.

If still having issues then provides extract from engine_A.log.

Mike

kjbss
Level 5
Partner Accredited

Prove user 'user1' exists on the failing node AND has the same UID and GID(s) associated with it.   If these are different, this could effect the ability to start your processes.

Have you attempted to start the application manually on the failing node?  You need to do this before spending any other time doing anything else.  This is an important step to isolate where the problem is so that we can focus on solutions.  If starting the application on the failing node works, then we should be able to assume that that node's environment (user credentials, mounts, filesystems, ports, ip addresses, etc.) is sufficient to start the application and therefore there is some problem within the VCS configuration (timings, dependencies, etc), and you need to inspect the VCS log (/var/VRTSvcs/log/engine_A.log) for the error message produced when the Application resource fails to start.  HOWEVER, if this does not work then this has nothing to do with the VCS configuration and you must manually debug your application environment in order to identify why it cannot start on the failing system. 

As a general rule, you should never integrate an application into VCS before first proving it can start manually on all required nodes.

Let us know if you need help in how to start the applicaiton manually on the failing node.

 

cedric_tours
Level 3

The application can start manualy in both nodes.

The symptom is the same on both node.

Offline -> online general and switchto produce the same issue.

My app need 3 FS mounted and 1 IP,  I don't see any else dependencie.

The issue has been solved by adding a new child script inside the App startprogram.

Lasts tests :

if we commented the line : start fail when we do a switch To or a Online general, but start is ok when online onlye the application.

if we uncomment it start well in all cases.

We presume it can be a little delay of one process, but we are not 100% sure.

thanks for your time.

Cédric

 

kjbss
Level 5
Partner Accredited

ok; good that you got it sorted...