cancel
Showing results for 
Search instead for 
Did you mean: 

Application resource faulting instead of Diskgroup/Mount

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

Environment

SFHA = 6.2

Cluster nodes = 2  (local cluster)

OS = Solaris 10 SPARC

SAN storage connectivity = Dual Fiber cables (DMP installed)

Application = Custom made application with Start/Stop and Monitor process in placed (configured the attributes)under the Java console. Complete application is copied under Mount point (Mount resource)

 

WE ARE ASUMING THAT CLUSTER SENCE FIRST THE APPLICATION FAILURE INSTEAD OF DISKGROUP AND EXECUTE FAILURE

 

App shows faulted instead of dg and mount_0.JPG

Query

We are performing a test case in which we are plugging out both fiber cables for SAN connectivity. Instead of Diskgroup or Mount resources crash, we are noticing our Application resource getting crash

3 ACCEPTED SOLUTIONS

Accepted Solutions

Wally_Heim
Level 6
Employee

Hi Zahid,

If you take IMF out of the picture, then it comes doen to a timing issue as to which resource hits its next montior cycle after the cables are pulled.  It would stand to reason that the application would complain or fault if the disks that it needs is removed. 

Once the first resource faults, VCS will fault the group if the resource is critical or affects a critical resource.  This corrective process should being the group down and not worry about the other resources that might fault.

The second system is not down in the screenshot.  It is up but marked as faulted for this service group.  The server icon would be grey (not running state) instead of yellow (running state) if the server was down.  The backgroup color is what show the state.  Red is faulted, yellow/blue is partial online, blue is online and grey is offline.

Thank you,

Wally

View solution in original post

mikebounds
Level 6
Partner Accredited

This is down to timeouts, what I/O is happening and as Wally says timing.

If there are no writes to the storage, then reads may come from cache so it can take a while for Volume Manager and the O/S to detect the storage is gone.  When writes are performed then there are timeouts as to when the I/O is marked as failed, so it is also feasible that your application may timeout before the application.

Or it maybe that the diskgroup and mount resource monitor entrypoints hang as oppose to fail and the default is 4 timeouts before the resource fails, whereas your application resource monitor entrypoint may return "offline" rather than hang and this would be my guess as to the main reason, but it could also be simply timing as each monitor runs independently for the different resource types.

As Marianne says, check your engine log (and also agent logs) to see what VCS detects at what time.

Mike

View solution in original post

Venkata_Reddy_C
Level 4
Employee

IMF is enabled by default in 6.2. If the application is such that, the processes die immediately after storage loss, then IMF triggered the monitor (since you configured the MonitorProcesses attribute) for Application resource immediately and the resource detected fault due to process death. With IMF in picture depends on application behavior to storage loss. The storage loss would be detected little late even if IMF enabled for DiskGroup. Without IMF, the fault detection happens based on timing of the monitor entry points of individual resources.

Thanks,

Venkat

View solution in original post

5 REPLIES 5

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hi Zahid,

If the SAN cables are pulled off, it is obvious that impact will be to volumes .. since you have not configured a volume resource, the next impacted resource is Mount because one of mandatory attribute of Mount resource is volume device, hence your mount resource is faulting.

Since your application content is sitting on top of mount resource, once mount resource faults (& mount is marked as critical) , the entire group will be failover.

 

G

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Check engine_A log to see how VCS interprets the failures.

Also check /var/adm/messages to see at which point the hardware loss was recognized.

PS: screenshot seems to have been taken where service group has already failed over and busy coming up.
Application still shows faulted because the fault has not been cleared.
The one host appears in a down state, probably because of default PanicSystemOnDGLoss attribute on Diskgroup. 

See this URL:
https://sort.symantec.com/public/documents/sf/5.1/aix/html/vcs_admin/ch_vcs_controlling_behavior25.h... 
How VCS attributes control behavior on loss of storage connectivity 

Wally_Heim
Level 6
Employee

Hi Zahid,

If you take IMF out of the picture, then it comes doen to a timing issue as to which resource hits its next montior cycle after the cables are pulled.  It would stand to reason that the application would complain or fault if the disks that it needs is removed. 

Once the first resource faults, VCS will fault the group if the resource is critical or affects a critical resource.  This corrective process should being the group down and not worry about the other resources that might fault.

The second system is not down in the screenshot.  It is up but marked as faulted for this service group.  The server icon would be grey (not running state) instead of yellow (running state) if the server was down.  The backgroup color is what show the state.  Red is faulted, yellow/blue is partial online, blue is online and grey is offline.

Thank you,

Wally

mikebounds
Level 6
Partner Accredited

This is down to timeouts, what I/O is happening and as Wally says timing.

If there are no writes to the storage, then reads may come from cache so it can take a while for Volume Manager and the O/S to detect the storage is gone.  When writes are performed then there are timeouts as to when the I/O is marked as failed, so it is also feasible that your application may timeout before the application.

Or it maybe that the diskgroup and mount resource monitor entrypoints hang as oppose to fail and the default is 4 timeouts before the resource fails, whereas your application resource monitor entrypoint may return "offline" rather than hang and this would be my guess as to the main reason, but it could also be simply timing as each monitor runs independently for the different resource types.

As Marianne says, check your engine log (and also agent logs) to see what VCS detects at what time.

Mike

Venkata_Reddy_C
Level 4
Employee

IMF is enabled by default in 6.2. If the application is such that, the processes die immediately after storage loss, then IMF triggered the monitor (since you configured the MonitorProcesses attribute) for Application resource immediately and the resource detected fault due to process death. With IMF in picture depends on application behavior to storage loss. The storage loss would be detected little late even if IMF enabled for DiskGroup. Without IMF, the fault detection happens based on timing of the monitor entry points of individual resources.

Thanks,

Venkat