VCS does not notice SAN LUN disappearance
We have a cluster running on two Oracle M3000 with Solaris 10 Update 9 and VCS 5.1 SP1RP2.
Some more information on the cluster setup:
- Seven containers running as cluster resources
- On each of these one Oracle 11.2 DB is running
- The nodes have their storage on two SAN LUNs mirrored with ZFS
- All containers and the DB storage are also installed on SAN LUNs
- LUNs are stored on two EMC VNX
- All seven Container SGs have been set up the same way.
Here is the problem, which may not even be one. All the cluster failover tests worked as expected save one. We unplugged both FC cables from one node and it continued running. VCS did not notice this either. The node could no longer be reached by SSH but reponded to pings. GAB and LLT did not complain. On the XSCF console there were errors noted about the SAN devices not being available, but that was it. After about 15 minutes we halted the node and rebooted.
Now, I wonder if VCS should not have noticed that the LUNs had gone or that the node did not respond properly anymore. Or is this expected behaviour?
I know the scenario tested this way is very unlikely indeed.
Thank you for enlightening me.
Excerpt from main.cf:
I think the main reason you don't see the cluster reacting is that while the agents are in memory, the agent entry points are not. So the scenario is that monitor interval for a Mount resource expires and the Mount agent calls the monitor entry point. Since the entry point script resides on disk on the root file system, the OS has to read it in. It now becomes an OS issue in that it can't access something it needs, but agent framework doesn't know what to do.
Personally, I'd love to see the OS throw a panic when root file system becomes unavailable. That would clear up a lot of problems. :-)