Process fails

mrrout · ‎04-12-2011

Hi All,

I am running an executable program which sends sequential and Random I/O to mount points. For this case I am using an NFS mount point - "/nfs0" to run sequential and Random I/O. This NFS filesystem has been shared from a two node VCS 5.0 setup.

When I am switching over the Service group (Which contains the NFS share) from active node to the passive node at times the process fails showing messages as below:

Mar 27 08:37:23 /nfs0/io @ File open error!
OPEN ERROR : err_no=52 err_msg=Missing file or filesystem
Mar 27 08:37:23 /nfs0/io @ File close error!
CLOSE ERROR : err_no=9 err_msg=Bad file number

Note: The above "io" is a file name to which the executable program writes sequential / random data.

The VCS is on AIX 5.3 TL11, my client is running AIX 6.1 TL04. I am not seeing any difference in the messages in VCS "engine_A.log" during switchover when the process fails or doesn't fail. No messages in errpt of client/ server or syslog.

Any help on how to decipher this issue, to me this looks like an issue with the nfs client or server services but I am unable to find a clue.

Thanks in advance.

Marianne · ‎04-12-2011

Please post the SG config in main.cf showing the resources and dependencies.

Have you tested each resource manually outside of VCS, including the order in which resources are brought online and offline and then repeat the same process on the 2nd node?

Cluster merely automates the manual steps. You also need to time each step when you perform the manual steps so that you can determine if the default timeouts are sufficient.

Handy NetBackup Links

mrrout · ‎04-12-2011

Hi Marianne,

Thanks for your response, I am including the "main.cf" file used in my setup for your reference. This cluster is running two NFS failover service groups. I am getting the problem (I/O generating process terminates in client) when I try to switchover the 2nd SG.

I was trying to understand why the process gets terminated at times and not always, i.e. any way to detect the NFS activity on the client. For your information the process gets terminated while resources are offlined as part of service group switchover.

Thanks and Best Regards.

Marianne · ‎04-13-2011

So, you have a 2-node cluster sharing filesystems that you are NFS mounting on another system. Your io program runs continuously on this 3rd system.

Now, when you switch/failover the service group from node1 to node2 all the service group must be taken offline first on node1. The filesystem needs to be unmounted and the diskgroup needs to be deported before the service group can be brought online on node2 - diskgroup needs to be imported, volumes started, filesystem checked, then mounted, then shared.

With the above in mind, it is understandable that system3 where the filesystem is NFS mounted will complain when it tries to write to filesystem during the failover period. This message is 100% correct: "Missing file or filesystem"

This is what happens in a failover cluster - downtime is minimized, not prevented or eliminated. It ensures that should resources fail on one node or if the node fails, the resources will be brought online on another node.

Handy NetBackup Links

mrrout · ‎04-14-2011

Hi Marianne,

Thanks for going through the main.cf file and your view. But I am experiencing this issue only occassionaly for the attached main.cf configuration (i. with NFSRestart resource configured), many a times a I get message as below on the client (Where I/O executes):

NFS server 10.10.2.5 not responding still trying
NFS server 10.10.2.5 not responding still trying
NFS server 10.10.2.5 not responding still trying
NFS server 10.10.2.5 ok
NFS server 10.10.2.5 ok

I/O continues.

I think this is an expected behaviour of NFS to keep trying for sometime (Based on re-try values, for my setup the client NFS values are default) when a service group switchover is initiated and then fail after NFS time-out is reached.

But in the event when I/O fails in the client it doesn't re-try at all, instead it fails with messages as shown in my 1st post the moment switchover of Service group is initiated. I am not sure what is causing this failure. Any idea on how to trace the NFS activities on the client would help me decipher this issue.

Best Regards.

VOX

Process fails