cancel
Showing results for 
Search instead for 
Did you mean: 

NFS share doesn't failover due to being busy

semenov_m_o
Level 3

Hello!

We are trying to implement a failover cluster, which hosts database and files on clustered NFS share.
Files are used by the clustered application itself, and by several other hosts.

The problem is, that when active node fails (I mean an ungraceful server shutdown or some clustered service stop), the other hosts still continue to use files on our cluster-hosted NFS share.
That leads to an NFS-share "hanging", when it doesn't work on the first node, and still cannot be brought online of the second node. Other hosts also experience hanging of requests to that NFS share.
Later, I will attach logs, where problem can be observed.

The only possible corrective action found by us is total shutdown and sequential start of all cluster nodes and other hosts.

Please recommend us a best-practice actions, required for using NFS share on veritas cluster server (maybe, some start/stop/clean scripts being included as a cluster resource, or additional cluster configuration options).

Thank you, in advance!

Best regards,
Maxim Semenov.

1 ACCEPTED SOLUTION

Accepted Solutions

Daniel_Matheus
Level 4
Employee Accredited Certified

Hi Maxim,

 

I think the problem is your dependencies.

VCS tries to unmount the file system while it is still in use.

2013/07/31 17:31:22 VCS NOTICE V-16-1-10300 Initiating Offline of Resource app-rg-mount (Owner: Unspecified, Group: app-rg) on System app02
2013/07/31 17:31:23 VCS NOTICE V-16-10031-5512 (app02) Mount:app-rg-mount:offline:Trying force umount with signal 15...
2013/07/31 17:31:23 VCS NOTICE V-16-10031-5512 (app02) Mount:app-rg-mount:offline:Trying force umount with signal 9...
2013/07/31 17:31:23 VCS INFO V-16-1-10305 Resource nfs-restr (Owner: Unspecified, Group: app-rg) is offline on app02 (VCS initiated)
2013/07/31 17:31:23 VCS NOTICE V-16-1-10300 Initiating Offline of Resource lockinfo-mnt (Owner: Unspecified, Group: app-rg) on System app02
2013/07/31 17:36:24 VCS INFO V-16-2-13003 (app02) Resource(app-rg-mount): Output of the timed out operation (offline) 
         the device is found by lsof(8) or fuser(1))
         the device is found by lsof(8) or fuser(1))
         the device is found by lsof(8) or fuser(1))

 

Please check the Bundled Agents Guide on page 132 for example how to set up dependencies.

http://www.symantec.com/business/support/resources/sites/BUSINESS/content/live/DOCUMENTATION/5000/DOC5881/en_US/vcs_bundled_agents_601_lin.pdf

Thanks,
Dan

 

 

View solution in original post

13 REPLIES 13

Samyak
Level 2
Employee

Hi Maxim,

 

Please send us the VCS configuration (main.cf) along with the logs for better understanding the scenario.

 

Thanks and Regards,

Samyak.

semenov_m_o
Level 3

Hello, Samyak!

Thank you for responce.
Please, review our yesterday failover testing.
 

engine_A.log is a portion of log, regarding our test.
After the last message there was no sign of life from cluster during another 30 minues.
Service on app02 was in offline pending state after it, then we have turned this node off.

Luckily, this time NFS didn't have such a freezing.
Probably, because we had changed our cluster config before, and added this dependency:

app-rg-mount requires app-rg-ip

But still, cluster move is not automatic, and some additional configuration fixing o scripting is needed.
We will appreciate any help.

Thank you!
Maxim Semenov.

kjbss
Level 5
Partner Accredited

Last I heard, cross mounting an NFS filesystem within the same cluster is simply bad design.  This is a generic cluster design problem and not specific to VCS.  In cases where some of the nodes in the cluster are NFS clients of another node in the same cluster which is performing NFS server duties, you effectively have made one node dependant on the other -- this makes the whole configuration "non-HA".  A tenant of a valid cluster configuration is that every node be 100% independent of the other nodes, so that it can operate correctly in the event that it is the last surviving node in the cluster.  

Best Practice is to move the NFS server responsibilities outside the database/application cluster.  So you have one cluster (or perhaps an HA-configured, dual-headed NetApp server) supplying the NFS exported filesystems, and another, comopletely separate cluster, acting as NFS clients and hosting the database/application duties.

Of course, the other Best Practice way to handle this is by placing the filesystems that all nodes need simultaneous access to on shared storage that is then simultaneously access from all the nodes in the cluster via CVM and CFS.  This is a very good solution if your customer can afford the licenses to do it (SFCFSHA, and others...).  
 

semenov_m_o
Level 3

Hello, kjbss!

Thank you for your answer.

Actually, you are not quite right.
The second, passive node in this cluster never uses the clustered NFS share.

This share is used by another clustered application (xapp in my designations).

Please, refer to attached image, where actual application structure is depicted.

Daniel_Matheus
Level 4
Employee Accredited Certified

Hi Maxim,

 

I think the problem is your dependencies.

VCS tries to unmount the file system while it is still in use.

2013/07/31 17:31:22 VCS NOTICE V-16-1-10300 Initiating Offline of Resource app-rg-mount (Owner: Unspecified, Group: app-rg) on System app02
2013/07/31 17:31:23 VCS NOTICE V-16-10031-5512 (app02) Mount:app-rg-mount:offline:Trying force umount with signal 15...
2013/07/31 17:31:23 VCS NOTICE V-16-10031-5512 (app02) Mount:app-rg-mount:offline:Trying force umount with signal 9...
2013/07/31 17:31:23 VCS INFO V-16-1-10305 Resource nfs-restr (Owner: Unspecified, Group: app-rg) is offline on app02 (VCS initiated)
2013/07/31 17:31:23 VCS NOTICE V-16-1-10300 Initiating Offline of Resource lockinfo-mnt (Owner: Unspecified, Group: app-rg) on System app02
2013/07/31 17:36:24 VCS INFO V-16-2-13003 (app02) Resource(app-rg-mount): Output of the timed out operation (offline) 
         the device is found by lsof(8) or fuser(1))
         the device is found by lsof(8) or fuser(1))
         the device is found by lsof(8) or fuser(1))

 

Please check the Bundled Agents Guide on page 132 for example how to set up dependencies.

http://www.symantec.com/business/support/resources/sites/BUSINESS/content/live/DOCUMENTATION/5000/DOC5881/en_US/vcs_bundled_agents_601_lin.pdf

Thanks,
Dan

 

 

Samyak
Level 2
Employee

Hi Maxim,

 

The NFS failover service group requires 2 NFSRestart resources ( Upper and Lower ) to stop/start NFS services accordingly, which are currently missing fro your setup.

Also the IP is below the mount and share resources, which will be accessible to the client even if you have unshared the exports, and will result in error on the NFS client side.

Please add NFSRestart Lower resource (set Lower = 1) and adjust dependencies as per shown in the Bundled Agents reference guide.

 

Regards,

Samyak.

 

 

semenov_m_o
Level 3

Hello, Daniel!

Thank you very much for your advice.
It really worked for me, and now we do not experience such problems during failover!

But somehow another problem appeared.

Now during service startup on both nodes we see such messages:

2013/08/02 11:56:58 VCS ERROR V-16-10031-7029 (app01) NFS:NFS-group:monitor:Daemon [rpc.nfsd] is not running. 

Is it also a dependency problem?

New config and full log extraction are in attachment.

Thank you, again!
Maxim Semenov.

semenov_m_o
Level 3

Hello, Samyak.

Thank you for your suggestion.
It was already proposed by Daniel earlier.

But now I have another problem (described upper), and I will appreciate your help on it.

Best Regards,
Maxim.

Samyak
Level 2
Employee

Hi Maxim,

 

Place app_rg_app resource above NFSRestart Upper resource (nfs-restr_U).

This would solve your issue.

Regards,

Samyak.

Daniel_Matheus
Level 4
Employee Accredited Certified

Hi Maxim,

 

The NFS daemons are killed during NFSRestart resource offline, and are
restarted later.
During this time, the NFS resource may find that statd is not running and can
report the error message. (Please check on the actual system whether the demons are running )

/etc/init.d/nfslock status
/etc/init.d/nfs status

This is by design and is not an issue. The daemons are started later, because
of which you are finding the daemons in running state, along with the error
message that rpc.statd daemon is not running.

In the NFs monitor, we report the resource online for some time even if the
daemons are not running. This is done to give time to NFSRestart resource to
complete its operation.

 

To workaround these messages you could try changing the ToleranceLimit for the NFS resource to 1 or 2 for example:

 

#haconf -makerw

#hatype -modify NFS ToleranceLimit 2

#haconf -dump -makero

 

The ToleranceLimit attribute defines the number of times the Monitor routine should return an offline status before declaring a resource offline. This attribute is typically used when a resource is busy and appears to be offline. Setting the attribute to a non-zero value instructs VCS to allow multiple failing monitor cycles with the expectation that the resource will eventually respond. Setting a non-zero ToleranceLimit also extends the time required to respond to an actual fault.

semenov_m_o
Level 3

Hello, Samyak!

Do you mean, I need to replace

	nfs-restr_U requires app-rg-app

to

	app-rg-app requires nfs-restr_U requires

?

semenov_m_o
Level 3

Thank you, Daniel.

But the problem is with rpc.nfsd, not with rpc.statd.

Nevertheless, I will try to play with ToleranceLimit, and report results here.

Samyak
Level 2
Employee

Hi Maxim,

 

Please replace 


nfs-restr_U requires app-rg-app
app-rg-app requires app-rg-ip

with 


app-rg-app requires nfs-restr_U
nfs-restr_U requires app-rg-ip

During service group online operation, NFS resource fakely returns online till 3 minutes (default) and invokes clean if the NFS services are still not running.

Since your application (app_rg_app) takes around 5 mins to come online, and the NFSRestart_Upper resource is above app_rg_app, the NFS services are offline for 5 minutes. This causes NFS resource to invoke clean.

NFS services will start sooner by moving nfs_restr_U resource below app-rg-app.

 

Thanks and Regards,

Samyak.