Forum Discussion

semenov_m_o's avatar
12 years ago

NFS share doesn't failover due to being busy

Hello!

We are trying to implement a failover cluster, which hosts database and files on clustered NFS share.
Files are used by the clustered application itself, and by several other hosts.

The problem is, that when active node fails (I mean an ungraceful server shutdown or some clustered service stop), the other hosts still continue to use files on our cluster-hosted NFS share.
That leads to an NFS-share "hanging", when it doesn't work on the first node, and still cannot be brought online of the second node. Other hosts also experience hanging of requests to that NFS share.
Later, I will attach logs, where problem can be observed.

The only possible corrective action found by us is total shutdown and sequential start of all cluster nodes and other hosts.

Please recommend us a best-practice actions, required for using NFS share on veritas cluster server (maybe, some start/stop/clean scripts being included as a cluster resource, or additional cluster configuration options).

Thank you, in advance!

Best regards,
Maxim Semenov.

  • Hi Maxim,

     

    I think the problem is your dependencies.

    VCS tries to unmount the file system while it is still in use.

    2013/07/31 17:31:22 VCS NOTICE V-16-1-10300 Initiating Offline of Resource app-rg-mount (Owner: Unspecified, Group: app-rg) on System app02
    2013/07/31 17:31:23 VCS NOTICE V-16-10031-5512 (app02) Mount:app-rg-mount:offline:Trying force umount with signal 15...
    2013/07/31 17:31:23 VCS NOTICE V-16-10031-5512 (app02) Mount:app-rg-mount:offline:Trying force umount with signal 9...
    2013/07/31 17:31:23 VCS INFO V-16-1-10305 Resource nfs-restr (Owner: Unspecified, Group: app-rg) is offline on app02 (VCS initiated)
    2013/07/31 17:31:23 VCS NOTICE V-16-1-10300 Initiating Offline of Resource lockinfo-mnt (Owner: Unspecified, Group: app-rg) on System app02
    2013/07/31 17:36:24 VCS INFO V-16-2-13003 (app02) Resource(app-rg-mount): Output of the timed out operation (offline) 
             the device is found by lsof(8) or fuser(1))
             the device is found by lsof(8) or fuser(1))
             the device is found by lsof(8) or fuser(1))

     

    Please check the Bundled Agents Guide on page 132 for example how to set up dependencies.

    http://www.symantec.com/business/support/resources/sites/BUSINESS/content/live/DOCUMENTATION/5000/DOC5881/en_US/vcs_bundled_agents_601_lin.pdf

    Thanks,
    Dan

     

     

  • Hello, Samyak!

    Do you mean, I need to replace

    	nfs-restr_U requires app-rg-app

    to

    	app-rg-app requires nfs-restr_U requires

    ?

  • Thank you, Daniel.

    But the problem is with rpc.nfsd, not with rpc.statd.

    Nevertheless, I will try to play with ToleranceLimit, and report results here.

  • Hi Maxim,

     

    Please replace 

    
    
    nfs-restr_U requires app-rg-app
    app-rg-app requires app-rg-ip

    with 

    
    
    app-rg-app requires nfs-restr_U
    nfs-restr_U requires app-rg-ip

    During service group online operation, NFS resource fakely returns online till 3 minutes (default) and invokes clean if the NFS services are still not running.

    Since your application (app_rg_app) takes around 5 mins to come online, and the NFSRestart_Upper resource is above app_rg_app, the NFS services are offline for 5 minutes. This causes NFS resource to invoke clean.

    NFS services will start sooner by moving nfs_restr_U resource below app-rg-app.

     

    Thanks and Regards,

    Samyak.