cancel
Showing results for 
Search instead for 
Did you mean: 

NFS4 Input/output error after resource failover

Markus_B
Level 3

Hi everybody,

I'm running VERITAS Cluster Server 5.1.00.2 with clustered NFS (nfs4 enabled) on RedHat 5.6. When I mount an export with nfs4 from a client and start a resource failover during a filecopy to this share the copy fails with "Input/output error" message.

In /var/log/messages on the node where I failover to I see:

kernel: nfs4_cb: server 10.31.1.12 not responding, timed out

The mentioned IP is the one of the nfs-client. As I could see in tcpdump nfs4 uses callbacks and initiates the connection from nfs-server to client with the node ip. The node ip switches during the failover, so this might be the reason why the connection crashes.

Has anyone a working installation with nfs4 failover and can give me a hint how to avoid the described problem?

If you need futher information about the setup please let me know.

 

Regards

Markus

1 ACCEPTED SOLUTION

Accepted Solutions

Markus_B
Level 3

A few days ago I updated VCS to 5.1SP1. Since SP1 the NFSRestart resource has an additional value called "Lower" and the resource chain is a bit different. That made me hopefull that my problem had already been addressed. And in fact it is. The "new" NFSRestart resource does exactly what I figured out has to be done during a failover, which is restarting nfsd.

Thanks a lot for you help and advises.

View solution in original post

9 REPLIES 9

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hi Markus,

Did you configured nfsrestart resources in your cluster.... If yes, can you post the main.cf to verify on dependency tree of resource..

In case you haven't configured nfsrestart resource, would be worth to have a look:

https://sort.symantec.com/public/documents/sfha/5.1sp1/linux/productguides/pdf/vcs_bundled_agents_51sp1_lin.pdf

 

check out page no. 133

 

Gaurav

Markus_B
Level 3

Hi Gaurav,

I use the nfsrestart resource. As recomended I've one parallel group NFS:

 group NFS (
    SystemList = { vcs-1-node-1 = 0, vcs-1-node-2 = 1 }
    Parallel = 1
    AutoStartList = { vcs-1-node-2, vcs-1-node-1 }
    )

    NFS NFS_NFS (
        Nproc = 64
        NFSv4Support = 1
        )

    NIC NFS_NIC (
        Device = bond0
        Mii = 0
        NetworkHosts = { "10.10.0.1" }
        )

    Phantom NFS_Phantom (
        )

    Share NFS_Share_root (
        PathName = "/cluster/nfs"
        Client = "10.10.0.0/24"
        OtherClients = { "10.10.1.0/24" }
        Options = "ro, fsid=0"
        NFSRes = NFS_NFS
        )

    NFS_Share_root requires NFS_NFS


    // resource dependency tree
    //
    //    group NFS
    //    {
    //    NIC NFS_NIC
    //    Phantom NFS_Phantom
    //    Share NFS_Share_root
    //        {
    //        NFS NFS_NFS
    //        }
    //    }

and multiple NFS-Service Groups that do mounting and exporting via NFS. Here for example NFS-Service1:

 

 group NFS-Service1 (
    SystemList = { vcs-1-node-1 = 0, vcs-1-node-2 = 1 }
    AutoStartList = { vcs-1-node-2, vcs-1-node-1 }
    PreOnline @vcs-1-node-1 = 1
    PreOnline @vcs-1-node-2 = 1
    )

    DiskGroup NFS-Service1_DiskGroup_DiskGroup_nfs1 (
        DiskGroup = DiskGroup_nfs1
        )

    IP NFS-Service1_IP_10-10-0-12 (
        Device = bond0
        Address = "10.10.0.12"
        NetMask = "255.255.255.224"
        )

    Mount NFS-Service1_Mount_Volume_cust_471102 (
        MountPoint = "/cluster/nfs/471102"
        BlockDevice = "/dev/vx/dsk/DiskGroup_nfs1/Volume_cust_471102"
        FSType = vxfs
        FsckOpt = "-y"
        )

    Mount NFS-Service1_Mount_Volume_dummy-nfs1 (
        MountPoint = "/cluster/nfs/dummy-nfs1"
        BlockDevice = "/dev/vx/dsk/DiskGroup_nfs1/Volume_dummy-nfs1"
        FSType = vxfs
        FsckOpt = "-y"
        )

    Mount NFS-Service1_Mount_Volume_lock (
        MountPoint = "/cluster/nfs/lock-nfs1"
        BlockDevice = "/dev/vx/dsk/DiskGroup_nfs1/Volume_lock"
        FSType = vxfs
        FsckOpt = "-y"
        )

    NFSRestart NFS-Service1_NFSRestart_NFSRestart (
        NFSRes = NFS_NFS
        LocksPathName = "/cluster/nfs/lock-nfs1"
        NFSLockFailover = 1
        )

    Proxy NFS-Service1_Proxy_NFS (
        TargetResName = NFS_NFS
        )

    Proxy NFS-Service1_Proxy_NIC (
        TargetResName = NFS_NIC
        )

    Share NFS-Service1_Share_cust_471102-0 (
        PathName = "/cluster/nfs/471102"
        OtherClients = { "10.10.1.184/29" }
        Options = "rw,no_root_squash,nohide"
        NFSRes = NFS_NFS
        )

    Share NFS-Service1_Share_dummy-nfs1 (
        PathName = "/cluster/nfs/dummy-nfs1"
        Client = "10.10.0.0/16"
        Options = "ro,no_root_squash,nohide"
        NFSRes = NFS_NFS
        )

    Volume NFS-Service1_Volume_Volume_cust_471102 (
        DiskGroup = DiskGroup_nfs1
        Volume = Volume_cust_471102
        )

    Volume NFS-Service1_Volume_Volume_dummy-nfs1 (
        DiskGroup = DiskGroup_nfs1
        Volume = Volume_dummy-nfs1
        )

    Volume NFS-Service1_Volume_Volume_lock (
        DiskGroup = DiskGroup_nfs1
        Volume = Volume_lock
        )

    requires group NFS online local firm
    NFS-Service1_IP_10-10-0-12 requires NFS-Service1_Proxy_NIC
    NFS-Service1_IP_10-10-0-12 requires NFS-Service1_Share_cust_471102-0
    NFS-Service1_IP_10-10-0-12 requires NFS-Service1_Share_dummy-nfs1
    NFS-Service1_Mount_Volume_cust_471102 requires NFS-Service1_Volume_Volume_cust_471102
    NFS-Service1_Mount_Volume_dummy-nfs1 requires NFS-Service1_Volume_Volume_dummy-nfs1
    NFS-Service1_Mount_Volume_lock requires NFS-Service1_Volume_Volume_lock
    NFS-Service1_NFSRestart_NFSRestart requires NFS-Service1_IP_10-10-0-12
    NFS-Service1_NFSRestart_NFSRestart requires NFS-Service1_Mount_Volume_lock
    NFS-Service1_Share_cust_471102-0 requires NFS-Service1_Mount_Volume_cust_471102
    NFS-Service1_Share_cust_471102-0 requires NFS-Service1_Proxy_NFS
    NFS-Service1_Share_dummy-nfs1 requires NFS-Service1_Mount_Volume_dummy-nfs1
    NFS-Service1_Share_dummy-nfs1 requires NFS-Service1_Proxy_NFS
    NFS-Service1_Volume_Volume_cust_471102 requires NFS-Service1_DiskGroup_DiskGroup_nfs1
    NFS-Service1_Volume_Volume_dummy-nfs1 requires NFS-Service1_DiskGroup_DiskGroup_nfs1
    NFS-Service1_Volume_Volume_lock requires NFS-Service1_DiskGroup_DiskGroup_nfs1


    // resource dependency tree
    //
    //    group NFS-Service1
    //    {
    //    NFSRestart NFS-Service1_NFSRestart_NFSRestart
    //        {
    //        Mount NFS-Service1_Mount_Volume_lock
    //            {
    //            Volume NFS-Service1_Volume_Volume_lock
    //                {
    //                DiskGroup NFS-Service1_DiskGroup_DiskGroup_nfs1
    //                }
    //            }
    //        IP NFS-Service1_IP_10-10-0-12
    //            {
    //            Share NFS-Service1_Share_dummy-nfs1
    //                {
    //                Mount NFS-Service1_Mount_Volume_dummy-nfs1
    //                    {
    //                    Volume NFS-Service1_Volume_Volume_dummy-nfs1
    //                        {
    //                        DiskGroup NFS-Service1_DiskGroup_DiskGroup_nfs1
    //                        }
    //                    }
    //                Proxy NFS-Service1_Proxy_NFS
    //                }
    //            Proxy NFS-Service1_Proxy_NIC
    //            Share NFS-Service1_Share_cust_471102-0
    //                {
    //                Mount NFS-Service1_Mount_Volume_cust_471102
    //                    {
    //                    Volume NFS-Service1_Volume_Volume_cust_471102
    //                        {
    //                        DiskGroup NFS-Service1_DiskGroup_DiskGroup_nfs1
    //                        }
    //                    }
    //                Proxy NFS-Service1_Proxy_NFS
    //                }
    //            }
    //        }
    //    } 


 

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Also make sure you're minor and major numbers of the disks match between the systems.

Markus_B
Level 3

Thanks for your replies. Today I recieved a hint from my frindly veritas trainer and he gave me the thought-provoking impulse. He also mentioned the major/minor numbers. So I checked on both systems and found them equal. But he also reminded me that minor numbers greater 255 might cause some trouble with nfs. By default vxvm uses minor numbers far bigger than 255, so you have to reminor. You can check with vxprint -g <DiskGroup> -vF %minor <Volume>

To reminor use: vxdg -g <DiskGroup> reminor 100

Although it is no exactly my problem I found this article helpfull for me:

http://www.symantec.com/business/support/index?page=content&id=TECH148225&key=15107&actp=LIST

After that the failover from one node to an other works perfectly also during a file transfer with nfs3 and nfs4.

Markus_B
Level 3

After a few more tests I had to find that the minor number wasn't the root of my problem. It was rather a combination of circumstances that the failover worked after the minor number tuning.

Now I've disabled all resources in VCS and failoverd manually. The result is, that the failover works fine when the nfsd is restartet on the new node. Unfortunately VCS doesn't restart nfsd when the resourcegroup fails over.

Has anyone a hint how to achive the nfsd restart in VCS or is this a bug?

mikebounds
Level 6
Partner Accredited

Hi Markus,

You need to copy the triggers nfs_postoffline (and nfs_preonline if you want NFS locks to failover) from /opt/VRTSvcs/bin/sample_triggers to /opt/VRTSvcs/bin/triggers. 

In 5.0 these used to be in /opt/VRTSvcs/bin/triggers by default, but they run certain hares commands which are very inefficient for large configs (more than 50 service groups) and the triggers could hang the had daemon for several seconds.  So in, I think 5.0RP2, they were moved as the majority of customers do not use NFS shares and a note was added to the RP notes to say you needed to copy them in place.  It seems as this info is missing from 5.1 as I cannot find this in the bundled agents guide, release notes, or admin guide.  All of the guides mention the triggers to some extent, but do not explity mention they need to be copied in place - this really needs to go bundled agents guide, so perhaps someone from Symantec who reads this, can action this.

Mike

 

 

Markus_B
Level 3

A few days ago I updated VCS to 5.1SP1. Since SP1 the NFSRestart resource has an additional value called "Lower" and the resource chain is a bit different. That made me hopefull that my problem had already been addressed. And in fact it is. The "new" NFSRestart resource does exactly what I figured out has to be done during a failover, which is restarting nfsd.

Thanks a lot for you help and advises.

mikebounds
Level 6
Partner Accredited

Did SP1 copy triggers into /opt/VRTSvcs/bin/triggers (see my previous comment) or had you already done this manually ( or is it working without these triggers being there).

 

Mike

Markus_B
Level 3

Hi Mike,

as far as I can see SP1 comes without nfs triggers. But I tried with the triggers you mentioned in the previous version, but it didn't solve my problem. My current triggers dir looks like that:

ls -la /opt/VRTSvcs/bin/triggers
total 40
drwxrwxr-x  2 root sys  4096 Feb  9 16:46 .
drwxr-xr-x 52 root root 4096 Oct  5 04:30 ..
-rwxr-----  1 root root 2313 Oct  1 09:23 dump_tunables
-rwxr-----  1 root root 2319 Oct  1 09:23 globalcounter_not_updated
-rwxr--r--  1 root sys  2295 Oct  5 04:30 postoffline
-rwxr--r--  1 root sys  3574 Oct  5 04:30 postonline
-rwxr--r--  1 root sys  7092 Oct  5 04:30 preonline
-rwxr-----  1 root root 7499 Oct  1 09:23 violation

My nfs-service group has preonline disabled:

# hagrp -display NFS-Service1 -attr PreOnline
#Group       Attribute             System              Value
NFS-Service1 PreOnline             vcs-101010-1-node-1 0
NFS-Service1 PreOnline             vcs-101010-1-node-2 0

Since SP1 you have to use two NFSRestart resources. One of them with the "Lower" option. It is quiete well described in the agent notes.

Regards

Markus