Forum Discussion

Markus_B's avatar
Markus_B
Level 3
14 years ago

NFS4 Input/output error after resource failover

Hi everybody,

I'm running VERITAS Cluster Server 5.1.00.2 with clustered NFS (nfs4 enabled) on RedHat 5.6. When I mount an export with nfs4 from a client and start a resource failover during a filecopy to this share the copy fails with "Input/output error" message.

In /var/log/messages on the node where I failover to I see:

kernel: nfs4_cb: server 10.31.1.12 not responding, timed out

The mentioned IP is the one of the nfs-client. As I could see in tcpdump nfs4 uses callbacks and initiates the connection from nfs-server to client with the node ip. The node ip switches during the failover, so this might be the reason why the connection crashes.

Has anyone a working installation with nfs4 failover and can give me a hint how to avoid the described problem?

If you need futher information about the setup please let me know.

 

Regards

Markus

  • A few days ago I updated VCS to 5.1SP1. Since SP1 the NFSRestart resource has an additional value called "Lower" and the resource chain is a bit different. That made me hopefull that my problem had already been addressed. And in fact it is. The "new" NFSRestart resource does exactly what I figured out has to be done during a failover, which is restarting nfsd.

    Thanks a lot for you help and advises.

  • Hi Markus,

    Did you configured nfsrestart resources in your cluster.... If yes, can you post the main.cf to verify on dependency tree of resource..

    In case you haven't configured nfsrestart resource, would be worth to have a look:

    https://sort.symantec.com/public/documents/sfha/5.1sp1/linux/productguides/pdf/vcs_bundled_agents_51sp1_lin.pdf

     

    check out page no. 133

     

    Gaurav

  • Hi Gaurav,

    I use the nfsrestart resource. As recomended I've one parallel group NFS:

     group NFS (
        SystemList = { vcs-1-node-1 = 0, vcs-1-node-2 = 1 }
        Parallel = 1
        AutoStartList = { vcs-1-node-2, vcs-1-node-1 }
        )
    
        NFS NFS_NFS (
            Nproc = 64
            NFSv4Support = 1
            )
    
        NIC NFS_NIC (
            Device = bond0
            Mii = 0
            NetworkHosts = { "10.10.0.1" }
            )
    
        Phantom NFS_Phantom (
            )
    
        Share NFS_Share_root (
            PathName = "/cluster/nfs"
            Client = "10.10.0.0/24"
            OtherClients = { "10.10.1.0/24" }
            Options = "ro, fsid=0"
            NFSRes = NFS_NFS
            )
    
        NFS_Share_root requires NFS_NFS
    
    
        // resource dependency tree
        //
        //    group NFS
        //    {
        //    NIC NFS_NIC
        //    Phantom NFS_Phantom
        //    Share NFS_Share_root
        //        {
        //        NFS NFS_NFS
        //        }
        //    }
    

    and multiple NFS-Service Groups that do mounting and exporting via NFS. Here for example NFS-Service1:

     

     group NFS-Service1 (
        SystemList = { vcs-1-node-1 = 0, vcs-1-node-2 = 1 }
        AutoStartList = { vcs-1-node-2, vcs-1-node-1 }
        PreOnline @vcs-1-node-1 = 1
        PreOnline @vcs-1-node-2 = 1
        )
    
        DiskGroup NFS-Service1_DiskGroup_DiskGroup_nfs1 (
            DiskGroup = DiskGroup_nfs1
            )
    
        IP NFS-Service1_IP_10-10-0-12 (
            Device = bond0
            Address = "10.10.0.12"
            NetMask = "255.255.255.224"
            )
    
        Mount NFS-Service1_Mount_Volume_cust_471102 (
            MountPoint = "/cluster/nfs/471102"
            BlockDevice = "/dev/vx/dsk/DiskGroup_nfs1/Volume_cust_471102"
            FSType = vxfs
            FsckOpt = "-y"
            )
    
        Mount NFS-Service1_Mount_Volume_dummy-nfs1 (
            MountPoint = "/cluster/nfs/dummy-nfs1"
            BlockDevice = "/dev/vx/dsk/DiskGroup_nfs1/Volume_dummy-nfs1"
            FSType = vxfs
            FsckOpt = "-y"
            )
    
        Mount NFS-Service1_Mount_Volume_lock (
            MountPoint = "/cluster/nfs/lock-nfs1"
            BlockDevice = "/dev/vx/dsk/DiskGroup_nfs1/Volume_lock"
            FSType = vxfs
            FsckOpt = "-y"
            )
    
        NFSRestart NFS-Service1_NFSRestart_NFSRestart (
            NFSRes = NFS_NFS
            LocksPathName = "/cluster/nfs/lock-nfs1"
            NFSLockFailover = 1
            )
    
        Proxy NFS-Service1_Proxy_NFS (
            TargetResName = NFS_NFS
            )
    
        Proxy NFS-Service1_Proxy_NIC (
            TargetResName = NFS_NIC
            )
    
        Share NFS-Service1_Share_cust_471102-0 (
            PathName = "/cluster/nfs/471102"
            OtherClients = { "10.10.1.184/29" }
            Options = "rw,no_root_squash,nohide"
            NFSRes = NFS_NFS
            )
    
        Share NFS-Service1_Share_dummy-nfs1 (
            PathName = "/cluster/nfs/dummy-nfs1"
            Client = "10.10.0.0/16"
            Options = "ro,no_root_squash,nohide"
            NFSRes = NFS_NFS
            )
    
        Volume NFS-Service1_Volume_Volume_cust_471102 (
            DiskGroup = DiskGroup_nfs1
            Volume = Volume_cust_471102
            )
    
        Volume NFS-Service1_Volume_Volume_dummy-nfs1 (
            DiskGroup = DiskGroup_nfs1
            Volume = Volume_dummy-nfs1
            )
    
        Volume NFS-Service1_Volume_Volume_lock (
            DiskGroup = DiskGroup_nfs1
            Volume = Volume_lock
            )
    
        requires group NFS online local firm
        NFS-Service1_IP_10-10-0-12 requires NFS-Service1_Proxy_NIC
        NFS-Service1_IP_10-10-0-12 requires NFS-Service1_Share_cust_471102-0
        NFS-Service1_IP_10-10-0-12 requires NFS-Service1_Share_dummy-nfs1
        NFS-Service1_Mount_Volume_cust_471102 requires NFS-Service1_Volume_Volume_cust_471102
        NFS-Service1_Mount_Volume_dummy-nfs1 requires NFS-Service1_Volume_Volume_dummy-nfs1
        NFS-Service1_Mount_Volume_lock requires NFS-Service1_Volume_Volume_lock
        NFS-Service1_NFSRestart_NFSRestart requires NFS-Service1_IP_10-10-0-12
        NFS-Service1_NFSRestart_NFSRestart requires NFS-Service1_Mount_Volume_lock
        NFS-Service1_Share_cust_471102-0 requires NFS-Service1_Mount_Volume_cust_471102
        NFS-Service1_Share_cust_471102-0 requires NFS-Service1_Proxy_NFS
        NFS-Service1_Share_dummy-nfs1 requires NFS-Service1_Mount_Volume_dummy-nfs1
        NFS-Service1_Share_dummy-nfs1 requires NFS-Service1_Proxy_NFS
        NFS-Service1_Volume_Volume_cust_471102 requires NFS-Service1_DiskGroup_DiskGroup_nfs1
        NFS-Service1_Volume_Volume_dummy-nfs1 requires NFS-Service1_DiskGroup_DiskGroup_nfs1
        NFS-Service1_Volume_Volume_lock requires NFS-Service1_DiskGroup_DiskGroup_nfs1
    
    
        // resource dependency tree
        //
        //    group NFS-Service1
        //    {
        //    NFSRestart NFS-Service1_NFSRestart_NFSRestart
        //        {
        //        Mount NFS-Service1_Mount_Volume_lock
        //            {
        //            Volume NFS-Service1_Volume_Volume_lock
        //                {
        //                DiskGroup NFS-Service1_DiskGroup_DiskGroup_nfs1
        //                }
        //            }
        //        IP NFS-Service1_IP_10-10-0-12
        //            {
        //            Share NFS-Service1_Share_dummy-nfs1
        //                {
        //                Mount NFS-Service1_Mount_Volume_dummy-nfs1
        //                    {
        //                    Volume NFS-Service1_Volume_Volume_dummy-nfs1
        //                        {
        //                        DiskGroup NFS-Service1_DiskGroup_DiskGroup_nfs1
        //                        }
        //                    }
        //                Proxy NFS-Service1_Proxy_NFS
        //                }
        //            Proxy NFS-Service1_Proxy_NIC
        //            Share NFS-Service1_Share_cust_471102-0
        //                {
        //                Mount NFS-Service1_Mount_Volume_cust_471102
        //                    {
        //                    Volume NFS-Service1_Volume_Volume_cust_471102
        //                        {
        //                        DiskGroup NFS-Service1_DiskGroup_DiskGroup_nfs1
        //                        }
        //                    }
        //                Proxy NFS-Service1_Proxy_NFS
        //                }
        //            }
        //        }
        //    } 
    


     

  • Thanks for your replies. Today I recieved a hint from my frindly veritas trainer and he gave me the thought-provoking impulse. He also mentioned the major/minor numbers. So I checked on both systems and found them equal. But he also reminded me that minor numbers greater 255 might cause some trouble with nfs. By default vxvm uses minor numbers far bigger than 255, so you have to reminor. You can check with vxprint -g <DiskGroup> -vF %minor <Volume>

    To reminor use: vxdg -g <DiskGroup> reminor 100

    Although it is no exactly my problem I found this article helpfull for me:

    http://www.symantec.com/business/support/index?page=content&id=TECH148225&key=15107&actp=LIST

    After that the failover from one node to an other works perfectly also during a file transfer with nfs3 and nfs4.

  • After a few more tests I had to find that the minor number wasn't the root of my problem. It was rather a combination of circumstances that the failover worked after the minor number tuning.

    Now I've disabled all resources in VCS and failoverd manually. The result is, that the failover works fine when the nfsd is restartet on the new node. Unfortunately VCS doesn't restart nfsd when the resourcegroup fails over.

    Has anyone a hint how to achive the nfsd restart in VCS or is this a bug?

  • Hi Markus,

    You need to copy the triggers nfs_postoffline (and nfs_preonline if you want NFS locks to failover) from /opt/VRTSvcs/bin/sample_triggers to /opt/VRTSvcs/bin/triggers. 

    In 5.0 these used to be in /opt/VRTSvcs/bin/triggers by default, but they run certain hares commands which are very inefficient for large configs (more than 50 service groups) and the triggers could hang the had daemon for several seconds.  So in, I think 5.0RP2, they were moved as the majority of customers do not use NFS shares and a note was added to the RP notes to say you needed to copy them in place.  It seems as this info is missing from 5.1 as I cannot find this in the bundled agents guide, release notes, or admin guide.  All of the guides mention the triggers to some extent, but do not explity mention they need to be copied in place - this really needs to go bundled agents guide, so perhaps someone from Symantec who reads this, can action this.

    Mike

     

     

  • A few days ago I updated VCS to 5.1SP1. Since SP1 the NFSRestart resource has an additional value called "Lower" and the resource chain is a bit different. That made me hopefull that my problem had already been addressed. And in fact it is. The "new" NFSRestart resource does exactly what I figured out has to be done during a failover, which is restarting nfsd.

    Thanks a lot for you help and advises.

  • Did SP1 copy triggers into /opt/VRTSvcs/bin/triggers (see my previous comment) or had you already done this manually ( or is it working without these triggers being there).

     

    Mike

  • Hi Mike,

    as far as I can see SP1 comes without nfs triggers. But I tried with the triggers you mentioned in the previous version, but it didn't solve my problem. My current triggers dir looks like that:

    ls -la /opt/VRTSvcs/bin/triggers
    total 40
    drwxrwxr-x  2 root sys  4096 Feb  9 16:46 .
    drwxr-xr-x 52 root root 4096 Oct  5 04:30 ..
    -rwxr-----  1 root root 2313 Oct  1 09:23 dump_tunables
    -rwxr-----  1 root root 2319 Oct  1 09:23 globalcounter_not_updated
    -rwxr--r--  1 root sys  2295 Oct  5 04:30 postoffline
    -rwxr--r--  1 root sys  3574 Oct  5 04:30 postonline
    -rwxr--r--  1 root sys  7092 Oct  5 04:30 preonline
    -rwxr-----  1 root root 7499 Oct  1 09:23 violation

    My nfs-service group has preonline disabled:

    # hagrp -display NFS-Service1 -attr PreOnline
    #Group       Attribute             System              Value
    NFS-Service1 PreOnline             vcs-101010-1-node-1 0
    NFS-Service1 PreOnline             vcs-101010-1-node-2 0

    Since SP1 you have to use two NFSRestart resources. One of them with the "Lower" option. It is quiete well described in the agent notes.

    Regards

    Markus