Forum Discussion

aryan3051984's avatar
13 years ago

VCS 5.1 SP1, Netbackup 7.1.0.2 & Linux RHEL 5.6

Hi All,

 

I am new to this forum. Need your help with one of the issues i m facingat a customer site.

I am having a setup of 3 clustered nodes. 2 cluster at one location & 1 cluster at a remote location. The failover between the 2 nodes is a normal failover & with the remote is GCO.

The OS used on this node is RHEL 5.6.

VCS version is 5.1 SP1 & netbackup used is 7.1.0.2

The Nic on the 2 nodes is a bonded NIC.

We are trying to test a scenario wherein my Network links go down. So we plug out both the Production cables but the failover is not happening. the ethtool is showing that there is no connectivity to the ethernet adapters but if u do a ifconfig -a  the bond is still showing as UP.

Also if i do a shutdown on one of the nodes the Failover is happening from one node to another node without any issues.

If i try to ifdown on the bonded NIC the failover is initiated but the NBU services are not getting killed even after doing a bp.kill_all so the Nbu_server group is not going down on the active node & the failover is not happening.

Does anyone have the same setup & have faced any isues like this.

  • Hi All,

    thanks to everyone to revert to this post. this issue is resolved by Symantec.

    the backline has created an engineering binary for us & it is working now fine.

    We tested both the scenarios by removing the lan cable as well as by bringing down the bond & the failover is happening without any issues.

7 Replies

  • How do you have your NIC resource configured in VCS?

    The NIC agent monitors an interface not just for up/down state, but for active traffic. Just unplugging the cable(s) from a NIC typically won't change its state to DOWN.

    Pasting in the segment from main.cf where your NIC is configured might help.

  • include "OracleASMTypes.cf"

    include "types.cf"

    include "Db2udbTypes.cf"

    include "HTCTypes.cf"

    include "NetBackupTypes.cf"

    include "OracleTypes.cf"

    include "SybaseTypes.cf"

    cluster clusterA(

    UserNames = { admin = gNOgNInKOjOOmWOiNL }

    ClusterAddress = "10.XX.XXX.XXX"

    Administrators = { admin }

    )

    remotecluster clusterB(

    ClusterAddress = "10.XX.XXX.XXX"

    )

    heartbeat Icmp (

    ClusterList = { clusterB }

    Arguments @clusterB = { "10.XX.XXX.XXX" }

    )

    system systemA (

    )

    system systemB (

    )

    group ClusterService (

    SystemList = { systemA = 0, systemB = 1 }

    AutoStartList = { systemA, systemB }

    OnlineRetryLimit = 3

    OnlineRetryInterval = 120

    )

    Application wac (

    StartProgram = "/opt/VRTSvcs/bin/wacstart"

    StopProgram = "/opt/VRTSvcs/bin/wacstop"

    MonitorProcesses = { "/opt/VRTSvcs/bin/wac" }

    RestartLimit = 3

    )

    IP webip (

    Device = bond0

    Address = "10.XX.XXX.XXX"

    NetMask = "255.255.255.0"

    )

    NIC backup_nic (

    Device = eth2

    NetworkHosts @systemA = { "10.XX.XXX.XXX" }

    NetworkHosts @systemB = { "10.XX.XXX.XXX" }

    )

    NIC public_nic (

    Device = bond0

    NetworkHosts @systemA = { "10.XX.XXX.X" }

    NetworkHosts @systemB = { "10.XX.XXX.X" }

    )

    wac requires webip

    webip requires public_nic

     

    // resource dependency tree

    //

    // group ClusterService

    // {

    // NIC backup_nic

    // Application wac

    // {

    // IP webip

    // {

    // NIC public_nic

    // }

    // }

    // }

     

    group nbu_group (

    SystemList = { systemA = 0, systemB = 1 }

    Frozen = 1

    AutoStart = 0

    ClusterList = { clusterA = 0, clusterB = 1 }

    Authority = 1

    AutoStartList = { systemA, systemB }

    )

    DiskGroup nbu_dg (

    DiskGroup = nbudg

    )

    HTC horcm0 (

    Critical = 0

    GroupName = MAS_HUR

    )

    IP nbu_ip (

    Device @systemA = bond0

    Device @systemB = bond0

    Address = "10.XX.XXX.XXX"

    NetMask = "255.255.255.0"

    )

    IP nbubk_ip (

    Device @systemA = eth2

    Device @systemB = eth2

    Address = "10.XX.XXX.XXX"

    NetMask = "255.255.252.0"

    )

    Mount nbu_mount (

    MountPoint = "/opt/VRTSnbu"

    BlockDevice = "/dev/vx/dsk/nbudg/nbuvol"

    FSType = vxfs

    FsckOpt = "-y"

    )

    NetBackup nbu_server (

    Critical = 0

    ResourceOwner = unknown

    ServerName = NBU_Server

    ServerType = NBUMaster

    MonScript = NONE

    RSPFile = "/usr/openv/netbackup/bin/cluster/NBU_RSP"

    GroupName = nbu_group

    )

    Proxy p_backup_nic (

    TargetResName = backup_nic

    )

    Proxy p_public_nic (

    TargetResName = public_nic

    )

    Volume nbu_vol (

    DiskGroup = nbudg

    Volume = nbuvol

    )

    nbu_dg requires horcm0

    nbu_ip requires p_public_nic

    nbu_mount requires nbu_vol

    nbu_server requires nbu_ip

    nbu_server requires nbu_mount

    nbu_server requires nbubk_ip

    nbu_vol requires nbu_dg

    nbubk_ip requires p_backup_nic

     

    // resource dependency tree

    //

    // group nbu_group

    // {

    // NetBackup nbu_server

    // {

    // IP nbu_ip

    // {

    // Proxy p_public_nic

    // }

    // Mount nbu_mount

    // {

    // Volume nbu_vol

    // {

    // DiskGroup nbu_dg

    // {

    // HTC horcm0

    // }

    // }

    // }

    // IP nbubk_ip

    // {

    // Proxy p_backup_nic

    // }

    // }

    // }

  • "ifconfig -a  the bond is still showing as UP"

    VCS cannot and will not register a fault as long as ifconfig reports interface as UP.

    You could possibly add PingHostList and/or UseConnectionStatus as additional tests.

  • Hi Marianne,

    I am actually trying to achieve is that if i remove my LAN cable the VCS should fault & nbu_group to move to another node. Bit i am unable to achieve that.

    Also when i try to do a ethdown bond0 it tries to failover but it is not able to get all the nbu processes down on the active node. so the failover is not happening. We even tried a kill -9 waiting for an hour but it is not working. Only thing left was shut down & then bring the resources manually up on another node.

    We have a symantec case open. They are trying to build a lab test for the same. No outcome since last one week & we have a deadline of march end to handover this project.

  • I see 2 issues here (neither of them VCS problem):

    • The Bonded NIC that stays UP
    • NBU that does not go down

    For the NBU issue, I would test as follows :

    1. Stop NBU manually using 'netbackup stop'. Since NBU resource is not marked as Critical, it will not cause a failover. Monitor from another window with bpps -x every couple of seconds to see if processes are terminating. Check how long it takes for all processes to go down. If there were active backups, it will take longer to go down.
    2. Start NBU again. Use VCS to offline the nbu_server resource.
      Use bpps -x every couple of seconds to check if processes are going down.

    It is normal for NBU to take quite a while to go down, but 1 hour is certainly excessive. We normally increase the offline timeout based on the time it takes 'netbackup stop' to stop all the processes.
    As a matter of interest, which processes are not terminating?

  • There's actually a bit more to NIC monitoring than just reported link state.

    From the Bundled Agents Reference Guide:

    If the NIC maintains its connection status, the agent uses MII to determine the status of the resource. If the NIC does not maintain its connection status, the agent verifies that the NIC is configured. The agent then sends a ping to all the hosts that are listed in the NetworkHosts attribute. If the ping test is successful, it marks the NIC resource ONLINE. If the NetworkHosts attribute list is empty, or the ping test fails, the agent counts the number of packets that the NIC received. The agent compares the count with a previously stored value. If the packet count increases, the resource is marked ONLINE. If the count remains unchanged, the agent sends a ping to the broadcast address of the device to generate traffic on the network. The agent counts the number of packets that the NIC receives before and after the broadcast. If the count increases, the resource is marked ONLINE. If the count remains the same or decreases over a period of five broadcast cycles, the resource is marked OFFLINE.

  • Hi All,

    thanks to everyone to revert to this post. this issue is resolved by Symantec.

    the backline has created an engineering binary for us & it is working now fine.

    We tested both the scenarios by removing the lan cable as well as by bringing down the bond & the failover is happening without any issues.