Forum Discussion

symsonu's avatar
symsonu
Level 6
12 years ago

multinicb failure case understanding required

Hello,

 

What has happened that MULtinicB resource faulted on both nodes.

that lead to the failure of proxies and other dependent resources configured in service group

can you please let us know what could be the recovery procedure if the network gets corrected.

will all the resources come online automaticaly or sdmin has to take any action ?

 

 

 

  • ... and this is where it comes back to: it depends on your configuration/environment.

    The reason for marking resources as faulted so they will not come online on the same system without manual intervention/being cleared is to protect you (ie: the rationale being that if the resource has faulted, it might be an idea for someone to take a look at it before trying to online it there again).

    If the only fault is with the network, the non-persistent resources may not fault themselves (this depends on the type/nature of the resource, obviously), though obviously they may not be usable while the network is down. So if that was the case (no other resources faulted apart from network) then AutoRestart=1 would be fine.

    possible considerations before setting AutoRestart=2

    • the default is 1 for a reason, eg: what if you had resources faulted from a previous/earlier issue that was still being fixed? if the fault was auto cleared then it could bring the resource up on the problem system prematurely/before it was ready

    • if you do want to go ahead with this setting, you really need to test it in your environment (or at least understand the behaviour/implications) to make sure it's behaving the way you expect it to / need it to in the event of fault/failure

  • If the service groups have faulted on all nodes, then they will not online automatically. but if the network comes back before the service groups have failed on all nodes, then the service groups should online on nodes they haven't faulted on yet.

    Some resources like MultiNICB will clear the fault themselves (as MultiNICB only monitors), but other resources like IPMultiNICB wil not clear by themselves, so you should run "hagrp clear" on any groups that contain faulted resources (hagrp -state shows FAULTED).  Once groups are cleared you will then need to manually online on them if service group had fauted on all nodes.

    Mike

  • symsonu,

    The following documentation provides a detailed explanation of the restart behaviour / explains the attributes that can affect/control this behaviour:

    Veritas Cluster 6.0.1 (Solaris) Server Administrator's Guide -> VCS communication and operations -> Controlling VCS behavior -> About controlling VCS behavior at the service group level -> About the AutoRestart attribute
    https://sort.symantec.com/public/documents/sfha/6.0.1/solaris/productguides/html/vcs_admin/ch11s02s01.htm
    --------------------
    About the AutoRestart attribute

    If a persistent resource on a service group (GROUP_1) faults, VCS fails the service group over to another system if the following conditions are met:

    • The AutoFailOver attribute is set.
    • Another system in the cluster exists to which GROUP_1 can fail over.

    If neither of these conditions is met, GROUP_1 remains offline and faulted, even after the faulted resource becomes online.

    Setting the AutoRestart attribute enables a service group to be brought back online without manual intervention. If no failover targets are available, setting the AutoRestart attribute enables VCS to bring the group back online on the first available system after the group's faulted resource came online on that system.

    For example, NIC is a persistent resource. In some cases, when a system boots and VCS starts, VCS probes all resources on the system. When VCS probes the NIC resource, the resource may not be online because the networking is not up and fully operational. In such situations, VCS marks the NIC resource as faulted, and does not bring the service group online. However, when the NIC resource becomes online and if AutoRestart is enabled, the service group is brought online.
    --------------------

    Also: VCS behavior when persistent resources transition from faulted to online ( https://sort.symantec.com/public/documents/sfha/6.0.1/solaris/productguides/html/vcs_admin/ch11s02s13s02.htm )

    --------------------
    VCS behavior when persistent resources transition from faulted to online

    The AutoRestart attribute determines the VCS behavior in the following scenarios:

    • A service group cannot be automatically started because of a faulted persistent resource
    • A service group is unable to failover

    Later, when a persistent resource transitions from FAULTED to ONLINE, the VCS engine attempts to bring the service group online if the AutoRestart attribute is set to 1 or 2.

    If AutoRestart is set to 1, the VCS engine restarts the service group. If AutoRestart is set to 2, the VCS engine clears the faults on all faulted non-persistent resources in the service group before it restarts the service group on the same system.
    --------------------

    Looking at Appendixes -> VCS Attributes -> Service Group Attributes ( https://sort.symantec.com/public/documents/sfha/6.0.1/solaris/productguides/html/vcs_admin/apds04.htm )

    --------------------
    • AutoFailOver - Indicates whether VCS initiates an automatic failover if the service group faults.
    [...]
    Default: 1 (enabled)

    • AutoRestart - Restarts a service group after a faulted persistent resource becomes online.
    [...]
    Default: 1 (enabled)
    --------------------

    So if the service group attributes are set to the defaults above, as the multinicb & its proxies would be persistent, the sg would be restarted - as Mike mentions, manual intervention would be required to clear faults on any non-persistent resources.

  •  

    Thanks a ton  lee for such a detailed presentation.

     

     our setup for understanding this scenario:-->

    ==========================================================

    If I have  a  parallel service group containg MultinicB  and phantome resources (service group name publan)

    In my failover service group I use proxy resource having Targetresname as name of NIC resource in parallel sg (  service group name SG1)
    ==========================================================

     

     

    Now , as I understood AutoRestart attribute  applies only to service group containg persistent resources.

    So, Autorestart will be only for Publan and not for SG1 ?

    or will it be for both , as SG1 also contains persistent resource i.e proxy.

     

     

     

     

  • As you've said, SG1 has a proxy. A proxy is a persistent resource as it cannot be onlined/offlined (VCS can monitor status only); thus, AutoRestart would also apply to SG1.

    see VCS Administrator's Guide -> Introducing Veritas Cluster Server -> Logical components of VCS -> Categories of resources

    https://sort.symantec.com/public/documents/sfha/6.0.1/solaris/productguides/html/vcs_admin/ch01s04s02.htm

    ----------
    Persistent
    These resources cannot be brought online or taken offline. For example, a network interface card cannot be started or stopped, but it is required to configure an IP address. A Persistent resource has an operation value of None. VCS monitors Persistent resources to ensure their status and operation. Failure of a Persistent resource triggers a service group failover.
    ----------

    # hatype -display Proxy -attribute Operations
    #Type        Attribute              Value
    Proxy        Operations             None

  • Thank You for clearing my doubt.

    Now , in continuation  say if the MultiNICB resources faulted on both nodes due to network outage

    and in turn our parallel service group (publan) faulted on both nodes.

    This makes the proxy to be failed and other dependent resources , thus SG1 also faulted.

    ===============================================

     

    after sometime , network is back.

     

    Now multinicb resources will come back online on itself and then as we have AutoRestart 1 for publan

    publan will also automatical y restart and comes online. as it does not contains non-persistent resource.

     

    Now comes SG1 :--

     

    here proxies have come back online as Nic came online.

    and AutoRestart is 1 , so service group restarts itself.

     

    As per your lines from previous post

    =======================================================================

    So if the service group attributes are set to the defaults above, as the multinicb & its proxies would be persistent, the sg would be restarted - as Mike mentions, manual intervention would be required to clear faults on any non-persistent resources.

    ===========================================================================

    Now, here we need to clear only the  non-persistent resources as per your comment

    one confusion, if service group is restarting as a whole due to  AutoRestart 1,then why are we acting on resource level?

    might be a silly question or I may not understood the" restart of service group"

     

    appreciate your patience for handling my queries or might be silly queries :)

     

     

     

     

     

  • If a (non-persistent) resource is faulted on a system, it cannot go online on that system until the fault is cleared.

    (edited to add note: The non-persistent resources may have faulted due to the network fault / due to fault propagation. There may also be entirely separate reasons for those resources being faulted. This is dependent on your configuration, what / how resources are configured, so you need to understand the implications for your cluster)

    So if you try to restart the service group on that system while these resources are still faulted, the online will not complete/succeed, as it will not be able to online all the resources in the group.

    one confusion, if service group is restarting as a whole due to  AutoRestart 1,then why are we acting on resource level?

    You can clear all resource faults in a groups with hagrp -clear (as Mike mentioned), or you can clear the resource fault for the individual resources. Either way, the resource fault still needs to be cleared before the group can go online - ie: if you want the group to be re-started/brought online on the same system, any faults need to be cleared first.

    Hint: (quoted from AutoRestart info above, bold added for emphasis)

    If AutoRestart is set to 1, the VCS engine restarts the service group. If AutoRestart is set to 2, the VCS engine clears the faults on all faulted non-persistent resources in the service group before it restarts the service group on the same system.

    Further reading:

    About administering service groups -> Clearing faulted resources in a service group

    https://sort.symantec.com/public/documents/sfha/6.0.1/solaris/productguides/html/vcs_admin/ch06s12s08.htm

    VCS behavior on resource faults:

    https://sort.symantec.com/public/documents/sfha/6.0.1/solaris/productguides/html/vcs_admin/ch11s01.htm

  •  

    Then will it not be good to set  AutoRestart  vaue as 2 for service group SG1, so that other non-persistent

    resources in the service group also get cleared ?

     

     

     

     

  • ... and this is where it comes back to: it depends on your configuration/environment.

    The reason for marking resources as faulted so they will not come online on the same system without manual intervention/being cleared is to protect you (ie: the rationale being that if the resource has faulted, it might be an idea for someone to take a look at it before trying to online it there again).

    If the only fault is with the network, the non-persistent resources may not fault themselves (this depends on the type/nature of the resource, obviously), though obviously they may not be usable while the network is down. So if that was the case (no other resources faulted apart from network) then AutoRestart=1 would be fine.

    possible considerations before setting AutoRestart=2

    • the default is 1 for a reason, eg: what if you had resources faulted from a previous/earlier issue that was still being fixed? if the fault was auto cleared then it could bring the resource up on the problem system prematurely/before it was ready

    • if you do want to go ahead with this setting, you really need to test it in your environment (or at least understand the behaviour/implications) to make sure it's behaving the way you expect it to / need it to in the event of fault/failure

  • Thanks a lot  Lee,

    You made my understanding pretty solid now.

    much appreciated.

     

    we all are using  symantec products like VXVM, VCS, SFS and netbackup because of you guys.

    Many thanks once again.