Business Continuity

Many of us who have been working with netbackup for a long time come across a situation where we need to work on Netbackup that is configured with VCS. Not everyone who knows Netbackup would necessarily know VCS. So here is a small overview of VCS and how it works with Netbackup.


CLUSTERS:

A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. Clusters are usually deployed to improve performance and/or availability over that provided by a single computer.There are many types of clusters, HA Cluster, Load-balancing cluster etc.

High-availability clusters (also known as Failover Clusters and most common for netbackup) are implemented primarily for the purpose of improving the availability of services which the cluster provides. They operate by having redundant nodes, which are then used to provide service when system components fail. The most common size for an HA cluster is two nodes, which is the minimum requirement to provide redundancy. HA cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure.

For NetBackup, you would usually have a 2-Node cluster. An active node, and a failover node.

Symantec's cluster software for High-Availability is VCS.


HOW IT WORKS:

Consider a server which is installed and configured as NetBackup master server. Assume that a disaster situation happens and the host is unavailble. All the services are unavailable and backups stop running until a corrective action is taken to bring the host back up (may be a complete DR). To avoid this, an identical host is configured with exact same installation and configuration of NetBackup. These two hosts can be configured in HA cluster. These two nodes have to be on same version of Netbackup, have same LUN assigned from SAN. The netbackup database (Image db, voldb and mediadb for 5x, imagedb and EMM for 6x) will reside on this shared LUN. Binaries will be installed on each node seperately. There will be one node which will run normally with Netbackup and VCS on it. This will be termed as "Active" node and the host that is not running Netbackup will be the "Passive" or "Fail-over" node.

In this case, VCS monitors the NetBackup application on active node at all times and if NetBackup becomes unavailable, VCS will detect this failure, it will gracefully stop everything, unmount the Shared volume from active node, mount on the Passive node and start netbackup there. The failed node can be now worked upon for disaster recovery and backups will be interrupted for just a few minutes.

For netbackup to work in cluster, following criteria needs to be met:
- Shared storage between hosts
- Atleast 3 NIC on each host
- Identical hardware.
- Same OS and Netbackup version.
- VCS installed and configured.


VCS:

VCS - Veritas Cluster Server, is Symantec's solution for high-availablity. It works on SLES, RHEL, Solaris and Windows. It is responsible for Startup, Shutdown, Monitoring and failover of applications configured. For an application to be configured for failover in VCS, VCS must know the steps to Startup the Application, Monitor it and Shutting it down. A user can define the logic in which the applications will be handled by VCS.

Terminology:

Heartbeat
: Heartbeats are a communication mechanism for nodes to exchange information concerning hardware and software status, keep track of cluster membership, and keep this information synchronized across all cluster nodes. It is recommended to have atleast two heartbeats.

Resource: A resource is an entity that may be brought online, offline, or monitored on a particular system. Each separate resource is of a resource type. Examples of resource types are mount points, IP addresses etc

There are three categories of VCS resources: on-off, on-only, and persistent.
- On-off means VCS can fully control the resource;
- on-only is a resource that VCS can restart but not shutdown;
- persistent resource is something that VCS will just monitor but cannot control. (NIC)

Resource agent
: Every resource has an agent associated. The agent is responsible for various actions on resource like online, offline, monitor

Service group
: A service group is a logical collection of resources. These resources will be taken online and offline together. Service groups come in two varieties -- failover and parallel. Resource for Netbackup will be a failover resource

Dependency: A dependency relationship tells the cluster in what order to bring resource entities online and offline. In each resource dependency relationship there is a parent and a child. A parent resource will not be brought online until all of its children are online.

Split brain
: Split brain occurs when two or more systems within the cluster think they have exclusive access to a shared resource at the same time. This can be very damaging because data corruption is common in this situation.

Jeopardy: A system is in jeopardy when only one of its heartbeat connections is still functioning. A loss of the remaining heartbeat network will not allow VCS to know whether the host has crashed or the last heartbeat network has been disabled.

VCS has can be divided into two important parts:
  1. Cluster Communication:
    Low Latency Transport (LLT) and Global Atomic Broadcast (GAB) are responsible for heartbeat and cluster communication. These are kernel modules and are installed with VCS. LLT provides a fast and high-priority internal cluster communication. LLT does not work on TCP/IP and its a different technology of communication. GAB runs over LLT. GAB is primarily responsible for cluster membership. So, LLT on each node will do the communication and GAB on each node will maintain the cluster membership.

    LLT -

    Configuration files:
    /etc/llttab
    /etc/llthosts

    Commands:
    lltstat
    lltconfig

    GAB -
    Configuration file:
    /etc/gabtab
    Command:
    gabconfig
     
  2. HAD:
    Stands for High Availability daemon. This is also known as VCS Engine. This is the heart of VCS. HAD is responsible for all the cluster functionality. HAD talks to all the agents, has all the configuration/logic in the memory. There is another process called hashadow, whose primary job is to monitor HAD.

    Configuration files for HAD:
    /etc/VRTSvcs/conf/config/main.cf
    /etc/VRTSvcs/conf/config/types.cf

    Commands for HAD:
    /opt/VRTS/bin/hastop (stops HAD)
    /opt/VRTS/bin/hastart (start HAD)
    /opt/VRTS/bin/hastatus (monitor HAD status)
    /opt/VRTS/bin/hagrp (monitor/manage Service group)
    /opt/VRTS/bin/hares (monitor/manage Resources)
    /opt/VRTS/bin/hacf –verify /etc/VRTSvcs/conf/config (checks main.cf for syntax issues)

NOTE: For LLT, GAB and HAD, there is a dependency. At the system start up, first LLT starts, then GAB and then HAD. HAD will not run without GAB and GAB will not run without LLT:

VCS startup:

  1. LLT starts. It reads /etc/llttab and /etc/llthosts.
  2. GAB starts (It executes /etc/gabtab). It checks for other GABs to establish a cluster membership.
  3. Once GAB is loaded, hashadow starts which lods HAD
  4. HAD reads /etc/VRTSvcs/conf/config/main.cf and all include .cf mentioned in main.cf.
  5. HAD checks if there are other HADs avaible. It registers itself with GAB.
  6. If there are no other HADs, it loads the main.cf again into HAD memory.
  7. Same process will happen when HAD starts on other nodes. The HAD on the first node will load the main.cf and other .cf files from the local system (also called as "local build") and all other HADs will load configuration from the first HAD (also called as "remote build")
  8. After starting up, HAD will know all the service groups and resources from main.cf. It will call the respective agents to check if the resources are currently online or offline.
  9. Based on main.cf, HAD will online/offline the Service group on the respective nodes.
  10. Check if all the service groups are running by command hastatus -sum

Important actions that can be taken by an admin while working on VCS:

VCS -
Start: Follow steps above.
Stop: Stop the HAD, unload GAB and then unload LLT.

Service Groups -
Online: Manually bring a specific service group online on a specific node or all nodes.
Offline: Manually bring a service group offline on a specific node or all nodes.
Freeze: In terms of netbackup, if netbackup has problems, you might want to stop and start netbackup a couple of times. Its necessary to freeze the service group at that time. By freezing service group, we are telling VCS not to take any action on it.

Resource -
Online: Manually online a resource
Offline: Manually offline a resource
Probe: Ask the resource agent to probe for the resource and get its current status.


Netbackup in VCS:
Install Netbackup on nodes the way you would normally do. Netbackup installation wizard asks for EMM server name and Master server name, at that time, give "virtual name" for installation on both the nodes. Note that right now, nothing will go on the shared LUN.

Once the installation is done, run the following script:

/usr/openv/netbackup/bin/cluster/cluster_config

This script will prompt for all the information that it needs and does the following:
- Create an agent "NetBackup" and its cf file at /usr/openv/netbackup/bin/cluster/vcs/NetBackupTypes.cf
- Create service group. (usually nbu_group)
- create resources. (NIC, IP, DG, VOL, MOUNT and NETBACKUP)
- Moves the databases to the shared location
- Creates the file /usr/openv/netbackup/bin/cluster/NBU_RSP which holds information about cluster configuration.

The good part about cluster_config script is that if any thing fails in the script, it does an undo on everything, which means that next time you run the script again, it wont create any duplicates in config.

Basic Tasks:
Create service group (hagrp -add)
Modify service group (hagrp –modify)
Delete service group (hagrp –delete)
Add resource(s) to a service group (hares –add)
Modify resources (hares –modify)
Delete resources (hares –delete)
Monitor the cluster (hastatus)
Switch over service group from one node to other (hagrp –switch)


Config files:
/etc/VRTSvcs/conf/config/main.cf
/etc/VRTSvcs/conf/config/types.cf
/usr/openv/netbackup/bin/cluster/vcs/NetBackupTypes.cf
/usr/openv/netbackup/bin/cluster/NBU_RSP

Logs:
System log
/var/VRTSvcs/logs/engine_A.log
/usr/openv/netbackup/bin/cluster/AGENT_DEBUG.log


I hope you enjoyed reading through it and hope it helps you in your day to day work.
Comments
 Scorpy it was Good and informative. we are running MS-CS for our master server few times i experianced that at time of switching from node one to second running backup get fail.

do you have any comparision VCS vs MSCS? 
hi Taqadus,

I wish I could give you that information but I have never worked on MSCS.
 Bro, 

I am looking to have such comparision because as i said in my earlier post i observed few incedents while shifting the cluster. i want to know if it really happen with netbackup while using VCS.....

Can you provide me VCS docs / web ref etc...
Hi,

If the netbackup is configured "by the books", i have never seen issues in a failover.

As far as VCS goes, I always refer the VCS admin guide.

Here are the links for VCS on windows:

Storage Foundation and High Availability

http://www.symantec.com/business/support/documentation.jsp?language=english&view=manuals&pid=15227&version=SFFWPVER20946

Cluster Server

http://www.symantec.com/business/support/documentation.jsp?language=english&view=manuals&pid=51064&version=VCSWINNETAPPPVER24508


Links for VCS on Solaris:

Storage Foundation and High Availability

http://www.symantec.com/business/support/documentation.jsp?language=english&view=manuals&pid=15107&version=FOUNDSUITEPVER21694

Cluster Server

http://www.symantec.com/business/support/documentation.jsp?language=english&view=manuals&pid=15066&version=CLUSTERSERVERPVER22108


Also, netbackup High-Availablity guide will be very handy:

ftp://exftpp.symantec.com/pub/support/products/NetBackup_Enterprise_Server/290237.pdf


Here is a technote that lists other cluster related technotes:
http://seer.entsupport.symantec.com/docs/284550.htm

Hope these helps you.

One of our best technical product manager wrote an excellent whitepaper on configuring NetBackup for HA.  You should read this one as part of your planning efforts.

This is on www.symantec.com/netbackup under Whitepapers

Implementing Highly Available Data Protection with Veritas NetBackup
Outlines design considerations for effective disaster recovery with Veritas NetBackup 6.5. Includes good, better, and best deployment scenarios.

Peter
 Thank you guys. Let me have look on these links and will get back to you guys in case of any further query.
Just for your information, when you fail over most applications, especially applications which don't involve network timeouts and are somewhat connection oriented, you will experience logouts, client disconnects, and - in the case of NetBackup - failed backup jobs.

I don't think VCS will solve that particular problem for you... when you switch a service group from one machine to another it involves shutting down the application.

Think of backup jobs that are live, and how they are affected by running commands to stop and then restart NetBackup. They do not, and cannot, continue where they left off... you will have failed jobs that will be, or may need to be, restarted.

One thing you can do is configure the number of job retries (in NetBackup) to be higher in this environment, to allow for more automated restarts.

The only applications in which downtime is negligeable or unseen is ones with timeouts (like NFS and file shares) or applications that run in parallel on multiple machines (e.g. Oracle RAC).

My problem with clustering NetBackup is that only new installations can be clustered (done that successfully on a number of occasions).
Existing stand-alone master cannot be clustered with the method described in the HA guide. The script that is used to configure the cluster will simply discard the current EMM database as well as netbackup/db folder and recreate them (empty) on the shared drive. The old non-clustered emm database cannot be restored.

We have contacted Symantec consulting for information on how to cluster existing master servers. The answer was that there is currently only one trained consultant that have access to a special tool to accomplish this!!

This is a serious shortcoming as most other applications can be converted from non-clustered to clustered. I have personally been involved in doing this with Oracle (on Solaris) as well as SQL and Exchange (Windows).
this how-to looks familiar.  ;-)

scorpy, do you work for Symantec; where did you get the info?
nice work. its a nice foundation document

I am building a NetBackup master server cluster. I have the cluster just about complete with everything except the NetBackup resource. Before there
was a package to install to add the service group. Now there is a "vcs_nbu_config" script. However this script is trying to do way too much for me. Any help would be highly appreciate.
sourcing
Read the NetBackup High Availability Admin Guide.
It will tell you all the pre-req's to check, before running the cluster_config script. This script will move catalogs to the share,  config the EMM database as a clustered master, copy/link the agent to VCS and create the Service Group.
This form of backup is important for every large company that stores valuable data used for day to day tasks. Its a nightmare to see all your data disappear, but a life saver to have a net backup. - Mothers Ring
nice work!
I wish I could just understand the things that I do! In the meantime, have to learn from such as you! Thanks a lot!
Frenki
Symantec Veritas NetBackup (NBU) is one of the most pervasive data management tools used in medium to large size data centers. It is a critical component of users' backup, recovery, and disaster preparedness strategies.

Cheers 
Softzine

 

Ok, very good technote.
This technote is a good overview, but want to know, there is some way to configure a Load Balance Cluster for Netbackup?
Thanks,

hello all;

thanks a lot for the information.  technical whitepaper is really excellent.

I have installed nb 7.0 on a clustered environment.

however, netbackup resource could not be online.

I have followed the NB HA admin. guide.

installation is completed succesfully. however,

netbackup resource could not be online. there is not any clue on the VCS/NB/system logs.

what should dou you duggest?

we build linux master cluster for NBU 7.1 but there we never ran cluster_config scripts seperately and everything is running fine.
Can someone tell me if it is taking up all its entries from NBU master installation when it asks for 'if server is going to be a cluster master"
or do we have to run this scripts seperately..??

cluster_config was done up to NBU 6.5. Since NBU 7.0, NBU installation sees that you are installing NBU in a clustered environment and will prompt you for cluster configuration info.

See NetBackup Installation Guide for UNIX and Linux  http://www.symantec.com/docs/DOC3647

and

NetBackup in Highly Available Environments Administrator's Guide http://www.symantec.com/docs/DOC3678

PLEASE check above-mentioned guides carefully for requirements (such as rsh).

Hi all ,

I'm currently running

NBU 7.1.0.2 HA on linux cluster with local (within site , local cluster) and global failover (at a secondary site)

After months of the setup being frozen to 1 master today we tried to have the cluster fail-over on the active cluster to the passive cluster and things went terribly wrong . As such we had logged a case with the Symantec support.

However that support case aside - I was just wondering if anyone has a pre-checklist that I should get my guys to do before we attempt another failover to the secondary node ?

We manage to fix the issues by failing back to the previously active node.

one feedback I got was that bp.conf was different on the 2 nodes and this gave some issues especially on the san media server side.

Also the java console seem to had froze.

plus the library and drives almost 50% went down into mixed mode (we have a pretty large environment with 8 media servers and over 36 drives out of which are residing in at least 8 seperate libraries....)

Would really appreciate if anyone has a checklist/cheatlist so that our next attempt would be more reassuring.....

 

Thanks ! Smiley Happy

Warmest Regards,

Kevin