Knowledge Base Article

Netbackup and VCS

Many of us who have been working with netbackup for a long time come across a situation where we need to work on Netbackup that is configured with VCS. Not everyone who knows Netbackup would necessarily know VCS. So here is a small overview of VCS and how it works with Netbackup.

CLUSTERS:

A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. Clusters are usually deployed to improve performance and/or availability over that provided by a single computer.There are many types of clusters, HA Cluster, Load-balancing cluster etc.

High-availability clusters (also known as Failover Clusters and most common for netbackup) are implemented primarily for the purpose of improving the availability of services which the cluster provides. They operate by having redundant nodes, which are then used to provide service when system components fail. The most common size for an HA cluster is two nodes, which is the minimum requirement to provide redundancy. HA cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure.

For NetBackup, you would usually have a 2-Node cluster. An active node, and a failover node.

Symantec's cluster software for High-Availability is VCS.

HOW IT WORKS:

Consider a server which is installed and configured as NetBackup master server. Assume that a disaster situation happens and the host is unavailble. All the services are unavailable and backups stop running until a corrective action is taken to bring the host back up (may be a complete DR). To avoid this, an identical host is configured with exact same installation and configuration of NetBackup. These two hosts can be configured in HA cluster. These two nodes have to be on same version of Netbackup, have same LUN assigned from SAN. The netbackup database (Image db, voldb and mediadb for 5x, imagedb and EMM for 6x) will reside on this shared LUN. Binaries will be installed on each node seperately. There will be one node which will run normally with Netbackup and VCS on it. This will be termed as "Active" node and the host that is not running Netbackup will be the "Passive" or "Fail-over" node.

In this case, VCS monitors the NetBackup application on active node at all times and if NetBackup becomes unavailable, VCS will detect this failure, it will gracefully stop everything, unmount the Shared volume from active node, mount on the Passive node and start netbackup there. The failed node can be now worked upon for disaster recovery and backups will be interrupted for just a few minutes.

For netbackup to work in cluster, following criteria needs to be met:
- Shared storage between hosts
- Atleast 3 NIC on each host
- Identical hardware.
- Same OS and Netbackup version.
- VCS installed and configured.

VCS:

VCS - Veritas Cluster Server, is Symantec's solution for high-availablity. It works on SLES, RHEL, Solaris and Windows. It is responsible for Startup, Shutdown, Monitoring and failover of applications configured. For an application to be configured for failover in VCS, VCS must know the steps to Startup the Application, Monitor it and Shutting it down. A user can define the logic in which the applications will be handled by VCS.

Terminology:

Heartbeat: Heartbeats are a communication mechanism for nodes to exchange information concerning hardware and software status, keep track of cluster membership, and keep this information synchronized across all cluster nodes. It is recommended to have atleast two heartbeats.

Resource: A resource is an entity that may be brought online, offline, or monitored on a particular system. Each separate resource is of a resource type. Examples of resource types are mount points, IP addresses etc

There are three categories of VCS resources: on-off, on-only, and persistent.
- On-off means VCS can fully control the resource;
- on-only is a resource that VCS can restart but not shutdown;
- persistent resource is something that VCS will just monitor but cannot control. (NIC)

Resource agent: Every resource has an agent associated. The agent is responsible for various actions on resource like online, offline, monitor

Service group: A service group is a logical collection of resources. These resources will be taken online and offline together. Service groups come in two varieties -- failover and parallel. Resource for Netbackup will be a failover resource

Dependency: A dependency relationship tells the cluster in what order to bring resource entities online and offline. In each resource dependency relationship there is a parent and a child. A parent resource will not be brought online until all of its children are online.

Split brain: Split brain occurs when two or more systems within the cluster think they have exclusive access to a shared resource at the same time. This can be very damaging because data corruption is common in this situation.

Jeopardy: A system is in jeopardy when only one of its heartbeat connections is still functioning. A loss of the remaining heartbeat network will not allow VCS to know whether the host has crashed or the last heartbeat network has been disabled.

VCS has can be divided into two important parts:

Cluster Communication:
Low Latency Transport (LLT) and Global Atomic Broadcast (GAB) are responsible for heartbeat and cluster communication. These are kernel modules and are installed with VCS. LLT provides a fast and high-priority internal cluster communication. LLT does not work on TCP/IP and its a different technology of communication. GAB runs over LLT. GAB is primarily responsible for cluster membership. So, LLT on each node will do the communication and GAB on each node will maintain the cluster membership.

LLT -

Configuration files:
/etc/llttab
/etc/llthosts

Commands:
lltstat
lltconfig

GAB -
Configuration file:
/etc/gabtab
Command:
gabconfig
HAD:
Stands for High Availability daemon. This is also known as VCS Engine. This is the heart of VCS. HAD is responsible for all the cluster functionality. HAD talks to all the agents, has all the configuration/logic in the memory. There is another process called hashadow, whose primary job is to monitor HAD.

Configuration files for HAD:
/etc/VRTSvcs/conf/config/main.cf
/etc/VRTSvcs/conf/config/types.cf

Commands for HAD:
/opt/VRTS/bin/hastop (stops HAD)
/opt/VRTS/bin/hastart (start HAD)
/opt/VRTS/bin/hastatus (monitor HAD status)
/opt/VRTS/bin/hagrp (monitor/manage Service group)
/opt/VRTS/bin/hares (monitor/manage Resources)
/opt/VRTS/bin/hacf –verify /etc/VRTSvcs/conf/config (checks main.cf for syntax issues)

NOTE: For LLT, GAB and HAD, there is a dependency. At the system start up, first LLT starts, then GAB and then HAD. HAD will not run without GAB and GAB will not run without LLT:

VCS startup:

LLT starts. It reads /etc/llttab and /etc/llthosts.
GAB starts (It executes /etc/gabtab). It checks for other GABs to establish a cluster membership.
Once GAB is loaded, hashadow starts which lods HAD
HAD reads /etc/VRTSvcs/conf/config/main.cf and all include .cf mentioned in main.cf.
HAD checks if there are other HADs avaible. It registers itself with GAB.
If there are no other HADs, it loads the main.cf again into HAD memory.
Same process will happen when HAD starts on other nodes. The HAD on the first node will load the main.cf and other .cf files from the local system (also called as "local build") and all other HADs will load configuration from the first HAD (also called as "remote build")
After starting up, HAD will know all the service groups and resources from main.cf. It will call the respective agents to check if the resources are currently online or offline.
Based on main.cf, HAD will online/offline the Service group on the respective nodes.
Check if all the service groups are running by command hastatus -sum

Important actions that can be taken by an admin while working on VCS:

VCS -
Start: Follow steps above.
Stop: Stop the HAD, unload GAB and then unload LLT.

Service Groups -
Online: Manually bring a specific service group online on a specific node or all nodes.
Offline: Manually bring a service group offline on a specific node or all nodes.
Freeze: In terms of netbackup, if netbackup has problems, you might want to stop and start netbackup a couple of times. Its necessary to freeze the service group at that time. By freezing service group, we are telling VCS not to take any action on it.

Resource -
Online: Manually online a resource
Offline: Manually offline a resource
Probe: Ask the resource agent to probe for the resource and get its current status.

Netbackup in VCS:
Install Netbackup on nodes the way you would normally do. Netbackup installation wizard asks for EMM server name and Master server name, at that time, give "virtual name" for installation on both the nodes. Note that right now, nothing will go on the shared LUN.

Once the installation is done, run the following script:

/usr/openv/netbackup/bin/cluster/cluster_config

This script will prompt for all the information that it needs and does the following:
- Create an agent "NetBackup" and its cf file at /usr/openv/netbackup/bin/cluster/vcs/NetBackupTypes.cf
- Create service group. (usually nbu_group)
- create resources. (NIC, IP, DG, VOL, MOUNT and NETBACKUP)
- Moves the databases to the shared location
- Creates the file /usr/openv/netbackup/bin/cluster/NBU_RSP which holds information about cluster configuration.

The good part about cluster_config script is that if any thing fails in the script, it does an undo on everything, which means that next time you run the script again, it wont create any duplicates in config.

Basic Tasks:
Create service group (hagrp -add)
Modify service group (hagrp –modify)
Delete service group (hagrp –delete)
Add resource(s) to a service group (hares –add)
Modify resources (hares –modify)
Delete resources (hares –delete)
Monitor the cluster (hastatus)
Switch over service group from one node to other (hagrp –switch)

Config files:
/etc/VRTSvcs/conf/config/main.cf
/etc/VRTSvcs/conf/config/types.cf
/usr/openv/netbackup/bin/cluster/vcs/NetBackupTypes.cf
/usr/openv/netbackup/bin/cluster/NBU_RSP

Logs:
System log
/var/VRTSvcs/logs/engine_A.log
/usr/openv/netbackup/bin/cluster/AGENT_DEBUG.log

I hope you enjoyed reading through it and hope it helps you in your day to day work.

Published 16 years ago

Version 1.0

Was this article helpful?

20 Comments

KRee
Level 2
13 years ago
Hi all ,

I'm currently running

NBU 7.1.0.2 HA on linux cluster with local (within site , local cluster) and global failover (at a secondary site)

After months of the setup being frozen to 1 master today we tried to have the cluster fail-over on the active cluster to the passive cluster and things went terribly wrong . As such we had logged a case with the Symantec support.

However that support case aside - I was just wondering if anyone has a pre-checklist that I should get my guys to do before we attempt another failover to the secondary node ?

We manage to fix the issues by failing back to the previously active node.

one feedback I got was that bp.conf was different on the 2 nodes and this gave some issues especially on the san media server side.

Also the java console seem to had froze.

plus the library and drives almost 50% went down into mixed mode (we have a pretty large environment with 8 media servers and over 36 drives out of which are residing in at least 8 seperate libraries....)

Would really appreciate if anyone has a checklist/cheatlist so that our next attempt would be more reassuring.....

Thanks ! :)

Warmest Regards,

Kevin
Marianne
Level 6
14 years ago
cluster_config was done up to NBU 6.5. Since NBU 7.0, NBU installation sees that you are installing NBU in a clustered environment and will prompt you for cluster configuration info.

See NetBackup Installation Guide for UNIX and Linux http://www.symantec.com/docs/DOC3647

and

NetBackup in Highly Available Environments Administrator's Guide http://www.symantec.com/docs/DOC3678

PLEASE check above-mentioned guides carefully for requirements (such as rsh).
nbuno
Level 6
14 years ago
we build linux master cluster for NBU 7.1 but there we never ran cluster_config scripts seperately and everything is running fine.
Can someone tell me if it is taking up all its entries from NBU master installation when it asks for 'if server is going to be a cluster master"
or do we have to run this scripts seperately..??
AsiyeYigitGante
Level 2
14 years ago
hello all;

thanks a lot for the information. technical whitepaper is really excellent.

I have installed nb 7.0 on a clustered environment.

however, netbackup resource could not be online.

I have followed the NB HA admin. guide.

installation is completed succesfully. however,

netbackup resource could not be online. there is not any clue on the VCS/NB/system logs.

what should dou you duggest?
Cristiano_Cabra
14 years ago
Ok, very good technote.
This technote is a good overview, but want to know, there is some way to configure a Load Balance Cluster for Netbackup?
Thanks,
ClubFemina
15 years ago
Symantec Veritas NetBackup (NBU) is one of the most pervasive data management tools used in medium to large size data centers. It is a critical component of users' backup, recovery, and disaster preparedness strategies.

Cheers
Softzine
Frenki
15 years ago
nice work!
I wish I could just understand the things that I do! In the meantime, have to learn from such as you! Thanks a lot!
Frenki
lchun21
15 years ago
This form of backup is important for every large company that stores valuable data used for day to day tasks. Its a nightmare to see all your data disappear, but a life saver to have a net backup. - Mothers Ring
Marianne
Level 6
15 years ago
Read the NetBackup High Availability Admin Guide.
It will tell you all the pre-req's to check, before running the cluster_config script. This script will move catalogs to the share, config the EMM database as a clustered master, copy/link the agent to VCS and create the Service Group.
fishfund
15 years ago

I am building a NetBackup master server cluster. I have the cluster just about complete with everything except the NetBackup resource. Before there
was a package to install to add the service group. Now there is a "vcs_nbu_config" script. However this script is trying to do way too much for me. Any help would be highly appreciate.
sourcing