Many of us who have been working with netbackup for a long time come across a situation where we need to work on Netbackup that is configured with VCS. Not everyone who knows Netbackup would necessarily know VCS. So here is a small overview of VCS and how it works with Netbackup.
A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. Clusters are usually deployed to improve performance and/or availability over that provided by a single computer.There are many types of clusters, HA Cluster, Load-balancing cluster etc.
High-availability clusters (also known as Failover Clusters and most common for netbackup) are implemented primarily for the purpose of improving the availability of services which the cluster provides. They operate by having redundant nodes, which are then used to provide service when system components fail. The most common size for an HA cluster is two nodes, which is the minimum requirement to provide redundancy. HA cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure.
For NetBackup, you would usually have a 2-Node cluster. An active node, and a failover node.
Symantec's cluster software for High-Availability is VCS.
HOW IT WORKS:
Consider a server which is installed and configured as NetBackup master server. Assume that a disaster situation happens and the host is unavailble. All the services are unavailable and backups stop running until a corrective action is taken to bring the host back up (may be a complete DR). To avoid this, an identical host is configured with exact same installation and configuration of NetBackup. These two hosts can be configured in HA cluster. These two nodes have to be on same version of Netbackup, have same LUN assigned from SAN. The netbackup database (Image db, voldb and mediadb for 5x, imagedb and EMM for 6x) will reside on this shared LUN. Binaries will be installed on each node seperately. There will be one node which will run normally with Netbackup and VCS on it. This will be termed as "Active" node and the host that is not running Netbackup will be the "Passive" or "Fail-over" node.
In this case, VCS monitors the NetBackup application on active node at all times and if NetBackup becomes unavailable, VCS will detect this failure, it will gracefully stop everything, unmount the Shared volume from active node, mount on the Passive node and start netbackup there. The failed node can be now worked upon for disaster recovery and backups will be interrupted for just a few minutes.
For netbackup to work in cluster, following criteria needs to be met:
- Shared storage between hosts
- Atleast 3 NIC on each host
- Identical hardware.
- Same OS and Netbackup version.
- VCS installed and configured.
VCS - Veritas Cluster Server, is Symantec's solution for high-availablity. It works on SLES, RHEL, Solaris and Windows. It is responsible for Startup, Shutdown, Monitoring and failover of applications configured. For an application to be configured for failover in VCS, VCS must know the steps to Startup the Application, Monitor it and Shutting it down. A user can define the logic in which the applications will be handled by VCS.
Heartbeat: Heartbeats are a communication mechanism for nodes to exchange information concerning hardware and software status, keep track of cluster membership, and keep this information synchronized across all cluster nodes. It is recommended to have atleast two heartbeats.
Resource: A resource is an entity that may be brought online, offline, or monitored on a particular system. Each separate resource is of a resource type. Examples of resource types are mount points, IP addresses etc
There are three categories of VCS resources: on-off, on-only, and persistent.
- On-off means VCS can fully control the resource;
- on-only is a resource that VCS can restart but not shutdown;
- persistent resource is something that VCS will just monitor but cannot control. (NIC)
Resource agent: Every resource has an agent associated. The agent is responsible for various actions on resource like online, offline, monitor
Service group: A service group is a logical collection of resources. These resources will be taken online and offline together. Service groups come in two varieties -- failover and parallel. Resource for Netbackup will be a failover resource
Dependency: A dependency relationship tells the cluster in what order to bring resource entities online and offline. In each resource dependency relationship there is a parent and a child. A parent resource will not be brought online until all of its children are online.
Split brain: Split brain occurs when two or more systems within the cluster think they have exclusive access to a shared resource at the same time. This can be very damaging because data corruption is common in this situation.
Jeopardy: A system is in jeopardy when only one of its heartbeat connections is still functioning. A loss of the remaining heartbeat network will not allow VCS to know whether the host has crashed or the last heartbeat network has been disabled.
VCS has can be divided into two important parts:
Low Latency Transport (LLT) and Global Atomic Broadcast (GAB) are responsible for heartbeat and cluster communication. These are kernel modules and are installed with VCS. LLT provides a fast and high-priority internal cluster communication. LLT does not work on TCP/IP and its a different technology of communication. GAB runs over LLT. GAB is primarily responsible for cluster membership. So, LLT on each node will do the communication and GAB on each node will maintain the cluster membership.
Stands for High Availability daemon. This is also known as VCS Engine. This is the heart of VCS. HAD is responsible for all the cluster functionality. HAD talks to all the agents, has all the configuration/logic in the memory. There is another process called hashadow, whose primary job is to monitor HAD.
Configuration files for HAD:
Commands for HAD:
/opt/VRTS/bin/hastop (stops HAD)
/opt/VRTS/bin/hastart (start HAD)
/opt/VRTS/bin/hastatus (monitor HAD status)
/opt/VRTS/bin/hagrp (monitor/manage Service group)
/opt/VRTS/bin/hares (monitor/manage Resources)
/opt/VRTS/bin/hacf –verify /etc/VRTSvcs/conf/config (checks main.cf for syntax issues)
NOTE: For LLT, GAB and HAD, there is a dependency. At the system start up, first LLT starts, then GAB and then HAD. HAD will not run without GAB and GAB will not run without LLT:
LLT starts. It reads /etc/llttab and /etc/llthosts.
GAB starts (It executes /etc/gabtab). It checks for other GABs to establish a cluster membership.
Once GAB is loaded, hashadow starts which lods HAD
HAD reads /etc/VRTSvcs/conf/config/main.cf and all include .cf mentioned in main.cf.
HAD checks if there are other HADs avaible. It registers itself with GAB.
If there are no other HADs, it loads the main.cf again into HAD memory.
Same process will happen when HAD starts on other nodes. The HAD on the first node will load the main.cf and other .cf files from the local system (also called as "local build") and all other HADs will load configuration from the first HAD (also called as "remote build")
After starting up, HAD will know all the service groups and resources from main.cf. It will call the respective agents to check if the resources are currently online or offline.
Based on main.cf, HAD will online/offline the Service group on the respective nodes.
Check if all the service groups are running by command hastatus -sum
Important actions that can be taken by an admin while working on VCS:
Start: Follow steps above.
Stop: Stop the HAD, unload GAB and then unload LLT.
Service Groups -
Online: Manually bring a specific service group online on a specific node or all nodes.
Offline: Manually bring a service group offline on a specific node or all nodes.
Freeze: In terms of netbackup, if netbackup has problems, you might want to stop and start netbackup a couple of times. Its necessary to freeze the service group at that time. By freezing service group, we are telling VCS not to take any action on it.
Online: Manually online a resource
Offline: Manually offline a resource
Probe: Ask the resource agent to probe for the resource and get its current status.
Netbackup in VCS:
Install Netbackup on nodes the way you would normally do. Netbackup installation wizard asks for EMM server name and Master server name, at that time, give "virtual name" for installation on both the nodes. Note that right now, nothing will go on the shared LUN.
Once the installation is done, run the following script:
This script will prompt for all the information that it needs and does the following:
- Create an agent "NetBackup" and its cf file at /usr/openv/netbackup/bin/cluster/vcs/NetBackupTypes.cf
- Create service group. (usually nbu_group)
- create resources. (NIC, IP, DG, VOL, MOUNT and NETBACKUP)
- Moves the databases to the shared location
- Creates the file /usr/openv/netbackup/bin/cluster/NBU_RSP which holds information about cluster configuration.
The good part about cluster_config script is that if any thing fails in the script, it does an undo on everything, which means that next time you run the script again, it wont create any duplicates in config.
Create service group (hagrp -add)
Modify service group (hagrp –modify)
Delete service group (hagrp –delete)
Add resource(s) to a service group (hares –add)
Modify resources (hares –modify)
Delete resources (hares –delete)
Monitor the cluster (hastatus)
Switch over service group from one node to other (hagrp –switch)