Whitepaper: Implementing Highly Available Data Pro...

Turls · ‎01-04-2010

1.0 Introduction

The data protection system must be regarded as a ‘mission critical’ element of any modern data center. The design of any data protection system must, as far as possible, eliminate single points of failure so that data can be recovered to an acceptable state and point in time in the event of data, server or site loss.

This paper looks at the components of a data protection system based on NetBackup and outlines a number of different configurations and solutions aimed at both reducing the risk of failure within a particular site and recovering from the loss of the site.

Although this paper has been written for NetBackup 6.5 GA and NetBackup 6.5.1 many of the concepts described here can also be applied to NetBackup 6.0. It uses a “good”/”better”/”best” classification in certain areas to indicate the merits of one solution over another. In general these classifications may be regarded as follows:
Good – provides an adequate solution in smaller environments and environments where data is less critical.
Better – provides a more robust solution, generally at increased cost or complexity.
Best – provides the best currently available solution but often at significant cost.

1.1 Disaster Recovery and High Availability – what is the difference?

This paper discusses the topics of high availability (HA) and disaster recovery (DR). It is important to understand that these are not the same. While high availability technologies may form part of a disaster recovery solution, simply having high availability components does not ensure that disaster recovery is possible

1.1.1 High Availability
High availability solutions exist to protect against single points of failure within an environment, thus ensuring that the environment continues to operate as intended (although possibly at reduced capacity) in the event of the failure of a component. In a NetBackup domain high availability can take many forms ranging from the clustering of a Master Server to the use of the Shared Storage Option and NIC teaming to protect against tape drive and network failures.

1.1.2 Disaster Recovery
Disaster recovery is the general term used to describe a process involved in the restoration of normal (although possibly reduced) service following a major unforeseen event. In the context of this paper, disaster recovery means the process involved in restoring data backed up by NetBackup to servers in a different location following the loss of a data center or site.

1.2 Disaster Recovery – all or nothing?

The traditional view of disaster recovery is that the complete environment must be re-created at a secondary site but this is often prohibitively expensive. In practice the immediate priority in the wake of a disaster involves the recovery of key ‘mission critical’ systems only and does not require a complete recovery of the entire environment. Organizations are increasingly seeking ways of providing a disaster recovery capability without needing to have a complete mirror of the production environment sitting idle in case a disaster occurs. This paper looks at a variety of approaches to meeting the objective of providing rapid recovery capability without incurring the expense of a fully dedicated disaster recovery site or the time penalty of building a recovery environment from the ground up.

1.3 Glossary

The following terms are used throughout this document:

Data Protection Solution – While a data protection solution may encompass many different technologies and levels of protection, in the context of this document it means backup and recovery provided by NetBackup.
Domain – A NetBackup Domain is a collection of NetBackup Servers and Clients under the control of a single Master Server
Site – A site is a single data center, data hall, building or location where servers requiring backup are located. A single domain may cover two or more sites. Three sub-classes of sites are described in this document:
Primary site – this is the site at which the servers that are protected by the data protection solution normally reside and is also described as the ‘production’ site.
Secondary site – this is the site at which data from the primary site would be recovered in the event of a disaster. The secondary site may be a dedicated disaster recovery site or a second production site.
Disaster recovery site/domain – this is a dedicated facility at the secondary site that is used to recover the data from the primary site in the event of a site loss. This may be either a separate NetBackup Domain or part of another existing NetBackup Domain.
NetBackup catalog – NetBackup catalogs are the internal databases that contain information about NetBackup backups and configuration. Backup information includes records of the files that have been backed up and the media on which the files are stored. The catalogs also contain information about the media and the storage devices. In the event of a disaster in which the NetBackup Master Server is lost a full or partial recovery of the catalog information is required before recovery from backup can begin.
Single Point of Failure (SPOF) – any component of the data protection solution which, if it fails or is unavailable, prevents the backup and recovery process from working.
SPAS – Symantec Product Authentication Service, formerly known as VxSS.
RPO (recovery point objective) – the most recent point an application can be recovered to using the available backup data. The actual recovery point may differ from any established objective. Proper planning and testing may need to be executed in order to ensure that the actual recovery point aligns with the desired recovery point objective.
RTO (recovery time objective) – the time required to recover an application, server or series of applications or servers following a failure. In the context of this document, the RTO is generally assumed to be the RTO at the secondary site, including the time to prepare the NetBackup environment at the secondary site to begin restoring data. Again, it should be noted that the actual recovery time may differ from any established objective. Proper planning and testing may need to be executed in order to assure that the actual recovery time aligns with the desired recovery time objective.

To read the complete article, please download the PDF.

VOX

Whitepaper: Implementing Highly Available Data Protection with Veritas NetBackup