10 Minutes to Get Your Backup and Recovery Jobs Ru...

smoulton · ‎07-05-2023

Are you worried about the cost of a site failure and the stress it puts on you to quickly get your environment back up and running? Recovering from site failures often requires numerous manual and tedious operations and result in costly downtime. NetBackup Flex Scale includes automation to make it quick and easy to recover and get your data protection jobs back up and running, often with an RTO of less than 10 minutes. All it takes is an administrator to go in and with one click initiate a site failover (change replication roles). From there, all required tasks are automated including:

Starting the management (primary) service on the secondary site
Updating the networking automatically so no network infrastructure changes need to be performed manually
Reversing replication of metadata (catalog)
Optionally changing backup policies to write to the online site
Starting backup and recovery jobs

During the time when one site is down, Flex Scale maintains a catalog of data that needs to be duplicated. When the failed site comes back online the duplication will automatically get initiated to bring both sites back into sync.

Let’s get into some more details about how this works.

Initial Configuration

You start with two NetBackup Flex Scale clusters, and using the web UI or API calls you simply provide configuration details.

Then the automated process takes over the configuration operations including:

setting up a trust relationship between sites
extending the primary site’s domain to include the DR site’s cluster
adding heartbeat monitoring between sites
adding asynchronous replication of the metadata (catalog) between the system with the primary service online and the DR site. Depending on the network bandwidth between sites this can results in a near zero RPO.
adding default storage lifecycle policies (SLP) to replicate backup data between sites using NetBackup optimized duplication technology
optionally configure both management services to use a shared virtual IP (only one service is active at a time)

Screenshot 2023-06-30 at 1.29.19 PM.png
It is recommended to enable WORM storage on either one site or both sites after the DR setup is complete.

Now you have a fully configured active-active dual site single NetBackup domain configuration.

Configuring Space Efficient Policies

Next the backup administrators can add policies to protect their data, with options for where to store their backups, whether to duplicate it between sites and how long to store it at each site. Any data that is configured to be duplicated between sites will do so using NetBackup optimized duplication. This process retains the deduplication savings, only sending unique blocks to the second site, ensuring fast data duplication and efficient use of network and storage on the DR site.

Failure Detection and Recovery

The cluster constantly monitors the heartbeat between sites and sends an alert if it is lost. The administrator then verifies the event and if they determine it is an unplanned outage, they can simply initiate a takeover operation from the DR site with a single click in the web UI or an API call.

This operation automates configuring the domain to use the management (primary) service on the remaining site, bringing its management service container online and reversing the direction of the asynchronous replication of the metadata (catalog).

Note: there is also an option to reverse the role of the primary and secondary sites in a planned migration operation.

To speed up recovery and decrease manual operations this process also includes two options:

First, the automated process can also update DNS records for the primary service if DR was setup without a virtual IP.
Secondly, the administrator can select the checkbox to automatically have the backup policies and service lifecycle policies (SLP) changed temporarily to write the first copy of the backup data from both sites to the remaining site’s storage. This is recommended if the failed site will be down for an extended period, and you want to retain your RTO for both sites. This allows backup and recovery jobs to continue for applications on both sites.

Once the failed site comes back online, NetBackup Flex Scale will automatically:

detect and reattach the cluster, keeping its primary service offline
resume replication of the catalog and duplication of the backup data
convert back the policies and SLPs to their previous configuration

Any jobs that were already started when the site came back online will finish using the temporary backup location, any new jobs will automatically use the original backup storage.

You can use the UI to see the replication status and the amount of data in the queue for replication.

Simplified Upgrades

In addition to protecting from a site-wide disaster, when you initiate an upgrade NetBackup Flex Scale automates the upgrade for both sites in parallel, ensuring your DR environment is in sync.

Summary

In summary, NetBackup Flex Scale supports a single active-active DR solution that spans two sites. In the event of a site-wide disaster, its built-in automation makes it simple to get your backup and recovery jobs running again, typically in less than 10 minutes.

Want to see DR in action? Checkout this YouTube video showing configuring DR and recovering from a site-wide disaster.

To learn more about NetBackup Flex Scale check out this technical overview or reach out to your Veritas account representative.

VOX