cancel
Showing results for 
Search instead for 
Did you mean: 

Data Protection of Hadoop Cluster using NetBackup

Abhishek_Kulk1
Level 1
Employee

Do you have a Hadoop cluster running with High Availability, multiple nodes, or SSL & Kerberos authentication enabled?  

Do you want to protect your Hadoop environment using  storage, Dedupe & performance benefits of NetBackup ?

You are at the right place then. Yes, NetBackup does support protection for all these Hadoop environments & helps in for a disaster recovery scenario.

How it works
NetBackup uses a parallel streaming framework to protect scale-out environments like Hadoop.

The NetBackup Hadoop plugin uses an agentless architecture; the administrator doesn’t need to install NetBackup clients on the Hadoop nodes. This supports the scale-out nature of Hadoop, where customers can configure a greater number of data nodes as the data growth occurs. The NetBackup Hadoop plugin that is installed on the Backup hosts (typically NetBackup Media servers) will be used to discover, backup and recover Hadoop data. Administrators can add additional backup hosts to effectively distribute data and improve performance.

Abhishek_klk_0-1651514118608.png

Administrators can create point in time copies  of Hadoop data using Full and Incremental backup schedules and recover to data to the same or alternate Hadoop clusters. The NetBackup Hadoop plugin backs up only one copy of data where replication is enabled, which reduces storage costs. If NetBackup deduplication (MSDP)  or NetBackup Appliances are part of the environment, even further storage savings are possible with built in deduplication.

You can protect Apache Hadoop and popular HDFS distributions such as Cloudera. The NetBackup Hadoop plugin supports file & folder level recovery from a Hadoop cluster.

Administrators can recover root directory or a typical folder and files to same or alternate Hadoop cluster. 


How to configure a Hadoop cluster using a NetBackup primary server

Backup:

a. Create a Bigdata policy as below, using a NetBackup Primary server

b. Under the Clients tab, mention Hadoop name node manually                  

Abhishek_klk_1-1651514118609.pngc. Under the Backup Selection tab, please mention below strings manually as seen in screenshot

  • All the backup hosts taking part in discovery and backup (NetBackup Media Servers)
  • All HDFS folders that will  be protected
  • Application_Type=hadoop 

Abhishek_klk_2-1651514118609.png

d. Select the storage location on Media server and then run the backup job with desired backup schedule 


How to configure a HDFS cluster enabled with Kerberos Authentication

Here you need to distribute Kerberos tokens on all the backup hosts. To do this, please follow these steps

  1. Copy krb5.conf from Hadoop namenode/kdc server and copy same under all backup hosts (/etc/krb5.conf)
  2. All backup hosts must have access to all Hadoop nodes & reachable to receive Kerberos token from kdcserver
  3. To receive Kerberos token on backup hosts from kdc server, you can run kinit command or use keytab file.
  4. The Kerberos principal must have superuser permissions on the HDFS cluster for backup and recovery to work smoothly.


How to configure a HDFS cluster enabled with SSL(https)

To enable access to SSL clusters for backup and restore, you need to get the root CA certificate from the Certificate authority and copy this certificate onto the Backup hosts. The NetBackup Hadoop plugin also supports protection for HDFS clusters enabled with Certificate Revocation Lists (CRL).

The root CA certificate in environments like Cloudera distribution can be obtained from the Cloudera administrator. The Hadoop cluster may have a manual TLS configuration or Auto-TLS enabled . For both cases, NetBackup needs a root CA certificate from the administrator.

The root CA certificate from the Hadoop cluster will validate the certificates for all nodes and allow NetBackup to run the backup and restore process in the secure(SSL) cluster. This root CA certificate would be a bundle of certificates that has been issued to all nodes.

The certificate from the root CA should be configured under ECA_TRUST_STORE_PATH flag under bp.conf on the backup host in self-signed, third-party CA or Local/Intermediate CA environments. (e.g., In case of AUTO-TLSenabled Cloudera environments, you could typically find the root CA file named with "cm-auto-global_cacerts.pem" at path "/var/lib/cloudera-scm-agent/agent-cert")

For protecting secure HDFS  with the NetBackup Hadoop plugin, you must  configure the following conf files on all backup hosts:

/usr/openv/var/global/hadoop.conf
/usr/openv/netbackup/bp.conf


Recovering Hadoop data:

If you need to restore HDFS file/folder data, you can recover using the Backup, Archive and Restore window in the NetBackup UI. Please specify the following values under NetBackup machines and Policy Type window:        

Abhishek_klk_3-1651514118609.png

 

Once specified, you can proceed recovering HDFS folder or files using Backup, Archive and Restore window.

For more information on configuration and support, please follow the Veritas NetBackup™ for Hadoop Administrator's Guide and Software Compatibility List (SCL) for Hadoop