De –Duplication (Overview in Backup Environment)

Harish55 · ‎11-20-2009

De –Duplication Role: -

The role of data de-duplication is to increase the amount of information that can be stored on storage appliances (Disk arrays, filers,) and to increase the effective amount of data that can be transmitted over networks.

Data reduction, built on a methodology that systematically substitutes reference pointers for redundant variable-length blocks (or data segments) in a specific dataset, is a key approach to data deduplication.

Data deduplication operates by segmenting a dataset (In backup environment this is normally a stream of backup data) into blocks and writing those blocks to a Disk Storage . To identify blocks in a transmitted stream, the data de-duplication engine creates a digital signature - like a fingerprint - for each data segment and an index of the signatures for a given repository.

The index, which can be recreated from the stored data segments, provides the reference list to determine whether blocks already exist in a repository. The index is used to determine which data segments need to be stored and also which need to be copied during a replication operation.

• Data Deduplication and Data Movement

– Replacement of duplicate data with references to a shared copy

– Distinction: Whole record level or sub-record level

- Whole record level refers to File or Object level

- Sub record level

– Deduplication @ source reduces WAN traffic

– Allows Full backup metadata collection

– Allows sub file level incremental data movement

When data de-duplication software sees a block it has processed before, instead of storing the block again, it inserts a pointer to the original block in the dataset's metadata. If the same block shows up multiple times, multiple pointers to it are generated

• De- duplication can be on File, Block, Segment Level : -

– File Level

Usually, the file is identified through it’s content, and multiple copies are Deduped. If a bit in the file changes,
it will be seen as a new copy, and stored in it’s full size + the 1 new bit.

– Block Level

Disk technologies allow the identification of a block, where each block content is stored only once, where multiple copies are referenced within a File system database. Usually deduplication is done after the data has been stored on disk, in a post process like way.

– Segment Level

At file level, the file is read and divided in smaller parts called segments, where for each segment as hash is calculated based upon it’s content. A database contains the relation between a file metadata and the segments stored as files. De- duplication is done at the segment level

In Backup Environment De duplication Happens on : -

Client

– Less impact on WAN

– Backups are moved/calculated by clients

– Reduced backup Window

– Removes tape from Remote location

Media Server

– Easy to implement

– Local LAN solution targets data centre

– Scalable by # of Media Servers

– Offloads clients from dedupe

– Leverage the Media servers

– Storage compatibility

Target deduplication

– Appliance model, usually Storage dependant

– Less intelligent

– Dedupe limited to the appliance level

– No impact on backup solution

– Less scalable, questionable suitability for larger data centres

Data Deduplication Applied to Replication

Data deduplication makes the process of replicating backup data practical by reducing the bandwidth and cost needed to create and maintain duplicate datasets over networks. At a basic level, deduplication-enabled replication is similar to de-duplication-enabled data stores.

Once two images of a backup data store are created, all that's required to keep the replica or target identical to the source is the periodic copying and movement of the new data segments added during each backup event, along with its metadata image, or namespace.

The replication process begins by copying all the data segments in one share or portion of a source appliance to an equivalent share or portion in a second, target appliance. Although this initial transfer can occur over a network, data volumes often make it more practical to temporarily co-locate the source and target devices to synchronize the datasets, or to transfer the initial datasets using tape.

Stefan_Davidov · ‎12-16-2009

Hello, harish 13,

I am currently evaluating Symantec Backup Exec System Recovery 2010 and I want to know is there any de-duplication included in BESR at any of the layers You described in Your article (quite useful:) ?

CraigV · ‎12-21-2009

Great article...quick question though:

When de-duplication tasks are moved to the client server, what sort of performance overhead does this add?

I cannot wait for BEWS 2010 to get released so that I can enable this on our file servers on the 34 sites I manage. It's going to save a lot of time during backups, a lot of cash for buying new tapes...now if I could only get that enabled on the file servers themselves so that I can free up space =)

NEERAJ_MEHTA · ‎12-22-2009

Please advive , how can i implement deduplication in our Netbackup Environment.
My current seup is :-
1. One Netbackup master server as well as media server 6.5.4 (W2K3)
2. 35-40 clients (all virtual machines)(windows and linux).
3. Total data size is 2TB. and policy for rentation period is 120 days.
4. Using MSL 4048 tape library and VLS 9000.

Please advice .

Many thanks.

NEERAJ_MEHTA · ‎01-12-2010

Please advice ,

HP VLS 9000 represent itself as a tape library and tapes to Netbackup.
So, how puredisk will able to recognize VLS9000 as a disk.

Thanks,

Yogesh_Jadhav1 · ‎08-14-2010

Ask your vendor if they can provide the deduplication option. It should be possible

Angelique28 · ‎10-07-2010

Deduplication of Backup Exec 2010 is very good feture.

Reference:

https://www-secure.symantec.com/connect/articles/deduplication-option-and-unified-archiving-option-s...

Hope you find it informative as well.

Angel

Todd_D__Woodwar · ‎11-12-2011

You have a couple of options for deduplication with NetBackup.

If you wish to stick with NetBackup 6.5 (I highly recommend upgrading to 6.5.6), consider a separate PureDisk environment or a NetBackup 5000 series Appliance, and use the PureDisk Deduplication Option (PDDO).

Or you can upgrade your NetBackup 6.5 environment to version 7 and use the Media Server Deduplication Option (MSDO).

VOX

De –Duplication (Overview in Backup Environment)