De –Duplication Role: -
The role of data de-duplication is to increase the amount of information that can be stored on storage appliances (Disk arrays, filers,) and to increase the effective amount of data that can be transmitted over networks.
Data reduction, built on a methodology that systematically substitutes reference pointers for redundant variable-length blocks (or data segments) in a specific dataset, is a key approach to data deduplication.
Data deduplication operates by segmenting a dataset (In backup environment this is normally a stream of backup data) into blocks and writing those blocks to a Disk Storage . To identify blocks in a transmitted stream, the data de-duplication engine creates a digital signature - like a fingerprint - for each data segment and an index of the signatures for a given repository.
The index, which can be recreated from the stored data segments, provides the reference list to determine whether blocks already exist in a repository. The index is used to determine which data segments need to be stored and also which need to be copied during a replication operation.
– Replacement of duplicate data with references to a shared copy
– Distinction: Whole record level or sub-record level
- Whole record level refers to File or Object level
- Sub record level
– Deduplication @ source reduces WAN traffic
– Allows Full backup metadata collection
– Allows sub file level incremental data movement
Usually, the file is identified through it’s content, and multiple copies are Deduped. If a bit in the file changes,
it will be seen as a new copy, and stored in it’s full size + the 1 new bit.
– Block Level
Disk technologies allow the identification of a block, where each block content is stored only once, where multiple copies are referenced within a File system database. Usually deduplication is done after the data has been stored on disk, in a post process like way.
In Backup Environment De duplication Happens on : -
– Less impact on WAN
– Backups are moved/calculated by clients
– Reduced backup Window
– Removes tape from Remote location
– Easy to implement
– Local LAN solution targets data centre
– Scalable by # of Media Servers
– Offloads clients from dedupe
– Leverage the Media servers
– Storage compatibility
– Appliance model, usually Storage dependant
– Less intelligent
– Dedupe limited to the appliance level
– No impact on backup solution
– Less scalable, questionable suitability for larger data centres
Data Deduplication Applied to Replication
Data deduplication makes the process of replicating backup data practical by reducing the bandwidth and cost needed to create and maintain duplicate datasets over networks. At a basic level, deduplication-enabled replication is similar to de-duplication-enabled data stores.
Once two images of a backup data store are created, all that's required to keep the replica or target identical to the source is the periodic copying and movement of the new data segments added during each backup event, along with its metadata image, or namespace.
The replication process begins by copying all the data segments in one share or portion of a source appliance to an equivalent share or portion in a second, target appliance. Although this initial transfer can occur over a network, data volumes often make it more practical to temporarily co-locate the source and target devices to synchronize the datasets, or to transfer the initial datasets using tape.