Blog Post

Enterprise Data Services Community Blog
4 MIN READ

Global Deduplication Myths

Alex_Sakaguchi's avatar
12 years ago

Don’t customers hate being misled?

I know I do.  Sometimes it can be innocent…you know, like maybe the sales person wasn’t as knowledgeable as he/she could’ve been.  Or perhaps they were new.  In any case, it behooves the customer to do some homework to make sure that they are not being misled, innocently or otherwise.

 

Your homework is done.

I came across a situation recently where a customer said that a vendor told them their solution could do global deduplication the same as Symantec, but cheaper.  My first thought was wow that’s a big deal.  As you may know, Symantec deduplication capabilities built into NetBackup and Backup Exec offer customers the flexibility of leveraging dedupe at the client, server, or target, and can efficiently dedupe across workloads like physical and virtual infrastructures seamlessly (See the V-Ray video here for more info).  On top of that, if your dedupe storage capacity on a single device is maxed out, Symantec can add another to increase capacity and compute resources, but to the customer, it would still appear as a single dedupe storage pool – global deduplication.

Anyhow, the customer asked if this was true.  Quite honestly, I too needed to do some homework to answer that question…what I found out was pretty disturbing.

First off, the vendor in question was not using the term “global deduplication” correctly, and what they were actually referring to was plain old deduplication, not even bordering on global yet, which I’ll get to in a minute.

According to the vendor’s documentation a customer would need to manually set a dedupe block size for all data sources in order to employ “global deduplication”.  Furthermore, the default and recommended size was 128KB.  For the record, global deduplication refers to the ability to deduplicate across dedupe storage devices so that no two devices contain the same blocks of data.  Here’s a generic definition from TechTarget:

“Global data deduplication is a method of preventing redundant data when backing up data to multiple deduplication devices.”

What the vendor is saying is that you can have multiple data sources (like VMware data, files system data, databases, etc.) feeding into a single dedupe storage pool, where the dedupe block size is set to 128KB, and those multiple data sources will dedupe against one another.  But that’s NOT global deduplication, that’s regular deduplication. 

Global deduplication in this example could be illustrated when the storage capacity that our 128KB chunk sized pool is reached and we need to stand up another.  Can the customer see both those storage devices as a single pool without any data redundancies across them or not?  If the answer is not, then the vendor cannot provide global dedupe capabilities.  And unfortunately, such was the case with our vendor in question.

The interesting thing was that even though this inquiry started as a result of a question on comparative global dedupe capabilities, I uncovered some other points of information that may cause you to think twice when purchasing from this vendor.

I’ve organized these into the chart below for ease of understanding:

Data Source/Workload

Recommended Block Size

File systems

128KB

Databases (smaller than 1TB)

128KB

VMware data

32KB

Databases (1-5TB in size)

256KB

Databases (larger than 5TB)

512KB

 

As you can see above, the vendor is recommending those specific dedupe block sizes to maintain an optimal dedupe efficiency level for each data source.  What this means is that:

  1. IF you want dedupe efficiency within data sources you have to manually configure and manage multiple dedupe storage pools (that’s a lot of management overhead by the way), and
  2. You’ll likely have duplicate data stored because your VMware data at 32KB is not going to dedupe with your files system data at 128KB, and lastly
  3. If you go ahead and use the same block size (128KB that the vendor recommends for their “global dedupe”), your dedupe efficiency is lost because 128KB is only optimal for file systems and databases smaller than 1TB, not for anything else.

This problem is defined as “content-aligned” deduplication.  Given that this particular vendor is unable to instead be “content-aware” and efficiently deduplicate source data without manual configuration of block sizes, there is certainly no hope for the vendor to claim global deduplication capabilities…unless the attempt is made to redefine the term.

A better way

With Symantec, the customer would not have to worry about this scenario at all.  It doesn’t matter if the data source is coming from a physical machine or virtual.  It doesn’t matter if the database is large or small, or if it’s just file system data.  Symantec is able to look deep into the backup stream and identify the blocks for which a copy is already stored, and store only the ones that are unique.  No block size limitations or inefficiencies between policies.   This means that you get the best in dedupe storage efficiency with the lowest management overhead.

 

Symantec calls its approach end-to-end, intelligent deduplication because we can deliver data reduction capabilities at the source, media server, or even on target storage (via our OpenStorage API).  We gain intelligence from content-awareness of the data stream for backup efficiency.  And of course, we deliver global deduplication capabilities.

 

More resources:

Symantec Deduplication

NetBackup Platform

Backup Exec Family

NetBackup Appliances

Published 12 years ago
Version 1.0
No CommentsBe the first to comment