In the data protection world, a number we frequently see and hear are deduplication rates. We hear of dedupe rates ranging from 50:1, 20:1, to 10:1. Recently, I heard someone say that 50:1 is 5 times better than 10:1. Their fuzzy math made me cringe, and I knew it was time to address this.

To clarify deduplication rates, we need to examine: 1) the factors that influence deduplication rates and 2) the math.

**Deduplication Factors**

Deduplication rates are like automobile miles per gallon (mpg): Your Results Will Vary. The factors that affect deduplication results are:

- Types of data (unstructured versus structured data)
- Change rate of data (what percent of data changes)
- Frequency and type of backup (how often are you backing up the data? (i.e. daily, weekly, fulls or incremental)
- Retention (how long are you keeping the dedupe data)

Any one of these factors can significantly impact the deduplication rate.

The challenge with comparing deduplication rates between vendors is there is no standardized test. In other words, each vendor is free to base their number on the above variables that give them the best results. This is vastly different than mpg comparisons where both the U.S. Environmental Protection Agency and the National Highway Traffic Safety Administration (NHTSA) have established standards for mpg calculations.

**Net result:** there is no standard metric for deduplication rates, remember your results will vary!

**The Math**

Let’s go back to the statement, “a 50:1 versus 10:1 dedupe rate is 5 times better”. Actually, the difference is only 8%? It is! Let’s do the math:

Dedupe Rate |
Dedupe Percentage |
Relative Difference (to 10:1) |

10:1 |
90% (1-1/10) |
--- |

20:1 |
95% (1-1/20) |
5% (95% - 90%) |

25:1 |
96% (1-1/25) |
6% (96% - 90%) |

50:1 |
98% (1-1/50) |
8% (98% - 90%) |

As you can see the difference between 50:1 and 10:1 is only 8%. If you compare numbers that are not at the extreme, the difference becomes negligible. This is why deduplication rates should not be the sole factor or decision criteria.

**Net result:** dedupe rates between vendors are often very similar. Don’t get caught up in the dedupe rate hype.

**How to Reconcile Deduplication Rates?**

What should you do if you are being told one product's dedupe rate is significantly better than another? Like a car, take the products on a test drive. Conduct a Proof of Concept using your data in your environment to determine what you can expect. Due to time and resource limitations, you may not get to the absolute numbers, but it will give you a baseline for comparison.

Secondly, the deduplication rate is not the end all. What is more important is the total data reduction of the solution, let’s call this data efficiency. Using the car analogy, no use having a car that publishes great mpg, but your driving habits are terrible (jack rabbit starts, racing to red lights, etc), vehicle maintenance is poor (tires are underinflated, behind in scheduled maintenance), and carrying extra weight (its January and you still have 2 sets of golf clubs and teams baseball equipment in the trunk). While the initial purchase was based on mpg, your driving habits are killing your miles per gallon. In other words, poor efficiency.

Let’s apply this thought to data, does your deduplication solution provide data efficiency? Can it deduplicate data at the source and target (provides better utilization of CPU resources and network bandwidth), dedupe both physical and virtual machines (eliminates storing the same dedupe data twice), with cost-effective software licensing?

**Net result:** focus on achieving data efficiency! This will drive the OpEx and CapEx savings of your solution.

While the data deduplication rates may be a buying criterion, it should not be THE buying criteria. Rather the criteria should focus on data efficiency. This will ensure your IT and business needs are met while delivering the greatest ROI.