Solved: Deduplication expectations...

samd · ‎10-31-2011

I'm trying to understand what I should expect from deduplication during the first backup of servers. For example let's say I back up a directory made up of Word documents, images, Excel spreadsheets, etc. Should I expect the first backup of that directory to have zero or a very small dedupe percentage? Another example. Let's say my first backup of another directory contains 3 SQL Server backups (not using SQL agent just backing up from disk SQL Server backups) which are full backups from 3 straight days. Therefore the data in the BAK files has changed a little but not a huge amount. Let's say each backup is 300MB. Therefore 3 files totally 900MB. Would my first time backup be around 900MB or would dedupe on that first backup cause maybe only 310MB for instance of space to be used in the dedupe folder.

I'm trying to understand are the advantages to dedupe really on subsequent backups of the same file except in cases like on a file server where multiple people have stored the same exact file in different directories. Therefore the first backup would be a bit smaller.

Thanks ahead of time.

teiva-boy · ‎10-31-2011

First pass backups are dismal when it comes to dedupe ratios. I get compression rates better with tape drives!

Though, do a 2nd pass, and immediately, you'll notice a fantastic dedupe rate.

You could try and enable compression on the host too via the PDCONF file, and see what you get on top of the dedupe.. Though reporting on that metric will be difficult at best.

I will say I was extremely mislead by Symantec in faster backup times with dedupe... Far from it. Only EMC's Avamar seems to really make a dramatic difference in backup times, but we're comparing a Honda to a Ferrari (price, performance, everything..)

View solution in original post

CraigV · ‎10-31-2011

Hi,

It's kind of like: How long's a piece of string? You can't really answer your question because it depends on so many variables to consider. Each person's environment is different, and some data will dedupe better than others.

By default the first backup is always the backup with the smallest dedupe ratio. THe more you run the job, the better the ratio. The backups won't be faster, but the amount of data to be backed up each time will decrease as the database is built up, dependant on changes etc.

Thanks!

samd · ‎10-31-2011

But for instance should there be some kind of dedupe even on the first backup? Very small for instance?

For instance I did a test backup of about 300MB where it was an assortment of files. I got a dedupe of 2.3%. So a ratio like that could be perfectly normal?

teiva-boy · ‎10-31-2011

First pass backups are dismal when it comes to dedupe ratios. I get compression rates better with tape drives!

Though, do a 2nd pass, and immediately, you'll notice a fantastic dedupe rate.

You could try and enable compression on the host too via the PDCONF file, and see what you get on top of the dedupe.. Though reporting on that metric will be difficult at best.

I will say I was extremely mislead by Symantec in faster backup times with dedupe... Far from it. Only EMC's Avamar seems to really make a dramatic difference in backup times, but we're comparing a Honda to a Ferrari (price, performance, everything..)

pkh · ‎10-31-2011

Dedupe compares data blocks of 128K (IIRC) if they are the same only one would be kept. For example, if File 1 and File 2 somehow end up with a data block which is the same, only one block would be kept. So even on your first backup some deduplication will take place. Obviously, the dedup ratio would be higher as most of the data blocks would be unchanged.

austin_lazanows · ‎11-16-2011

Depending on how they are dumping the SQL backups to disk (or any type of file for that matter), there could potentially have been a compression and/or encryption algorithim used first. It is important to remember that compressed data does not dedupe well, because a small change can drastically effect the bit strings on compression.

So always remember that it needs to flow in this methodology: Deduplication -> Compression -> Encryption. All the vendors are using compression as part of their advertising dedupe ratios. Also, don't believe the marketing approach to whatever dedupe ratios will exist (Yes, there are some algorithms that will perform slightly better than others, but it REALLY depends on your data and the uniqueness of the data being created).

Also databases do not normally produce phenomenal deduplication ratios (although its not uncommon to see between 3:1 to 6:1 depending on the resource and post-compression) because most data is uniquely structured within the database itself. Its easier to find dedupe ratios in unstructured data such as file servers.

If you can convince your DBAs, I always recommend using an agent to perform a backup to any deduplication solution so that the compression/dedupe/encryption stages are handled by the backup software. That and it reduces 1 less stage of I/O hit on your environment by not having to be staged to a disk first before the actual backup.

Finally, you need to be fully aware that unique media based data will not be deduplicated ever with other files and will only deduplicate with itself if its already been backed up before. This type of data includes pictures, audio files, video files, scanned documents, etc. You'll get better bang for your buck using special compression algorithms on those type of files.

Hope that helped!

VOX

Deduplication expectations...