Showing results for 
Search instead for 
Did you mean: 

Compression and de-duplication

Level 6

It's taken me a while but this article gave me the insight how de-duplication actually works:

Bit of a light dawning moment. Also, allowed me to understand server and client side deduplication. With client side, I assume that the hash calculation (per chunk) and check is carried out on the client and only if not already found is the chunk sent to the media server. But if it's server side, the chunk is sent across the network anyway and it's the media server that calculates the hash. So if you've got a slow link to the media server (say over internet backup), then client side deduplication is attactive as it really cuts down network traffic.

A question about the job compression option though. Should it be software enabled or disabled? I assume (maybe wrongly) that before de-duplication, software compression did compression in a similar way to WinZip or LZW style compression to simply reduce the space in the BKF file.

I'm sure I read that with de-duplication, compression should be disabled but I'm intruiged as to why? As long as the hash calculations are done before compression, then would it still not work? Compression could be used to pack that 64k block into a smaller space and therefore would still have some merit. In fact, wouldn't the algorithm still work after compression? If one compresses a file one day and then does the same compression on the same file the next day, the compressed data stream is identical.

Ahh, as I type, I realise that I'm thinking mainly here for flat files like Office documents. With a database file, one byte changing at the start would completely change the compressed data stream for the rest of the file. Okay, so compression before calculating the hash is out. But what about as the blocks are written to the deduplication folder? Do the 64k chunks get compressed automatically?

BTW - on a normal non-dedupe job, where is software compression carried out? On the client or media server?

Cheers, Rob.


Level 6

In order to enable compression when using deduplication, it must be done from within the pd.conf file which is local to the client and media server in the installation directory.

Doing it in the GUI within the job properties could negate the dedupe performance.  


So, if using dedupe, you want to enable software compression within the pd.conf file.  If doing a traditional backup, use compression settings within the GUI.

Employee Accredited Certified

Compression and de-duplication are close to the same thing  - just that one applies inside files and the other applies across a set of files (including across multiple servers

At a basic level they both look for a chunks/blocks/samples of data and then if the same chunk/block/sample occurs again it is only stored once with an index recording where it is used so that the reverse process csan be performed.

Hence compression on top of deduplication shouldn't find much to compress and could slow the proceses down whilst it tries to identify non-existent matching samples of data.

Note I am talking in very general terms here not specifically about BE.

Level 6

The algorithms of something like LZW and that of hash based deduplication are pretty different beasts IMO. LZW builds the token dictionary on the fly and then builds this dictionary into the data stream itself. Deduplication stores the hash dictionary in separate database to the 64k chunks. With a compressed file, all you need is the compressed file to reconstruct the original file. With de-deplication system, you need the entire hash database as well. This does make the resiliance of a de-deplication system more prone to error than a simple bunch of compressed files. That said, backup-to-disk isn't a bunch of separate compressed files but a stream of them all bunched together in BKF files.

So in a disaster recovery scenario, you have to bring back the entire de-duplication folder strucuture from an atomic backup. With BKF files, there is a bit more resilience. If you restore a bunch of BKF files from tape store, you can catalog them and it'll do the best it can to build a catalog.

Anyway, that wasn't the main discusion was it :)

I thought on this some more and realised that compression on the client side compression before de-duplication is a bad idea for database files. Okay for atomic files like Word documents where the entire files tends to shift contents when a save is made. But for large database files (like SQL), then a change of just one byte in the first 64k chunk would change the compression stream of the entire file. So a single byte change at the start would mean the entire files would need backing up as all 64k chunk hashes would change.

However, once that 64k chunk gets to been stored in the database on the media server, then something like LZW compression could be applied just to reduce space in the database. Sure, compression of indivudual 64k chunks compared to LZW compression on an entire 2GB file isn't as good, but it would most likely be >1 for many file types. Especially database BAK files from SQL which tend to be very sparse.

Does anyone know if BE2010 compresses the 64k chunks as they are written to the database?

Cheers, Rob.