Forum Discussion

Matt_Harris's avatar
9 years ago

Support for deduplication of compressed files?

Good Day,

I have recently been spent quite some time reviewing deduplication ratios on backups in NetBackup and looking at how we can change how we store our data to achieve better backups.

In most cases Compressed files tend to be the worst offender for poor deduplication. We have NetBackup for SQL Server technology which works very well however there are many scenarios where you have an SQL developer who will insist that their .bak  (which is often compressed to save space) must be backed up so they can restore from their backups. A simple method that they know and trust.

The 2nd scenario I am coming across is RAR and 7z files where someone has had to compress files in order to free up limited space they have. If these are added too there goes my deduplication rate.

My question is.. Has there ever been investigation into a feature in NetBackup where compressed archives can be accessed and backed up with their compression removed?

Unlike encryption we know how Compression is structured and made, read, and how to identify types of compression like LZMA1/2 and PPMd.

 

Any thoughts on the above would be greatly received.

Matt.

  • Hi Matt,

     

    Interesting idea but if you start to cater to everyone then we'll end up right where we started, by having 10 different backup solutions. Your DBA needs to be told what the company policy is in terms of backup, its not a choice. We all know they are drama queens but they need to suck it up. If you show them how easy it is to run their own restores it might help.

     

    As for the compressed files, would be nice, but also going to add a whole bunch of unnecessary processing to the backup process.

  • Seems like an interesting idea, a "content handler" within a "stream handler"... but it is fraught with complexities.   We'd be asking the NetBackup media server / OST stream handler to watch out for specific file name specification "types" (i.e. trailing ".type") because surely it couldn't waste time trying to match a signature at the start of every single file all the time?  Anyway, and then it would have to intercept and decompress... but what would the catalog handler do?  Does it still only catalog the "original" source (e.g. .zip) file name, and not the internal file names within?  Which, without much additional thought, would seem plausible, with even the possible benefit of being to restore a file from within a supposedly compressed file, without having to restore the entire original compressed exo-file.  But what about compressed files within compressed files, where would it end, recursive handling?

    I suspect the real problems would be:

    1) Cost of licensing the compression algorithms from multiple different vendors/developers.

    2) Maintaining compatibility - what you're asking for is a bit like asking for Veritas to support small-scale vendor/developer file-systems - but they have enough trouble already keeping up with the current nost popular file systems - let alone adding multiple different compression vendor formats.

    3) How to re-construct the compressed file in exactly the same manner as before.  There are lots of different vendors of ZIP and ARC, etc, compressed file type formats/layouts/architectures/algorithms/ratios, and I bet they have a hard time keeping up with each other... because there's probably different padding overhead meta-data in a compressed files depending upon which tool/utility/product wrote the compressed file - it would be horrible for customers if product A creates said compressed file, then NetBackup decompresses on the fly, and then at restore time NetBackup re-creates the compressed file with the idiosyncracies of product B, only to find that the customer's product A can no longer access the contents.

    .

    I don't mean to shoot you down - and this is a discussion topic after all ;)         IMO, it's too complex, too costly, too risky - for not much gain.  I have to ask myself... would potential new customers be so impressed as to make them switch away from any other backup product vendor - and would it be enough to keep those customers thinking of going elsewhere?   Personally I don't see this feature as a compelling architecture case to the degree and impact that something like SLP has been.

    .

    It's still an interesting question though.

  • Hi Matt,

     

    Interesting idea but if you start to cater to everyone then we'll end up right where we started, by having 10 different backup solutions. Your DBA needs to be told what the company policy is in terms of backup, its not a choice. We all know they are drama queens but they need to suck it up. If you show them how easy it is to run their own restores it might help.

     

    As for the compressed files, would be nice, but also going to add a whole bunch of unnecessary processing to the backup process.

  • Seems like an interesting idea, a "content handler" within a "stream handler"... but it is fraught with complexities.   We'd be asking the NetBackup media server / OST stream handler to watch out for specific file name specification "types" (i.e. trailing ".type") because surely it couldn't waste time trying to match a signature at the start of every single file all the time?  Anyway, and then it would have to intercept and decompress... but what would the catalog handler do?  Does it still only catalog the "original" source (e.g. .zip) file name, and not the internal file names within?  Which, without much additional thought, would seem plausible, with even the possible benefit of being to restore a file from within a supposedly compressed file, without having to restore the entire original compressed exo-file.  But what about compressed files within compressed files, where would it end, recursive handling?

    I suspect the real problems would be:

    1) Cost of licensing the compression algorithms from multiple different vendors/developers.

    2) Maintaining compatibility - what you're asking for is a bit like asking for Veritas to support small-scale vendor/developer file-systems - but they have enough trouble already keeping up with the current nost popular file systems - let alone adding multiple different compression vendor formats.

    3) How to re-construct the compressed file in exactly the same manner as before.  There are lots of different vendors of ZIP and ARC, etc, compressed file type formats/layouts/architectures/algorithms/ratios, and I bet they have a hard time keeping up with each other... because there's probably different padding overhead meta-data in a compressed files depending upon which tool/utility/product wrote the compressed file - it would be horrible for customers if product A creates said compressed file, then NetBackup decompresses on the fly, and then at restore time NetBackup re-creates the compressed file with the idiosyncracies of product B, only to find that the customer's product A can no longer access the contents.

    .

    I don't mean to shoot you down - and this is a discussion topic after all ;)         IMO, it's too complex, too costly, too risky - for not much gain.  I have to ask myself... would potential new customers be so impressed as to make them switch away from any other backup product vendor - and would it be enough to keep those customers thinking of going elsewhere?   Personally I don't see this feature as a compelling architecture case to the degree and impact that something like SLP has been.

    .

    It's still an interesting question though.

  • I recently about something kind of related to this... I was seking a NetBackup configuration option such that MSDP does not attempt to fingerprint and de-dupe the content of certain file type names, and so would just store them full-fat inside the MSDP space... i.e. in an effort to save a whole lot of buffering time and CPU effort attempting to de-dupe somethign that won't... i.e. prevent from wasting so much time and effort and catalog meta-data space caused by still fingerprinting and hashing and collating millions upon billions of utterly unique 128KB data segments.  I thnk this feature to not attempt a de-dupe of certain file name types would be a bit easier to implement in code.