Forum Discussion

asg2ki's avatar
asg2ki
Level 4
6 years ago

Poor MSDP baseline deduplication ratio

Hi All,

Recently I decided to refresh all my backup data from scratch over my existing MSDP pools by expiring all previous images and taking a fresh copy of all data. I noticed that the deduplication ratio of the baseline is very poor considering that the majority of the source data I'm protecting is ISO files containing very similar data. The source location of these ISO files is a standard Windows volume with deduplication enabled where I'm able to gain ratio of 61%. On the contrary backing up the very same data to MSDP gave me only 18% which is a pretty drastic difference.

I'm wondering where this huge difference is coming from and for the moment I assume this might be due to the fact that Windows Deduplication is working with a different block size for its deduplication activitities as opposed to the default 128k segment size in MSDP. At the moment I don't intend to move on with VLD settings in my NetBackup infrastructure due to CPU resource constraints but I was thinking that maybe modifying the SEGKSIZE or perhaps the PREFERRED_EXT_SEGKSIZE parameters could help out the situation.

Do you guys have any experience with these parameters and would you recommend modifying them for my scenario in order to achieve better deduplication ratio ?

Also what I'm not sure about is if it would make any difference if I split my single backup policy for those ISO files into multiple ones or maybe if I set the current policy to have multiple data streams. By my best knowledge there is no requirement to have any additional data handlers when backing up data by via standard NBU client on a Windows machine but still I find this 18% deduplication ratio is very low even with 128K segment size, so I was thinking that maybe NBU is not able to efficiantly dedup the baseline data if it goes through one big stream.

Thanks in advance for your help.

  • sdo's avatar
    sdo
    6 years ago

    I was trying to find a structure definition for ISO, and found this:

    http://users.telenet.be/it3.consultants.bvba/handouts/ISO9960.html

    ...which says:

    "the ISO 9660 standard allows an optional extended attributes record (XAR) stored at the beginning of the file's extent"

    ...so I'm wondering if some/many/all file extents within any given typical modern ISO image actually begin with one or more binary meta-data fields, and if so, this would likely impact FLD (Fixed Length Dedupe) because many potentially similar blocks (to VLD) would actually appear to be different to FLD.


  • I never thought that MSDP might be using just compression for the baseline and it would make perfect sense in this case.

    Strictly speaking any deduplication engine for first backup does compression only. Not specific to NBU or DD. Consider your baseline backup as a large zip file with index in front (dedupe hashes), followed by actual data chunks with mix of pointers to chunks with the same hash. Yes, there is additional compression for the chunks. If you ever had exposure to developing a compression algorithm, this is exactly how they work. Deduplication as a term applicable to any subsequent backup where a baseline with hashes already present on the target. Maybe it is my terminology but I don't see how baseline deduplication is (and could be) different from a good archiving software.


    asg2ki wrote:

    I'll see what I can do to make a comparisson with virtual DD but I know for sure that physical DD is definitely breaking down the data into very small chunks, then it deduplicates them and finally it applies a compression on the top where I suppose it still keeps a separate record of the hashing data before the compression itself so that the deduplication would be as efficient as possible for further data.BTW I already tested MSDP with variable deduplication (see my previous reply) and it didn't help out the deduplication ratio at all.

    For the record, the deduplication engine used by Virtual and Physical DD is identical. The hashing is done only once as part of the SISL process.

    It's interesting to know that MSDP with variable length dedupe did not work as well. I expected it improve the dedupe ratio.

  • Hello,

    I think on the forum there will be less or no experience with deduplicating ISO files - these are not typical source files for backups. I think you must perform your own test if other segment size or VLD helps (despite your doubts I recommend you to test VLD, it has helped me several times).

    And yes my tip is that Windows dedup process is able to better interpret ISO file contents and dedupe them.

    And again, is it necessary to backup these files? If these are install media, then they are usually easily recreatable from hundreds of sources, or available on many websites/clouds etc.

    Regards

    Michal

     

    • asg2ki's avatar
      asg2ki
      Level 4

      Hi Michal,

      So I tested with different seg size as well as with VLD just out of curiosity. Frankly I didn't see much more CPU overhead during the tests but it also worth mentioning that VLD would probably work much better with smaller sized files rather than large ISO's. Anyway after I expired the previously taken images of the ISO's which were taken via FLD, I made some test policies explicitly with VLD but unfortunately the results were just all the same so apparently MSDP is not so effective for deduplicating the source data as opposed to what can be achieved with either Windows Dedup or with a backup appliance such as DataDomain but that's understandable especially in the latter example since it uses between 4KB and 12KB block sizes which of course makes the entire deduplication process extremely effective. With MSDP in my case, a particular set of similar ISO files are deduplicated initially to approx. between 6 and 10% which is relatively low. Of course the later deduplications are going up to 99 - 100% dedup ratio which is perfeclty normal and expected in this case but still I would have expected MSDP to do a better job on the initial dedup phases.

      Anyway I decided to make a further test and check what the results would be if I deduplicate the entire MSDP target where I backed up the files via Windows Deduplication as well. Currently I'm running a simple evaluation via the built in "ddpeval" tool, so I'll post the results once I have them. If I get a high deduplication ratio there, then it will be a clear sign that MSDP is not dealing well with the source deduplication but just the additional upcoming copies of the same data. We will see about that soon enough.

      As for backing up ISO files, trust me there are multiple scenarios where such data has to be protected even within large business environments and not on the last place even replicating the same to off-site locations (depending on the business requirements). For the moment I'm using the ISO scenario just as a test in my own lab and apparently it makes a very good use case to research the limitations of MSDP even though I've been using such pools for years now. I kinda didn't expect to see such a drastic "waste" of storage resources and started to get suspicous when one of my test pools became full in pretty much no time. I never had such issues in the past with DataDomain but the technological solution applied to the two solutions is quite different after all.

      Anyway I'll post the "ddpeval" results probably tomorrow.

      • asg2ki's avatar
        asg2ki
        Level 4

        Ok so the "ddpeval" has finished its job and I can definitely see additional 12% deduplication possibility there (almost 600 GB out of 4.79 TB) although I was expecting a bit more to be honest. Probably the difference is coming out of the way MSDP is making chunks out of the big files which is supposedly skewing the total percentage by big amount. Anyway it is now very much clear to me that MSDP simply doesn't do good job on deduplicating the source files probably due to its internal mechanism so I guess I'll have to live with that.

        I also tested if a policy made up of separate streams per folder would make any difference to the ratio but that is now confirmed to be negative, so MDSP is just not efficient enough on storing initial data as compared to other deduplicators. I haven't seen much difference with VLD either even though I set the minimum and maximum segment sizes to be 4KB and 16KB respectively. I guess MSDP has plenty of space to evolve but on the contrary it does a pretty good job on deduplicating repeating data and replicating the changes via AIR in a very efficient way.

  • What could be worth checking if you have a recent NBU version where variable-size deduplication can be switched on. This might help. You can also download an eval version of Data Domain Virtual Edition and dump those files there to see whether it can do initial compression better - this can be used as a benchmark as Data Domain usually comes on top when comparing deduplication efficiency.

    If the numbers are comparable, this may just mean your data has just so much commonality and can't be compressed further. I hope there is a unversal understanding here that baseline deduplication is just a compression?

    • asg2ki's avatar
      asg2ki
      Level 4

      Ahhhh.... that might be it.

      I never thought that MSDP might be using just compression for the baseline and it would make perfect sense in this case. I'll see what I can do to make a comparisson with virtual DD but I know for sure that physical DD is definitely breaking down the data into very small chunks, then it deduplicates them and finally it applies a compression on the top where I suppose it still keeps a separate record of the hashing data before the compression itself so that the deduplication would be as efficient as possible for further data.BTW I already tested MSDP with variable deduplication (see my previous reply) and it didn't help out the deduplication ratio at all.

      Can you please provide me with a reference link where it is stated that MSDP is using just compression over the baseline ?

      • davidmoline's avatar
        davidmoline
        Level 6

        The basline description is slightly misleading. MSDP much like other deduplication engines segment the data coming in into manageble chunks (default is 128Kb with FLD) and creates a fingerprint for that chunk. The chunk is then compressed (and potentially encrypted if that is enabled). As new data comes in and is chunked, the fingerprint is compared to existing fingerprints and the data is also saved (and compressed) or discarded after recoding its details. As the initial backup will deduplicate the least, this is why it can be though of as compression only (it will still deduplicate to some extent - depending on data type)

        Data Domain typically doesn't understand what the incoming data is (to the DD it is simply a data stream). That is why it uses VLD so that it can get reasonable deduplication. MSDP on the other hand understands most of the data streams coming in (and has stream handlers for these common types), so the FLD generally works better, and the additional overhead for VLD is not required. 

        I'm curious how you are backing up the ISO's in the first place - you mention they are hosted on a Windows dedupe volume - is the data being rehydrated on the way to MSDP? How is your policy setup for this backup? Also what NetBackup version are you using for this test?

        Generally all deduplication performs around about the same in the long term - you get some minor differences bewteen vendors but overall it works out about even. 

        Have a look at this article to see if there is anything there that may help: https://www.veritas.com/support/en_US/article.100042164