cancel
Showing results for 
Search instead for 
Did you mean: 

Poor MSDP baseline deduplication ratio

asg2ki
Level 4

Hi All,

Recently I decided to refresh all my backup data from scratch over my existing MSDP pools by expiring all previous images and taking a fresh copy of all data. I noticed that the deduplication ratio of the baseline is very poor considering that the majority of the source data I'm protecting is ISO files containing very similar data. The source location of these ISO files is a standard Windows volume with deduplication enabled where I'm able to gain ratio of 61%. On the contrary backing up the very same data to MSDP gave me only 18% which is a pretty drastic difference.

I'm wondering where this huge difference is coming from and for the moment I assume this might be due to the fact that Windows Deduplication is working with a different block size for its deduplication activitities as opposed to the default 128k segment size in MSDP. At the moment I don't intend to move on with VLD settings in my NetBackup infrastructure due to CPU resource constraints but I was thinking that maybe modifying the SEGKSIZE or perhaps the PREFERRED_EXT_SEGKSIZE parameters could help out the situation.

Do you guys have any experience with these parameters and would you recommend modifying them for my scenario in order to achieve better deduplication ratio ?

Also what I'm not sure about is if it would make any difference if I split my single backup policy for those ISO files into multiple ones or maybe if I set the current policy to have multiple data streams. By my best knowledge there is no requirement to have any additional data handlers when backing up data by via standard NBU client on a Windows machine but still I find this 18% deduplication ratio is very low even with 128K segment size, so I was thinking that maybe NBU is not able to efficiantly dedup the baseline data if it goes through one big stream.

Thanks in advance for your help.

2 ACCEPTED SOLUTIONS

Accepted Solutions

sdo
Moderator
Moderator
Partner    VIP    Certified

I was trying to find a structure definition for ISO, and found this:

http://users.telenet.be/it3.consultants.bvba/handouts/ISO9960.html

...which says:

"the ISO 9660 standard allows an optional extended attributes record (XAR) stored at the beginning of the file's extent"

...so I'm wondering if some/many/all file extents within any given typical modern ISO image actually begin with one or more binary meta-data fields, and if so, this would likely impact FLD (Fixed Length Dedupe) because many potentially similar blocks (to VLD) would actually appear to be different to FLD.

View solution in original post

Mouse
Moderator
Moderator
Partner    VIP    Accredited Certified

I never thought that MSDP might be using just compression for the baseline and it would make perfect sense in this case.

Strictly speaking any deduplication engine for first backup does compression only. Not specific to NBU or DD. Consider your baseline backup as a large zip file with index in front (dedupe hashes), followed by actual data chunks with mix of pointers to chunks with the same hash. Yes, there is additional compression for the chunks. If you ever had exposure to developing a compression algorithm, this is exactly how they work. Deduplication as a term applicable to any subsequent backup where a baseline with hashes already present on the target. Maybe it is my terminology but I don't see how baseline deduplication is (and could be) different from a good archiving software.


@asg2ki wrote:

I'll see what I can do to make a comparisson with virtual DD but I know for sure that physical DD is definitely breaking down the data into very small chunks, then it deduplicates them and finally it applies a compression on the top where I suppose it still keeps a separate record of the hashing data before the compression itself so that the deduplication would be as efficient as possible for further data.BTW I already tested MSDP with variable deduplication (see my previous reply) and it didn't help out the deduplication ratio at all.

For the record, the deduplication engine used by Virtual and Physical DD is identical. The hashing is done only once as part of the SISL process.

It's interesting to know that MSDP with variable length dedupe did not work as well. I expected it improve the dedupe ratio.

View solution in original post

13 REPLIES 13

Michal_Mikulik1
Moderator
Moderator
Partner    VIP    Accredited Certified

Hello,

I think on the forum there will be less or no experience with deduplicating ISO files - these are not typical source files for backups. I think you must perform your own test if other segment size or VLD helps (despite your doubts I recommend you to test VLD, it has helped me several times).

And yes my tip is that Windows dedup process is able to better interpret ISO file contents and dedupe them.

And again, is it necessary to backup these files? If these are install media, then they are usually easily recreatable from hundreds of sources, or available on many websites/clouds etc.

Regards

Michal

 

Hi Michal,

So I tested with different seg size as well as with VLD just out of curiosity. Frankly I didn't see much more CPU overhead during the tests but it also worth mentioning that VLD would probably work much better with smaller sized files rather than large ISO's. Anyway after I expired the previously taken images of the ISO's which were taken via FLD, I made some test policies explicitly with VLD but unfortunately the results were just all the same so apparently MSDP is not so effective for deduplicating the source data as opposed to what can be achieved with either Windows Dedup or with a backup appliance such as DataDomain but that's understandable especially in the latter example since it uses between 4KB and 12KB block sizes which of course makes the entire deduplication process extremely effective. With MSDP in my case, a particular set of similar ISO files are deduplicated initially to approx. between 6 and 10% which is relatively low. Of course the later deduplications are going up to 99 - 100% dedup ratio which is perfeclty normal and expected in this case but still I would have expected MSDP to do a better job on the initial dedup phases.

Anyway I decided to make a further test and check what the results would be if I deduplicate the entire MSDP target where I backed up the files via Windows Deduplication as well. Currently I'm running a simple evaluation via the built in "ddpeval" tool, so I'll post the results once I have them. If I get a high deduplication ratio there, then it will be a clear sign that MSDP is not dealing well with the source deduplication but just the additional upcoming copies of the same data. We will see about that soon enough.

As for backing up ISO files, trust me there are multiple scenarios where such data has to be protected even within large business environments and not on the last place even replicating the same to off-site locations (depending on the business requirements). For the moment I'm using the ISO scenario just as a test in my own lab and apparently it makes a very good use case to research the limitations of MSDP even though I've been using such pools for years now. I kinda didn't expect to see such a drastic "waste" of storage resources and started to get suspicous when one of my test pools became full in pretty much no time. I never had such issues in the past with DataDomain but the technological solution applied to the two solutions is quite different after all.

Anyway I'll post the "ddpeval" results probably tomorrow.

Ok so the "ddpeval" has finished its job and I can definitely see additional 12% deduplication possibility there (almost 600 GB out of 4.79 TB) although I was expecting a bit more to be honest. Probably the difference is coming out of the way MSDP is making chunks out of the big files which is supposedly skewing the total percentage by big amount. Anyway it is now very much clear to me that MSDP simply doesn't do good job on deduplicating the source files probably due to its internal mechanism so I guess I'll have to live with that.

I also tested if a policy made up of separate streams per folder would make any difference to the ratio but that is now confirmed to be negative, so MDSP is just not efficient enough on storing initial data as compared to other deduplicators. I haven't seen much difference with VLD either even though I set the minimum and maximum segment sizes to be 4KB and 16KB respectively. I guess MSDP has plenty of space to evolve but on the contrary it does a pretty good job on deduplicating repeating data and replicating the changes via AIR in a very efficient way.

sdo
Moderator
Moderator
Partner    VIP    Certified

MSDP opportunisticly retains awareness of previously deduped data even after image expiry, i.e. old stale expired deduped data has to be expunged from the underlying containers before the space is truely free - and I'm wondering if this is skewing the initial dedupe result for an erstwhile empty MSDP.

I wonder what your dedupe results would be if you were to use a brand new truely empty MSDP.

Definitely not my case since I engaged the "crcontrol --processqueue" binary multiple times on the target MSDP datapool before I took the next backups. Otherwise I would have seen much higher dedup ratio on second attempt of the initial backups. My MSDP pool at the time being was acting pretty much as if it was newly provisioned, but anyway thanks for the thoughts.

Mouse
Moderator
Moderator
Partner    VIP    Accredited Certified

What could be worth checking if you have a recent NBU version where variable-size deduplication can be switched on. This might help. You can also download an eval version of Data Domain Virtual Edition and dump those files there to see whether it can do initial compression better - this can be used as a benchmark as Data Domain usually comes on top when comparing deduplication efficiency.

If the numbers are comparable, this may just mean your data has just so much commonality and can't be compressed further. I hope there is a unversal understanding here that baseline deduplication is just a compression?

Ahhhh.... that might be it.

I never thought that MSDP might be using just compression for the baseline and it would make perfect sense in this case. I'll see what I can do to make a comparisson with virtual DD but I know for sure that physical DD is definitely breaking down the data into very small chunks, then it deduplicates them and finally it applies a compression on the top where I suppose it still keeps a separate record of the hashing data before the compression itself so that the deduplication would be as efficient as possible for further data.BTW I already tested MSDP with variable deduplication (see my previous reply) and it didn't help out the deduplication ratio at all.

Can you please provide me with a reference link where it is stated that MSDP is using just compression over the baseline ?

The basline description is slightly misleading. MSDP much like other deduplication engines segment the data coming in into manageble chunks (default is 128Kb with FLD) and creates a fingerprint for that chunk. The chunk is then compressed (and potentially encrypted if that is enabled). As new data comes in and is chunked, the fingerprint is compared to existing fingerprints and the data is also saved (and compressed) or discarded after recoding its details. As the initial backup will deduplicate the least, this is why it can be though of as compression only (it will still deduplicate to some extent - depending on data type)

Data Domain typically doesn't understand what the incoming data is (to the DD it is simply a data stream). That is why it uses VLD so that it can get reasonable deduplication. MSDP on the other hand understands most of the data streams coming in (and has stream handlers for these common types), so the FLD generally works better, and the additional overhead for VLD is not required. 

I'm curious how you are backing up the ISO's in the first place - you mention they are hosted on a Windows dedupe volume - is the data being rehydrated on the way to MSDP? How is your policy setup for this backup? Also what NetBackup version are you using for this test?

Generally all deduplication performs around about the same in the long term - you get some minor differences bewteen vendors but overall it works out about even. 

Have a look at this article to see if there is anything there that may help: https://www.veritas.com/support/en_US/article.100042164

 

sdo
Moderator
Moderator
Partner    VIP    Certified

I was trying to find a structure definition for ISO, and found this:

http://users.telenet.be/it3.consultants.bvba/handouts/ISO9960.html

...which says:

"the ISO 9660 standard allows an optional extended attributes record (XAR) stored at the beginning of the file's extent"

...so I'm wondering if some/many/all file extents within any given typical modern ISO image actually begin with one or more binary meta-data fields, and if so, this would likely impact FLD (Fixed Length Dedupe) because many potentially similar blocks (to VLD) would actually appear to be different to FLD.

I'm currently using NBU 8.2 and my policy is just a very simple one. It protects a folder that contains bunch of ISO's and subfolders. The ISO's are being rehydrated on their way to the MSDP pool and then processed by it as if they were taken in their original form, so I'm not using the "optimized backup of Windows dedup" feature but just the regular file reading method. NBU's Windows Dedup optimization backup has its own drawbacks / limitations with both backups and restores so I'd rather stay away from this feature for now. The only "feature" that I'm using on my source client is the "client side deduplication" and I assume that it shouldn't make any difference if it is turned on or off.

In the long run of course I'm able to see nearly 100% deduplication over the ISO files during the regular full and rescan based backup jobs but I was wondering why MSDP on it's own doesn't dedup the initial set of ISO's as good as Windows Dedup does. FLD seems to work well enough for me but just out of curiousity I tried VLD as well in order to be able to simulate something more close to DD block sizes but unfortunately without much change in the deduplication ratio. Anyway I suppose MSDP deduplication engine is just not good enough in optimizing the initial source data. It's great for repeatable backups of the same data but I still find the difference way too big in comparisson to other engines.

Mouse
Moderator
Moderator
Partner    VIP    Accredited Certified

...so I'm wondering if some/many/all file extents within any given typical modern ISO image actually begin with one or more binary meta-data fields, and if so, this would likely impact FLD (Fixed Length Dedupe) because many potentially similar blocks (to VLD) would actually appear to be different to FLD.


Great observation and this is exactly why the stream handler approach employed by NBU does not really work here - for it to work with ISO files, they should look like a file system to the stream handler ("mounted"), otherwise they are treated just as a normal file and streamed as-is, causing shifts in fixed blocks and yeilding in poor compression rate. Same problem affects databases that are actually sets of blobs, a case in point would be Exchange and SQL used by Share Point: a stream handler could detect same blobs during full backups ("mounted database") but completely useless when backing up transaction logs with exactly same data (and same blobs) OR when a hardware snapshot is used.

Mouse
Moderator
Moderator
Partner    VIP    Accredited Certified

I never thought that MSDP might be using just compression for the baseline and it would make perfect sense in this case.

Strictly speaking any deduplication engine for first backup does compression only. Not specific to NBU or DD. Consider your baseline backup as a large zip file with index in front (dedupe hashes), followed by actual data chunks with mix of pointers to chunks with the same hash. Yes, there is additional compression for the chunks. If you ever had exposure to developing a compression algorithm, this is exactly how they work. Deduplication as a term applicable to any subsequent backup where a baseline with hashes already present on the target. Maybe it is my terminology but I don't see how baseline deduplication is (and could be) different from a good archiving software.


@asg2ki wrote:

I'll see what I can do to make a comparisson with virtual DD but I know for sure that physical DD is definitely breaking down the data into very small chunks, then it deduplicates them and finally it applies a compression on the top where I suppose it still keeps a separate record of the hashing data before the compression itself so that the deduplication would be as efficient as possible for further data.BTW I already tested MSDP with variable deduplication (see my previous reply) and it didn't help out the deduplication ratio at all.

For the record, the deduplication engine used by Virtual and Physical DD is identical. The hashing is done only once as part of the SISL process.

It's interesting to know that MSDP with variable length dedupe did not work as well. I expected it improve the dedupe ratio.

Well... I've been using DD in the past (it was the 960 model at the time being) and in my experience it did a much better job on saving space with even the initial source data. Can't remember exactly but it might have been Oracle DB which of course is giving completely different compression and deduplication ratio considering that it's a different type of data and it constantly changes but in fact the saving factor was pretty good on both the initial and the subsequent backups.

Anyway I accept the fact that MSDP might be using just compression at the beginning. I would have still expected the initial set of ISO's to be deduplicated much better though because as I mentioned many of them contain a very similar data inside but then what SDO mentioned makes also a perfect sense. I would have expected VLD to help out the situation but unfortunately it didn't, so perhaps there is something else that I'm missing from the whole picture.

For the record I wanted to make a small test with Virtual DD but unfortunately it is not available for download anymore (or at least EMC doesn't give me any links) so I guess I'll just accept the situation as it is and calculate with storage capacity based on the outcome of this discussion.

Many thanks for the thoughts and suggestions. I really appreciate your help guys. Maybe if I get a spare moment I'll give this scenario a another try with a different dedup engine (OpenDedup perhaps).