Poor MSDP baseline deduplication ratio
Hi All,
Recently I decided to refresh all my backup data from scratch over my existing MSDP pools by expiring all previous images and taking a fresh copy of all data. I noticed that the deduplication ratio of the baseline is very poor considering that the majority of the source data I'm protecting is ISO files containing very similar data. The source location of these ISO files is a standard Windows volume with deduplication enabled where I'm able to gain ratio of 61%. On the contrary backing up the very same data to MSDP gave me only 18% which is a pretty drastic difference.
I'm wondering where this huge difference is coming from and for the moment I assume this might be due to the fact that Windows Deduplication is working with a different block size for its deduplication activitities as opposed to the default 128k segment size in MSDP. At the moment I don't intend to move on with VLD settings in my NetBackup infrastructure due to CPU resource constraints but I was thinking that maybe modifying the SEGKSIZE or perhaps the PREFERRED_EXT_SEGKSIZE parameters could help out the situation.
Do you guys have any experience with these parameters and would you recommend modifying them for my scenario in order to achieve better deduplication ratio ?
Also what I'm not sure about is if it would make any difference if I split my single backup policy for those ISO files into multiple ones or maybe if I set the current policy to have multiple data streams. By my best knowledge there is no requirement to have any additional data handlers when backing up data by via standard NBU client on a Windows machine but still I find this 18% deduplication ratio is very low even with 128K segment size, so I was thinking that maybe NBU is not able to efficiantly dedup the baseline data if it goes through one big stream.
Thanks in advance for your help.
I was trying to find a structure definition for ISO, and found this:
http://users.telenet.be/it3.consultants.bvba/handouts/ISO9960.html
...which says:
"the ISO 9660 standard allows an optional extended attributes record (XAR) stored at the beginning of the file's extent"
...so I'm wondering if some/many/all file extents within any given typical modern ISO image actually begin with one or more binary meta-data fields, and if so, this would likely impact FLD (Fixed Length Dedupe) because many potentially similar blocks (to VLD) would actually appear to be different to FLD.
I never thought that MSDP might be using just compression for the baseline and it would make perfect sense in this case.
Strictly speaking any deduplication engine for first backup does compression only. Not specific to NBU or DD. Consider your baseline backup as a large zip file with index in front (dedupe hashes), followed by actual data chunks with mix of pointers to chunks with the same hash. Yes, there is additional compression for the chunks. If you ever had exposure to developing a compression algorithm, this is exactly how they work. Deduplication as a term applicable to any subsequent backup where a baseline with hashes already present on the target. Maybe it is my terminology but I don't see how baseline deduplication is (and could be) different from a good archiving software.
asg2ki wrote:
I'll see what I can do to make a comparisson with virtual DD but I know for sure that physical DD is definitely breaking down the data into very small chunks, then it deduplicates them and finally it applies a compression on the top where I suppose it still keeps a separate record of the hashing data before the compression itself so that the deduplication would be as efficient as possible for further data.BTW I already tested MSDP with variable deduplication (see my previous reply) and it didn't help out the deduplication ratio at all.
For the record, the deduplication engine used by Virtual and Physical DD is identical. The hashing is done only once as part of the SISL process.
It's interesting to know that MSDP with variable length dedupe did not work as well. I expected it improve the dedupe ratio.