What you can store and transmit with deduplication...

Jim_MD · ‎09-10-2019

The charts in the PDF are theortetical it would be interesting to see if others are getting results like these?

From what I can see the ability to deupe data can have a significant impact on storage and transmission.

sdo · ‎09-10-2019

As long as your source backup data is not compressed at source (i.e. randomized) or massively binary-new every day, and usually also as long as you do not retain for too long in the dedupe storage... then yes, it is possible that one can see some fairly impressive figures.

In my experience it is quite rare to see results heading towards the far end of the scale. You can actually similate some truely stupendously ridiculous dedupe rates using the "GEN_DATA" backup policy directives, but these are only useful in a small way of proving the actual hype in perfect (non-real-world) conditions - i.e. the NetBackup MSDP dedupe software really can deliver the goods, if the backup data is of exactly the right type - but it never is. e.g. is all of your backup data always all zeros, or always all ones, or always exactly the same every day ? No, it probably isn't. And so, no, you probably won't achieve stupendous dedupe results.

Remember - it might not be a good idea to retain for too long in dedupe... because... as data changes (e.g. 2% per day) then after 50 days you will have experienced 100% change, i.e. a doubling from first to last - plus also all of the change of the other 48 days.

At one site I see only 5:1 dedupe rate, but then that site ingests a lot of digital media, other sites 8:1 and 9:1 are much more typical, I've seen 12:1 at one site - yet I cannot say that I have ever seen the marketeering fluff of 20:1.

One point to note is that the technologies to achieve massively reduced data transit volumes (client to server) of traffic are different for each vendor. To achieve it for NetBackup MSDP then one needs to engage client-side dedupe. For DataDomain, it is different. Some other vendors cannot achieve this reduced volume of "client to server" traffic.

For NetBackup MSDP AIR replication then again that is MSDP to MSDP only, or different technologies in otehr vendor platforms to achieve something similar - but not all apples are comparable.

e.g. we know that NetBackup AIR Replication duplicates/send only new never seen before changed blocks between MSDP disk pools - and for client-side dedupe somethign very similar happens, the client performs the finger-print and then sends that small packet to the MSDP server and asks... "Hey MSDP media server, do I need to send you the backup data matching this finger-print?" and the MSDP media server will either respond with "no, seen that before, I'll mark you as having a copy of that in your backup data", or the MSDP media server will repond with "yes please do send me the original backup data as I have never seen that pattern before".

Remember, with client side dedupe, still every piece of backup data still involves a TCP conversation of at least "have you seen this before?", i.e. there are still gazillions of packets flying back-and-forth between backup clients and backup servers, it's just that with client-side dedupe these packets tend to be a lot smaller (because hopefully most of the backup data has been seen before) - and so, with client-side dedupe, from a network perspective then data volumes start to appear to be more "chatty" and much less apparent "streaming" (bulk send of client to server). What this means then, is that for very long distance comms you still have the problem of how do get more packets in flight, i.e. how do keep the wire full... do you see how client-side dedupe does not remove or reduce this problem, i.e. that even with MSDP AIR Replication and client-side dedupe (the same technologies under the hood) then to achieve good fast very long distance comms then one may well still need WAN accelerators.

sdo · ‎09-10-2019

Here's another reason to love NetBackup MSDP's own client-side deudpe capabilities... if for example I have one backup appliance (with two CPU sockets and 10 cores each - so 20 cores in total), and say I have a hypervisor based compute stack (VMware or Hyper-V - it doesn't matter) of say 20 physical nodes each of two CPU sockets each with 20 cores - so 800 cores in total)... then if I engage client-side dedupe on the backups of either plain client inside the VM, (or for Hyper-V say VM style backup at the Hypervisor layer) then the CPUs in the compute stack will perform the dedupe finger-print hashing and so free-up the CPUs in the appliances for processing the ingest of more streams of backups. When spread across so many CPUs in the compute stack, then the CPU demand of fingerprint hashing is hardly noticed at all - whereas if the appliance or traditional media server had to do all the hard fingerprint work, then you might well notice this.

Jim_MD · ‎09-10-2019

Client-side dedupe has been maturing for the last 11+ years. I first encountered this in PureDisk when it was a standalone product. What got me intersted in invertigating the storage and transmission profile was the variability I was seeing in different dedupe devices and how they each responded to encypted, multi-media and data made up of squillions of tiny files. One site where there was considerable variability in dedupe backup rates but overall it was about 93-95% reduction for the enitre storage device.

One thing to be aware of this is only theoretical, the transmission overhead and meta-data are ignored. What I think is important is the shape and doing a sensitivity analysis of stuff.

VOX

What you can store and transmit with deduplication seems to very sensitive to deduplication rates.