How distributed global deduplication delivers uniquely dedicated resources for the best performance.
An innovative take on deduplication for the cloud world
How Does Cloud Scale Leverage Distributed Global Deduplication Deliver Uniquely Dedicated Resources for the Best Performance?
Over the last few years, we have been evolving what it means to use object storage as a backup target. The funny thing is that every iteration has almost gotten simpler, or maybe “more elegant” as it has evolved. From initially using a libfuse underlay, to optimizing the architecture to leverage an elastic container-based worker that can move data directly from memory to object storage, keeping data all in-flight for best performance.
The Cloud Method: Distributed Deduplication
The method we use, which I like to call “Distributed Deduplication,” started with a monolithic storage target. We took advantage of cloud infrastructure and “exploded” it to push the work of deduplication, compression and encryption into a workload-adjacent dedicated worker for each data stream.
This ensures that each data stream gets all the resources it needs and the individual data streams can be broken into manageable chunks. Now each steam worker is only having to deal with a small subset of data and thus can get great performance. But what about global deduplication? Well, with a central storage pool authority, we can globally track what was read and the relevant fingerprints. This means that on subsequent jobs we make sure that that ephemeral deduplication engine has access to all the info it needs to deliver optimization across the entire pool.
Compatibility Without Disruption
The coolest part of this is we were able to make all these changes without impacting the data format. MSDP was always an object storage, and making it operate like an object storage overlay instead of sitting on top of block storage could be done in a way that kept it compatible with the previous iterations of the feature. This means that a customer can take a backup that they might have written years ago, and within a few minutes be writing to cloud storage.
What About Advanced Features?
But what about advanced features? Does it perform well enough for those? Yes, the copy data management capabilities of MSDP work on top of object storage as well. This means that we can use Instant Access to directly access backup contents for all sorts of options, from quick recovery to analyzing backup data to check for malware. On top of that, all data written to the bucket is self-descriptive and can be “reused” from a different instance, or even a temporary “Cloud recovery server”. Imagine you backed up a bunch of EC2 instances with copies that were written to blob storage in Azure. With no pre-existing infrastructure in Azure, you could spin up an on-demand environment, either manually, or with the SaaS-based recovery orchestration service (RecoveryAgent) that would be able to directly read those backups and recover them to Azure VMs.
Wrapping It Up
All in all, by building a distributed, cloud-native architecture, we can spin up dedicated resources in the right place at the right time to ensure the performance needs are met. All while keeping as much of the infrastructure as possible ephemeral to make sure the other TCO is manageable. Delivering customers the performance and capabilities they need while also solving for cloud costs.