Data removal improvement in PureDisk 6.6

Mark_Christiaen · ‎10-09-2009

It was a big effort but we finally made it: PureDisk 6.6 (a.k.a. Darrieus) is out! It contains a bucket load of improvements::

Storage Pool Installer
New Administration User Interface
Context sensitive help
Remote office job progress information
Optimized deduplication for PDDO/VCB backups
Oracle Agent
Exchange Granular Restore
NTFS Special file type support
Virtual Synthetics
Storage Pool Consistency Checker
PureDisk Command Line
Multi Stream for Replication
Storage Pool Conversion tool (6.2 to 6.5 layout)

I won't go into detail regarding these since they are covered sufficiently in the documentation. Let's instead talk about a background process that's nevertheless crucial: data removal.

Data removal in a deduplicating environment is an under-rated process. Most people, when asked, will think that it is an afterthought when designing a deduplication product but nothing could be farther from the truth. Removing data that was deduplicated requires a careful analysis. When deduplicating data during backup, you essentially remove duplicate parts of data and replace them with references to just one copy. When removing that data you again have to update this “reference” structure: PureDisk has to figure out what part of that data is currently referenced just once (and after removal is therefore not referenced at all) and can therefore be removed. PureDisk performs this task in the background and it constitutes a large part of what is known as “queue processing”.

Now, why do I mention all of this? Well, I do so because we've done some serious work on queue processing. In Caver (the previous, major release of PureDisk) queue processing was structured in such a way that, to completely remove data from a PureDisk storage pool, you would have to run data removal on the MetaBase and then run queue processing four times. Considering that queue processing can take a couple of hours on a 16 TB content router, queue processing is a big resource (disk bandwidth, CPU power) consumer.

In Darrieus, we've overhauled queue processing so that it's “smarter”. It can analyze and update the state of references more intelligently by keeping track of the commands in the queue. As a result, without loss of correctness or functionality, we can remove data by only processing the queue twice. In Caver, the default queue processing frequency was four times a day. In Darrieus, we updated this default to twice a day to achieve the same net effect. The cool thing is that queue processing itself is still just as fast as it was in Caver so the total overhead of queue processing was halved.

Hope you enjoy Darrieus!

Harish55 · ‎11-17-2009

Performance enhancements
PureDisk now groups more information together when transferring data from
the content routers to a client for a restore. This new method improves
performance for most restores. Two new client agent configuration file fields
have been added. You can change the values of these fields to further enhance
restore performance for specific clients: MaxSegmentPrefetchSize and
SegmentChunkSize

VOX

Data removal improvement in PureDisk 6.6