Netbackup Deduplication algorithm aka PureDisk

dedupe · ‎05-17-2011

My understanding of the process is as follows;

If file size is < 128KB (default segment size), whole file hash is created for the file. Any change to the file will result in a new hash therefore will not be deduped.

If file size is > 128KB and #segments <maximum segment count: divide file in to fixed 128KB segments and generate a MD5 hash for each segment.

If file size is > 128KB and #segments >maximum segment count: multiply the segment size until #segments<maximum segment count, then divide file in to fixed 128KB segments and generate a MD5 hash for each segment.

Reference: "If the number of segments within the file exceeds the maximum segment count (file size /segment size >= maximum segment count), then the segment size is doubled, until the number of segments is below the maximum segment count, or until the maximum segment size is reached."

What is the maximum segment count? Does Netbackup use variable segment size with the stream readers or still use fixed blocks? Can someone shed light on these points please?

PS: Netbackup fingerprints are not same as MD5 hashes. Netbackup generates a second hash using its own method then calculates the fingerprint for a segment to prevent from MD5 collisions. Correct me if I am wrong.

Thanks

Burak Uysal

VOX

Netbackup Deduplication algorithm aka PureDisk