Solved: How deduplication works in BE2012

capriguy84 · ‎04-24-2013

I'm pretty confused over how the deduplication occurs in BE2012.

Q1) How is a backup job setup to use deduplication? Example: I have to backup RHEL6 server daily(differential) & weekly(full) of a single volume of 100GB.

Q1a) What storage type do we choose for each of above jobs? Is it correct to choose "Dedup-disk" for full & "regular disk" for differential?
Q1b) When given an option w/ dedup disk to be made available which one should i choose? Any recommendation for windows or Linux host?

Q2) Does the differential backup jobs get any benefit from using dedupe?(I suppose not, but still curious).

Q3) What happens after first full backup? I assume the first backup is used as seed for dedup. When the next full backup job kicks in, does the hash calculation happen on RHEL6 server or BE2012?

Q3a) Now as per theory of deduplication, the collective size of [(seed + second full backup) < 2x full backup]. Is that always the case?
Q3b) If I were to restore the

Q3) If I were to manually copy the backups to a parent site, how should I go about it? The two sites are separated geograpically over slow WAN.

Q3a) Someone suggested using CASO option to duplicate the backup set. Does this kind of copy also take advatage of deduplication?

Any help is much appreciated.

teiva-boy · ‎04-24-2013

Q1) How is a backup job setup to use deduplication? Example: I have to backup RHEL6 server daily(differential) & weekly(full) of a single volume of 100GB.

Q1a) What storage type do we choose for each of above jobs? Is it correct to choose "Dedup-disk" for full & "regular disk" for differential?
Q1b) When given an option w/ dedup disk to be made available which one should i choose? Any recommendation for windows or Linux host?

You send ALL backups both full/diff/and incr to the dedupe storage pool. Don't mix in plain disk unless you have specific reasons to.

I do not believe linux supports client-side dedupe yet, so you select media server dedupe. So the BE server does all the work. Windows can be either/or. Choose Client-side dedupe as the default unless you run into performance issues.

Q2) Does the differential backup jobs get any benefit from using dedupe?(I suppose not, but still curious).

It does. Since a diff/incr is based on a file backup. Dedupe is based on a smaller segment, a sub-file backup.

Q3) What happens after first full backup? I assume the first backup is used as seed for dedup. When the next full backup job kicks in, does the hash calculation happen on RHEL6 server or BE2012?

Q3a) Now as per theory of deduplication, the collective size of [(seed + second full backup) < 2x full backup]. Is that always the case?
Q3b) If I were to restore the

The first backup will have more compression than anything, but some dedupe. Subsequent FULL's will see the real benefits of dedupe, in a best case 4 full backups and daily diff (a month worth) shouldn't take much more space than a single full backup.

The calculation for linux is the BE server. For windows you can pick, but ideally it's client-side, to distribute the processing amongst multiple hosts.

Q3) If I were to manually copy the backups to a parent site, how should I go about it? The two sites are separated geograpically over slow WAN.

Depending on latency and available bandwidth, this will be your biggest hurdle. Also any dropped packets will likely fail jobs, and you have to keep trying till it completes successfully. You can seed backups, you woudl just have to backup similar data to a removable HDD, ship it, then duplicate it into the other BE server's dedupe store.

Q3a) Someone suggested using CASO option to duplicate the backup set. Does this kind of copy also take advatage of deduplication?

It does take advantage only in the sense that the data has been deduplicated and stored in an advanced compressed format. It only needs to send the difference segments that make up the new data. So it's pretty efficient.

View solution in original post

pkh · ‎04-24-2013

1) For any backup you can backup to either a dedup folder or other storage. It all depends on your installation's requirement. There is no "recommendation" as to the choice.

2) Sure. There will be benefits. The data will be dedup'ed.

3) The hash calculation is done on the media if you choose server-side dedup and on the remote server if you choose client-side dedup.

3a) It should

To "copy" dedup'ed data over a slow WAN link, you should use optimised duplication. This means that you have to have dedup folders on both ends of the link and CASO. You then run a duplicate job to duplicate backup sets from one dedup folder to the other dedup folder. When you do this, only the changed data blocks will be transmitted across the WAN link, thus minimising bandwidth requirements.

teiva-boy · ‎04-24-2013