cancel
Showing results for 
Search instead for 
Did you mean: 

Dedupe Query

Krutina
Level 3

Hi

 

I got a doubt when i were to configure a backup to a MSDP.

Intention here is to create a fullbackup every week end and an incremental on the week days. Retention of weekly would be 2 weeks on MSDP and 2 weeks for incremental.

I just need clarity here when the 3rd week full backup is triggered, will it take a fresh full backup ? I guess, since there is no pointers to earlier deduped fullbackup data, it would take a full data.

 

Can someone throw light on this query?

 

Tx

1 ACCEPTED SOLUTION

Accepted Solutions

RonCaplinger
Level 6

The space occupied by the 1st copy would likely be less than the original size, but that all depends on the data.  Deduplication looks at "blocks" of data, not file-by-file.  Block size will vary.  A single file may be one block, or it could be multiple blocks, depending on how the deduplication engine happens to examine the incoming data.  This can also make estimation difficult.  If your data being backed up would compress well with something like WinZip, then it will likely deduplicate well, although the two technologies are very different from a compression standpoint.  However, if your data would not compress very much, such as video, audio, or .jpg files, then deduplication probably won't save any space. 

An example where deduplication would be really helpful is if you were to back up the C:\windows directory of 10 servers.  The first backup of the very first server would likely take the same space in the MSDP as the original data on the first server, because there isn't a lot of duplicate strings of characters in those files.  But the backup of the second through the tenth servers' C:\windows directory would probably deduplicate enough to save 99% or more every time.  

View solution in original post

5 REPLIES 5

RonCaplinger
Level 6

You are correct.  Unless you have enabled Client Side Deduplication, or are using NetBackup Accelerator, every full backup will traverse the entire client directory structure and send every block of data from the client, over the network, to the MSDP.  The media server connected to the MSDP will then calculate a hash value from the block of data, compare the hashed value to the MSDP "fingerprint cache", and if it has never seen that same block of data, it will write it to the MSDP.  If it *has* seen that block of data, it updates the expiration date just for that block already stored on disk, then discards the block.This will happen on every full backup, from the first to the second to the third, etc.  It doesn't save any time to do deduplicated backups.

If you enable Client Side Deduplication for that claint in your master server's "Client Attributes" settings, then a full backup the client will still traverse the entire client directory structure, but instead of transferring the blocks of data to the media server to be hashed, the client will calculate the hash and send that to the media server instead.  If the has value is found in the "fingerprint cache", the media server updates the expiration date for the block (just like above), but tells the client "don't send the full block of data, I already have it".  If the hash value is NOT found in the "fingerprint cache", the media server says "hey, that's completely new, send me the whole block."  Again, this will happen on every full backup.  This option reduces the amount of data sent over the network but requires extra processing on the client.

If you also enable NetBackup Accelerator, this can dramatically save your systems a lot of work.  Instead of traversing the directory structure for both full and incremental backups, Accelerator keeps track of every file that gets created or changed on the client (if the client is very busy and frequently has LOTS of new or changed files, this can potentially cause the client to run out of disk space!).  When a backup runs, NetBackup will read the list of files that have been changed or added since the last backup and will ONLY create hash values of those files and send them to the MSDP media server.  When also using Client Side Deduplication, the blocks of data processed for even a full backup are limited to just new or changed files since the last full backup.  This is why these features combined make even full backups run just about as quickly as the incremental backups.

 

MSDP - saves disk storage space, but requires extra CPU from either the media server or client, requires as much network bandwidth as regular backups

Client Side Dedupe - requires MSDP (or ther deduplication), reduces CPU usage on the media server and reduces network bandwidth usage to the media server, but requires more CPU usage on the client

Accelerator - Requires MSDP, reduces CPU usage on the client and reduces network bandwidth usage to the media server, but uses some storage space on the client

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

If you create a full backup every week with 2-week retention, there will always be 1 valid, full backup that can be referenced.

So, I don't see that a fresh, full backup will be needed in week 3.

Catalog references to expired backups is removed, but as long as there are links to actual data from unexpired backups, the deduped data is not removed.

This TN describes the dedupe cleanup process: http://www.symantec.com/docs/TECH124914

Under normal operation, a sequence of regularly scheduled operations ensures that obsolete data segments are removed from the storage pools or deduplication pools automatically. Because each data segment may be owned by more than one backup image, segments are only removed when all the associated backups have expired.

 

Krutina
Level 3

Hi

When a Client sends it 1st ever copy to MSDP pool, would it consume the storage capacity equivalent to the source data size ? 

Assuming the source data size is of 100 GB, would it consume same 100 GB for storing the 1st Backup Copy on the MSDP ? 

I guess, it is not. Any help in this regard would be much appreciated.

Any pointers to documentation would be of great help.

 

Thx

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Even the very first backup will only store unique blocks/segments of data.

So, depending on types of data, you will see varying dedupe rates for the first backup.

 

See: How deduplication works
http://www.symantec.com/docs/HOWTO36285

RonCaplinger
Level 6

The space occupied by the 1st copy would likely be less than the original size, but that all depends on the data.  Deduplication looks at "blocks" of data, not file-by-file.  Block size will vary.  A single file may be one block, or it could be multiple blocks, depending on how the deduplication engine happens to examine the incoming data.  This can also make estimation difficult.  If your data being backed up would compress well with something like WinZip, then it will likely deduplicate well, although the two technologies are very different from a compression standpoint.  However, if your data would not compress very much, such as video, audio, or .jpg files, then deduplication probably won't save any space. 

An example where deduplication would be really helpful is if you were to back up the C:\windows directory of 10 servers.  The first backup of the very first server would likely take the same space in the MSDP as the original data on the first server, because there isn't a lot of duplicate strings of characters in those files.  But the backup of the second through the tenth servers' C:\windows directory would probably deduplicate enough to save 99% or more every time.