cancel
Showing results for 
Search instead for 
Did you mean: 

NBU and the CBT issue for vmware

cdirado
Level 3
Partner Accredited

I’m wondering if anyone knows if there are any potential issues with NetBackup and Vmware’s recent KB about their Change Block Tracking bug?

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=209063...

 

It looks like some other backup products might be affected but Google find very few results as related to NetBackup.

 

1 ACCEPTED SOLUTION

Accepted Solutions

aaron_meza
Level 2
Employee
17 REPLIES 17

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

The VMware KB article is fairly new and may have not yet resulted in any NBU Support calls.

The problem with issues like these is that the problem is only noticed at restore time.

You found this article and can therefore be proactive to prevent issues.
IMHO, if you have vmdk's that were extended and experience the symptoms listed in the doc, then the answer is YES. Any backup product will be affected as a result.

Best to talk to VM admins to check extended vmdk's and apply workaround if symptoms are seen.
So, backup and restore issues are easy enough to avoid.

Nicolai
Moderator
Moderator
Partner    VIP   

Thanks for sharing the KB with us. It now on our bug watch list !

Anonymous
Not applicable

Note: You can perform a Storage vMotion to reset CBT.

Further this only highlights the point of validating backups. 

Q: Can NetBackup perform validation of the VM Backup data for a large number of VM's without a lot of babysitting? Which would be ideal in this scenario.

CRZ
Level 6
Employee Accredited Certified

Short answer: yes, but we don't know enough to document them yet.  We're working with VMware.  Stay tuned!

aaron_meza
Level 2
Employee

CRZ
Level 6
Employee Accredited Certified

...and in case you are noticing that this link doesn't work at the moment, we're rewriting this document and will republish it a little later this week (we expect) as a Tech Alert.

To subscribe and get notifications of Alerts as they are published, please use this link:
 http://www.symantec.com/business/support/index?page=content&key=15143&channel=ALERTS

AKopel
Level 6

I'm curious if using the 'Accelerator Forced Rescan' option would be a workaround until an offical fix is issues (either via VMware and/or Symantec). From NetBackup 7.6 Feature Briefing - Accelerator support for Virtual Machines.pdf  

...Figure 4, is an optional setting called Accelerator forced rescan. This is basically a safety net to make certain nothing has gone awry with VMware CBT. Accelerator relies on CBT to accurately provide a list of changed blocks and if something goes amiss, it’s possible some blocks might not get backed up.

Based on this, it seems 'reasonable' that this would catch all the data regardless.

Then, the question begs, let's say this 'works', does it sort of 'reset' CBT when you do this? (e.g. the next 'accelerated backup' would be fine because a new baseline has been created?) OR is it still using the 'broken' CBT information even after a forced rescan?

AK

 

MortenSeeberg
Level 6
Partner Accredited

Some info we have gotten through the partner network, you question is answered further down:
 

Here is the VMware article:

http://kb.vmware.com/kb/2090639

 

Briefly, the issue is described as follows:   

During vStorage API for Data Protection (VADP) style backups of VMware VMs, when incremental backups are desired the NetBackup “BLIB” option is selected.   NetBackup then automatically enables the CBT mechanism for all protected VMs.   If any protected VMDK(s) is extended in size past the 128GB boundary (or any power of 2 above that – e.g. 256GB, 512GB, 1024GB) after CBT is enabled, the CBT mechanism becomes corrupt.  Subsequent backups of the impacted VMs won’t accurately backup the correct (in-use) blocks of data. In many cases this is exclusively extra data (blocks) but in some cases backups can miss blocks that are actually in use.  Either data within the restored VM will be corrupt or the restored VM will fail to boot.

VMware has indicated that they plan on providing a fix for all currently supported versions of ESXi.  VMware has not yet announced the exact nature of this fix or exactly when this fix will become available but indications are that VMware will provide this fix in the next VMware “update” release. 

 

Some additional information:

Q:  Does this issue apply if the VMDK is extended more than 128GB?

A:  Yes.  This issue does not necessarily impact all VMs.  The issue only occurs if a VMDK is extended past the 128GB boundary (e.g. 120GB resized to 130GB) or any power of 2 above 128.

 

Q:  Does this issue apply if my VMDK is already larger than 128GB?

A:  No.  The initial size of the VMDK does not matter.  What matters is if the VMDK is extended beyond the affected limits after it is initially configured and CBT has been enabled.

 

Q:  Is this issued caused by a NetBackup bug?

A:  No.  This is not a NetBackup bug.  This is a VMware bug.  Any vendor that uses VMware’s VADP for backups can be impacted by this issue.

 

Q:  Why doesn’t NetBackup provide a fix for this?

A:  NetBackup engineering has determined that there is no way a backup vendor can provide a reliable solution that covers every contingency configuration.  We’ve studied competitor workarounds for this.  Their solutions cover some use cases but there are still configurations where their solutions will not accurately detect and fix the issue and, in our opinion, give the user a false sense of security.  We have determined that VMware must provide a fix for this.  VMware agrees with this assessment.

 

Q:  What should I tell my customer if they request an immediate solution?

A:  The only solution that will insure that this issue is fixed for every VM is to manually disable CBT.  This can be performed from within the vSphere interface or a PowerCLI script can be written to do this.  Once CBT is disabled, NetBackup will automatically re-enable CBT during the next scheduled backup.

 

Q:  My customer is using NetBackup Accelerator for VMware.  Will running a backup enabling the “forced rescan” option take care of this issue?

A:  No.  Enabling “forced rescan” will not reset or disable CBT.  CBT must be manually disabled.

 

Q:  Is there a way I can test to determine if previous backups are corrupt?

A:  Instant Recovery for VMware (IRV) can be used to test if the VM image is bootable.  However, this will not detect if there is corruption of data inside the VM.

 

Q:  If CBT is reset, will I ever encounter this issue again?

A:  Yes.  It is possible that this issue can subsequently occur after CBT is reset.  If after the CBT reset the VMDK is extended past one of the affected boundaries, CBT can once again become corrupt mandating that CBT once again be reset.

 

Q:  Will disabling CBT force a full backup of my VMs?

A:  Yes.  You customer should be aware that once CBT is disabled, the next backup run will force a full backup.   All of the data within the VM will be backed up regardless of whether the backup schedule is a differential, cumulative or full backup schedule.

 

Q:  Does my customer need to log a call with VMware regarding this issue?

A:  We recommend that your customer does log a call with VMware.  We’ve discussed this issue with VMware and have indicated that this has the potential of creating a significant data loss scenario.  Having your customer log a call with VMware will reinforce our stance on this.

MortenSeeberg
Level 6
Partner Accredited

What I think is a bit unclear here: Would a regular non-Accelerated full backup be a functional workaround?

VMware describes the bug as CBT and incremental backup related. So if we disable Accelerator for VMware backups and only run full backups, are we then in the clear?

MortenSeeberg
Level 6
Partner Accredited

Got an answer from someone else: Regular full backups without accelerator should not run into this issue.

Of course sub-sequent incrementals are still affected and running full backups every day is not an option for most, so you still need to reset the CBT tracking to be safe:
http://www.symantec.com/business/support/index?page=content&id=TECH197311

aaron_meza
Level 2
Employee

No, that is incorrect. Even without accelerator and even without BLIB enabled we will still rely on CBT to get a list of blocks in-use for full backups (if CBT is enabled on the VM/vmdk).

Per previous inquery, doing a forced re-scan with Accelerator will also not resolve the issue. This still uses CBT to get in-use blocks. This is a bug at the VMware hypervisor level so the only way to work around it is to a) not use CBT or b) reset CBT. For a variety of reasons, a is not a practical solution. b is understandably burdensome but is necessary to get correct CBT information and prevent possible data loss.

Unfortunately, due to the nature of the issue there is no way a backup vendor can provide a reliable solution that covers every contingency configuration.  We have determined that VMware must provide a fix for this.  VMware agrees with this assessment.

It should be noted that CBT being corrupted doesn't necessarily imply data loss. Corrupted means that CBT isn't an accurate reflection of in-use blocks or blocks that have changed since a particular point in time. While not good either way, in many cases this corruption presents itself as only over-reported changed blocks. So not data loss, but backing up too much data.

It should also be noted that CBT functions properly going forward. Any incrementals taken after the vmdk resize will accurately account for changed blocks. The full schedule could still be corrupt which is why we recommend the CBT reset regardless.

AKopel
Level 6

So i'm a bit confused with the premise that a forced rescan will not 'workaround' the problem. I completely understand that it's not a 'fix', But according to the documentation on forced rescan, it does indeed NOT use CBT..

http://www.symantec.com/business/support/index?page=content&id=HOWTO92079

NetBackup is therefore dependent on VMware CBT for correctly identifying changed blocks. To protect against any potential omissions by underlying VMware CBT, the Accelerator forced rescan option conducts the backup without using CBT.

Like you mentioned, 'maybe' forced rescan doesn't RESET CBT, but only skips it for that run. At least that would ensure a forced rescan backup will be a consistent backup?

While I completely agree VMware needs to fix the issue, it seems that a workaround for NetBackup customers (if it indeed is NOT the behavior today...) would be for the accelerator forced rescan to actually reset CBT (or have an OPTION to reset CBT). Symantec could provide this in an EEB or patch and would be really nice for now and if there were future unknown bugs with CBT like this.

AK

 

MortenSeeberg
Level 6
Partner Accredited

Thank you for correcting that, I have give the information back to the sender.

aaron_meza
Level 2
Employee

The documentation you point out needs to be edited for clarification; I will make sure that happens.. What it is supposed to convey is that a forced rescan isn't a synthetic full. It is like the original backup taken with accelerator. The document does state this in the 3rd paragraph:

When Accelerator forced rescan is used, all the data on the virtual machine is backed up. This backup is similar to the first VMware Accelerator backup for a policy. For the forced rescan job, the optimization percentage for Accelerator is 0. The duration of the backup is similar to a non-Accelerator full backup.

By "all the data" they means not just incremental changes. With a forced rescan, we will still query CBT, but instead of asking for changed blocks since X date we will ask for all in-use blocks.

We are certainly actively investigating and evaluating possible workarounds for this issue. My personal opinion is that this whole issue brings to light the fact that maybe there are scenarios in which backup admins want to forgo the benefits of CBT in exchange for the security of knowing it's not optimizing out data. So I agree it would be really nice to have an option to make a full schedule actually back up every block.

PeterCrNZ
Level 2
Partner

It would be great to have a workaround hotfix for NetBackup like Veeam are below. The rumour is that the fix from VMware is not going be available before the end of January 2015.

Good news, we have managed to squeeze a workaround for VMware CBT bug into the v8 release (hotfix for v7 is coming soon). Jobs will now reset CBT on a VM upon detecting any virtual disk configuration change. Given the time constraints, results of our own testing and continuous updates to the corresponding VMware KB article, we felt this is the safest way to go. Keep in mind that this new functionality will address future virtual disk expansion cases only, so you do need to reset CBT manually on all VMs before installing v8 (unless you have already done so) according to the steps outlined in our KB1940

MortenSeeberg
Level 6
Partner Accredited

I wonder how that relates to http://www.veeam.com/kb1113 which states just as Symantecs TECH note, that a restart of the VM (power off/on) is required for the reset....

AKopel
Level 6

I agree, seems an 'elegant' way they could provide the workaround would be to have "Accelerator Forced Rescan" Reset CBT. Since you are reading all blocks anyways, might as well reset CBT in the process...