VMWare NBD Backups Spend a Long Time Waiting

NESL_Netops · ‎07-02-2012

We're on Backup Exec 2012, currently using NBD as our transport mode for VMWare backups and I've noticed that it's spending a ridiculous amount of time doing (what I don't necessarily assume is) nothing. It takes over 5hrs to backup 133GB worth of VMs, while other, larger backups running at similar speeds complete in more than half that time. I've gone through the job logs and on average, it spends roughly 90min backing up the VM data, and another 5-10min snapping/unsnapping VMs. What it's doing for the remaining 3.5hrs I can't for the life of me figure out.

Just to give an idea, here's a recent schedule:

11:00pm - 11:01pm - Create Snapshot on VMSvr01
11:01pm - 11:53pm - ????
11:53pm - 11:57pm - Backup VMSvr01

.....

1:57am - 1:58am - Create Snapshot on VMSvr02
1:58am - 2:28am - ????
2:28am - 2:33am - Backup VMSvr02

Anyone else run into this scenario? Is there any part of the procedure that is"preparing" the VM somehow that explains these long wait times? From what I can tell, or VCenter or ESX hosts aren't being overutilized or performing any resource-heavy activities at these times.

teiva-boy · ‎07-02-2012

Any way you can check your storage IOPS directly during this period too? Assuming this is a SAN..

NESL_Netops · ‎07-03-2012

I'm not 100% certain how to query the IOPS on our SAN.

I've narrowed down a bit of information. The VM takes literally no time to snap (5-10sec), but afterwards Backup Exec is waiting for some process to complete before it begins processing the backup. I'm having trouble figuring out if it's waiting for a VCB process to complete, or one of it's own processes.

I've gone through VMWare sys logs as well as the bengine and beremote debug logs and notice that during these lengthy intervals, neither is logging any relevant activity, and Backup Exec specifically is not logging any processes at all. I have a test job that has a 5min gap before the backup begins, and the logs report not activity during that gap. What I notice just before that in the beremote log is the following entries:

[5844] 2012-07-03T10:08:12.568 [ndmp\loops]         - LP_ENV::createVMWareSnapshot(): creating a snapshot 'SYMC-FULL 03-07-2012 10:08' for the vm '(DC)VDATACENTER(DC)\vm\VSPHERETESTGUEST'
[5844] 2012-07-03T10:08:12.568 [fsys\shared]        - vddkCreateVMWareSnapshot() creating a snapshot 'SYMC-FULL 03-07-2012 10:08' for the VM '(DC)VDATACENTER(DC)\vm\VSPHERETESTGUEST'
[5844] 2012-07-03T10:08:12.660 [fsys\shared]        - SymVmTools: WaitForTaskCompleteWithTimeout: Waiting for 0 seconds for task to complete
[5844] 2012-07-03T10:08:15.577 [fsys\shared]        - vddkDisconnectFromVCServer(): vmcDisconnect OK, vddkSession = 0000000013CC03D0
[5844] 2012-07-03T10:08:15.578 [fsys\shared]        - vddkDisconnectFromVCServer(): status = 0X0
[5844] 2012-07-03T10:08:15.579 [ndmp\loops]         - LP_ENV::GenIPCObject() Opening the new ipc BACKUP01-VM-VCB-{B3375F0D-14A3-4F21-937A-38605C95200D} at location C:\Program Files\Symantec\Backup Exec\Data.
[5844] 2012-07-03T10:08:15.580 [fsys\shared]        - [memento::OpenWrite] - C:\Program Files\Symantec\Backup Exec\Data\BACKUP01-VM-VCB-{B3375F0D-14A3-4F21-937A-38605C95200D}
[5844] 2012-07-03T10:08:15.582 [fsys\vmvcb]         - VM_VCBPROXY_FS::DisconnectFromVCServer: status = 0X0
[5844] 2012-07-03T10:08:15.632 [fsys\vmvcb]         - SymVmTools: LoginAPI: key            = D02B9D15-7445-43FC-8DD8-9A0FECE82C5F
[5844] 2012-07-03T10:08:15.632 [fsys\vmvcb]         - VM_VCBPROXY_FS::ConnectToVCServer: OK, session = 0000000013CC03D0
[5844] 2012-07-03T10:08:15.633 [fsys\vmvcb]         - SymVmTools: BuildDatabases: Building databases
[5844] 2012-07-03T10:08:16.410 [fsys\vmvcb]         - SymVmTools: BuildDatabases: Building databases
[5844] 2012-07-03T10:08:17.235 [fsys\vmvcb]         - SymVmTools: SetVmPowerStateAPI: enableChangeTracking: err = 0
[4592] 2012-07-03T10:12:37.313 [fsys\shared]        - Calling the vixdiskLib api: VixDiskLib_PrepareForAccess('vcenter.domain.com')

...I'm curious about the part in bold. Is it attempting to download the VM locally before kicking off the backup? I'm thinking no because the NIC is reporting 0% utilization during the gap, but I can't for the life of me figure out what is happening during that time to know whether or not it's something I can address, or if I'm stuck with it. If I knew it was VCB was taking a long time to do something like mirror the VM or something of the like, and I could then chalk it up to SAN performance, then fine, but I can't even determine what's going on.

AshutoshTamhank · ‎07-03-2012

Suspect this to be related to a similar issue reported for NetBackup due to a new VMware API VixDiskLib_PrepareForAccess which is used to notify the host of the virtual machine that the disks of the virtual machine will be opened. The host disables operations on the virtual machine that may be adversely affected if they are performed while the disks are open by a third party application.

The Netbackup technote: http://www.symantec.com/business/support/index?page=content&id=TECH179346

I recommend you to report this issue to support for further investigation with a matching set of job log and beremote logs. Collect them using vxgather. (%program files%/symantec/backup exec/vxgather.exe). You could open a case via mysupport.symantec.com

ManikandanPC · ‎07-04-2012

NBD method is nothing but backup using LAN and not SAN to transfer your data from datastore to backup host(VM proxy server). Seeing the above the logs it appers to be VCB and not VADP. Your backup may be waiting for long time because of the following reasons : 1. The VM may be hugh and it takes long time to quisce the VM. 2. VCB mount point on the proxy server will not have any free space to dumb your VM snapshot for backup. Once the other VM backup completes, the snapshots gets cleared we get the free space on VCB mounts. If you trigger the backup at this point of time your backup will be completing soon

ManikandanPC · ‎07-04-2012

NBD method is nothing but backup using LAN and not SAN to transfer your data from datastore to backup host(VM proxy server). Seeing the above the logs it appers to be VCB and not VADP. Your backup may be waiting for long time because of the following reasons : 1. The VM may be hugh and it takes long time to quisce the VM. 2. VCB mount point on the proxy server will not have any free space to dumb your VM snapshot for backup. Once the other VM backup completes, the snapshots gets cleared we get the free space on VCB mounts. If you trigger the backup at this point of time your backup will be completing soon

NESL_Netops · ‎07-09-2012

Thanks for the response.

Regarding #1: If that's the case, in one instance of the job running, it took 50+min to quiesce a 5GB VM. Seems unlikely. How would I know if the quiesce process was taking place and/or if BExec was waiting for VM quiescing? Any log information that reports that?

Regarding VCB vs. VADP: I was wondering about that. Am I limited to LAN-free backups for using the vStorage API, or can this be used even when the transport method is NBD? If so, I'm not sure what I have to do to instruct it to use that over VCB (doesn't seem to be anywhere I can configure that).

Lastly, does the BExec media server serve as the VCB proxy? Is it physically making a local copy of the VM after the snapshot process but BEFORE the backup? If so, I'm confused as to why network utilization drops to 0% during this interval.

teiva-boy · ‎07-09-2012

My assumption that the labeling of VCB in the logs, is nothing more than legacy coding/nomenclature. Where as it's still using VADP as needed.

VCB would copy the snapshot to a staging area, then the copy is backed up.

VADP will stream copy the snapshot directly to the storage target, with no staging process needed.

But I do believe some debug logs are what is needed and a support case opened.

ManikandanPC · ‎07-10-2012

Yes, in few cases I have seen quiscing action takes long for even smaller machines. Here is the log details http://www.symantec.com/business/support/index?page=content&id=TECH63926. My sugesstion would be Increase the verbose level and work with symantec to find where exactly is the bottle neck.

Regarding VCB vs VADP - You are allowed to choose your own transport mode. It even allows you through SSL. And if you want it to VCB you can select it in your policy.

Yes, your BE media server can also act as a VM proxy server/off host backup server.

NESL_Netops · ‎07-11-2012

Thanks for everyone's input. I've open a case and will report with the results when we reach a conclusion.

NESL_Netops · ‎07-23-2012

OK, so support says it's related to an issue that was remediated (not going to say solved just yet) in a recent hotfix that was released. Here's the article for that hotfix that I was referred to:

http://www.symantec.com/business/support/index?page=content&id=TECH189571

There doesn't appear to be anything specific about the issue in here, but as I'm told the wait time is due to Backup Exec waiting for access to the target device. After applying the hotfix, we saw a slight improvement (reduced by an hour or so), but wait times still occasionally reach 30min. It's a step in the right direction I suppose.

I'm not sure how much I believe the "waiting for the target" answer I got. It was even word of mouth heard from another engineer and then sort of dismissed as "yeah, I'm not sure that's just what he said" when I was asking follow up questions. Seems like if it was waiting for the target device, I'd be having more problems with other backups using the same device, and regardless of job type.

In the end, this had gotten dragged out for 3 weeks and I was at least satisfied that they knew there was an issue and were looking into it enough that I was comfortable knowing it would get addressed in a future service pack or somehting. I hope.

VOX

VMWare NBD Backups Spend a Long Time Waiting