GRT backup makes Hyper-V VM unresponsive...

LouisC · ‎07-30-2013

Hi all!

So I was all excited when SP2 was released because my migration from my current backup solution to BE was on hold due to us moving towards Hyper-V 2012. So the day it was released, I installed it and thought "Alright! Let's start backing up Hyper-V VMs!" and then BAM! I had all VMs on a particular CSV become unresponsive because I was backing up one VM on that CSV. The only way out was to shut the host down completely and bring those VMs backup on the other node, not the best way to start.

Here's my environment and some more of my experience:

Hyper-V Environment
2x Hyper-V 2012 hosts w/ BE 2012 SP2 RAWS agent installed.
Hyper-V hosts are clustered and using CSVs.
Storage provided by a 6 node P4500 cluster.
HP Lefthand MPIO used for multipathing.
Lefthand VSS provider not installed.
SCVMM 2012 SP1 is used to manage the cluster.
Backup Exec Environment
Backup Exec 2012 SP2 installed
Deduplication Disk Storage configured and is the current destination for Hyper-V backups.
Guest VM Environment
Server 2012 OS w/ latest Integration Services installed
BE 2012 SP2 RAWS agent installed.

I started with just trying to backup one VM to see how the whole process worked before I start rolling this out to other VMs. The scenario above was what happened after performing the first GRT enabled backup of the VM. I've since moved the VM to its own CSV so I can test just this server without potentially bringing down the environment.

What I'm seeing is the backing up the VM's VHDXs is fine, every time. If BE just backs up the VHDXs, the job runs successfully and the VM stays responsive the whole time. I can make the BE backup just VHDXs by unchecking "Exclude virtual machines that must be put in a saved state to back up". If I check that box, BE first performs the VHDX backup successfully and then it attempts to perform GRT pass but the VM becomes unresponsive, more than likely from being put into a saved state.

I've checked everything I can possibly think of to allow for live backups for the guest VM to no avail. All disks on the VM are basic NTFS drives, each with their shadow copy storage pointed to their own drive. The correct Hyper-V Integration Services is installed and "backups" are enabled in the VMs Integration Services configuration.

I just for the life of me can't figure this out. It seems like it should be pretty straight forward. Any help would be greatly appreciated!

v/r,

Louis

lmosla · ‎07-30-2013

Hello Louis C, Please post a image of how you are selecting your machines.

LouisC · ‎07-30-2013

LMosla,

Thanks for the prompt response! I'll attempt to post a picture here soon. I can say that I am selecting them via the "[Cluster's Virtual Name]"\"Microsoft Hyper-V HA Virtual Machines"\"[TestVM]".

Oddly enough, I'm now backing up this particular VM succesfully now w/ GRT. I have no idea why it started working (or even what made the CSV unresponsive yesterday) but its good at this moment.

I think I'm going to expand the backup selection to contain another VM and see where it takes me.

v/r,

Louis

MusSeth · ‎07-30-2013

hello louis,

please check the event viewer, see if there are any events for the time when system was frozen, please let us know about any errors, warning or informative events if you see there

LouisC · ‎07-30-2013

One more that is interesting... ever since installing RAWS (the only backup agent ever installed on these Hyper-V hosts) I've been getting the following event repeatedly:

Event ID: 8194 Source: VSS

Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0x80070005, Access is denied.

. This is often caused by incorrect security settings in either the writer or requestor process.

Operation:

Gathering Writer Data

Context:

Writer Class Id: {e8132975-6f93-4464-a53e-1050253ae220}

Writer Name: System Writer

Writer Instance ID: {1486d557-3b49-4314-8e12-db0d4b9c7d98}

v/r,

Louis

LouisC · ‎07-30-2013

- This was an interesting error from the application log as I was trying to shut the host down gracefully:

Event ID: 31 Source: VSS

Volume Shadow Copy Service Warning: A writer with name ASR Writer and ID {be000cbe-11fe-4426-9c58-531aa6355fc4} waited 4294967 seconds for in-progress calls to complete before shutting down.

- This was in the system log after the backup began and right about when everything stopped working:

Event ID: 1146 Source: FailoverClustering

The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource. Please determine which resource and resource DLL is causing the issue and verify it is functioning properly.

- And I get these (from the Microsoft\Windows\Hyper-V-VMMS\Admin log) when backing up a VM now w/ BE:

Event ID: 19050 Source: Hyper-V-VMMS

'TestVM' failed to perform the operation. The virtual machine is not in a valid state to perform the operation. (Virtual machine ID 0A3A39F9-B9B3-4F19-BDB0-ABB2A0076D87)

- This was on HyperVNode1 around the time of the incident:

Event ID: 10028 Source: DistributedCOM

DCOM was unable to communicate with the computer HyperVNode2 using any of the configured protocols; requested by PID c44 (C:\Program Files\Symantec\Backup Exec\RAWS\beremote.exe).

- These were scary events during the outage. This was on HyperVNode1:

Event ID: 5120 Source: FailoverClustering

Cluster Shared Volume 'Volume1' ('HQ-P4000-VOL-2') is no longer available on this node because of 'STATUS_NETWORK_NAME_DELETED(c00000c9)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Event ID: 5142 Source: FailoverClustering

Cluster Shared Volume 'Volume1' ('HQ-P4000-VOL-2') is no longer accessible from this cluster node because of error 'ERROR_TIMEOUT(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

Other than that, nothing really caught my eye.

v/r,

Louis

lmosla · ‎07-30-2013

Louis, what version of Windows is the Media Server running? Backup Exec 2012 sp2 supports backing up Windows Server 2012, it does not yet support Windows 2012 Server on the media server itself.

LouisC · ‎07-30-2013

LMosla,

The Backup Exec Media Server is Server 2008 R2 Standard Service Pack 1. Only the Hyper-V hosts and serveral guest VMs are Server 2012.

v/r,

Louis

LouisC · ‎07-30-2013

Branching out a bit further to perform a few VM backup appears to have been a bad idea. I have now lost all guests on the CSV that is being backed up again...

MusSeth · ‎07-30-2013

Hello Louis,

I found this solution on one of the forum

"I found out the problem (error 8194...IVssWriterCallback...) on Hyper-V host when backing up VMs on a CSV :
Go to DCOM setup : dcomcnfg --> Expand Component Services, Computers --> Right-click My Computer --> Properties --> COM Security tab.
Under Access Permission click Edit Default. --> add the "Network Service" account with Local Access allowed.
Restart the computer !
no more 8194 error."

http://forums.veeam.com/viewtopic.php?f=25&t=9486

as you also see Dcom errors in event viewer, you try this however would suggest to check on technet aswell before you can implement this solution....

I see lots of windows 2012 users are getting this error with different backup applications

LouisC · ‎07-30-2013

I saw that but wasn't sure if that was going to be a "recommended" solution.

LouisC · ‎07-30-2013

This will be a bit challenging due to the hosts being Hyper-V Core boxes but I'll see what I can do.

LouisC · ‎07-30-2013

Oddly enough, all the VMs came back online. Looking at it, HyperVHost1 and HyperVHost2 never complained; its the VMs that suffered. Apparently they lost their ability to write to their disks. On one of the guest VMs logged this during the backup of the CSV its VHDXs reside on:

Event ID: 129 Source: storvsc

Reset to device, \Device\RaidPort0, was issued.

EDIT: I supposed I'll clarify the statement about the hosts not complaining. Looking through their event logs, there are no events hinting to the CSVs ever falling offline. Further more, looking at the cluster logs, not a single CSV resource failed.

I do see the following 252 times in a row (every two seconds) in the event logs on the host that had only the 3 VMs I was trying to backup:

Event ID: 10014 Source: DistributedCOM

The activation for CLSID {ECABAFB9-7F19-11D2-978E-0000F8757E2A} failed because remote activations for COM+ are disabled. To enable this functionality use Server Manager to install the COM+ Network Access feature in the Application Server role.

What's interesting is the guest VMs that had the problem were on a different node but on the same CSV as the one being backed up during the outage.

MusSeth · ‎07-30-2013

I would suggest you to open a support case for the same as it we might have to enable logging in order ti isolate the issue.

LouisC · ‎07-31-2013

Well that was an interesting night last night.

I created a support case like recommended. A technician got back to me last night. With his guidance, we started the backup job that was causing the outages. This particular job started to bring down the Hyper-V cluster about an hour into it. At around ~15 minutes or so into the job, the tech says he'll call me back in a few minutes. He does call me back around ~40-50 minutes into the job and tells me it’s the end of his shift and another technician will call be back in roughly 30-40 minutes. I explained that this job will start to cause an outage in approximately 10-20 minutes. He told me he would escalate it and someone will get with me in 30-40 minutes. Low and behold, the job started to crash the Hyper-V cluster 10-20 minutes after getting off the phone with him and no one had called me back.

So now I had a severe outage happening. I started an online chat via the support site and explained I had an open case, that I had a production outage happening, and I was "in-between" technicians. They told me they couldn't escalate it to a priority 1 via chat so I should call in. So that's what I did, I called back in.

After explaining my situation to another person they got me in touch with another technician. While the outage was happening, we looked around. It seemed like no real troubleshooting happening, just waiting for a complete failure to happen (of either the backup job or the Hyper-V cluster). We ended up killing backup engine on the media server but that didn't help bring the VMs backup. Eventually, the cluster considered one node dead and started to bring VMs up on the second node. At this point, I had one Linux box complaining about a possible corrupt volume and to attempt to repair. I got lucky and the repair was successful but none the less, it was a bit scary.

So now that I have the VMs back up and the criticality of the outage has subsided, he had me change some settings on the backup job (VSS provider and storage destination), enable debug logging, and start the job back up. After the backup job was started and logging began, he told me to let the job fully complete, attach the logs to the case, someone will call me tomorrow (which would be today), and got off the phone with me again knowing that this backup job could potentially cause another severe outage!

I was completely shocked! I was let go twice when facing a potential outage. Luckily, the second job failed immediately after starting and right after the tech got off the phone with me (that ought to show the quickness that I was let go). I'm still working with them but I'll be reaching out to my regional sales rep to see if I can get different path to engineers for this case.

LouisC · ‎07-31-2013

I wonder if I'm experiencing this....

http://support.microsoft.com/kb/2813630

Not sure if anyone else w/ a Hyper-V cluster has this installed or if they have any thoughts.

LouisC · ‎07-31-2013

I think I may be on the right path... I found this KB that includes KB 2813630 and its titled "Update that improves cloud service provider resiliency in Windows Server 2012".

http://support.microsoft.com/kb/2870270

This blog post seems to further point towards KB 2813630 being a resolution.

http://blog.aaronmarks.com/?p=154

MusSeth · ‎07-31-2013

Hello Louis,

I apologize for what happened with the support last night however could you please provide me the case number, I try to explain the situation to the engineer assigned to this case and will try and get this expedited, however the article which you have posted seems to be referring to the same issue, I would suggest you apply the fix and than chek if that resolves the issue however would be best to try with a backup using windows utility in order to confirm its the same issue befire you apply that patch...

LouisC · ‎07-31-2013

MusSeth,

The case number is :04837204.

I'll take a look at attempting a backup using a Windows Utility first.

v/r,

Louis

LouisC · ‎07-31-2013

Turns out my case was escalated and I'm currently working with an engineer that specializes in Hyper-V. We are pushing forward with KB 2838669 and we'll see what the results are.

http://support.microsoft.com/kb/2838669/EN-US

VOX

GRT backup makes Hyper-V VM unresponsive...