Snapshot error status 156 using VCB

Enviro:
NB 6.5.3.1 on dedicated master, media and proxy.
Proxy attached to 2 CLARiiONS, one for source vmdks, the other for snapshot cache.
VCB 1.5
Windows Server 2003 R2 SP2 x64 is the VCB proxy. Attached to array over 2Gb FC.
VM Backup : 1
transfer type: 3
vmdk type: 1

Seems like if backing up one VM at a time will work. Running them in "streams" all results in snapshot errors although I don't see any bottlenecks anywhere.  When I setup the VCB policy I enabled 5 streams. When the jobs kick off, only the first one was successful, the others all fail with "- Critical bpbrm(pid=5932) from client qa01: FTL - snapshot creation failed, status 156".

90% of our VM's are Linux so all I care about is fulls I guess. I'm attaching the bpfis log from the proxy, the bpvmutil log has nothing relavent to this issue as the discovery is all running fine, just incredibly slow. Has anyone seen this issue? thanks in advance.
There are multiple failing commands and errors but I do not see how to resolve this based on other posts.
1 Solution

Accepted Solutions
Accepted Solution!

It is not documented, however

It is not documented, however the behavior you describe sounds like a similiar issue we are having where if you try to run multiple jobs against the same lun/datastore, only one will work and the other will fail with a status 156.  I have heard that you can only snap one VM on a given lun at a time, although Symantec has not said anything about this.

In our environment we have many VM's per datastore (lun) and when we run multiple backups were the clients are on the same lun, we run into the snapshot error issue.  If we run multiple jobs but have them on different luns, we have no issue.  The problem is the quiesce step in the flow chart, it cannot be done for two jobs simultaneously.  Our solution will be to create seperate policies for each of our datastores and run them simulatenously but have only one client per policy execute.  That way we can load the proxy, and avoid the snapshot 156 errors due to lun constraint.

I am in the process of testing a perl script I wrote that self discovers the datastores and the clients on them through the Virtual Center using the Perl API tool from VMWare.  Once the script discovers the datastores and clients, it injects them into the corresponding policies in Netbackup.   Think of it as autodiscovery for Netbackup and VM's.

Are the given VM's that fail all on the same lun of the CLARiiON?  Do you have VM's on other luns?  What happens if you take a client from each of the luns and stick it into a test policy and allow all of them to run?

Regards,

Benjamin Schmaus


42 Replies

check this

http://seer.entsupport.symantec.com/docs/319289.htm
http://seer.entsupport.symantec.com/docs/316460.htm

hmm

Not seeing either of those errors, not using PowerPath, not getting the solution.
...
12:10:10.119 [844.4996] <32> onlfi_freeze_fim_fs: VfMS error 11; see following messages:
12:10:10.119 [844.4996] <32> onlfi_freeze_fim_fs: Fatal method error was reported
12:10:10.119 [844.4996] <32> onlfi_freeze_fim_fs: vfm_freeze: method: VMware, type: FIM, function: VMware_freeze
12:10:10.119 [844.4996] <32> onlfi_freeze_fim_fs: VfMS method error 1; see following message:
12:10:10.119 [844.4996] <32> onlfi_freeze_fim_fs: VMware_freeze: Unable to get servers at line 1669
12:10:10.119 [844.4996] <32> onlfi_freeze: VfMS error 11; see following messages:
12:10:10.119 [844.4996] <32> onlfi_freeze: Fatal method error was reported
12:10:10.119 [844.4996] <32> onlfi_freeze: vfm_freeze: method: VMware, type: FIM, function: VMware_freeze
12:10:10.119 [844.4996] <32> onlfi_freeze: VfMS method error 1; see following message:
12:10:10.119 [844.4996] <32> onlfi_freeze: VMware_freeze: Unable to get servers at line 1669
12:10:10.119 [844.4996] <16> bpfis: FTL - snapshot creation failed, status 156
12:10:10.119 [844.4996] <4> onlfi_thaw: Thawing ALL_LOCAL_DRIVES using snapshot method VMware.
12:10:10.119 [844.4996] <8> onlfi_thaw: ALL_LOCAL_DRIVES is not frozen
12:10:10.119 [844.4996] <8> bpfis: WRN - snapshot delete returned status 20
12:10:10.119 [844.4996] <4> bpfis: INF - EXIT STATUS 156: snapshot error encountered

Best Recommendations

Just remember that symantec only supports in their best practices the maximum of 4 streams at a time per proxy. Also, are all of these systems coming from the same virtual host? There may be some issues from the systems all attempting to snapshot all at once.


How is the I/O load on the actual disks themselves? Unfortunately with snapshots they are SUPER sensitive to even the smallest spikes since it can lead to data corruption.


I have been very successful to running around 3 at a time, even at 4 i've seen issues like this arising. It also has to do with how busy the VM's are themselves during the time of snapshot since it has to manage all those additional writes that occur during snapshot.


Another way to diagnose if its a netbackup issue itself is to do a vcbmounter call for all 5 of them sequentially manually outside of netbackup. If any of them fail, then its the chance of how the vm load is occurnig.

Take NetBackup out of the equation / Independent Disk mode set?

Agree with Outbacker,
also I would reduce the number of streams if you can.

Here is some snippets from documentation pdf Understanding VMware Consolidated Backup from VMware

The VMware Infrastructure Basic System Administration guide provides details on setting up networking on the ESX Server host and on setting up VirtualCenter. Setting up shared storage is essential for Consolidated Backup. Make sure the Consolidated Backup proxy server can access LUNs that hold VMware VMFS datastores. For more information on SAN configuration, see the SAN Configuration Guide.

A special debugging utility — vcbSanDbg — is provided with the Consolidated Backup software. This utility is a Windows program that collects the SAN information as seen from the proxy server. This information is a good starting point to determine whether your storage is properly configured.

When the VMware Infrastructure software reports an error, various logs are updated with diagnostic information. Backup agent logs on the proxy server may also contain information useful in diagnosing errors. The vcbMounter output file can contain vital information to help narrow the scope of diagnostic investigations.

You can increase the verbosity level of vcbMounter output for diagnostic purposes by adding the -L 6 parameter to the command line used to back up the virtual machine, as shown below:
vcbMounter -h bc.vmware.com -u backup -p XXXXX -a \ipaddr:vm1.vmware.com -r d:\vtl\vm1-backup -t fullvm -L 6

This command generates output including information messages that are very useful in diagnosing problems. If snapshot problems are reported in the vcbMounter output, one place to look for additional information is the ESX Server host on which the virtual machine is running.

On the service console, check hostd.log and vmkernel.log. These log files are in the /var/vmware/log directory.

----------------------
Make sure your vmdk's are not set to independent mode or maybe this is a good way to exclude vmdk's from your backup.
Check your VM is not in snapshot mode already or some old orphaned delta files remaining on the datastore.

Read here
http://www.yellow-bricks.com/2008/01/11/vcb-problems-with-independent-disks/

---------------------

For your curiosity and mine here is roughly what is happening during a backup with VCB and Netbackup.

imagebrowser image


Accepted Solution!

It is not documented, however

It is not documented, however the behavior you describe sounds like a similiar issue we are having where if you try to run multiple jobs against the same lun/datastore, only one will work and the other will fail with a status 156.  I have heard that you can only snap one VM on a given lun at a time, although Symantec has not said anything about this.

In our environment we have many VM's per datastore (lun) and when we run multiple backups were the clients are on the same lun, we run into the snapshot error issue.  If we run multiple jobs but have them on different luns, we have no issue.  The problem is the quiesce step in the flow chart, it cannot be done for two jobs simultaneously.  Our solution will be to create seperate policies for each of our datastores and run them simulatenously but have only one client per policy execute.  That way we can load the proxy, and avoid the snapshot 156 errors due to lun constraint.

I am in the process of testing a perl script I wrote that self discovers the datastores and the clients on them through the Virtual Center using the Perl API tool from VMWare.  Once the script discovers the datastores and clients, it injects them into the corresponding policies in Netbackup.   Think of it as autodiscovery for Netbackup and VM's.

Are the given VM's that fail all on the same lun of the CLARiiON?  Do you have VM's on other luns?  What happens if you take a client from each of the luns and stick it into a test policy and allow all of them to run?

Regards,

Benjamin Schmaus


More info

It is a good flow chart. But I can't take the credit.

Was taken from a nice White Paper from HP's Enterprise Backup Solution website entitled:

VMware Consolidated Backup EBS Solutions guide for Symantec Veritas NetBackup

It was authored at the time of NetBackup 6.5.1 but is very relevant.

Since you guys like diagrams here's another NICE one to show off VCB.

VMware Consolidated Backup

Above is the VCB Visio shape from a library of VMware.com shapes

So for all you documentation aficianados heres some high quality Visio stencils and shapes courtesy of VMware might help with documenting your Virtual environment with backup solution.

A Friday Freebie if you will.

http://viops.vmware.com/home/blogs/strategy/2009/02/23/build-your-own-high-quality-vmware-diagrams-and-presentations

My solution for error 156

I just check a option on the Client Attributes (If your problem its on a Virtual Machine)

Go to Master Server -> Client Attributes -> Find and select the Client ->  Click on Windows Open File Backup options and check the Enable Open  File Backup option. This happend on all my Virtuals Machines. Not on a Phisycal Machine.

Also I set the option Disable Snapshot and Continue on the Snapshot Error Control Area.


Here the link to the Symantec Documentation:
http://seer.entsupport.symantec.com/docs/276739.htm

Good Luck...

I don't think the solution

I don't think the solution above relates to VCB backups.  We continue to have 156 errors and I've contacted support on a few occasions with no results.  I send them my logs and filled out a huge form explaining our VCB environment but they have been no help.  Some people say turn limit jobs per policy to 1 but if we do this our backups will never finish since we have about 105 servers being backed up by VCB.  Not to mention the TIR file read errors for our incremental VCB jobs.....is anyone happy with their VCB backups?  Does anyone get consistent backups? 

Re: Backmeupson

We have roughly 200 VCB backups that occur daily with 1 master and we're doing fine , but we don't set it to 1. You should however only be backing up 1 per time per lun. This is a limitation on SAN/VM and for a good reason to prevent file corruption.

I'd be glad to help you if you were able to send me a bpfis (you can clean it up if you want). I know that sometimes the servers that are consistently to busy will fail with 156s, since the actual point in which the system can lock is different. There is a "patch" that symantec can offer if that is the case but they DO NOT support anything in regards to its consistency to actually complete the backup successfully and be able to restore that data.


Thanks for your response.  I

Thanks for your response.  I haven't found any poeple out there with that many clients in VCB.

We have our policies setup so that all servers in a given policy are on the same datastore.  We currently have limit jobs per policy set to 2 - we had this as high as 5 previously.  Multiplexing on the tape drives is set to 10.  We have done alot of tweaking with limit jobs per policy and the snap_lock_timeout key but can't seem to get a consistent backup.  Are you getting almost 100% completion of backups?  I can provide the bpfis logs if that would help.

We also get a lot of 156

We also get a lot of 156 errors.  I'm backing up using option 0 with about 15 VCB clients added to the policy (I have plenty more to add) and the jobs limited to two.  I logged a call with netbackup who were unable to provide a solution.   I checked the servers that were receiving the 156 errors to see if they were using the same LUN (Good idea by the way) but this wasn't the case.  That's not to say that these particular LUNs aren't very busy though.

I also get the occasional 156

I also get the occasional 156 error but this is due to trying to snap on the same lun something i will be addressing shortly. As said by Outbacker only one job per lun at anytime!

Are you guys referring to 1

Are you guys referring to 1 snapshot taking place per LUN or 1 backup job per LUN?  By 1 backup job I mean take snapshot - mount to proxy mount point - write to tape - complete. 

Its very random because I have nights where incrementals all complete successfully with jobs per policy set to 2 and other nights where I get 5-10 out of 105 clients failing with a 156. 

It should really be able to

It should really be able to be one snapshot per lun and once that snap is closed run another but you would not be able to set it up to run like that without potentiall conflict of the snaps. I run one job per lun. I have the policy setup with all the VM's from 2 different luns and tick one job per policy. This way you dont get to many jobs running at once putting heavy load on the disks not that it should be to much of an issue for a SAN but it all depends on your overall setup.

Two Stage

Clarity:

1 snapshot per lun is a good rule of thumb, high activity on that LUN also needs to be assumed (aka, if its an application that writes and reads and locks hundreds of files every second, its one of those times where VCB will not work without the override patch --- again no gaurauntee by Symantec on that one).

We get 100% completion of backups... I am not comfortable with getting less than that. Currently here is my setup:


2 VCB Proxy servers - 1 per Host Group.


Seperate Policies depending on what I need:
VCB_VCBSERVERNAMEHERE_DayTime (Certain applications of ours are HEAVILY loaded at night time... you should always schedule snapshots away from critical points of use per application / per lun.)
VCB_VCBSERVERNAMEHERE_NightTime (same as above but ran at night for day heavy apps)
VCB_VCBSERVERNAMEHERE_ImageOnly (sometimes i have servers that i want to do flash snapshots on that i can't map simply out of the fact that they have 25,000,000 files, i sometimes get 156s on machines that try to map with that high of a file count)
VCB_VCBSERVERNAMEHERE_NTOnly (sometimes i want only specific directories, so i use this option.... nice to be able to do this without a netbackup license)


Inside the configuration, we find the following snapshot settings:
Snapshot Mount Point: E:\mnt (This is a dedicated drive meant only for snapshots, having your snapshot mount point on the vCB proxy server contain other accessed applications can slow down backups as well as cause issues as it needs all the "write" capability it can get.)

Virtual Machine Backup - option 3 (Full-Mapped with NT Incr)
trantype - option 3 (This means that it'll try over SAN first, but if that doesnt work it'll go across network.... this is nice just as a failover option)
VMDKType - option 0.


I currently limit the amount of jobs at a time to 3. Snapshots complete fairly fast so this is do-able. Symantec supports as a best practice the MAXIMUM of 4.


Also remember that the lun that you are writing to on the VCB Proxy needs to be actively responsive with the amount of writes/reads it needs to perform- i don't recommend sharing with anything else if you can.

Some Important items to note:
1. If you have regular backups running during the time of the snapshot that affects that lun, you can potentially see a 156 snapshot error from the timeout with the lun.
2. Do not attempt to do backups of anything with databases.... not only could it cause a crash on your VM, but the data is useless unless everything is stopped.
3. Be ameable to the size of your disks - remember snapshots have to copy the entire VMDK filesystem down to your proxy.... If you're attempting 2 TB of data, it WILL take awhile.
4. It is important for a quality VCB server - don't run as a VM.................... and don't find the latest trashed piece of hardware you could find otherwiseSmiley Happy.
5. Test different scenarios with your 156's. Does it work if you run it by itself? Does it run more efficiently at one time of day than another?
6. If a snapshot already exists on the VCB server and was not cleaned up that has the same name of the server, i've also seen 156's generated.


Also, attempt to do the same thing with the VCB Proxy only - if you remove the netbackup software out of the scenario and it still fails, then its something to bring to VMWare. Commandline information on how to do this is readily available on the net.


 

Disk space?

How is your diskspace on that LUN?  Each snapshot is going to create a new file that expands as needed to fit the data changes.  Could it be that your LUN is close to the edge of full and two good sized snapshots cause this problem?

There is also some documentation here: communities.vmware.com/thread/121451
that talks about increasing the timeouts associated with VCB backups, Netbackup or direct.  We have seen problems in our environment with Disk Lease timeouts while copying from the SAN and the link above has helped...

Scott

LUN presentation

i think if the LUN where the .vmdk resides has to presented to the VCB proxy server

Disk space and snap clean ups

We have had similar issues with 156 errors and this seems to be the error for everything whether it is a snap error or not.

Scott's note about the disk space available on the VCB host is good. Also, note that if the backup snap is taken and during the process the disk fills up it will many times leave remnants of the old snapshot.  We have to go into Virtual Center and clean up the snapshots to free up the space again. 

Also, be careful when using the snaps becuase if the snap space on the ESX server fills up then this is same area where the swap file is placed.  If the swap file cannot grow it can cause your VM's to crash.  It is a wonderful thing when this happens, NOT!

Joe

Another offender of the 156

Short name.  Make sure you are using the FQDN and that fixed many of my 156 errors.