I would look in the vsphere

Andy_Welcomer · ‎03-25-2015

Hello,

We are having an on going issue when trying to backup a large VM.

Our setup is as follows.

Master: Virtual Server 2008 R2 Netbackup 7.6.0.2

Media Server/Backup Proxy: IBM 3650 M4 with Server 2008 R2 Netbackup 7.6.0.2

ESX hosts IBM 3550M4 running ESXi5.1 (Kernal 2323236)

Storge: IBM DS3500 SAS Attached

Issue:

We have a large 2008R2 VM that is currentently using 12.78TB used out of 29.05 allocated. The server houses very large image files that are GBs in size each.

We are unable to perform a complete back up this server using SAN transport. It fails on an error 6, Here is the txt from activity monitor.

3/24/2015 1:58:00 PM - Info nbjm(pid=4484) starting backup job (jobid=423561) for client EPNSMS01, policy NewStanton_MS01_Data, schedule Weekly_Full
3/24/2015 1:58:00 PM - Info nbjm(pid=4484) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=423561, request id:{8BF9ACC0-1B65-4749-9FF0-F3EE1B6FFE3B})
3/24/2015 1:58:00 PM - requesting resource epnsbms02-hcart2-robot-tld-7
3/24/2015 1:58:00 PM - requesting resource epenbms01.NBU_CLIENT.MAXJOBS.EPNSMS01
3/24/2015 1:58:00 PM - requesting resource epenbms01.NBU_POLICY.MAXJOBS.NewStanton_MS01_Data
3/24/2015 1:58:00 PM - granted resource epenbms01.NBU_CLIENT.MAXJOBS.EPNSMS01
3/24/2015 1:58:00 PM - granted resource epenbms01.NBU_POLICY.MAXJOBS.NewStanton_MS01_Data
3/24/2015 1:58:00 PM - granted resource 045ACP
3/24/2015 1:58:00 PM - granted resource IBM.ULT3580-HH5.000
3/24/2015 1:58:00 PM - granted resource epnsbms02-hcart2-robot-tld-7
3/24/2015 1:58:00 PM - estimated 0 Kbytes needed
3/24/2015 1:58:00 PM - Info nbjm(pid=4484) started backup (backupid=EPNSMS01_1427219880) job for client EPNSMS01, policy NewStanton_MS01_Data, schedule Weekly_Full on storage unit epnsbms02-hcart2-robot-tld-7
3/24/2015 1:58:01 PM - started process bpbrm (1368)
3/24/2015 1:58:04 PM - Info bpbrm(pid=1368) EPNSMS01 is the host to backup data from
3/24/2015 1:58:04 PM - Info bpbrm(pid=1368) reading file list for client
3/24/2015 1:58:04 PM - Info bpbrm(pid=1368) starting bpbkar32 on client
3/24/2015 1:58:04 PM - Info bpbkar32(pid=1048) Backup started
3/24/2015 1:58:04 PM - connecting
3/24/2015 1:58:04 PM - connected; connect time: 0:00:00
3/24/2015 1:58:04 PM - Info bpbkar32(pid=1048) archive bit processing:<enabled>
3/24/2015 1:58:04 PM - Info bptm(pid=4504) start
3/24/2015 1:58:05 PM - Info bptm(pid=4504) using 65536 data buffer size
3/24/2015 1:58:05 PM - Info bptm(pid=4504) setting receive network buffer to 263168 bytes
3/24/2015 1:58:05 PM - Info bptm(pid=4504) using 30 data buffers
3/24/2015 1:58:06 PM - Info bptm(pid=4504) start backup
3/24/2015 1:58:06 PM - Info bptm(pid=4504) Waiting for mount of media id 045ACP (copy 1) on server epnsbms02.
3/24/2015 1:58:06 PM - mounting 045ACP
3/24/2015 1:58:46 PM - Info bptm(pid=4504) media id 045ACP mounted on drive index 0, drivepath {10,0,0,0}, drivename IBM.ULT3580-HH5.000, copy 1
3/24/2015 1:58:47 PM - mounted; mount time: 0:00:41
3/24/2015 1:58:47 PM - positioning 045ACP to file 17
3/24/2015 2:00:49 PM - positioned 045ACP; position time: 0:02:02
3/24/2015 2:00:49 PM - begin writing
3/24/2015 2:55:22 PM - Info bpbkar32(pid=1048) bpbkar waited 0 times for empty buffer, delayed 0 times.
3/24/2015 2:56:57 PM - Error bpbrm(pid=1368) could not send server status message
3/24/2015 2:56:58 PM - Critical bpbrm(pid=1368) unexpected termination of client EPNSMS01
3/24/2015 2:56:58 PM - Info bpbkar32(pid=0) done. status: 6: the backup failed to back up the requested files
3/24/2015 2:56:58 PM - end writing; write time: 0:56:09
the backup failed to back up the requested files(6)

Observations:

It will backup if the transport method is set to NBT (takes too long to be a solution)
We did have some succes a few weeks ago with the following settings using SAN (no longer works)
- Backing only the C drive with quiescing disabled
- Backing only the data drive with quiescing disabled
Its not an issue with the media server seeing the datastore as other VMs backup with SAN transport without problem

I see nothing in documentation that says the VM is too big and other than the size, the configuration of the VM is no different than any other we have.

We've had multiple requests open with both VMWare and Symantec and we don't see to get any real feedback.

Anyone else have a suggestion?

RiaanBadenhorst · ‎03-25-2015

Hi,

Can you collect the vxms log?

Also, your tape buffers are very low (65536), try increase that to 262144. should improve performance all round.

VoropaevPavel · ‎03-26-2015

Do you observe any SAN errors during the backup on the switch level?

In one of the environments with big VMs error 6 was resolved by setting 8gb fix instead of auto on the brocate switches ports towards the media server.

And as Riaan suggested please collect vxms logs.

Michael_G_Ander · ‎03-26-2015

I would look in the vsphere events after warnings/error for this VM

Have you tuned the raw buffer on the mnedia/backup host for vmware backup ?

Error bpbrm(pid=1368) could not send server status message could be a timeout issue, which could be solved by increasing CLIENT_READ_TIMEOUT on the Media and Master server

Besides that

I would look in the OS event application & system logs after VSS & Volnsap entries and general warning/errors at time of the backup. Be aware that too small a shadow storage does not always generate event log entries unfortunately.

Would also do a vssadmin list writers

Often found the problems in vmware was caused by VSS in the Windows OS

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

Andy_Welcomer · ‎03-31-2015

Hello,

Sorry for responding late.

I will try to address everyone's questions.

Voropaev, the storage is direct attached with SAS connections, no switch. One thing I did notice that I didn't know before is that the media server's HBAs are only 3GB capable.

Michael: There are no VSS issues with the server. The snapshot completes without issue. It's the second job that writes to tape which fails.

Here is what was done since the post:

We tested a theory that Windows could be the issue with the size of the data store. This environment has 3 stores which are sized 20TB, 2.72 TB and 500GB respetfully. The large VM is on the 20TB datastore. As a test we relocated another VM to the 20TB datastore, and were able to perform the backup.

We've increased the CLIENT_READ_TIMEOUT to the highest it will go. Now a question is does this setting require a restart of the services? On Friday April 27 6 attempts failed. The media server was rebooted on Sat night. Another test for a clean set of logs was attempted yesterday. This job is running for as of this post 17 hours. I'm going to attach all of the VXMS logs from this attempt in a zip file shortly.

Andy_Welcomer · ‎03-31-2015

Forgot to mention Riann, we didn't increase the buffers yet.

Andy_Welcomer · ‎03-31-2015

Logs attached

RiaanBadenhorst · ‎03-31-2015

Doesn't look like the verbosity is high enough. set it to 8

sdo · ‎04-01-2015

Can I get this right... the storage is SAS attached to mutiple ESX hosts and to the NetBackup Media Server / Backup Proxy Host?

Never done that myself... but if it is 'supposed' to behave like SAN presentation...

...then... I noticed that you didn't respond to Michael's point re:

"Have you tuned the raw buffer on the media/backup host for vmware backup ?"

...and what Michael is alluding to with this is...

...a VADP backup proxy reads LUNs that are VMFS volumes when SAN transport is specified. Whilst the NetBackup 'stream handler' for VMware will inspect the LUN stream (to catalog files), it does not actually read NTFS folder structures like plain NetBackup Client does.

Anyway, what Michael was steering you towards is... have you experimented with the raw LUN buffer setting that a SAN based 'backup proxy' will use when reading (backing-up) from LUNs? The setting in question is the "Raw partition read buffer size:" attribute of the 'Client Properties' of the backup proxy. This controls the 'block size' (or ''buffer size') of the raw disk IO when NetBackup Client (i.e. bpbkar32 -> bpfis -> VxMs -> vix API) i.e. the block size of the VADP API routines embedded within the VDDK 'dll' files that are included within a backup proxy.

The actual setting name is 'SectorsPerBuffer' - but, get this, the name is now a misnomer, and it is no longer actually based on 'sector' count. In the good (bad?) old days of VCB it really was 'sector' count based, and so one had to manually work things out on the basis of 512B sectors... anyway, I digress.

What I wanted to say, was that years ago we used to be able to drive the value of this setting all the way up to 32MB - but, nowadays, for some reason the NetBackup Admin GUI only lets us specify a maximum of 4096 (in KB), i.e. 4 MB. Now, I suspect that Symantec must have limited the maximum value possible via the GUI because there must be a limitation in either NetBackup, or in the VDDK - so I would not recommend manually punching this value higher than 4MB by using 'bpsetconfig'.

I believe that the default value for this setting is now '32'... so, a SAN attached VMFS datastore LUN will be read in units/buffers of 32KB! I think there's room for improvement here :)

Maybe, first try a value of 1024 (KB), i.e. so that the (apparently) SAN attached LUN will be read in units of 1MB disk IO.

BTW - In Appliance, and RHEL, and SuSE based VMware backup proxy land - I believe this setting is known as FBU_READBLKS (but please, anyone, do correct me if I'm wrong about this).

RiaanBadenhorst · ‎04-01-2015

I got another post with similar issue

https://www-secure.symantec.com/connect/forums/vmware-backups-failing-status-code-623-0

VOX

VMware backup with Error code 6