VMware Policy Backup - Network Connection Broken (...

SYMAJ · ‎09-10-2013

I have a Netbackup Environment with a Virtual Master Server running 7506 and a 5230 Appliance running 2.5.2 (7505). There are a small number of Physical clients (Windows and Linux) and many VMware clients.

Whilst most of the VMware clients (ESX5) are working successfully (120+) I have two which are giving major problems. The ESX LUNs are presented to the 5230 over SAN and we are authenticating to the vCenter server to control backups. Policy type is VMWARE.

The two servers in question have a large number of files, over 1TB total size each.

One of the servers, following the backup being initiaited, performs the VMware snapshot, starts the backup job and then sits there for some time before failing with a status 40 - Network Connection Broken. Which network conection ? As this is a VMware policy type I am not talking directly to the VM, it is being snapped and then presented to the 5230 as a local image. I have extended the client read timeouts on both the master and 5230 appliance with no success.

This job has run successfully in the past numerous times - both FULL and DIFF.

Failed Job log:

10/09/2013 10:53:38 - Info nbjm(pid=3704) starting backup job (jobid=4321) for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule DAILY
10/09/2013 10:53:38 - estimated 1388605664 Kbytes needed
10/09/2013 10:53:38 - Info nbjm(pid=3704) started backup (backupid=DUB-APPS-001_1378806818) job for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule DAILY on storage unit DUB-5230-DeDup
10/09/2013 10:53:39 - started process bpbrm (29973)
10/09/2013 10:53:41 - Info bpbrm(pid=29973) DUB-APPS-001 is the host to backup data from
10/09/2013 10:53:41 - Info bpbrm(pid=29973) reading file list from client
10/09/2013 10:53:41 - Info bpbrm(pid=29973) starting bpbkar on client
10/09/2013 10:53:41 - Info bpbkar(pid=29997) Backup started
10/09/2013 10:53:41 - Info bpbrm(pid=29973) bptm pid: 29998
10/09/2013 10:53:41 - Info bptm(pid=29998) start
10/09/2013 10:53:41 - connecting
10/09/2013 10:53:41 - connected; connect time: 00:00:00
10/09/2013 10:53:42 - Info bptm(pid=29998) using 1048576 data buffer size
10/09/2013 10:53:42 - Info bptm(pid=29998) using 64 data buffers
10/09/2013 10:53:42 - Info bptm(pid=29998) start backup
10/09/2013 10:54:25 - begin writing
10/09/2013 12:01:17 - Error bptm(pid=29998) media manager terminated by parent process
10/09/2013 12:01:24 - Info dub-5230(pid=29998) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 3 KB, CR sent: 0 KB, CR sent over FC: 0 KB, dedup: 100.0%
10/09/2013 12:01:25 - Info bpbkar(pid=0) done. status: 40: network connection broken
10/09/2013 12:01:25 - end writing; write time: 01:07:00
network connection broken(40)
10/09/2013 12:11:28 - Info bpbrm(pid=11144) DUB-APPS-001 is the host to backup data from
10/09/2013 12:11:28 - Info bpbrm(pid=11144) reading file list from client
10/09/2013 12:11:28 - Info bpbrm(pid=11144) starting bpbkar on client
10/09/2013 12:11:28 - Info bpbkar(pid=11151) Backup started
10/09/2013 12:11:28 - Info bpbrm(pid=11144) bptm pid: 11152
10/09/2013 12:11:28 - Info bptm(pid=11152) start
10/09/2013 12:11:29 - Info bptm(pid=11152) using 1048576 data buffer size
10/09/2013 12:11:29 - Info bptm(pid=11152) using 64 data buffers
10/09/2013 12:11:29 - Info bptm(pid=11152) start backup

Following this the job just sits there doing nothing.

Successful Job Log:

07/09/2013 02:59:09 - Info nbjm(pid=3984) starting backup job (jobid=4051) for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule MONTHLY
07/09/2013 02:59:09 - estimated 0 Kbytes needed
07/09/2013 02:59:09 - Info nbjm(pid=3984) started backup (backupid=DUB-APPS-001_1378519149) job for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule MONTHLY on storage unit DUB-5230-DeDup
07/09/2013 02:59:11 - started process bpbrm (7231)
07/09/2013 02:59:12 - Info bpbrm(pid=7231) DUB-APPS-001 is the host to backup data from
07/09/2013 02:59:12 - Info bpbrm(pid=7231) reading file list from client
07/09/2013 02:59:12 - connecting
07/09/2013 02:59:13 - Info bpbrm(pid=7231) starting bpbkar on client
07/09/2013 02:59:13 - Info bpbkar(pid=7238) Backup started
07/09/2013 02:59:13 - Info bpbrm(pid=7231) bptm pid: 7239
07/09/2013 02:59:13 - Info bptm(pid=7239) start
07/09/2013 02:59:13 - Info bptm(pid=7239) using 1048576 data buffer size
07/09/2013 02:59:13 - Info bptm(pid=7239) using 64 data buffers
07/09/2013 02:59:13 - connected; connect time: 00:00:01
07/09/2013 02:59:14 - Info bptm(pid=7239) start backup
07/09/2013 02:59:51 - begin writing
07/09/2013 04:00:53 - Info bpbkar(pid=7238) INF - Transport Type = san
07/09/2013 20:25:04 - Info bpbkar(pid=7238) bpbkar waited 124 times for empty buffer, delayed 77660 times
07/09/2013 20:25:04 - Info bptm(pid=7239) waited for full buffer 848100 times, delayed 3160597 times
07/09/2013 20:25:29 - Info bptm(pid=7239) EXITING with status 0 <----------
07/09/2013 20:25:29 - Info dub-5230(pid=7239) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 1148749424 KB, CR sent: 57710289 KB, CR sent over FC: 0 KB, dedup: 95.0%
07/09/2013 20:25:29 - Info bpbrm(pid=7231) validating image for client DUB-APPS-001
07/09/2013 20:25:29 - end writing; write time: 17:25:38
07/09/2013 20:25:30 - Info bpbkar(pid=7238) done. status: 0: the requested operation was successfully completed
the requested operation was successfully completed(0)

The other server is more promising, but runs extremely slowly and would take days to finish if let run. This job has run successfully (FULL) twice and secured 1.6TB in 14hrs.

Failed Job Log:

09/09/2013 16:00:45 - Info nbjm(pid=3704) starting backup job (jobid=4227) for client DUB-PTFILE-P01, policy AEGON-PROD-VMWARE-ADHOC, schedule MONTHLY
09/09/2013 16:00:45 - estimated 0 Kbytes needed
09/09/2013 16:00:45 - Info nbjm(pid=3704) started backup (backupid=DUB-PTFILE-P01_1378738845) job for client DUB-PTFILE-P01, policy AEGON-PROD-VMWARE-ADHOC, schedule MONTHLY on storage unit DUB-5230-DeDup
09/09/2013 16:00:46 - started process bpbrm (459)
09/09/2013 16:00:48 - Info bpbrm(pid=459) DUB-PTFILE-P01 is the host to backup data from
09/09/2013 16:00:48 - Info bpbrm(pid=459) reading file list from client
09/09/2013 16:00:48 - Info bpbrm(pid=459) starting bpbkar on client
09/09/2013 16:00:48 - Info bpbkar(pid=467) Backup started
09/09/2013 16:00:48 - Info bpbrm(pid=459) bptm pid: 468
09/09/2013 16:00:48 - connecting
09/09/2013 16:00:48 - connected; connect time: 00:00:00
09/09/2013 16:00:49 - Info bptm(pid=468) start
09/09/2013 16:00:49 - Info bptm(pid=468) using 1048576 data buffer size
09/09/2013 16:00:49 - Info bptm(pid=468) using 64 data buffers
09/09/2013 16:00:50 - Info bptm(pid=468) start backup
09/09/2013 16:00:51 - begin writing
10/09/2013 10:42:07 - Error bptm(pid=468) media manager terminated by parent process
10/09/2013 10:42:14 - Info dub-5230(pid=468) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 3636227 KB, CR sent: 342397 KB, CR sent over FC: 0 KB, dedup: 90.6%
10/09/2013 10:42:15 - Info bpbkar(pid=0) done
10/09/2013 10:42:15 - Info bpbkar(pid=0) done. status: 150: termination requested by administrator
10/09/2013 10:42:15 - end writing; write time: 18:41:24
termination requested by administrator(150)

Successful job log:

10/08/2013 00:13:37 - Info nbjm(pid=3836) starting backup job (jobid=1681) for client DUB-PTFILE-P01, policy Daily_None__Weekly_00.01__Monthly_00.01__VMware, schedule WEEKLY
10/08/2013 00:13:37 - estimated 0 Kbytes needed
10/08/2013 00:13:37 - Info nbjm(pid=3836) started backup (backupid=DUB-PTFILE-P01_1376090017) job for client DUB-PTFILE-P01, policy Daily_None__Weekly_00.01__Monthly_00.01__VMware, schedule WEEKLY on storage unit DUB-5230-DeDup
10/08/2013 00:13:38 - started process bpbrm (23623)
10/08/2013 00:13:40 - Info bpbrm(pid=23623) DUB-PTFILE-P01 is the host to backup data from
10/08/2013 00:13:40 - Info bpbrm(pid=23623) reading file list from client
10/08/2013 00:13:40 - connecting
10/08/2013 00:13:40 - connected; connect time: 00:00:00
10/08/2013 00:13:41 - Info bpbrm(pid=23623) starting bpbkar on client
10/08/2013 00:13:41 - Info bpbkar(pid=23645) Backup started
10/08/2013 00:13:41 - Info bpbrm(pid=23623) bptm pid: 23648
10/08/2013 00:13:41 - Info bptm(pid=23648) start
10/08/2013 00:13:41 - Info bptm(pid=23648) using 1048576 data buffer size
10/08/2013 00:13:41 - Info bptm(pid=23648) using 64 data buffers
10/08/2013 00:13:42 - Info bptm(pid=23648) start backup
10/08/2013 00:13:51 - begin writing
10/08/2013 08:09:45 - Info bpbkar(pid=23645) INF - Transport Type = san
10/08/2013 13:56:30 - Info bpbkar(pid=23645) bpbkar waited 389820 times for empty buffer, delayed 1140862 times
10/08/2013 13:56:30 - Info bptm(pid=23648) waited for full buffer 107426 times, delayed 1894435 times
10/08/2013 13:59:14 - Info bptm(pid=23648) EXITING with status 0 <----------
10/08/2013 13:59:15 - Info dub-5230(pid=23648) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 1617774661 KB, CR sent: 187714934 KB, CR sent over FC: 0 KB, dedup: 88.4%
10/08/2013 13:59:16 - Info bpbrm(pid=23623) validating image for client DUB-PTFILE-P01
10/08/2013 13:59:17 - end writing; write time: 13:45:26
10/08/2013 13:59:18 - Info bpbkar(pid=23645) done. status: 0: the requested operation was successfully completed
the requested operation was successfully completed(0)

I note in both of the above FAILED joblogs we never get the message INF - Transport Type = san, whereas we do in the successful jobs.

I am unaware of any other changes in the environment which have been made - and in either case why would it affect only these two VM clients ? As a test I have now installed the 7506 agent on each of the VM's and am running an MS-Windows policy typw with Accelerator enabled and Client Side Dedup to test if the job will complete in that manner. However - I do not want to progress this way, and both jobs worked perfectly previously ??

Any pointers ?

AJ

Mark_Solutions · ‎09-10-2013

There can be several reasons for this sort of error so not always easy to track down but worth working through the logs as advised in the NetBackup VMware Admin Guide - Troubleshooting section

It can be caused by VMware Tools being out of date, network issues (even though it does a SAN backup it still has to resolve the clients correctly before it will run the backup)

It can also be due to communications between the Master and VMware backup host - some useful tops in this tech note:

http://www.symantec.com/docs/TECH184305

The Media Manager terminated by parent process is also a concern, especially as it carries on doing things after that

Are you fully up to date with patches etc.?

SYMAJ · ‎09-10-2013

Mark,

Thanks for the input.

Master is W2K8R2 SP1, with NBU 7506. Media (5230 Appliance) is at 2.5.2 (7505). Client is on ESX5, one running W2K3 and one running W2K8. VMware tools is reasonably current (and it is working with the other 120+ VMs with no issues. Network connectivity should not be an issue as the ESX hosts in the cluster are on a dedicated network with the NBU servers. There is a checkpoint firewall linking the 'management' network (ESX Hosts and NBU Servers) with the 'Prod' network (vm's and physical servers). The only links I have open between the two networks allow the 'physical' hosts to contact the NBU environment and vice versa.

Netwoork between the Master and VM Backup Host (appliance) is not via the firewall and is on the same subnet etc. so should not be a problem.

All was working well in both cases but then it appears to have gone 'pete tong' for these two vm's only. All other VMs are working fine, albeit these have significantly more data on them.

Could this be something to do with the amount of catalog information which has built up over the month or so we have been live on this environment ? Or a build-up of 'junk' in the appliance side of things ?

I will look into the items you point out.

When you say 'up to date with patches etc.' - what in particular are you thinking of ?

Thanks,

AJ

Mark_Solutions · ‎09-10-2013

Patches was in terms of NetBackup but looks like you are all up to date with those.

If it was clutter / junk you would expect it to affect all clients - so as you say it may be the data size and hence the amount of time the job takes that you need to look at here.

There is no consistent time factor either that would pin down a keep_alive setting

One thing that may help would be to set a smaller size on the fragment size used by the storage unit - this means smaller chunks used (even though it is de-dupe really!) but it does cause a regular conversation all round as it starts the next fragment

I like to use a 5000MB fragments size if i have any GRT backups involved but just try half your current one as a starting point to see if that helps

There are an awful lot of delays indicating that you are not processing very well - so it could be an overload of the vCenter / ESX Server causing yoru issues, especially when it comes to these very large snapshots being handled - so you could also limit the concurrent jobs per ESX Server or DataStore.

Hope these ideas help

SYMAJ · ‎09-10-2013

Mark,

Thanks for that.

I have the limits set for the VMware processing (in NBU Master Properties) to 3 snapshots, 3 backups per ESX Server and 3 backups per datastore. These were 4 and all was processing well but I have reduced them earlier today. However, a point to note is that I am running some of these tests during the day when there is no other backup processing taking place.

I am using a fragment size of 5000MB on the storage unit.

I'll keep thinking / looking.......

AJ

cruisen · ‎09-10-2013

Hi,

do yourself a favor and upgrade to 7.5.0.6 on the vmware backup acces host. You will see it will work like a charm.

Sorry Guys,

Best regards,

Cruisen

SYMAJ · ‎09-10-2013

Cruisen - have you come accross this and 2.5.3 resolved the issue or is this just general advice ?

AJ.

SYMAJ · ‎09-10-2013

Mark,

Regarding the technote you mentioned above (http://www.symantec.com/docs/TECH184305) this relates to a windows media server. Are all/any of these paramters relevant in a 5230 appliance, and if so how do I go about changing them (I'm not a Unix head....) ?

AJ.

cruisen · ‎09-10-2013

Hi,

i have come across with this, yes, and I spend hours troubleshooting and applying best practices, just to be aware that upgrading to 7506 solved everything. 100 % sucessrate and fast backups.

What I was wondering was your description, are you really having issues with two servers only. Or did you make it work for a while for the 120 servers and than you will experience a failure. This was my case. It works perfect the first backups and then I needed to recycle services and even reboot the server to make it work again. I thought that was normal. Symantec Support gave me binaries that worked. After applying the patch on the vmware acces host only , no reboot and never again backups not succesfully.

I propose to you, figure out what you have if you have the same problems ore similar on all vms, than go for the patch.

best regards,

Cruisen

cruisen · ‎09-10-2013

Hi, i also would like to mention that the errors are more related to the deduplication pool than others.

http://www.symantec.com/docs/TECH171327

In our case this was a DataDomain appliance with OST.

There are other different technotes that take about Vmware and Dedupe pools, please check if this applies.

best regards,

Cruisen

Mark_Solutions · ‎09-11-2013

The tech note does relate to Windows but some details will relate to an appliance too

You can get into the O/S of an appliance via the CLISH using Support - Maintenance - p@ssw0rd - elevate

Some standard tuning i like to do on the appliances (if it is not already there - and many things have now been adopted on the later patches) are:

Set the DATA buffer numbers and sizes for both disk and tape (done via the clish) numbers 64 or 128, sizes 262144 for tape, 1048576 for disk

It using accelerator or doing a very large number of backups i edit the /disk/etc/puredisk/contentrouter.cfg file so that the WorkerThreads=128 (default is 64)

For slow VMWare policy query runs - add the /usr/openv/netbackup/virtualization.conf file with the following in it:

[BACKUP]
"disableIPResolution"=dword:00000000

Change the keep alive settings (does need some other work to keep these persistent:

# echo 510 > /proc/sys/net/ipv4/tcp_keepalive_time

# echo 3 > /proc/sys/net/ipv4/tcp_keepalive_intvl

# echo 3 > /proc/sys/net/ipv4/tcp_keepalive_probes

Add the /usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO file with a value of 800 in it (to all Master and Media Servers - needs a restart to take effect and NO other DPS_ files should exist)

Just a few things i have found that help

If your VMWare luns have multiple paths to the appliance (which is not supported) it will cause a memory leak on the appliance - this needs a command running or a re-boot to clear it down - however, if the appliance is also a Master the command takes EMM down!!!

Hope some of this helps

SYMAJ · ‎09-11-2013

Mark,

Many thanks.

My appliance is NOT the Master server - just a Media Server.

I already have the buffers set as per your note.

I will update the contentrouter.cfg as you say.

How do I make the TCP changes persistent ?

I do have multiple 'physical' paths from the SAN volumes to the appliance, however we are using Dell Compellent and an option when presenting the LUN is to present over one path only (a nice feature which I don't see on the EMC side which I am more experienced with). Therefore the appliance only sees the LUN over a single path, although there are 8 paths physically available. Also, I did reboot the appliance a couple of days ago with no effect to this issue.

Also, I am planning to upgrade to 2.5.3 on the appliances tomorrow so I will see if that has any effect.

I will keep you updated.

AJ.

Mark_Solutions · ‎09-11-2013

2.5.3 should help with things - just bear in mind that if your disk pools are not using the standard name then you may need to get support to sort out the installation script for you first to get the upgrade to actually work (see the threads in the Appliance forum for details)

Check your keep alive settings first - they may already be OK:

# cat /proc/sys/net/ipv4/tcp_keepalive_time
510
# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
3
# cat /proc/sys/net/ipv4/tcp_keepalive_probes
3

If you do need to change them they are kept persistent as follows:

The changes would be rendered persistent with an addition such as the following to /etc/sysctl.conf

## Keepalive at 8.5 minutes

# start probing for heartbeat after 8.5 idle minutes (default 7200 sec)
net.ipv4.tcp_keepalive_time=510

# close connection after 4 unanswered probes (default 9)
net.ipv4.tcp_keepalive_probes=3

# wait 45 seconds for reponse to each probe (default 75
net.ipv4.tcp_keepalive_intvl=3

and then run :
chkconfig boot.sysctl on

SYMAJ · ‎09-11-2013

Mark,

Thanks.

Regarding the disk pool names, if they are different does it not just fail the self test ? Saw the threads previously which I assume you are referring to.

AJ.

Mark_Solutions · ‎09-11-2013

To check if you have any memory leaks (they are classed as semaphore leaks) you can run a very simple command on the appliance (LOL!):

for semid in `ipcs -s | awk '/^0x/ {if ($1=="0x00000000") print $2}'`; do ipcs -s -i $semid; done | egrep -v "se maphore|uid|mode|nsems|otime|ctime|semnum" | awk '{if ($5 != "0" && $5 != "") print $5}' | uniq | xargs ps -p

Copy this into notepad first as it should all be on a single line as a single command - if you get any sort of output then you have a memory leak (I am still not sure how to interpret the results!!)

If it is only a media server you can clear then down by adding a crontab entry (crontab -e)

*/15 * * * * /usr/bin/ipcs -s | grep 0x00000000 | awk '{print $2}' | while read sem; do /usr/bin/ipcrm -s $sem; done

This again is a single line - it runs the command every 15 minutes to keep the memory cleared down - I believe this is fixed in 2.5.3 so you probably wont need to do this after your upgrade

Mark_Solutions · ‎09-11-2013

It does just fail the self test - but that ends the upgrade so you cannot upgrade it

Check through the threads - you may be able to edit the script file yourself but if you plan to do it tomorrow you could get a case opened now to help you out

SYMAJ · ‎09-11-2013

Mark,

As you say, nice simple commands........ ?????

Great advice as always.

I will do the 2.5.3 upgrade first and then address this if needs be.

Thanks again for all your input, I will update when I have details.

AJ

SYMAJ · ‎09-13-2013

A couple of points relating to the above on the 253 upgrade - and settings following the upgrade:

253 upgrade completed successfully (POST UPGRADE SELF TEST failed as per other post - worked after 5 mins of allowing appliance to stabilize)

After upgrade the settings were still defaults as per below (I set them as per commands):

# start probing for heartbeat after 8.5 idle minutes (default 7200 sec)
net.ipv4.tcp_keepalive_time=510

# close connection after 4 unanswered probes (default 9)
net.ipv4.tcp_keepalive_probes=3

# wait 45 seconds for reponse to each probe (default 75)
net.ipv4.tcp_keepalive_intvl=3

Databuffers for tape were at 32 - set to 64 (disk was 64 by default)

Databuffers size OK for both TAPE and DISK

I will see over the next few days whether the 253 upgrade has stopped the Network Connection (40) errors etc.

AJ

Mark_Solutions · ‎09-13-2013

I find that you need to specify the disk and tape data buffers sizes and number specifically (or create the SIZE_DATA_BUFFERS etc. files) otherwise the numbers get used for the other type

So if you set tape to 256kb it will start using disk at 256kb too

Hope this helps

SYMAJ · ‎09-13-2013

As the sizes were OK (I only had to update the numbers) the backup logs are showing that the correct sizes are being used.

AJ

VOX

VMware Policy Backup - Network Connection Broken (40)