Solved: Status 42 on VMware backups (somewhat inconsistent...

elanmbx · ‎11-13-2014

Solaris Master - 7.6.0.3

3 x 5220 appliance media servers - 2.6.0.3

Appliances are the VMware Backup Hosts

All VMware backups are done via SAN transport mode

Job details from a "representative" Status 42 failure:

11/13/2014 12:06:09 - Info nbjm (pid=19785) starting backup job (jobid=8025958) for client denjsit15, policy RETRY-VMware-DEV, schedule Full
11/13/2014 12:06:09 - estimated 0 kbytes needed
11/13/2014 12:06:09 - Info nbjm (pid=19785) started backup (backupid=denjsit15_1415905569) job for client denjsit15, policy RETRY-VMware-DEV, schedule Full on storage unit stu_disk_densyma02p
11/13/2014 12:06:10 - started process bpbrm (pid=12111)
11/13/2014 12:06:11 - Info bpbrm (pid=12111) denjsit15 is the host to backup data from
11/13/2014 12:06:11 - Info bpbrm (pid=12111) reading file list for client
11/13/2014 12:06:11 - Info bpbrm (pid=12111) starting bpbkar on client
11/13/2014 12:06:11 - Info bpbkar (pid=12146) Backup started
11/13/2014 12:06:11 - connecting
11/13/2014 12:06:11 - connected; connect time: 0:00:00
11/13/2014 12:06:12 - Info bpbrm (pid=12111) bptm pid: 12147
11/13/2014 12:06:12 - Info bptm (pid=12147) start
11/13/2014 12:06:15 - Info bptm (pid=12147) using 524288 data buffer size
11/13/2014 12:06:15 - Info bptm (pid=12147) setting receive network buffer to 262144 bytes
11/13/2014 12:06:15 - Info bptm (pid=12147) using 128 data buffers
11/13/2014 12:06:17 - Info bptm (pid=12147) start backup
11/13/2014 12:06:26 - begin writing
11/13/2014 12:07:24 - Info bpbkar (pid=12146) 0 entries sent to bpdbm
11/13/2014 12:07:24 - Info bpbkar (pid=12146) 90 entries sent to bpdbm
11/13/2014 12:07:24 - Info bpbkar (pid=12146) 91 entries sent to bpdbm
11/13/2014 12:07:31 - Info bpbkar (pid=12146) 95091 entries sent to bpdbm
11/13/2014 12:07:40 - Info bpbkar (pid=12146) 190091 entries sent to bpdbm
11/13/2014 12:07:47 - Info bpbkar (pid=12146) 272387 entries sent to bpdbm
11/13/2014 12:07:47 - Info bpbkar (pid=12146) 272388 entries sent to bpdbm
11/13/2014 12:07:50 - Info bpbkar (pid=12146) 287172 entries sent to bpdbm
11/13/2014 12:07:50 - Info bpbkar (pid=12146) 287173 entries sent to bpdbm
11/13/2014 12:07:50 - Info bpbkar (pid=12146) 287232 entries sent to bpdbm
11/13/2014 12:07:50 - Info bpbkar (pid=12146) 287233 entries sent to bpdbm
11/13/2014 12:07:58 - Info bpbkar (pid=12146) 382233 entries sent to bpdbm
11/13/2014 12:08:06 - Info bpbkar (pid=12146) 477233 entries sent to bpdbm
11/13/2014 12:08:14 - Info bpbkar (pid=12146) 558340 entries sent to bpdbm
11/13/2014 12:08:14 - Info bpbkar (pid=12146) 558341 entries sent to bpdbm
11/13/2014 12:08:14 - Info bpbkar (pid=12146) 558348 entries sent to bpdbm
11/13/2014 12:08:14 - Info bpbkar (pid=12146) INF - Transport Type =  san
11/13/2014 12:08:14 - Info bpbkar (pid=12146) 558381 entries sent to bpdbm
11/13/2014 12:09:39 - Info bpbkar (pid=12146) 558382 entries sent to bpdbm
11/13/2014 12:13:20 - Info bpbkar (pid=12146) bpbkar waited 10177 times for empty buffer, delayed 29247 times
11/13/2014 12:13:20 - Info bptm (pid=12147) waited for full buffer 6491 times, delayed 17216 times
11/13/2014 12:22:35 - Error bptm (pid=12147) get_string() failed, Broken pipe (32), premature end of file encountered
11/13/2014 12:22:35 - Info bptm (pid=12147) EXITING with status 42 <----------
11/13/2014 12:22:38 - Info densyma02p (pid=12147) StorageServer=PureDisk:densyma02p; Report=PDDO Stats (multi-threaded stream used) for (densyma02p): scanned: 23198447 KB, CR sent: 31511 KB, CR sent over FC: 0 KB, dedup: 99.9%, cache hits: 260943 (99.5%)
11/13/2014 12:22:40 - Info bpbkar (pid=0) done
11/13/2014 12:22:40 - Info bpbkar (pid=0) done. status: 42: network read failed
11/13/2014 12:22:40 - end writing; write time: 0:16:14
network read failed  (42)

I have attached the associated bpbrm, bpbkar, and bptm (I only have included the last 500 lines of the bptm log entries for pid=12147 since on VERBOSE=5 bptm was HUGE for this job).

Any help would be greatly appreciated!

elanmbx · ‎11-20-2014

I deleted all the old files and then retried the backup that was recently failing. And it worked. I'm going to be cautiously optimistic that perhaps my "Status 42" issues will be less prevalent with this discovery.

Anyone know why these files get left behind? The backup I retried left behind the following files:

densyma02p:/usr/openv/netbackup/online_util/fi_cntl # ll
total 21744
-rw-r--r-- 1 root root    2473 Nov 20 14:58 bpfis.fim.<host>_1416520694.1.0
-rw-r--r-- 1 root root    2058 Nov 20 14:58 bpfis.fim.<host>_1416520694.1.0.NBU_DATA.xml
-rw-r--r-- 1 root root      21 Nov 20 15:05 bpfis.fim.<host>_1416520694.1.0.NBU_DATA.xml.BID
-rw-r--r-- 1 root root   22406 Nov 20 14:58 bpfis.fim.<host>_1416520694.1.0.VM_ObjInfoXML.xml
-rw-r--r-- 1 root root     734 Nov 20 14:58 bpfis.fim.<host>_1416520694.1.0.changeid.xml
-rw-rw-rw- 1 root root     284 Nov 20 15:10 <host>_1416520694_copy1.lock

I'm inclined to stick a cron job on the appliance to clean up this directory daily, but would like to avoid such a modification if I can manage to. I really dislike doing these types of "one off" fixes on the appliances.

View solution in original post

elanmbx · ‎11-13-2014

Just tested without "SAN" as the transport mode - forced "nbd" - still same resultant Status 42.

Michael_G_Ander · ‎11-13-2014

11/13/2014 12:13:20 - Info bpbkar (pid=12146) bpbkar waited 10177 times for empty buffer, delayed 29247 times
11/13/2014 12:13:20 - Info bptm (pid=12147) waited for full buffer 6491 times, delayed 17216 times
11/13/2014 12:22:35 - Error bptm (pid=12147) get_string() failed, Broken pipe (32), premature end of file encountered
11/13/2014 12:22:35 - Info bptm (pid=12147) EXITING with status 42 <----------

This looks like you have a timeout might cause by the performance, increasing the CLIENT_READ_TIMEOUT to 600 rather the default 300 seconds I think would mitigate the problem

You probably also want to look into tuning the buffers, as there is some waits and delays. Increasing the number of buffers was the first thing I would try

Also changing the raw read buffer size on the backup host can improve vmware backup speed, the recommend setting was 1024 (1MB)

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

elanmbx · ‎11-14-2014

CLIENT_READ_TIMEOUT is already significantly higher. I have quite a bit more than the default number of buffers (128) already. Is the raw read buffer size modifieable on the appliances from the CLISH? Not sure it's in the "regular" buffer tuning menu...

elanmbx · ‎11-18-2014

I'm starting to think that this is not backup-resource related... due to the fact that I can reproduce it on some VMs when there is NOTHING else going on in the backup environment.

I did note in one backup I looked at this morning that there were no buffer issues reported in the backup either:

11/18/2014 07:46:10 - mounted 060516; mount time: 0:00:53
11/18/2014 07:46:10 - positioning 060516 to file 1211
11/18/2014 07:50:52 - positioned 060516; position time: 0:04:42
11/18/2014 07:50:52 - begin writing
11/18/2014 07:50:56 - Info bpbkar (pid=29899) bpbkar waited 0 times for empty buffer, delayed 0 times
11/18/2014 07:59:19 - Error bptm (pid=29858) get_string() failed, Broken pipe (32), premature end of file encountered
11/18/2014 07:59:24 - Info bpbrm (pid=29832) media manager for backup id denoest1_1416321908 exited with status 42: network read failed
11/18/2014 08:05:27 - end writing; write time: 0:14:35
network read failed  (42)

elanmbx · ‎11-18-2014

Another curious thing - the backups, when retried, are failing basically at the SAME point (at least reported) each time.

Screenshot attached.

elanmbx · ‎11-19-2014

And the "inconsistent" nature of this issue - the VM recently that I worked on (and ran retry jobs which also failed) - it backed up fine last night. This is quite irritating - seeing how I have sometimes a couple hundred "Status 42" failures in my VMware environment.

Screenshot attached.

elanmbx · ‎11-20-2014

I wonder if this is in any way related... I have over 145,000 files in one of my media server appliance's VMware directories:

maintenance-!> pwd
/usr/openv/netbackup/online_util/fi_cntl
maintenance-!> ls | wc -l
145888

Some are as old as January of 2013?!?

Why in the world are these files being left behind?

elanmbx · ‎11-20-2014

Wow - those directories on my two other media server appliances are over 150,000 files. I need to get those cleaned up! Perhaps this is part of my problem...

elanmbx · ‎11-20-2014

I deleted all the old files and then retried the backup that was recently failing. And it worked. I'm going to be cautiously optimistic that perhaps my "Status 42" issues will be less prevalent with this discovery.

Anyone know why these files get left behind? The backup I retried left behind the following files:

densyma02p:/usr/openv/netbackup/online_util/fi_cntl # ll
total 21744
-rw-r--r-- 1 root root    2473 Nov 20 14:58 bpfis.fim.<host>_1416520694.1.0
-rw-r--r-- 1 root root    2058 Nov 20 14:58 bpfis.fim.<host>_1416520694.1.0.NBU_DATA.xml
-rw-r--r-- 1 root root      21 Nov 20 15:05 bpfis.fim.<host>_1416520694.1.0.NBU_DATA.xml.BID
-rw-r--r-- 1 root root   22406 Nov 20 14:58 bpfis.fim.<host>_1416520694.1.0.VM_ObjInfoXML.xml
-rw-r--r-- 1 root root     734 Nov 20 14:58 bpfis.fim.<host>_1416520694.1.0.changeid.xml
-rw-rw-rw- 1 root root     284 Nov 20 15:10 <host>_1416520694_copy1.lock

I'm inclined to stick a cron job on the appliance to clean up this directory daily, but would like to avoid such a modification if I can manage to. I really dislike doing these types of "one off" fixes on the appliances.

elanmbx · ‎11-20-2014

Definitely was the issue. My nightly window opened and the 1st batch of several hundred VM backups went without a single "Status 42" thrown.

Previous evenings backup saw ~30 "Status 42" errors in the same couple policies.

I will monitor the directory and see how many files get orphaned in there by tomorrow morning. I suspect there are going to be a LOT.

SRP · ‎11-21-2014

Just to add my comments:

1 x 5220 appliance Master - 2.6.0.3

Windows 2008 Vmware Backup Host

VMware backups via SAN transport mode

Just noticed that fi_cntl folder has got quite a few hundred files in it.

I had the same problem when I upgraded to 2.5.0.2 in the beginning. Can't remember whether there was a fix applied... but worked beautifully for over a year.

After upgrading from 2.5.0.2 straight to 2.6.0.3 I've got the same problem.....

Sean

elanmbx · ‎11-21-2014

Interesting. I wonder if a bug was "reintroduced"...

elanmbx · ‎12-03-2014

I am going to conclude with almost 100% certainty that this was the issue. I still am not experiencing "Status 42" errors any more after a couple weeks. I'm keeping an eye on those fi_cntl directories on my appliances - they are currently around 20K files each. If after 60 days they aren't cleaning up the old image files I will manually do it.

Sending Symantec Support a note to let them know what the issue seems to be.

VOX

Status 42 on VMware backups (somewhat inconsistent)