cancel
Showing results for 
Search instead for 
Did you mean: 

Strange errors in NBU 7.6.0.2, VMware clients

backupguy2014
Level 3

NBU Master is on RHEL 6.4 x86_64.  Affected media servers are Win2k3.

 

# bperror -problems -hoursago 24 | grep 'tar file write error'

 

1412286931 1 4 16 fdswp00301-bur 796271 796169 0 FDSWV01423 bpbrm from client FDSWV01423: ERR - tar file write error (14)

1412286931 1 4 16 fdswp00305-bur 796254 796133 0 FDSWV01003 bpbrm from client FDSWV01003: ERR - tar file write error (14)

1412286931 1 4 16 fdswp00301-bur 796270 796157 0 FDSWV01458 bpbrm from client FDSWV01458: ERR - tar file write error (14)

1412286931 1 4 16 fdswp00305-bur 796257 796154 0 FDSWV01505 bpbrm from client FDSWV01505: ERR - tar file write error (14)

1412286934 1 4 16 fdswp00301-bur 796269 796155 0 FDSWV01517 bpbrm from client FDSWV01517: ERR - tar file write error (14)

1412287111 1 4 16 fdswp00302-bur 796255 796151 0 FDSWV01471 bpbrm from client FDSWV01471: ERR - tar file write error (14)

1412290751 1 4 16 fdswp00305-bur 796254 796133 0 FDSWV01003 bpbrm from client FDSWV01003: ERR - tar file write error (14)

1412290751 1 4 16 fdswp00303-bur 796281 796153 0 FDSWV01482 bpbrm from client FDSWV01482: ERR - tar file write error (14)

1412290751 1 4 16 fdswp00305-bur 796257 796154 0 FDSWV01505 bpbrm from client FDSWV01505: ERR - tar file write error (14)

1412290753 1 4 16 fdswp00301-bur 796269 796155 0 FDSWV01517 bpbrm from client FDSWV01517: ERR - tar file write error (14)

{etc}

 

40 failures in the last 24 hours.

 

I can view a job log with bpdbjobs.  Picking one at random (so far, they all look very much alike):

 

# /usr/openv/netbackup/bin/admincmd/bpdbjobs -jobid 796416 -all_columns | ./reformat

 

10/02/2014 19:43:21 - Info nbjm(pid=18149) starting backup job (jobid=796416) for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH

10/02/2014 19:43:21 - estimated 16226320 kbytes needed

10/02/2014 19:43:21 - Info nbjm(pid=18149) started backup (backupid=FDSWV00373_1412293401) job for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH on storage unit 305-PureDisk

10/02/2014 19:43:23 - Info bpbrm(pid=6980) FDSWV00373 is the host to backup data from

10/02/2014 19:43:23 - Info bpbrm(pid=6980) reading file list for client

10/02/2014 19:43:23 - Info bpbrm(pid=6980) accelerator enabled

10/02/2014 19:43:24 - Info bpbrm(pid=6980) starting bpbkar32 on client

10/02/2014 19:43:22 - started process bpbrm (6980)

10/02/2014 19:43:24 - connecting

10/02/2014 19:43:24 - connected; connect time: 000:00:00

10/02/2014 19:43:24 - Info bpbkar32(pid=8364) Backup started

10/02/2014 19:43:24 - Info bpbkar32(pid=8364) accelerator enabled backup, archive bit processing:<disabled>

10/02/2014 19:43:25 - Info bptm(pid=8344) start

10/02/2014 19:43:25 - Info bptm(pid=8344) using 61440 data buffer size

10/02/2014 19:43:25 - Info bptm(pid=8344) setting receive network buffer to 246784 bytes

10/02/2014 19:43:25 - Info bptm(pid=8344) using 16 data buffers

10/02/2014 19:43:25 - Info bptm(pid=8344) start backup

10/02/2014 19:43:27 - begin writing

10/02/2014 19:44:05 - Info bpbkar32(pid=8364) INF - Transport Type =  san

10/02/2014 19:45:10 - Critical bptm(pid=8344) include image failed

10/02/2014 19:45:10 - Critical bptm(pid=8344) image write failed: error 2060001: one or more invalid arguments

10/02/2014 19:46:23 - Error bptm(pid=8344) cannot write image to disk, Invalid argument

10/02/2014 19:46:23 - Info bptm(pid=8344) EXITING with status 84 <----------

10/02/2014 19:46:23 - Error bpbrm(pid=6980) from client FDSWV00373: ERR - tar file write error (14)

10/02/2014 19:46:25 - Info bpbkar32(pid=8364) accelerator sent 0 bytes out of 0 bytes to server, optimization 0.0%

10/02/2014 19:46:25 - Info bpbkar32(pid=8364) bpbkar waited 5 times for empty buffer, delayed 4394 times.

10/02/2014 19:46:26 - Info fdswp00305-bur(pid=8344) StorageServer=PureDisk:fdswp00305-bur; Report=PDDO Stats for (fdswp00305-bur): scanned: 134414 KB, CR sent: 10118 KB, CR sent over FC: 0 KB, dedup: 92.5%, cache disabled

10/02/2014 19:46:26 - Error bpbrm(pid=6980) could not send server status message

10/02/2014 19:46:26 - Critical bpbrm(pid=6980) unexpected termination of client FDSWV00373

10/02/2014 19:46:27 - Info bpbkar32(pid=0) done. status: 84: media write error

10/02/2014 19:46:27 - end writing; write time: 000:03:00

-

10/02/2014 20:01:27 - Info nbjm(pid=18149) starting backup job (jobid=796416) for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH

10/02/2014 20:01:27 - Info nbjm(pid=18149) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=796416, request id:{6BAD9324-4A90-11E4-8CC2-51EAD643EE46})

10/02/2014 20:01:27 - requesting resource 305-PureDisk

10/02/2014 20:01:27 - requesting resource fdsup00001nbu.NBU_CLIENT.MAXJOBS.FDSWV00373

10/02/2014 20:01:27 - requesting resource fdsup00001nbu.NBU_POLICY.MAXJOBS.WIN_VM_305

10/02/2014 20:01:29 - Info nbrb(pid=18089) Limit has been reached for the logical resource fdsup00001nbu.NBU_POLICY.MAXJOBS.WIN_VM_305

10/02/2014 20:20:10 - Info bpbrm(pid=9092) FDSWV00373 is the host to backup data from

10/02/2014 20:20:10 - Info bpbrm(pid=9092) reading file list for client

10/02/2014 20:20:10 - Info bpbrm(pid=9092) accelerator enabled

10/02/2014 20:20:08 - granted resource fdsup00001nbu.NBU_CLIENT.MAXJOBS.FDSWV00373

10/02/2014 20:20:08 - granted resource fdsup00001nbu.NBU_POLICY.MAXJOBS.WIN_VM_305

10/02/2014 20:20:08 - granted resource MediaID=@aaaap;DiskVolume=PureDiskVolume;DiskPool=305-PureDisk;Path=PureDiskVolume;StorageServer=fdswp00305-bur;MediaServer=fdswp00305-bur

10/02/2014 20:20:08 - granted resource 305-PureDisk

10/02/2014 20:20:08 - estimated 16226320 kbytes needed

10/02/2014 20:20:08 - Info nbjm(pid=18149) started backup (backupid=FDSWV00373_1412295608) job for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH on storage unit 305-PureDisk

10/02/2014 20:20:09 - started process bpbrm (9092)

10/02/2014 20:20:11 - Info bpbrm(pid=9092) starting bpbkar32 on client

10/02/2014 20:20:11 - Info bpbkar32(pid=8728) Backup started

10/02/2014 20:20:11 - Info bpbkar32(pid=8728) accelerator enabled backup, archive bit processing:<disabled>

10/02/2014 20:20:11 - Info bptm(pid=9152) start

10/02/2014 20:20:12 - Info bptm(pid=9152) using 61440 data buffer size

10/02/2014 20:20:12 - Info bptm(pid=9152) setting receive network buffer to 246784 bytes

10/02/2014 20:20:12 - Info bptm(pid=9152) using 16 data buffers

10/02/2014 20:20:11 - connecting

10/02/2014 20:20:11 - connected; connect time: 000:00:00

10/02/2014 20:20:12 - Info bptm(pid=9152) start backup

10/02/2014 20:20:18 - begin writing

10/02/2014 20:20:51 - Info bpbkar32(pid=8728) INF - Transport Type =  san

10/02/2014 20:22:25 - Info bptm(pid=9152) waited for full buffer 2037 times, delayed 3985 times

10/02/2014 20:22:26 - Info bpbkar32(pid=8728) accelerator sent 1115131392 bytes out of 14165895168 bytes to server, optimization 92.1%

10/02/2014 20:22:26 - Info bpbkar32(pid=8728) bpbkar waited 548 times for empty buffer, delayed 1106 times.

10/02/2014 20:22:42 - Info bptm(pid=9152) EXITING with status 0 <----------

10/02/2014 20:22:42 - Info fdswp00305-bur(pid=9152) StorageServer=PureDisk:fdswp00305-bur; Report=PDDO Stats for (fdswp00305-bur): scanned: 13841369 KB, CR sent: 488111 KB, CR sent over FC: 0 KB, dedup: 96.5%, cache disabled

10/02/2014 20:22:43 - Info bpbrm(pid=9092) validating image for client FDSWV00373

10/02/2014 20:22:45 - Info bpbkar32(pid=8728) done. status: 0: the requested operation was successfully completed

10/02/2014 20:22:45 - end writing; write time: 000:02:27

 

The problem is, I don’t know what I am looking at.  It reads as if the jobs ran, failed, ran again and succeeded?  The first attempt looks like the second except where the ‘granted resource’ messages should be – they are not present.

 

‘bplist’ shows a catalog entry for 10/2 20:20, which lines up with the 2nd job:

 

# bplist -C FDSWV00373 -k WIN_VM_305 -b -l -unix_files "/C/"

d--------- user      group               0 Oct 02 20:20 /C/

d--------- user      group               0 Oct 01 21:57 /C/

d--------- user      group               0 Sep 30 21:31 /C/

d--------- user      group               0 Sep 29 21:52 /C/

{etc}

 

7 REPLIES 7

RamNagalla
Moderator
Moderator
Partner    VIP    Certified

your job got successfull in 2nd attempt,

and you have vmware backup with accelator enable, though SAN transport...

backup are going to MSPD pool.

now.. do you see any other backup faillures to this MSDP pools with error 84?

looks like there was an hick-up on MSDP pool while writing the data...

so need to make sure if you get the failures on this pool for other backup...... 

all failures  are at specific time range..?

 

backupguy2014
Level 3

The jobs all run within the same window, so the times are clustered together.

We have 5 Windows-based media servers - they all show the same symptoms.

Here's one from today:

 

10/05/2014 21:01:26 - Info nbjm(pid=18149) starting backup job (jobid=801397) for client FDSWV01426, policy WIN_VM_302, schedule WIN_VMW_INCR
10/05/2014 21:01:26 - estimated 0 kbytes needed
10/05/2014 21:01:26 - Info nbjm(pid=18149) started backup (backupid=FDSWV01426_1412557286) job for client FDSWV01426, policy WIN_VM_302, schedule WIN_VMW_INCR on storage unit 302-PureDisk
10/05/2014 21:01:28 - Info bpbrm(pid=14756) FDSWV01426 is the host to backup data from
10/05/2014 21:01:28 - Info bpbrm(pid=14756) reading file list for client
10/05/2014 21:01:28 - Info bpbrm(pid=14756) accelerator enabled
10/05/2014 21:01:27 - started process bpbrm (14756)
10/05/2014 21:01:29 - Info bpbrm(pid=14756) starting bpbkar32 on client
10/05/2014 21:01:30 - Info bpbkar32(pid=9460) Backup started
10/05/2014 21:01:30 - Info bpbkar32(pid=9460) accelerator enabled backup, archive bit processing:<disabled>
10/05/2014 21:01:30 - Info bptm(pid=9208) start
10/05/2014 21:01:30 - Info bptm(pid=9208) using 61440 data buffer size
10/05/2014 21:01:30 - Info bptm(pid=9208) setting receive network buffer to 246784 bytes
10/05/2014 21:01:30 - Info bptm(pid=9208) using 16 data buffers
10/05/2014 21:01:28 - connecting
10/05/2014 21:01:29 - connected; connect time: 000:00:01
10/05/2014 21:01:31 - Info bptm(pid=9208) start backup
10/05/2014 21:01:32 - begin writing
10/05/2014 21:02:32 - Info bpbkar32(pid=9460) INF - Transport Type =  san
10/05/2014 23:13:03 - Critical bptm(pid=9208) Storage Server Error: (Storage server: PureDisk:fdswp00302-bur) write_buf: Failed to write to spoold on storage server fdswp00302-bur (Bad file descriptor). Ensure storage server services are running and operational. V-454-16
10/05/2014 23:13:03 - Critical bptm(pid=9208) image write failed: error 2060001: one or more invalid arguments
10/05/2014 23:13:05 - Critical bptm(pid=9208) sts_close_handle failed: 2060017 system call failed
10/05/2014 23:13:07 - Error bptm(pid=9208) cannot write image to disk, Invalid argument
10/05/2014 23:13:07 - Info bptm(pid=9208) EXITING with status 84 <----------
10/05/2014 23:13:07 - Error bpbrm(pid=14756) from client FDSWV01426: ERR - tar file write error (14)
10/05/2014 23:13:09 - Info bpbkar32(pid=9460) accelerator sent 0 bytes out of 0 bytes to server, optimization 0.0%
10/05/2014 23:13:10 - Info bpbkar32(pid=9460) bpbkar waited 123750 times for empty buffer, delayed 174713 times.
10/05/2014 23:13:11 - Info fdswp00303-bur(pid=9208) StorageServer=PureDisk:fdswp00302-bur; Report=PDDO Stats for (fdswp00302-bur): scanned: 163386248 KB, CR sent: 124751746 KB, CR sent over FC: 0 KB, dedup: 23.6%, cache disabled
10/05/2014 23:13:11 - Error bpbrm(pid=14756) could not send server status message
10/05/2014 23:13:11 - Critical bpbrm(pid=14756) unexpected termination of client FDSWV01426
10/05/2014 23:13:11 - Info bpbkar32(pid=0) done. status: 84: media write error
10/05/2014 23:13:11 - end writing; write time: 002:11:39

 

In the media server's bptm, tons of these:

23:13:02.353 [7560.7296] <16> 801707:bptm:7560:fdswp00302-bur: [ERROR] PDSTS: fbu_align_extents: (9132) stream_offset(8976996864) != extent.stream_offset=(8976920064)
23:13:02.353 [7560.7296] <16> 801707:bptm:7560:fdswp00302-bur: [ERROR] PDSTS: fbu_scan_buf: fbu_align_extents() failed (2060022:software error)
23:13:02.353 [7560.7296] <16> 801707:bptm:7560:fdswp00302-bur: [ERROR] PDSTS: fbu_scan_buf: err=(2060022:software error) can't process FBU format, ignore

In the media server's bpbkar log:

23:12:40.903 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Found 2 virtual volume(s)
23:12:44.226 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
23:12:44.226 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
23:12:52.322 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
23:12:52.338 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
23:12:52.338 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
23:12:52.494 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
23:12:52.494 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Virtual machine backup host: fdswp00302-bur
23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Transport Type =  san
23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog VIRTUAL_FILE
23:13:00.574 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog VIRTUAL_FILE
23:13:00.574 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog VMFILE0_1412565138
23:13:41.383 [9484.9236] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog VMFILE1_1412564822
23:13:52.474 [9484.9236] <4> tar_backup_tfi::flash_start_state: Skip writing raw partition
23:13:52.474 [9484.9236] <4> tar_backup::backup_endarg_state: INF - finish error count: 0
23:13:52.474 [9484.9236] <4> tar_backup::backup_endarg_state: INF - finish fatal count: 0
23:13:58.589 [9484.9236] <2> tar_base::backup_finish: TAR - backup:                     4439 files
23:13:58.589 [9484.9236] <2> tar_base::backup_finish: TAR - backup:          file data:          0 bytes
23:13:58.589 [9484.9236] <2> tar_base::backup_finish: TAR - backup:         image data: 1069038592 bytes  37 gigabytes
23:13:58.589 [9484.9236] <2> tar_base::backup_finish: TAR - backup:       elapsed time:        410 secs     99506063 bps
23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -     number of file directives found: 1
23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -      number of files sent to server: 0 out of 4439
23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -                          image size: 40797486080 bytes
23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -                     adjustment size: -39021356544 bytes
23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -                         stream size: 1776129536 bytes
23:13:58.589 [9484.9236] <2> tar_base::V_vTarMsgW: JBD - accelerator sent 1776129536 bytes out of 40797486080 bytes to server, optimization 95.6%
23:13:58.589 [9484.9236] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
23:13:58.589 [9484.9236] <2> ov_log::V_GlobalLog: INF - Bypass the loading of BEDS
23:13:58.589 [9484.9236] <4> send_msg_to_monitor: INF - in send_msg_to_monitor()...
23:13:58.589 [9484.9236] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 0: the requested operation was successfully completed
23:14:43.204 [9484.9236] <4> dos_backup::fscp_fini: INF - backup status:<0>
23:14:43.204 [9484.9236] <4> dos_backup::fscp_fini: INF - backup succeeded, renaming current track log
23:14:43.204 [9484.9236] <16> dos_backup::fscp_fini: ERR - ct_fini() failed, error (15)
23:14:43.204 [9484.9236] <16> dos_backup::V_PostCleanup: ERR - fscp_fini() failed, error : (15)
23:14:43.204 [9484.9236] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
23:14:43.204 [9484.9236] <4> OVStopCmd: INF - EXIT - status = 0
23:14:43.204 [9484.9236] <2> tar_base::V_Close: closing...
23:14:43.204 [9484.9236] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
23:14:44.218 [9484.9236] <4> OVShutdown: INF - Finished process
23:14:44.218 [9484.9236] <4> WinMain: INF - Exiting D:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe

 

I suppose the 'bad file descriptor' is the invalid argument that causes the image write to fail?  Where is the bad file descriptor coming from?  And why does it work ~5 minutes later?

 

backupguy2014
Level 3

Comparing the bptm log output of a failed job to a successful job (which ran 5 mins later) I noticed the following:

 

In the FAILED job, diskvolumeinfo.committed space = 818,099,694,592

In the SUCCESSFUL job, diskvolumeinfo.committed space = 1,262,662,880,256

 

And

 

When we clear the avalanche of fbu_* errors, the FAILED job emits:

 

<2> write_backup: write_data() returned, exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -6

whereas the SUCCESSFUL job:

<2> write_backup: write_data() returned, exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = 0

 

So what can I infer from a backup_status of -6?

Will_Restore
Level 6

Status 6 is generic "task failed".  A search on fbu_* errors mentions various older EEBs.  Since you are running 7.6.0.2 it would be best to upgrade to current 7.6.0.3.  If you still have trouble, open a support case. 

 

backupguy2014
Level 3

I dont like the idea of trying to hit a moving target.  If there is a sufficient mass of fixes that relate directly to fbu_* alignment errors it would probably be worth a go.

 

I had thought the alignment issue(s) might be related to our BUFFER_SIZE (61000) and the fact that we are running MSEO, which I assume generates a non-predictable amount of data to add to each write.  But that's just a guess.

backupguy2014
Level 3

Okay, our SIZE_DATA_BUFFERS on our Windows Media Servers is 61000, and it looks like NBU rounds that up to 61440.  That should leave us 4kB for MSEO overhead.

 

TECH65340 recommends SIZE_DATA_BUFFERS be set to no larger than 59392 (58kB) for systems that use MSEO.  For every job that shows the fbu_* buffer alignment errors, I see the Fragment Number is -1 (TIR) and there is always a random Remainder value in Disk Reports/Images on Disk.  Every other Fragment Number either has a Remainder of 0 or 512 (an artifact of tar?).

 

Is it possible my SIZE_DATA_BUFFERS is too large and that is what is causing all of this churn?

 

--Grasping at straws!

Will_Restore
Level 6

Sounds reasonable.  Try reducing SIZE_DATA_BUFFERS to 59392.