Forum Discussion

backupguy2014's avatar
11 years ago

Strange errors in NBU 7.6.0.2, VMware clients

NBU Master is on RHEL 6.4 x86_64.  Affected media servers are Win2k3.

 

# bperror -problems -hoursago 24 | grep 'tar file write error'

 

1412286931 1 4 16 fdswp00301-bur 796271 796169 0 FDSWV01423 bpbrm from client FDSWV01423: ERR - tar file write error (14)

1412286931 1 4 16 fdswp00305-bur 796254 796133 0 FDSWV01003 bpbrm from client FDSWV01003: ERR - tar file write error (14)

1412286931 1 4 16 fdswp00301-bur 796270 796157 0 FDSWV01458 bpbrm from client FDSWV01458: ERR - tar file write error (14)

1412286931 1 4 16 fdswp00305-bur 796257 796154 0 FDSWV01505 bpbrm from client FDSWV01505: ERR - tar file write error (14)

1412286934 1 4 16 fdswp00301-bur 796269 796155 0 FDSWV01517 bpbrm from client FDSWV01517: ERR - tar file write error (14)

1412287111 1 4 16 fdswp00302-bur 796255 796151 0 FDSWV01471 bpbrm from client FDSWV01471: ERR - tar file write error (14)

1412290751 1 4 16 fdswp00305-bur 796254 796133 0 FDSWV01003 bpbrm from client FDSWV01003: ERR - tar file write error (14)

1412290751 1 4 16 fdswp00303-bur 796281 796153 0 FDSWV01482 bpbrm from client FDSWV01482: ERR - tar file write error (14)

1412290751 1 4 16 fdswp00305-bur 796257 796154 0 FDSWV01505 bpbrm from client FDSWV01505: ERR - tar file write error (14)

1412290753 1 4 16 fdswp00301-bur 796269 796155 0 FDSWV01517 bpbrm from client FDSWV01517: ERR - tar file write error (14)

{etc}

 

40 failures in the last 24 hours.

 

I can view a job log with bpdbjobs.  Picking one at random (so far, they all look very much alike):

 

# /usr/openv/netbackup/bin/admincmd/bpdbjobs -jobid 796416 -all_columns | ./reformat

 

10/02/2014 19:43:21 - Info nbjm(pid=18149) starting backup job (jobid=796416) for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH

10/02/2014 19:43:21 - estimated 16226320 kbytes needed

10/02/2014 19:43:21 - Info nbjm(pid=18149) started backup (backupid=FDSWV00373_1412293401) job for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH on storage unit 305-PureDisk

10/02/2014 19:43:23 - Info bpbrm(pid=6980) FDSWV00373 is the host to backup data from

10/02/2014 19:43:23 - Info bpbrm(pid=6980) reading file list for client

10/02/2014 19:43:23 - Info bpbrm(pid=6980) accelerator enabled

10/02/2014 19:43:24 - Info bpbrm(pid=6980) starting bpbkar32 on client

10/02/2014 19:43:22 - started process bpbrm (6980)

10/02/2014 19:43:24 - connecting

10/02/2014 19:43:24 - connected; connect time: 000:00:00

10/02/2014 19:43:24 - Info bpbkar32(pid=8364) Backup started

10/02/2014 19:43:24 - Info bpbkar32(pid=8364) accelerator enabled backup, archive bit processing:<disabled>

10/02/2014 19:43:25 - Info bptm(pid=8344) start

10/02/2014 19:43:25 - Info bptm(pid=8344) using 61440 data buffer size

10/02/2014 19:43:25 - Info bptm(pid=8344) setting receive network buffer to 246784 bytes

10/02/2014 19:43:25 - Info bptm(pid=8344) using 16 data buffers

10/02/2014 19:43:25 - Info bptm(pid=8344) start backup

10/02/2014 19:43:27 - begin writing

10/02/2014 19:44:05 - Info bpbkar32(pid=8364) INF - Transport Type =  san

10/02/2014 19:45:10 - Critical bptm(pid=8344) include image failed

10/02/2014 19:45:10 - Critical bptm(pid=8344) image write failed: error 2060001: one or more invalid arguments

10/02/2014 19:46:23 - Error bptm(pid=8344) cannot write image to disk, Invalid argument

10/02/2014 19:46:23 - Info bptm(pid=8344) EXITING with status 84 <----------

10/02/2014 19:46:23 - Error bpbrm(pid=6980) from client FDSWV00373: ERR - tar file write error (14)

10/02/2014 19:46:25 - Info bpbkar32(pid=8364) accelerator sent 0 bytes out of 0 bytes to server, optimization 0.0%

10/02/2014 19:46:25 - Info bpbkar32(pid=8364) bpbkar waited 5 times for empty buffer, delayed 4394 times.

10/02/2014 19:46:26 - Info fdswp00305-bur(pid=8344) StorageServer=PureDisk:fdswp00305-bur; Report=PDDO Stats for (fdswp00305-bur): scanned: 134414 KB, CR sent: 10118 KB, CR sent over FC: 0 KB, dedup: 92.5%, cache disabled

10/02/2014 19:46:26 - Error bpbrm(pid=6980) could not send server status message

10/02/2014 19:46:26 - Critical bpbrm(pid=6980) unexpected termination of client FDSWV00373

10/02/2014 19:46:27 - Info bpbkar32(pid=0) done. status: 84: media write error

10/02/2014 19:46:27 - end writing; write time: 000:03:00

-

10/02/2014 20:01:27 - Info nbjm(pid=18149) starting backup job (jobid=796416) for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH

10/02/2014 20:01:27 - Info nbjm(pid=18149) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=796416, request id:{6BAD9324-4A90-11E4-8CC2-51EAD643EE46})

10/02/2014 20:01:27 - requesting resource 305-PureDisk

10/02/2014 20:01:27 - requesting resource fdsup00001nbu.NBU_CLIENT.MAXJOBS.FDSWV00373

10/02/2014 20:01:27 - requesting resource fdsup00001nbu.NBU_POLICY.MAXJOBS.WIN_VM_305

10/02/2014 20:01:29 - Info nbrb(pid=18089) Limit has been reached for the logical resource fdsup00001nbu.NBU_POLICY.MAXJOBS.WIN_VM_305

10/02/2014 20:20:10 - Info bpbrm(pid=9092) FDSWV00373 is the host to backup data from

10/02/2014 20:20:10 - Info bpbrm(pid=9092) reading file list for client

10/02/2014 20:20:10 - Info bpbrm(pid=9092) accelerator enabled

10/02/2014 20:20:08 - granted resource fdsup00001nbu.NBU_CLIENT.MAXJOBS.FDSWV00373

10/02/2014 20:20:08 - granted resource fdsup00001nbu.NBU_POLICY.MAXJOBS.WIN_VM_305

10/02/2014 20:20:08 - granted resource MediaID=@aaaap;DiskVolume=PureDiskVolume;DiskPool=305-PureDisk;Path=PureDiskVolume;StorageServer=fdswp00305-bur;MediaServer=fdswp00305-bur

10/02/2014 20:20:08 - granted resource 305-PureDisk

10/02/2014 20:20:08 - estimated 16226320 kbytes needed

10/02/2014 20:20:08 - Info nbjm(pid=18149) started backup (backupid=FDSWV00373_1412295608) job for client FDSWV00373, policy WIN_VM_305, schedule WIN_VMW_MNTH on storage unit 305-PureDisk

10/02/2014 20:20:09 - started process bpbrm (9092)

10/02/2014 20:20:11 - Info bpbrm(pid=9092) starting bpbkar32 on client

10/02/2014 20:20:11 - Info bpbkar32(pid=8728) Backup started

10/02/2014 20:20:11 - Info bpbkar32(pid=8728) accelerator enabled backup, archive bit processing:<disabled>

10/02/2014 20:20:11 - Info bptm(pid=9152) start

10/02/2014 20:20:12 - Info bptm(pid=9152) using 61440 data buffer size

10/02/2014 20:20:12 - Info bptm(pid=9152) setting receive network buffer to 246784 bytes

10/02/2014 20:20:12 - Info bptm(pid=9152) using 16 data buffers

10/02/2014 20:20:11 - connecting

10/02/2014 20:20:11 - connected; connect time: 000:00:00

10/02/2014 20:20:12 - Info bptm(pid=9152) start backup

10/02/2014 20:20:18 - begin writing

10/02/2014 20:20:51 - Info bpbkar32(pid=8728) INF - Transport Type =  san

10/02/2014 20:22:25 - Info bptm(pid=9152) waited for full buffer 2037 times, delayed 3985 times

10/02/2014 20:22:26 - Info bpbkar32(pid=8728) accelerator sent 1115131392 bytes out of 14165895168 bytes to server, optimization 92.1%

10/02/2014 20:22:26 - Info bpbkar32(pid=8728) bpbkar waited 548 times for empty buffer, delayed 1106 times.

10/02/2014 20:22:42 - Info bptm(pid=9152) EXITING with status 0 <----------

10/02/2014 20:22:42 - Info fdswp00305-bur(pid=9152) StorageServer=PureDisk:fdswp00305-bur; Report=PDDO Stats for (fdswp00305-bur): scanned: 13841369 KB, CR sent: 488111 KB, CR sent over FC: 0 KB, dedup: 96.5%, cache disabled

10/02/2014 20:22:43 - Info bpbrm(pid=9092) validating image for client FDSWV00373

10/02/2014 20:22:45 - Info bpbkar32(pid=8728) done. status: 0: the requested operation was successfully completed

10/02/2014 20:22:45 - end writing; write time: 000:02:27

 

The problem is, I don’t know what I am looking at.  It reads as if the jobs ran, failed, ran again and succeeded?  The first attempt looks like the second except where the ‘granted resource’ messages should be – they are not present.

 

‘bplist’ shows a catalog entry for 10/2 20:20, which lines up with the 2nd job:

 

# bplist -C FDSWV00373 -k WIN_VM_305 -b -l -unix_files "/C/"

d--------- user      group               0 Oct 02 20:20 /C/

d--------- user      group               0 Oct 01 21:57 /C/

d--------- user      group               0 Sep 30 21:31 /C/

d--------- user      group               0 Sep 29 21:52 /C/

{etc}

 

7 Replies

  • your job got successfull in 2nd attempt,

    and you have vmware backup with accelator enable, though SAN transport...

    backup are going to MSPD pool.

    now.. do you see any other backup faillures to this MSDP pools with error 84?

    looks like there was an hick-up on MSDP pool while writing the data...

    so need to make sure if you get the failures on this pool for other backup...... 

    all failures  are at specific time range..?

     

  • The jobs all run within the same window, so the times are clustered together.

    We have 5 Windows-based media servers - they all show the same symptoms.

    Here's one from today:

     

    10/05/2014 21:01:26 - Info nbjm(pid=18149) starting backup job (jobid=801397) for client FDSWV01426, policy WIN_VM_302, schedule WIN_VMW_INCR
    10/05/2014 21:01:26 - estimated 0 kbytes needed
    10/05/2014 21:01:26 - Info nbjm(pid=18149) started backup (backupid=FDSWV01426_1412557286) job for client FDSWV01426, policy WIN_VM_302, schedule WIN_VMW_INCR on storage unit 302-PureDisk
    10/05/2014 21:01:28 - Info bpbrm(pid=14756) FDSWV01426 is the host to backup data from
    10/05/2014 21:01:28 - Info bpbrm(pid=14756) reading file list for client
    10/05/2014 21:01:28 - Info bpbrm(pid=14756) accelerator enabled
    10/05/2014 21:01:27 - started process bpbrm (14756)
    10/05/2014 21:01:29 - Info bpbrm(pid=14756) starting bpbkar32 on client
    10/05/2014 21:01:30 - Info bpbkar32(pid=9460) Backup started
    10/05/2014 21:01:30 - Info bpbkar32(pid=9460) accelerator enabled backup, archive bit processing:<disabled>
    10/05/2014 21:01:30 - Info bptm(pid=9208) start
    10/05/2014 21:01:30 - Info bptm(pid=9208) using 61440 data buffer size
    10/05/2014 21:01:30 - Info bptm(pid=9208) setting receive network buffer to 246784 bytes
    10/05/2014 21:01:30 - Info bptm(pid=9208) using 16 data buffers
    10/05/2014 21:01:28 - connecting
    10/05/2014 21:01:29 - connected; connect time: 000:00:01
    10/05/2014 21:01:31 - Info bptm(pid=9208) start backup
    10/05/2014 21:01:32 - begin writing
    10/05/2014 21:02:32 - Info bpbkar32(pid=9460) INF - Transport Type =  san
    10/05/2014 23:13:03 - Critical bptm(pid=9208) Storage Server Error: (Storage server: PureDisk:fdswp00302-bur) write_buf: Failed to write to spoold on storage server fdswp00302-bur (Bad file descriptor). Ensure storage server services are running and operational. V-454-16
    10/05/2014 23:13:03 - Critical bptm(pid=9208) image write failed: error 2060001: one or more invalid arguments
    10/05/2014 23:13:05 - Critical bptm(pid=9208) sts_close_handle failed: 2060017 system call failed
    10/05/2014 23:13:07 - Error bptm(pid=9208) cannot write image to disk, Invalid argument
    10/05/2014 23:13:07 - Info bptm(pid=9208) EXITING with status 84 <----------
    10/05/2014 23:13:07 - Error bpbrm(pid=14756) from client FDSWV01426: ERR - tar file write error (14)
    10/05/2014 23:13:09 - Info bpbkar32(pid=9460) accelerator sent 0 bytes out of 0 bytes to server, optimization 0.0%
    10/05/2014 23:13:10 - Info bpbkar32(pid=9460) bpbkar waited 123750 times for empty buffer, delayed 174713 times.
    10/05/2014 23:13:11 - Info fdswp00303-bur(pid=9208) StorageServer=PureDisk:fdswp00302-bur; Report=PDDO Stats for (fdswp00302-bur): scanned: 163386248 KB, CR sent: 124751746 KB, CR sent over FC: 0 KB, dedup: 23.6%, cache disabled
    10/05/2014 23:13:11 - Error bpbrm(pid=14756) could not send server status message
    10/05/2014 23:13:11 - Critical bpbrm(pid=14756) unexpected termination of client FDSWV01426
    10/05/2014 23:13:11 - Info bpbkar32(pid=0) done. status: 84: media write error
    10/05/2014 23:13:11 - end writing; write time: 002:11:39

     

    In the media server's bptm, tons of these:

    23:13:02.353 [7560.7296] <16> 801707:bptm:7560:fdswp00302-bur: [ERROR] PDSTS: fbu_align_extents: (9132) stream_offset(8976996864) != extent.stream_offset=(8976920064)
    23:13:02.353 [7560.7296] <16> 801707:bptm:7560:fdswp00302-bur: [ERROR] PDSTS: fbu_scan_buf: fbu_align_extents() failed (2060022:software error)
    23:13:02.353 [7560.7296] <16> 801707:bptm:7560:fdswp00302-bur: [ERROR] PDSTS: fbu_scan_buf: err=(2060022:software error) can't process FBU format, ignore

    In the media server's bpbkar log:

    23:12:40.903 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Found 2 virtual volume(s)
    23:12:44.226 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
    23:12:44.226 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
    23:12:52.322 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
    23:12:52.338 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
    23:12:52.338 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog C:
    23:12:52.494 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
    23:12:52.494 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
    23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
    23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
    23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog D:
    23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Virtual machine backup host: fdswp00302-bur
    23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Transport Type =  san
    23:13:00.278 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog VIRTUAL_FILE
    23:13:00.574 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog VIRTUAL_FILE
    23:13:00.574 [9424.7312] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog VMFILE0_1412565138
    23:13:41.383 [9484.9236] <2> tar_base::V_vTarMsgW: INF - Files System Name to catalog VMFILE1_1412564822
    23:13:52.474 [9484.9236] <4> tar_backup_tfi::flash_start_state: Skip writing raw partition
    23:13:52.474 [9484.9236] <4> tar_backup::backup_endarg_state: INF - finish error count: 0
    23:13:52.474 [9484.9236] <4> tar_backup::backup_endarg_state: INF - finish fatal count: 0
    23:13:58.589 [9484.9236] <2> tar_base::backup_finish: TAR - backup:                     4439 files
    23:13:58.589 [9484.9236] <2> tar_base::backup_finish: TAR - backup:          file data:          0 bytes
    23:13:58.589 [9484.9236] <2> tar_base::backup_finish: TAR - backup:         image data: 1069038592 bytes  37 gigabytes
    23:13:58.589 [9484.9236] <2> tar_base::backup_finish: TAR - backup:       elapsed time:        410 secs     99506063 bps
    23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF - number of file directives not found: 0
    23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -     number of file directives found: 1
    23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -      number of files sent to server: 0 out of 4439
    23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -                          image size: 40797486080 bytes
    23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -                     adjustment size: -39021356544 bytes
    23:13:58.589 [9484.9236] <4> tar_backup::backup_done_state: INF -                         stream size: 1776129536 bytes
    23:13:58.589 [9484.9236] <2> tar_base::V_vTarMsgW: JBD - accelerator sent 1776129536 bytes out of 40797486080 bytes to server, optimization 95.6%
    23:13:58.589 [9484.9236] <4> tar_base::stopKeepaliveThread: INF - keepalive thread has exited. (reason: WAIT_OBJECT_0)
    23:13:58.589 [9484.9236] <2> ov_log::V_GlobalLog: INF - Bypass the loading of BEDS
    23:13:58.589 [9484.9236] <4> send_msg_to_monitor: INF - in send_msg_to_monitor()...
    23:13:58.589 [9484.9236] <2> tar_base::V_vTarMsgW: INF - EXIT STATUS 0: the requested operation was successfully completed
    23:14:43.204 [9484.9236] <4> dos_backup::fscp_fini: INF - backup status:<0>
    23:14:43.204 [9484.9236] <4> dos_backup::fscp_fini: INF - backup succeeded, renaming current track log
    23:14:43.204 [9484.9236] <16> dos_backup::fscp_fini: ERR - ct_fini() failed, error (15)
    23:14:43.204 [9484.9236] <16> dos_backup::V_PostCleanup: ERR - fscp_fini() failed, error : (15)
    23:14:43.204 [9484.9236] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
    23:14:43.204 [9484.9236] <4> OVStopCmd: INF - EXIT - status = 0
    23:14:43.204 [9484.9236] <2> tar_base::V_Close: closing...
    23:14:43.204 [9484.9236] <4> dos_backup::tfs_reset: INF - Snapshot deletion start
    23:14:44.218 [9484.9236] <4> OVShutdown: INF - Finished process
    23:14:44.218 [9484.9236] <4> WinMain: INF - Exiting D:\Program Files\Veritas\NetBackup\bin\bpbkar32.exe

     

    I suppose the 'bad file descriptor' is the invalid argument that causes the image write to fail?  Where is the bad file descriptor coming from?  And why does it work ~5 minutes later?

     

  • Comparing the bptm log output of a failed job to a successful job (which ran 5 mins later) I noticed the following:

     

    In the FAILED job, diskvolumeinfo.committed space = 818,099,694,592

    In the SUCCESSFUL job, diskvolumeinfo.committed space = 1,262,662,880,256

     

    And

     

    When we clear the avalanche of fbu_* errors, the FAILED job emits:

     

    <2> write_backup: write_data() returned, exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = -6

    whereas the SUCCESSFUL job:

    <2> write_backup: write_data() returned, exit_status = 0, CINDEX = 0, TWIN_INDEX = 0, backup_status = 0

     

    So what can I infer from a backup_status of -6?

  • Status 6 is generic "task failed".  A search on fbu_* errors mentions various older EEBs.  Since you are running 7.6.0.2 it would be best to upgrade to current 7.6.0.3.  If you still have trouble, open a support case. 

     

  • I dont like the idea of trying to hit a moving target.  If there is a sufficient mass of fixes that relate directly to fbu_* alignment errors it would probably be worth a go.

     

    I had thought the alignment issue(s) might be related to our BUFFER_SIZE (61000) and the fact that we are running MSEO, which I assume generates a non-predictable amount of data to add to each write.  But that's just a guess.

  • Okay, our SIZE_DATA_BUFFERS on our Windows Media Servers is 61000, and it looks like NBU rounds that up to 61440.  That should leave us 4kB for MSEO overhead.

     

    TECH65340 recommends SIZE_DATA_BUFFERS be set to no larger than 59392 (58kB) for systems that use MSEO.  For every job that shows the fbu_* buffer alignment errors, I see the Fragment Number is -1 (TIR) and there is always a random Remainder value in Disk Reports/Images on Disk.  Every other Fragment Number either has a Remainder of 0 or 512 (an artifact of tar?).

     

    Is it possible my SIZE_DATA_BUFFERS is too large and that is what is causing all of this churn?

     

    --Grasping at straws!