07-17-2017 05:16 AM
Hello All,
Netbackup 7.1.
server: OS: Redhat2.6
1.client OS: solaris 10
2.client OS: HPUX
On some policies, when backup process reaches approx. to 1TB, fails with error ''cannot write image to media id XXXXX, drive index 34, Input/output error''.
we're using VTL. not TAPE!.
But on client and server system logs, there is no issue about I/O.
Any idea?
Below you can find logs from btpm.
03:57:44.364 [14818] <2> io_ioctl: command (5)MTWEOF 0 0x10 from (overwrite.c.491) on drive index 34
03:57:44.364 [14818] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/drive_hcart2033, from overwrite.c.527
03:57:44.365 [14818] <2> tape_error_rec: absolute block position of media after error is 0 (with buffer is 0)
03:57:44.366 [14818] <16> write_data: cannot write image to media id S05552, drive index 34, Input/output error
03:57:44.366 [14818] <2> send_MDS_msg: DEVICE_STATUS 1 1998464 nbumedia S05552 4010424 hcart2033 2001498 WRITE_ERROR 0 0
03:57:44.384 [14818] <2> Orb::setDebugLevelFromVxul: Orb logging configuration level set to 0(../Orb.cpp:2254)
03:57:44.389 [14818] <2> log_media_error: successfully wrote to error file - 07/17/17 03:57:44 S05552 34 WRITE_ERROR hcart2033
03:57:44.389 [14818] <2> KILL_MM_CHILD: Sending SIGUSR2 (kill) to child 14841 (bptm.c:19188)
03:57:44.389 [14818] <2> wait_for_sigcld: waiting for child to exit, timeout is 1800
03:57:44.389 [14818] <2> child_wait: waitpid returned zero, no children waiting (tmcommon.c:5712)
03:57:44.389 [14841] <2> Media_child_dispatch_signal: calling child_catch_signal for 12 (child.c:1209) delay 0 seconds
03:57:44.389 [14841] <2> child_catch_signal: child.c.642: Entered: child_catch_signal
03:57:44.389 [14841] <2> Media_siginfo_print: 0: delay 0 signo SIGUSR2:12 code 0 pid 14818
03:57:44.402 [14818] <2> Media_siginfo_print: 0: delay 0 signo SIGCHLD:17 code 1 pid 14841
03:57:44.402 [14818] <2> child_wait: SIGCHLD: exit=0, signo=0 core=no, pid=14841 (tmcommon.c:5712)
03:57:44.402 [14818] <2> check_error_history: just tpunmount: called from bptm line 19204, EXIT_Status = 84
03:57:44.402 [14818] <2> drivename_write: Called with mode 1
03:57:44.402 [14818] <2> drivename_unlock: unlocked
03:57:44.402 [14818] <2> drivename_close: Called for file hcart2033
03:57:44.402 [14818] <2> tpunmount: NOP: MEDIA_DONE 0 1757703 0 S05552 4010424 0 {EF357672-6A80-11E7-8A94-968D68008353}
03:57:44.402 [14818] <2> bptm: EXITING with status 84 <----------
03:57:44.407 [14818] <2> cleanup: Detached from BPBRM shared memory
BR.
Turgun
07-17-2017 06:03 AM
Status 84 is just about ALWAYS a hardware error.
The O/S is reporting an I/O error and NBU is therefore reporting what it is getting back from the OS.
Here is a very old but still valid In-depth troubleshooting guide for Status 84:
https://www.veritas.com/support/en_US/article.000029813
Extract:
As an application, NetBackup has no direct access to a device, instead relying on the operating system (OS) to handle any communication with the device. This means that during a write operation NetBackup asks the OS to write to the device and report back the success or failure of that operation. If there is a failure, NetBackup will merely report that a failure occurred, and any troubleshooting should start at the OS level. If the OS is unable to perform the write, there are three likely causes; OS configuration, a problem on the SCSI path, or a problem with the device.
So, the place to look for errors is /var/log/messages on the media server as well as the VTL system logs.
07-18-2017 12:02 AM
Thank you Marianne for valuable feedback.
I will go on with that document which you've provided.
I will inform summary later.
Thanks.
Turgun
07-18-2017 12:18 AM
Hello
If my memory serves me correctly I was having some similar issue and it was related the the bptm binary - which after certain amount of data written was crashing due to some bug... Unfortunatelly I do not recall the exact NBU version it was happening on, but I was provided with new bptm binary for media server and the error was gone...
Also I am finding it hard to have hard error on a VTL, so if you are not running out of disk space this is rather not possible.
07-18-2017 12:23 AM
I have personally seen these type of errors due to bad disks in the VTL.
07-18-2017 12:54 AM
OK Marianne ;) could be - to be frank we are not vtl shop ;) but I had a few sometime back.
I just reviewed my cases with VRTS and my memory was serving me incorectly - it was bpbkar not bptm and here is the quote from case:
"Here is a snippet:
Abstract: Netbackup 7.5.0.4 multiplexing, multistreaming, and checkpoint enabled backup of large files results in integer overflow
Problem Description
bpbkar was sending incorrect file size due to 32 bit block number variable instead of
expected 64 bit. This caused overflow issue. This is evident only on Large sized files
since the block number used to overflow.
Customer Problem Description
The bpbkar process sends incorrect file size information to the bpbrm process which results in Media Manager waiting for additional blocks to be sent by bpbkar while bpbkar waits for bpbrm to send completion status. This results in hung backup job."
Since you are on 7.1 family looks like this is not the issue...
07-18-2017 01:26 AM
Well seeing that this is 7.1 environment (released in 2011) and is already EOSL, one wonders how old the VTL hardware is...
Nothing is meant to last forever...
07-18-2017 03:42 AM
You'd be amazed how realistic some VTL vendors seem to have made their products ....
When troubleshooting VTL issue I treat them exactly the same as real hardware. This approach has not failed me yet (there is in fact no choice in the matter, NBU doesn't understand the difference betwen real/ VTL.
I even had a VTL case the other week that when vault ejected the tapes, they completly dissappeared from the VTL, as if, it had really ejected them. Impossible you would think ... - the VTL vendor even tried to blame NBU.