06-17-2014 10:21 PM
I got a NBU master server and a media server, the media server has a SL48 robot with 2 SCSI-LTO4
All my backups are running just fine, mounting tapes all the time, except my Vault_catalogbackup.
When the catalog backup starts it tries to mount 2 inline copies, NLTxxx NRTxxx
Looking at the NBU logs it seems it can't mount any of them, timing out.
The strange thing is that it works ~ 4 days a week and fails ~ 3
Server: bpkar log
22:00:48.129 [21007] <2> bpbkar resolve_path: INF - Actual mount point of /opt/openv/db/staging is /opt/openv/db/staging
22:00:48.129 [21007] <2> bpbkar SelectFile: INF - Resolved_path = /opt/openv/db/staging/DBM_DATA.db
22:00:48.130 [21007] <4> bpbkar PrintFile: /opt/openv/db/staging/
22:01:40.949 [21007] <16> flush_archive(): ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
22:01:40.950 [21007] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 24: socket write failed
22:01:40.950 [21007] <4> bpbkar Exit: INF - EXIT STATUS 24: socket write failed
22:01:40.950 [21007] <2> bpbkar Exit: INF - Close of stdout complete
22:01:40.950 [21007] <4> bpbkar Exit: INF - setenv FINISHED=0
media server: bptm log
22:00:47.877 [12382] <2> tapelib: wait_for_ltid, Mount, timeout 0
22:01:10.384 [12397] <2> bptm: INITIATING (VERBOSE = 5): -rptdrv -jobid -1402269145 -jm
22:01:10.384 [12397] <2> bptm: PORT_STATUS = 0x00000000
22:01:10.384 [12397] <2> drivename_open: Called with Create 0, file HP.ULTRIUM4-SCSI.001
22:01:10.384 [12397] <2> drivename_checklock: Called
22:01:10.384 [12397] <2> drivename_checklock: PID 12382 has lock
22:01:10.384 [12397] <2> report_drives: DRIVE = HP.ULTRIUM4-SCSI.001 LOCK = TRUE CURTIME = 1403006470
22:01:10.384 [12397] <2> report_drives: MODE = 0
22:01:10.384 [12397] <2> report_drives: TIME = 1403006446
22:01:10.384 [12397] <2> report_drives: MASTER = ipndms
22:01:10.384 [12397] <2> report_drives: SR_KEY = 0 1
22:01:10.384 [12397] <2> report_drives: PATH = /dev/rmt/1cbn
22:01:10.384 [12397] <2> report_drives: MEDIA = NLT004
22:01:10.384 [12397] <2> report_drives: REQID = -1402269143
22:01:10.384 [12397] <2> report_drives: ALOCID = 139498
22:01:10.384 [12397] <2> report_drives: RBID = {F825C9EE-F616-11E3-894C-00144FF80172}
22:01:10.385 [12397] <2> report_drives: PID = 12382
22:01:10.385 [12397] <2> report_drives: FILE = /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM4-SCSI.001
22:01:10.385 [12397] <2> drivename_open: Called with Create 0, file HP.ULTRIUM4-SCSI.002
22:01:10.385 [12397] <2> drivename_checklock: Called
22:01:10.385 [12397] <2> drivename_checklock: PID 12382 has lock
22:01:10.385 [12397] <2> report_drives: DRIVE = HP.ULTRIUM4-SCSI.002 LOCK = TRUE CURTIME = 1403006470
22:01:10.385 [12397] <2> report_drives: MODE = 0
22:01:10.385 [12397] <2> report_drives: TIME = 1403006446
22:01:10.385 [12397] <2> report_drives: MASTER = ipndms
22:01:10.385 [12397] <2> report_drives: SR_KEY = 0 1
22:01:10.385 [12397] <2> report_drives: PATH = /dev/rmt/0cbn
22:01:10.385 [12397] <2> report_drives: MEDIA = NRT101
22:01:10.385 [12397] <2> report_drives: REQID = -1402269143
22:01:10.385 [12397] <2> report_drives: ALOCID = 139499
22:01:10.385 [12397] <2> report_drives: RBID = {F82ACD18-F616-11E3-A122-00144FF80172}
22:01:10.385 [12397] <2> report_drives: PID = 12382
22:01:10.385 [12397] <2> report_drives: FILE = /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM4-SCSI.002
22:01:10.385 [12397] <2> main: Sending [EXIT STATUS 0] to NBJM
22:01:10.385 [12397] <2> bptm: EXITING with status 0 <----------
22:01:40.987 [12382] <2> Media_signal_poll: 2:Terminate detected (tapelib.c:615)
22:01:40.987 [12382] <2> mount_open_media: mount canceled detected in tpreq(), signo = 1
22:01:40.987 [12382] <2> set_job_details: Tfile (61970): LOG 1403006500 16 bptm 12382 media manager terminated during mount of media id NLT004, possible media mount timeout
I have tried freezing certain tapes to lock out particular tape problems, doesnt' matter what I do.
Checking the SL48 Library logs tells me nothing either.
Any ideas what to do next?
- Roland
Solved! Go to Solution.
09-02-2014 02:33 PM
Hi,
This post got a bit sidelined...
Yes I think I solved it, think...
There was a stray sticker that had went off a tape inside the library.
It was discovered when a tape was ejected without a sticker!
After pulling the library apart and removed the sticker the problem dissappeared.
For what ever reason the sticker must have been in a very odd spot to cause such strange behaviour.
And why just the Vault?
Maybe vault does a scan of all the tapes and thus hit the sticker "bug" all the time?
One could only speculate!
- Roland
06-19-2014 10:26 PM
Very interesting! I think you might having some form of media contention, like it needs to read and write to the same tape or similar. Do any of your backups span tapes? Or just to confirm, you mention catalog backup, and 2 inline copies, which would require 3 drives if I'm reading it right?
I'd generally recommend opening a case with support, as vault cases can be served by another pair of eyes on the issue (it's a bit like a logic puzzle, and a party with no prior knowledge of the environment can see conflicts more quickly).
However if this isn't an option for you, the next step would be to look at the vault logs (mainly in the relevant SID directory), detail.log will be most helpful probably. You don't mention which version of NetBackup you are running, and the logs have changed location and name over the versions but this may assist: http://www.symantec.com/docs/TECH27773.
Best of luck!
06-19-2014 10:36 PM
Hi,
I'm using 7.1.0.4.
The Vault kicks off a catalog backup as per standard, the schedule defines the 2 inline copies.
So it tries to mount 2 tapes to do that NRTxxx and NLTxxx.
I have checked the tapes and there are available tapes in the 2 pools they reside in.
As it is only Catalog backup the backup is tiny it wont spill over to another tape.
Sometimes it actually mounts one of the tapes but not the other, very strange.
It has succeded the last 2 nights...
Maybe I should open a case, I just thought it might be something easy I have overlooked.
- Roland
06-20-2014 12:24 AM
Please show us output of the following for these 2 tapes:
nbemmcmd -listmedia -mediaid <media-id>
Also use robtest to see if you can manually mount these tapes.
Robtest commands that can be used to test the SCSI functionality of a robot
http://www.symantec.com/docs/TECH83129
Increase logging level for bptm (level 3 should be fine).
Copy log to bptm.txt and upload as File attachment.
Oh, one more thing to check - what is Media Mount Timeout set to?
Weird that we see a mount failure/termination within 1 minute.
My recommendation is to have Media Mount Timeout of 15 minutes or more.
06-24-2014 09:21 PM
Hi,
I have done some more testing.
I can move the tape manually with robtest and the SL48 WEB-GUI, still no errors in the SL48 logs.
On Friday it succesfully used NLT104, but on Saturday it didn't mount properly and since then mounted the tapes every time.
I do get some Vault errors but I think it fails because it can't mount the tape.
These are socket write failures.
It happily retries an hour later and runs fine.
62237 Catalog Backup Done 0 NBU_Catalog_tape Vault_CatalogBackup NRT101 ipndms 06/21/14 23:13:39 000:06:51 06/21/14 23:21:09 10689344 36503 100
62236 Catalog Backup Done 0 NBU_Catalog_tape Vault_CatalogBackup NLT004 ipndms 06/21/14 23:13:39 000:06:51 06/21/14 23:21:09 10689344 36503 100
62235 Catalog Backup Done 0 NBU_Catalog_tape Vault_CatalogBackup ipndms 06/21/14 23:13:39 000:06:51 06/21/14 23:21:09 100
62234 Catalog Backup Done 0 NBU_Catalog_tape Vault_CatalogBackup NRT101 ipndms 06/21/14 23:08:38 000:04:50 06/21/14 23:13:29 32480 86844 100
62233 Catalog Backup Done 0 NBU_Catalog_tape Vault_CatalogBackup NLT004 ipndms 06/21/14 23:08:38 000:04:50 06/21/14 23:13:29 32480 86844 100
62232 Catalog Backup Done 0 NBU_Catalog_tape Vault_CatalogBackup ipndms 06/21/14 23:08:38 000:04:50 06/21/14 23:13:29 100
62231 Catalog Backup Done 0 NBU_Catalog_tape Vault_CatalogBackup ipndms 06/21/14 23:08:33 000:05:03 06/21/14 23:13:36 100
62230 Catalog Backup Done 0 NBU_Catalog_tape - ipndms 06/21/14 23:08:32 000:12:37 06/21/14 23:21:09 100
62229 Vault Done 0 Run_Vault Daily_Vault ipndms 06/21/14 23:08:17 000:19:32 06/21/14 23:27:49 100
62228 Catalog Backup Done 24 NBU_Catalog_tape Vault_CatalogBackup NRT109 ipndms 06/21/14 22:00:26 000:01:12 06/21/14 22:01:39 100
62227 Catalog Backup Done 24 NBU_Catalog_tape Vault_CatalogBackup NLT004 ipndms 06/21/14 22:00:26 000:01:12 06/21/14 22:01:39 100
62226 Catalog Backup Done 24 NBU_Catalog_tape Vault_CatalogBackup ipndms 06/21/14 22:00:26 000:01:12 06/21/14 22:01:39 100
62225 Catalog Backup Done 2 NBU_Catalog_tape Vault_CatalogBackup ipndms 06/21/14 22:00:14 000:01:16 06/21/14 22:01:30 100
62224 Catalog Backup Done 2 NBU_Catalog_tape - ipndms 06/21/14 22:00:14 000:01:25 06/21/14 22:01:39 100
62223 Vault Done 294 Run_Vault Daily_Vault ipndms 06/21/14 22:00:00 000:08:19 06/21/14 22:08:19 100
Master Server:
root@ipndms:~# ndd -get /dev/tcp tcp_time_wait_interval
10000
root@ipndms:~# ndd -get /dev/tcp tcp_fin_wait_2_flush_interval
67500
root@ipndms:~# ndd -get /dev/tcp tcp_keepalive_interval
60000
root@ipndms:~#
Media Server
root@pnms01:(LT)~# ndd -get /dev/tcp tcp_time_wait_interval
10000
root@pnms01:(LT)~# ndd -get /dev/tcp tcp_fin_wait_2_flush_interval
67500
root@pnms01:(LT)~# ndd -get /dev/tcp tcp_keepalive_interval
60000
root@pnms01:(LT)~#
These servers don't do anything else than backups.
Media timeout is not changed that I know of.
Master Server Properties has Timeouts->Media Mount Timeout unselected, does that mean it uses a default value?
root@ipndms:~# /opt/openv/netbackup/bin/admincmd/bpgetconfig -M ipndms|grep -i mount
BELIEVE_MOUNTS = NO
MEDIA_UNMOUNT_DELAY = 180
root@ipndms:~#
loglevel is always 5
root@ipndms:~# /opt/openv/netbackup/bin/admincmd/nbemmcmd -listmedia -mediaid NLT004
NBEMMCMD, Version:7.1
====================================================================
Media GUID: 37ce8264-307b-11de-8000-a996759a3cf5
Media ID: NLT004
Partner: -
Media Type: HCART
Volume Group: 000_00000_TLD
Application: Netbackup
Media Flags: 1
Description: NetBackup DB Local LTO4
Barcode: NLT004L4
Partner Barcode: --------
Last Write Host: pnms01
Created: 04/23/2009 16:54
Time Assigned: 06/10/2014 22:00
First Mount: 11/25/2011 19:05
Last Mount: 06/24/2014 22:06
Volume Expiration: -
Data Expiration: 06/30/2014 22:05
Last Written: 06/24/2014 22:05
Last Read: -
Robot Type: TLD
Robot Control Host: pnms01
Robot Number: 0
Slot: 18
Side/Face: -
Cleanings Remaining: -
Number of Mounts: 508
Maximum Mounts Allowed: 0
Media Status: ACTIVE
Kilobytes: 151886377
Images: 42
Valid Images: 18
Retention Period: 4
Number of Restores: 0
Optical Header Size Bytes: 1024
Optical Sector Size Bytes: 0
Optical Partition Size Bytes: 0
Last Header Offset: 593413
Adamm Guid: 00000000-0000-0000-0000-000000000000
Rsm Guid: 00000000-0000-0000-0000-000000000000
Origin Host: NONE
Master Host: ipndms
Server Group: NO_SHARING_GROUP
Upgrade Conflicts Flag:
Pool Number: 3
Volume Pool: CatalogBackup
Previous Pool Name: -
Vault Flags: -
Vault Container: -
Vault Name: -
Vault Slot: -
Session ID: -
Date Vaulted: -
Return Date: -
====================================================================
Command completed successfully.
root@ipndms:~# /opt/openv/netbackup/bin/admincmd/nbemmcmd -listmedia -mediaid NRT109
NBEMMCMD, Version:7.1
====================================================================
Media GUID: d7ad85c2-da9b-11e2-8000-92a5b15f1c60
Media ID: NRT109
Partner: -
Media Type: HCART
Volume Group: 000_00000_TLD
Application: Netbackup
Media Flags: 1
Description: NetBackup DB Remote LTO4
Barcode: NRT109L4
Partner Barcode: --------
Last Write Host: NONE
Created: 06/21/2013 07:55
Time Assigned: -
First Mount: 06/21/2013 23:03
Last Mount: 03/27/2014 22:03
Volume Expiration: -
Data Expiration: -
Last Written: -
Last Read: -
Robot Type: TLD
Robot Control Host: pnms01
Robot Number: 0
Slot: 9
Side/Face: -
Cleanings Remaining: -
Number of Mounts: 51
Maximum Mounts Allowed: 0
Media Status: ACTIVE
Kilobytes: 0
Images: 0
Valid Images: 0
Retention Period: -
Number of Restores: 0
Optical Header Size Bytes: 0
Optical Sector Size Bytes: 0
Optical Partition Size Bytes: 0
Last Header Offset: 0
Adamm Guid: 00000000-0000-0000-0000-000000000000
Rsm Guid: 00000000-0000-0000-0000-000000000000
Origin Host: NONE
Master Host: ipndms
Server Group: -
Upgrade Conflicts Flag:
Pool Number: 12
Volume Pool: CatalogBackupOffsite
Previous Pool Name: -
Vault Flags: -
Vault Container: -
Vault Name: -
Vault Slot: -
Session ID: -
Date Vaulted: -
Return Date: -
====================================================================
Command completed successfully.
bptm log attached for Saturday with the failed (at 22:00) and the succefull (23:08) mount
- Roland
06-26-2014 09:37 PM
The tape NRT109 that didn't mount on Saturday was used without any problems on Wednesday.
This is really strange.
- Roland
06-26-2014 11:44 PM
The mount failure at 22:00 was for NLT004.
22:01:27.408 [8228] <2> mount_open_media: mount canceled detected in tpreq(), signo = 1 22:01:27.408 [8228] <2> set_job_details: Tfile (62227): LOG 1403352087 16 bptm 8228 media manager terminated during mount of media id NLT004, possible media mount timeout 22:01:29.480 [8228] <16> catch_signal: media manager terminated by parent process
NRT101 and NLT004 were used at 23:08.
NRT109 is not assigned and was last mounted in March: 03/27/2014 22:03
From what I see in bptm, the 'possible' media mount timeout is a red herring.
There is no evidence of any kind of hardware issues.
It is the 'parent process' that is killing the mount attempt after 1 minute:
media manager terminated by parent process
So, my assumption is that bpbrm is the 'parent process'.
Seems something else is causing the catalog backup to fail and subsequently killing the mount request.
Please show us all the text in Details tab of failed job(s) as well as the parent job for the Catalog backup.
If 2 separate child jobs are created for the 2 tape backups, please show us both.
Please post bpbrm log as well for the same period.
06-29-2014 04:32 PM
ok, I had a feeling for the same.
This happens a lot on Saturdays!! wonder why...
This is this Saturdays failed attempt at 22:00 in the bpbrm.log
22:00:29.254 [15297] <2> pfi_start_client: command = /usr/openv/netbackup/bin/bpbkar bpbkar -L /usr/openv/netbackup/logs/user_ops/dbext/logs/vxbsa.1403956819.12909.prog.pcb_std -IEL -PCB /usr/openv/netback
up/logs/user_ops/dbext/logs/vxbsa.1403956819.12909.files.1 -ct 7 -dt 0 -to 0 -read_to 300 -clnt ipndms -class NBU_Catalog_tape -sched Vault_CatalogBackup -st UBAK -bpstart_to 300 -bpend_to 300 -keyword NBD
B:62675:1403956817:F -bt 1403956823 -fso -b ipndms_1403956823 -kl 10
22:00:29.254 [15297] <2> pfi_start_client: received bpcd success message
22:00:29.524 [15297] <2> pfi_start_client: read start message from ipndms, msg=<INF - BACKUP START 12963>
22:00:29.524 [15297] <2> set_job_details: Tfile (62678): LOG 1403956829 4 bphdb 12963 Backup started
22:00:29.524 [15297] <2> bpbrm main: client_pid=12912
22:00:29.524 [15297] <2> set_job_details: Tfile (62678): LOG 1403956829 4 bphdb 12912 Backup started
22:00:29.524 [15297] <2> bpbrm main: from client ipndms: read client start message
22:00:29.524 [15297] <2> bpbrm spawn_child: /usr/openv/netbackup/bin/bptm bptm -w -c ipndms -den 6 -rt 8 -rn 0 -stunit pnms01-hcart-robot-tld-0 -cl NBU_Catalog_tape -bt 1403956823 -b ipndms_1403956823 -st
2 -cj 2 -p CatalogBackup -p CatalogBackupOffsite -stunit pnms01-hcart-robot-tld-0 -cj 2 -twin_fail_on_error 0 -twin_fail_on_error 0 -reqid -1402274654 -jm -brm -hostname ipndms -L /usr/openv/netbackup/logs
/user_ops/dbext/logs/vxbsa.1403956819.12909.prog.pcb_std -ru root -rclnt ipndms -rclnthostname ipndms -rl 4 -rl 4 -rp 518400 -rp 518400 -sl Vault_CatalogBackup -ct 7 -maxfrag 1048576 -eari 0 -eari 0 -v -me
diasvr pnms01 -no_callback -connect_options 0x01010100 -jobid 62678 -jobgrpid 62675 -masterversion 710000 -bpbrm_shm_id 1862270989 -blks_per_buffer 512
22:00:29.525 [15297] <2> set_job_details: Tfile (62678): LOG 1403956829 4 bpbrm 15297 bptm pid: 15298
22:00:29.526 [15297] <2> bpbrm write_continue_backup: wrote CONTINUE BACKUP on COMM_SOCK <11>
22:00:29.526 [15297] <2> write_file_names: buffering file name '/usr/openv/db/staging/DARS_DATA.db' for output
22:00:29.526 [15297] <2> write_file_names: successfully wrote buffer to COMM_SOCK
22:00:29.526 [15297] <2> bpbrm main: wrote CONTINUE on COMM_SOCK
22:00:29.526 [15297] <2> bpbrm main: closing DATA_SOCK
22:00:29.526 [15297] <2> bpbrm main: closing COMM_SOCK
22:00:29.526 [15297] <2> bpbrm main: ESTIMATE -1 -1 ipndms_1403956823
22:00:31.027 [15297] <2> bpbrm main: ADDED FILES TO DB FOR ipndms_1403956823 1 /opt/openv/db/staging/DARS_DATA.db
22:01:26.406 [15297] <16> bpbrm main: from client ipndms: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
22:01:26.406 [15297] <2> set_job_details: Tfile (62678): LOG 1403956886 16 bpbrm 15297 from client ipndms: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
And success at 23:08.
23:08:46.936 [5251] <2> pfi_start_client: command = /usr/openv/netbackup/bin/bpbkar bpbkar -L /usr/openv/netbackup/logs/user_ops/dbext/logs/vxbsa.1403960917.22232.prog.pcb_std -IEL -PCB /usr/openv/netbacku
p/logs/user_ops/dbext/logs/vxbsa.1403960917.22232.files.1 -ct 7 -dt 0 -to 0 -read_to 300 -clnt ipndms -class NBU_Catalog_tape -sched Vault_CatalogBackup -st UBAK -bpstart_to 300 -bpend_to 300 -keyword NBDB
:62681:1403960914:F -bt 1403960921 -fso -b ipndms_1403960921 -kl 10
23:08:46.936 [5251] <2> pfi_start_client: received bpcd success message
23:08:47.176 [5251] <2> pfi_start_client: read start message from ipndms, msg=<INF - BACKUP START 24759>
23:08:47.176 [5251] <2> set_job_details: Tfile (62684): LOG 1403960927 4 bphdb 24759 Backup started
23:08:47.176 [5251] <2> bpbrm main: client_pid=22816
23:08:47.176 [5251] <2> set_job_details: Tfile (62684): LOG 1403960927 4 bphdb 22816 Backup started
23:08:47.176 [5251] <2> bpbrm main: from client ipndms: read client start message
23:08:47.176 [5251] <2> bpbrm spawn_child: /usr/openv/netbackup/bin/bptm bptm -w -c ipndms -den 6 -rt 8 -rn 0 -stunit pnms01-hcart-robot-tld-0 -cl NBU_Catalog_tape -bt 1403960921 -b ipndms_1403960921 -st 2
-cj 2 -p CatalogBackup -p CatalogBackupOffsite -stunit pnms01-hcart-robot-tld-0 -cj 2 -twin_fail_on_error 0 -twin_fail_on_error 0 -reqid -1402274678 -jm -brm -hostname ipndms -L /usr/openv/netbackup/logs/
user_ops/dbext/logs/vxbsa.1403960917.22232.prog.pcb_std -ru root -rclnt ipndms -rclnthostname ipndms -rl 4 -rl 4 -rp 518400 -rp 518400 -sl Vault_CatalogBackup -ct 7 -maxfrag 1048576 -eari 0 -eari 0 -v -med
iasvr pnms01 -no_callback -connect_options 0x01010100 -jobid 62684 -jobgrpid 62681 -masterversion 710000 -bpbrm_shm_id 1862270991 -blks_per_buffer 512
23:08:47.178 [5251] <2> set_job_details: Tfile (62684): LOG 1403960927 4 bpbrm 5251 bptm pid: 5252
23:08:47.178 [5251] <2> bpbrm write_continue_backup: wrote CONTINUE BACKUP on COMM_SOCK <11>
23:08:47.178 [5251] <2> write_file_names: buffering file name '/usr/openv/db/staging/DARS_DATA.db' for output
23:08:47.178 [5251] <2> write_file_names: successfully wrote buffer to COMM_SOCK
23:08:47.178 [5251] <2> bpbrm main: wrote CONTINUE on COMM_SOCK
23:08:47.178 [5251] <2> bpbrm main: closing DATA_SOCK
23:08:47.178 [5251] <2> bpbrm main: closing COMM_SOCK
23:08:47.178 [5251] <2> bpbrm main: ESTIMATE -1 -1 ipndms_1403960921
23:08:48.709 [5251] <2> bpbrm main: ADDED FILES TO DB FOR ipndms_1403960921 1 /opt/openv/db/staging/DARS_DATA.db
23:08:51.902 [5251] <2> bpbrm main: client ipndms EXIT STATUS = 0: the requested operation was successfully completed
23:08:51.902 [5251] <2> set_job_details: Tfile (62684): LOG 1403960931 4 bphdb 22816 done. status: 0
- Roland
06-29-2014 05:22 PM
Details logs:
The "main" vault catalog job:
06/28/2014 22:00:22 - Info nbjm (pid=23557) starting backup job (jobid=62677) for client ipndms, policy NBU_Catalog_tape, schedule Vault_CatalogBackup
06/28/2014 22:00:22 - Info nbjm (pid=23557) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=62677, request id:{C88FD28A-FEBB-11E3-818F-00144FF80172})
06/28/2014 22:00:22 - requesting resource pnms01-hcart-robot-tld-0
06/28/2014 22:00:22 - requesting resource pnms01-hcart-robot-tld-0
06/28/2014 22:00:22 - requesting resource ipndms.NBU_CLIENT.MAXJOBS.ipndms
06/28/2014 22:00:22 - requesting resource ipndms.NBU_POLICY.MAXJOBS.NBU_Catalog_tape
06/28/2014 22:00:23 - granted resource ipndms.NBU_CLIENT.MAXJOBS.ipndms
06/28/2014 22:00:23 - granted resource ipndms.NBU_POLICY.MAXJOBS.NBU_Catalog_tape
06/28/2014 22:00:23 - granted resource NLT004
06/28/2014 22:00:23 - granted resource HP.ULTRIUM4-SCSI.001
06/28/2014 22:00:23 - granted resource pnms01-hcart-robot-tld-0
06/28/2014 22:00:23 - granted resource NRT105
06/28/2014 22:00:23 - granted resource HP.ULTRIUM4-SCSI.002
06/28/2014 22:00:23 - granted resource pnms01-hcart-robot-tld-0
06/28/2014 22:00:24 - estimated 0 kbytes needed
06/28/2014 22:00:24 - estimated 0 kbytes needed
06/28/2014 22:00:24 - Info nbjm (pid=23557) started backup job for client ipndms, policy NBU_Catalog_tape, schedule Vault_CatalogBackup on storage unit pnms01-hcart-robot-tld-0
06/28/2014 22:00:24 - started process bpbrm (pid=15297)
06/28/2014 22:00:25 - connecting
06/28/2014 22:00:28 - connected; connect time: 0:00:00
06/28/2014 22:00:30 - mounting NLT004
06/28/2014 22:01:28 - Error bptm (pid=15298) media manager terminated by parent process
06/28/2014 22:01:28 - end writing
socket write failed (24)
Tape 1 job:
06/28/2014 22:00:22 - requesting resource pnms01-hcart-robot-tld-0
06/28/2014 22:00:22 - requesting resource ipndms.NBU_CLIENT.MAXJOBS.ipndms
06/28/2014 22:00:22 - requesting resource ipndms.NBU_POLICY.MAXJOBS.NBU_Catalog_tape
06/28/2014 22:00:23 - granted resource ipndms.NBU_CLIENT.MAXJOBS.ipndms
06/28/2014 22:00:23 - granted resource ipndms.NBU_POLICY.MAXJOBS.NBU_Catalog_tape
06/28/2014 22:00:23 - granted resource NLT004
06/28/2014 22:00:23 - granted resource HP.ULTRIUM4-SCSI.001
06/28/2014 22:00:23 - granted resource pnms01-hcart-robot-tld-0
06/28/2014 22:00:24 - started process bpbrm (pid=15297)
06/28/2014 22:00:25 - Info bpbrm (pid=15297) ipndms is the host to backup data from
06/28/2014 22:00:25 - Info bpbrm (pid=15297) reading file list from client
06/28/2014 22:00:25 - Info bpbrm (pid=15297) listening for client connection
06/28/2014 22:00:25 - connecting
06/28/2014 22:00:28 - Info bpbrm (pid=15297) INF - Client read timeout = 300
06/28/2014 22:00:28 - Info bpbrm (pid=15297) accepted connection from client
06/28/2014 22:00:28 - Info bpbrm (pid=15297) start bpbkar on client
06/28/2014 22:00:28 - connected; connect time: 0:00:00
06/28/2014 22:00:29 - Info bphdb (pid=12963) Backup started
06/28/2014 22:00:29 - Info bphdb (pid=12912) Backup started
06/28/2014 22:00:29 - Info bpbrm (pid=15297) bptm pid: 15298
06/28/2014 22:00:29 - Info bptm (pid=15298) start
06/28/2014 22:00:29 - Info bptm (pid=15298) using 262144 data buffer size
06/28/2014 22:00:30 - Info bptm (pid=15298) setting receive network buffer to 262144 bytes
06/28/2014 22:00:30 - Info bptm (pid=15298) using 256 data buffers
06/28/2014 22:00:30 - Info bptm (pid=15298) start backup
06/28/2014 22:00:30 - Info bptm (pid=15298) backup child process is pid 15299
06/28/2014 22:00:30 - Info bptm (pid=15298) Waiting for mount of media id NLT004 (copy 1) on server pnms01.
06/28/2014 22:00:30 - mounting NLT004
06/28/2014 22:01:26 - Error bpbrm (pid=15297) from client ipndms: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
06/28/2014 22:01:26 - Info bphdb (pid=12912) done. status: 24
06/28/2014 22:01:26 - Error bptm (pid=15298) media manager terminated during mount of media id NLT004, possible media mount timeout
06/28/2014 22:01:28 - Info bphdb (pid=12912) done. status: 24: socket write failed
06/28/2014 22:01:28 - end writing
socket write failed (24)
Tape 2 job:
06/28/2014 22:00:22 - requesting resource pnms01-hcart-robot-tld-0
06/28/2014 22:00:22 - requesting resource ipndms.NBU_CLIENT.MAXJOBS.ipndms
06/28/2014 22:00:22 - requesting resource ipndms.NBU_POLICY.MAXJOBS.NBU_Catalog_tape
06/28/2014 22:00:23 - granted resource ipndms.NBU_CLIENT.MAXJOBS.ipndms
06/28/2014 22:00:23 - granted resource ipndms.NBU_POLICY.MAXJOBS.NBU_Catalog_tape
06/28/2014 22:00:23 - granted resource NRT105
06/28/2014 22:00:23 - granted resource HP.ULTRIUM4-SCSI.002
06/28/2014 22:00:23 - granted resource pnms01-hcart-robot-tld-0
06/28/2014 22:00:24 - started process bpbrm (pid=15297)
06/28/2014 22:00:25 - connecting
06/28/2014 22:00:28 - connected; connect time: 0:00:00
06/28/2014 22:01:28 - end writing
socket write failed (24)
06-29-2014 10:46 PM
All I can suggest is that you log a support call with Symantec.
"Something" is happening between the master and media server at this point that causes network failure:
22:00:31.027 [15297] <2> bpbrm main: ADDED FILES TO DB FOR ipndms_1403956823 1 /opt/openv/db/staging/DARS_DATA.db
22:01:26.406 [15297] <16> bpbrm main: from client ipndms: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
We see that a successful backup reports success within 3 seconds:
23:08:48.709 [5251] <2> bpbrm main: ADDED FILES TO DB FOR ipndms_1403960921 1 /opt/openv/db/staging/DARS_DATA.db
23:08:51.902 [5251] <2> bpbrm main: client ipndms EXIT STATUS = 0: the requested operation was successfully completed
So, unless there is a real network failure between master and media server, all processes will need to be logged and checked.
I am not sure about all of them - Support will be able to tell you which logs are needed with increased logging levels.
07-10-2014 03:50 PM
I have a case with support and we are testing a few things.
No results as of yet but I will post any solutions/answers when I get them.
07-14-2014 06:45 AM
Hopefully your Support engineer has asked for bpbrm on the media server and bpbkar on the master?
The 'broken pipe' seems to indicate a connection issue between these 2 processes.
09-02-2014 02:33 PM
Hi,
This post got a bit sidelined...
Yes I think I solved it, think...
There was a stray sticker that had went off a tape inside the library.
It was discovered when a tape was ejected without a sticker!
After pulling the library apart and removed the sticker the problem dissappeared.
For what ever reason the sticker must have been in a very odd spot to cause such strange behaviour.
And why just the Vault?
Maybe vault does a scan of all the tapes and thus hit the sticker "bug" all the time?
One could only speculate!
- Roland