Drive down Issue

Hernan_Peralta · ‎08-21-2016

IN OCCASIONS WHEN ARE LAUNCHED BACKUPS OR ERRORS IN THE TAPE DRIVE DOWN AND THE UP IS MANUALLY THIS GIU I NEED AUTOMATIC THIS PROBLEM.

IN ADDITION THAT HAPPENS WHEN SHOWN PENDING REQUEST

HELP PLEASE

mph999 · ‎08-21-2016

NetBackup will never 'UP' a drive automatically, and this feature will not be possible in the future.

If a drive has an issue, it will be marked down - this is a safety feature, 1. It prevents the drive being used and going down part way throough a backup which would cause a failed job. Moreimportantly, if the drive has some mechanical issue, then it couold damage tapes (I have seen this happen many times over the years). IF the drive was automatically UP, then it could work it's way through many tapes causing damage and data loss.

It you are seeing PEND issues - you have a scsi reservation issue.

1, If drives are shared between media servers and NDMP hosts (eg, NetApp filer of similar) be sure that the scsi reservation type set in NBU (either SPC-2 or Persistent) is set the same on the filers. In fact, for a drive that is seen by more than one device, each device to that drive must used the same reservation type.

Make sure only NetBAckup servers (and / or filers) can see the drives, they should never be zoned in to other servers.

Make sure no 'monitoing' software is affecting the drives, eg. HP Openview or similar - this can send scsi commands to the drives, NBU knows nothing about the software and hence it can cause issues.

PEND means that when NBU goes to reserve a drive, it finds that something else, external to NBU has made a SCSI reservation. Apart from the scsi reservation type setting not being consistent across all devices, I have never seen NBU be the cause of a PEND issue.

Marianne · ‎08-21-2016

I agree with Martin.

You need to troubleshoot and find the reason for drive being down'ed.

Only when you have fixed the problem should the drive be manually UP'ed.

Add VERBOSE entry to vm.conf on all media servers and restart NBU Device Management service (ltid) so that device-related issues can be logged to OS syslog.
Create bptm log folder on all media servers for additional troubleshooting.

PS:

Please release the CAPS button on your keyboard.
When everything is typed in UPPERCASE you are yelling at us....

Handy NetBackup Links

Hernan_Peralta · ‎08-22-2016

/usr/openv/netbackup/logs/bptm

11:12:32.890 [37716] <2> bptm: INITIATING (VERBOSE = 0): -verify -c exadatanbclient -b exadatanbclient_1471828081 -hostname srvbackup -L /usr/openv/netbackup/logs/user_ops/root/logs/jbpVerify-20160822111226.log -ru root -rclnt srvbackup -rclnthostname srvbackup -everything -nonrsvdports -connect_options 0x01020001 -jobid 168405
11:12:32.892 [37716] <2> job_connect: SO_KEEPALIVE set on socket 1 for client srvbackup
11:12:32.892 [37716] <2> logconnections: BPJOBD CONNECT FROM 192.168.80.141.60977 TO 192.168.80.141.13723 fd = 1
11:12:32.892 [37716] <2> job_authenticate_connection: ignoring VxSS authentication check for now...
11:12:32.892 [37716] <2> job_connect: Connected to the host srvbackup contype 53 jobid <168405> socket <1>
11:12:32.892 [37716] <2> job_connect: Connected on port 60977
11:12:32.892 [37716] <4> bptm: emmserver_name = srvbackup
11:12:32.892 [37716] <4> bptm: emmserver_port = 1556
11:12:32.898 [37716] <2> Orb::init: Enabling ORBNativeCharCodeSet UTF-8(Orb.cpp:713)
11:12:32.898 [37716] <2> Orb::init: initializing ORB EMMlib_Orb with: bptm -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirective "static Resource_Factory '-ORBNativeCharCodeSet UTF-8'" -ORBSvcConfDirective "static PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBDefaultInitRef '' -ORBSvcConfDirective "static PBXIOP_Evaluator_Factory '-orb EMMlib_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBSvcConf /dev/null -ORBSvcConfDirective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(Orb.cpp:916)
11:12:32.899 [37716] <2> Orb::init: caching EndpointSelectorFactory(Orb.cpp:930)
11:12:32.899 [37716] <2> Orb::setOrbConnectTimeout: timeout seconds: 60(Orb.cpp:1562)
11:12:32.899 [37716] <2> Orb::setOrbRequestTimeout: timeout seconds: 1800(Orb.cpp:1571)
11:12:32.903 [37716] <4> report_client: VBRC 2 37716 0 exadatanbclient_1471828081 -1 *NULL* -1 *NULL* 0 1 1
11:12:32.903 [37716] <2> ConnectionCache::connectAndCache: Acquiring new connection for host srvbackup, query type 81
11:12:32.903 [37716] <2> logconnections: BPDBM CONNECT FROM 192.168.80.141.45481 TO 192.168.80.141.13721 fd = 12
11:12:32.937 [37716] <2> db_end: Need to collect reply
11:12:32.946 [37716] <2> read_backup: media id Q161L6, copy 1, fragment 1 (365088 Kbytes) being considered for verify
11:12:32.946 [37716] <2> io_init: CINDEX 0, sched Kbytes for monitoring = 400000
11:12:32.946 [37716] <2> read_legacy_touch_file: Found /usr/openv/netbackup/NET_BUFFER_SZ; requested from (tmcommon.c.3857).
11:12:32.946 [37716] <2> read_legacy_touch_file: 262144 read ; requested from (tmcommon.c.3857).
11:12:32.946 [37716] <2> io_set_sendbuf: setting send network buffer to 262144 bytes
11:12:32.946 [37716] <2> read_legacy_touch_file: Found /usr/openv/netbackup/db/config/NUMBER_DATA_BUFFERS; requested from (tmcommon.c.3525).
11:12:32.946 [37716] <2> read_legacy_touch_file: 256 read ; requested from (tmcommon.c.3525).
11:12:32.946 [37716] <2> io_init: using 256 data buffers
11:12:32.946 [37716] <2> io_init: buffer size for read is 262144
11:12:32.946 [37716] <2> io_init: child delay = 10, parent delay = 15 (milliseconds)
11:12:32.947 [37716] <2> create_shared_memory: shm_size = 67115064, buffer address = 0x0x7fbfa6b41000, buf control = 0x0x7fbfaab41000, ready ptr = 0x0x7fbfaab42800, res_cntl = 0x0x7fbfaab42808
11:12:32.947 [37716] <2> read_backup: verify child process is pid 37718
11:12:32.947 [37716] <2> read_client: dname=., offline=0, online_at=0 offline_at=0
11:12:32.947 [37716] <2> read_client: dname=.., offline=0, online_at=0 offline_at=0
11:12:32.947 [37716] <2> read_client: dname=CO_0, offline=0, online_at=0 offline_at=0
11:12:32.947 [37716] <2> read_client: dname=OA_0, offline=0, online_at=0 offline_at=0
11:12:32.947 [37716] <2> read_client: dname=host_info, offline=0, online_at=0 offline_at=0
11:12:32.948 [37716] <2> db_freeEXDB_INFO: ?
11:12:32.948 [37716] <2> logconnections: BPCD CONNECT FROM 192.168.80.141.43105 TO 192.168.80.141.13782 fd = 0
11:12:32.950 [37716] <2> bpcr_get_version_rqst: bpcd version: 07610001
11:12:32.982 [37716] <2> media_id_to_monitor: job_id = 168405, pSrcMediaId = Q161L6
11:12:32.982 [37716] <2> nbjm_media_request: Passing job control to NBJM, type READ/12
11:12:32.982 [37716] <2> nbjm_media_request: old_media_id = , media_id = Q161L6
11:12:32.982 [37716] <2> Orb::init: Enabling ORBNativeCharCodeSet UTF-8(Orb.cpp:713)
11:12:32.982 [37716] <2> Orb::init: Created anon service name: NB_37716_548304112(Orb.cpp:795)
11:12:32.982 [37716] <2> Orb::init: endpointvalue is : pbxiop://1556:NB_37716_548304112(Orb.cpp:805)
11:12:32.982 [37716] <2> Orb::init: initializing ORB Default_DAEMON_Orb with: Unknown -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirective "static Resource_Factory '-ORBNativeCharCodeSet UTF-8'" -ORBSvcConfDirective "static PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBDefaultInitRef '' -ORBSvcConfDirective "static PBXIOP_Evaluator_Factory '-orb Default_DAEMON_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBEndpoint pbxiop://1556:NB_37716_548304112 -ORBSvcConf /dev/null -ORBSvcConfDirective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(Orb.cpp:916)
11:12:32.982 [37716] <2> Orb::init: caching EndpointSelectorFactory(Orb.cpp:930)
11:12:33.026 [37716] <8> copy_addrinfo: [vnet_addrinfo.c:3599] no valid addresses to copy 0 0x0
11:12:33.026 [37716] <8> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1651] copy_addrinfo() failed NAME=fe80::8edc:d4ff:fea9:bfd9%5 STAT=6
11:12:33.026 [37716] <8> vnet_cached_getaddrinfo: [vnet_addrinfo.c:1267] vnet_cached_getaddrinfo_and_update() failed 6 0x6
11:12:33.027 [37716] <8> copy_addrinfo: [vnet_addrinfo.c:3599] no valid addresses to copy 0 0x0
11:12:33.027 [37716] <8> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1651] copy_addrinfo() failed NAME=fe80::8edc:d4ff:fea9:bfd8%3 STAT=6
11:12:33.028 [37716] <8> vnet_cached_getaddrinfo: [vnet_addrinfo.c:1267] vnet_cached_getaddrinfo_and_update() failed 6 0x6
11:12:33.029 [37716] <8> copy_addrinfo: [vnet_addrinfo.c:3599] no valid addresses to copy 0 0x0
11:12:33.029 [37716] <8> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1651] copy_addrinfo() failed NAME=fe80::1658:d0ff:fe5f:f6f3%7 STAT=6
11:12:33.029 [37716] <8> vnet_cached_getaddrinfo: [vnet_addrinfo.c:1267] vnet_cached_getaddrinfo_and_update() failed 6 0x6
11:12:33.031 [37716] <8> copy_addrinfo: [vnet_addrinfo.c:3599] no valid addresses to copy 0 0x0
11:12:33.031 [37716] <8> vnet_cached_getaddrinfo_and_update: [vnet_addrinfo.c:1651] copy_addrinfo() failed NAME=fe80::1658:d0ff:fe5f:f6f1%4 STAT=6
11:12:33.031 [37716] <8> vnet_cached_getaddrinfo: [vnet_addrinfo.c:1267] vnet_cached_getaddrinfo_and_update() failed 6 0x6
11:16:44.194 [38002] <2> bptm: INITIATING (VERBOSE = 0): -delete_all_expired
11:16:44.194 [38002] <4> bptm: emmserver_name = srvbackup
11:16:44.194 [38002] <4> bptm: emmserver_port = 1556
11:16:44.199 [38002] <2> Orb::init: Enabling ORBNativeCharCodeSet UTF-8(Orb.cpp:713)
11:16:44.200 [38002] <2> Orb::init: initializing ORB EMMlib_Orb with: bptm -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirective "static Resource_Factory '-ORBNativeCharCodeSet UTF-8'" -ORBSvcConfDirective "static PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBDefaultInitRef '' -ORBSvcConfDirective "static PBXIOP_Evaluator_Factory '-orb EMMlib_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBSvcConf /dev/null -ORBSvcConfDirective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(Orb.cpp:916)
11:16:44.201 [38002] <2> Orb::init: caching EndpointSelectorFactory(Orb.cpp:930)
11:16:44.201 [38002] <2> Orb::setOrbConnectTimeout: timeout seconds: 60(Orb.cpp:1562)
11:16:44.201 [38002] <2> Orb::setOrbRequestTimeout: timeout seconds: 1800(Orb.cpp:1571)
11:16:44.234 [38002] <2> bptm: EXITING with status 0 <----------
11:17:28.968 [38017] <2> bptm: INITIATING (VERBOSE = 0): -rptdrv -jobid -1468615423 -jm
11:17:28.968 [38017] <2> main: Sending [EXIT STATUS 0] to NBJM
11:17:28.968 [38017] <2> bptm: EXITING with status 0 <----------

Marianne · ‎08-22-2016

There are no errors in this bptm log. It shows successful 'verify' job.

Handy NetBackup Links

Hernan_Peralta · ‎08-23-2016

The drives constantly go down

VAR/LOG/MESSAGGES

Aug 21 20:05:23 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.003 (device 3)
Aug 21 20:12:07 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.003 (device 3)
Aug 21 20:13:01 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 21 20:17:10 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.003 (device 3)
Aug 21 20:17:11 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 21 20:24:20 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 21 20:31:25 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 21 21:42:14 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 22 06:53:34 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 22 07:11:51 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.000 (device 0)
Aug 22 08:40:31 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.000 (device 0)
Aug 22 09:06:56 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.000 (device 0)
Aug 22 10:09:00 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.000 (device 0)
Aug 22 10:37:09 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.000 (device 0)
Aug 22 10:47:33 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.000 (device 0)
Aug 22 11:31:33 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.003 (device 3)
Aug 22 14:37:15 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.003 (device 3)
Aug 22 15:22:55 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 22 16:16:07 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 22 16:46:53 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 23 09:35:25 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 23 09:39:21 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 23 09:40:40 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 23 09:51:37 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 23 10:14:54 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 23 10:24:48 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 23 10:40:51 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.000 (device 0)
Aug 23 12:04:30 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.000 (device 0)
Aug 23 12:04:53 srvbackup ltid[9733]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.002 (device 2)
Aug 23 12:37:09 srvbackup ltid[9733]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.000 (device 0)

Marianne · ‎08-23-2016

Have you created bptm log folder?

Probably more than 3 errors in 12 hours? Each subsequent error will simply add to the amount of errors in 3 hours and simply DOWN the drive again.
Curious to see bptm log.
Please copy to bptm.txt and upload the file.

Did you copy selectively from messages file or are these the only entries in the file?

Handy NetBackup Links

Hernan_Peralta · ‎08-23-2016

How created bptm log folder help please for recollect this archive?

mph999 · ‎08-23-2016

You really need to add VERBOSE = 5 into /usr/openv/netbackup/bp.conf on the media server as well. Logs at verbose 0 are generally pretty useless.

I would also create this empty file (command given), again on the media server

touch /usr/openv/volmgr/DRIVE_DEBUG

Add the single word VERBOSE into /usr/openv/volmgr/vm.conf

This will increase the logging into the operating syste, messages log (/var/adm/messages or similar depending on exact OS)

We might need tpcommand log (/usr/openv/volmgr/debug/tpcommand) but I'd leave this for the moment. After creating the empty DRIVE_DEBUG file, and editting vm.conf you will need to restart the services on the media server to pick up the change.

With no jobs running on the media server.

/usr/openv/volmgr/bin/stopltid

/usr/openv/volmgr/bin/ltid -v (to restart)

Once drive(s) start to go down, collect /usr/openv/netbackup/logs/bptm log that covers the time period of the issue, and system messages file.

If you have a scsi reservation issue, which from your description of PEND, it sound like you do, you need to look at my previous advice as in geerlly, you cannot solve drive PEND issues by looking at NBU logs, because, drive PEND issue are not caused b netbackup (with the exception perhaps of checking that the scsi reservation type used is the same across all 'devices' that share a given tape drive.

Hernan_Peralta · ‎08-23-2016

After activating the VERBOSE MUST RESTART THE SERVICES NETBACKUP?

Marianne · ‎08-23-2016

Create bptm directory under /usr/openv/netbackup/logs
If you have more media servers, create the folder on each media server.

Handy NetBackup Links

mph999 · ‎08-24-2016

Yes, you need to restart the media manager services:

/usr/openv/volmgr/bin/stopltid

/usr/openv/volmgr/bin/ltid -v (to restart)

No restart is required for bptm log, as this picks up the new VERBOSE = 5 setting when a new job runs.

Just to confirm, in vm.conf you only put VERBOSE, there is no number as it is simply on or off.

Genericus · ‎08-30-2016

You may want to include some information on your enironment - what OS is master/media, what kind of drive/robot?

For example, I have a current issue with my SL8500 robot, when I replace a drive, the process does not always "stick" and will take the drives down sporadically afterwards.

on media server "tpautoconf -report_disc" will report discrepancies with serialized robot, mine will be fine for a few hours then the serial number reverts and the drive goes DOWN. I can use command "tpautoconf -replace_drive -path " and fix it for an hour or so.

I am only able to resolve by rescannning for drives. which requires stop/start of media server and can be disruptive.

Basic process :

verify drives at robot
verify drives at OS
verify drives at NB

Please verify 1 & 2!

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

VOX

Drive down Issue