cancel
Showing results for 
Search instead for 
Did you mean: 

The same tpautoconf -report_disc after the tpautoconf -replace_drive successfully

liuyl
Level 6

After one fault TD was replaced successfully via tpautoconf on the Media Server,  why its -report_disc still output the preivous mismatch record?
Notes: On the corresponding Master Server side,  the tpautoconf -report_disc output with empty,  and it can also work fine on the Media Server!

#
#
# tpautoconf -report_disc
======================= Missing Device (Drive) =======================
Drive Name = HP.ULTRIUM5-SCSI.002
Drive Path = /dev/rmt6.1
Inquiry = "HP Ultrium 5-SCSI I6RZ"
Serial Number = HU1248TF0U
TLD(3) definition Drive = 10
Hosts configured for this device:
Host = JCERPDB1
======================= New Device (Drive) =======================
Inquiry = "HP Ultrium 5-SCSI I5GZ"
Serial Number = HU1213MTPL
Drive Path = /dev/rmt6.1
#
#
#
# tpconfig -l|tail -28
drive - 11 hcart3 6 UP - IBM.ULT3580-TD3.001 /dev/rmt19.1
drive - 12 hcart3 3 UP - IBM.ULT3580-TD3.006 /dev/rmt2.1
drive - 13 hcart3 8 UP - IBM.ULT3580-TD3.000 /dev/rmt20.1
drive - 14 hcart3 5 UP - IBM.ULT3580-TD3.005 /dev/rmt3.1
drive - 15 hcart3 7 UP - IBM.ULT3580-TD3.004 /dev/rmt4.1
robot 2 - TLD - - - - jchxbak
drive - 21 hcart3 1 UP - HP.ULTRIUM6-SCSI.001 /dev/rmt21.1
drive - 22 hcart3 2 UP - HP.ULTRIUM6-SCSI.007 /dev/rmt22.1
drive - 23 hcart3 3 UP - HP.ULTRIUM6-SCSI.003 /dev/rmt23.1
drive - 24 hcart3 4 UP - HP.ULTRIUM6-SCSI.005 /dev/rmt24.1
drive - 25 hcart3 5 UP - HP.ULTRIUM6-SCSI.002 /dev/rmt25.1
drive - 26 hcart3 6 UP - HP.ULTRIUM6-SCSI.004 /dev/rmt26.1
drive - 27 hcart3 7 UP - HP.ULTRIUM6-SCSI.000 /dev/rmt27.1
drive - 28 hcart3 8 UP - HP.ULTRIUM6-SCSI.006 /dev/rmt28.1
robot 3 - TLD - - - - jchxbak
drive - 2 hcart2 2 UP - HP.ULTRIUM5-SCSI.009 /dev/rmt10.1
drive - 3 hcart2 3 UP - HP.ULTRIUM5-SCSI.010 /dev/rmt11.1
drive - 4 hcart2 4 UP - HP.ULTRIUM5-SCSI.011 /dev/rmt12.1
drive - 5 hcart2 5 UP - HP.ULTRIUM5-SCSI.006 /dev/rmt13.1
drive - 6 hcart2 6 UP - HP.ULTRIUM5-SCSI.007 /dev/rmt14.1
drive - 7 hcart2 7 UP - HP.ULTRIUM5-SCSI.005 /dev/rmt15.1
drive - 8 hcart2 8 UP - HP.ULTRIUM5-SCSI.004 /dev/rmt16.1
drive - 16 hcart2 9 UP - HP.ULTRIUM5-SCSI.003 /dev/rmt5.1
drive - 17 hcart2 10 UP - HP.ULTRIUM5-SCSI.002 /dev/rmt6.1
drive - 18 hcart2 11 UP - HP.ULTRIUM5-SCSI.001 /dev/rmt7.1
drive - 19 hcart2 12 UP - HP.ULTRIUM5-SCSI.000 /dev/rmt8.1
drive - 20 hcart2 1 UP - HP.ULTRIUM5-SCSI.008 /dev/rmt9.1
drive - 0 pcd - DISABL - IBM.DDSGEN6.000 /dev/rmt0.1
#
#
#
# tpautoconf -replace_drive HP.ULTRIUM5-SCSI.002 -path /dev/rmt6.1
Found a matching device in global DB, HP.ULTRIUM5-SCSI.002 on host JCERPDB1
#
#
#
# tpautoconf -report_disc|grep -Ei "124|1213"
Serial Number = HU1248TF0U
Serial Number = HU1213MTPL
#
#
#
# stopltid
#
#
#
# ltid -v
#
#
#
# tpautoconf -report_disc|grep -Ei "124|1213"
Serial Number = HU1248TF0U
Serial Number = HU1213MTPL
#
#
#
# netbackup stop
stopping the NetBackup Service Monitor
stopping the NetBackup Service Layer
stopping the NetBackup Remote Monitoring Management System
stopping the NetBackup compatibility daemon
stopping the Media Manager device daemon
stopping the Media Manager volume daemon
stopping the NetBackup client daemon
stopping the NetBackup network daemon
#
#
#
# bpps -a
NB Processes
------------


MM Processes
------------
#
#
#
# netbackup start
NetBackup network daemon started.
NetBackup client daemon started.
NetBackup SAN Client Fibre Transport daemon started.
NetBackup Database Server started.
NetBackup Event Manager started.
NetBackup Audit Manager started.
NetBackup Enterprise Media Manager started.
NetBackup Resource Broker started.
Media Manager daemons started.
NetBackup request daemon started.
NetBackup compatibility daemon started.
NetBackup Job Manager started.
NetBackup Policy Execution Manager started.
NetBackup Storage Lifecycle Manager started.
NetBackup Remote Monitoring Management System started.
NetBackup Key Management daemon started.
NetBackup Service Layer started.
NetBackup Agent Request Server started.
NetBackup Bare Metal Restore daemon not started.
NetBackup Vault daemon started.
NetBackup Service Monitor started.
NetBackup Bare Metal Restore Boot Server daemon started.
#
#
#
# bpps -a
NB Processes
------------
root 17301572 1 0 15:22:04 - 0:00 /usr/openv/netbackup/bin/vnetd -standalone
root 4194620 1 0 15:22:08 - 0:00 /usr/openv/netbackup/bin/nbsvcmon
root 4784508 1 0 15:22:06 - 0:00 /usr/openv/netbackup/bin/nbrmms
root 7537248 1 0 15:22:04 - 0:00 /usr/openv/netbackup/bin/bpcd -standalone
root 4522834 1 0 15:22:08 - 0:00 /usr/openv/netbackup/bin/bmrbd
root 5178226 1 0 15:22:06 - 0:00 /usr/openv/netbackup/bin/bpcompatd
root 7930840 1 0 15:22:07 - 0:00 /usr/openv/netbackup/bin/nbsl


MM Processes
------------
root 21692480 3212062 0 15:22:09 - 0:00 tldd -v
root 6226440 1 0 15:22:06 - 0:00 vmd -v
root 9568958 3212062 0 15:22:11 - 0:00 avrd -v
root 3212062 1 0 15:22:06 - 0:00 /usr/openv/volmgr/bin/ltid
#
#
#
# tpautoconf -report_disc|grep -Ei "124|1213"
Serial Number = HU1248TF0U
Serial Number = HU1213MTPL
#
#
#

1 ACCEPTED SOLUTION

Accepted Solutions

mph999
Level 6
Employee Accredited

I do not know the answer to that - I suspect if you dug through NBDB you would perhaps find something amiss with regard to wuich machines are mapped to the drive.

I have never seen this not work before, as I mentioned.  It's a simple concept that has been working for years, and from what I saw, it appears that the new drive may have been added before the replace_drive was run and it has got itself all upset.

The only way to fix this, well two ways ...

Delete the missing drive and possibley the new drive from the config and readd - this should clear the missing device from the output.

Manual SQL commands to remove it from NBDB - but this is very last resort and would only be used if the method above fails.

I don't think there is much else I can add to this, because ultimately the fix will be as I mention, and I am confident that in the future if a drive is swapped - running only tpautoconf -replace_drive and not adding the drive, will be successful.

View solution in original post

31 REPLIES 31

VirgilDobos
Moderator
Moderator
Partner    VIP    Accredited Certified

Have you tried restarting the ltid service or the Media server? Often times this solves the problem.

--Virgil

mph999
Level 6
Employee Accredited

 

From report_disc

======================= Missing Device (Drive) =======================
Drive Name = HP.ULTRIUM5-SCSI.002
Drive Path = /dev/rmt6.1
Inquiry = "HP Ultrium 5-SCSI I6RZ"
Serial Number = HU1248TF0U
TLD(3) definition Drive = 10
Hosts configured for this device:
Host = JCERPDB1
======================= New Device (Drive) =======================
Inquiry = "HP Ultrium 5-SCSI I5GZ"
Serial Number = HU1213MTPL
Drive Path = /dev/rmt6.1


If the new drive shown is the replacement for the missing drive (I presume it is, but this may not be the case) ... then run ...

tpautoconf -replace_drive HP.ULTRIUM5-SCSI.002 /dev/rmt6.1

The issue happenes because NBU does not automatically detect swapped drives, it needs to be told that the <drivename> has been replaces by the drive at <new path> - in this case, the path is the same.

I think you need to restart ltid afterwars, stopltid, then, ltid -v

From my above post contents,  you can see that I had already done so(plus restarting the whole NBU services)!
But still the same mismatch result!

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Always best to delete the drive that was removed, restart ltid,  then run device config, followed by restart of ltid. 

If the drive is shared, you need to delete the drive on all media servers. Best to do this on the master server. 

This made me wonder if the drive is shared: 

tpautoconf -replace_drive HP.ULTRIUM5-SCSI.002 -path /dev/rmt6.1
Found a matching device in global DB, HP.ULTRIUM5-SCSI.002 on host JCERPDB1

 

That is also my doubt!
Because it is indeed a SSO TD,  but why there would be only its own one host entry in my output of tpautoconf?

https://www.veritas.com/support/en_US/article.000027601
http://symcnbu.blogspot.com/2010/04/updating-replaced-tape-drive-in-nbu.html

Genericus
Moderator
Moderator
   VIP   

The tpautoconf -replace_drive is supposed to swap the new path on the fly, I have the same issue with my SSO drives sometimes.

I have to stop netbackup - it is vitally important the drive is clear and no reservations remain - especially in shared drives.

I use "nbrbutil -dump" and grep for the drive name to make sure there is nothing internal to NetBackup manipulating the drive - reservations and unload commands can remain hidden there!

Once all is clear, you can run the replace drive command - I like to also do the tpautoconf -a and recycle netbackup once the drive is online.

I have replaced the drive using just the replace_drive command, and verified the serial number in NetBackup is changed, and 5 - 10 minutes later it reverts to the old one! This causes drive issues due to the serialization, so unmount commands are executed and NetBackup thinks it has emptied the drive but the path is bad, so the tape stays in the drive - tapes get frozen because they fail to load. 

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

Genericus
Moderator
Moderator
   VIP   

Now, since this non-intrusive command is not consistant, I am unable to use it as a non-intrusive command.

When ever I have to swap a drive and rescan it, I am forced to do a complete NetBackup shutdown.

The good news is that if you totally stop netbackup and rescan the drives and restart the media servers, the drive rescan usually works.

The bad news - it can look like it is replaced, then revert back - usually in about 10 minutes.

 

 

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

Genericus
Moderator
Moderator
   VIP   

mph999 - the process you descibe is exactly corrrect - it is what is supposed to happen.

However - it does not always work

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

Genericus
Moderator
Moderator
   VIP   

liuyl - I have the same issue. I did not have the time or inclination to solve the issue for Netbackup, so I found a workaround.

My support thought it was caused by the SSO - somebody has cached information about the drive and overwrites your drive replacement command.

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

Genericus
Moderator
Moderator
   VIP   

From the link in a prior post, I noticed a few KEY points

down the drive - this ensures nothing is using it - I would add the steps using "nbrbutil -dump | grep -i drive" from my post.

Run the tpautoconf -replace_drive and tpautoconf -a from the robot control host! I thnk this is where I may have gone wrong!

 

1 Down the drive. In the Device Monitor, select the drive to swap or update. From the Actions menu, select Down Drive. 

2 Replace the drive or physically update the firmware for the drive. If you replace the drive, specify the same SCSI ID for the new drive as the old drive. 

3 To produce a list of new and missing hardware, run tpautoconf -report_disc on one of the reconfigured servers. This command scans for  new hardware and produce a report that shows the new and the replaced hardware. 

4 Ensure that all servers that share the new hardware are up and that all Netbackup services are active. 

5 Run tpautoconf with the -replace_drivedrive_name -path path_name options or -replace_robotrobot_number -pathrobot_path options. The tpautoconfcommand reads the serial number from the new hardware device and then updates the EMM database. 

6 If the new device is an unserialized drive, run the device configuration wizard on all servers that share the drive. If the new device is a robot, run the device configuration wizard on the server that is the robot control host

7 Up the drive. In the Device Monitor, select the new drive. From the Actions menu, select Up Drive. 

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

In fact,  it is too hard to avoid such problem,  even though we have applied the above conditions and steps!
It seems that the replace_drive option could just only find its own local mismatch S/N records,  and also it do not update with the new S/N record at all!

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

You have 2 choices:

  1. Log a call with Veritas Support about their documented process not working.
  2. Delete removed drive and add new drive via Device Config wizard.
    We know that this process works well. 

We won't be able to solve your issue in this forum. 

mph999
Level 6
Employee Accredited

If you reconfigure the drives via the wizard, we won't be able to troubleshoot this, as any evience will have disappeared.

As a very minium, we would need:

Add -zr SQL in /usr/openv/var/global/server.conf

Add VERBOSE to /usr/openv/volmgr/vm.conf

Create dir /usr/openv/volmgr/debug/tpcommand

Restart services

nbdb_unload output  (/usr/openv/db/bin/nbdb_unload /tmp/output.before)

Recreate issue

nbdb_unload output command again  (/usr/openv/db/bin/nbdb_unload /tmp/output.after)

tpcommand log

server.log (from /usr/openv/db/log)

OK!
But I am a bit afriad that the nbdb_unload would result in some unexpected worse situations!

https://vox.veritas.com/t5/NetBackup/Help-needed-nbdb-unload/td-p/843803

Notes: it seems that my worry is superfluous about it,   so I will do that soon ! 

 

Now I have done and uploaded all the logs you need!

From my tpcommand.log, I can see that the replace_drive with the new TD S/N did failed!

09:32:07.182 [25886860] <16> update_drive: (0) UpdateDrive failed, emmError = 2009005, nbError = 0
09:32:07.182 [25886860] <16> MMreplace_hw: (-) Translating EMM_ERROR_DriveSerialNumberAlreadyExists(2009005) to 91 in the Device Config context

mph999
Level 6
Employee Accredited


This is the missing device in the NBDB table that 'defines' devices:

'2000423',0x16FFD58A4E1211E68000FE9DA945725D,'2','10','0','128','1','NetBackup HCART2','NetBackup HCART','523118080','16176','6','0','HP.ULTRIUM5-SCSI.002','','2000420','3','8','','10','HP','Ultrium 5-SCSI','I6RZ','','','','HU1248TF0U','','','HP Ultrium 5-SCSI I6RZ','','0','','1000015','1000014','1970-01-01 00:00:00.000000','1970-01-01 00:00:00.000000','2018-12-23 18:12:07.000000','0','3663947','0','1',0x00000000,0x00000000000000000000000000000000,'-1','-1','1970-01-01 00:00:00.000000','0','0','-1','-1','-1','-1','','','','82','0','0','8388608','2016-07-19 08:37:08.362446','2018-12-27 01:02:50.048002'

This is the new one ...

'2000792',0x4148AA1CE45411E880009852720722F7,'2','10','0','128','1','NetBackup HCART2','NetBackup HCART','523118080','16176','6','0','HP.ULTRIUM5-SCSI.012','','2000420','3','8','','10','HP','Ultrium 5-SCSI','I5GZ','','','','HU1213MTPL','','','HP Ultrium 5-SCSI I5GZ','','0','','1000003','1000002','1970-01-01 00:00:00.000000','1970-01-01 00:00:00.000000','2018-12-27 00:38:52.000000','0','207705','0','0',0x00000000,0x00000000000000000000000000000000,'-1','-1','1970-01-01 00:00:00.000000','0','0','-1','-1','-1','-1','','','','82','0','0','8388608','2018-11-09 03:18:35.846744','2018-12-27 00:45:23.609944'


So, quite simply, it seems the new drive was added via the wizard or manually before the tpautoconf -replace_drive command was run.

If you manually delete the drive with name HP.ULTRIUM5-SCSI.002 hopefully it will resolve the issue.

1) How to explain such phenomenon that the replace_drive cannot take effect once the new TD S/N was added via DW or tpautoconf -a?
2) Are all the SSO TD S/N registered with every Media Server, that is, why my tpautoconf can only find their own mismatch records?
Notes, that means I must run tpautoconf on all the corresponding SSO Media Servers.

mph999
Level 6
Employee Accredited

1.   As per the log message you found, it can't take effect because it already exists.

2.  The drive is only referenced once in the device table, it is given a unique device key ( a number).  There is another table that references the 'device key' of the drive to each media server it is associated with.

In theory therefore, you should only need to run tpautoconf on one server ...