Forum Discussion

rajeshthink's avatar
13 years ago

WWn number and device troubleshooting

**** Moved to new discussion from https://www-secure.symantec.com/connect/forums/wwn-no-drive  ****

Passthru Name : in scan never gave me the WWN number....

Device Name  : "/dev/rmt/15cbn"
Passthru Name: "/dev/sg/c20t2l0"
Volume Header: ""
Port: -1; Bus: -1; Target: -1; LUN: -1
Inquiry    : "HP      Ultrium 4-SCSI  H58S"
Vendor ID  : "HP      "
Product ID : "Ultrium 4-SCSI  "
Product Rev: "H58S"
Serial Number: "HU195184RE"
WWN          : ""
WWN Id Type  : 0
Device Identifier: ""
Device Type    : SDT_TAPE
NetBackup Drive Type: 3
Removable      : Yes
Device Supports: SCSI-5
Flags : 0x0
Reason: 0x0
------------------------------------------------------------
Device Name  : "/dev/rmt/18cbn"
Passthru Name: "/dev/sg/c21t0l0"
Volume Header: ""
Port: -1; Bus: -1; Target: -1; LUN: -1
Inquiry    : "HP      Ultrium 4-SCSI  H58S"
Vendor ID  : "HP      "
Product ID : "Ultrium 4-SCSI  "
Product Rev: "H58S"
Serial Number: "HU19497WLF"
WWN          : ""
WWN Id Type  : 0
Device Identifier: ""
Device Type    : SDT_TAPE
NetBackup Drive Type: 3
Removable      : Yes
Device Supports: SCSI-5
Flags : 0x0
Reason: 0x0
 

 

i am still waiting for a command which can provide me wwn number of all the devices attched to the server.

 

I need some document or references for drive troubleshooting and media too.
We usual have error 84 /86 /85 in our server .

Another thing i want to point is we have ACSLS server in which this master server are in.

Also it will be great if you can help me from OS and Storage end like checking wwn and kernal related.

  • There is no NetBackup command that can do this as I explained before.

    What the pass thru name shows will differ between systems as we have dicovered.

    The correct way to do this, is to look at OS commands, such as cfgadm -al or cfgadm -al -o show_FCP_dev

    for Solaris.

    Status 86 - positioning error

    NetBackup does not position the tapes, it is done by the OS.  Positioing errors are normally due to a tape driver issue, or tape drive firmware issue.

    Status 85 - Read error

    Status 84 - Write error

    Without seeing the bptm log / job details for 84/ 85 errors, you cannot expect to get an answer, however, it is quite possible. and I would say in fact most likely that none of these issues have anything to do with NetBackup at all.

    NetBackup does not read, write or position tapes it is all handled by the OS, so for any of these errors, the troubleshhoting should start with the devices, or operating system with the help of the device vendor, or the operating system vendor if required.

    http://www.symantec.com/docs/TECH169477

    Martin

  • There is no NetBackup command that can do this as I explained before.

    What the pass thru name shows will differ between systems as we have dicovered.

    The correct way to do this, is to look at OS commands, such as cfgadm -al or cfgadm -al -o show_FCP_dev

    for Solaris.

    Status 86 - positioning error

    NetBackup does not position the tapes, it is done by the OS.  Positioing errors are normally due to a tape driver issue, or tape drive firmware issue.

    Status 85 - Read error

    Status 84 - Write error

    Without seeing the bptm log / job details for 84/ 85 errors, you cannot expect to get an answer, however, it is quite possible. and I would say in fact most likely that none of these issues have anything to do with NetBackup at all.

    NetBackup does not read, write or position tapes it is all handled by the OS, so for any of these errors, the troubleshhoting should start with the devices, or operating system with the help of the device vendor, or the operating system vendor if required.

    http://www.symantec.com/docs/TECH169477

    Martin

  • My 2c (In addition to Martin's excellent post):

    You can see that the WWN in the previous post is part of the OS device name, but not in yours.
    The reason for this is because you are probably not using the Solaris Leadville drivers. We have found that customers who use Leadville drivers have less device-related issues than users who use vendor hba drivers.  (Personal experience - others may have different opinion.)

    About status 84/5/6 errors:
    A good starting point is /usr/openv/netbackup/db/media/error.
    In this file we can get an indication of whether the same tapes are reporting errors (bad media?) or particular tape drive.

    bptm log will also be helpful as indicated by Martin.

    Most helpful will be OS logs (/var/adm/messages on Solaris). 
    SCSI errors, TapeAlerts, Sense Key errors, hba errors will all be logged to messages file.

    Important is to have VERBOSE entry in /usr/openv/volmgr/vm.conf on all media servers.
    (NBU needs to be restarted after adding this line.)
    All Media Manager actions/errors will be logged to /var/adm/messages.
     

     

  • I got some pointer from martin Post and some of the resolution for the OS admin guys in my team.

    I ran some intresting command like below to check if drive has a connectiity from HBA end

    /usr/sbin/lpfc/lputil shownodes $i

    and i found many drive were disconnected.

    I also observed  there are some media list in the log

    /usr/openv/netbackup/db/error/media  ..which are causing issues time and again with the drive which are working good.

     

    and yes Solaris 9 is my master server which would be one of the issue , Since now its EOS (End of support) even Oracle will not help us fix the issue with devices issue on the same.

     

    also about ACSLS drive i got briliant post from martin to fix the issue ..which is mention below..

     

    ----------------------------------------------------------------------------------------------

     

    I do not know of any doc, there is a good ACS TN - http://www.symantec.com/docs/TECH59332  to see what logs are needed.

    We might need the ACS trace logs – this is a bit like solaris snoop command and shows the exact network activity – problem is they need running through a storage tek tool which converts them to readable format (a bit like vxlogview I guess) – we’ll have a copy somewhere, but let’s not concern ourselves about these logs for the moment.

     

     

    There is a mistake in the TN, it should be

     

    # rpcinfo -t {acsls_hostname} 300031 2

    program 300031 version 2 ready and waiting

    # rpcinfo -t {acsls_hostname} 300031 1

    program 300031 version 1 ready and waiting

     

    # rpcinfo -u {acsls_hostname} 300031 2

    program 300031 version 2 ready and waiting

    # rpcinfo -u {acsls_hostname} 300031 1

    program 300031 version 1 ready and waiting

     

    t is for tcp / u is for udp

     

    One of the commands, or pref all four should show ready and waiting.  If all fail at any point in time, there is no comms to the ACS server and drives will go to AVR.

     

    My thoughts.

     

    Drive going down is not the issue, it will be whatever causes AVR, and this will be why the drive is down.  Fix that and hopefully drive stays up.

     

    “I would like to have document about primary step we would do to bring drive to active”

     

    There is no one step quick fix doc – sorry

     

    “What Logs we would look apart from messages“

    As mentioned in TN – messages and ACS event log

     

    This is the most important point you have made

     

    “Why is it only some drive going to AVR“

     

    If you have a n/work issue (most comman cause of AVR mode with ACS) then I would expect all the drives to goto AVR). 

     

    I presume that the drives you are talking about, working and non-working, are on the same media server.  If all the drives on any one media server are stuck in avr, then it is almost certainly a comms issue between that media server and the acs server.

     

    If only certain drives, on any one media server are in avr, then the only thing I can think of, is some config issue.

     

    The server may be critical, but things are not working.  We could talk for days and never fix anything and so the trouble shooting will have to start at the most likely cause and eliminate it.

    I would do this by removing the drive config from a single media server and reconfig it via the device wizard.  OK, it changes EMM, but it will not require a restart of any services on the master.

     

    I would put all the drives in AVR in down mode, and then reconfig the media server – with the shared drives down, the only thing that should be talking with them, is that one media server.

     

    There is a big mis-understanding with troubleshooting.  It is sometimes thought that we can look in the log or document and find the instant answer first time.  Sometimes, yes, but very often no.

     

    Consider this :

     

    I go home and the light is not working in my front room ... it could be ...

     

    Broken lightbulb

    Broken fuse

    Broken light switch

     

    Am I going to go to the shop and buy a new switch first, or am I going to try the most likely cause first, that is change the bulb, and then the fuse.

     

    Of course, I will change the bulb first.

     

    In the same way for NBU troubleshooting,  we must eliminate the most likely cause – if this means a restart/ change request etc .. then so be it.

     

    I can only think of two causes of AVR – n/work and config.  Given that only some drives goto AVR I don’t think it is n/work, which only leaves config.  I have personally never seen scsi conflicts make a drive goto avr, could it happen, I can’t say no for sure – but given that avr means that the NBU think the drive is not in a robot, or, can’t talk to the robot then I’m struggling to see how.

     

    Given that, for this issue we can at least suggest the n/work is not the cause as described above, this only leaves the config – unless you can find someone who can suggest something else.

     

    Will a reconfig fix it – I hope so, but if not, we will have eliminated this as a cause.  This obviously doesn’t give a solution, but it narrows the issue down to ‘something else’ and is a 100% valid method or troubleshooting.

     

    The other possibility is still config, but on the ACS side – it the ACS server aware of all the tape drives ?  You could run ‘robtest’ from the media server and check that the correct drives appear.  I can’t test this as I don’t have access to a ACS setup but it’s worth a look. 

  • Thanks for the feedback!

    If your Master server is Solaris 9, it means that your NBU version is also not yet upgraded to 7.x, right?

    NBU 6.x has also reached EOSL earlier this month.

    ACSLS is for robot control only. Each media server still needs direct access to the tape drives.
    The fact that you saw 'disconnect' in HBA tools, says that something is wrong at SAN level.
    You will need to check HBA firmware, drivers, cables, connectors, switch logs, switch firmware, even Solaris 'st' patch level, etc. etc... basically each component in the data path.

    I personally find TECH31526 the best ACSLS troubleshooting guide (see Related articles in TECH59332)

  • right marian , both the OS and application has reach EOSL , so we may resolve most of the issue by upgrading them or by changing the Hardware.

     

  • Your backup SAN probably also need to be upgraded.... 
    As I've said - you need to check each component in the data path.

    Old Emulex hba's are known for I/O issues during high transfer rate.
    /var/adm/messages is a good starting point.