cancel
Showing results for 
Search instead for 
Did you mean: 

sgscan panics new Solaris 10 media server

Marianne
Level 6
Partner    VIP    Accredited Certified
NetBackup 6.5.4 server software installed on server that used to do backups over the network up to now.
sgscan causes kernel panic (as well as tpautoconf -t). QLogic  QLA2462 hba with Leadville driver.
We have logged a Sev 1 call with Symantec - all they could suggest in the 3 hours since the call has been logged was to 'disable tcp fusion'!
Any ideas will be welcome.

1 ACCEPTED SOLUTION

Accepted Solutions

Marianne
Level 6
Partner    VIP    Accredited Certified
Thanks to everyone who contributed.
Problem solved:
The system owners replaced the controller for the internal disks and also updated firmware on internal disks. Not sure which change fixed the problem, but all is fine now! No changes were made to hba drivers and/or firmware.

View solution in original post

7 REPLIES 7

John_Stockard
Level 5
Partner Certified
Try taking NetBackup out of the picture and see if the HBA is working correctly and if Solaris is playing nice with the HBA.

Does Solaris 10 see the HBA correctly if you run the Solaris command "cfgadm -a"?

Does Solaris see any devices on the SAN fabric through this HBA (not via "sgscan")?

Also, try taking a look at Sun patch ID 123305-04, which is a firmware patch for the QLA2462 HBA (and also references various Solaris OS patches).  It might be possible that you're encountering a bug in either the QLogic HBA firmware or the HBA driver in Solaris 10.

Marianne
Level 6
Partner    VIP    Accredited Certified
Hi John, thanks a lot for all the suggestions - will let you know.

Seems 'scan' works fine and reports on all tape drives (14 in total):

Device Name : "/dev/rmt/0cbn"
Passthru Name: "/dev/sg/c0tw500507630f9a1902l0"
Volume Header: ""
Port: -1; Bus: -1; Target: -1; LUN: -1
Inquiry : "IBM 03592E05 1E0D"
Vendor ID : "IBM "
Product ID : "03592E05 "
Product Rev: "1E0D"
Serial Number: "000007877009"
WWN : ""
WWN Id Type : 0
Device Identifier: "IBM 03592E05 000007877009"
Device Type : SDT_TAPE
NetBackup Drive Type: 10
Removable : Yes
Device Supports: SCSI-3
Flags : 0x0
Reason: 0x0
------------------------------------------------------------
Device Name : "/dev/rmt/1cbn"
Passthru Name: "/dev/sg/c0tw500507630f9a1903l0"
Volume Header: ""
Port: -1; Bus: -1; Target: -1; LUN: -1
Inquiry : "IBM 03592E05 1E0D"
Vendor ID : "IBM "
Product ID : "03592E05 "
Product Rev: "1E0D"
Serial Number: "000007879308"
WWN : ""
WWN Id Type : 0
Device Identifier: "IBM 03592E05 000007879308"
Device Type : SDT_TAPE
NetBackup Drive Type: 10
Removable : Yes
Device Supports: SCSI-3
Flags : 0x0
Reason: 0x0
.....
.....

mph999
Level 6
Employee Accredited
Must be something environmental - which you'll be aware of (since sgscan doesn't normally panic Solaris 10 servers)

I'm with John re. the patch, also consider the firmware level of the HBA, the latest firmware is not always the best, can you change this and see if it makes a difference.

I'd reconfigure the sg driver, remove and recreate - also, pull the cable and try sgscan with no drives attached, could the issue be coming back from the hardware (not sure on that, but hey, worth a go).

Is there a core, grab it, the sgscan binary and ask Symantec about running pflags , pstack against the core (there's one other I can't think of ).  Escalate to Backline to analyze.

Martin

Abesama
Level 6
Partner
sgscan panics but scan works fine

then you can also look at the device link files, and remove the un-necessary lines with LUNs and TARGETs.

/etc/devlinks.tab (or some name like that) as well as the sg* files and st* files under /kernel/drv

I've seen similar issue before, that the devlinks containing excessive number of lines with big LUN and TARGET range (from the sg.build and sg.install) and we had to remove the lines with no devices

so, in your case, those necessary lines are the ones with your WWNs right

/dev/sg/c0tw500507630f9a1902l0
/dev/sg/c0tw500507630f9a1903l0

my 2c

A

Marianne
Level 6
Partner    VIP    Accredited Certified
Thanks to all for valuable advise.
I did not stay on site through the night and was cc'ed on an email this morning saying that the problem has been solved with no indication of what the solution was.
As soon as I have details, I will post an update.
In all honesty, I never believed that NetBackup was at fault. I have been working with NBU for more than 10 years and have seen this twice - each time problem was caused by hba (jni and Emulex).

Marianne
Level 6
Partner    VIP    Accredited Certified
Thanks to everyone who contributed.
Problem solved:
The system owners replaced the controller for the internal disks and also updated firmware on internal disks. Not sure which change fixed the problem, but all is fine now! No changes were made to hba drivers and/or firmware.

stu1
Level 4
Employee Accredited Certified
Also remember, sgscan is just a shell script.  It's in plain ASCII, so you can vi it, and see what it does. 

If you did, you'll notice, it does a  for i loop on ;ls /dev/sg/'; then doing a glorified scsi_command -inquiry on each symlink there.

/dev/sg/files are usually symlink to /devices...../sg@.... files which represent a passthrough device, where initiators can pass SCSI commands directly to the SCSI layer.

So scsi_command is sending a 'inquiry' to each of these devices.  which just passes the same command to the target.

Well in past cases, either a controller, or device that when an inquiry hits it, it causes a scsi bus reset.  No big deal, but when enough inquiry is going through the controller, and the controller is doing a lot of resets, IF one of the HD on the device happens to be your root disk... BAM panic.

Last time this happens I saw it was with Sparc system using older scsi controllers.  These controllers were not necessary ones where tape drives are connected too; but sgscan sends enough inquiry to all controllers, that the one where root disk was attach too was doing resets, and that lead to a kernel panic.