04-18-2013 12:00 AM
Goodday,
I have a VCS running, as far as I can tell, everything looks ok. It is a 2 node cluster directly (no switch) connected to a SUN/LSI/ 6180 diskarray. The connections look ok. DMP shows no errors but in the messages file on the OS it is constantly showing scsi write errors :
Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci111d,806e@0/pci111d,806e@4/pci1077,171@0,1/fp@0,0/disk@w20140080e518345e,... (sd12):
Mar 20 16:00:11 MIRTL01 Error for Command: read(10) Error Level: Retryable
Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.notice] Requested Block: 288 Error Block: 288
Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.notice] Vendor: SUN Serial Number: 4^ 9N7
Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.notice] ASC: 0x8b (<vendor unique code 0x8b>), ASCQ: 0x2, FRU: 0x0
Mar 22 10:41:00 MIRTL01 explorer: [ID 702911 daemon.notice] Explorer started
I don't understand where this is coming from, some advice as to where I should be looking? Logfiles are available if need be.
Thanks
Remco
04-18-2013 12:34 AM
Hi, Remco
For this kind of scsi error, you'd better consulting hardware agent for further troubleshooting first.
If there is some disk array level mirror,which maybe readonly , or the lun have reserve key , this information maybe found.
04-18-2013 12:39 AM
Hi Remco,
First, the error was reported by OS kernel which not from DMP or VxVM, so maybe you should ask OS vendor for that firstly.
This is a SCSI read error returned from the SCSI target from disk array, so check on disk array maybe helpful.
From VxVM/DMP perspective, I suggest you check on support mode and host configure about the array. Pls check on DMP configuration Guide for detail requirements:
http://www.symantec.com/business/support/resources/sites/BUSINESS/content/live/TECHNICAL_SOLUTION/47000/TECH47728/en_US/TECH47728.pdf
04-18-2013 02:20 AM
Thanks for your reply, the thing I'm getting at is the fact that if DMP is working properly, it should not show these errors on the OS?
Let me clarify a bit more:
The OS messages I'm seeing are on the active node of the cluster. The DMPevents.log file are also showing entries which I do not fully understand but it leads to believe me that something is not right :
------LOGGING START------
Wed Feb 27 01:57:54.737: Enabled Disk array sun6180-0
Wed Feb 27 01:57:54.737: Enabled Disk array disk
Wed Feb 27 01:57:54.737: Added Dmpnode sun6180-0_0
Wed Feb 27 01:57:54.737: Added Dmpnode sun6180-0_1
Wed Feb 27 01:57:54.753: Dmpnode disk_0 has migrated from enclosure - to disk
Wed Feb 27 01:57:54.753: Disabled Disk array -
Wed Feb 27 01:58:07.000: Initiated SAN topology discovery
Wed Feb 27 01:58:07.000: Completed SAN topology discovery
Wed Feb 27 02:29:12.400: Lost 325 DMP I/O statistics records
Wed Feb 27 03:14:55.509: Lost 9262 DMP I/O statistics records
Wed Mar 6 01:25:06.258: Lost 10060 DMP I/O statistics records
Wed Mar 6 01:25:07.258: Lost 9996 DMP I/O statistics records
Wed Mar 6 01:25:08.258: Lost 8380 DMP I/O statistics records
Wed Mar 20 02:08:22.905: Lost 3917 DMP I/O statistics records
Wed Mar 20 02:08:23.915: Lost 6303 DMP I/O statistics records
Wed Mar 20 02:08:24.915: Lost 2935 DMP I/O statistics records
Wed Mar 20 03:08:28.062: Lost 9914 DMP I/O statistics records
Wed Mar 20 03:08:29.072: Lost 1276 DMP I/O statistics records
Thanks sofar! ;)
04-18-2013 04:36 AM
There is nothing wrong with the paths to the disk - the errors are on the disk itself.
04-18-2013 08:42 AM
As per the documemt of DSM for this array, this ASC/ASCQ means the LU is in quiesce condition.
Please ask Oracle further information and root cause of this.
http://docs.oracle.com/cd/E19373-01/820-4737-13/chapsing.html
04-18-2013 12:47 PM
Hello yasuhisa and Marianne,
So you think the issue is on the array itself? I have checked the array and have no errors there. There are 2 volumes build on the array that are mapped towards the cluster. Would it be possible to derive the controller from the array that is causing the issue? I have explorers from both systems. The errors are only visible on the active node.
Thanks
Remco
04-18-2013 05:28 PM
04-18-2013 09:24 PM
Have you had a look at the document that stinsong referred you to?
It is extremely important to verify correct host settings as well as array settings.
For one - If this is Solaris Sparc, MPxIO must be disabled.
6180 array settings is covered on the lsst page of the doc.
04-19-2013 02:26 AM
Hi Marianne,
Yes, I have checked and doublechecked that mpxio is disabled. Settings on firmware are correct. They are using the correct host settings (avt enabled). I was hoping that we could offline each controller at a time to see if the problem persists on a particular controller but customer is not willing to try this.
Thanks
05-08-2013 08:02 PM
Hi Remco,
If you have checked all points everyone talked, I suggest you could map 2 new other LUN from the array and create DG/volume on it with IO going, to test if there is same error on the new LUNs. So that we could see if it's the array controller issue or LUN issue.
06-21-2013 09:11 AM
It looks to be some issues with the HDD/luns assigned itself.
Try to issue SCSI inquiry to the devices.
vxscsiinq /dev/vx/rdmp/<devicename> which i assume will pass, if it is then try to see any error events on array events logs.
06-24-2013 01:49 AM
Dear
try to run iostat -En and post the logs, if you have any hardware error it should be in these outputs.
and also please let us know the version of OS you are using, because obviosly you have a OS or hardware related issues.