09-13-2013 12:47 AM
Environment
SFHA/DR
iSCSI SAN
Primary Site = two nodes
DR Site = one node
SFHA version = 5.0 MP4 RP1
RHEL OS = 5
DiskGroup Agent logs
2013/09/10 11:56:48 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 11:56:49 VCS ERROR V-16-2-13027 Thread(4152359824) Resource(DG) - monitor procedure did not complete within the expected time.
2013/09/10 11:58:48 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:00:48 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:02:48 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:02:49 VCS ERROR V-16-2-13210 Thread(4155296656) Agent is calling clean for resource(DG) because 4 successive invocations of the monitor procedure did not complete within the expected time.
2013/09/10 12:02:50 VCS ERROR V-16-2-13068 Thread(4155296656) Resource(DG) - clean completed successfully.
2013/09/10 12:03:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:03:51 VCS ERROR V-16-2-13077 Thread(4152359824) Agent is unable to offline resource(DG). Administrative intervention may be required.
2013/09/10 12:05:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:07:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:09:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:11:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:13:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:15:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:17:51 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:19:51 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:21:51 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:23:51 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
ERROR V-16-2-13210
https://sort.symantec.com/ecls/umi/V-16-2-13210
===========================================================================
Mount Agent logs
2013/09/10 12:12:53 VCS ERROR V-16-2-13064 Thread(4154968976) Agent is calling clean for resource(MOUNT1) because the resource is up even after offline completed.
2013/09/10 12:12:54 VCS ERROR V-16-2-13069 Thread(4154968976) Resource(MOUNT1) - clean failed.
2013/09/10 12:13:54 VCS ERROR V-16-2-13077 Thread(4154968976) Agent is unable to offline resource(MOUNT1). Administrative intervention may be required.
2013/09/10 12:14:00 VCS WARNING V-16-2-13102 Thread(4153916304) Resource (MOUNT1) received unexpected event info in state GoingOfflineWaiting
2013/09/10 12:15:28 VCS WARNING V-16-2-13102 Thread(4153916304) Resource (MOUNT1) received unexpected event info in state GoingOfflineWaiting
WARNING V-16-2-13102
https://sort.symantec.com/ecls/umi/V-16-2-13102
===========================================================================
My investigation
Extract from the above TN => Links 0 and 1 are both connected to the same switch, resulting in a cross-link scenario (googled last extract and found next TN) => http://www.symantec.com/business/support/index?page=content&id=HOWTO79920 (Description of this TN next ) => It seems that you're using only one network switch between the cluster nodes. Symantec recommends configuring two independent networks between the cluster nodes with one network switch for each network for failure protection.
Seems the problem is this => Links 0 and 1 are both connected to the same switch, resulting in a cross-link scenario
===========================================================================
===========================================================================
The above is the one case. I also see the problem is with disks as well. Which may lead the disks unresponsive and mount resource caused problem. See the attached RHEL messages log:
09-13-2013 01:59 AM
It seems you have to separate issues here:
Mike
09-13-2013 02:19 AM
1. Do you think that disk(s) un responsive means :
- Slow I/O on disk due to much I/O receives by disks(SAN Disks)
OR
- either jerks between SAN STORAGE and Cluster Node due to network issue(as its iSCSI SAN).
I need to conclude this.
09-13-2013 02:25 AM
I don't know - this is a storage/network question, not a cluster question, unless you are sharing your heartbeat networks with the nework iSCSI is using (in which case you need to separate as VCS heartbeats are very "chatty" and require a private network).
Mike
09-13-2013 03:08 AM
Do you think that mixing a Heartbeat network with iSCSI SAN traffic can cause problem between cluster nodes and iSCSI SAN communication ?
09-13-2013 03:38 AM
Yes - putting ONE heartbeat on the same network as something else will effect traffic on the network - that is why they are called "private" links as they should be private - your situation is worse as you have put TWO heartbeats on the same network with a busy network - iSCSI !
Note low-pri heartbeats only normally send heartbeats, not cluster state information, but low-pri heartbeats can be promote to full heartbeats so even low-pri heartbeats should not be used on heavily loaded networks as they will effect each other if low-pri heartbeats is promoted.
Mike
09-13-2013 06:21 AM
I have no experience using iSCSI, but I would have thought a network used for iSCSI should be dedicated so that excessive traffic on the network doesn't effect I/O on iSCSI. Therefore excessive network traffic (like heartbeats or ftping a large file) may be able to make an iSCSI disk unresponsive. You could do your own tests by trying to max out the network by say ftping large files to see how it effects iSCSI disks.
Mike
09-13-2013 06:55 AM
Mike I have talked to my client and following are the configurations.
- Both heartbeats of both cluster nodes directly connected and no switch is in the middle of both cluster nodes heartbeats.
- iSCSI SAN is connected with the layer 3 switch and all other traffic of the network is also using same switch but the iSCSI SAN is on a seperate a VLAN.
I think each and every thing is placed properly.
09-13-2013 07:12 AM
In this case, as I said earlier, the unresponsive iSCSI disk is most likely a storage/network issue, not a cluster issue. It could be a cluster issue if the cluster is having problems with Mount and Diskgroup agent, but there is no underlying storage issue, but you said that "problem is with disks as well .... see the attached RHEL messages log", so you would expect agent errors for Mount and Diskgroup is there is a problem with unresponsive disks.
But if there is no switch involved in heartbeats, if you have got crossed messages then it could be that heartbeat1 on node1 is connected to heartbeat2 on node2, but I am not sure if the heartbeats would work at all if you had this.
Mike
09-13-2013 07:28 AM
but you said that "problem is with disks as well .... see the attached RHEL messages log
Yes my disks have problem. I shared the same for you, see my first post "Mount Agent logs". This clearly shows problem with disks.
Actually the clustered Application was unresponsive so my client switched over the service group and while switched over the mount resource stuck while switchover=>offline. So now I want to investigate what are the reasons behind unmount failed.
09-13-2013 07:51 AM
I already posted this:
Disks unresponsive: - this means:
you can't umount the filesystem as can't write to the mount flag to denote filesystem is umounted
Mike
09-18-2013 06:58 AM
It is concluded that the disks were unresponsive. May we dig down this more please as we need to conclude the reason(s) behind it(unresponsive) ?
As per my understanding there may be two reasons behind it
1- Disks unresponsive (as you said) . May be due to disks over burden (my thoughts).
2- (The disks were not over burden) and OS may have bug as kill signal 15 as well as 9 was not able to terminate the running process. See the below logs piece for your reference
09-18-2013 07:51 AM
VCS just runs an O/S command - umount and the error you receive is from "Umount" not VCS or SF.
Whether umount fails due to an O/S or hardware issue is not a SF/VCS question, so I am unable to determine why your umount failed.
Mike
09-18-2013 08:33 AM
Thanks Mike for your words
09-18-2013 09:08 PM
umount: /xxx/XXX/abc: device is busy
As per Mike's excellent post, VCS is reporting the message that it is getting back from the OS.
This means that you need to troubleshoot at OS level, not VCS level.
If you Google 'umount device is busy' you will find lots of OS-related forums and blogs where possible reasons are discussed.
09-20-2013 12:49 AM
Here is what I think could be happening:
Given 1: The Mount agent, like all VCS agents/components, is run wtihin the root user's context.
Given 2: In UNIX, when a process is sent signal 9 from root, the process is *garanteed* to die (signal 9 is not catchable).
So if you have a group of processes that are sent signal 9 by root and are not killed, it basically means that these processes are never getting scheduled to run by the scheduler (becuase if they ever got scheduled to run, they would recieve signal 9 from root and *would* die).
So why aren't they getting scheduled to run? A classic case is that a UNIX scheduler will not schedule a process to run (put it in the run queue) when that process is "waiting for an I/O operation to complete".
So why are those processes waiting for I/O? -- From your previous posts it looks like there are problems communicating with the iSCSI sotrage devices -- this would cause this kind of error. If you could have reconnected and/or resolved the iSCSI communications error earlier (after the Mount agent sent those processes signal 9), then those process would have had their I/Os returned, and therefor the schedular would have scheduled them to run, at which point they would have immediately died as a result of having been sent signal 9 from root.
To handle such issues, I know that most UNIXs have a forceful umount (certainly Solaris does anyway) which does umount the mounts and therefore the Mount agent completes. You want to investigate if your OS (which I think is Linux?) is also capable of such forceful umounting, and then why the Linux Mount agent is not using it. This is SYMC Tech support territory, I would think. I assume you already have a case opened on this?
Hope this helps.