Mount resource was not able to unmount

Zahid_Haseeb · ‎09-13-2013

Environment

SFHA/DR

iSCSI SAN

Primary Site = two nodes

DR Site = one node

SFHA version = 5.0 MP4 RP1

RHEL OS = 5

DiskGroup Agent logs

2013/09/10 11:56:48 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 11:56:49 VCS ERROR V-16-2-13027 Thread(4152359824) Resource(DG) - monitor procedure did not complete within the expected time.
2013/09/10 11:58:48 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:00:48 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:02:48 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:02:49 VCS ERROR V-16-2-13210 Thread(4155296656) Agent is calling clean for resource(DG) because 4 successive invocations of the monitor procedure did not complete within the expected time.
2013/09/10 12:02:50 VCS ERROR V-16-2-13068 Thread(4155296656) Resource(DG) - clean completed successfully.
2013/09/10 12:03:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:03:51 VCS ERROR V-16-2-13077 Thread(4152359824) Agent is unable to offline resource(DG). Administrative intervention may be required.
2013/09/10 12:05:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:07:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:09:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:11:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:13:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:15:50 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:17:51 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:19:51 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)
2013/09/10 12:21:51 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4152359824)
2013/09/10 12:23:51 VCS WARNING V-16-2-13139 Thread(4156349328) Canceling thread (4155296656)

ERROR V-16-2-13210

https://sort.symantec.com/ecls/umi/V-16-2-13210

===========================================================================

Mount Agent logs

2013/09/10 12:12:53 VCS ERROR V-16-2-13064 Thread(4154968976) Agent is calling clean for resource(MOUNT1) because the resource is up even after offline completed.
2013/09/10 12:12:54 VCS ERROR V-16-2-13069 Thread(4154968976) Resource(MOUNT1) - clean failed.
2013/09/10 12:13:54 VCS ERROR V-16-2-13077 Thread(4154968976) Agent is unable to offline resource(MOUNT1). Administrative intervention may be required.
2013/09/10 12:14:00 VCS WARNING V-16-2-13102 Thread(4153916304) Resource (MOUNT1) received unexpected event info in state GoingOfflineWaiting
2013/09/10 12:15:28 VCS WARNING V-16-2-13102 Thread(4153916304) Resource (MOUNT1) received unexpected event info in state GoingOfflineWaiting

WARNING V-16-2-13102

https://sort.symantec.com/ecls/umi/V-16-2-13102

===========================================================================

My investigation

Extract from the above TN => Links 0 and 1 are both connected to the same switch, resulting in a cross-link scenario (googled last extract and found next TN) => http://www.symantec.com/business/support/index?page=content&id=HOWTO79920 (Description of this TN next ) => It seems that you're using only one network switch between the cluster nodes. Symantec recommends configuring two independent networks between the cluster nodes with one network switch for each network for failure protection.

Seems the problem is this => Links 0 and 1 are both connected to the same switch, resulting in a cross-link scenario

===========================================================================

The above is the one case. I also see the problem is with disks as well. Which may lead the disks unresponsive and mount resource caused problem. See the attached RHEL messages log:

mikebounds · ‎09-13-2013

It seems you have to separate issues here:

Disks unresponsive: - this means:
you can't umount the filesystem as can't write to the mount flag to denote filesystem is umounted
diskgroup agent is unable to get information on disks in monitor entry point
Your heartbeats are connected to the same switch which means one hearteat can receive an unexpected message from the other heartbeat and also a hearbeat can loose messages

Mike

Zahid_Haseeb · ‎09-13-2013

1. Do you think that disk(s) un responsive means :

- Slow I/O on disk due to much I/O receives by disks(SAN Disks)

OR

- either jerks between SAN STORAGE and Cluster Node due to network issue(as its iSCSI SAN).

I need to conclude this.

mikebounds · ‎09-13-2013

I don't know - this is a storage/network question, not a cluster question, unless you are sharing your heartbeat networks with the nework iSCSI is using (in which case you need to separate as VCS heartbeats are very "chatty" and require a private network).

Mike

Zahid_Haseeb · ‎09-13-2013

Do you think that mixing a Heartbeat network with iSCSI SAN traffic can cause problem between cluster nodes and iSCSI SAN communication ?

mikebounds · ‎09-13-2013

Yes - putting ONE heartbeat on the same network as something else will effect traffic on the network - that is why they are called "private" links as they should be private - your situation is worse as you have put TWO heartbeats on the same network with a busy network - iSCSI !

Note low-pri heartbeats only normally send heartbeats, not cluster state information, but low-pri heartbeats can be promote to full heartbeats so even low-pri heartbeats should not be used on heavily loaded networks as they will effect each other if low-pri heartbeats is promoted.

Mike

mikebounds · ‎09-13-2013

I have no experience using iSCSI, but I would have thought a network used for iSCSI should be dedicated so that excessive traffic on the network doesn't effect I/O on iSCSI. Therefore excessive network traffic (like heartbeats or ftping a large file) may be able to make an iSCSI disk unresponsive. You could do your own tests by trying to max out the network by say ftping large files to see how it effects iSCSI disks.

Mike

Zahid_Haseeb · ‎09-13-2013

Mike I have talked to my client and following are the configurations.

- Both heartbeats of both cluster nodes directly connected and no switch is in the middle of both cluster nodes heartbeats.

- iSCSI SAN is connected with the layer 3 switch and all other traffic of the network is also using same switch but the iSCSI SAN is on a seperate a VLAN.

I think each and every thing is placed properly.

mikebounds · ‎09-13-2013

In this case, as I said earlier, the unresponsive iSCSI disk is most likely a storage/network issue, not a cluster issue. It could be a cluster issue if the cluster is having problems with Mount and Diskgroup agent, but there is no underlying storage issue, but you said that "problem is with disks as well .... see the attached RHEL messages log", so you would expect agent errors for Mount and Diskgroup is there is a problem with unresponsive disks.

But if there is no switch involved in heartbeats, if you have got crossed messages then it could be that heartbeat1 on node1 is connected to heartbeat2 on node2, but I am not sure if the heartbeats would work at all if you had this.

Mike

Zahid_Haseeb · ‎09-13-2013

but you said that "problem is with disks as well .... see the attached RHEL messages log

Yes my disks have problem. I shared the same for you, see my first post "Mount Agent logs". This clearly shows problem with disks.

Actually the clustered Application was unresponsive so my client switched over the service group and while switched over the mount resource stuck while switchover=>offline. So now I want to investigate what are the reasons behind unmount failed.

mikebounds · ‎09-13-2013

I already posted this:

Disks unresponsive: - this means:
you can't umount the filesystem as can't write to the mount flag to denote filesystem is umounted

Mike

Zahid_Haseeb · ‎09-18-2013

It is concluded that the disks were unresponsive. May we dig down this more please as we need to conclude the reason(s) behind it(unresponsive) ?

As per my understanding there may be two reasons behind it

1- Disks unresponsive (as you said) . May be due to disks over burden (my thoughts).

2- (The disks were not over burden) and OS may have bug as kill signal 15 as well as 9 was not able to terminate the running process. See the below logs piece for your reference

12:12:52 VCS NOTICE V-16-10031-5512 (XXX) Mount:MOUNT1:offline:Trying force umount with signal 15...
12:12:52 VCS NOTICE V-16-10031-5512 (XXX) Mount:MOUNT1:offline:Trying force umount with signal 9...

umount: /xxx/XXX/abc: device is busy
umount: /xxx/XXX/abc: device is busy

/xxx/XXX/abc: 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148

Please Note: It is clear that there is no such process (on volume which is not able to unmount) on which OS is dependent.

mikebounds · ‎09-18-2013

VCS just runs an O/S command - umount and the error you receive is from "Umount" not VCS or SF.

Whether umount fails due to an O/S or hardware issue is not a SF/VCS question, so I am unable to determine why your umount failed.

Mike

Zahid_Haseeb · ‎09-18-2013

Thanks Mike for your words

Marianne · ‎09-18-2013

umount: /xxx/XXX/abc: device is busy

As per Mike's excellent post, VCS is reporting the message that it is getting back from the OS.

This means that you need to troubleshoot at OS level, not VCS level.

If you Google 'umount device is busy' you will find lots of OS-related forums and blogs where possible reasons are discussed.

Handy NetBackup Links

kjbss · ‎09-20-2013

Here is what I think could be happening:

Given 1: The Mount agent, like all VCS agents/components, is run wtihin the root user's context.

Given 2: In UNIX, when a process is sent signal 9 from root, the process is *garanteed* to die (signal 9 is not catchable).

So if you have a group of processes that are sent signal 9 by root and are not killed, it basically means that these processes are never getting scheduled to run by the scheduler (becuase if they ever got scheduled to run, they would recieve signal 9 from root and *would* die).

So why aren't they getting scheduled to run? A classic case is that a UNIX scheduler will not schedule a process to run (put it in the run queue) when that process is "waiting for an I/O operation to complete".

So why are those processes waiting for I/O? -- From your previous posts it looks like there are problems communicating with the iSCSI sotrage devices -- this would cause this kind of error. If you could have reconnected and/or resolved the iSCSI communications error earlier (after the Mount agent sent those processes signal 9), then those process would have had their I/Os returned, and therefor the schedular would have scheduled them to run, at which point they would have immediately died as a result of having been sent signal 9 from root.

To handle such issues, I know that most UNIXs have a forceful umount (certainly Solaris does anyway) which does umount the mounts and therefore the Mount agent completes. You want to investigate if your OS (which I think is Linux?) is also capable of such forceful umounting, and then why the Linux Mount agent is not using it. This is SYMC Tech support territory, I would think. I assume you already have a case opened on this?

Hope this helps.

VOX

Mount resource was not able to unmount