Application monitoring

43 Topics

Listener resource remain faulted
Hello, we are doing some failure tests for a customer. We have VCS 6.2 running on solaris 10. We have an Oracle database and of course the listener associated with it. We try to simulate different kind of failures. One of them is to kill the listener. In this situation the cluster observes that the listener has died, and it fails over the service to the other node. BUT the listener resource will remain in FAULTED state on the original node, and the group to which belongs will be in OFFLINE FAULTED state. In this situation if something goes wrong on the second node the service will not fail back to the original one until we manually run hagrp -clear. Is there anything we can do to fix this? (to have the clear done automatically) Here are some lines from the log: 2015/03/30 17:26:10 VCS ERROR V-16-2-13067 (node2p) Agent is calling clean for resource(ora_listener-res) because the resource became OFFLINE unexpectedly, on its own. 2015/03/30 17:26:11 VCS INFO V-16-2-13068 (node2p) Resource(ora_listener-res) - clean completed successfully. 2015/03/30 17:26:11 VCS INFO V-16-1-10307 Resource ora_listener-res (Owner: Unspecified, Group: oracle_rg) is offline on node2p (Not initiated by VCS) in these it says that clean for the resource has completed successfully, but the resource is still faulted. but if I run hares -clear manually, the the fault goes away. 20150330-173628:root@node1p:~# hares -state ora_listener-res #Resource Attribute System Value ora_listener-res State node1p ONLINE ora_listener-res State node2p FAULTED 20150330-173636:root@node1p:~# hares -clear ora_listener-res 20150330-173653:root@node1p:~# hares -state ora_listener-res #Resource Attribute System Value ora_listener-res State node1p ONLINE ora_listener-res State node2p OFFLINE 20150330-173655:root@node1p:~#
Solved
Laszlo_Budai
10 years ago Place Cluster Server
3.4KViews
0likes
5Comments
Integrating SAP with VCS 6.2 (on Oracle Linux 6.5)
Hi, I was wondering if someone has some additional information regarding how to setup my cluster... I have both VCS (inclusing Storage Foundation) and Linux knowledge. I do however have no background in SAP. And as SAP is a very complex product, I can not see the forest because of the trees... Setup 2 node (active-passive) cluster of Oracle Linux 6.5 nodes. Veritas Storage Foundation HA (= VxVM + DMP + VCS). Oracle 11.2 as database. SAP ECC 6.0 Apart from the Installation & Configuration guide on the SAP NetWeaver Agent, I found little information about implementing SAP in VCS. Source: "Symantec™ High Availability Agent for SAP NetWeaver Installation and Configuration Guide for Linux 6.2". But unfortunately I can not find a howto, guide or whatever from Symantec, nor from the usual Google attempts. My customer is however also not very SAP knowledged. From what I understand it is a very basic SAP setup, if not the simplest. They are using SAP ECC6.0 and an Oracle 11.2 database. So I assume they are just having a Central Instance and the Database. After some Google resource, I found out that SAP ECC 6.0 is technically a SAP NetWeaver 7.0. On Symantec SORT, I found 3 versions of SAP NetWeaver. I downloaded the first one, as the descripton says: SAP NetWeaver SAP NetWeaver 7.1, 7.2, 7.3, 7.4 SAP NetWeaver 7.1, 7.3 Agent: SAPNW04 5.0.16.0 Application version(s): SAP R/3 4.6, R/3 Enterprise 4.7, NW04, NW04s, ERP/ECC 5.0/6.0, SCM/APO 4.1/5.0/5.1/7.0, SRM 4.0/5.0/7.0, CRM 4.0/5.0/7.0 Source: https://sort.symantec.com/agents/detail/1077 SAP ERP 2005 = SAP NetWeaver 2004s (BASIS 7.00) = ECC 6.0 Source: http://itknowledgeexchange.techtarget.com/itanswers/difference-bet-ecc-60-sap-r3-47/ Source: http://www.fasttrackph.com/sap-ecc-6-0/ Source : http://wulibi.blogspot.be/2010/03/what-is-sap-ecc-60-in-brief.html Currently I have this setup unfinshed: Installed & configured Storage Foundation HA on both nodes. Instaled the ACC Libraries on both nodes. see: https://sort.symantec.com/agents/detail/1183 Installed the SAP NetWeaver Agent on both nodes. see: https://sort.symantec.com/agents/detail/1077 Configured next to the CusterServiceGroup, 3 Service Groups: SG_sap the shared storage Resources: DiskGroup + Volumes + Mount. the SAPNW Agent Resource. SG_oracle the shared storage Resources: DiskGroup + Volumes + Mount the Oracle Agent Resurce. SG_nfs still empty. SAPNW Agent. SAP instance type The SAPNW Agent documentation states: The agent supports the following SAP instance types: Central Services Instance Application Server Instance Enqueue Replication Server Instance. Source: "Symantec™ High Availability Agent for SAP NetWeaver Installation and Configuration Guide for Linux 6.2" But I guess the SAP ECC 6.0 has them all in one central instance, right? So I only need one SAPNW Agent. How is the SAP installed: only ABAP only Java add-in (both ABAP and Java). Source: "Symantec™ High Availability Agent for SAP NetWeaver Installation and Configuration Guide for Linux 6.2" I have no idea. How can I find this out? InstName Attribute Another thing is the InstName Attribute. This also does not correspond with the information I have. My SAP intance is T30. So the syntax is correct more or less, but it isn't listed below. Which is important also to decide on the value for the ProcMon Attribute The SAPSID and InstName form a unique identifier that can identify the processes running for a particular instance. Some examples of SAP instances are given as follows: InstName = InstType DVEBMGS00 = SAP Application Server - ABAP (Primary) D01 SAP = Application Server - ABAP (Additional) ASCS02 = SAP Central Services - ABAP J03 = SAP Application Server - Java SCS04 = SAP Central Services - Java ERS05 = SAP Enqueue Replication Server SMDA97 = Solution Manager Diagnostics Agent Source: "Symantec™ High Availability Agent for SAP NetWeaver Installation and Configuration Guide for Linux 6.2" In the listing of the required attributes it is also stated. However, the default value is CENTRAL. I guess this is correct in my case? InstName Attribute: An identifier that classifies and describes the SAP server instance type. Valid values are: APPSERV: SAP Application Server ENQUEUE: SAP Central Services ENQREP: Enqueue Replication Server SMDAGENT: Solution Manager Diagnostics Agent SAPSTARTSRV: SAPSTARTSRV Process Note: The value of this attribute is not case-sensitive. Type and dimension: string-scalar Default: APPSERV Example: ENQUEUE EnqSrvResName Attribute A required attribute is the EnqSrvResName Attribute. The documentation says this should be the Resource Name for the SAP Central Instance. But I am assuming I only have a SAP Central Instance. So I guess I should use the name of my SAP Agent Resouce from my SAP Service Group? EnqSrvResName Attribute: The name of the VCS resource for SAP Central Services (A)SCS Instance. This attribute is used by Enqueue and Enqueue Replication Server. Using this attribute the Enqueue server queries the Enqueue Replication Server resource state while determining the fail over target and vice a versa. Type and dimension: string-scalar Default: No default value Example: SAP71-PI1SCS_sap Source: "Symantec™ High Availability Agent for SAP NetWeaver Installation and Configuration Guide for Linux 6.2" Is anyone able to help me out? Thanks in advance.
Solved
sanderfiers
10 years ago Place Cluster Server
2.4KViews
2likes
9Comments
Veritas Cluster Server, Resource Application failed to start
Hello, on 2 servers OSLinux Red Hat 6.3, I've got a VRTS 6.0 For on Application, when I put it offline and then online on the same server (with all other resources online on thisserver) it start well. But when I test a 'switchto' ofall the service group on the other node it doesn't start properly (and all the other resources started ) The application is link with 3 Mount ressources and 1 IP ressource weset attributescritical to false, and we set UseSUDash to true. The StartProgram script is supposed to start several processes, with a offline, online action all the processes are started, with a 'switch to' action, only half of them are started. No interesting log in the Application side. Any suggestion to debug will be appreciated.
cedric_tours
10 years ago Place Cluster Server
2.4KViews
0likes
7Comments
Unable to bring the Service Group online.
Hi All, I tried to bring a SG online in a node but it's not comming online. Let me explaing the issue. We did reboot of a node aixprd001 and we found that /etc/filesystem is corrupted so the SG bosinit_SG is in partial state since lot of cluster FS in not mounted. Then we corrected the entry and done the manual mout of all the FS but the SG still show the status partial so we did the bellow command. hagrp -clear bosinit_SG -all Once done the SG is in online state. For safer side we tried to offline the SG and brought it up online again but the SG failed to come online, Bellow is the only error we able find the engine_A.log file. 2014/12/17 06:49:04 VCS NOTICE V-16-1-10166 Initiating manual online of group bosinit_SG on system aixprd001 2014/12/17 06:49:04 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group bosinit_SG on all nodes Please help me by providing suggestion, I will provide the output of logs if needed. Thanks, Rufus
Solved
prabindr
10 years ago Place Cluster Server
2KViews
0likes
4Comments
Understanding RestartLimit for non critical ressource
Hello, we have some trouble with our oracle listener process. sometimes the listener is killed by vcs. We dont know why. xxx VCS ERROR V-16-2-13027 (node1) Resource(lsnr-ORADB1) - monitor procedure did not complete within the expected time. xxx VCS ERROR V-16-2-13210 (node1) Agent is calling clean for resource(lsnr-ORADB1) because 4 successive invocations of the monitor procedure did not complete within the expected time. xxx VCS NOTICE V-16-20002-42 (node1) Netlsnr:lsnr-ORADB1:clean:Listener(LISTENER) kill TERM 2342 xxx VCS INFO V-16-2-13068 (node1) Resource(lsnr-ORADB1) - clean completed successfully. xxx VCS INFO V-16-2-13026 (node1) Resource(lsnr-ORADB1) - monitor procedure finished successfully after failing to complete within the expected time for (4) consecutive times. xxx VCS INFO V-16-1-10307 Resource lsnr-ORADB1 (Owner: unknown, Group: ORADB1) is offline on node1 (Not initiated by VCS) However,Resource(lsnr-ORADB1) is set to non-critical, to prevent an failover. I'll now set an RestartLimit forResource(lsnr-ORADB1) to let the cluster try to restart the listener, but what happen if this failed?Will the Ressouce still staying offline or initiate the cluster an failover for the whole ResourceGroup? thanks in advance for any help!
Foo_Bar_BLN
9 years ago Place Cluster Server
1.9KViews
0likes
3Comments
SG is not switching to next node.
Hi All, I am new to VCS but good in HACMP. In our environment we are using VCS-6.0, I one server we found that the SG is not moving from one node to another node when we tried manual failover using the bellow command. hagrp -switch <SGnamg> -to <sysname> We able to see that the SG is offline in the currnent node but it's not coming online in the secondary node. There is no error locked in engine_A.log except the bellow entry cpus load more than 60% <Secondary node name> Can anyone help me to find the solution for this. I will provide the output of any commands if you need more info to help me out to get this trouble shooted :) Thanks,
Solved
prabindr
10 years ago Place Cluster Server
1.8KViews
1like
8Comments
Long running action on custom agent timing out
Hi, I've created an action on a custom agent (based on ApplicationAgent) which can take a couple of minutes to complete. However, the action will timeout after MonitorInterval / 2. If I set the MonitorInterval to a sufficiently high value, the action will complete, but the cluster manager will then take a very long time to recognise that the application has started (some multiple of the MonitorInterval), causing the dependent applications in thegroup to take too long to start up. I had hoped that I could override the action timeout using VCSAG_SET_RES_EP_TIMEOUT from ag_i18n_inc.sh, but this does not appear to affect the MonitorInterval / 2 maximum so it does not help. In other instances we have created a completely separate custom Agent with the custom action so that itsMonitorIntervalcan be set very high without changing the value forthe real applications; this still leaves the application groups in 'Partial_Online' state for far longer than is acceptable. I have also contemplated changing the MonitorInteval on the custom agent to a high value only during the period when the long running action is to be carried out, switching it back when the action completes, but there is a risk that the value might not get switched back, again causing slow startup. Is there any way of allowing my custom action to use a timeout of several minutes without affecting the cluster manager's rapid ability to confirm the correct startup of the applications? Any suggestions gratefully received thanks, Bill Hurn
Solved
Bill_Hurn
10 years ago Place Cluster Server
1.8KViews
0likes
7Comments
The VCS 5.1 administator java console
I want to download VCS 5.1 adminstrator java console.But I can not find the adress. Who can help me?
Solved
nick_netbackup
13 years ago Place Cluster Server
1.7KViews
0likes
5Comments
Unable to add user parameter while configuring Apache Agent
Im Using VCS 4.0 on linux. Cluster Manager : 4.4 Cluster Server : 4.1 When im trying to Do "Import Types" and import vcsApacheTypes.cf , I dont see user parameter here!. Moreover , If i try to manually add a User parameter that doesnt work either. Please let me know how to configure an Apache agent so that the apache process starts with a specified user, instead of root user where i installed VCS.
Solved
raga
13 years ago Place Cluster Server
1.6KViews
0likes
9Comments
setting llt (low prio net) in a bond configured in a bridge
Hi, I want to set a low prio network over a bond which is part of a bridge. I have a cluster with two nodes. This is the network configuration (it is the same for both nodes): node 2: [root@node2 ]# cat /proc/version Linux version 2.6.32-504.el6.x86_64 (mockbuild@x86-023.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Tue Sep 16 01:56:35 EDT 2014 [root@node2 ]# ifconfig | head -n 24 bond0 Link encap:Ethernet HWaddr 52:54:00:14:13:21 inet6 addr: fe80::5054:ff:fe14:1321/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:761626 errors:0 dropped:0 overruns:0 frame:0 TX packets:605968 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1188025449 (1.1 GiB) TX bytes:582093867 (555.1 MiB) br0 Link encap:Ethernet HWaddr 52:54:00:14:13:21 inet addr:10.10.11.102 Bcast:10.10.11.255 Mask:255.255.255.0 inet6 addr: fe80::5054:ff:fe14:1321/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:49678 errors:0 dropped:0 overruns:0 frame:0 TX packets:50264 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:44061727 (42.0 MiB) TX bytes:28800387 (27.4 MiB) eth0 Link encap:Ethernet HWaddr 52:54:00:14:13:21 UP BROADCAST RUNNING PROMISC SLAVE MULTICAST MTU:1500 Metric:1 RX packets:761626 errors:0 dropped:0 overruns:0 frame:0 TX packets:605968 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1188025449 (1.1 GiB) TX bytes:582093867 (555.1 MiB) [root@node2 ]# brctl show bridge name bridge id STP enabled interfaces br0 8000.525400141321 no bond0 [root@node2 ]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: 52:54:00:14:13:21 Slave queue ID: 0 node 1: [root@node1]# ifconfig | head -n 24 bond0 Link encap:Ethernet HWaddr 52:54:00:2E:6D:23 inet6 addr: fe80::5054:ff:fe2e:6d23/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:816219 errors:0 dropped:0 overruns:0 frame:0 TX packets:668207 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1194971130 (1.1 GiB) TX bytes:607831273 (579.6 MiB) br0 Link encap:Ethernet HWaddr 52:54:00:2E:6D:23 inet addr:10.10.11.101 Bcast:10.10.11.255 Mask:255.255.255.0 inet6 addr: fe80::5054:ff:fe2e:6d23/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:813068 errors:0 dropped:0 overruns:0 frame:0 TX packets:640374 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1181039350 (1.0 GiB) TX bytes:604216197 (576.2 MiB) eth0 Link encap:Ethernet HWaddr 52:54:00:2E:6D:23 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:816219 errors:0 dropped:0 overruns:0 frame:0 TX packets:668207 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1194971130 (1.1 GiB) TX bytes:607831273 (579.6 MiB) [root@node1]# brctl show bridge name bridge id STP enabled interfaces br0 8000.5254002e6d23 no bond0 [root@node1 ]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: 52:54:00:2e:6d:23 Slave queue ID: 0 the llt configuration files are the following: [root@node1 ]# cat /etc/llttab set-node node2 set-cluster 1042 link eth3 eth3-52:54:00:c3:a0:55 - ether - - link eth2 eth2-52:54:00:35:f6:a5 - ether - - link-lowpri bond0 bond0 - ether - - [root@node1 ]# cat /etc/llttab set-node node1 set-cluster 1042 link eth3 eth3-52:54:00:bc:9b:e5 - ether - - link eth2 eth2-52:54:00:31:fb:31 - ether - - link-lowpri bond0 bond0 - ether - - However this seems that is not working. When I check the llt status, the node thinks the interface is down in the other one. [root@node2 ]# lltstat -nvv | head LLT node information: Node State Link Status Address 0 node1 OPEN eth3 UP 52:54:00:BC:9B:E5 eth2 UP 52:54:00:31:FB:31 bond0 DOWN * 1 node2 OPEN eth3 UP 52:54:00:C3:A0:55 eth2 UP 52:54:00:35:F6:A5 bond0 UP 52:54:00:14:13:21 [root@node2 ]# lltstat -nvv | head LLT node information: Node State Link Status Address * 0 node1 OPEN eth3 UP 52:54:00:BC:9B:E5 eth2 UP 52:54:00:31:FB:31 bond0 UP 52:54:00:2E:6D:23 1 node2 OPEN eth3 UP 52:54:00:C3:A0:55 eth2 UP 52:54:00:35:F6:A5 bond0 DOWN Do you know if I have something worng? Is this a valid configuration? Thanks, Javier
javierv
9 years ago Place Cluster Server
1.5KViews
0likes
4Comments