Failover

73 Topics

Automatic Failover
I would like to use Veritas Cluster Server to achieve high availability and automatic fail over for my applications. These application are java applications with some services with the backend databases using Sybase The java applications can be broken into 2 parts : web layer and application layer. The underlying infrastructure will be in RHEL Linux for java applications and database (Sybase) My question is : is VCS supporting seamless automatic failover for the java services including database services without requiring manual intervention ? What i want to achieve is : after i setup active-passive for application layer, i expect the active node to automatically failover to passive node and immediately the passive node become active node
Solved
haditeo
9 years ago Place Cluster Server
998Views
0likes
1Comment
Understanding RestartLimit for non critical ressource
Hello, we have some trouble with our oracle listener process. sometimes the listener is killed by vcs. We dont know why. xxx VCS ERROR V-16-2-13027 (node1) Resource(lsnr-ORADB1) - monitor procedure did not complete within the expected time. xxx VCS ERROR V-16-2-13210 (node1) Agent is calling clean for resource(lsnr-ORADB1) because 4 successive invocations of the monitor procedure did not complete within the expected time. xxx VCS NOTICE V-16-20002-42 (node1) Netlsnr:lsnr-ORADB1:clean:Listener(LISTENER) kill TERM 2342 xxx VCS INFO V-16-2-13068 (node1) Resource(lsnr-ORADB1) - clean completed successfully. xxx VCS INFO V-16-2-13026 (node1) Resource(lsnr-ORADB1) - monitor procedure finished successfully after failing to complete within the expected time for (4) consecutive times. xxx VCS INFO V-16-1-10307 Resource lsnr-ORADB1 (Owner: unknown, Group: ORADB1) is offline on node1 (Not initiated by VCS) However,Resource(lsnr-ORADB1) is set to non-critical, to prevent an failover. I'll now set an RestartLimit forResource(lsnr-ORADB1) to let the cluster try to restart the listener, but what happen if this failed?Will the Ressouce still staying offline or initiate the cluster an failover for the whole ResourceGroup? thanks in advance for any help!
Foo_Bar_BLN
9 years ago Place Cluster Server
1.9KViews
0likes
3Comments
how LLT heartbeat setting MAC address ?
I was able to set theMAC addressin LLT , but its getting changed after every server reboot, Node A # lltstat -nvv | head -10 LLT node information: Node State Link Status Address * 0 node a OPEN vnet2 UP 00:14:4F:F8:3B:B7 vnet3 UP 00:14:4F:F9:73:DC -- vnet1 UP 00:14:4F:F8:F5:AC --- 1 node b OPEN vnet2 UP 00:14:4F:FA:2B:77 vnet3 UP 00:14:4F:FB:5E:07 -- vnet1 UP 00:14:4F:F9:E0:17 -- Node B # lltstat -nvv | head -10 LLT node information: Node State Link Status Address 0 node a OPEN vnet2 UP 00:14:4F:F8:3B:B7 vnet3 UP 00:14:4F:F9:73:DC -- vnet1 UP 00:14:4F:F8:F5:AC --- * 1 node b OPEN vnet2 UP 00:14:4F:FA:2B:77 vnet3 UP 00:14:4F:F9:E0:17 --- vnet1 UP 00:14:4F:FB:5E:07 ---- I got abovepost VCS instllation, So I followedbelow steps to change the MAC address for Vnet1 and Vnet3 gabconfig -U svcadm disable svc:/system/gab:default lltconfig -k disable svcadm disable svc:/system/llt:default svcadm enable svc:/system/llt:default lltconfig -k enable svcadm enable svc:/system/gab:default /sbin/gabconfig -c -n2 It has changed here, Node A: # lltstat -nvv | head -10 LLT node information: Node State Link Status Address * 0 node a OPEN vnet2 UP 00:14:4F:F8:3B:B7 vnet3 UP 00:14:4F:F9:73:DC vnet1 UP 00:14:4F:F8:F5:AC 1 node b OPEN vnet2 UP 00:14:4F:FA:2B:77 vnet3 UP 00:14:4F:FB:5E:07 vnet1 UP 00:14:4F:F9:E0:17 Node B # lltstat -nvv | head -10 LLT node information: Node State Link Status Address 0 node a OPEN vnet2 UP 00:14:4F:F8:3B:B7 vnet3 UP 00:14:4F:F9:73:DC vnet1 UP 00:14:4F:F8:F5:AC * 1 node b OPEN vnet2 UP 00:14:4F:FA:2B:77 vnet3 UP 00:14:4F:FB:5E:07 vnet1 UP 00:14:4F:F9:E0:17 but after server reboot went to orginal state of non-sync mode. where Vnet1's MAC and Vnet3's MAC changed. Any body know how the setting happens? how it is fetching and from where? how to make it permanent?
Solved
skmohanbe
10 years ago Place Cluster Server
1.4KViews
0likes
2Comments
VCS clustered Enterprise vault migrate index service to new node
Hi, We have three node cluster for Enterprise vault in Veritas Cluster server, the node details are as follows: Node 1: EV server with all services Node 2: SQL server Node 3: Spare server for EV service failover and SQl service failover. Also GCO and VVR is configured for SQl server and EV server volumes are replicating using VVR but are not in VCS control. Now we want to remove index service from EV server (node 1) and add it to new server, also add new server to the cluster. Can anybody share the steps we should follow from VCS end to perform above changes.
r1_abhinav
10 years ago Place Cluster Server
1.1KViews
0likes
5Comments
Listener resource remain faulted
Hello, we are doing some failure tests for a customer. We have VCS 6.2 running on solaris 10. We have an Oracle database and of course the listener associated with it. We try to simulate different kind of failures. One of them is to kill the listener. In this situation the cluster observes that the listener has died, and it fails over the service to the other node. BUT the listener resource will remain in FAULTED state on the original node, and the group to which belongs will be in OFFLINE FAULTED state. In this situation if something goes wrong on the second node the service will not fail back to the original one until we manually run hagrp -clear. Is there anything we can do to fix this? (to have the clear done automatically) Here are some lines from the log: 2015/03/30 17:26:10 VCS ERROR V-16-2-13067 (node2p) Agent is calling clean for resource(ora_listener-res) because the resource became OFFLINE unexpectedly, on its own. 2015/03/30 17:26:11 VCS INFO V-16-2-13068 (node2p) Resource(ora_listener-res) - clean completed successfully. 2015/03/30 17:26:11 VCS INFO V-16-1-10307 Resource ora_listener-res (Owner: Unspecified, Group: oracle_rg) is offline on node2p (Not initiated by VCS) in these it says that clean for the resource has completed successfully, but the resource is still faulted. but if I run hares -clear manually, the the fault goes away. 20150330-173628:root@node1p:~# hares -state ora_listener-res #Resource Attribute System Value ora_listener-res State node1p ONLINE ora_listener-res State node2p FAULTED 20150330-173636:root@node1p:~# hares -clear ora_listener-res 20150330-173653:root@node1p:~# hares -state ora_listener-res #Resource Attribute System Value ora_listener-res State node1p ONLINE ora_listener-res State node2p OFFLINE 20150330-173655:root@node1p:~#
Solved
Laszlo_Budai
10 years ago Place Cluster Server
3.4KViews
0likes
5Comments
vxfen module cause system panic after I/O fencing
I have a three-node cluster configuration: vcs1, vcs2, vcs3. Using three ISCSI disks (by SCST iscsi target simulator) as fendg. # vxfenadm -d I/O Fencing Cluster Information: ================================ Fencing Protocol Version: 201 Fencing Mode: SCSI3 Fencing SCSI3 Disk Policy: dmp Cluster Members: * 0 (vcs1) 1 (vcs2) 2 (vcs3) RFSM State Information: node 0 in state 8 (running) node 1 in state 8 (running) node 2 in state 8 (running) # vxfenadm -s all -f /etc/vxfentab Device Name: /dev/vx/rdmp/disk_1s3 Total Number Of Keys: 3 key[0]: [Numeric Format]: 86,70,49,48,53,50,48,48 [Character Format]: VF105200 * [Node Format]: Cluster ID: 4178 Node ID: 0 Node Name: vcs1 key[1]: [Numeric Format]: 86,70,49,48,53,50,48,49 [Character Format]: VF105201 * [Node Format]: Cluster ID: 4178 Node ID: 1 Node Name: vcs2 key[2]: [Numeric Format]: 86,70,49,48,53,50,48,50 [Character Format]: VF105202 * [Node Format]: Cluster ID: 4178 Node ID: 2 Node Name: vcs3 Device Name: /dev/vx/rdmp/disk_0s3 Total Number Of Keys: 3 key[0]: [Numeric Format]: 86,70,49,48,53,50,48,48 [Character Format]: VF105200 * [Node Format]: Cluster ID: 4178 Node ID: 0 Node Name: vcs1 key[1]: [Numeric Format]: 86,70,49,48,53,50,48,49 [Character Format]: VF105201 * [Node Format]: Cluster ID: 4178 Node ID: 1 Node Name: vcs2 key[2]: [Numeric Format]: 86,70,49,48,53,50,48,50 [Character Format]: VF105202 * [Node Format]: Cluster ID: 4178 Node ID: 2 Node Name: vcs3 Device Name: /dev/vx/rdmp/disk_2s3 Total Number Of Keys: 3 key[0]: [Numeric Format]: 86,70,49,48,53,50,48,48 [Character Format]: VF105200 * [Node Format]: Cluster ID: 4178 Node ID: 0 Node Name: vcs1 key[1]: [Numeric Format]: 86,70,49,48,53,50,48,49 [Character Format]: VF105201 * [Node Format]: Cluster ID: 4178 Node ID: 1 Node Name: vcs2 key[2]: [Numeric Format]: 86,70,49,48,53,50,48,50 [Character Format]: VF105202 * [Node Format]: Cluster ID: 4178 Node ID: 2 Node Name: vcs3 # lltstat -l LLT link information: link 0 eth1 on ether hipri mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6 txpkts 190145 txbytes 23138275 rxpkts 174420 rxbytes 11540391 latehb 0 badcksum 0 errors 0 link 1 eth0 on ether lowpri mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6 txpkts 71940 txbytes 3495901 rxpkts 73537 rxbytes 3617055 latehb 0 badcksum 0 errors 0 After I disconnect the network link of vcs3, vcs1 take over the application running on vcs3 . And after waitedfor serval miniutes and show log"VCS waiting for I/O fencing to be completed" , vcs3 shown kernel panic message like this: BUG: unable to handle kernel paging request at ffffffff00000019 [32353.581223] IP: [<ffffffff810399b5>] task_rq_lock+0x35/0x90 [32353.581991] PGD 1806067 PUD 0 [32353.582446] Oops: 0000 [#1] SMP [32353.582928] last sysfs file: /sys/devices/system/node/node0/cpumap [32353.583751] CPU 0 [32353.584031] Modules linked in: vxodm(PN) vxfen(PN) dmpjbod(PN) dmpap(PN) dmpaa(PN) vxspec(PN) vxio(PN) vxdmp(PN) binfmt_misc snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device gab(PN) crc32 c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet llt(PN) microcode amf(PN) fuse loop vxportal(PN) fdd(PN) vxfs(PN) exportfs dm_mod virtio_console virtio_balloon virtio_net rt c_cmos snd_hda_intel rtc_core snd_hda_codec rtc_lib snd_hwdep snd_pcm tpm_tis virtio_pci snd_timer tpm sym53c8xx virtio_ring snd button sg tpm_bios floppy pcspkr virtio scsi_transport_spi i2 c_piix4 soundcore i2c_core snd_page_alloc uhci_hcd sd_mod crc_t10dif ehci_hcd usbcore edd ext3 mbcache jbd fan processor ide_pci_generic piix ide_core ata_generic ata_piix libata scsi_mod th ermal thermal_sys hwmon [32353.584031] Supported: Yes, External [32353.584031] Pid: 4730, comm: vxfen Tainted: P 2.6.32.12-0.7-default #1 Bochs [32353.584031] RIP: 0010:[<ffffffff810399b5>] [<ffffffff810399b5>] task_rq_lock+0x35/0x90 [32353.584031] RSP: 0018:ffff88006c6b5cc0 EFLAGS: 00010086 [32353.584031] RAX: ffffffff00000001 RBX: 0000000000013680 RCX: dead000000100100 [32353.584031] RDX: 0000000000000000 RSI: ffff88006c6b5d00 RDI: ffffffff81ab2df0 [32353.584031] RBP: ffff88006c6b5ce0 R08: 00000000000005db R09: 000000000000000a [32353.584031] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000013680 [32353.584031] R13: ffffffff81ab2df0 R14: ffff88006c6b5d00 R15: 0000000000000000 [32353.584031] FS: 0000000000000000(0000) GS:ffff880006200000(0000) knlGS:0000000000000000 [32353.584031] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b [32353.584031] CR2: ffffffff00000019 CR3: 0000000037d1b000 CR4: 00000000000406f0 [32353.584031] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [32353.584031] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [32353.584031] Process vxfen (pid: 4730, threadinfo ffff88006c6b4000, task ffff88006d98c580) [32353.584031] Stack: [32353.584031] 0000000000000000 00000001007a4653 ffffffff81ab2df0 ffff880062de34b0 [32353.584031] <0> ffff88006c6b5d30 ffffffff81040e5a 000000000000000f ffff88007c0de148 [32353.584031] <0> 0000000000000086 ffff88007c0de140 00000001007a4653 0000000000000001 [32353.584031] Call Trace: [32353.584031] [<ffffffff81040e5a>] try_to_wake_up+0x4a/0x340 [32353.584031] [<ffffffff810682a8>] up+0x48/0x50 [32353.584031] [<ffffffffa0d5ed4a>] vxfen_bcast_lost_race_msg+0x8a/0x1b0 [vxfen] [32353.584031] [<ffffffffa0d5f63d>] vxfen_grab_coord_pt_30+0x76d/0x830 [vxfen] [32353.584031] [<ffffffffa0d5fbe7>] vxfen_grab_coord_pt+0x87/0x1a0 [vxfen] [32353.584031] [<ffffffffa0d6eb7c>] vxfen_msg_node_left_ack+0x22c/0x330 [vxfen] [32353.584031] [<ffffffffa0d70f22>] vxfen_process_client_msg+0x7d2/0xb30 [vxfen] [32353.584031] [<ffffffffa0d716db>] vxfen_vrfsm_cback+0x45b/0x17b0 [vxfen] [32353.584031] [<ffffffffa0d8cb20>] vrfsm_step+0x1b0/0x3b0 [vxfen] [32353.584031] [<ffffffffa0d8ee1c>] vrfsm_recv_thread+0x32c/0x970 [vxfen] [32353.584031] [<ffffffffa0d8f5b4>] vxplat_lx_thread_base+0xa4/0x100 [vxfen] [32353.584031] [<ffffffff81003fba>] child_rip+0xa/0x20 [32353.584031] Code: 6c 24 10 4c 89 74 24 18 49 89 fd 48 89 1c 24 49 89 f6 4c 89 64 24 08 49 c7 c4 80 36 01 00 9c 58 fa 49 89 06 49 8b 45 08 4c 89 e3 <8b> 40 18 48 03 1c c5 40 dc 91 81 48 89 df e8 68 d5 35 00 49 8b [32353.584031] RIP [<ffffffff810399b5>] task_rq_lock+0x35/0x90 [32353.584031] RSP <ffff88006c6b5cc0> [32353.584031] CR2: ffffffff00000019
Solved
Jamesb_china
10 years ago Place Cluster Server
1.9KViews
0likes
3Comments
Unable to bring the Service Group online.
Hi All, I tried to bring a SG online in a node but it's not comming online. Let me explaing the issue. We did reboot of a node aixprd001 and we found that /etc/filesystem is corrupted so the SG bosinit_SG is in partial state since lot of cluster FS in not mounted. Then we corrected the entry and done the manual mout of all the FS but the SG still show the status partial so we did the bellow command. hagrp -clear bosinit_SG -all Once done the SG is in online state. For safer side we tried to offline the SG and brought it up online again but the SG failed to come online, Bellow is the only error we able find the engine_A.log file. 2014/12/17 06:49:04 VCS NOTICE V-16-1-10166 Initiating manual online of group bosinit_SG on system aixprd001 2014/12/17 06:49:04 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group bosinit_SG on all nodes Please help me by providing suggestion, I will provide the output of logs if needed. Thanks, Rufus
Solved
prabindr
10 years ago Place Cluster Server
2KViews
0likes
4Comments
Volume is mounted on two split-brain nodes in VCS 6.0
I have built a three-node cluster using vcs 6.0 in sles 11 sp1. Here is the configuration: main.cf: include "OracleASMTypes.cf" include "types.cf" include "Db2udbTypes.cf" include "OracleTypes.cf" include "SybaseTypes.cf" cluster vcscluster ( ClusterAddress = "192.168.4.10" SecureClus = 1 UseFence = SCSI3 ) system vcs1 ( ) system vcs2 ( ) system vcs3 ( ) group ClusterService ( SystemList = { vcs1 = 0, vcs2 = 1, vcs3 = 2 } AutoStartList = { vcs1, vcs2, vcs3 } OnlineRetryLimit = 3 OnlineRetryInterval = 120 ) IP webip ( Device = eth0 Address = "192.168.4.10" NetMask = "255.255.255.0" ) NIC csgnic ( Device = eth0 ) webip requires csgnic // resource dependency tree // // group ClusterService // { // IP webip // { // NIC csgnic // } // } group apache ( SystemList = { vcs1 = 0, vcs2 = 1, vcs3 = 2 } AutoStartList = { vcs1 } ) DiskGroup share_dg ( DiskGroup = share_dg ) Mount apache_fs ( MountPoint = "/srv/www/htdocs" BlockDevice = "/dev/vx/dsk/share_dg/apache" FSType = vxfs FsckOpt = "-y" ) apache_fs requires share_dg // resource dependency tree // // group apache // { // Mount apache_fs // { // DiskGroup share_dg // } // } # lltstat -l LLT link information: link 0 eth1 on ether hipri mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6 txpkts 129728 txbytes 14155153 rxpkts 119866 rxbytes 7909769 latehb 0 badcksum 0 errors 0 link 1 eth2 on ether lowpri mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6 txpkts 49369 txbytes 2400217 rxpkts 50476 rxbytes 2480391 latehb 0 badcksum 0 errors 0 # vxfenconfig -l I/O Fencing Configuration Information: ====================================== Single Disk Flag : 0 Count : 3 Disk List Disk Name Major Minor Serial Number Policy /dev/vx/rdmp/disk_2s3 201 67 7ae525da dmp /dev/vx/rdmp/disk_1s3 201 51 27cddc71 dmp /dev/vx/rdmp/disk_0s3 201 35 132a74e8 dmp # vxfenadm -d I/O Fencing Cluster Information: ================================ Fencing Protocol Version: 201 Fencing Mode: SCSI3 Fencing SCSI3 Disk Policy: dmp Cluster Members: * 0 (vcs1) 1 (vcs2) 2 (vcs3) RFSM State Information: node 0 in state 8 (running) node 1 in state 8 (running) node 2 in state 8 (running) # hastatus -summary -- SYSTEM STATE -- System State Frozen A vcs1 RUNNING 0 A vcs2 RUNNING 0 A vcs3 RUNNING 0 -- GROUP STATE -- Group System Probed AutoDisabled State B ClusterService vcs1 Y N ONLINE B ClusterService vcs2 Y N OFFLINE B ClusterService vcs3 Y N OFFLINE B apache vcs1 Y N OFFLINE B apache vcs2 Y N OFFLINE B apache vcs3 Y N ONLINE #df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 7218432 4356176 2495576 64% / devtmpfs 995788 212 995576 1% /dev tmpfs 995788 0 995788 0% /dev/shm tmpfs 4 0 4 0% /dev/vx /dev/vx/dsk/share_dg/apache 512000 3285 476928 1% /srv/www/htdocs When I disconnect the net link of eth1/eth2 in VCS3. Apache is brought up in vcs1. But when I check vcs3, the mount point still exists . And after several minutes, a kernel panic occur in vcs3. I think it is very dangerous thatavolume is mounted on two split-brain nodes . How can I prevent this happens?
Jamesb_china
10 years ago Place Cluster Server
492Views
0likes
0Comments
Reg application service returning "The program exited with return code <0>"
Hi All, I configured my application in VCS and when i try to online the service, its actually online the service( I could check via service status command), but VCS getting the error code "The program exited with return code <0>" Here is my main.cf file parameters for the specific service Application dfm-ocie ( StartProgram = "/etc/init.d/ocie start" StopProgram = "/etc/init.d/ocie stop" MonitorProcesses = { classpath, "/opt/netapp/essentials/jboss/lib/jboss-logmanager.jar" } ) 2014/12/12 11:57:18 VCS INFO V-16-10031-509 (vmlnx64-xyz) Application:dfm-ocie:online:Executed </etc/init.d/ocie> as user <root>. The program exited with return code <0>. 2014/12/12 11:57:19 VCS INFO V-16-2-13716 (vmlnx64-xyz) Resource(dfm-ocie): Output of the completed operation (online) ============================================== Starting NetApp OnCommand Insight Essentials Server service. This may take couple of minutes Successfully started NetApp OnCommand Insight Essentials Server service ============================================== 2014/12/12 11:59:20 VCS ERROR V-16-2-13066 (vmlnx64-xyz) Agent is calling clean for resource(dfm-ocie) because the resource is not up even after online completed. 2014/12/12 11:59:21 VCS INFO V-16-2-13068 (vmlnx64-xyz) Resource(dfm-ocie) - clean completed successfully. 2014/12/12 11:59:21 VCS INFO V-16-2-13071 (vmlnx64-xyz) Resource(dfm-ocie): reached OnlineRetryLimit(0). 2014/12/12 11:59:23 VCS ERROR V-16-1-54031 Resource dfm-ocie (Owner: Unspecified, Group: dfmgrpkjag) is FAULTED on sys vmlnx64-xyz Is there any thing iam missing here? Also can some one explian whats the MonitorProcesses doesn? -Jaga
Solved
jagathe
10 years ago Place Cluster Server
1.3KViews
1like
3Comments
SG is not switching to next node.
Hi All, I am new to VCS but good in HACMP. In our environment we are using VCS-6.0, I one server we found that the SG is not moving from one node to another node when we tried manual failover using the bellow command. hagrp -switch <SGnamg> -to <sysname> We able to see that the SG is offline in the currnent node but it's not coming online in the secondary node. There is no error locked in engine_A.log except the bellow entry cpus load more than 60% <Secondary node name> Can anyone help me to find the solution for this. I will provide the output of any commands if you need more info to help me out to get this trouble shooted :) Thanks,
Solved
prabindr
10 years ago Place Cluster Server
1.8KViews
1like
8Comments