08-13-2014 08:38 AM
Hell Expert,
I am facing one simple but bit tricky issue in VCS.
one non-existant system added in the cluster which has hostname "-i" . i am not aware how this system added in the list. find bellow system list.
bash-3.00# hasys -list
-i
MMINDIA01
MMINDIA02
MMINDIA03
MMINDIA04
i tried with "hasys -delete -i", "hasys -delete "/i" but no success.
Kindly help on priority.
Regards,
AMit MAne
Solved! Go to Solution.
08-18-2014 10:07 AM
Hi Amit,
I would run hastop -force -all on node 1 (the one with correct main.cf)
Once all are down, run hastart on node1. Then once its up, run hastart or remaining nodes.
They should then pull the correct main.cf and rebuild their own.
It is weird though for 2,3,4 to see something different from 1. And they've not partitioned into mini clusters as they id's are all the same.
08-13-2014 10:25 PM
Hi,
please post gabconfig -a to see if this is present there as well. If its is not then post or pm /etc/VRTSvcs/conf/config/main.cf. If its only the main.cf it should be easily correctable.
08-13-2014 11:32 PM
Hi Amit,
First check the /etc/hosts file, as may somebody checked the hostname , also provide the hastatus -sum output.
Before you remove any node from the cluster please verify by running few commands to check how many node the cluster confired.
cat /etc/gabtab
cat /etc/hosts
cat /etc/VRTSvcs/conf/config/main.cf|grep -i system
Please verify first nobody changed the hostname:-
If you think node :-i" is not part of the cluster then below is the procedure to remove node from the cluster.
# hagrp -switch group -to <another node>
2 Check for any dependencies involving any service groups that run on the
leaving node; for example, grp4 runs only on the leaving node.
# hagrp -dep
3 If the service group on the leaving node requires other service groups, that
is, if it is a parent to service groups on other nodes, then unlink the service
groups.
# haconf -makerw
# hagrp -unlink group <another node>
These commands enable you to edit the configuration and to remove the
requirement grp4 has for grp1.
4 Stop VCS on the leaving node:
# hastop -sys C
5 Check the status again. The state of the leaving node should be EXITED. Also,
any service groups set up for failover should be online on other nodes:
# hastatus -summary
-- SYSTEM STATE
-- System State Frozen
A A RUNNING 0
A B RUNNING 0
A C EXITED 0
12 Adding and removing cluster nodes
Removing a node from a cluster
-- GROUP STATE
-- Group System Probed AutoDisabled State
B grp1 A Y N ONLINE
B grp1 B Y N OFFLINE
B grp2 A Y N ONLINE
B grp3 B Y N ONLINE
B grp3 C Y Y OFFLINE
B grp4 C Y N OFFLINE
6 Delete the leaving node from the SystemList of service groups grp3 and
grp4.
# hagrp –modify < groupA> SystemList -delete C
# hagrp -modify < groupB> SystemList -delete C
7 For service groups that run only on the leaving node, delete the resources
from the group before deleting the group.
# hagrp -resources < groupB>
processx_grp4
processy_grp4
# hares -delete <dependent resources>
8 Delete the service group configured to run on the leaving node.
# hagrp -delete groupA
9 Check the status.
# hastatus -summary
-- SYSTEM STATE
-- System State Frozen
A A RUNNING 0
A B RUNNING 0
A C EXITED 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B grp1 A Y N ONLINE
B grp1 B Y N OFFLINE
B grp2 A Y N ONLINE
B grp3 B Y N ONLINE
10 Delete the node from the cluster.
# hasys -delete C
11 Save the configuration, making it read only.
# haconf -dump -makero
Adding and removing cluster nodes 13
Removing a node from a cluster
Modifying configuration files on each remaining node
Perform the following tasks on each of the remaining nodes of the cluster.
To modify the configuration files on a remaining node
1 If necessary, modify the /etc/gabtab file.
No change is required to this file if the /sbin/gabconfig command has only
the argument -c, although Symantec recommends using the -nN option,
where N is the number of cluster systems.
If the command has the form /sbin/gabconfig -c -nN, where N is the
number of cluster systems, then make sure that N is not greater than the
actual number of nodes in the cluster, or GAB does not automatically seed.
Note: Symantec does not recommend the use of the -c -x option for /sbin/
gabconfig. The Gigabit Ethernet controller does not support the use of
-c -x.
2 Modify /etc/llthosts file on each remaining nodes to remove the entry of the
leaving node.
For example, change:
0 A
1 B
2 C
to:
0 A
1 B
Unloading LLT and GAB and removing VCS on the leaving node
Perform the tasks on the node leaving the cluster.
To stop LLT and GAB and remove VCS
1 Stop GAB and LLT:
# /etc/init.d/gab stop
# /etc/init.d/llt stop
2 To determine the RPMs to remove, enter:
# rpm -qa | grep VRTS
08-14-2014 05:44 AM
Please post /etc/llthosts and "system" section of main.cf
I tried to add host "-i" on a test system and I could add to llthosts, but I could not add to main.cf, either by editing as syntax is rejected or by using hasys command.
Mike
08-18-2014 01:02 AM
Please find requested details.
# gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 26eef6e membership 0123
Port b gen 26eef6f membership 0123
Port d gen 26eef71 membership 0123
Port f gen 26ef015 membership 0123
Port h gen 26ef013 membership 0123
Port u gen 26ef013 membership 0123
Port v gen 26ef00f membership 0123
Port w gen 26ef011 membership 0123
# hostname
MHOGW04
# Connection to MHOGW04 closed.
# hostname
MHOGW03
# gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 26eef6e membership 0123
Port b gen 26eef6f membership 0123
Port d gen 26eef71 membership 0123
Port f gen 26ef015 membership 0123
Port h gen 26ef013 membership 0123
Port u gen 26ef013 membership 0123
Port v gen 26ef00f membership 0123
Port w gen 26ef011 membership 0123
# Connection to MHOGW03 closed.
# hostname
MHOGW02
# gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 26eef6e membership 0123
Port b gen 26eef6f membership 0123
Port d gen 26eef71 membership 0123
Port f gen 26ef015 membership 0123
Port h gen 26ef013 membership 0123
Port u gen 26ef013 membership 0123
Port v gen 26ef00f membership 0123
Port w gen 26ef011 membership 0123
# Connection to MHOGW02 closed.
bash-3.00# hostname
MHOGW01
bash-3.00# gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 26eef6e membership 0123
Port b gen 26eef6f membership 0123
Port d gen 26eef71 membership 0123
Port f gen 26ef015 membership 0123
Port h gen 26ef013 membership 0123
Port u gen 26ef013 membership 0123
Port v gen 26ef00f membership 0123
Port w gen 26ef011 membership 0123
bash-3.00#
# hastatus -summ | more
-- SYSTEM STATE
-- System State Frozen
A -i FAULTED 0
A MHOGW01 RUNNING 0
A MHOGW02 RUNNING 0
A MHOGW03 RUNNING 0
A MHOGW04 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
bash-3.00# more /etc/VRTSvcs/conf/config/main.cf
include "types.cf"
include "mediationtypes.cf"
include "CFSTypes.cf"
include "CVMTypes.cf"
cluster MultiMediation (
UserNames = { admin = INOkNPnMNsNM }
Administrators = { admin }
UseFence = SCSI3
HacliUserLevel = COMMANDROOT
)
system MHOGW01 (
)
system MHOGW02 (
)
system MHOGW03 (
)
system MHOGW04 (
)
---------------------------
bash-3.00# hasys -list
MHOGW01
MHOGW02
MHOGW03
MHOGW04
bash-3.00#
---------------------------
bash-3.00# more /etc/VRTSvcs/conf/config/main.cf
include "types.cf"
include "mediationtypes.cf"
include "CFSTypes.cf"
include "CVMTypes.cf"
cluster MultiMediation (
UserNames = { admin = INOkNPnMNsNM }
Administrators = { admin }
UseFence = SCSI3
HacliUserLevel = COMMANDROOT
)
system "-i" (
)
system MHOGW01 (
)
system MHOGW02 (
)
system MHOGW03 (
)
system MHOGW04 (
)
---------------------------
bash-3.00# hasys -list
-i
MHOGW01
MHOGW02
MHOGW03
MHOGW04
bash-3.00#
---------------------------
Also note thay i am not able to verify the main.cf with haconf -verify on node 2,3 & 4.
Regards,
AMit Mane
08-18-2014 01:09 AM
Dear Naveen,
Thanks for reply!!
As i mentioned in forum. nonexistent system. we have 4 node cluster. and someone added -i in config which is not existed in the network.
still find bellow required details.
Node1:
# cat /etc/hosts
GW02
#
# Internet host table
#
::1 localhost
127.0.0.1 localhost
IP MHOGW02
IP MHOGW01 loghost
IP MHOGW03
IP MHOGW04
Node2,3,4
# cat /etc/hosts
GW02
#
# Internet host table
#
::1 localhost
127.0.0.1 localhost
IP MHOGW02 loghost
IP MHOGW01
IP MHOGW03
IP MHOGW04
Regards,
Amit Mane
08-18-2014 01:18 AM
Hello Mike,
Thanks for help!!
find bellow requested details.
bash-3.00# hostname
MHOGW01
bash-3.00# cat /etc/llthosts
0 MHOGW01
1 MHOGW02
2 MHOGW03
3 MHOGW04
bash-3.00#
# hostname
MHOGW02
# cat /etc/llthosts
0 MHOGW01
1 MHOGW02
2 MHOGW03
3 MHOGW04
# hostname
MHOGW03
# cat /etc/llthosts
0 MHOGW01
1 MHOGW02
2 MHOGW03
3 MHOGW04
#
# cat /etc/llthosts
0 MHOGW01
1 MHOGW02
2 MHOGW03
3 MHOGW04
#
bash-3.00# more /etc/VRTSvcs/conf/config/main.cf
include "types.cf"
include "mediationtypes.cf"
include "CFSTypes.cf"
include "CVMTypes.cf"
cluster MultiMediation (
UserNames = { admin = INOkNPnMNsNM }
Administrators = { admin }
UseFence = SCSI3
HacliUserLevel = COMMANDROOT
)
system MHOGW01 (
)
system MHOGW02 (
)
system MHOGW03 (
)
system MHOGW04 (
)
bash-3.00# more /etc/VRTSvcs/conf/config/main.cf
include "types.cf"
include "mediationtypes.cf"
include "CFSTypes.cf"
include "CVMTypes.cf"
cluster MultiMediation (
UserNames = { admin = INOkNPnMNsNM }
Administrators = { admin }
UseFence = SCSI3
HacliUserLevel = COMMANDROOT
)
system "-i" (
)
system MHOGW01 (
)
system MHOGW02 (
)
system MHOGW03 (
)
system MHOGW04 (
)
I tried to verify the config node 2,3 & 4 but not able to verify
Node1
bash-3.00# hasys -list
MHOGW01
MHOGW02
MHOGW03
MHOGW04
Node 2,3 & 4
# hasys -list
-i
MHOGW01
MHOGW02
MHOGW03
MHOGW04
#
Regards,
Amit Mane
08-18-2014 07:41 AM
I don't understand why 3 of the 4 cluster nodes have a different main.cf?
The 1st node to startup in a multi-node cluster will do a 'local build' - reading local main.cf and load that config into memory.
Subsequent nodes that startup should find a currently running config, do a 'remote build' and update local main.cf. This will ensure that all nodes have the same config.
Since the incorrect entry only appears in main.cf on the 3 nodes, it will be easy to fix without bringing down any applications, but more inportant now to see state of cluster membership.
Please show us output of 'gabconfig -a' on all nodes?
**** EDIT *****
I found output of gabconfig in you quarantined post and published it.
The fact that all nodes show correct cluster membership is making the different output on node 1 even more strange....
08-18-2014 08:22 AM
Well that is very strange indeed.
I agree with Mike; I also could not in any way insert a node name of "-i" into the cluster, either by CLI or direct editing of the main.cf file.
I also agree with Marianne that if a node had a bad configuration and it was the first to boot, it would not form a cluster; but when one of the other nodes with a good main.cf file did start VCS, it would local build, and only then would node 1 (with the bad main.cf) would do a remote build, and then overwrite the main.cf file.
The fact that node 1 now has an incorrect main.cf means that somebody with root authority has directly modified the file after the node had joined the currently runing cluster.
On the nodes that show you the incorrect system name, you can attempt the following command, though I doubt it will work:
-$ hasys -delete %-i
Probably, the best thing to do is to fix the bad file on node 1 simply by opening the cluster configuration and then closing it again -- this will overwrite every main.cf on every node in the cluster with a good copy. Here's the CLI to do that:
-$ haconf -makerw
-$ haconf -dump -makero
Once you have done that, if the three nodes that "reveal" that bad nodename persists I guess then you could stop the cluster and restart it.
Something tells me that there is more to this story as the situation you describe should not really be possible -- I directly modified the main.cf file and put in "system "-i" ( ) " but I could not re-create your result of having the bad system name revealed via the "hasys -list" command -- neither from running that command on the host with the good main.cf (as I believe you describe), nor from running it on the host with the bad main.cf.
08-18-2014 08:27 AM
I did not notice if you provided the relevent release levels, but I see you are on solaris -- FYI that my testing was done on "VRTSvcs 6.0.300.000" and "Red Hat Enterprise Linux Server release 6.4 (Santiago)"
08-18-2014 09:19 AM
Hi,
its Solaris 10 and VRTSvcs 5.1
Regards,
AMit MAne
08-18-2014 09:22 AM
Dear Mike,
This problem occured as hostname of one of the node was changed. because of someone has run the command 'hostname -i'
IF you have test system, can u please check by changing hostname of the server by "hostname -i"
share me you observation.
Thanks in advance.
Regards,
Amit Mane
08-18-2014 10:07 AM
Hi Amit,
I would run hastop -force -all on node 1 (the one with correct main.cf)
Once all are down, run hastart on node1. Then once its up, run hastart or remaining nodes.
They should then pull the correct main.cf and rebuild their own.
It is weird though for 2,3,4 to see something different from 1. And they've not partitioned into mini clusters as they id's are all the same.
08-18-2014 10:50 AM
I managed to replicate what you have by putting "-i" in /etc/VRTSvcs/conf/sysname and when you do this and start vcs it adds host "-i" to main.cf.
So to fix I would do:
Stop VCS on all node, but leave apps running
hastop -all force
Correct /etc/VRTSvcs/conf/sysname
Start VCS on node with correct main.cf (node 1)
Start VCS on other nodes (this will build main.cf from node1 so no need to edit main.cf on other nodes)
Mike
08-18-2014 08:16 PM
Hi Mike,
I checked /etc/VRTSvcs/conf/sysname in all the node but there is no such entry in the sysname in all 4 nodes.
I think you are aware that main.cf on node 1 is correct but main.cf in other 3 nodes having entry of "-i"
kindly let me know "hastop -all force" will not affect running services in cluster as all the nodes having some running services.
bash-3.00# hastatus -summ | grep ONLINE
B DG1 MHOGW01 Y N ONLINE
B DG1 MHOGW02 Y N ONLINE
B DG1 MHOGW03 Y N ONLINE
B DG1 MHOGW04 Y N ONLINE
B Network MHOGW01 Y N ONLINE
B Network MHOGW02 Y N ONLINE
B Network MHOGW03 Y N ONLINE
B Network MHOGW04 Y N ONLINE
B ALARM MHOGW01 Y N ONLINE
B MANAGER MHOGW01 Y N ONLINE
B SERVER MHOGW01 Y N ONLINE
B SERVER10 MHOGW04 Y N ONLINE
B SERVER2 MHOGW03 Y N ONLINE
B SERVER3 MHOGW03 Y N ONLINE
B SERVER4 MHOGW01 Y N ONLINE
B SERVER5 MHOGW04 Y N ONLINE
B SERVER6 MHOGW02 Y N ONLINE
B SERVER7 MHOGW04 Y N ONLINE
B SERVER8 MHOGW01 Y N ONLINE
B SERVER9 MHOGW04 Y N ONLINE
B TRACER MHOGW01 Y N ONLINE
B Sentinel MHOGW01 Y N ONLINE
B cvm MHOGW01 Y N ONLINE
B cvm MHOGW02 Y N ONLINE
B cvm MHOGW03 Y N ONLINE
B cvm MHOGW04 Y N ONLINE
Node2,3,4
bash-3.00# hastatus -summ
-- SYSTEM STATE
-- System State Frozen
A -i FAULTED 0
A MHOGW01 RUNNING 0
A MHOGW02 RUNNING 0
A MHOGW03 RUNNING 0
A MHOGW04 RUNNING 0
Node 1:
bash-3.00# hastatus -summ | more
-- SYSTEM STATE
-- System State Frozen
A MHOGW01 RUNNING 0
A MHOGW02 RUNNING 0
A MHOGW03 RUNNING 0
A MHOGW04 RUNNING 0
08-18-2014 08:31 PM
I can see currently Node3 is master node, this is for your information.
bash-3.00# /etc/vx/bin/vxclustadm nidmap
Name CVM Nid CM Nid State
MHOGW01 0 0 Joined: Slave
MHOGW02 1 1 Joined: Slave
MHOGW03 2 2 Joined: Master
MHOGW04 3 3 Joined: Slave
08-19-2014 12:01 AM
As per Mike's previous post, 'hastop -all -force' will stop VCS on all nodes, but leave apps running.
So, we want node1 with correct main.cf to start 'had' and load correct config into memory.
Run hastatus in one window to view continuous progess. (Will firstly say 'cannot connect...' when had is down on all nodes.)
After stopping VCS (had) on all nodes, run 'hastart' on node 1.
Wait for 'hastatus' to show node 1 in 'LOCAL BUILD'. Wait for RUNNING state.
Run hastart on remaining nodes. They should all do a 'Remote Build' from node 1.
All nodes should now share the correct main.cf with no '-i' entry.
08-19-2014 01:06 AM
I agree with Marianne, you still need to do hastop -force as in essence "-i" for a host is an invalid string which is rejected by "check of main.cf" when VCS started and "check of command hasys", but there is a side case where if VCS node name (which is normally taken from /etc/VRTSvcs/conf/sysname) is not in main.cf when VCS start then it adds it to VCS and it looks as though the "checks" don't take place here. Once VCS is started if you change /etc/VRTSvcs/conf/sysname (or if this is not there, then I think VCS uses hostname) then this will not affect VCS as this is only determined at VCS start up. So as "hasys" rejects "-i" for host you need to run "hastop -all -force", and then restart VCS which will only start on first node if main.cf does NOT contain the invalid "-i" hostname (all other systems will build main.cf from the first node)
Mike