Solved: One SFRAC node startup very slowly after reboot or...

David_Zhang · ‎08-03-2010

Hi,all! Two SUN M5000 servers with solaris 10 and newly patches. Installed SFRAC 5.0 MP3RP4. I can config IO fencing and other resources. It works fine. But the customer want to take some testing for this SFRAC,for example as move one network cable or move one HBA cable,and they want to shutdown the server directly by "reboot" command or move all the power cables to simulate disaster situation. Now I find the A node startup VCS very slowly about 14 minutes when run "reboot" command directly,but the B node startup VCS fine about 5 minutes. My action as following: 1.clean the main.cf file without any configuration,and disable IO fencing service;reboot two nodes,the boot speed about two nodes are fine. And I try to shutdown the GAB a,b,d... port,then restart and startup ha,it's fine; 2.use configured main.cf and enable IO fencing,run command "vxfenclearpre" to clear the coordinator dg;two nodes startup fine; 3.retest the A node with "reboot" command,the A node startup slowly again. So, any body can give me some solution?Thank you very much. Aug 3 10:33:53 m5005 e1000g: [ID 801725 kern.info] NOTICE: pciex8086,105f - e1000g[1] : link up, 1000 Mbps, full duplex Aug 3 10:33:58 m5005 e1000g: [ID 801725 kern.info] NOTICE: pciex8086,105f - e1000g[3] : link up, 1000 Mbps, full duplex Aug 3 10:35:07 m5005 llt: [ID 122464 kern.notice] LLT INFO V-14-1-10499 recvarpack link 0 for node 1 addr change from 00:00:00:00:00:00 to 00:15:17:DF:A2:C5 Aug 3 10:35:07 m5005 gab: [ID 222459 kern.notice] GAB INFO V-15-1-20026 Port a registration waiting for seed port membership Aug 3 10:35:08 m5005 llt: [ID 122464 kern.notice] LLT INFO V-14-1-10499 recvarpack link 1 for node 1 addr change from 00:00:00:00:00:00 to 00:15:17:DF:A6:61 Aug 3 10:35:08 m5005 gab: [ID 222459 kern.notice] GAB INFO V-15-1-20026 Port d registration waiting for seed port membership Aug 3 10:35:09 m5005 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 0 (e1000g1) node 1 active Aug 3 10:35:09 m5005 llt: [ID 860062 kern.notice] LLT INFO V-14-1-10024 link 1 (e1000g3) node 1 active Aug 3 10:35:13 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port a gen 926a0c membership 01 Aug 3 10:35:13 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port d gen 926a0b membership 01 Aug 3 10:48:11 m5005 Had[4171]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-53021 Diagnostics directory moved to /var/VRTSvcs/diag/had.1280803691, please check its contents and contact VERITAS Support Aug 3 10:48:11 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10619 'HAD' starting on: m5005 Aug 3 10:48:11 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10620 Waiting for local cluster configuration status Aug 3 10:48:11 m5005 syslog[4201]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-53021 Diagnostics directory moved to /var/VRTSvcs/diag/CmdServer.1280803691, please check its contents and contact VERITAS Support Aug 3 10:48:11 m5005 syslog[4208]: [ID 702911 daemon.notice] VCS INFO V-16-1-11240 Command Server: running with security OFF Aug 3 10:48:12 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10625 Local cluster configuration valid Aug 3 10:48:12 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-11034 Registering for cluster membership Aug 3 10:48:12 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-11035 Waiting for cluster membership Aug 3 10:48:12 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port b gen 926a0f membership 01 Aug 3 10:48:12 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port o gen 926a0e membership 01 Aug 3 10:48:12 m5005 vcsmm: [ID 357760 kern.notice] VCS RAC INFO V-10-1-15047 mmpl_reconfig_ioctl: dev_ioctl failed, vxfen may not be configured Aug 3 10:48:12 m5005 vxfen: [ID 416634 kern.notice] NOTICE: VXFEN INFO V-11-1-35 Fencing driver going into RUNNING state Aug 3 10:48:16 m5005 mac: [ID 736570 kern.info] NOTICE: bge0 unregistered Aug 3 10:48:16 m5005 mac: [ID 736570 kern.info] NOTICE: bge2 unregistered Aug 3 10:48:16 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port h gen 926a1a membership 01 Aug 3 10:48:16 m5005 Had[4200]: [ID 702911 daemon.notice] VCS INFO V-16-1-10077 Received new cluster membership Aug 3 10:48:16 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10086 System m5005 (Node '0') is in Regular Membership - Membership: 0x3 Aug 3 10:48:16 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10086 System (Node '1') is in Regular Membership - Membership: 0x3 Aug 3 10:48:17 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10075 Building from remote system Aug 3 10:48:17 m5005 mac: [ID 736570 kern.info] NOTICE: bge3 unregistered Aug 3 10:48:17 m5005 mac: [ID 736570 kern.info] NOTICE: bge1 unregistered Aug 3 10:48:18 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-10066 Entering RUNNING state Aug 3 10:48:18 m5005 Had[4200]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-50311 VCS Engine: running with security OFF Aug 3 10:48:20 m5005 Had[4200]: [ID 702911 daemon.notice] VCS ERROR V-16-1-1005 (m5005) CVMCluster:???:monitor:node - state: out of cluster Aug 3 10:48:36 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port v gen 926a1c membership 01 Aug 3 10:48:38 m5005 vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-7900 CVM_VOLD_CONFIG command received Aug 3 10:48:40 m5005 vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-7899 CVM_VOLD_CHANGE command received Aug 3 10:48:44 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port w gen 926a1e membership 01 Aug 3 10:49:01 m5005 vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-12143 CVM_VOLD_JOINOVER command received for node(s) 0 Aug 3 10:49:01 m5005 vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-10994 join completed for node 0 Aug 3 10:49:01 m5005 vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-4123 cluster established successfully Aug 3 10:49:01 m5005 pseudo: [ID 129642 kern.info] pseudo-device: devinfo0 Aug 3 10:49:01 m5005 genunix: [ID 936769 kern.info] devinfo0 is /pseudo/devinfo@0 Aug 3 10:49:08 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port f gen 926a17 membership 01

David_Zhang · ‎08-04-2010

Hi,Gaurav,I am sure to install SFRAC 5.0 MP3RP4.
Right now,I found the root issue and the resolution.

The two SUN M5000 servers connect to one EMC CX3-80 array,redundant paths. The document from Symantec suggests set array parameter as "failover mode=1",but the EMC engineer set one server to failover mode = 1,the other server with "failover mode = 4". The problem that startup slowly server with parameter "failover mode = 4".

Now change the failover mode to 1,the testing as reboot at least 3 times is fine.

Ok,thanks again,Gaurav.

View solution in original post

Gaurav_S · ‎08-03-2010

Hi David,

Interesting problem... can you let me know following:

a) when u issue a reboot command with configured main.cf & IOFencing enabled, what did you observe in console logs ? does shutdown of node goes fine ? or shutdown also stucks somewhere ? IF boot up takes time, where does boot process takes time (I mean at which message server boot up stucks) ?

points to notice from logs above:

Aug 3 10:35:13 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port d gen 926a0b membership 01
Aug 3 10:48:11 m5005 Had[4171]: [ID 702911 daemon.notice] VCS NOTICE V-16-1-53021 Diagnostics directory moved to /var/VRTSvcs/diag/had.1280803691, please check its contents and contact VERITAS Support

a) You can see above, it took almost 13 mins after port d & basically HAD core dumped... I am surprised to see the core dump because why did "HAD" attempt to start at this place.... First Fencing should have attempted to start here... SO..
-- Is startup sequence of scripts is correct ? Fencing is S97vxfen & HAD is in S99vcs .. is it intact, or it might be using the SMF methods... compare both the nodes if startup sequence is matching...

-- Does your main.cf contains "UseFence = SCSI3" ? paste following:
# grep -i usefence /etc/VRTSvcs/conf/config/main.cf

Again, later we see:

Aug 3 10:48:12 m5005 gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port o gen 926a0e membership 01
Aug 3 10:48:12 m5005 vcsmm: [ID 357760 kern.notice] VCS RAC INFO V-10-1-15047 mmpl_reconfig_ioctl: dev_ioctl failed, vxfen may not be configured
Aug 3 10:48:12 m5005 vxfen: [ID 416634 kern.notice] NOTICE: VXFEN INFO V-11-1-35 Fencing driver going into RUNNING state

we see configuration of VCSMM failed because fencing wasn't running ..so somehow Fencing driver is taking time to go to running state...

-- Is storage settings same for both the nodes ? Have you observed something like one node access the storage faster then other ?

-- Did you ran "vxfentsthdw" before configuring fencing ? Did every test passed ?

We can also check the HAD core to find why it is core dumping... but that might be at little later stage & won't be possible here to analyze...

I hope above answers may bring clarity...

Gaurav

David_Zhang · ‎08-04-2010

Hi,Gaurav,thanks for your response.

The core dump as " /var/VRTSvcs/diag/had.1280803691",then directory is empty without any information.

The "UseFence = SCSI3" is added in main.cf with both nodes.

The command "vxfentsthdw" runs fine with "passed" message.

The "S97vxfen S99vcs " is fine.

So as your suggestion,the long tims is used to had core dump.

Gaurav_S · ‎08-04-2010

Hi David...

I am still thinking the reasons, why would "had" attempt to start before vxfen ..... What I can think would be only issue with startup sequence ... I would suggest to make a thorough check to see if any other script is not triggering "had" to start before S99vcs....

I was looking in documentation & I found this in one of the known issues...... From Read First Guide..

ftp://ftp.veritas.com/pub/support/patchcentral/Solaris/5.0_MP3/sfha/sfha-sol_sparc-5.0MP3RP3-patches.tar.gz_doc/sfha_readfirst.pdf

1779172 [Oakmont][Opteron]had core dump on the non-first node of a cluster

Not sure if this relates & is solved with RP4 or 5.1 .... will need a Symantec support help to confirm this...

Gaurav

David_Zhang · ‎08-04-2010

Hi,Gaurav,I am sure to install SFRAC 5.0 MP3RP4.
Right now,I found the root issue and the resolution.

The two SUN M5000 servers connect to one EMC CX3-80 array,redundant paths. The document from Symantec suggests set array parameter as "failover mode=1",but the EMC engineer set one server to failover mode = 1,the other server with "failover mode = 4". The problem that startup slowly server with parameter "failover mode = 4".

Now change the failover mode to 1,the testing as reboot at least 3 times is fine.

Ok,thanks again,Gaurav.

Gaurav_S · ‎08-04-2010

ok good... so so my suspection in first response was not really wrong...

" -- Is storage settings same for both the nodes ? Have you observed something like one node access the storage faster then other ? "

Gaurav

VOX

One SFRAC node startup very slowly after reboot or poweroff directly