cancel
Showing results for 
Search instead for 
Did you mean: 

Failaover time more than 1 minute

solom
Level 4

Hi,

I have a Veritas cluster 6.1 configured on Red Hat 6.4, and it is taking more than one minute to failover even though no agents are configured yet.

The only resources configured are the Disk Groups and Mounts. The cluster is not generating any errors, still it takes a very long time either in deporting the Disk Group or in importing it, sometimes even volumes take long time to go online.

At first, the cluster service had only one disk group as a resource, and the failover time was around 25 sec, but when I added another disk group, failover time increased to 1.5 to 2 minutes.

Any advice what might be causing this slowness?

Thanks

17 REPLIES 17

Gaurav_S
Moderator
Moderator
   VIP    Certified
How much time is taken to import diskgroup and start volumes on the servers manually outside VCS? I would first advise to find with above if problem is with VCS or Vxvm or OS layer G

solom
Level 4

it is take 12 second

stinsong
Level 5

Hi Solom,

It seems because of there are many disks and volumes in the second DG you added into VCS.

The time costs on importing DG depends on how many disks and volumes in the DG. Because with importing DG, all disks will be read and all volumes will be online(start) at the same time. So typically, when there are many disks and volumes, the import or online DG/DG resource will take long time to complete.

Another thing is that if you configured all volumes as volume resource in the "Volume" resource. If so, VCS volume agent will wait until all volume to start/stop when the service group online/offline to complate.

solom
Level 4

I hava one  disk  at each DG and 5 volumes mounts on the disk i think not many .

And when i force the node1 ore node2 the failover taking 10 second .

 

Regards

Gaurav_S
Moderator
Moderator
   VIP    Certified

can you attach engine_A.log & main.cf for us ?

 

G

solom
Level 4

I attached .

 

Thanks

starflyfly
Level 6
Employee Accredited Certified

Hi, solom  

 

  You mean this log :

======

2014/04/22 16:10:32 VCS INFO V-16-1-50135 User admin fired command: hagrp -switch INT-PRI  TCPRI-CLU2  localclus  from ::ffff:10.100.208.76
2014/04/22 16:10:32 VCS NOTICE V-16-1-10208 Initiating switch of group INT-PRI from system TCPRI-CLU1 to system TCPRI-CLU2
2014/04/22 16:10:32 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INT-DB (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:10:32 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INT-HS (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:10:32 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INTJRNALT (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:10:32 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INTJRNPRI (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:10:32 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INTBACKUP (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:10:33 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INT-DB (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:10:34 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INTJRNALT (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:10:35 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INTBACKUP (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:10:35 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INTJRNPRI (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:10:41 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INT-HS (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:10:41 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PRI-INT (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:10:48 VCS INFO V-16-1-10305 Resource PRI-INT (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:10:48 VCS NOTICE V-16-1-10446 Group INT-PRI is offline on system TCPRI-CLU1
2014/04/22 16:10:48 VCS NOTICE V-16-1-10301 Initiating Online of Resource PRI-INT (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:10:48 VCS NOTICE V-16-10031-1513 (TCPRI-CLU2) DiskGroup:PRI-INT:online:Diskgroups will be imported with reservations.
2014/04/22 16:10:54 VCS WARNING V-16-10031-1509 (TCPRI-CLU2) DiskGroup:PRI-INT:online:vxdg import succeeded on Disk Group PRI-INT.
2014/04/22 16:10:54 VCS NOTICE V-16-10031-1559 (TCPRI-CLU2) DiskGroup:PRI-INT:online:Volumes in DiskGroup PRI-INT will be started automatically as part of import command,the system level autostartvolume is set On 
2014/04/22 16:10:55 VCS INFO V-16-1-10298 Resource PRI-INT (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:10:55 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INTJRNPRI (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:10:55 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INTJRNALT (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:10:55 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INTBACKUP (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:10:55 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INT-HS (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:10:55 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INT-DB (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:10:58 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INTJRNALT (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:10:59 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INT-HS (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:11:00 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INT-DB (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:11:01 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INTJRNPRI (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:11:02 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INTBACKUP (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:11:02 VCS NOTICE V-16-1-10447 Group INT-PRI is online on system TCPRI-CLU2

==============

 

or  this time:

===========

2014/04/22 16:20:16 VCS INFO V-16-1-50135 User admin fired command: hagrp -switch INT-PRI  TCPRI-CLU2  localclus  from ::ffff:10.100.208.76
2014/04/22 16:20:16 VCS NOTICE V-16-1-10208 Initiating switch of group INT-PRI from system TCPRI-CLU1 to system TCPRI-CLU2
2014/04/22 16:20:16 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INT-DB (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:20:16 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INT-HS (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:20:16 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INTJRNALT (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:20:16 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INTJRNPRI (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:20:16 VCS NOTICE V-16-1-10300 Initiating Offline of Resource TRAKPRIVOL-INTBACKUP (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:20:17 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INT-DB (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:20:18 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INTJRNALT (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:20:19 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INTJRNPRI (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:20:19 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INT-HS (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:20:20 VCS INFO V-16-1-10305 Resource TRAKPRIVOL-INTBACKUP (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:20:20 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PRI-INT (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU1
2014/04/22 16:23:21 VCS WARNING V-16-6-16100 (TCPRI-CLU1) chkvxconfigd:The VxVM process vxconfigd seems to be un-responsive. Stopping vxnotify process, so that resources get unregistered from AMF monitoring
2014/04/22 16:23:21 VCS INFO V-16-2-13717 (TCPRI-CLU1) Output of the completed operation (imf_getnotification) 
==============================================
Cannot continue monitoring event
Got notification for group: PRI-INT

==============================================

2014/04/22 16:25:22 VCS WARNING V-16-2-13011 (TCPRI-CLU1) Resource(PRI-INT): offline procedure did not complete within the expected time.
2014/04/22 16:25:22 VCS ERROR V-16-2-13063 (TCPRI-CLU1) Agent is calling clean for resource(PRI-INT) because offline did not complete within the expected time.
2014/04/22 16:26:23 VCS ERROR V-16-2-13006 (TCPRI-CLU1) Resource(PRI-INT): clean procedure did not complete within the expected time.
2014/04/22 16:27:24 VCS INFO V-16-1-10305 Resource PRI-INT (Owner: Unspecified, Group: INT-PRI) is offline on TCPRI-CLU1 (VCS initiated)
2014/04/22 16:27:24 VCS NOTICE V-16-1-10446 Group INT-PRI is offline on system TCPRI-CLU1
2014/04/22 16:27:24 VCS NOTICE V-16-1-10301 Initiating Online of Resource PRI-INT (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:27:24 VCS NOTICE V-16-10031-1513 (TCPRI-CLU2) DiskGroup:PRI-INT:online:Diskgroups will be imported with reservations.
2014/04/22 16:27:45 VCS WARNING V-16-10031-1509 (TCPRI-CLU2) DiskGroup:PRI-INT:online:vxdg import succeeded on Disk Group PRI-INT.
2014/04/22 16:27:45 VCS NOTICE V-16-10031-1559 (TCPRI-CLU2) DiskGroup:PRI-INT:online:Volumes in DiskGroup PRI-INT will be started automatically as part of import command,the system level autostartvolume is set On 
2014/04/22 16:27:45 VCS INFO V-16-2-13717 (TCPRI-CLU2) Output of the completed operation (imf_getnotification) 
==============================================
Got notification for group: PRI-INT

==============================================

2014/04/22 16:27:46 VCS INFO V-16-1-10298 Resource PRI-INT (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:27:46 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INTJRNPRI (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:27:46 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INTJRNALT (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:27:46 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INTBACKUP (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:27:46 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INT-HS (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:27:46 VCS NOTICE V-16-1-10301 Initiating Online of Resource TRAKPRIVOL-INT-DB (Owner: Unspecified, Group: INT-PRI) on System TCPRI-CLU2
2014/04/22 16:27:49 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INT-HS (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:28:05 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INTJRNALT (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:28:06 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INTBACKUP (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:28:07 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INTJRNPRI (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:28:08 VCS INFO V-16-1-10298 Resource TRAKPRIVOL-INT-DB (Owner: Unspecified, Group: INT-PRI) is online on TCPRI-CLU2 (VCS initiated)
2014/04/22 16:28:08 VCS NOTICE V-16-1-10447 Group INT-PRI is online on system TCPRI-CLU2

===============

solom
Level 4

the last one

Daniel_Matheus
Level 4
Employee Accredited Certified

Hi Solom,

 

even in the last one the online procedure took only a few seconds.

What took long was the offline procedure because according to the logs the vxconfigd is either not running or not responding:

 

2014/04/22 16:23:21 VCS WARNING V-16-6-16100 (TCPRI-CLU1) chkvxconfigd:The VxVM process vxconfigd seems to be un-responsive. Stopping vxnotify process, so that resources get unregistered from AMF monitoring

 

vxconfigd is the main VxVM daemon, who manages the import/export of diskgroups and start/stop of volumes.

 

Can you check if vxconfigd is running?

#ps -ef | grep vxconfigd

 

if running check if it is enabled:

#vxdctl mode

if disabled try to enable:

#vxdctl enable

or

vxconfigd -m enable

 

if vxconfigd is not running or can't be enabled try to restart:

#vxconfigd -k -x syslog

 

regards,
Dan

 

solom
Level 4

[root@TCPRI-CLU1 ~]# vxdctl mode
mode: enabled
[root@TCPRI-CLU1 ~]#

 

Daniel_Matheus
Level 4
Employee Accredited Certified

can you do another simple check to see if the vxconfigd is responsive?

#vxdisk list

solom
Level 4

vxdisk list
DEVICE       TYPE            DISK         GROUP        STATUS
disk_0       auto:LVM        -            -            online invalid
eva64000_0   auto:cdsdisk    -            -            online
eva64000_1   auto:cdsdisk    -            -            online
eva64000_2   auto:cdsdisk    -            -            online
eva64000_3   auto:cdsdisk    -            -            online
eva64000_4   auto:cdsdisk    -            -            online
eva64000_59  auto:cdsdisk    eva64000_59  PRI-INT      online
eva64000_60  auto:cdsdisk    eva64000_60  PRI-LAB      online
eva64000_61  auto:cdsdisk    eva64000_61  PRI-TC       online
eva64000_62  auto:cdsdisk    eva64000_62  PRI-LAB      online
eva64000_63  auto:cdsdisk    eva64000_63  PRI-INT      online
eva64000_64  auto:cdsdisk    eva64000_64  PRI-TC       online

 

Yes it is .

 The same configuration in the other side and is working well maybe the  problem in the network if there more traffic on the vlan. 

 

starflyfly
Level 6
Employee Accredited Certified

Hi, 

 

  Need  check  when the information  in   vcs:

==========

2014/04/22 16:23:21 VCS WARNING V-16-6-16100 (TCPRI-CLU1) chkvxconfigd:The VxVM process vxconfigd seems to be un-responsive. Stopping vxnotify process, so that resources get unregistered from AMF monitoring

========

 

 

what happened to vxconfigd.

check /var/log/messages,  /etc/vx/dmpevents.log

if needed, check  debug log.

solom
Level 4

See the messages.log 27 April .

 

I attached 

 


 

starflyfly
Level 6
Employee Accredited Certified

HI, we'd better check system log on  Apr 22.

 

Anyway, check  log  in April 27

=========

Apr 27 03:31:30 TCPRI-CLU1 multipathd: mpathbm: load table [0 629145600 multipath 1 queue_if_no_path 0 3 2 round-robin 0 1 1 68:16 1 round-robin 0 4 1 68:192 1 67:96 1 65:80 1 66:176 1 round-robin 0 3 1 69:112 1 8:160 1 66:0 1]
Apr 27 03:31:50 TCPRI-CLU1 multipathd: mpathbm: load table [0 629145600 multipath 1 queue_if_no_path 0 2 1 round-robin 0 4 1 68:192 1 67:96 1 65:80 1 66:176 1 round-robin 0 4 1 68:16 1 69:112 1 8:160 1 66:0 1]    <<<<<<<<<<<<<<
Apr 27 03:37:31 TCPRI-CLU1 kernel: __ratelimit: 1 callbacks suppressed
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:3:3: [sdcc] Unhandled error code<<<<<<<<<<<<
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:3:3: [sdcc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:3:3: [sdcc] CDB: Read(10): 28 00 00 00 01 20 00 00 10 00
Apr 27 03:37:31 TCPRI-CLU1 kernel: __ratelimit: 1 callbacks suppressed
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:2:8: [sdbw] Unhandled error code
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:2:8: [sdbw] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:2:8: [sdbw] CDB: Read(10): 28 00 14 01 01 40 00 00 02 00
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:2:8: [sdbw] Unhandled error code
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:2:8: [sdbw] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:2:8: [sdbw] CDB: Read(10): 28 00 14 01 01 10 00 00 10 00
Apr 27 03:37:31 TCPRI-CLU1 kernel: sd 2:0:2:8: [sdbw] Unhandled error code

=========

 

suggestions:

1. if possible, stop  multipathd , since dmp may not work with other multi path software together well.

2. check if sth. abnormal, since many "Unhandled error code"

 

 

 

solom
Level 4

dmb ??

solom
Level 4

suggestions:

1. if possible, stop  multipathd , since dmp may not work with other multi path software together well.

 

This problem was .

 

I'm sorry for the delay in reply

Thank you very much