Solved: Hi, If both private networks

Jamesb_china · ‎01-12-2015

I have a three-node cluster configuration: vcs1, vcs2, vcs3. Using three ISCSI disks (by SCST iscsi target simulator) as fendg.

# vxfenadm -d

I/O Fencing Cluster Information:
================================

Fencing Protocol Version: 201
Fencing Mode: SCSI3
Fencing SCSI3 Disk Policy: dmp
Cluster Members:

* 0 (vcs1)
1 (vcs2)
2 (vcs3)

RFSM State Information:
node 0 in state 8 (running)
node 1 in state 8 (running)
node 2 in state 8 (running)

# vxfenadm -s all -f /etc/vxfentab

Device Name: /dev/vx/rdmp/disk_1s3
Total Number Of Keys: 3
key[0]:
[Numeric Format]: 86,70,49,48,53,50,48,48
[Character Format]: VF105200
* [Node Format]: Cluster ID: 4178 Node ID: 0 Node Name: vcs1
key[1]:
[Numeric Format]: 86,70,49,48,53,50,48,49
[Character Format]: VF105201
* [Node Format]: Cluster ID: 4178 Node ID: 1 Node Name: vcs2
key[2]:
[Numeric Format]: 86,70,49,48,53,50,48,50
[Character Format]: VF105202
* [Node Format]: Cluster ID: 4178 Node ID: 2 Node Name: vcs3

Device Name: /dev/vx/rdmp/disk_0s3
Total Number Of Keys: 3
key[0]:
[Numeric Format]: 86,70,49,48,53,50,48,48
[Character Format]: VF105200
* [Node Format]: Cluster ID: 4178 Node ID: 0 Node Name: vcs1
key[1]:
[Numeric Format]: 86,70,49,48,53,50,48,49
[Character Format]: VF105201
* [Node Format]: Cluster ID: 4178 Node ID: 1 Node Name: vcs2
key[2]:
[Numeric Format]: 86,70,49,48,53,50,48,50
[Character Format]: VF105202
* [Node Format]: Cluster ID: 4178 Node ID: 2 Node Name: vcs3

Device Name: /dev/vx/rdmp/disk_2s3
Total Number Of Keys: 3
key[0]:
[Numeric Format]: 86,70,49,48,53,50,48,48
[Character Format]: VF105200
* [Node Format]: Cluster ID: 4178 Node ID: 0 Node Name: vcs1
key[1]:
[Numeric Format]: 86,70,49,48,53,50,48,49
[Character Format]: VF105201
* [Node Format]: Cluster ID: 4178 Node ID: 1 Node Name: vcs2
key[2]:
[Numeric Format]: 86,70,49,48,53,50,48,50
[Character Format]: VF105202
* [Node Format]: Cluster ID: 4178 Node ID: 2 Node Name: vcs3

# lltstat -l
LLT link information:
link 0 eth1 on ether hipri
mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6
txpkts 190145 txbytes 23138275
rxpkts 174420 rxbytes 11540391
latehb 0 badcksum 0 errors 0
link 1 eth0 on ether lowpri
mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6
txpkts 71940 txbytes 3495901
rxpkts 73537 rxbytes 3617055
latehb 0 badcksum 0 errors 0

After I disconnect the network link of vcs3, vcs1 take over the application running on vcs3 . And after waited for serval miniutes and show log "VCS waiting for I/O fencing to be completed" , vcs3 shown kernel panic message like this:

BUG: unable to handle kernel paging request at ffffffff00000019
[32353.581223] IP: [<ffffffff810399b5>] task_rq_lock+0x35/0x90
[32353.581991] PGD 1806067 PUD 0
[32353.582446] Oops: 0000 [#1] SMP
[32353.582928] last sysfs file: /sys/devices/system/node/node0/cpumap
[32353.583751] CPU 0
[32353.584031] Modules linked in: vxodm(PN) vxfen(PN) dmpjbod(PN) dmpap(PN) dmpaa(PN) vxspec(PN) vxio(PN) vxdmp(PN) binfmt_misc snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device gab(PN) crc32
c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet llt(PN) microcode amf(PN) fuse loop vxportal(PN) fdd(PN) vxfs(PN) exportfs dm_mod virtio_console virtio_balloon virtio_net rt
c_cmos snd_hda_intel rtc_core snd_hda_codec rtc_lib snd_hwdep snd_pcm tpm_tis virtio_pci snd_timer tpm sym53c8xx virtio_ring snd button sg tpm_bios floppy pcspkr virtio scsi_transport_spi i2
c_piix4 soundcore i2c_core snd_page_alloc uhci_hcd sd_mod crc_t10dif ehci_hcd usbcore edd ext3 mbcache jbd fan processor ide_pci_generic piix ide_core ata_generic ata_piix libata scsi_mod th
ermal thermal_sys hwmon
[32353.584031] Supported: Yes, External
[32353.584031] Pid: 4730, comm: vxfen Tainted: P 2.6.32.12-0.7-default #1 Bochs
[32353.584031] RIP: 0010:[<ffffffff810399b5>] [<ffffffff810399b5>] task_rq_lock+0x35/0x90
[32353.584031] RSP: 0018:ffff88006c6b5cc0 EFLAGS: 00010086
[32353.584031] RAX: ffffffff00000001 RBX: 0000000000013680 RCX: dead000000100100
[32353.584031] RDX: 0000000000000000 RSI: ffff88006c6b5d00 RDI: ffffffff81ab2df0
[32353.584031] RBP: ffff88006c6b5ce0 R08: 00000000000005db R09: 000000000000000a
[32353.584031] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000013680
[32353.584031] R13: ffffffff81ab2df0 R14: ffff88006c6b5d00 R15: 0000000000000000
[32353.584031] FS: 0000000000000000(0000) GS:ffff880006200000(0000) knlGS:0000000000000000
[32353.584031] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[32353.584031] CR2: ffffffff00000019 CR3: 0000000037d1b000 CR4: 00000000000406f0
[32353.584031] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[32353.584031] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[32353.584031] Process vxfen (pid: 4730, threadinfo ffff88006c6b4000, task ffff88006d98c580)
[32353.584031] Stack:
[32353.584031] 0000000000000000 00000001007a4653 ffffffff81ab2df0 ffff880062de34b0
[32353.584031] <0> ffff88006c6b5d30 ffffffff81040e5a 000000000000000f ffff88007c0de148
[32353.584031] <0> 0000000000000086 ffff88007c0de140 00000001007a4653 0000000000000001
[32353.584031] Call Trace:
[32353.584031] [<ffffffff81040e5a>] try_to_wake_up+0x4a/0x340
[32353.584031] [<ffffffff810682a8>] up+0x48/0x50
[32353.584031] [<ffffffffa0d5ed4a>] vxfen_bcast_lost_race_msg+0x8a/0x1b0 [vxfen]
[32353.584031] [<ffffffffa0d5f63d>] vxfen_grab_coord_pt_30+0x76d/0x830 [vxfen]
[32353.584031] [<ffffffffa0d5fbe7>] vxfen_grab_coord_pt+0x87/0x1a0 [vxfen]
[32353.584031] [<ffffffffa0d6eb7c>] vxfen_msg_node_left_ack+0x22c/0x330 [vxfen]
[32353.584031] [<ffffffffa0d70f22>] vxfen_process_client_msg+0x7d2/0xb30 [vxfen]
[32353.584031] [<ffffffffa0d716db>] vxfen_vrfsm_cback+0x45b/0x17b0 [vxfen]
[32353.584031] [<ffffffffa0d8cb20>] vrfsm_step+0x1b0/0x3b0 [vxfen]
[32353.584031] [<ffffffffa0d8ee1c>] vrfsm_recv_thread+0x32c/0x970 [vxfen]
[32353.584031] [<ffffffffa0d8f5b4>] vxplat_lx_thread_base+0xa4/0x100 [vxfen]
[32353.584031] [<ffffffff81003fba>] child_rip+0xa/0x20
[32353.584031] Code: 6c 24 10 4c 89 74 24 18 49 89 fd 48 89 1c 24 49 89 f6 4c 89 64 24 08 49 c7 c4 80 36 01 00 9c 58 fa 49 89 06 49 8b 45 08 4c 89 e3 <8b> 40 18 48 03 1c c5 40 dc 91 81 48 89
df e8 68 d5 35 00 49 8b
[32353.584031] RIP [<ffffffff810399b5>] task_rq_lock+0x35/0x90
[32353.584031] RSP <ffff88006c6b5cc0>
[32353.584031] CR2: ffffffff00000019

TonyGriffiths · ‎02-02-2015

Hi,

If both private networks fail, then each node will race for the co-ordinator disks. If a node does not control of the majority of the disks, then it will panic and leave the cluster.

The VCS administrators guide will have more on this topic, here is the link to the 6.1/Linux docs

https://sort.symantec.com/documents/doc_details/sfha/6.1/Linux/ProductGuides/

cheers

tony

View solution in original post

TonyGriffiths · ‎02-02-2015

Hi,

If both private networks fail, then each node will race for the co-ordinator disks. If a node does not control of the majority of the disks, then it will panic and leave the cluster.

The VCS administrators guide will have more on this topic, here is the link to the 6.1/Linux docs

https://sort.symantec.com/documents/doc_details/sfha/6.1/Linux/ProductGuides/

cheers

tony

starflyfly · ‎03-11-2015

HI,

It's by design.

When you disconnecting network for vcs3, it seperate to two sub cluster. vcs1&vcs2 in one sub cluster,

vcs3 in another sub cluster.

vcs3 lose the race for co-ordinator disks, it will panic the server.

vcs_man · ‎03-26-2015

It is by design to panic the node who lose the race however in this kind of scenarios, if we can collect the crash dump from the system, it helps to analyze.

I am not sure what is the VCS and OS version here.

Apart from VXFEN design of panic, There was a known issue for kernel panic due to a stack overrun with vxfs/vxvm on Redhat..

https://access.redhat.com/solutions/526393

Please check if you are hitting this issue mentioned above.(It may require redhat login).

Hope this helps.

VOX

vxfen module cause system panic after I/O fencing