cancel
Showing results for 
Search instead for 
Did you mean: 

vxfen module cause system panic after I/O fencing

I have a three-node cluster configuration: vcs1, vcs2, vcs3.  Using three ISCSI disks (by SCST iscsi target simulator) as fendg.

 # vxfenadm   -d

I/O Fencing Cluster Information:
================================

 Fencing Protocol Version: 201
 Fencing Mode: SCSI3
 Fencing SCSI3 Disk Policy: dmp
 Cluster Members:  

        * 0 (vcs1)
          1 (vcs2)
          2 (vcs3)

 RFSM State Information:
        node   0 in state  8 (running)
        node   1 in state  8 (running)
        node   2 in state  8 (running)

 

# vxfenadm -s all -f /etc/vxfentab                                                                                                                                      

Device Name: /dev/vx/rdmp/disk_1s3
Total Number Of Keys: 3
key[0]: 
        [Numeric Format]:   86,70,49,48,53,50,48,48
        [Character Format]: VF105200
   *    [Node Format]: Cluster ID: 4178  Node ID: 0   Node Name: vcs1
key[1]: 
        [Numeric Format]:   86,70,49,48,53,50,48,49
        [Character Format]: VF105201
   *    [Node Format]: Cluster ID: 4178  Node ID: 1   Node Name: vcs2
key[2]: 
        [Numeric Format]:   86,70,49,48,53,50,48,50
        [Character Format]: VF105202
   *    [Node Format]: Cluster ID: 4178  Node ID: 2   Node Name: vcs3

Device Name: /dev/vx/rdmp/disk_0s3
Total Number Of Keys: 3
key[0]: 
        [Numeric Format]:   86,70,49,48,53,50,48,48
        [Character Format]: VF105200
   *    [Node Format]: Cluster ID: 4178  Node ID: 0   Node Name: vcs1
key[1]: 
        [Numeric Format]:   86,70,49,48,53,50,48,49
        [Character Format]: VF105201
   *    [Node Format]: Cluster ID: 4178  Node ID: 1   Node Name: vcs2
key[2]: 
        [Numeric Format]:   86,70,49,48,53,50,48,50
        [Character Format]: VF105202
   *    [Node Format]: Cluster ID: 4178  Node ID: 2   Node Name: vcs3

Device Name: /dev/vx/rdmp/disk_2s3
Total Number Of Keys: 3
key[0]: 
        [Numeric Format]:   86,70,49,48,53,50,48,48
        [Character Format]: VF105200
   *    [Node Format]: Cluster ID: 4178  Node ID: 0   Node Name: vcs1
key[1]: 
        [Numeric Format]:   86,70,49,48,53,50,48,49
        [Character Format]: VF105201
   *    [Node Format]: Cluster ID: 4178  Node ID: 1   Node Name: vcs2
key[2]: 
        [Numeric Format]:   86,70,49,48,53,50,48,50
        [Character Format]: VF105202
   *    [Node Format]: Cluster ID: 4178  Node ID: 2   Node Name: vcs3

 # lltstat -l
LLT link information:
link 0  eth1 on ether hipri
        mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6
        txpkts 190145  txbytes 23138275
        rxpkts 174420  rxbytes 11540391
        latehb 0  badcksum 0  errors 0
link 1  eth0 on ether lowpri
        mtu 1500, sap 0xcafe, broadcast FF:FF:FF:FF:FF:FF, addrlen 6
        txpkts 71940  txbytes 3495901
        rxpkts 73537  rxbytes 3617055
        latehb 0  badcksum 0  errors 0

 

 

After I disconnect the network link of vcs3, vcs1 take over the application running on vcs3 . And after waited for serval miniutes and show log "VCS waiting for I/O fencing to be completed" , vcs3 shown kernel panic message like this:

 

BUG: unable to handle kernel paging request at ffffffff00000019
[32353.581223] IP: [<ffffffff810399b5>] task_rq_lock+0x35/0x90
[32353.581991] PGD 1806067 PUD 0 
[32353.582446] Oops: 0000 [#1] SMP 
[32353.582928] last sysfs file: /sys/devices/system/node/node0/cpumap
[32353.583751] CPU 0 
[32353.584031] Modules linked in: vxodm(PN) vxfen(PN) dmpjbod(PN) dmpap(PN) dmpaa(PN) vxspec(PN) vxio(PN) vxdmp(PN) binfmt_misc snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device gab(PN) crc32
c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet llt(PN) microcode amf(PN) fuse loop vxportal(PN) fdd(PN) vxfs(PN) exportfs dm_mod virtio_console virtio_balloon virtio_net rt
c_cmos snd_hda_intel rtc_core snd_hda_codec rtc_lib snd_hwdep snd_pcm tpm_tis virtio_pci snd_timer tpm sym53c8xx virtio_ring snd button sg tpm_bios floppy pcspkr virtio scsi_transport_spi i2
c_piix4 soundcore i2c_core snd_page_alloc uhci_hcd sd_mod crc_t10dif ehci_hcd usbcore edd ext3 mbcache jbd fan processor ide_pci_generic piix ide_core ata_generic ata_piix libata scsi_mod th
ermal thermal_sys hwmon
[32353.584031] Supported: Yes, External
[32353.584031] Pid: 4730, comm: vxfen Tainted: P             2.6.32.12-0.7-default #1 Bochs
[32353.584031] RIP: 0010:[<ffffffff810399b5>]  [<ffffffff810399b5>] task_rq_lock+0x35/0x90
[32353.584031] RSP: 0018:ffff88006c6b5cc0  EFLAGS: 00010086
[32353.584031] RAX: ffffffff00000001 RBX: 0000000000013680 RCX: dead000000100100
[32353.584031] RDX: 0000000000000000 RSI: ffff88006c6b5d00 RDI: ffffffff81ab2df0
[32353.584031] RBP: ffff88006c6b5ce0 R08: 00000000000005db R09: 000000000000000a
[32353.584031] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000013680
[32353.584031] R13: ffffffff81ab2df0 R14: ffff88006c6b5d00 R15: 0000000000000000
[32353.584031] FS:  0000000000000000(0000) GS:ffff880006200000(0000) knlGS:0000000000000000
[32353.584031] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
[32353.584031] CR2: ffffffff00000019 CR3: 0000000037d1b000 CR4: 00000000000406f0
[32353.584031] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[32353.584031] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[32353.584031] Process vxfen (pid: 4730, threadinfo ffff88006c6b4000, task ffff88006d98c580)
[32353.584031] Stack:
[32353.584031]  0000000000000000 00000001007a4653 ffffffff81ab2df0 ffff880062de34b0
[32353.584031] <0> ffff88006c6b5d30 ffffffff81040e5a 000000000000000f ffff88007c0de148
[32353.584031] <0> 0000000000000086 ffff88007c0de140 00000001007a4653 0000000000000001
[32353.584031] Call Trace:
[32353.584031]  [<ffffffff81040e5a>] try_to_wake_up+0x4a/0x340
[32353.584031]  [<ffffffff810682a8>] up+0x48/0x50
[32353.584031]  [<ffffffffa0d5ed4a>] vxfen_bcast_lost_race_msg+0x8a/0x1b0 [vxfen]
[32353.584031]  [<ffffffffa0d5f63d>] vxfen_grab_coord_pt_30+0x76d/0x830 [vxfen]
[32353.584031]  [<ffffffffa0d5fbe7>] vxfen_grab_coord_pt+0x87/0x1a0 [vxfen]
[32353.584031]  [<ffffffffa0d6eb7c>] vxfen_msg_node_left_ack+0x22c/0x330 [vxfen]
[32353.584031]  [<ffffffffa0d70f22>] vxfen_process_client_msg+0x7d2/0xb30 [vxfen]
[32353.584031]  [<ffffffffa0d716db>] vxfen_vrfsm_cback+0x45b/0x17b0 [vxfen]
[32353.584031]  [<ffffffffa0d8cb20>] vrfsm_step+0x1b0/0x3b0 [vxfen]
[32353.584031]  [<ffffffffa0d8ee1c>] vrfsm_recv_thread+0x32c/0x970 [vxfen]
[32353.584031]  [<ffffffffa0d8f5b4>] vxplat_lx_thread_base+0xa4/0x100 [vxfen]
[32353.584031]  [<ffffffff81003fba>] child_rip+0xa/0x20
[32353.584031] Code: 6c 24 10 4c 89 74 24 18 49 89 fd 48 89 1c 24 49 89 f6 4c 89 64 24 08 49 c7 c4 80 36 01 00 9c 58 fa 49 89 06 49 8b 45 08 4c 89 e3 <8b> 40 18 48 03 1c c5 40 dc 91 81 48 89
 df e8 68 d5 35 00 49 8b 
[32353.584031] RIP  [<ffffffff810399b5>] task_rq_lock+0x35/0x90
[32353.584031]  RSP <ffff88006c6b5cc0>
[32353.584031] CR2: ffffffff00000019

 

 

 

1 Solution

Accepted Solutions
Accepted Solution!

Hi, If both private networks

Hi,

If both private networks fail, then each node will race for the co-ordinator disks. If a node does not control of the majority of the disks, then it will panic and leave the cluster.

 

The VCS administrators guide will have more on this topic, here is the link to the 6.1/Linux docs

https://sort.symantec.com/documents/doc_details/sfha/6.1/Linux/ProductGuides/

 

cheers

tony

View solution in original post

3 Replies
Accepted Solution!

Hi, If both private networks

Hi,

If both private networks fail, then each node will race for the co-ordinator disks. If a node does not control of the majority of the disks, then it will panic and leave the cluster.

 

The VCS administrators guide will have more on this topic, here is the link to the 6.1/Linux docs

https://sort.symantec.com/documents/doc_details/sfha/6.1/Linux/ProductGuides/

 

cheers

tony

View solution in original post

HI,   It's by design.   When

HI,

  It's by design.

  When you disconnecting  network for vcs3, it seperate to two sub cluster. vcs1&vcs2 in one sub cluster,

vcs3 in another sub cluster.

  vcs3  lose the race for co-ordinator disks, it will panic the server.

 

 

It is by design to panic the

It is by design to panic the node who lose the race however in this kind of scenarios, if we can collect the crash dump from the system, it helps to analyze.

I am not sure what is the VCS and OS version here.

Apart from VXFEN design of panic, There was a known issue for kernel panic due to a stack overrun with vxfs/vxvm on Redhat..

https://access.redhat.com/solutions/526393

Please check if you are hitting this issue mentioned above.(It may require redhat login). 

 

Hope this helps.