cancel
Showing results for 
Search instead for 
Did you mean: 

http down on cluster

Aerosmith1
Level 4
Hi

Suddenly my nagios sent me email, http is critical, when I checked on one of node where all service groups were running fine,

hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  server1            RUNNING              0
A  server2            RUNNING              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

B  ClusterService  server1            Y          N               OFFLINE
B  ClusterService  server2            Y          N               ONLINE
B  bb              server1            Y          N               STOPPING|PARTIAL
B  bb              server2            Y          N               OFFLINE

-- RESOURCES OFFLINING
-- Group           Type            Resource             System               IState

F  bb              DiskGroup       bbdg                 server1            W_OFFLINE_PROPAGATE



Server log in messages file

Jan 18 23:22:51 server1 genunix: [ID 408789 kern.warning] WARNING: ce1: fault detected external to device; service degraded
Jan 18 23:22:51 server1 genunix: [ID 451854 kern.warning ] WARNING: ce1: xcvr addr:0x01 - link down
Jan 18 23:22:58 server1 llt: [ID 120420 kern.notice] LLT:10032: link 2 (ce1) node 1 inactive 8 sec (2421280/13593146)
Jan 18 23:22:59 server1 llt: [ID 120420 kern.notice] LLT:10032: link 2 (ce1) node 1 inactive 9 sec (2421280/13593150)
Jan 18 23:23:00 server1 llt: [ID 120420 kern.notice] LLT:10032: link 2 (ce1) node 1 inactive 10 sec (2421280/13593154)
Jan 18 23:23:01 server1 llt: [ID 120420 kern.notice] LLT:10032: link 2 (ce1) node 1 inactive 11 sec (2421280/13593158)
Jan 18 23:23:02 server1 llt: [ID 120420 kern.notice] LLT:10032: link 2 (ce1) node 1 inactive 12 sec (2421280/13593162)
Jan 18 23:23:03 server1 llt: [ID 120420 kern.notice] LLT:10032: link 2 (ce1) node 1 inactive 13 sec (2421280/13593168)
Jan 18 23:23:04 server1 llt: [ID 120420 kern.notice] LLT:10032: link 2 (ce1) node 1 inactive 14 sec (2421280/13593172)
Jan 18 23:23:05 server1 llt: [ID 120420 kern.notice] LLT:10032: link 2 (ce1) node 1 inactive 15 sec (2421280/13593176)
Jan 18 23:23:06 server1 llt: [ID 120420 kern.notice] LLT:10032: link 2 (ce1) node 1 inactive 16 sec (2421280/13593180)
Jan 18 23:23:06 server1 llt: [ID 106513 kern.notice] LLT:10033: link 2 (ce1) node 1 expired
Jan 18 23:32:39 server1 genunix: [ID 408789 kern.notice] NOTICE: ce1: fault cleared external to device; service available
Jan 18 23:32:39 server1 genunix: [ID 451854 kern.notice] NOTICE: ce1: xcvr addr:0x01 - link up 1000 Mbps full duplex
Jan 18 23:32:42 server1 llt: [ID 465730 kern.notice] LLT:10024: link 2 (ce1) node 1 active


What could be the issue ?
8 REPLIES 8

springer
Level 4
Hi

I have no idea what is made up of ur bb service grousp apart from diskgroup ... its in partial state because its waiting for dg to be offline maybe the is open process holding on to group do a vxdg list to see if diskgroup is reall on that server iuf not mayve a porobe and flush needed

but why did it fail in first place well "ce1: fault detected"  that might eb clue aka is there a nic resource in there i bet u there is and if u have a hardware fault and vcs dectets well its doing its job



springer

Aerosmith1
Level 4
group bb (
        SystemList = { server1 = 0, server2 = 1 }
        AutoStartList = { server1 }
        )

        Application bbservice (
                User = root
                StartProgram = "/opt/VRTSvcs/bin/blackboard/online"
                StopProgram = "/opt/VRTSvcs/bin/blackboard/offline"
                CleanProgram = "/opt/VRTSvcs/bin/blackboard/clean"
                PidFiles = { "/usr/local/blackboard/logs/pid-files/httpd.pid",
                         "/usr/local/blackboard/logs/pid-files/modperl.pid",
                         "/usr/local/blackboard/logs/pid-files/tomcat.pid" }
                )

        DiskGroup bbdg (
                DiskGroup = bboradg
                )

        IP bb_ip (
                Device = ce1
                Address = "2.4.1
.152"
                NetMask = "255.255.255.0"
                )

        Mount bb_mnt-bb (
                MountPoint = "/usr/local/blackboard"
                BlockDevice = "/dev/vx/dsk/bboradg/VOLBB"
                FSType = vxfs
                MountOpt = rw
                FsckOpt = "-y"
                )

        Mount bb_mnt-or (
                MountPoint = "/usr/local/oracle"
                BlockDevice = "/dev/vx/dsk/bboradg/VOL-OR"
                FSType = vxfs
                MountOpt = rw
                FsckOpt = "-y"
                )

        Mount bb_mnt-u01 (
                MountPoint = "/u01"
                BlockDevice = "/dev/vx/dsk/bboradg/VOL-U01"
                FSType = vxfs
                MountOpt = rw
                FsckOpt = "-y"
                )

        Mount bb_mnt-u02 (
                MountPoint = "/u02"
                BlockDevice = "/dev/vx/dsk/bboradg/VOL-U02"
                FSType = vxfs
                MountOpt = rw
                FsckOpt = "-y"
                )

        NIC bb_ce1 (
                Device = ce1
                NetworkType = ether
                )

        Oracle BB60 (
                Sid = BB60
                Owner = oracle
                Home = "/u01/app/oracle/product/9.2.0"
                Pfile = "/u01/app/oracle/admin/BB60/pfile/initBB60.ora"
                EnvFile = "/opt/VRTSvcs/bin/Oracle/envfile"
                )

        Sqlnet LISTENER (
                Owner = oracle
                Home = "/u01/app/oracle/product/9.2.0"
                TnsAdmin = "/u01/app/oracle/product/9.2.0/network/admin"
                Listener = LISTENER
                MonScript = "./bin/Sqlnet/LsnrTest.pl"
                EnvFile = "/opt/VRTSvcs/bin/Oracle/envfile"
                )

        Volume bb_vol-or (
                Volume = VOL-OR
                DiskGroup = bboradg
                )

        Volume bb_vol-u01 (
                Volume = VOL-U01
                DiskGroup = bboradg
                )

        Volume bb_vol-u02 (
                Volume = VOL-U02
                DiskGroup = bboradg
                )

        Volume bb_volbb (
                Volume = VOLBB
                DiskGroup = bboradg
                )

        BB60 requires bb_mnt-bb
        BB60 requires bb_mnt-or
        BB60 requires bb_mnt-u01
        BB60 requires bb_mnt-u02
        LISTENER requires BB60
        LISTENER requires bb_ip
        bb_ip requires bb_ce1
        bb_mnt-bb requires bb_volbb
        bb_mnt-or requires bb_vol-or
        bb_mnt-u01 requires bb_vol-u01
        bb_mnt-u02 requires bb_vol-u02
        bb_vol-or requires bbdg
        bb_vol-u01 requires bbdg
        bb_vol-u02 requires bbdg
        bb_volbb requires bbdg
        bbservice requires LISTENER


        // resource dependency tree
        //
        //      group bb
        //      {
        //      Application bbservice
        //          {
        //          Sqlnet LISTENER
        //              {
        //              Oracle BB60
        //                  {
        //                  Mount bb_mnt-bb
        //                      {
        //                      Volume bb_volbb
        //                          {
        //                          DiskGroup bbdg
        //                          }
        //                      }
        //                  Mount bb_mnt-or
        //                      {
        //                      Volume bb_vol-or
        //                          {
        //                          DiskGroup bbdg
        //                          }
        //                      }
        //                  Mount bb_mnt-u01
        //                      {
        //                      Volume bb_vol-u01
        //                          {
        //                          DiskGroup bbdg
        //                          }
        //                      }
        //                  Mount bb_mnt-u02
        //                      {
        //                      Volume bb_vol-u02
        //                          {
        //                          DiskGroup bbdg
        //                          }
        //                      }
        //                  }
        //              IP bb_ip
        //                  {
        //                  NIC bb_ce1
        //                  }
        //              }
        //          }
        //      }


vxdg list
NAME         STATE           ID
rootdg       enabled  1069283981.1025.server1
bboradg      enabled  1069366553.1248.server1

I know network guys were working on something and I think they restarted something because I saw nagios emailing me about services down for other standalone server not part of cluster but the services online in some time..


RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified
Did you resolve the issue?
 
I'm a little worried about something i saw in your post.
 
In the messages you posted LLT complains about CE1 having gone down / lost connection. In your main.cf CE1 is listed as the NIC in use by the service group.
 
You should not be using CE1 for LLT high priority links as well as public communications.
 
Just a thought.
 
R

Aerosmith1
Level 4
Hi Riaan,

the issue was resolved the night.

Actually to resolve i forced stop cluster and mounted all partitions/volumes unmounted (i saw they were not there) after mounting, I check 4 resources were already offline, took them online one by one.

and hastart on both nodes, that solved issue fortunately,

regarding your view, thank you very much

can you give me more info or if you want to know more I can give more info and see if I can benefit from our discussion

I understood from network guys that the links were connected to core network switch which they never take offline generally..still it caused lot of problems for sure..I don;t know why disks remained unmounted even after link was restored

Thanks



Message Edited by Aerosmith1 on 01-29-2008 08:39 AM

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified
Can you post the /etc/llttab files from all nodes in the cluster,

as well as the main.cf

Aerosmith1
Level 4
@server1
cat /etc/llttab
set-node server1
set-cluster 1
link qfe0 /dev/qfe:0 - ether - -
link ce2 /dev/ce:2 - ether - -
link-lowpri ce1 /dev/ce:1 - ether - -
start

@server2
cat /etc/llttab
set-node server2
set-cluster 1
link qfe0 /dev/qfe:0 - ether - -
link ce2 /dev/ce:2 - ether - -
link-lowpri ce1 /dev/ce:1 - ether - -
start


main.cf @ server1 , I guess this will be same as main.cf @server2

h**p://pastebin.ca/879776

could not paste here , message size goes beyond limit.

Thank you

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified
Hi,
 
Thanks, you dont have to be concerned. The CE1 interface is configured as a low priority LLT link, it was quite all right for it to go down when the network engineers were working on the LAN.
 
Have a nice day!

Aerosmith1
Level 4
 Thanks Riaan and others, for looking in to my issue


Message Edited by Aerosmith1 on 01-30-2008 09:01 AM