Solved: Faulty Memory Replacement on one of the node of Or...

Zubair_mohammed · ‎11-22-2009

Hi,

Appreciate your help in reviewing the attached plan for getting a memory module replaced on one of RAC Cluster. Its a 3 node cluster running with 9 SG in ACTIVE/ACTIVE mode.

# /etc/vx/bin/vxclustadm nidmap [Identify the Master Node]

#haconf -makerw [On Master node]

# hasys -freeze -persistent <nodename> [Freeze the systems which do not have faulty Memory]

# Login to the Node [Which has faulty Memory].

#hagrp -list [ SG which are ONLINE on the faulty memory node].

#haconf -dump -makero

# hagrp -offline <SG> -sys <nodename>

# hastop -local

#/sbin/gabconfig -a [Port h, v & W should be stopped].

#/opt/VRTSvcs/rac//uload_drv

# /sbin/vxfenconfig -U
# /sbin/vcsmmconfig -U

# /sbin/lmxconfig -U

#/sbin/gabconfig -a

# modinfo | egrep "lmx|vxfen|vcsmm" [Determine the module IDs for VCSMM, I/O fencing, and LMX]

# modunload -i <ID>
# modunload -i <ID>
# modunload -i <ID>

# /sbin/gabconfig -U

# /sbin/lltconfig -U

# modinfo | egrep "gab|llt"
# modunload -i <ID>
# modunload -i <ID>

#shutdown -g0 -y -i0

#su - sms-svc
$setkeyswitch -d <> OFF

Hand over the box for Memory Module Replacement

$setkeyswitch -d <> ON upon confirmation of successfull replacement of Memory Module.

Cross verify LLT, GAB, LMX, VXFEN & VCSMM drivers has been loaded upon having the system running in Multiuser mode.

#haconf -makerw

#hasys -unfreeze -persistent <nodename>

#hagrp -online <SG> -sys <nodename>

#hastatus -sum

Regards,
Zubair

Gaurav_S · ‎11-22-2009

Hello Zubair,

Plan is ok however couple of things would like to highlight..

a) On the faulty node, you are offlining the group, don't you want to switch them to other active nodes to avoid downtime ?
b) I don't see any harm in freezing the whole cluster rather then just the nodes with no-faulty memory, plz note, on the system which you have not frozen, VCS will be ok to take any actions... you can offline service group even after freezing the system, I guess that should be possible, try that out in Simulator, it works,,,

c) while stopping the stack, I don't see a line to stop ODM... you might want to include that
d) Order of unconfguring modules doesn't seems to be correct, fencing should be at last just before GAB, you should consider unoconfiguring vcsmm, ODM & LMX first...... than later go to fencing once all others are closed....

Hope this helps..

Gaurav

View solution in original post

Gaurav_S · ‎11-22-2009

Hello Zubair,

Plan is ok however couple of things would like to highlight..

a) On the faulty node, you are offlining the group, don't you want to switch them to other active nodes to avoid downtime ?
b) I don't see any harm in freezing the whole cluster rather then just the nodes with no-faulty memory, plz note, on the system which you have not frozen, VCS will be ok to take any actions... you can offline service group even after freezing the system, I guess that should be possible, try that out in Simulator, it works,,,

c) while stopping the stack, I don't see a line to stop ODM... you might want to include that
d) Order of unconfguring modules doesn't seems to be correct, fencing should be at last just before GAB, you should consider unoconfiguring vcsmm, ODM & LMX first...... than later go to fencing once all others are closed....

Hope this helps..

Gaurav

VOX

Faulty Memory Replacement on one of the node of Oracle RAC Cluster