cancel
Showing results for 
Search instead for 
Did you mean: 

Faulty Memory Replacement on one of the node of Oracle RAC Cluster

Zubair_mohammed
Not applicable

Hi,

Appreciate your help in reviewing the attached plan for getting a memory module replaced on one of RAC Cluster. Its a 3 node cluster running with 9 SG in ACTIVE/ACTIVE mode.


# /etc/vx/bin/vxclustadm nidmap   [Identify the Master Node]

#haconf -makerw     [On Master node]

# hasys -freeze -persistent <nodename>  [Freeze the systems which do not have faulty Memory]

# Login to the Node     [Which has faulty Memory].

#hagrp -list      [ SG which are ONLINE on the faulty memory node].

#haconf -dump -makero

# hagrp -offline <SG> -sys <nodename>

# hastop -local

#/sbin/gabconfig -a     [Port h, v & W should be stopped].

#/opt/VRTSvcs/rac//uload_drv

# /sbin/vxfenconfig -U
# /sbin/vcsmmconfig -U

# /sbin/lmxconfig -U

#/sbin/gabconfig -a

# modinfo | egrep "lmx|vxfen|vcsmm"  [Determine the module IDs for VCSMM, I/O fencing, and LMX]

# modunload -i <ID>
# modunload -i <ID>
# modunload -i <ID>

# /sbin/gabconfig -U

# /sbin/lltconfig -U

# modinfo | egrep "gab|llt"
# modunload -i <ID>
# modunload -i <ID>

#shutdown -g0 -y -i0

#su - sms-svc
$setkeyswitch -d <> OFF

Hand over the box for Memory Module Replacement

$setkeyswitch -d <> ON upon confirmation of successfull replacement of Memory Module.

Cross verify LLT, GAB, LMX, VXFEN & VCSMM drivers has been loaded upon having the system running in Multiuser mode.

#haconf -makerw

#hasys -unfreeze -persistent <nodename>

#hagrp -online <SG> -sys <nodename>

#hastatus -sum

Regards,
Zubair

1 ACCEPTED SOLUTION

Accepted Solutions

Gaurav_S
Moderator
Moderator
   VIP    Certified
Hello Zubair,

Plan is ok however couple of things would like to highlight..

a) On the faulty node, you are offlining the group, don't you want to switch them to other active nodes to avoid downtime ?
b) I don't see any harm in freezing the whole cluster rather then just the nodes with no-faulty memory, plz note, on the system which you have not frozen, VCS will be ok to take any actions... you can offline service group even after freezing the system, I guess that should be possible, try that out in Simulator, it works,,,

c) while stopping the stack, I don't see a line to stop ODM... you might want to include that
d) Order of unconfguring modules doesn't seems to be correct, fencing should be at last just before GAB, you should consider unoconfiguring vcsmm, ODM & LMX first...... than later go to fencing once all others are closed....


Hope this helps..

Gaurav

View solution in original post

1 REPLY 1

Gaurav_S
Moderator
Moderator
   VIP    Certified
Hello Zubair,

Plan is ok however couple of things would like to highlight..

a) On the faulty node, you are offlining the group, don't you want to switch them to other active nodes to avoid downtime ?
b) I don't see any harm in freezing the whole cluster rather then just the nodes with no-faulty memory, plz note, on the system which you have not frozen, VCS will be ok to take any actions... you can offline service group even after freezing the system, I guess that should be possible, try that out in Simulator, it works,,,

c) while stopping the stack, I don't see a line to stop ODM... you might want to include that
d) Order of unconfguring modules doesn't seems to be correct, fencing should be at last just before GAB, you should consider unoconfiguring vcsmm, ODM & LMX first...... than later go to fencing once all others are closed....


Hope this helps..

Gaurav