cancel
Showing results for 
Search instead for 
Did you mean: 

VCS cannot startup

Home_224
Level 6

Hi All ,

The enviornment is configured two node form the active / passive cluster, i have maintenance for active node , switch to passive node to online cluster, but check the status is in parital, I have no idea what happen on the issue.  Can you please advice how to fix it ?

root@devuaebms42 # gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 63750d membership 01
Port h gen 63750b membership ;1
Port h gen 63750b visible 0

^Croot@devuaebms42 # hastatus -sum

-- SYSTEM STATE
-- System State Frozen

A devuaebms41 EXITED 0
A devuaebms42 RUNNING 0

-- GROUP STATE
-- Group System Probed AutoDisabled State

B cf_bms_sg_01 devuaebms41 Y Y OFFLINE
B cf_bms_sg_01 devuaebms42 Y N PARTIAL

Many thanks,

Hong

1 ACCEPTED SOLUTION

Accepted Solutions

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

The NBU dg is visible on devuaebms42 (in deported state) - check the bottom of the list:

EMC1_30 auto - (cf_bms_dg_01) online emcpower2c

If this is NBU master only, you need to find out why so many luns have been zoned to this environment. 
Looks like disaster waiting to happen.... 

View solution in original post

9 REPLIES 9

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Home_224 

I have moved your post in the NetBackup forum to the Cluster forum.

Even if this is a NetBackup Cluster, you appear to have a cluster-related issue.

Resource(cf_bms_ser_01) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 2 of 2) the resource.

Usually when we see 'the resource became OFFLINE unexpectedly, on its own', it is because 'someone' is manually stopping or restarting processes outside of VCS. 

Can you share exactly which steps were followed before one node was switched off for maintenance as well as afterwards. 

It seems that 'someone' manually started NetBackup on node devuaebms42, instead of properly onlining it with VCS: 

2020/02/11 20:47:09 VCS INFO V-16-1-10299 Resource cf_bms_ser_01 (Owner: unknown, Group: cf_bms_sg_01) is online on devuaebms42 (Not initiated by VCS) 

and then stopping again? 

2020/02/11 20:49:26 VCS ERROR V-16-2-13067 (devuaebms42) Agent is calling clean for resource(cf_bms_ser_01) because the resource became OFFLINE unexpectedly, on its own.
2020/02/11 20:50:28 VCS ERROR V-16-2-13006 (devuaebms42) Resource(cf_bms_ser_01): clean procedure did not complete within the expected time.

Can you show us full 'hastatus' output? 

This will show which resources in the SG are online and which ones offline (causing the Partial status). 

Hi Marianne ,

Thank you for your comment.

I will show the hastatus output here when I back to office tomorrow 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

I am curious to see how this SG is configured.

I see no attempt in the EngineA log to import the dg, mount volumes or online virtual IP.

I only see the NBU resource going online.

So, important to know what other resources are in the SG and what their status is. 

root@devuaebms42 # hastatus
attempting to connect....connected

group resource system message
--------------- -------------------- -------------------- --------------------
devuaebms41 RUNNING
devuaebms42 RUNNING
cf_bms_sg_01 devuaebms41 *FAULTED* OFFLINE
cf_bms_sg_01 devuaebms42 *FAULTED* OFFLINE
-------------------------------------------------------------------------

cf_bms_dg_01 devuaebms41 OFFLINE
cf_bms_dg_01 devuaebms42 *FAULTED*
cf_bms_ipmnicb_01 devuaebms41 OFFLINE
cf_bms_ipmnicb_01 devuaebms42 OFFLINE
cf_bms_mount_01 devuaebms41 OFFLINE

-------------------------------------------------------------------------
cf_bms_mount_01 devuaebms42 OFFLINE
CFgroup devuaebms41 ONLINE
CFgroup devuaebms42 ONLINE
cf_bms_ser_01 devuaebms41 OFFLINE
cf_bms_ser_01 devuaebms42 OFFLINE

-------------------------------------------------------------------------
cf_bms_vol_01 devuaebms41 *FAULTED*
cf_bms_vol_01 devuaebms42 OFFLINE

root@devuaebms42 # more main.cf
include "types.cf"
include "/usr/openv/netbackup/bin/cluster/vcs/NetBackupTypes.cf"

cluster devuaebms (
UserNames = { admin = eLMeLGlIMhMMkUMgLJ }
Administrators = { admin }
CredRenewFrequency = 0
)

system devuaebms41 (
)

system devuaebms42 (
)

group cf_bms_sg_01 (
SystemList = { devuaebms41 = 0, devuaebms42 = 1 }
AutoStartList = { devuaebms41, devuaebms42 }
)

DiskGroup cf_bms_dg_01 (
DiskGroup = cf_bms_dg_01
StartVolumes = 0
)

IPMultiNICB cf_bms_ipmnicb_01 (
BaseResName = CFgroup
Address = "10.26.144.153"
NetMask = "255.255.255.192"
)

Mount cf_bms_mount_01 (
MountPoint = "/opt/VRTSnbu"
BlockDevice = "/dev/vx/dsk/cf_bms_dg_01/cf_bms_vol_01"
FSType = vxfs
FsckOpt = "-y"
)

MultiNICB CFgroup (
UseMpathd = 1
MpathdCommand = "/usr/lib/inet/in.mpathd -a"
Device = { bge0 = 0, bge1 = 1 }
NetworkHosts = { "10.26.144.129" }
DefaultRouter = "10.26.144.129"
)

NetBackup cf_bms_ser_01 (
ServerType = NBUMaster
)

Volume cf_bms_vol_01 (
Volume = cf_bms_vol_01
DiskGroup = cf_bms_dg_01
)

cf_bms_ipmnicb_01 requires CFgroup
cf_bms_mount_01 requires cf_bms_vol_01
cf_bms_ser_01 requires cf_bms_ipmnicb_01
cf_bms_ser_01 requires cf_bms_mount_01
cf_bms_vol_01 requires cf_bms_dg_01

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Home_224 

Have you tried to troubleshoot the faulted resources? 

Please bear in mind that VCS is merely reacting to underlying system errors.

You need to trace chain of events to find out why the diskgroup is faulted on one system and the volume on the other node.
Check OS messages files and all cluster logs.

cf_bms_dg_01 devuaebms42 *FAULTED*

cf_bms_vol_01 devuaebms41 *FAULTED*

You need to firstly confirm that the disks are visible at OS-level on both nodes, then clear the faults. 

When you are confident that all is fine, try to online firstly the dg resource on the node where you wish NBU to run (seems both nodes are up now).

If the diskgroup has imported fine, try to online the volume (on the same node).

If the volume mounts fine, then online the rest of the SG. 
Or else keep on onlining resources on-by-one (IP next, then NBU).
Check and validate each resource before you online the next resource. 

Please speak to team members - ensure they understand the basic principals of a cluster - like resource dependancies and that applications (such as NetBackup) should never be manually started or stopped outside of VCS. 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Home_224 

I see that you have been online this morning.

Could you please give us feedback? 

Hi Marianne,

I check both node can see the disk, but not to see the diskgroup by vxprint and vxdg list

root@devuaebms42 # vxdisk -eo alldgs list
DEVICE TYPE DISK GROUP STATUS OS_NATIVE_NAME
Disk_0 auto - - online c1t0d0s2
Disk_3 auto - - online c1t1d0s2
EMC0_0 auto - - error emcpower99c
EMC0_1 auto - - error emcpower98c
EMC0_4 auto - - error emcpower95c
EMC0_5 auto - - error emcpower94c
EMC0_6 auto - - error emcpower93c
EMC0_7 auto - - error emcpower92c
EMC0_8 auto - - error emcpower91c
EMC0_9 auto - - error emcpower90c
EMC0_10 auto - - error emcpower89c
EMC0_14 auto - (db3_datadg) online emcpower85c
EMC0_15 auto - (db3_datadg) online emcpower84c
EMC0_16 auto - (db3_datadg) online emcpower83c
EMC0_18 auto - (db3_datadg) online emcpower81c
EMC0_19 auto - - error emcpower80c
EMC0_20 auto - - error emcpower79c
EMC0_22 auto - - error emcpower77c
EMC0_23 auto - - error emcpower76c
EMC0_24 auto - - error emcpower75c
EMC0_25 auto - - error emcpower74c
EMC0_26 auto - - error emcpower73c
EMC0_27 auto - - error emcpower72c
EMC0_28 auto - - error emcpower71c
EMC0_29 auto - - error emcpower70c
EMC0_30 auto - - error emcpower69c
EMC0_34 auto - (mirrdg) online emcpower65c
EMC0_36 auto - (mirrdg) online emcpower63c
EMC0_38 auto - (share_testdg) online emcpower61c
EMC0_43 auto - (cfs2_datadg) online emcpower56c
EMC0_44 auto - (cfs2_datadg) online emcpower55c
EMC0_47 auto - (cfs2_datadg) online emcpower52c
EMC0_50 auto - (cfs2_datadg) online emcpower49c
EMC0_51 auto - (cfs2_datadg) online emcpower48c
EMC0_52 auto - (cfs2_datadg) online emcpower47c
EMC0_53 auto - (cfs2_datadg) online emcpower46c
EMC0_54 auto - - error 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
EMC0_55 auto - - online emcpower44c
EMC0_56 auto - - online emcpower43c
EMC0_57 auto - - online emcpower42c
EMC0_59 auto - - online emcpower40c
EMC0_61 auto - - online emcpower38c
EMC0_62 auto - - online emcpower37c
EMC0_69 auto - - error emcpower168c
EMC0_70 auto - - error emcpower167c
EMC0_72 auto - - error emcpower165c
EMC0_73 auto - - error emcpower164c
EMC0_76 auto - - error emcpower161c
EMC0_79 auto - - error emcpower158c
EMC0_81 auto - - error emcpower156c
EMC0_85 auto - - error emcpower152c
EMC0_86 auto - - error emcpower151c
EMC0_87 auto - - error emcpower150c
EMC0_89 auto - - error emcpower148c
EMC0_97 auto - - error emcpower140c
EMC0_101 auto - - error emcpower136c
EMC0_102 auto - - error emcpower135c
EMC0_103 auto - - error emcpower134c
EMC0_104 auto - - error emcpower133c
EMC0_105 auto - - error emcpower132c
EMC0_106 auto - - error emcpower131c
EMC0_107 auto - - error emcpower130c
EMC0_108 auto - - error emcpower129c
EMC0_109 auto - - error emcpower128c
EMC0_110 auto - - error emcpower127c
EMC0_111 auto - - error emcpower126c
EMC0_112 auto - - error emcpower125c
EMC0_113 auto - - error emcpower124c
EMC0_114 auto - - error emcpower123c
EMC0_121 auto - - error emcpower116c
EMC0_122 auto - - error emcpower115c
EMC0_123 auto - - error emcpower114c
EMC0_124 auto - - error emcpower113c
EMC0_125 auto - - error emcpower112c
EMC0_126 auto - - error emcpower111c
EMC0_127 auto - - error emcpower110c
EMC0_128 auto - - error emcpower109c
EMC0_129 auto - - error emcpower108c
EMC0_130 auto - - error emcpower107c
EMC0_131 auto - - error emcpower106c
EMC0_132 auto - - error emcpower105c
EMC0_133 auto - - error emcpower104c
EMC0_134 auto - - error emcpower103c
EMC0_135 auto - - error emcpower102c
EMC0_136 auto - - error emcpower101c
EMC0_137 auto - - error emcpower100c
EMC1_10 auto - (db2_datadg) online emcpower15c
EMC1_13 auto - (db1_datadg) online emcpower32c
EMC1_14 auto - (db1_datadg) online emcpower33c
EMC1_16 auto - (db1_datadg) online emcpower35c
EMC1_18 auto - (db1_datadg) online emcpower21c
EMC1_19 auto - (db1_datadg) online emcpower22c
EMC1_20 auto - (db1_datadg) online emcpower23c
EMC1_21 auto - (db1_datadg) online emcpower24c
EMC1_22 auto - (db1_datadg) online emcpower25c
EMC1_23 auto - (db1_datadg) online emcpower26c
EMC1_24 auto - (db1_datadg) online emcpower27c
EMC1_25 auto - (db1_datadg) online emcpower28c
EMC1_26 auto - (db1_datadg) online emcpower29c
EMC1_27 auto - (db1_datadg) online emcpower30c
EMC1_28 auto - (db3_datadg) online emcpower7c
EMC1_29 auto - (db1_datadg) online emcpower34c
EMC1_30 auto - (cf_bms_dg_01) online emcpower2c

I check the main.cf , cf_bms_dg_01 is service group for the server, and find other disk group wrong to attach to this server. It may be something wrong zoning to server and make it fails to startup VCS.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

The NBU dg is visible on devuaebms42 (in deported state) - check the bottom of the list:

EMC1_30 auto - (cf_bms_dg_01) online emcpower2c

If this is NBU master only, you need to find out why so many luns have been zoned to this environment. 
Looks like disaster waiting to happen....