Solved: Hi Gaurav, # pkginfo -l

bsobek · ‎08-20-2010

Hi,

we have a problem with 2 agents.
2010/08/20 12:59:57 VCS WARNING V-16-1-53025 Agent Application has faulted; ipm connection was lost; restarting the agent
2010/08/20 13:00:19 VCS WARNING V-16-1-53025 Agent Netlsnr has faulted; ipm connection was lost; restarting the agent
2010/08/20 13:05:07 VCS WARNING V-16-1-53025 Agent Application has faulted; ipm connection was lost; restarting the agent
2010/08/20 13:05:29 VCS WARNING V-16-1-53025 Agent Netlsnr has faulted; ipm connection was lost; restarting the agent
2010/08/20 13:10:08 VCS WARNING V-16-1-53025 Agent Application has faulted; ipm connection was lost; restarting the agent
2010/08/20 13:10:30 VCS WARNING V-16-1-53025 Agent Netlsnr has faulted; ipm connection was lost; restarting the agent
2010/08/20 13:15:19 VCS WARNING V-16-1-53025 Agent Application has faulted; ipm connection was lost; restarting the agent
2010/08/20 13:15:41 VCS WARNING V-16-1-53025 Agent Netlsnr has faulted; ipm connection was lost; restarting the agent
2010/08/20 13:20:21 VCS WARNING V-16-1-53025 Agent Application has faulted; ipm connection was lost; restarting the agent
2010/08/20 13:20:21 VCS ERROR V-16-1-10009 Agent Application has faulted 6 times in less than 1800 seconds -- Will not attempt to restart. Correct the problem and use haagent -start to start the agent
2010/08/20 13:22:11 VCS ERROR V-16-1-10009 Agent Netlsnr has faulted 6 times in less than 1800 seconds -- Will not attempt to restart. Correct the problem and use haagent -start to start the agent

We have 411 Netlsnr-ressources and 234 application ressources configured in our cluster. We set the NumThreads of both agents to 50 and the AgentReplyTimeout to 300.
Both agents only fault on one of the 3 clusternodes, but the load of the other systems is higher than the faulty node.

Does anybody have an idea to solve the problem?

Thanks!

Regards
Björn

bsobek · ‎09-27-2010

We received an IDR patch, the problem is solved.

View solution in original post

Gaurav_S · ‎08-20-2010

Seems to be extensive use of agent..... can u update following:

-- what is VCS version & OS version ?
-- hatype -display Netlsnr
-- Once these agents fault, what is avg load on the server ? Also load is on memory or CPU ?

Gaurav

bsobek · ‎08-23-2010

Hi Gaurav,

# pkginfo -l VRTSvcs
   PKGINST: VRTSvcs
      NAME: Veritas Cluster Server by Symantec
CATEGORY: system
      ARCH: sparc
   VERSION: 5.1
   BASEDIR: /
    VENDOR: Symantec Corporation
      DESC: Veritas Cluster Server by Symantec
    PSTAMP: 5.1.001.004-5.1RP1HF4-2010-06-21_02.42.49
INSTDATE: Jul 02 2010 10:11
    STATUS: completely installed
     FILES:      280 installed pathnames
                  27 shared pathnames
                   4 linked files
                  59 directories
                 102 executables
              233894 blocks used (approx)

# uname -a
SunOS gf01sxdb101t 5.10 Generic_142900-13 sun4u sparc SUNW,SPARC-Enterprise

# hatype -display Netlsnr
#Type        Attribute               Value
Netlsnr      AEPTimeout              0
Netlsnr      ActionTimeout           30
Netlsnr      AgentClass              TS
Netlsnr      AgentDirectory          /opt/VRTSagents/ha/bin/Netlsnr
Netlsnr      AgentFailedOn           sys1
Netlsnr      AgentFile
Netlsnr      AgentPriority           0
Netlsnr      AgentReplyTimeout       300
Netlsnr      AgentStartTimeout       60
Netlsnr      ArgList                 Owner      Home    TnsAdmin        Listener        EnvFile MonScript       LsnrPwd AgentDebug      Encoding
Netlsnr      AttrChangedTimeout      60
Netlsnr      CleanRetryLimit         0
Netlsnr      CleanTimeout            60
Netlsnr      CloseTimeout            60
Netlsnr      ConfInterval            600
Netlsnr      ContainerOpts           RunInContainer     1       PassCInfo       0
Netlsnr      EPClass                 -1
Netlsnr      EPPriority              -1
Netlsnr      ExternalStateChange
Netlsnr      FaultOnMonitorTimeouts 4
Netlsnr      FaultPropagation        1
Netlsnr      FireDrill               0
Netlsnr      InfoInterval            0
Netlsnr      InfoTimeout             30
Netlsnr      IntentionalOffline      0
Netlsnr      LevelTwoMonitorFreq     1
Netlsnr      LogDbg                  DBG_AGINFO DBG_AGTRACE     DBG_AGDEBUG
Netlsnr      LogFileSize             33554432
Netlsnr      MonitorInterval         60
Netlsnr      MonitorStatsParam       Frequency 0       ExpectedValue   100     ValueThreshold 100     AvgThreshold    40
Netlsnr      MonitorTimeout          60
Netlsnr      NumThreads              50
Netlsnr      OfflineMonitorInterval 300
Netlsnr      OfflineProcScanInterval 60
Netlsnr      OfflineTimeout          300
Netlsnr      OfflineWaitLimit        0
Netlsnr      OnlineClass             -1
Netlsnr      OnlinePriority          -1
Netlsnr      OnlineRetryLimit        5
Netlsnr      OnlineTimeout           300
Netlsnr      OnlineWaitLimit         2
Netlsnr      OpenTimeout             60
Netlsnr      Operations              OnOff
Netlsnr      ProcScanInterval        60
Netlsnr      ProcScanStatus          0
Netlsnr      RestartLimit            5
Netlsnr      ScriptClass             TS
Netlsnr      ScriptPriority          0
Netlsnr      SourceFile              ./OracleTypes.cf
Netlsnr      SupportedActions        VRTS_GetInstanceName       VRTS_GetRunningServices tnsadmin.vfd
Netlsnr      ToleranceLimit          0

The load average is between 4 and 5, currently we have 30GB of physical memory left, and 130GB of swap left.
# sar
...
Average        4       7       0      89

Thanks!

regards
Björn

Gaurav_S · ‎08-23-2010

Hello Bjorn,

Couple of observations & suggestions from above output.....

-- MonitorInterval & MonitorTimeout set to default 60 Sec .... Well I believe 60 Sec is far too less for agent to monitor 411 Or 234 resources... I presume you would also be getting "monitor procedure did not complete in expected time messages" ... Considering to increase them might be of worth, this will reduce a bit of burden from agent..... Plz note you will need to do this for both agents separately...

-- LevelTwoMonitorFreq is set to 1, that means after every 1 cycle, agent is going to run a detail monitor... this might be by default however again it seems to push some xtra load on the agent.... However I am still trying to make out what a detail monitor for Netlsnr & application agent do... but I believe it would be worth to increase this number & call for a detail monitor after every 4-5 cycles...

-- Thirdly.. you have set numthreads to 50.... did you picked this number via some algorithm or just like it ... I would suggest to try some different combinations here.. somehow I have in back of my mind that putting large no. of numthreads can also cause issues.... try with 20, 30 & see if you see any difference....

Gaurav

bsobek · ‎09-27-2010

We received an IDR patch, the problem is solved.

Gaurav_S · ‎09-27-2010

Can you please share the exact patch name.. just for reference ?

Thanks

VOX

VCS WARNING V-16-1-53025 Agent <AGENT> has faulted; ipm connection was lost; restarting the agent