LHC : IpmHandle::recv _read_errno is 5

Marianne · ‎06-30-2010

Strange Warning in engine_A log at system startup...

Background: HP-UX 11i with VCS 5.0.10.2, 3-node cluster.
Node 0 in the cluster was rebooted this morning. All processes started normally, node 0 joined the cluster, went from CURRENT_DISCOVER_WAIT to REMOTE_BUILD; System: <node-0> is building configuration from system: <node-1>
2010/06/30 09:14:21 VCS NOTICE V-16-1-10465 Getting snapshot.

AutoRestart for all service groups checked:
2010/06/30 09:14:29 VCS NOTICE V-16-1-10181 Group <group1> AutoRestart set to 1
2010/06/30 09:14:30 VCS NOTICE V-16-1-10181 Group <group2> AutoRestart set to 1
(lots of service groups...)

Then this warning appears (literally HUNDREDS of lines, all the same; even timestamp):
2010/06/30 09:15:48 VCS WARNING V-16-1-10638 IpmHandle::recv _read_errno is 5. Client (hastatus) Pid (2654)
2010/06/30 09:15:48 VCS WARNING V-16-1-10638 IpmHandle::recv _read_errno is 5. Client (hastatus) Pid (2654)
....
2010/06/30 09:15:58 VCS INFO V-16-1-10466 End of snapshot received from node: 1. snapped_membership: 0x7 current_membership: 0x7 current_jeopardy_membership: 0x0
2010/06/30 09:15:58 VCS WARNING V-16-1-10030 UseFence=NONE. Hence do not need fencing
2010/06/30 09:15:59 VCS NOTICE V-16-1-10467 Replaying broadcast queue. snapped_membership: 0x7 current_membership: 0x7 current_jeopardy_membership: 0x0
2010/06/30 09:15:59 VCS NOTICE V-16-1-10322 System .... (Node '0') changed state from REMOTE_BUILD to RUNNING

Service Groups start going online normally.

I found a TechNote with similar warining, but the 'Client' in the TN is IPMultiNICB where in this instance the 'Client' is hastatus. No other resemblance to the TN.

Has anyone seen this? Any idea what's causing it?

Handy NetBackup Links

g_lee · ‎06-30-2010

VOS seems to indicate this is not a major problem:
https://vos.symantec.com/ecls/umi/V-16-1-10638

It also mentions: "A possible action is to determine the reason why the client has suddenly gone down."

Off the top of my head, the only thing I can think is that had was busy/still starting so hastatus wasn't able to return cleanly? Even if that was the case, it doesn't explain why there would be hundreds of messages though (ie: implies hastatus being run hundreds of times in that window?)

Gaurav_S · ‎06-30-2010

Hi Marianne,

As you would be aware that VCS intercommunication depends on IPM. VCS commands like hastatus uses Inter Process Messaging to communicate to "had" daemon.

The message above gives indication to me that PID 2654 was looping continous hastatus commands.... Did you got a chance to see what was that PID ?

Quite possible that some old hastatus were stuck ? Messages would be appearing at the time when node was still building the configuration....

Gaurav

Marianne · ‎06-30-2010

Thanks for responses so far. The messages appeared at system startup (after reboot) while node was building config. Weird thing is that this config has been unchanged over the last (at least) 18 months,except for SF/HA patches recommended by Symantec. This is the first time that this has occurred. Notification has been set to 'Warning' level - imagine the hundreds of emails!

I was not on site at the time. The assumption is that the PID belongs to hastatus....

Handy NetBackup Links

Marianne · ‎07-01-2010

Managed to get more info - got syslog from this node. Seems it wasn't a system restart after normal reboot, but a startup after a panic (not familiar with HP-UX - not sure if they use the same term...)

I see the following in syslog:

Jun 30 08:58:55 <node> vmunix: GAB INFO V-15-1-20032 Port h closed
Jun 30 08:59:05 <node> syslog: pid:3549.10 - pam_request.c:152:process_pam_ldap_request(): _hp_ldap_bind_ux() failed, err=-2
Jun 30 09:50:29 <node> syslogd: restart

No errors in either engine_A or syslog prior to this.

I am wondering if we might have the situation described in this TechNote (although we do not see the same entries in syslog and the TechNote is actually for Solaris):

http://seer.entsupport.symantec.com/docs/184301.htm
Intentional Panic of a node in a Veritas Cluster Server cluster by GAB

I have requested extract of syslog and engine_A from the other 2 nodes between about 08:50 and 09:05.

Now to find out what could be overloading the system....

Handy NetBackup Links

VOX

LHC : IpmHandle::recv _read_errno is 5