Skip to content

Fix log formatting, heartbeat message handling, and node state management#1053

Merged
cgalibern merged 12 commits into
opensvc:mainfrom
cvaroqui:main
Jun 25, 2026
Merged

Fix log formatting, heartbeat message handling, and node state management#1053
cgalibern merged 12 commits into
opensvc:mainfrom
cvaroqui:main

Conversation

@cvaroqui

Copy link
Copy Markdown
Member

No description provided.

cvaroqui added 12 commits June 24, 2026 10:39
The prefix had 2 chars for 2 nodes instead of only 1 char.
Instead of d.clusterData.Cluster.Config.Nodes, which is empty
when this func is called (because ccfg has not published its first
cluster.Config).

This fixes this bogus startup event, used to format a bogus NodeRejoin
event, used to freeze the node when peer node was frozen when we were
stopped:

	at: "2026-06-22T18:42:57.019766924+02:00"
	data:
	  installed_gens:
	    dev2n1: 16
	    dev2n2: 11
	    dev2n3: 1
	  joined_nodes:
	  - dev2n3
	  - dev2n1
	  labels:
	    node: dev2n3
	  new_type: patch
	  node: dev2n3
	  nodes:
	  - dev2n1
	  - dev2n2
	  - dev2n3
	  old_type: full
	id: 2
	kind: HeartbeatMessageTypeUpdated
on nmon.Manager Start(), after the event subscriptions are
accepted.

Without this init, the Manager.clusterConfig can be a zero value.
Early in the daemon startup sequence,
d.clusterData.Cluster.Config.Nodes is still a nil, but
daemondata.GetHbMessageType() is already called to forge a
HeartbeatMessageTypeUpdated event, then used by nmon to
decide to end the rejoin period immediately if all nodes already
rejoined. Nil Nodes was interpreted always has "no missing nodes".
queueFreeze() error were logged at info level.
* Also daemondata now initializes node.Monitor.State to rejoin

* Publish the NodeRejoin event from the immediate rejoin codepath,
  so the merge peer frozen is not bypassed in this shortcut.

* Log "we are late to the party, immediate rejoin" when this
  shortcut is entered.

* Use direct <var>/node/frozen manipulation from nmon, instead of
  a mix of direct and callout

* Better and more symmetric log entries format in rejoin period
  expire and peer-frozen-when-dead-merge

* Log all peer frozen since last shutdown instead of just any one
  of those.
So a node crashing has its last_shutdown at most 10s in before the
event, and peer frozen after this barrier are mirrored when the
daemon starts up again.
These allow tracking the blackout begin-end time of nmon,
that will be used by imon to decide if a peer instance freeze
should be mirrored locally.
Fixing:

    --- FAIL: TestDaemonData/Ensure_ClusterNodeData_result_is_a_deep_copy (0.00s)
        main_test.go:140:
            	Error Trace:	/home/runner/work/om3/om3/daemon/daemondata/main_test.go:140
            	Error:      	Not equal:
            	            	expected: 0
            	            	actual  : 2
            	Test:       	TestDaemonData/Ensure_ClusterNodeData_result_is_a_deep_copy
    main_test.go:142:
        	Error Trace:	/home/runner/work/om3/om3/daemon/daemondata/main_test.go:142
        	Error:      	Should be false
        	Test:       	TestDaemonData

And:

    --- FAIL: TestDaemonData/Ensure_node.MonitorData.GetByNode_result_is_a_deep_copy (0.00s)
        main_test.go:157:
            	Error Trace:	/root/dev/om3/daemon/daemondata/main_test.go:157
            	Error:      	Not equal:
            	            	expected: 0
            	            	actual  : 2
            	Test:       	TestDaemonData/Ensure_node.MonitorData.GetByNode_result_is_a_deep_copy
            	Messages:   	State changed !
    main_test.go:163:
        	Error Trace:	/root/dev/om3/daemon/daemondata/main_test.go:163
        	Error:      	Should be false
        	Test:       	TestDaemonData
The commit adding Levelf as a backend for Infof & co added one
more frame to skip.

Without this patch the caller information of log entries points
to the Levelf call, not useful.
@cgalibern cgalibern merged commit 907a1c8 into opensvc:main Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants