Skip to content

[client] Fix engine lifecyrcle race#6443

Merged
pappz merged 10 commits into
mainfrom
fix/engine-stop-error-handling
Jun 22, 2026
Merged

[client] Fix engine lifecyrcle race#6443
pappz merged 10 commits into
mainfrom
fix/engine-stop-error-handling

Conversation

@pappz

@pappz pappz commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Describe your changes

Follow-up to #6397, which fixed the embedded-client startup-rollback deadlock
by cancelling the client context before stopping. That avoided the deadlock but
left a data race underneath, which -race reliably surfaced in
TestClientStartTimeoutRollback (client/embed): Engine.close() nils
wgInterface from the Stop path while a still-running Engine.Start
(firewall setup) reads it.

Root cause

ConnectClient.Stop() stopped the engine directly while the run backoff loop
could still be bringing an engine up. The two paths touched the same engine on
different goroutines: the c.engine pointer was mutex-guarded, but the engine's
Start <-> close lifecycle was not synchronized between them. The loop's
if engine.wgInterface != nil check (an unsynchronized read, already flagged
with a // todo: ... Is not thread safe) was the other half of the race.

What changed

The engine lifecycle is now hardened on three fronts:

  • Engine.Stop is idempotent and thread-safe, and the engine can stop
    itself.
    Stop cancels the run context before taking syncMsgMux, so a
    Start parked waiting on the signal stream while holding the mutex is
    unblocked and cannot deadlock the teardown. close() is nil-guarded
    throughout, so stopping a partially-started or already-stopped engine is safe.

  • Engine.Start cleanup via a single deferred guard instead of scattered
    per-branch e.close() calls, plus single-use guards (a started flag and a
    cancelled-context check) so a duplicate or post-stop Start is rejected with
    ErrEngineAlreadyStarted / a stopped-context error rather than racing the
    running engine. Note: started is a dedicated flag rather than a check on
    wgInterface, which close() nils and other goroutines read.

  • The run loop is the sole owner of engine shutdown. The run context is
    derived in NewConnectClient, and ConnectClient.Stop() now cancels it and
    waits for the loop to exit (skipping the wait when the loop never ran) instead
    of calling engine.Stop() itself. The loop always stops the engine on its way
    out, so the unsynchronized wgInterface check is gone. Self-calls from within
    the loop use runCancel to avoid waiting on themselves and deadlocking.

WaitStreamConnected now takes a context and selects on it, and the engine
passes e.ctx, so cancelling the engine unblocks a Start parked there. This
touches the internal signal.Client interface (not a public gRPC/CLI surface).

embed.Client.Start keeps a defensive pre-Stop cancel() (no longer required
for correctness, kept belt-and-suspenders). The daemon's cleanupConnection
gets a // TODO to adopt ConnectClient.Stop() instead of stopping the engine
in parallel with the run loop.

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)
  • This change does not modify the public API, gRPC protocols, functionality behavior, CLI / service flags, or introduce a new feature — OR I have discussed it with the NetBird team beforehand (link the issue / Slack thread in the description). See CONTRIBUTING.md.

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • New Features
    • Added ErrEngineAlreadyStarted to indicate attempts to start an engine that’s already running.
  • Improvements
    • Strengthened engine and connection lifecycle coordination for more reliable startup/shutdown and safer teardown sequencing.
    • Improved shutdown behavior: stopping now consistently cancels internal run contexts, unblocks on run-exit signaling, and handles permission/login failures more predictably.
    • Updated Signal stream connection handling to respect cancellation and surface failures.
  • Breaking Changes
    • Updated Signal client WaitStreamConnected to require a context.Context.
  • Tests
    • Updated engine and Signal client tests to use the new context-based API (including timeout handling).

pappz added 4 commits June 16, 2026 11:54
The rosenpass init paths (NewManager/Run) returned without calling
e.close(), leaking the WireGuard interface and other partially
initialized state on failure. Per-branch cleanup was easy to miss when
adding new early returns.

Convert Start to a named error return and tear down via a single defer
that calls e.close() whenever err != nil, removing the scattered
per-branch close() calls (including the redundant one in initFirewall).
Create the run context once in NewEngine instead of in Start. This
keeps e.cancel valid for the engine's whole lifetime, so Stop can
cancel a Start that is blocked waiting on the network while holding
syncMsgMux: Stop now cancels before taking the lock, unblocking that
Start so it can release the mutex.

Reject re-entry into Start: a non-nil wgInterface means a prior Start
already ran (ErrEngineAlreadyStarted), and a cancelled run context
means the engine was stopped (ErrEngineAlreadyStopped). Both checks run
before the cleanup defer so a duplicate call cannot tear down the
running engine's state.
WaitStreamConnected only watched the signal client's own context, which
derives from the parent engineCtx rather than the engine's run context.
A Start blocked here (signal stream not yet up) could therefore not be
released by Engine.Stop, since Stop only cancels the engine's run
context.

Pass a context into WaitStreamConnected and select on it too, and have
the engine pass e.ctx, so Stop cancelling e.ctx unblocks a parked Start.
Update the Client interface, the mock, and callers accordingly.
ConnectClient.Stop stopped the engine directly while the run loop's
backoff cycle could still be starting an engine, so Engine.close raced
Engine.Start (e.g. firewall setup reading wgInterface while close nils
it). embed.Client.Start's rollback only avoided a deadlock by cancelling
before Stop; the race itself remained and was caught by -race.

Make the run loop the sole owner of engine shutdown: derive the run
context in NewConnectClient, and have Stop cancel it and wait for the
loop to exit (skipping the wait when the loop never ran) instead of
calling engine.Stop. The loop now always stops the engine on its way
out, dropping the unsynchronised wgInterface check it used to guard that
call. Self-calls from within the loop use runCancel to avoid waiting on
themselves.

embed keeps a defensive pre-Stop cancel(); the daemon's cleanupConnection
gets a TODO to adopt Stop() rather than stopping the engine in parallel.
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

ConnectClient gains an owned run context, exit channel, and atomic started flag so Stop() cancels the run loop and waits for it to exit instead of directly stopping the engine. Engine becomes single-use via a started guard and ErrEngineAlreadyStarted sentinel, with deferred cleanup on partial-init failures and unconditional context cancellation in Stop(). WaitStreamConnected gains a context.Context parameter across the interface, gRPC implementation, mock, and all callers, allowing stream-connection waits to be tied to caller-provided cancellation contexts.

Changes

Engine & ConnectClient Lifecycle Hardening

Layer / File(s) Summary
WaitStreamConnected context-aware interface and implementation
shared/signal/client/client.go, shared/signal/client/grpc.go, shared/signal/client/mock.go, shared/signal/client/client_test.go
Client interface changes WaitStreamConnected to accept context.Context; gRPC implementation removes getStreamStatusChan() helper, adds early-exit on connected state, and selects on the caller-supplied context; mock signature updated; three test call sites pass context.WithTimeout(..., 5*time.Second) and assert StreamConnected() is true.
Engine single-use guard, run context, and deferred cleanup
client/internal/engine.go
Adds ErrEngineAlreadyStarted sentinel and started boolean; NewEngine stores a derived run ctx/cancel; Start checks started and ctx.Err(), sets started=true, uses named return, validates MTU, defers cleanup on failure, and removes redundant explicit close() calls from partial-init error paths; Stop unconditionally calls cancel() and performs shutdown sequencing; receiveSignalEvents now returns error and waits for WaitStreamConnected(e.ctx).
ConnectClient run context and exit signaling
client/internal/connect.go
ConnectClient gains runCancel, runExited, runOnce, runStarted fields; NewConnectClient derives the run context; run() marks the loop started via atomic flag and defers closing runExited exactly once; Stop() cancels runCancel and blocks on runExited if run started; PermissionDenied paths call runCancel() directly; engine teardown calls Stop() unconditionally.
Engine test context initialization
client/internal/engine_test.go
Five engine test cases updated to initialize root context via CtxInitState(context.Background()) before context.WithCancel, ensuring consistent context setup for the new engine run-context lifecycle.
Embed and server shutdown ownership documentation
client/embed/embed.go, client/server/server.go
Updates comments in Client.Start and cleanupConnection to describe cancel() as belt-and-suspenders behavior and document shutdown ownership semantics.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ConnectClient
    participant runLoop as run loop
    participant Engine
    participant SignalClient

    Caller->>ConnectClient: Stop()
    ConnectClient->>ConnectClient: runCancel() [cancels run ctx]
    ConnectClient->>runLoop: ctx.Done fires
    runLoop->>Engine: engine.Stop() [unconditional]
    Engine->>Engine: cancel() [run ctx]
    Engine->>SignalClient: WaitStreamConnected(e.ctx)
    SignalClient->>SignalClient: ctx.Done fires → unblocks
    Engine-->>runLoop: stopped
    runLoop->>ConnectClient: close(runExited)
    ConnectClient-->>Caller: return (unblocked)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • netbirdio/netbird#6397: Addresses embedded-client startup/shutdown deadlock by changing context cancellation order relative to stopping the engine, overlapping with this PR's context-propagation and lifecycle-hardening approach.
  • netbirdio/netbird#4891: Modifies Engine.Stop and signal-stream shutdown sequencing to address races/deadlocks, directly overlapping with this PR's lifecycle and stop-behavior updates.
  • netbirdio/netbird#5623: Modifies client/internal/connect.go's ConnectClient execution and shutdown paths, so the Android wiring can interact with the new run-context and stop-coordination changes.

Suggested reviewers

  • mlsmaycon
  • lixmal

Poem

🐇 Hops through the code with careful paws,
A run context owned, no deadlock clause!
runExited waits, runCancel calls,
The engine guards its lifecycle walls.
Belt and suspenders—no dangling thread,
This bunny ships safe shutdown instead! 🎉

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title mentions 'engine lifecycle race' which is the core issue fixed across multiple files. However, it contains a typo ('lifecyrcle' instead of 'lifecycle') and is somewhat vague about what specifically was fixed. Correct the typo 'lifecyrcle' to 'lifecycle'. Consider if a more specific title like 'Fix engine lifecycle race in concurrent Start/Stop paths' would better describe the multi-component fix.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description comprehensively covers the root cause, the three-pronged fix, and implementation details. The checklist confirms this is a bug fix and that no public API changes were made. Documentation is appropriately marked as not needed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/engine-stop-error-handling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
client/server/server.go (1)

991-999: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use ConnectClient.Stop() here instead of directly stopping the captured engine.

This path still races the run loop’s own engine.Stop() after actCancel(), so shutdown has two owners again and can double-close engine subsystems. Let ConnectClient.Stop() cancel and wait for the run loop so engine teardown remains single-owner.

Proposed fix
-	// TODO: consider calling s.connectClient.Stop() instead of engine.Stop().
-	// actCancel() lets the run loop stop the engine too, so both stop it
-	// concurrently; ConnectClient.Stop cancels and waits for the run loop,
-	// making the run loop the sole owner of engine shutdown.
-	if engine != nil {
-		if err := engine.Stop(); err != nil {
-			return err
-		}
+	if err := s.connectClient.Stop(); err != nil {
+		return err
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/server/server.go` around lines 991 - 999, Replace the direct
engine.Stop() call in the conditional block with s.connectClient.Stop() instead.
This ensures that ConnectClient becomes the sole owner of engine shutdown by
canceling and waiting for the run loop, eliminating the race condition where
both actCancel() and the direct engine.Stop() call attempt to stop the engine
concurrently, which can cause double-closing of engine subsystems.
shared/signal/client/mock.go (1)

14-14: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Forward the wait context through the mock hook.

The mock method accepts a context, but WaitStreamConnectedFunc still has no parameter, so tests cannot assert or simulate cancellation behavior for the new contract.

Proposed fix
-	WaitStreamConnectedFunc      func()
+	WaitStreamConnectedFunc      func(context.Context)
-func (sm *MockClient) WaitStreamConnected(context.Context) {
+func (sm *MockClient) WaitStreamConnected(ctx context.Context) {
 	if sm.WaitStreamConnectedFunc == nil {
 		return
 	}
-	sm.WaitStreamConnectedFunc()
+	sm.WaitStreamConnectedFunc(ctx)
 }

Also applies to: 58-62

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/signal/client/mock.go` at line 14, The WaitStreamConnectedFunc mock
hook at shared/signal/client/mock.go line 14 needs to accept a context parameter
to match the new contract that the actual method expects, allowing tests to
simulate cancellation behavior. Update the function signature of
WaitStreamConnectedFunc to include a context.Context parameter. Additionally, at
shared/signal/client/mock.go lines 58-62, update the code that invokes
WaitStreamConnectedFunc to pass the context parameter through to the mock hook
function call.
🧹 Nitpick comments (1)
client/internal/engine.go (1)

450-628: 🏗️ Heavy lift

Split Start into lifecycle phases to satisfy the failing complexity check.

SonarCloud flags Start for cognitive complexity and length. Extracting setup phases such as interface creation, DNS/routes/firewall initialization, event-stream startup, and monitor startup would make the new cleanup/cancellation paths easier to verify.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine.go` around lines 450 - 628, Refactor the Start method
in Engine to split its cognitive complexity by extracting setup phases into
separate helper methods. Create new private methods that group related
initialization steps: one for interface and WireGuard setup (covering
ValidateMTU, newWgIface, flowManager creation, rosenpass manager, and
stateManager.Start), one for DNS/routes/firewall initialization (covering
readInitialSettings, newDnsServer, routeManager creation and init, firewall
creation, and SetFirewall), one for interface activation (wgInterfaceCreate,
wgInterface.Up, setupWGProxyNoTrack, and port forwarding startup), one for
connection management and relays (connMgr and srWatcher startup), one for event
streams (receiveSignalEvents, receiveManagementEvents, receiveJobEvents), and
one for monitoring (startNetworkMonitor and wgIfaceMonitor startup). Call each
helper method sequentially from Start while maintaining the existing error
handling and defer cleanup logic so that e.close() is still called on any
failure. This preserves the current error propagation behavior and cancellation
semantics while reducing the method's complexity.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/internal/connect.go`:
- Around line 527-530: The retry loop in the ConnectClient's run method uses a
raw ExponentialBackOff that does not respect context cancellation, causing the
Stop() method to block when cancellation occurs during a retry sleep. Locate the
ExponentialBackOff usage in the retry logic (likely in the method that populates
runExited) and wrap it with c.ctx so that when Stop() calls c.runCancel(), the
backoff sleep is immediately interrupted instead of continuing for its full
duration. This ensures the retry loop exits promptly and Stop() does not block
unnecessarily waiting on runExited.

In `@client/internal/engine.go`:
- Around line 467-470: The error defer block in engine.go (lines 467-470)
currently only calls e.close(), which is insufficient to clean up components
that may have been initialized after that point during startup, including
stateManager, flowManager, DNS and route managers, and registered goroutines.
Create a new cleanupStartFailureLocked function that mirrors the relevant
cleanup logic from the Stop method for partially initialized state, handling
cleanup of DNS manager, route manager, flow manager, state manager, port
forwarding, Rosenpass, firewall, WG interface, and waiting for goroutines that
were registered before the failure. Then replace the e.close() call in the defer
block with a call to this new cleanupStartFailureLocked function to ensure
comprehensive cleanup of all components that may have been partially
initialized.
- Around line 1779-1780: After the WaitStreamConnected call in the Start method
returns, add a check to determine if the context e.ctx has been cancelled. If
e.ctx is done (meaning Stop() was called and cancelled the context), the Start
method should return an error immediately instead of continuing with the startup
sequence. This prevents the engine from being marked as started when it is
actually being shut down, and ensures that the cancellation is properly reported
back to the caller of Start.

In `@shared/signal/client/client_test.go`:
- Line 68: The WaitStreamConnected calls with context.Background() at lines 68,
94, and 132 in shared/signal/client/client_test.go can hang indefinitely if the
stream never connects. Replace each context.Background() with a context that has
a small timeout (using a helper like context.WithTimeout). After each
WaitStreamConnected call returns, add an assertion to verify that the client
actually connected successfully before proceeding with the test. This ensures
tests fail fast rather than hanging when stream connections fail unexpectedly.

In `@shared/signal/client/grpc.go`:
- Around line 285-296: The WaitStreamConnected method has a race condition where
notifyStreamConnected can set the status to StreamConnected after the status
check on line 287 but before getStreamStatusChan is called on line 291, causing
the method to wait forever on a stale channel. Fix this by moving the status
check for StreamConnected inside the mutex (c.mux) lock and returning
immediately if already connected, then only call getStreamStatusChan and perform
the select after confirming the stream is not yet connected. This ensures that
the status check and channel creation are atomic operations and no notifications
can be missed.

---

Outside diff comments:
In `@client/server/server.go`:
- Around line 991-999: Replace the direct engine.Stop() call in the conditional
block with s.connectClient.Stop() instead. This ensures that ConnectClient
becomes the sole owner of engine shutdown by canceling and waiting for the run
loop, eliminating the race condition where both actCancel() and the direct
engine.Stop() call attempt to stop the engine concurrently, which can cause
double-closing of engine subsystems.

In `@shared/signal/client/mock.go`:
- Line 14: The WaitStreamConnectedFunc mock hook at shared/signal/client/mock.go
line 14 needs to accept a context parameter to match the new contract that the
actual method expects, allowing tests to simulate cancellation behavior. Update
the function signature of WaitStreamConnectedFunc to include a context.Context
parameter. Additionally, at shared/signal/client/mock.go lines 58-62, update the
code that invokes WaitStreamConnectedFunc to pass the context parameter through
to the mock hook function call.

---

Nitpick comments:
In `@client/internal/engine.go`:
- Around line 450-628: Refactor the Start method in Engine to split its
cognitive complexity by extracting setup phases into separate helper methods.
Create new private methods that group related initialization steps: one for
interface and WireGuard setup (covering ValidateMTU, newWgIface, flowManager
creation, rosenpass manager, and stateManager.Start), one for
DNS/routes/firewall initialization (covering readInitialSettings, newDnsServer,
routeManager creation and init, firewall creation, and SetFirewall), one for
interface activation (wgInterfaceCreate, wgInterface.Up, setupWGProxyNoTrack,
and port forwarding startup), one for connection management and relays (connMgr
and srWatcher startup), one for event streams (receiveSignalEvents,
receiveManagementEvents, receiveJobEvents), and one for monitoring
(startNetworkMonitor and wgIfaceMonitor startup). Call each helper method
sequentially from Start while maintaining the existing error handling and defer
cleanup logic so that e.close() is still called on any failure. This preserves
the current error propagation behavior and cancellation semantics while reducing
the method's complexity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5a0a3a04-ffaa-4709-8af3-f30aa568ca0b

📥 Commits

Reviewing files that changed from the base of the PR and between 08a2b63 and 98c71d7.

📒 Files selected for processing (8)
  • client/embed/embed.go
  • client/internal/connect.go
  • client/internal/engine.go
  • client/server/server.go
  • shared/signal/client/client.go
  • shared/signal/client/client_test.go
  • shared/signal/client/grpc.go
  • shared/signal/client/mock.go

Comment thread client/internal/connect.go
Comment thread client/internal/engine.go
Comment thread client/internal/engine.go
Comment thread shared/signal/client/client_test.go Outdated
Comment thread shared/signal/client/grpc.go
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

Release artifacts

Built for PR head f4e2836 in workflow run #15912.

Artifact Link
All release artifacts Download
Linux packages Download
Windows packages Download
macOS packages Download
UI artifacts Download
UI macOS artifacts Download

GHCR images (amd64)

This comment is updated by the Release workflow. Artifact links expire according to the workflow retention policy.

pappz added 5 commits June 16, 2026 14:35
Engine tests built the engine context with context.WithCancel(
context.Background()), omitting CtxInitState. Now that the run context
is created in the constructor, the wgIfaceMonitor goroutine can reach
triggerClientRestart during teardown, which calls CtxGetState and
panics on the missing state. Real entry points (up, embed, service)
always CtxInitState; only the tests skipped it.
The run loop retried with a raw ExponentialBackOff, so a backoff sleep
ignored context cancellation. Now that ConnectClient.Stop waits for the
run loop to exit, a cancel landing during a sleep would block Stop for
the full interval (up to MaxInterval). Wrap the backoff with the run
context so Retry returns promptly on cancel; the retry budget itself
(MaxElapsedTime) is unchanged.
The tests waited on WaitStreamConnected with context.Background() and the
client's own context was also Background, so a stream that never connects
would hang until the suite timeout. Pass a 5s timeout context and assert
StreamConnected afterwards so the tests fail fast with a clear reason.
The StreamConnected check and the wait-channel creation took the mutex
separately, so notifyStreamConnected could set the status and close/clear
connectedCh in between: the waiter then created a fresh channel nobody
would ever close and blocked forever. Also, the status read was unlocked
while notify wrote it under the mutex (a data race). Do the check and the
channel fetch in one locked section; drop the now-unused
getStreamStatusChan helper. Pre-existing bug, not introduced by this branch.
…ream

receiveSignalEvents blocks in WaitStreamConnected until the signal stream
connects or the context is cancelled. If Stop cancelled e.ctx while Start
was parked there, Start kept going: it started the remaining subsystems on
a cancelled context and marked a shutting-down engine as started. Return
the context error from receiveSignalEvents and propagate it from Start, so
the deferred cleanup runs and the cancellation reaches the caller.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/internal/engine.go (1)

1717-1717: 💤 Low value

Consider complexity reduction in a follow-up.

SonarCloud flags cognitive complexity of 28 (allowed 20), but this is pre-existing. The inner message handler with its switch statement and error handling accounts for most of the complexity. This PR's changes (context error check) add minimal complexity.

If desired, the message handling logic could be extracted to a separate method in a future PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine.go` at line 1717, The receiveSignalEvents() function
currently has high cognitive complexity of 28, exceeding the allowed threshold
of 20. While this is pre-existing and the current PR changes add minimal
complexity, consider extracting the inner message handler logic (the switch
statement and its associated error handling) into a separate private method to
reduce complexity. This refactoring would move the bulk of the conditional logic
out of the main function and into a dedicated handler method, bringing the
overall cognitive complexity of receiveSignalEvents() below the threshold.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@client/internal/engine.go`:
- Line 1717: The receiveSignalEvents() function currently has high cognitive
complexity of 28, exceeding the allowed threshold of 20. While this is
pre-existing and the current PR changes add minimal complexity, consider
extracting the inner message handler logic (the switch statement and its
associated error handling) into a separate private method to reduce complexity.
This refactoring would move the bulk of the conditional logic out of the main
function and into a dedicated handler method, bringing the overall cognitive
complexity of receiveSignalEvents() below the threshold.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6a54435b-f0fe-4722-a8ed-db780660d264

📥 Commits

Reviewing files that changed from the base of the PR and between 90af9dd and 3190347.

📒 Files selected for processing (1)
  • client/internal/engine.go

Start's failure defer only called close(), which covers the wg interface,
firewall, rosenpass and port forwarding but leaves connMgr, srWatcher,
route/DNS/flow/state managers and the monitor goroutines running. A late
failure (e.g. the context-cancelled check after the signal stream) thus
leaked them.

Extract Stop's locked teardown into stopLocked (caller holds syncMsgMux,
does not wait on shutdownWg) and call it from both Stop and Start's defer.
The defer also cancels the run context first so goroutines started before
the failure unwind. Teardown order is unchanged.
@pappz pappz changed the title [client] fix engine lifecyrcle race [client] Fix engine lifecyrcle race Jun 16, 2026
@pappz pappz merged commit ac9529e into main Jun 22, 2026
49 checks passed
@pappz pappz deleted the fix/engine-stop-error-handling branch June 22, 2026 11:53
@sonarqubecloud

Copy link
Copy Markdown

pappz added a commit that referenced this pull request Jun 22, 2026
…#6503)

watchdog_test.go called WaitStreamConnected() without the context.Context
argument added in #6443, breaking the signal client test build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants