[client] Fix engine lifecyrcle race#6443
Conversation
The rosenpass init paths (NewManager/Run) returned without calling e.close(), leaking the WireGuard interface and other partially initialized state on failure. Per-branch cleanup was easy to miss when adding new early returns. Convert Start to a named error return and tear down via a single defer that calls e.close() whenever err != nil, removing the scattered per-branch close() calls (including the redundant one in initFirewall).
Create the run context once in NewEngine instead of in Start. This keeps e.cancel valid for the engine's whole lifetime, so Stop can cancel a Start that is blocked waiting on the network while holding syncMsgMux: Stop now cancels before taking the lock, unblocking that Start so it can release the mutex. Reject re-entry into Start: a non-nil wgInterface means a prior Start already ran (ErrEngineAlreadyStarted), and a cancelled run context means the engine was stopped (ErrEngineAlreadyStopped). Both checks run before the cleanup defer so a duplicate call cannot tear down the running engine's state.
WaitStreamConnected only watched the signal client's own context, which derives from the parent engineCtx rather than the engine's run context. A Start blocked here (signal stream not yet up) could therefore not be released by Engine.Stop, since Stop only cancels the engine's run context. Pass a context into WaitStreamConnected and select on it too, and have the engine pass e.ctx, so Stop cancelling e.ctx unblocks a parked Start. Update the Client interface, the mock, and callers accordingly.
ConnectClient.Stop stopped the engine directly while the run loop's backoff cycle could still be starting an engine, so Engine.close raced Engine.Start (e.g. firewall setup reading wgInterface while close nils it). embed.Client.Start's rollback only avoided a deadlock by cancelling before Stop; the race itself remained and was caught by -race. Make the run loop the sole owner of engine shutdown: derive the run context in NewConnectClient, and have Stop cancel it and wait for the loop to exit (skipping the wait when the loop never ran) instead of calling engine.Stop. The loop now always stops the engine on its way out, dropping the unsynchronised wgInterface check it used to guard that call. Self-calls from within the loop use runCancel to avoid waiting on themselves. embed keeps a defensive pre-Stop cancel(); the daemon's cleanupConnection gets a TODO to adopt Stop() rather than stopping the engine in parallel.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthrough
ChangesEngine & ConnectClient Lifecycle Hardening
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ConnectClient
participant runLoop as run loop
participant Engine
participant SignalClient
Caller->>ConnectClient: Stop()
ConnectClient->>ConnectClient: runCancel() [cancels run ctx]
ConnectClient->>runLoop: ctx.Done fires
runLoop->>Engine: engine.Stop() [unconditional]
Engine->>Engine: cancel() [run ctx]
Engine->>SignalClient: WaitStreamConnected(e.ctx)
SignalClient->>SignalClient: ctx.Done fires → unblocks
Engine-->>runLoop: stopped
runLoop->>ConnectClient: close(runExited)
ConnectClient-->>Caller: return (unblocked)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
client/server/server.go (1)
991-999:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse
ConnectClient.Stop()here instead of directly stopping the captured engine.This path still races the run loop’s own
engine.Stop()afteractCancel(), so shutdown has two owners again and can double-close engine subsystems. LetConnectClient.Stop()cancel and wait for the run loop so engine teardown remains single-owner.Proposed fix
- // TODO: consider calling s.connectClient.Stop() instead of engine.Stop(). - // actCancel() lets the run loop stop the engine too, so both stop it - // concurrently; ConnectClient.Stop cancels and waits for the run loop, - // making the run loop the sole owner of engine shutdown. - if engine != nil { - if err := engine.Stop(); err != nil { - return err - } + if err := s.connectClient.Stop(); err != nil { + return err }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/server/server.go` around lines 991 - 999, Replace the direct engine.Stop() call in the conditional block with s.connectClient.Stop() instead. This ensures that ConnectClient becomes the sole owner of engine shutdown by canceling and waiting for the run loop, eliminating the race condition where both actCancel() and the direct engine.Stop() call attempt to stop the engine concurrently, which can cause double-closing of engine subsystems.shared/signal/client/mock.go (1)
14-14:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winForward the wait context through the mock hook.
The mock method accepts a context, but
WaitStreamConnectedFuncstill has no parameter, so tests cannot assert or simulate cancellation behavior for the new contract.Proposed fix
- WaitStreamConnectedFunc func() + WaitStreamConnectedFunc func(context.Context)-func (sm *MockClient) WaitStreamConnected(context.Context) { +func (sm *MockClient) WaitStreamConnected(ctx context.Context) { if sm.WaitStreamConnectedFunc == nil { return } - sm.WaitStreamConnectedFunc() + sm.WaitStreamConnectedFunc(ctx) }Also applies to: 58-62
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/signal/client/mock.go` at line 14, The WaitStreamConnectedFunc mock hook at shared/signal/client/mock.go line 14 needs to accept a context parameter to match the new contract that the actual method expects, allowing tests to simulate cancellation behavior. Update the function signature of WaitStreamConnectedFunc to include a context.Context parameter. Additionally, at shared/signal/client/mock.go lines 58-62, update the code that invokes WaitStreamConnectedFunc to pass the context parameter through to the mock hook function call.
🧹 Nitpick comments (1)
client/internal/engine.go (1)
450-628: 🏗️ Heavy liftSplit
Startinto lifecycle phases to satisfy the failing complexity check.SonarCloud flags
Startfor cognitive complexity and length. Extracting setup phases such as interface creation, DNS/routes/firewall initialization, event-stream startup, and monitor startup would make the new cleanup/cancellation paths easier to verify.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/engine.go` around lines 450 - 628, Refactor the Start method in Engine to split its cognitive complexity by extracting setup phases into separate helper methods. Create new private methods that group related initialization steps: one for interface and WireGuard setup (covering ValidateMTU, newWgIface, flowManager creation, rosenpass manager, and stateManager.Start), one for DNS/routes/firewall initialization (covering readInitialSettings, newDnsServer, routeManager creation and init, firewall creation, and SetFirewall), one for interface activation (wgInterfaceCreate, wgInterface.Up, setupWGProxyNoTrack, and port forwarding startup), one for connection management and relays (connMgr and srWatcher startup), one for event streams (receiveSignalEvents, receiveManagementEvents, receiveJobEvents), and one for monitoring (startNetworkMonitor and wgIfaceMonitor startup). Call each helper method sequentially from Start while maintaining the existing error handling and defer cleanup logic so that e.close() is still called on any failure. This preserves the current error propagation behavior and cancellation semantics while reducing the method's complexity.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@client/internal/connect.go`:
- Around line 527-530: The retry loop in the ConnectClient's run method uses a
raw ExponentialBackOff that does not respect context cancellation, causing the
Stop() method to block when cancellation occurs during a retry sleep. Locate the
ExponentialBackOff usage in the retry logic (likely in the method that populates
runExited) and wrap it with c.ctx so that when Stop() calls c.runCancel(), the
backoff sleep is immediately interrupted instead of continuing for its full
duration. This ensures the retry loop exits promptly and Stop() does not block
unnecessarily waiting on runExited.
In `@client/internal/engine.go`:
- Around line 467-470: The error defer block in engine.go (lines 467-470)
currently only calls e.close(), which is insufficient to clean up components
that may have been initialized after that point during startup, including
stateManager, flowManager, DNS and route managers, and registered goroutines.
Create a new cleanupStartFailureLocked function that mirrors the relevant
cleanup logic from the Stop method for partially initialized state, handling
cleanup of DNS manager, route manager, flow manager, state manager, port
forwarding, Rosenpass, firewall, WG interface, and waiting for goroutines that
were registered before the failure. Then replace the e.close() call in the defer
block with a call to this new cleanupStartFailureLocked function to ensure
comprehensive cleanup of all components that may have been partially
initialized.
- Around line 1779-1780: After the WaitStreamConnected call in the Start method
returns, add a check to determine if the context e.ctx has been cancelled. If
e.ctx is done (meaning Stop() was called and cancelled the context), the Start
method should return an error immediately instead of continuing with the startup
sequence. This prevents the engine from being marked as started when it is
actually being shut down, and ensures that the cancellation is properly reported
back to the caller of Start.
In `@shared/signal/client/client_test.go`:
- Line 68: The WaitStreamConnected calls with context.Background() at lines 68,
94, and 132 in shared/signal/client/client_test.go can hang indefinitely if the
stream never connects. Replace each context.Background() with a context that has
a small timeout (using a helper like context.WithTimeout). After each
WaitStreamConnected call returns, add an assertion to verify that the client
actually connected successfully before proceeding with the test. This ensures
tests fail fast rather than hanging when stream connections fail unexpectedly.
In `@shared/signal/client/grpc.go`:
- Around line 285-296: The WaitStreamConnected method has a race condition where
notifyStreamConnected can set the status to StreamConnected after the status
check on line 287 but before getStreamStatusChan is called on line 291, causing
the method to wait forever on a stale channel. Fix this by moving the status
check for StreamConnected inside the mutex (c.mux) lock and returning
immediately if already connected, then only call getStreamStatusChan and perform
the select after confirming the stream is not yet connected. This ensures that
the status check and channel creation are atomic operations and no notifications
can be missed.
---
Outside diff comments:
In `@client/server/server.go`:
- Around line 991-999: Replace the direct engine.Stop() call in the conditional
block with s.connectClient.Stop() instead. This ensures that ConnectClient
becomes the sole owner of engine shutdown by canceling and waiting for the run
loop, eliminating the race condition where both actCancel() and the direct
engine.Stop() call attempt to stop the engine concurrently, which can cause
double-closing of engine subsystems.
In `@shared/signal/client/mock.go`:
- Line 14: The WaitStreamConnectedFunc mock hook at shared/signal/client/mock.go
line 14 needs to accept a context parameter to match the new contract that the
actual method expects, allowing tests to simulate cancellation behavior. Update
the function signature of WaitStreamConnectedFunc to include a context.Context
parameter. Additionally, at shared/signal/client/mock.go lines 58-62, update the
code that invokes WaitStreamConnectedFunc to pass the context parameter through
to the mock hook function call.
---
Nitpick comments:
In `@client/internal/engine.go`:
- Around line 450-628: Refactor the Start method in Engine to split its
cognitive complexity by extracting setup phases into separate helper methods.
Create new private methods that group related initialization steps: one for
interface and WireGuard setup (covering ValidateMTU, newWgIface, flowManager
creation, rosenpass manager, and stateManager.Start), one for
DNS/routes/firewall initialization (covering readInitialSettings, newDnsServer,
routeManager creation and init, firewall creation, and SetFirewall), one for
interface activation (wgInterfaceCreate, wgInterface.Up, setupWGProxyNoTrack,
and port forwarding startup), one for connection management and relays (connMgr
and srWatcher startup), one for event streams (receiveSignalEvents,
receiveManagementEvents, receiveJobEvents), and one for monitoring
(startNetworkMonitor and wgIfaceMonitor startup). Call each helper method
sequentially from Start while maintaining the existing error handling and defer
cleanup logic so that e.close() is still called on any failure. This preserves
the current error propagation behavior and cancellation semantics while reducing
the method's complexity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5a0a3a04-ffaa-4709-8af3-f30aa568ca0b
📒 Files selected for processing (8)
client/embed/embed.goclient/internal/connect.goclient/internal/engine.goclient/server/server.goshared/signal/client/client.goshared/signal/client/client_test.goshared/signal/client/grpc.goshared/signal/client/mock.go
Release artifactsBuilt for PR head
GHCR images (amd64)
This comment is updated by the Release workflow. Artifact links expire according to the workflow retention policy. |
Engine tests built the engine context with context.WithCancel( context.Background()), omitting CtxInitState. Now that the run context is created in the constructor, the wgIfaceMonitor goroutine can reach triggerClientRestart during teardown, which calls CtxGetState and panics on the missing state. Real entry points (up, embed, service) always CtxInitState; only the tests skipped it.
The run loop retried with a raw ExponentialBackOff, so a backoff sleep ignored context cancellation. Now that ConnectClient.Stop waits for the run loop to exit, a cancel landing during a sleep would block Stop for the full interval (up to MaxInterval). Wrap the backoff with the run context so Retry returns promptly on cancel; the retry budget itself (MaxElapsedTime) is unchanged.
The tests waited on WaitStreamConnected with context.Background() and the client's own context was also Background, so a stream that never connects would hang until the suite timeout. Pass a 5s timeout context and assert StreamConnected afterwards so the tests fail fast with a clear reason.
The StreamConnected check and the wait-channel creation took the mutex separately, so notifyStreamConnected could set the status and close/clear connectedCh in between: the waiter then created a fresh channel nobody would ever close and blocked forever. Also, the status read was unlocked while notify wrote it under the mutex (a data race). Do the check and the channel fetch in one locked section; drop the now-unused getStreamStatusChan helper. Pre-existing bug, not introduced by this branch.
…ream receiveSignalEvents blocks in WaitStreamConnected until the signal stream connects or the context is cancelled. If Stop cancelled e.ctx while Start was parked there, Start kept going: it started the remaining subsystems on a cancelled context and marked a shutting-down engine as started. Return the context error from receiveSignalEvents and propagate it from Start, so the deferred cleanup runs and the cancellation reaches the caller.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
client/internal/engine.go (1)
1717-1717: 💤 Low valueConsider complexity reduction in a follow-up.
SonarCloud flags cognitive complexity of 28 (allowed 20), but this is pre-existing. The inner message handler with its switch statement and error handling accounts for most of the complexity. This PR's changes (context error check) add minimal complexity.
If desired, the message handling logic could be extracted to a separate method in a future PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/engine.go` at line 1717, The receiveSignalEvents() function currently has high cognitive complexity of 28, exceeding the allowed threshold of 20. While this is pre-existing and the current PR changes add minimal complexity, consider extracting the inner message handler logic (the switch statement and its associated error handling) into a separate private method to reduce complexity. This refactoring would move the bulk of the conditional logic out of the main function and into a dedicated handler method, bringing the overall cognitive complexity of receiveSignalEvents() below the threshold.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@client/internal/engine.go`:
- Line 1717: The receiveSignalEvents() function currently has high cognitive
complexity of 28, exceeding the allowed threshold of 20. While this is
pre-existing and the current PR changes add minimal complexity, consider
extracting the inner message handler logic (the switch statement and its
associated error handling) into a separate private method to reduce complexity.
This refactoring would move the bulk of the conditional logic out of the main
function and into a dedicated handler method, bringing the overall cognitive
complexity of receiveSignalEvents() below the threshold.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6a54435b-f0fe-4722-a8ed-db780660d264
📒 Files selected for processing (1)
client/internal/engine.go
Start's failure defer only called close(), which covers the wg interface, firewall, rosenpass and port forwarding but leaves connMgr, srWatcher, route/DNS/flow/state managers and the monitor goroutines running. A late failure (e.g. the context-cancelled check after the signal stream) thus leaked them. Extract Stop's locked teardown into stopLocked (caller holds syncMsgMux, does not wait on shutdownWg) and call it from both Stop and Start's defer. The defer also cancels the run context first so goroutines started before the failure unwind. Teardown order is unchanged.
|



Describe your changes
Follow-up to #6397, which fixed the embedded-client startup-rollback deadlock
by cancelling the client context before stopping. That avoided the deadlock but
left a data race underneath, which
-racereliably surfaced inTestClientStartTimeoutRollback(client/embed):Engine.close()nilswgInterfacefrom theStoppath while a still-runningEngine.Start(firewall setup) reads it.
Root cause
ConnectClient.Stop()stopped the engine directly while therunbackoff loopcould still be bringing an engine up. The two paths touched the same engine on
different goroutines: the
c.enginepointer was mutex-guarded, but the engine'sStart<->closelifecycle was not synchronized between them. The loop'sif engine.wgInterface != nilcheck (an unsynchronized read, already flaggedwith a
// todo: ... Is not thread safe) was the other half of the race.What changed
The engine lifecycle is now hardened on three fronts:
Engine.Stopis idempotent and thread-safe, and the engine can stopitself.
Stopcancels the run context before takingsyncMsgMux, so aStartparked waiting on the signal stream while holding the mutex isunblocked and cannot deadlock the teardown.
close()is nil-guardedthroughout, so stopping a partially-started or already-stopped engine is safe.
Engine.Startcleanup via a single deferred guard instead of scatteredper-branch
e.close()calls, plus single-use guards (astartedflag and acancelled-context check) so a duplicate or post-stop
Startis rejected withErrEngineAlreadyStarted/ a stopped-context error rather than racing therunning engine. Note:
startedis a dedicated flag rather than a check onwgInterface, whichclose()nils and other goroutines read.The run loop is the sole owner of engine shutdown. The run context is
derived in
NewConnectClient, andConnectClient.Stop()now cancels it andwaits for the loop to exit (skipping the wait when the loop never ran) instead
of calling
engine.Stop()itself. The loop always stops the engine on its wayout, so the unsynchronized
wgInterfacecheck is gone. Self-calls from withinthe loop use
runCancelto avoid waiting on themselves and deadlocking.WaitStreamConnectednow takes a context and selects on it, and the enginepasses
e.ctx, so cancelling the engine unblocks aStartparked there. Thistouches the internal
signal.Clientinterface (not a public gRPC/CLI surface).embed.Client.Startkeeps a defensive pre-Stopcancel()(no longer requiredfor correctness, kept belt-and-suspenders). The daemon's
cleanupConnectiongets a
// TODOto adoptConnectClient.Stop()instead of stopping the enginein parallel with the run loop.
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
ErrEngineAlreadyStartedto indicate attempts to start an engine that’s already running.WaitStreamConnectedto require acontext.Context.