feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954
Conversation
…rns CoordinatorExitStatus The three periodic sync coordinators (platform-address, identity, shielded) run their `!Send` loops on detached OS threads via `Handle::block_on`. `shutdown()`/`quiesce()` previously only drained the in-flight pass (the `is_syncing` barrier) and never joined the threads, so a consumer that drops the tokio runtime right after `shutdown()` (one-shot / headless / stdio) could race a coordinator still polling `tokio::time` on a shutting-down runtime and panic with "A Tokio 1.x context was found, but it is being shutdown". Each coordinator now stores its OS-thread `JoinHandle`; `quiesce()` joins it (via `spawn_blocking`, after the existing drain) and returns a `CoordinatorThreadStatus` (NotRunning / Ok / Panicked / Error). Joining while the runtime is still alive guarantees the loop has stopped touching `tokio::time` before the host drops the runtime. `shutdown()` aggregates the three into `CoordinatorExitStatus`, so a panicked loop surfaces in the status instead of being silently dropped. JoinHandle-join chosen over a oneshot/Notify signal: `JoinHandle::join` natively distinguishes a clean return from a panic and waits for the actual OS thread to terminate (not just a signal fired mid-teardown), yielding the per-thread status for free. The generation-guard reschedule and quiesce-drain behavior are preserved. BREAKING CHANGE: `PlatformWalletManager::shutdown()` now returns `CoordinatorExitStatus` instead of `()`. FFI: the internal `shutdown()` call logs the new status; the `extern "C"` `platform_wallet_manager_destroy` signature and C ABI are unchanged. <sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
…nnot wedge shutdown
SEC-001: Add `IsSyncingGuard` RAII struct to all three coordinator
`sync_now` (and shielded `sync_wallet`) implementations. The guard
clears `is_syncing=false` on every exit path — normal return, early
return, and panic-unwind — so `quiesce()`'s drain loop can never spin
forever on a panicked pass, and the `Panicked` thread-exit status
becomes reachable.
SEC-002: Wrap each coordinator's `quiesce()` call in `shutdown()` with
`tokio::time::timeout(30 s)`. On timeout the slot reports
`CoordinatorThreadStatus::Error("join timed out")` rather than hanging
forever.
SEC-003: Add `debug_assert!` in `shutdown()` that the current runtime
is `MultiThread`; document the precondition in the method doc.
F-5: In all three coordinators' `start()`, store the `JoinHandle` in
`background_join` while still holding the `background_cancel` lock —
eliminates the theoretical window where a concurrent `quiesce()` could
take a `None` handle because spawn completed before the store.
Rename `CoordinatorThreadExit` → `CoordinatorThreadStatus` with
variants `Ok / NotRunning / Panicked / Error` to match the coordinator
module's existing `super::CoordinatorThreadStatus` references (fixing
the compile break in f3354f6). `join_coordinator_thread`'s
spawn_blocking `Err` arm now maps to `Error` rather than `Panicked`
to distinguish infra failure from thread panic (F-6 documented).
Co-Authored-By: Claudius the Magnificent <noreply@anthropic.com>
<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR adds shared worker lifecycle primitives, refactors wallet sync managers to use them, and propagates incomplete-shutdown status through wallet, FFI, and Swift result handling. ChangesCoordinator Shutdown and Lifecycle Refactor
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## v3.1-dev #3954 +/- ##
=============================================
+ Coverage 52.54% 87.18% +34.63%
=============================================
Files 11 2632 +2621
Lines 1707 327563 +325856
=============================================
+ Hits 897 285592 +284695
- Misses 810 41971 +41161
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
packages/rs-platform-wallet/src/manager/identity_sync.rs (1)
418-459: 🩺 Stability & Availability | 🟠 MajorDon't overwrite an unjoined coordinator generation.
stop()removes onlybackground_canceland leaves the previousbackground_joinunjoined. Astop()→start()sequence passes thecancel_guard.is_some()check and overwritesbackground_joinwith a new handle, losing the old OS-thread handle before shutdown can join it. Gate restart on confirming the prior handle has been joined, or ensure every generation's handle is tracked and joined.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/identity_sync.rs` around lines 418 - 459, The start() method can overwrite an existing background_join handle before it has been properly joined, causing resource leaks. Before spawning the new identity-sync thread and storing its handle in background_join, add a check to ensure any existing join handle in background_join has been properly cleaned up or joined first. This can be done either by joining the existing handle before proceeding, or by verifying that background_join is None before allowing the new thread spawn to proceed. This ensures the prior thread's OS handle is not lost and can be properly shutdown.packages/rs-platform-wallet/src/manager/platform_address_sync.rs (1)
218-255: 🩺 Stability & Availability | 🟠 MajorAdd generation guard to prevent thread handle loss after stop/start cycles.
After
stop()cancels the token,background_cancelbecomesNonewhile the old thread keeps running. A subsequentstart()seescancel_guard.is_some() == falseand spawns a new thread, unconditionally overwritingbackground_join. The old thread's join handle is lost, making it impossible to join its cleanup later. Additionally, without a generation counter, the exiting old thread clears the new generation'sbackground_canceltoken as it shuts down, creating a race where the new loop runs but appears stopped tois_running().Both
IdentitySyncManagerandShieldedSyncManageralready implement this pattern: they incrementbackground_generationon eachstart(), passmy_gento the spawned thread, and checkbackground_generation.load() == my_genbefore clearingbackground_cancelon exit. Apply the same approach here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs` around lines 218 - 255, The thread handle for the platform address sync loop can be lost during stop/start cycles because a new thread spawns and unconditionally overwrites the background_join handle before the old thread finishes cleanup. Additionally, the old thread clears the new generation's background_cancel token, creating a race condition. Implement the generation guard pattern already used in IdentitySyncManager and ShieldedSyncManager: add a background_generation atomic counter to the struct, increment it at the start of the start() method, capture the current generation as my_gen before spawning the thread, pass my_gen into the spawned thread closure, and modify the cleanup code (where background_cancel is set to None) to only clear the token if background_generation.load() equals my_gen, ensuring old exiting threads do not interfere with new generations.packages/rs-platform-wallet/src/manager/shielded_sync.rs (1)
245-300: 🩺 Stability & Availability | 🔴 CriticalDon't replace a pending shielded-sync join handle on rapid stop/start cycles.
The
start()method checks onlybackground_cancelto guard against concurrent starts. Whenstop()removes the token frombackground_cancel, a subsequentstart()proceeds to spawn a new thread and overwritesbackground_join, even though the previous generation's thread is still winding down. The generation check (line 290) only prevents the old thread from clearingbackground_cancel—it does not protectbackground_join. This leaves prior-generation handles permanently lost.To fix: either require
quiesce()beforestart()to join all pending generations, or track join handles per-generation and join all pending before spawning a new thread.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs` around lines 245 - 300, The `start()` method only checks `background_cancel` to guard against concurrent starts, but when `stop()` clears this token, a subsequent `start()` can spawn a new thread and overwrite the `background_join` handle before the previous generation's thread has finished cleaning up. The generation check at the cleanup section prevents the old thread from clearing `background_cancel`, but does not protect `background_join` from being overwritten. Fix this by either adding logic to join all pending prior-generation join handles before storing the new one (by tracking handles per-generation), or by ensuring that before assigning a new join handle to `background_join` in the `start()` method, any existing pending handle from a prior generation is properly joined and waited for completion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 706-710: The while loop waiting for handler_started to become true
has no timeout, which will cause the test to hang indefinitely if the slow
callback never executes. Wrap the entire while loop (that checks
handler_started.load(AO::Acquire)) with tokio::time::timeout() and provide an
appropriate timeout duration, then handle the timeout error case with an
assertion that explicitly fails the test with a useful message. This ensures CI
fails fast with clear feedback instead of timing out.
- Around line 483-489: In the wallet event adapter task join error handling, the
else branch that handles non-panic JoinErrors currently returns
CoordinatorThreadStatus::Ok, which should instead return
CoordinatorThreadStatus::Error with an appropriate error message. Change the
else clause that follows the is_panic() check to map the error to a
CoordinatorThreadStatus::Error variant containing details about the join error,
rather than treating it as a successful completion.
- Around line 175-181: The current implementation uses
tokio::task::spawn_blocking to wrap handle.join(), but this pattern prevents the
timeout from effectively interrupting the blocking task if the coordinator
thread hangs. Replace the spawn_blocking closure approach with explicit polling:
repeatedly check if the JoinHandle is finished using is_finished() in a loop
until the deadline is reached, and only call join() once the handle confirms it
is finished. This ensures the timeout boundary is enforced even if the
coordinator thread misbehaves or fails to clear is_syncing before join() is
called.
---
Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 418-459: The start() method can overwrite an existing
background_join handle before it has been properly joined, causing resource
leaks. Before spawning the new identity-sync thread and storing its handle in
background_join, add a check to ensure any existing join handle in
background_join has been properly cleaned up or joined first. This can be done
either by joining the existing handle before proceeding, or by verifying that
background_join is None before allowing the new thread spawn to proceed. This
ensures the prior thread's OS handle is not lost and can be properly shutdown.
In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 218-255: The thread handle for the platform address sync loop can
be lost during stop/start cycles because a new thread spawns and unconditionally
overwrites the background_join handle before the old thread finishes cleanup.
Additionally, the old thread clears the new generation's background_cancel
token, creating a race condition. Implement the generation guard pattern already
used in IdentitySyncManager and ShieldedSyncManager: add a background_generation
atomic counter to the struct, increment it at the start of the start() method,
capture the current generation as my_gen before spawning the thread, pass my_gen
into the spawned thread closure, and modify the cleanup code (where
background_cancel is set to None) to only clear the token if
background_generation.load() equals my_gen, ensuring old exiting threads do not
interfere with new generations.
In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 245-300: The `start()` method only checks `background_cancel` to
guard against concurrent starts, but when `stop()` clears this token, a
subsequent `start()` can spawn a new thread and overwrite the `background_join`
handle before the previous generation's thread has finished cleaning up. The
generation check at the cleanup section prevents the old thread from clearing
`background_cancel`, but does not protect `background_join` from being
overwritten. Fix this by either adding logic to join all pending
prior-generation join handles before storing the new one (by tracking handles
per-generation), or by ensuring that before assigning a new join handle to
`background_join` in the `start()` method, any existing pending handle from a
prior generation is properly joined and waited for completion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 01cab324-105d-4c5d-afe0-0ceb6faff13e
📒 Files selected for processing (5)
packages/rs-platform-wallet-ffi/src/manager.rspackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rs
Introduces `AtomicFlagGuard`, a pub RAII guard that clears an `AtomicBool` flag to `false` (Release ordering) on drop. The guard does not set the flag on construction — the caller is responsible for doing so (typically via a `compare_exchange`) — preserving the exact semantics of the three identical `IsSyncingGuard` structs that were copy-pasted across the platform-wallet sync coordinators. This is the panic-safety keystone for the quiesce drain loop: if a sync pass panics, the guard's `drop` still clears `is_syncing`, so `quiesce()` is never permanently wedged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…en runtime check
**Task 1 — new enum variants**
Add `Stopped(Option<String>)` (non-panic, non-clean task exit, e.g.
tokio cancel/abort) and `Timeout` (join exceeded
SHUTDOWN_JOIN_TIMEOUT_SECS) to `CoordinatorThreadStatus`.
- Non-panic JoinError on the event-adapter task → `Stopped(Some(...))`,
not the previous `Ok` (wrong: a cancelled task is not a clean exit).
- Timeout on any `quiesce()` wrapper → `Timeout`, not `Error("join
timed out")`.
- `is_clean()` now returns `true` only for `Ok` and `NotRunning`; all
other variants — including the two new ones — are non-clean.
- Update all docs / comments that referenced the old `Error("join timed
out")` wording.
**Task 2 — promote debug_assert to assert**
`shutdown()`'s multi-thread-runtime guard was `debug_assert!`, making
it a no-op in release builds. Changed to `assert!` — this is a real
invariant: `spawn_blocking` deadlocks on a `current_thread` runtime.
**Task 3 — bound the test wait loop**
Wrap the `while !handler_started…` polling in
`shutdown_waits_for_in_flight_pass_to_drain` with a 5 s
`tokio::time::timeout` so a broken test fails fast instead of hanging.
**Task 4 — DRY IsSyncingGuard**
Replace the three identical copy-pasted `IsSyncingGuard` structs in
`identity_sync.rs`, `platform_address_sync.rs`, and `shielded_sync.rs`
with the new `dash_async::AtomicFlagGuard`. Adds `dash-async` to
`rs-platform-wallet/Cargo.toml`. Zero behavioral change: construction
semantics preserved (callers set the flag via `compare_exchange` before
creating the guard; `Drop` clears it with `Ordering::Release`).
**Task 5 — new tests**
- `coordinator_thread_status_clean_predicate`: unit-tests `is_clean()`
for all six variants including the two new ones; no real timeout needed.
- `coordinator_exit_status_all_clean`: tests `all_clean()` with
`Timeout` and `Stopped` slots.
- `event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean`:
aborts the adapter task before `shutdown()` and asserts the result is
`Stopped` (covers the non-panic JoinError path).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/rs-platform-wallet/src/manager/mod.rs (2)
405-408: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winHandle unclean shielded quiesce before clearing state.
quiesce()now returns a meaningful shutdown status. Ignoring it letsclear_shielded()proceed tocoord.clear()afterTimeout,Stopped, orPanicked, which can race a still-running shielded pass that the quiesce barrier was meant to stop.Proposed fix
- self.shielded_sync_manager.quiesce().await; + let status = self.shielded_sync_manager.quiesce().await; + if !status.is_clean() { + return Err(crate::error::PlatformWalletError::ShieldedStoreError( + format!("shielded sync did not stop cleanly before clear: {status:?}"), + )); + } if let Some(coord) = self.shielded_coordinator().await { coord.clear().await?; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 405 - 408, The clear_shielded function ignores the return value from self.shielded_sync_manager.quiesce().await, which now provides meaningful shutdown status information. Instead of ignoring this result, capture the return value and check if the quiesce completed cleanly. If quiesce returns a status indicating Timeout, Stopped, or Panicked, return an appropriate error from clear_shielded rather than continuing to call coord.clear(), which could race with a still-running shielded pass. Only proceed with coord.clear() when quiesce has successfully shut down cleanly.
463-477: 🩺 Stability & Availability | 🟠 MajorWrap
quiescinginAtomicFlagGuardto ensure cancellation-safe reset in all coordinatorquiesce()implementations.The current implementations set
quiescing = truebefore the awaited drain loop and reset it only after. Iftokio::time::timeoutdrops the future during the loop, the reset never executes, permanently wedgingquiescingand blocking all future syncs.All three coordinators (
platform_address_sync.rs:291,identity_sync.rs:494,shielded_sync.rs:334) have identical patterns. Use the sameAtomicFlagGuardapproach already correctly applied tois_syncinginsync_now():pub async fn quiesce(&self) -> super::CoordinatorThreadStatus { self.quiescing.store(true, Ordering::Release); let _quiescing_guard = AtomicFlagGuard::new(&self.quiescing); self.stop(); while self.is_syncing.load(Ordering::Acquire) { tokio::time::sleep(Duration::from_millis(20)).await; } // quiescing.store(false) removed — guard handles reset on all exit paths // ...rest of implementation }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 463 - 477, The `quiesce()` method implementations in all three coordinators (the ones containing the timeout calls for `platform_address_sync_manager`, `identity_sync_manager`, and `shielded_sync_manager`) don't properly handle cancellation when `tokio::time::timeout` drops the future. Wrap the `quiescing` flag reset in an `AtomicFlagGuard` to ensure it's reset on all exit paths including early cancellation. In each coordinator's `quiesce()` method, after setting `quiescing` to true, immediately create an `AtomicFlagGuard` using `AtomicFlagGuard::new(&self.quiescing)`, and remove any manual `quiescing.store(false)` reset call at the end since the guard will handle it automatically. Use the same pattern already correctly implemented in `sync_now()` for the `is_syncing` flag.
🧹 Nitpick comments (1)
packages/rs-dash-async/src/atomic.rs (1)
8-15: 📐 Maintainability & Code Quality | 🔵 TrivialAdd
#[must_use]annotation toAtomicFlagGuard.The guard is dropped as a temporary if not bound, silently resetting the flag immediately. This breaks the intended guarded scope behavior. Mark the type with
#[must_use]to catch accidental non-binding at compile time.Proposed fix
-pub struct AtomicFlagGuard<'a>(&'a AtomicBool); +#[must_use = "AtomicFlagGuard clears the flag on drop; bind it to keep the flag set for the guarded scope"] +pub struct AtomicFlagGuard<'a>(&'a AtomicBool);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-dash-async/src/atomic.rs` around lines 8 - 15, The AtomicFlagGuard struct can be accidentally dropped without being bound to a variable, causing the flag to be reset immediately and breaking the guarded scope behavior. Add the #[must_use] attribute to the AtomicFlagGuard struct definition to make the compiler warn when the guard is not explicitly bound to a variable, ensuring the developer catches this mistake at compile time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 671-695: The test
`event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean` is
non-deterministic because it accepts both `CoordinatorThreadStatus::Stopped` and
`CoordinatorThreadStatus::Ok` as valid outcomes. This allows the test to pass
without actually exercising the non-panic JoinError branch. To fix this, replace
the current task abort approach with a task that is guaranteed to be pending and
never complete on its own, such as a task that awaits on a channel or a
never-resolving future. This ensures the abort always triggers the Stopped path
deterministically, and update the assertion to only expect
`CoordinatorThreadStatus::Stopped`.
---
Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 405-408: The clear_shielded function ignores the return value from
self.shielded_sync_manager.quiesce().await, which now provides meaningful
shutdown status information. Instead of ignoring this result, capture the return
value and check if the quiesce completed cleanly. If quiesce returns a status
indicating Timeout, Stopped, or Panicked, return an appropriate error from
clear_shielded rather than continuing to call coord.clear(), which could race
with a still-running shielded pass. Only proceed with coord.clear() when quiesce
has successfully shut down cleanly.
- Around line 463-477: The `quiesce()` method implementations in all three
coordinators (the ones containing the timeout calls for
`platform_address_sync_manager`, `identity_sync_manager`, and
`shielded_sync_manager`) don't properly handle cancellation when
`tokio::time::timeout` drops the future. Wrap the `quiescing` flag reset in an
`AtomicFlagGuard` to ensure it's reset on all exit paths including early
cancellation. In each coordinator's `quiesce()` method, after setting
`quiescing` to true, immediately create an `AtomicFlagGuard` using
`AtomicFlagGuard::new(&self.quiescing)`, and remove any manual
`quiescing.store(false)` reset call at the end since the guard will handle it
automatically. Use the same pattern already correctly implemented in
`sync_now()` for the `is_syncing` flag.
---
Nitpick comments:
In `@packages/rs-dash-async/src/atomic.rs`:
- Around line 8-15: The AtomicFlagGuard struct can be accidentally dropped
without being bound to a variable, causing the flag to be reset immediately and
breaking the guarded scope behavior. Add the #[must_use] attribute to the
AtomicFlagGuard struct definition to make the compiler warn when the guard is
not explicitly bound to a variable, ensuring the developer catches this mistake
at compile time.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 192ea517-ff8a-4b43-9773-c096391c8a49
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
packages/rs-dash-async/src/atomic.rspackages/rs-dash-async/src/lib.rspackages/rs-platform-wallet/Cargo.tomlpackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/rs-platform-wallet/src/manager/platform_address_sync.rs
- packages/rs-platform-wallet/src/manager/shielded_sync.rs
|
✅ Review complete (commit 748c4f8) |
|
Swift 🤖 Co-authored by Claudius the Magnificent AI Agent |
…on tests Wrap the cleanup `mgr.quiesce().await` in all three `restart_after_stop_reaps_prior_thread` tests with a 2-second `tokio::time::timeout`. An unbounded quiesce after the restarted loop would hang CI forever if the loop wedges; now it fails fast with a clear message. Also assert `status.is_clean()` on the returned `CoordinatorThreadStatus`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
The latest delta resolves prior-1 (Swift deinit at PlatformWalletManager.swift:153-195 now captures both shielded_sync_stop and destroy result codes via discardReturningCode() and leaks persistenceHandler/eventHandler via unbalanced passRetained on .errorShutdownIncomplete) and prior-3 (restart_after_stop_reaps_prior_thread now exists in shielded_sync.rs alongside its identity/platform-address siblings). Prior-2 — the 1 s wedge-backstop detach in each coordinator's start() — remains in the tree on the deliberate-wedge path; all six agents converge on it as a suggestion, narrow but conceptually opposed to the PR's central honest-status contract. No new blocking issues introduced.
🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517: start()'s 1 s wedge-backstop detaches a still-live coordinator that shutdown()/destroy then reports as clean
Carried forward from prior review (still valid; no reviewer-repeat required by the verifier protocol).
On a genuine wedge — a `sync_now` future whose `Drop` never yields, or a non-yielding step inside the SDK — the prior thread cannot reach its exit epilogue within the 1 s deadline at identity_sync.rs:502-517 (and the symmetric patterns at platform_address_sync.rs:317-332 and shielded_sync.rs:350-365). `start()` then drops `h`, detaching the still-running OS thread; `background_join` already holds the new generation's handle, so a later `quiesce()`/`shutdown()` joins only the new thread, `CoordinatorExitStatus::all_clean()` returns true, and `platform_wallet_manager_destroy` returns `ok()`.
That directly defeats the central FFI contract this PR introduces. The Swift deinit at PlatformWalletManager.swift:190-194 only retains `persistenceHandler` / `eventHandler` when it observes `.errorShutdownIncomplete`; a falsely-clean code lets ARC free both immediately, and the detached prior thread — which still owns Arc'd Rust callback wrappers carrying the `passUnretained(self).toOpaque()` raw pointers (PlatformWalletPersistenceHandler.swift:993-996, PlatformWalletManagerAddressSync.swift:47-50) — can fire one last callback into freed memory. The same hazard exists on the timeout paths of `clear_shielded` and the FFI `platform_wallet_manager_shielded_sync_stop`, where an outer `tokio::time::timeout` firing while the inner quiesce is parked inside `join_coordinator_thread(handle).await` drops the local handle at the .await — detach.
The cancellable `tokio::select!` in each loop body and the prompt `drop(cancel_guard)` ordering (identity_sync.rs:488-494, mirrored in the other two) make this vanishingly rare in practice — `restart_after_stop_reaps_prior_thread` now pins the common-case reap at <500 ms across all three coordinators. But the residual path is the one bypass of the PR's honest-status guarantee. Two clean closures: (a) track detached handles in a per-manager orphans list that `shutdown()` polls and surfaces as `Timeout`/`Detached`, so `destroy` returns `ErrorShutdownIncomplete` and the host keeps the callback context alive; or (b) once cancellation has been signalled, drop the 1 s deadline entirely — the cancellable select! makes a real-world stall essentially impossible, and turning a degraded shutdown into a hang is arguably less dangerous than the silent UAF window the detach can open.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Latest delta (5017ba1..b491773) is test-only: it bounds the post-restart cleanup quiesce() in three restart-reap regression tests with a 2 s tokio::time::timeout and asserts is_clean(). No production code changed and no new defects in the delta. The prior finding about start()'s 1 s wedge-backstop silently detaching a still-live coordinator is STILL VALID on the current head for all three coordinators and is carried forward as the sole in-scope finding.
Carried-forward prior findings
-
[suggestion] 1 s wedge-backstop in start() detaches a live prior coordinator that shutdown()/destroy then reports as clean
packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517Verified against b491773. On a genuine wedge — a sync_now future whose Drop never yields, or a non-yielding step inside the SDK — the prior coordinator thread cannot reach its exit epilogue within the 1 s deadline. The loop at lines 504–513 then
breaks and dropsh, which detaches the still-running OS thread (std::thread::JoinHandle's Drop does not join).background_joinnow holds only the replacement handle installed at lines 483–486, so a laterquiesce()(lines 576–581) joins only the replacement andCoordinatorExitStatus::all_clean()returns true. FFIplatform_wallet_manager_destroy(packages/rs-platform-wallet-ffi/src/manager.rs:366–386) then returnsok()instead ofErrorShutdownIncomplete=19. SwiftPlatformWalletManager.deinit(packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift:153–195) does NOT take the retain-leak branch and ARC releases the persister/event-handler wrappers, even though the detached orphan thread still holds Arc/Arc referencing the freedpassUnretainedcontext — exactly the UAF class this PR exists to close, but now on the stop()→start()→destroy() degraded path the new restart bookkeeping creates.The biased cancel-select introduced at lines 446–461 narrows the common-case window (cancel beats sync_now's next .await), but it cannot abort an SDK step or future Drop that is itself non-yielding — which is precisely when the 1 s backstop fires. Same pattern exists in platform_address_sync.rs:317–332 and shielded_sync.rs:350–365; none of the three record the detach event into state observable by shutdown()/CoordinatorExitStatus. The new restart-reap tests only exercise the prompt-reap (<500 ms) path, not a genuine wedge past 1 s.
Minimum FFI-correct fix options (none fit an inline patch): (a) track detached orphans in a Mutex<Vec<JoinHandle<()>>> the next quiesce()/shutdown() polls and surfaces as Timeout/non-clean status; (b) have start() flip a per-coordinator atomic on detach that quiesce() ORs into its returned CoordinatorThreadStatus; (c) have start() return an error rather than silently install a replacement when the prior couldn't be reaped, so the FFI host can escalate. The current shape — silently dropping the orphan — is the one outcome that defeats the PR's central invariant that destroy()'s result code tells the host whether it is safe to free callback contexts.
…) reports them as non-clean Closes the residual use-after-free window left by the coordinator reap backstop. On a tight stop()->start(), each sync coordinator waits ~1s for the prior OS thread to finish; if that thread is genuinely wedged in a non-yielding Drop, the backstop previously DROPPED the still-live JoinHandle (detaching it). A later shutdown() joined only the current handle, all_clean() returned true, and the FFI destroy returned ok() — at which point the host could free the callback context the detached, still-running thread might still touch. Fix (review option i): the manager now owns a shared CoordinatorOrphans list (Arc<Mutex<Vec<JoinHandle>>>) cloned into every coordinator. The duplicated reap blocks in identity/platform-address/shielded start() are consolidated into reap_prior_or_park(), which PARKS a wedged prior thread in that list instead of dropping it (lock-ordering preserved: drop(cancel_guard) still happens before the reap). shutdown() then drains the list via join_detached_orphans() within a bounded, yielding is_finished() poll and reports a new CoordinatorThreadStatus::Detached (non-clean) in CoordinatorExitStatus::detached_threads when any orphan is still alive at the grace deadline. all_clean() folds it in, so the FFI destroy correctly returns ErrorShutdownIncomplete and the host delays freeing its context. The new Detached variant re-exports through lib.rs with its sibling statuses. Tests (manager/mod.rs): reap_prior_or_park parks a force-wedged thread; join_detached_orphans reports Detached then Ok; and a manager shutdown() with a parked still-live orphan reports non-clean. All proven non-vacuous by neutering the park/join. Cleanup quiesce/join in tests is bounded; a wedged stand-in thread is released and joined so none leak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Prior finding (start()/wedge-backstop detaching a live coordinator) is FIXED at head 76c8bee via reap_prior_or_park + CoordinatorOrphans + join_detached_orphans, with Detached folded into all_clean(). However, two related lifecycle gaps in the same orphan-tracking design remain: (1) quiesce() takes the JoinHandle out of background_join before awaiting join_coordinator_thread, so a wrapping tokio::time::timeout (used by shutdown(), clear_shielded(), and the FFI shielded_sync_stop bridge) drops the live handle and detaches the thread without parking it; on the explicit retry path the next call sees background_join == None and reports NotRunning/clean while a wedged thread may still call host callbacks. (2) clear_shielded() consults only the current shielded quiesce status and wipes the commitment-tree store even when an earlier wedged shielded coordinator is still alive in coordinator_orphans, breaking the PR's 'Clear leaves the store intact whenever a sync drain is incomplete' invariant. Both are in scope for the PR's stated FFI lifecycle/UAF hardening.
🔴 2 blocking
1 additional finding(s) omitted (not in diff).
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:240-261: Timed-out quiesce drops the JoinHandle and silently detaches a live coordinator
join_coordinator_thread takes the JoinHandle by value. Every caller of quiesce() — shutdown(), clear_shielded(), and the FFI platform_wallet_manager_shielded_sync_stop — wraps it in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, ...). Each coordinator's quiesce() first synchronously takes background_join (identity_sync.rs:579, shielded_sync.rs:423, platform_address_sync.rs analogous) and only then awaits this helper. If the outer timeout fires while this loop is in tokio::time::sleep, the future is dropped, the locally-owned handle is dropped with it, and the still-live OS thread is detached — not parked in coordinator_orphans. The FFI/host contract is explicitly retry-capable here (Timeout → ErrorShutdownIncomplete → host should be able to retry stop/destroy), but a retry now sees background_join == None and join_coordinator_thread returns NotRunning, which is_clean() treats as clean. The host then frees the persister / event-handler context while a wedged thread still holds Arcs to it and may still fire one final callback — the exact UAF class the orphan list was added to close. The fix is to keep the handle reachable across timeout/cancel: either don't take() from background_join until is_finished() is true, or have join_coordinator_thread re-park the unjoined handle into a manager-scoped orphans list on Drop / on the deadline (mirroring reap_prior_or_park's park-on-wedge contract).
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the store while parked coordinator orphans may still be alive
clear_shielded() gates the store wipe on `self.shielded_sync_manager.quiesce()` returning a clean status, but a tight stop()→start() sequence can park a previously-wedged shielded coordinator thread into coordinator_orphans (reap_prior_or_park, mod.rs:296-332). Those parked threads hold Arc references to the same shielded coordinator / persister context the clear() below is about to wipe, and shutdown() is currently the only path that drains them via join_detached_orphans. As soon as the current quiesce reports clean, this path calls coord.clear() while a parked orphan may still be inside its sync_now drop / persister fan-out, violating the PR's stated invariant that Clear leaves the store intact whenever a sync drain is incomplete and re-creating the store-desync the orphan tracking was added to surface. Mirror the shutdown() path: after quiesce, drain coordinator_orphans (with SHUTDOWN_ORPHAN_GRACE_SECS) and refuse the wipe with ShieldedShutdownIncomplete if any orphan is still Detached/Panicked.
…via tokio::join! The three periodic coordinators (platform-address, identity, shielded) were quiesced sequentially in shutdown(), making the worst-case wait additive (~3 x SHUTDOWN_JOIN_TIMEOUT_SECS). Each quiesce() touches only its own state (its quiescing/is_syncing atomics and its own background_cancel/background_join mutexes) and joins its own OS thread, sharing no lock, so racing them is sound. Drain them concurrently via tokio::join!, collapsing the worst case to ~max(timeouts). Each join! arm keeps its OWN inner tokio::time::timeout, so every coordinator still yields its own per-coordinator CoordinatorThreadStatus (a single outer timeout would flatten all three to Timeout). The event adapter teardown and join_detached_orphans stay sequential and ordered strictly AFTER the coordinator join!, since the adapter sinks the coordinators' stores. The multi-thread runtime assert is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
|
@thepastaclaw please review |
Centralizes the previously-triplicated background-worker lifecycle — generation-match exit epilogue, restart reap-or-park, orphan drain — into one tested engine in the shared dash-async crate, generic over a worker key and supporting both OS-thread (`!Send` block_on loops) and tokio-task workers. Makes two confirmed bugs impossible by construction: - F1: quiesce/join paths take `&self`; the live JoinHandle stays owned by the slot and is never moved into a cancellable future's frame. A dropped/timed-out quiesce re-parks the handle into orphans (Timeout), never drop-and-detach to a clean NotRunning. - F2: any_alive() is the single liveness gate spanning live slots AND parked orphans, so store-wiping paths can refuse while a prior thread is alive. Weight-ordered shutdown drains tiers ascending, concurrently within a tier. WorkerStatus variants are byte-identical to the wallet's CoordinatorThreadStatus for a stable FFI mapping. Adds the full registry test suite (TC-001/001b/003-014, F1 shutdown-path GAP-006, compile-fail DrainHook Send check, default-config and idempotent- shutdown gaps). 22 tests + 1 doctest green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
Tag each parked orphan with its originating worker key and add any_alive_for(key), so a store-wiping path scoped to one worker (the wallet's clear_shielded F2 gate) can refuse only while that worker — its slot or a parked prior-generation thread under its key — is alive, without being blocked by unrelated workers that are legitimately running (e.g. the always-on event-adapter task). Registry-wide any_alive() is retained. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…adRegistry Replace the triplicated per-coordinator lifecycle machinery (background_cancel/join/generation mutexes, the cloned coordinator_orphans list, and the mod.rs free fns join_coordinator_thread / reap_prior_or_park / join_detached_orphans / panic_message) with the shared dash-async ThreadRegistry. The manager holds one Arc<ThreadRegistry<WalletWorker>>; each coordinator's start/stop/ is_running/quiesce now delegate to it under a fixed WalletWorker key, and each exposes its quiescing-gate raise as a registry DrainHook. The wallet-event adapter becomes a registry start_task worker (weight 10, draining after the coordinators it sinks at weight 0). Fixes the two confirmed bugs structurally: - F1: shutdown() is now CoordinatorExitStatus::from_report(registry .shutdown()); each worker's join is bounded by its own join_budget inside the registry, where the live handle stays owned by the slot. A dropped/timed-out join can no longer detach a live thread — it re-parks to Timeout. - F2: clear_shielded() gates the store wipe on registry.any_alive_for(ShieldedSync) (shielded-scoped, so the always-on event adapter and the other coordinators running normally do not block Clear), refusing while a prior-generation shielded thread is parked alive. CoordinatorThreadStatus / CoordinatorExitStatus / all_clean() are byte-stable (FFI destroy maps !all_clean -> ErrorShutdownIncomplete); WorkerStatus maps onto them 1:1 via From. The three wall-clock restart-reap regression tests and the relocated free-fn tests are deleted (subsumed by the dash-async registry suite); F2 (clear_shielded) and R5 (from_report) gain wallet-level tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…te assert asymmetry QA-002: anchor the DrainHook Send+Sync compile_fail doctest to E0277 so it verifies the !Send capture is rejected for the right reason (unsatisfied Send bound) and cannot pass vacuously on an unrelated compile error. QA-001: document the runtime-flavor assert asymmetry — start_thread and shutdown assert a multi-thread runtime but start_task does not, so a task-only consumer (rs-dapi) on a current_thread runtime would panic late at shutdown(). The wallet always uses start_thread, so it trips the assert at start and is unaffected. Fix deferred to the rs-dapi adoption PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Incremental review against head 3cca1cf. The new tokio::join! concurrent drain in shutdown() is structurally sound — each coordinator's quiesce() touches only its own quiescing/is_syncing atomics and background_cancel/background_join mutexes, and the shared coordinator_orphans Mutex is not taken inside quiesce(), so racing the three drains is safe; per-coordinator inner timeouts correctly preserve per-thread CoordinatorThreadStatus that a single outer timeout would have flattened. However, both prior blocking findings are unchanged on this head: a timed-out quiesce still loses the only JoinHandle and silently detaches a live coordinator OS thread, and clear_shielded still wipes the shielded store without consulting coordinator_orphans. Both were unanimously re-flagged by all six agent reviews and both are required for this PR's stated lifetime/UAF guarantees to actually hold.
🔴 1 blocking
1 additional finding(s) omitted (not in diff).
1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the shielded store without draining parked coordinator orphans
clear_shielded (mod.rs:644–674) gates coord.clear().await on shielded_sync_manager.quiesce() returning is_clean(), and the doc on SHUTDOWN_JOIN_TIMEOUT_SECS (mod.rs:407–408) explicitly lists clear_shielded and the FFI shielded-stop bridge as using the same backstop. But neither path inspects self.coordinator_orphans. Shielded coordinator start() parks a prior wedged shielded OS thread into that shared list via reap_prior_or_park (mod.rs:296–332) whenever a tight stop()→start() cannot reap within the 1 s wedge backstop. Only shutdown() actually drains the list via join_detached_orphans (mod.rs:820–824).
A parked shielded thread still holds an Arc to the same NetworkShieldedCoordinator and persister context that coord.clear() (mod.rs:670–672) is about to wipe; the doc on lines 663–666 warns exactly about this — "a surviving pass writing into a store we just cleared, desyncing the host's own wipe from a repopulated tree." Because the orphan drain is missing, a sequence of stop()→start() that wedges the prior shielded thread, followed by clear_shielded, lets the wipe proceed while the parked thread is still alive and can persister.store(...) / fire on_shielded_* host callbacks against the cleared store.
The same defect is in the FFI platform_wallet_manager_shielded_sync_stop bridge: it awaits only quiesce(), so a non-clean orphan goes unreported and the host (which sees ok()) is free to release the callback context that a parked shielded thread still references.
Fix shape (symmetric with shutdown()): in both clear_shielded and platform_wallet_manager_shielded_sync_stop, after quiesce() returns clean, run a bounded join_detached_orphans(&self.coordinator_orphans, deadline) and fold the result into the precondition — return ShieldedShutdownIncomplete { status } (and leave the store intact) unless both the live drain and the orphan drain are clean. Caveat: coordinator_orphans is shared across all three coordinator kinds; either accept the conservative wait for identity/platform-address orphans before a shielded Clear, or tag parked handles with their coordinator kind so Clear can drain only the shielded subset.
…c accuracy Apply consolidated review findings to the shared ThreadRegistry and the wallet's shutdown/clear paths. Concurrency hardening (with regression tests): - generation-guard quiesce(): a concurrent same-key restart can no longer have its fresh live handle parked or reported Timeout — the superseded quiesce returns NotRunning and the gen-scoped Repark leaves the new handle alone. - graceful spawn-failure rollback: start_thread no longer .expect()s on thread spawn; a failure re-installs the prior handle (never detached), clears the running flag, and returns instead of panicking across the FFI boundary. - panic-safe epilogue: an EpilogueGuard drop-guard runs the gen-gated epilogue even when a worker body unwinds, so is_running() reflects a crash and start() can relaunch it. - closing latch: shutdown() latches the registry closed under the slot lock before snapshotting tiers; start_thread/start_task refuse new workers once teardown begins, so a start racing shutdown cannot leave an un-joined worker behind. - clear_shielded holds the shielded quiescing gate across its liveness check and store wipe (closes the direct sync_now/sync_wallet TOCTOU) and documents the residual start-vs-clear host-serialization precondition. Hygiene: - feature-gate park_orphan_for_test behind a new test-util feature so the mutation seam never ships in a downstream production build. - add Debug to WorkerConfig/ThreadRegistry; add # Panics docs to start_thread/start_task/shutdown; drop a tombstone comment. - correct the WorkerStatus<->CoordinatorThreadStatus "byte-identical" claim to "exhaustive by-name match, never a layout cast". - repoint dead [CoordinatorOrphans] and [spawn_wallet_event_adapter] intra-doc links at the surviving registry / renamed loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…three sync coordinators The migration onto ThreadRegistry left the coordinator-side delegate + pass-gating boilerplate copy-pasted across IdentitySyncManager, PlatformAddressSyncManager, and ShieldedSyncManager: the same five fields (registry, interval, is_syncing, quiescing, last_sync), byte- identical interval/is_syncing/last_sync/drain_hook/stop/quiesce delegations, and — critically — the subtle is_syncing-CAS + quiescing-gate pass preamble reproduced four times. Hoist all of it into one CoordinatorLifecycle helper that each coordinator embeds and delegates to, so the teardown-critical handshake has a single home. begin_pass() folds the CAS + guard + gate check into one RAII-guard- returning call; hold_quiescing_gate() is the shared primitive the shielded Clear flow holds across its check-then-wipe. Each coordinator now keeps only its domain-specific pass body. Behaviour-preserving: the full platform-wallet lib suite (307 tests, both default and shielded) and the coordinator pass-gate tests are unchanged and green. Also fixes the dangling [JoinHandle] intra-doc link in core_bridge (its import was removed in the migration) by fully qualifying it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Both prior blocking findings are structurally fixed in the new ThreadRegistry engine (prior-1 by the slot-owned JoinHandle + generation-scoped Repark drop-guard; prior-2 by clear_shielded's any_alive_for(ShieldedSync) gate under a held quiescing guard). One new blocking regression was introduced by the CoordinatorLifecycle dedup: the shared quiesce() no longer raises the quiescing gate itself or drains is_syncing — it only delegates to ThreadRegistry::quiesce, whose drain hook is skipped when no slot is registered — so a direct sync_now/sync_wallet in flight (which the FFI exposes without requiring the background loop to be running) is no longer awaited before quiesce reports clean. Two narrower follow-ups (symmetric orphan check on shielded_sync_stop FFI; out-of-lock prior-handle reap racing shutdown's orphan reap) round out the review.
🔴 1 blocking | 🟡 2 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:146-149: Shared quiesce() no longer drains direct in-flight sync passes
The pre-refactor per-coordinator `quiesce()` explicitly raised `quiescing` and awaited `is_syncing` to fall before joining (`self.quiescing.store(true, ...); self.stop(); while self.is_syncing { sleep(20ms) }`). The new shared `CoordinatorLifecycle::quiesce` is `let _g = AtomicFlagGuard::new(&self.quiescing); self.registry.quiesce(self.worker).await.into()`. `AtomicFlagGuard::new` (rs-dash-async/src/atomic.rs:18-22) deliberately does **not** raise the flag — only the registry's drain hook does, and `ThreadRegistry::quiesce` (registry.rs:507-519) early-returns `NotRunning` without invoking the drain hook whenever no live slot is registered.
The FFI exposes `platform_wallet_manager_shielded_sync_sync_now` / `shielded_sync_wallet` (shielded_sync.rs:208, 590) — and the same shape for identity / platform-address — which call `shielded_sync().sync_now(true)` / `sync_wallet(..., true)` directly, with no requirement that the background loop is running. So a host-driven direct pass can be in flight while `clear_shielded` calls `quiesce()`: the registry has no slot, the drain hook never fires, the gate is never raised, and `quiesce()` returns `NotRunning` — clean — immediately. `clear_shielded` (mod.rs:531-562) then takes the `is_clean()` branch, raises the gate via `hold_quiescing_gate()` (too late — the in-flight pass already passed `begin_pass()` and does not recheck mid-pass), and `any_alive_for(ShieldedSync)` returns false (it only inspects slots/orphans, not `is_syncing`). `coord.clear().await` wipes the store while the direct pass keeps writing through `coord.sync(...)` and its persister fan-out.
This directly contradicts the documented contract on `clear_shielded` (mod.rs:517-518: "A concurrent direct `sync_now`/`sync_wallet` is held off (the quiescing gate stays raised across the liveness check and the wipe)"), and is the same store-desync class the prior F2 finding closed for parked orphans. Fix by restoring the explicit gate raise + `is_syncing` drain in the shared `quiesce`, before delegating to the registry — symmetric with what each coordinator did before the dedup.
In `packages/rs-platform-wallet-ffi/src/shielded_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:117-134: shielded_sync_stop FFI does not consult any_alive_for(ShieldedSync)
`platform_wallet_manager_shielded_sync_stop` returns `Success` whenever the per-key `quiesce()` lands clean (Ok/NotRunning). It does not consult `registry.any_alive_for(WalletWorker::ShieldedSync)` — the symmetric gate `clear_shielded` adopted at manager/mod.rs:553 (the F2 fix) and the destroy path covers via `CoordinatorExitStatus::all_clean()` which folds in `detached_threads`.
A prior tight start->stop reap that had to park a wedged shielded thread as an orphan would not block the next stop's `Success`, even though the parked thread still holds `Arc`s to the host's `FFIEventHandler::context` / `FFIPersister::context` pointers and can fire one final callback through them. The function's docstring (shielded_sync.rs:80-83) says "On this code the host must NOT free the callback context immediately" only on the non-clean branch — implying that on `Success` the host MAY free it.
Today's only consumer (Swift `deinit`) always follows stop with destroy, and destroy's `manager.shutdown()` drains orphans, so no UAF currently reaches a host. The contract gap is still worth closing symmetrically with the other two quiesce-then-touch-host-context teardown paths.
In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:406-414: start_thread's out-of-lock prior reap races shutdown's orphan reap
`start_thread` takes the prior handle under the slot lock, releases the lock at line 406, then calls `self.reap_prior_or_park(prior, key)` synchronously at line 413. `reap_prior_or_park` for an OS thread (registry.rs:737-762) spins up to `reap_backstop` (1 s in the wallet) before pushing a still-wedged handle into orphans. The `closing` latch and shutdown's tier snapshot only synchronize on the slot lock — they do not block on this out-of-lock reap. Interleaving from different threads:
1. A: `start_thread` takes the slot lock, passes the `closing` check, takes the wedged `prior`, installs the new gen handle, releases the lock. A is now spinning in `reap_prior_or_park`.
2. B: `shutdown()` takes the slot lock, sets `closing=true`, snapshots tiers (sees only the NEW gen handle — `prior` was already moved out), releases.
3. B drains the new gen via `quiesce` (clean).
4. B calls `reap_orphans_impl(reap_backstop)` — orphan list is still empty because A hasn't pushed yet.
5. B returns a clean `ShutdownReport` → `CoordinatorExitStatus::all_clean()` → FFI `destroy` returns `ok()` → host frees its callback context.
6. A's 1 s backstop expires; A pushes the still-live wedged prior into orphans and returns.
The prior thread is now a live orphan holding `Arc`s to the freed host context — the same UAF surface F1 was designed to close. Requires (a) a wedged-past-backstop prior, (b) `start_thread` invoked concurrently with `shutdown` from a different thread, and (c) shutdown's reap_orphans phase to land in A's 1 s backstop window — narrow, but the asymmetric to the closing-latch protection the PR explicitly adds. Close by parking `prior` immediately under the slot lock and reaping asynchronously, or by tracking in-flight reaps so `reap_orphans_impl` waits for them to settle.
|
@coderabbitai review all |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
packages/rs-platform-wallet-ffi/src/manager.rs (1)
363-365: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUpdate the inline comment to match the new FFI behavior.
Lines 363-365 still say destroy will "just log it and drop it", but Lines 367-382 now surface non-clean shutdowns as
ErrorShutdownIncomplete. Leaving the old comment here is misleading in the callback-lifetime path this PR is hardening.✏️ Suggested edit
- // It now joins the coordinator OS threads and returns their - // per-thread exit status; the C ABI exposes none of that, so we - // just log it (a panicked loop is worth surfacing) and drop it. + // It now joins the coordinator OS threads and returns their + // per-thread exit status. The C ABI does not expose that full + // structure directly, but a non-clean aggregate is surfaced as + // `ErrorShutdownIncomplete` so the host can keep its callback + // context alive until lingering workers settle.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet-ffi/src/manager.rs` around lines 363 - 365, The inline comment near the destroy/callback-lifetime handling in manager::destroy is outdated and still says the shutdown result is only logged and dropped. Update that comment to describe the current FFI behavior in destroy and the related callback-lifetime path: the coordinator OS threads are joined, and non-clean shutdowns are surfaced as ErrorShutdownIncomplete instead of being silently discarded. Keep the wording aligned with the destroy logic and ErrorShutdownIncomplete handling so the comment matches the implemented shutdown semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-dash-async/src/registry.rs`:
- Around line 674-679: The shutdown path in `Registry::shutdown` is discarding
the orphan result status from `reap_orphans_impl()`, which can hide non-clean
exits. Update `ShutdownReport` and its construction so the orphan status is
preserved alongside `detached`, and adjust `ShutdownReport::all_clean()` to
consider that stored status instead of treating parked orphans as clean by
default.
In `@packages/rs-platform-wallet/src/changeset/core_bridge.rs`:
- Around line 68-110: When the cancel branch in core_bridge.rs wins the
tokio::select! race, the loop exits immediately and can drop buffered broadcast
events, so update the cancellation handling in the main event loop to drain
receiver before breaking. Use the existing build_core_changeset and
persister.store flow for each pending event from receiver.try_recv(), preserve
the core.is_empty_no_records() skip, and keep the existing error/warn logging
style keyed by wallet_id so shutdown still persists any queued CoreChangeSet
updates.
In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 407-410: Update the `stop()` documentation in `identity_sync.rs`
to reflect that background sync passes are cancellable and may exit early when
shutdown is requested. Adjust the wording around `sync_now()`/`tokio::select!`
in `stop()` and any related comments so it no longer claims a pass will always
run to completion after `stop()`, and make the doc mention that cancellation can
interrupt an in-flight `.await`.
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 515-522: The `clear_shielded`/`shielded_sync_start` path in
`manager::mod` still relies on a host-side precondition, but `start()` can race
`clear()` and re-persist into a wiped store. Add a per-key clearing latch in the
wallet manager/registry logic, have `shielded_sync_start` honor it, and keep it
held across quiesce, liveness check, and `coord.clear()` so the clear/start
exclusion is enforced inside the wallet rather than by the caller.
In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 206-210: Update the `stop()` documentation in
`PlatformAddressSync` to reflect that cancellation can win over `sync_now()` in
the `tokio::select!` loop, so a background pass may exit before reaching
`on_platform_address_sync_completed`. Adjust the docs/comments near `stop()`,
`sync_now()`, and the select loop to clearly state that shutdown does not
guarantee completion of an in-flight sync pass.
In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 226-230: Update the stop() documentation to reflect that
background-pass cancellation can interrupt an in-flight sync_now(false) call
rather than guaranteeing the persister fan-out completes. In shielded_sync.rs,
adjust the docs/comments around the tokio::select! path in
ShieldedSyncManager::stop and the background coordinator.sync() flow to state
that cancellation may win the race and the sync can be dropped at an await. Keep
the description aligned with the actual behavior of sync_now(false) and the
cancel.cancelled() branch.
---
Nitpick comments:
In `@packages/rs-platform-wallet-ffi/src/manager.rs`:
- Around line 363-365: The inline comment near the destroy/callback-lifetime
handling in manager::destroy is outdated and still says the shutdown result is
only logged and dropped. Update that comment to describe the current FFI
behavior in destroy and the related callback-lifetime path: the coordinator OS
threads are joined, and non-clean shutdowns are surfaced as
ErrorShutdownIncomplete instead of being silently discarded. Keep the wording
aligned with the destroy logic and ErrorShutdownIncomplete handling so the
comment matches the implemented shutdown semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ef1d0648-a464-44ab-8151-583159ad51f5
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (20)
packages/rs-dash-async/Cargo.tomlpackages/rs-dash-async/src/atomic.rspackages/rs-dash-async/src/lib.rspackages/rs-dash-async/src/registry.rspackages/rs-platform-wallet-ffi/Cargo.tomlpackages/rs-platform-wallet-ffi/src/error.rspackages/rs-platform-wallet-ffi/src/manager.rspackages/rs-platform-wallet-ffi/src/shielded_sync.rspackages/rs-platform-wallet/Cargo.tomlpackages/rs-platform-wallet/src/changeset/core_bridge.rspackages/rs-platform-wallet/src/changeset/mod.rspackages/rs-platform-wallet/src/error.rspackages/rs-platform-wallet/src/lib.rspackages/rs-platform-wallet/src/manager/coordinator_lifecycle.rspackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rspackages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swiftpackages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletResult.swift
| // Account for parked orphans last. | ||
| let (_status, detached) = self.reap_orphans_impl(self.reap_backstop).await; | ||
| ShutdownReport { | ||
| per_worker, | ||
| detached, | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift
Preserve non-clean orphan status in ShutdownReport.
reap_orphans_impl() can return Panicked, Stopped, or Error for an orphan that exits before the grace deadline, but shutdown() discards _status and only keeps detached. That lets ShutdownReport::all_clean() report clean when a parked worker actually ended non-cleanly.
Suggested shape
pub struct ShutdownReport<K: RegistryKey> {
pub per_worker: BTreeMap<K, WorkerStatus>,
+ pub orphan_status: WorkerStatus,
pub detached: usize,
}
pub fn all_clean(&self) -> bool {
- self.detached == 0 && self.per_worker.values().all(WorkerStatus::is_clean)
+ self.detached == 0
+ && self.orphan_status.is_clean()
+ && self.per_worker.values().all(WorkerStatus::is_clean)
}
- let (_status, detached) = self.reap_orphans_impl(self.reap_backstop).await;
+ let (orphan_status, detached) = self.reap_orphans_impl(self.reap_backstop).await;
ShutdownReport {
per_worker,
+ orphan_status,
detached,
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Account for parked orphans last. | |
| let (_status, detached) = self.reap_orphans_impl(self.reap_backstop).await; | |
| ShutdownReport { | |
| per_worker, | |
| detached, | |
| } | |
| // Account for parked orphans last. | |
| let (orphan_status, detached) = self.reap_orphans_impl(self.reap_backstop).await; | |
| ShutdownReport { | |
| per_worker, | |
| orphan_status, | |
| detached, | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/rs-dash-async/src/registry.rs` around lines 674 - 679, The shutdown
path in `Registry::shutdown` is discarding the orphan result status from
`reap_orphans_impl()`, which can hide non-clean exits. Update `ShutdownReport`
and its construction so the orphan status is preserved alongside `detached`, and
adjust `ShutdownReport::all_clean()` to consider that stored status instead of
treating parked orphans as clean by default.
| loop { | ||
| tokio::select! { | ||
| recv = receiver.recv() => { | ||
| match recv { | ||
| Ok(event) => { | ||
| let wallet_id = event.wallet_id(); | ||
| // For events that need to consult per-wallet | ||
| // state (today only `TransactionInstantLocked`, | ||
| // which checks finality before recording the IS | ||
| // lock), grab a brief read lock on the manager. | ||
| let core = build_core_changeset(&wallet_manager, &event).await; | ||
| if core.is_empty_no_records() { | ||
| // SyncHeightAdvanced for an unknown wallet, | ||
| // empty BlockProcessed, etc. — nothing to | ||
| // persist. Skip the round-trip. | ||
| continue; | ||
| } | ||
| Err(RecvError::Lagged(n)) => { | ||
| let cs = PlatformWalletChangeSet { | ||
| core: Some(core), | ||
| ..PlatformWalletChangeSet::default() | ||
| }; | ||
| if let Err(e) = persister.store(wallet_id, cs) { | ||
| tracing::warn!( | ||
| missed = n, | ||
| "wallet-event adapter lagged on broadcast channel; some events were dropped" | ||
| wallet_id = %hex::encode(wallet_id), | ||
| error = %e, | ||
| "Persister rejected core changeset; state will be re-emitted on next sync round" | ||
| ); | ||
| } | ||
| } | ||
| Err(RecvError::Closed) if cancel.is_cancelled() => break, | ||
| Err(RecvError::Closed) => { | ||
| tracing::error!("WalletEvent broadcast closed unexpectedly"); | ||
| break; | ||
| } | ||
| Err(RecvError::Lagged(n)) => { | ||
| tracing::warn!( | ||
| missed = n, | ||
| "wallet-event adapter lagged on broadcast channel; some events were dropped" | ||
| ); | ||
| } | ||
| } | ||
| _ = cancel.cancelled() => break, | ||
| } | ||
| _ = cancel.cancelled() => break, |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major
Drain pending events before exiting on cancellation
If the cancel signal wins the race in tokio::select!, the loop exits immediately. Since the receiver is a broadcast channel, any events queued in the buffer at that moment will be dropped, potentially losing persistence updates for CoreChangeSet entries generated before shutdown.
Refactor the cancelled branch to drain the receiver:
_ = cancel.cancelled() => {
while let Ok(event) = receiver.try_recv() {
let core = build_core_changeset(&wallet_manager, &event).await;
if core.is_empty_no_records() {
continue;
}
if let Err(e) = persister.store(event.wallet_id(), PlatformWalletChangeSet {
core: Some(core),
..Default::default()
}) {
tracing::warn!(wallet_id = %hex::encode(event.wallet_id()), "Drain persist failed");
}
}
break;
},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/rs-platform-wallet/src/changeset/core_bridge.rs` around lines 68 -
110, When the cancel branch in core_bridge.rs wins the tokio::select! race, the
loop exits immediately and can drop buffered broadcast events, so update the
cancellation handling in the main event loop to drain receiver before breaking.
Use the existing build_core_changeset and persister.store flow for each pending
event from receiver.try_recv(), preserve the core.is_empty_no_records() skip,
and keep the existing error/warn logging style keyed by wallet_id so shutdown
still persists any queued CoreChangeSet updates.
| tokio::select! { | ||
| biased; | ||
| _ = cancel.cancelled() => break, | ||
| _ = this.sync_now() => {} |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Update stop() docs for cancellable background passes.
Line 407 races sync_now() against cancellation, so a background pass can be dropped at an .await instead of “running to completion” after stop().
Suggested doc adjustment
- /// **Cancel-only**: requests cancellation and returns immediately. A
- /// pass already inside `sync_now` keeps running to completion,
- /// including its `persister.store(...)` fan-out. For a real "nothing
+ /// **Cancel-only**: requests cancellation and returns immediately. In
+ /// the background loop, cancellation can drop an in-flight `sync_now()`
+ /// at its next `.await`; direct caller-owned `sync_now()` futures are
+ /// not cancelled by this method. For a real "nothingAlso applies to: 423-430
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/rs-platform-wallet/src/manager/identity_sync.rs` around lines 407 -
410, Update the `stop()` documentation in `identity_sync.rs` to reflect that
background sync passes are cancellable and may exit early when shutdown is
requested. Adjust the wording around `sync_now()`/`tokio::select!` in `stop()`
and any related comments so it no longer claims a pass will always run to
completion after `stop()`, and make the doc mention that cancellation can
interrupt an in-flight `.await`.
| tokio::select! { | ||
| biased; | ||
| _ = cancel.cancelled() => break, | ||
| _ = this.sync_now() => {} | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Update stop() docs for cancellation-winning passes.
Line 206 lets cancellation win over sync_now(), so an in-flight background pass is not guaranteed to reach on_platform_address_sync_completed after stop().
Suggested doc adjustment
- /// **Cancel-only**: requests cancellation and returns immediately. A
- /// pass already inside `sync_now` keeps running to completion,
- /// including its `on_platform_address_sync_completed` host-callback
- /// dispatch. For a real "nothing is running and nothing more will
+ /// **Cancel-only**: requests cancellation and returns immediately. In
+ /// the background loop, cancellation can drop an in-flight `sync_now()`
+ /// at its next `.await`, so the completion callback is not guaranteed
+ /// after this method. For a real "nothing is running and nothing more willAlso applies to: 222-230
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs` around
lines 206 - 210, Update the `stop()` documentation in `PlatformAddressSync` to
reflect that cancellation can win over `sync_now()` in the `tokio::select!`
loop, so a background pass may exit before reaching
`on_platform_address_sync_completed`. Adjust the docs/comments near `stop()`,
`sync_now()`, and the select loop to clearly state that shutdown does not
guarantee completion of an in-flight sync pass.
| tokio::select! { | ||
| biased; | ||
| _ = cancel.cancelled() => break, | ||
| _ = this.sync_now(false) => {} | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Update stop() docs for background-pass cancellation.
Line 226 races cancellation against sync_now(false), so a background coordinator.sync() can be dropped at an .await rather than always running through its persister fan-out.
Suggested doc adjustment
- /// **Cancel-only**: this requests cancellation and returns
- /// immediately. A pass already inside `sync_now` /
- /// `coordinator.sync()` keeps running to completion (including its
- /// persister-callback fan-out). For a real "nothing is running and
+ /// **Cancel-only**: this requests cancellation and returns
+ /// immediately. In the background loop, cancellation can drop an
+ /// in-flight `sync_now(false)` / `coordinator.sync()` at its next
+ /// `.await`. For a real "nothing is running andAlso applies to: 242-249
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs` around lines 226 -
230, Update the stop() documentation to reflect that background-pass
cancellation can interrupt an in-flight sync_now(false) call rather than
guaranteeing the persister fan-out completes. In shielded_sync.rs, adjust the
docs/comments around the tokio::select! path in ShieldedSyncManager::stop and
the background coordinator.sync() flow to state that cancellation may win the
race and the sync can be dropped at an await. Keep the description aligned with
the actual behavior of sync_now(false) and the cancel.cancelled() branch.
…uiesce regardless of a running loop registry.quiesce early-returns NotRunning WITHOUT running the DrainHook when no background-loop slot is registered, so quiesce left the quiescing gate down and never drained an in-flight pass. A concurrent direct sync_now/sync_wallet that had already cleared begin_pass() was therefore not held off, breaking the clear_shielded/stop contract. Fix: quiesce now raises quiescing itself (gate-before-cancel preserved; the AtomicFlagGuard still reopens it on return), then — after the registry's bounded cancel+join — drains is_syncing when the status is clean. Draining only on a clean status keeps a wedged loop pass (reported Timeout, its thread orphaned) from reintroducing the shutdown stall the bounded join exists to prevent, while still covering the no-loop and idle-loop+direct-pass cases. TDD: new test fails against the pre-fix delegating quiesce. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…so shutdown can't miss it start_thread took the prior handle out of the slot and ran reap_prior_or_park (a ~1s backstop spin) OUT of the slot lock. A concurrent shutdown() could take the lock, latch closing, snapshot tiers seeing only the NEW handle, release, and reap an EMPTY orphan list — reporting clean while the wedged prior was still live and un-joined. Fix: park the prior into orphans UNDER the slot lock (park_prior_locked), making take-prior + park-prior atomic from shutdown's under-lock view; the bounded join stays out of the lock (reap_parked_prior, which finds the prior by ThreadId, removes+joins it when finished, or leaves a genuine wedge parked). start_task parks under the lock too. This introduces the module's only slots->orphans nesting; it is deadlock-free since no path takes slots while holding orphans. TDD: long-backstop test asserts the prior is parked before the spin could elapse; fails pre-fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…iveness, like clear_shielded/destroy shielded_sync_stop returned Success on status.is_clean() alone, ignoring a prior-generation shielded thread still parked alive as an orphan — asymmetric with clear_shielded/destroy and a misleading contract (the orphan still holds the host callback context). No live UAF today since Swift always does stop->destroy, but Success should imply no live shielded worker/orphan. Add manager::shielded_worker_alive() (the same shielded-scoped any_alive_for gate clear_shielded consults) and have shielded_sync_stop return ErrorShutdownIncomplete when a parked orphan survives a clean drain. FFI ABI unchanged (same return-code semantics); docstring updated so Success accurately implies no live shielded worker/orphan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…cing gate continuously SEC-001: clear_shielded's in-flight-pass drain was unbounded and the FFI clear bridge a bare block_on, so a heavy direct pass could hang the host (ANR). Bound it with a SHUTDOWN_JOIN_TIMEOUT_SECS timeout (mirroring shielded_sync_stop); on timeout the clear reports Timeout and aborts BEFORE the wipe, leaving the store intact. Split out clear_shielded_inner(drain_timeout) so the timeout path is testable without the 30s budget. SEC-002/RUST-002: the gate lapsed between quiesce() returning (its RAII guard lowers the shared flag) and the post-drain re-raise, letting a direct sync_now/sync_wallet slip past any_alive_for and re-persist into the wiped store. Fix: raise+HOLD the gate via clear's own guard BEFORE draining, and drain via a new gate-neutral quiesce_under_held_gate (extracted cancel_join_and_drain shared with quiesce, which stays byte-identical — Fix-1 invariant untouched). The gate now stays raised continuously across drain, liveness check, and wipe; doc softened to note the only residual is a full start() racing clear (per-key-latch follow-up). PROJ-004: clear now calls shielded_worker_alive() instead of re-inlining any_alive_for. Also clarifies the quiesce doc that only shielded gates sync_wallet (platform-address's is intentionally ungated). TDD: both SEC tests proven non-vacuous (revert->fail->restore). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…story-tags RUST-004: a failed thread spawn left the slot carrying the FAILED start's weight/drain/join_budget (and a bumped generation). Now snapshot the pre-start config and restore ALL of it on the Err path, so the re-installed prior keeps its own teardown config; generation rolls back too (the +1 is only observed under the slot lock and the failed start spawns no thread, so the rollback is net-zero and the externally-visible generation stays monotonic). New regression test asserts the restored config. RUST-005: trimmed the duplicated park-under-lock rustdoc block in start_thread that repeated park_prior_locked's doc. RUST-003/PROJ hygiene: removed [F1/F2/F3 FIX] history-tags from committed comments, replaced the 'Why F1 and F2 cannot recur' module section with present-state invariant descriptions, and fixed the glossary to reference the key-scoped any_alive_for (the gate store-wiping paths actually consult) rather than the registry-wide any_alive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…an-trigger contract shielded_sync_stop now also returns .errorShutdownIncomplete when the drain was clean but a prior-generation shielded thread is still parked alive as an orphan (not only when the in-flight drain times out). Update the manager deinit comment to reflect both triggers; behaviour unchanged (the deinit already leaks one strong ref to the handlers on .errorShutdownIncomplete). Comment-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…fencing (SeqCst) SEC-003: the pass-gate handshake is a Dekker-style StoreLoad pair across two distinct atomics — quiesce does store(quiescing) … load(is_syncing); begin_pass does CAS(is_syncing) … load(quiescing). Release/Acquire do NOT order StoreLoad across separate locations, so by the annotations alone both sides could miss each other (begin_pass reads a stale quiescing==false and runs a pass past a raised gate while the drain reads a stale is_syncing==false and returns). It was sound only incidentally — registry.quiesce happens to take the slots Mutex (a fence) before returning; a lock-free refactor of that path would make the race live. Promote the four handshake ops to SeqCst (a single total order guarantees at least one side observes the other): the gate-raise stores in quiesce, hold_quiescing_gate, and the registry drain hook; the is_syncing CAS (success) and quiescing load in begin_pass; and the is_syncing load in the drain. Gate lowering (reopen / RAII drop) and observational reads stay Release/Acquire — a stale-high gate read only makes a pass bail conservatively. Fix-1's gate-before-cancel + never-latched invariant is preserved (SeqCst is strictly stronger than the prior Release). Added a load-bearing-ordering comment at begin_pass. Not unit-testable (ordering); the existing gate/drain handshake tests still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
All three prior findings (911f99f) are resolved at 748c4f8: CoordinatorLifecycle::quiesce now self-raises the quiescing gate and drains in-flight direct passes (coordinator_lifecycle.rs:166-173, with a regression test); the shielded_sync_stop FFI now consults manager.shielded_worker_alive() and returns ErrorShutdownIncomplete when a prior-generation shielded orphan is parked (shielded_sync.rs:127-157); and start_thread now parks the prior worker under the slot lock via park_prior_locked before the out-of-lock reap, with a regression test in the registry. One in-scope cumulative concern remains: the shared quiescing AtomicBool means a concurrent quiesce()'s AtomicFlagGuard drop can lower the gate that clear_shielded's hold_quiescing_gate is supposed to keep raised continuously — the mechanism does not match the doc's continuous-hold promise in clear_shielded_inner. In practice this is gated by the host's UI-thread serialization, so it is a suggestion, not a blocker.
🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:166-173: clear_shielded's "continuously held" gate can be lowered by a concurrent quiesce()
`clear_shielded_inner` holds `_clearing_gate = shielded_sync_manager.hold_quiescing_gate()` across drain → liveness check → wipe, expecting the shielded `quiescing` boolean to stay `true` for the whole sequence (and the doc at manager/mod.rs:517-520 explicitly promises "the quiescing gate is raised continuously" so concurrent direct `sync_now`/`sync_wallet` are held off). But the same `AtomicBool` backs `CoordinatorLifecycle::quiesce`, which builds its own `AtomicFlagGuard` (coordinator_lifecycle.rs:171); `AtomicFlagGuard::drop` unconditionally stores `false` (rs-dash-async/src/atomic.rs:25-29). If `platform_wallet_manager_shielded_sync_stop` is invoked concurrently with `clear_shielded` (the FFI handle store only takes a brief read lock in `with_item`, so two calls on the same handle can interleave), the stop's `quiesce()` returns and its dropping guard clears the flag while clear is still between its drain and its wipe. A direct `sync_now`/`sync_wallet` then sees `quiescing=false` in `begin_pass`, claims `is_syncing`, and can re-persist into the store `coord.clear()` is about to wipe — a store/host desync. The host's UI-thread serialization mitigates this today, but the documented invariant on `clear_shielded` only excludes concurrent `shielded_sync_start`, not `_stop`. Fix options: refcount the gate (so the inner drop doesn't lower it under the outer hold), use a clear-specific second flag that `begin_pass` also consults, or tighten the host-serialization doc to enumerate `_stop` too.
| pub(crate) async fn quiesce(&self) -> CoordinatorThreadStatus { | ||
| // Gate up first (instant) and held until the guard drops on return. | ||
| // SeqCst: store-half of the `quiescing`<->`is_syncing` handshake | ||
| // (see `begin_pass`). | ||
| self.quiescing.store(true, Ordering::SeqCst); | ||
| let _quiescing_gate = AtomicFlagGuard::new(&self.quiescing); | ||
| self.cancel_join_and_drain().await | ||
| } |
There was a problem hiding this comment.
🟡 Suggestion: clear_shielded's "continuously held" gate can be lowered by a concurrent quiesce()
clear_shielded_inner holds _clearing_gate = shielded_sync_manager.hold_quiescing_gate() across drain → liveness check → wipe, expecting the shielded quiescing boolean to stay true for the whole sequence (and the doc at manager/mod.rs:517-520 explicitly promises "the quiescing gate is raised continuously" so concurrent direct sync_now/sync_wallet are held off). But the same AtomicBool backs CoordinatorLifecycle::quiesce, which builds its own AtomicFlagGuard (coordinator_lifecycle.rs:171); AtomicFlagGuard::drop unconditionally stores false (rs-dash-async/src/atomic.rs:25-29). If platform_wallet_manager_shielded_sync_stop is invoked concurrently with clear_shielded (the FFI handle store only takes a brief read lock in with_item, so two calls on the same handle can interleave), the stop's quiesce() returns and its dropping guard clears the flag while clear is still between its drain and its wipe. A direct sync_now/sync_wallet then sees quiescing=false in begin_pass, claims is_syncing, and can re-persist into the store coord.clear() is about to wipe — a store/host desync. The host's UI-thread serialization mitigates this today, but the documented invariant on clear_shielded only excludes concurrent shielded_sync_start, not _stop. Fix options: refcount the gate (so the inner drop doesn't lower it under the outer hold), use a clear-specific second flag that begin_pass also consults, or tighten the host-serialization doc to enumerate _stop too.
source: ['codex']
Why this PR exists
Problem. The SPV wallet's three sync coordinators (identity / platform-address / shielded) and the event-adapter each ran their own ad-hoc worker lifecycle. The coordinators must run on dedicated OS threads (
std::thread+Handle::block_on) because the dash-sdk sync futures are!Send, so a tokio task group cannot host them. The teardown logic was triplicated across the three coordinators plus two parallel bookkeeping structures, and two teardown paths could leave a live coordinator thread detached while the host freed its context.What breaks without it (concrete):
shutdown()/quiesce()joined a coordinator by taking itsJoinHandleby value into a future wrapped in a 30 stokio::time::timeout. If the timeout fired mid-poll, the future dropped and the handle dropped with it — silently detaching a still-live OS thread without parking it. The nextdestroysawbackground_join == None→NotRunning→ reported clean, so the host freed the callback context while the detached thread could still fire into it.clear_shielded()wiped the shielded store gated only on the current coordinator quiesce being clean. A prior-generation shielded thread parked after a stop→start race still heldArcrefs to the persister being wiped, so the clear could run under a live writer.Blocking relationship. Builds on the in-branch shutdown-join work. Introduces a generic engine in the shared
dash-asynccrate thatrs-dapicould adopt later (nors-dapichange here).What was done?
ThreadRegistry<K>indash-async— one generic engine managing both!SendOS-thread workers andSendtokio tasks behind a uniform API: a generation-match epilogue, reap-or-park orphan tracking, weight-ordered gracefulshutdown(), and per-worker async drain hooks. Theis_syncing/quiescingdomain barrier stays in each coordinator and is handed to the registry as aDrainHook— the registry owns the lifecycle machinery, not the domain semantics.background_*fields, theCoordinatorOrphanstype); extracted aCoordinatorLifecyclehelper to dedup the remaining sharedis_syncing/quiescingprotocol.JoinHandleowned by the registry slot (never moved into a cancellable future) and deterministically re-park on timeout; F2 fixed — a single key-scopedany_alive_for()gate over live + parked workers blocks the store wipe.quiesce(no spuriousTimeout/ErrorShutdownIncompletewhen a same-keystartraces a slow quiesce), graceful thread-spawn-failure re-park with full slot-config rollback (no detach, no unwind across theextern "C"shim onEAGAIN), panic-safe epilogue (a panicking worker still runs cleanup sois_running()stays truthful), a terminalclosinglatch so astartracingshutdowncannot leave an un-joined worker, a restarted worker's prior parked under the slot lock soshutdown()'s orphan scan can never miss it,clear_shielded's drain bounded by a timeout (no host ANR) and itsquiescinggate held continuously across drain→liveness-check→wipe, and thequiescing↔is_syncinghandshake made self-fencing (SeqCst) so its correctness is by-design rather than incidental.CoordinatorLifecyclehelper.CoordinatorThreadStatus/CoordinatorExitStatus/all_clean()are unchanged (the registry'sWorkerStatusmaps 1:1 viaFrom); no FFI or other public platform-wallet API break.How Has This Been Tested?
platform-walletlib: 210 (default) / 310 (--features shielded) passing — at/above baseline, no coverage regression.dash-async: 29 unit tests + acompile_faildoctest, including new security regressions; the F1/F2 + lifecycle fixes are proven non-vacuous (reverting each fix makes its regression test fail, then restored).cargo clippy --all-targets -- -D warningsclean (dash-async; platform-wallet and platform-wallet-ffi in both default and--features shielded);cargo fmt --all --checkclean.quiescing↔is_syncinghandshake's mutual-exclusion proven both ways (Dekker), gate-continuity proven across the wipe, and Fix-1's gate-before-cancel/never-latched invariant confirmed intact.clear_shielded-vs-concurrent-starthost-serialization window (store-desync, not UAF), and astart_taskruntime-flavor assert for the future task-only (rs-dapi) consumer.Breaking Changes
PlatformWalletManager::shutdown()now returnsCoordinatorExitStatus(was()); FFIdestroymaps!all_clean()→ErrorShutdownIncomplete— hence the!in the title. The status enum variants are byte/variant-stable across this change.dash-async(ThreadRegistry,WorkerStatus,WorkerConfig,ShutdownWeight, …) and a new dev-onlydash-asynctest-utilfeature.Checklist:
🤖 Co-authored by Claudius the Magnificent AI Agent