Skip to content

feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954

Open
Claudius-Maginificent wants to merge 30 commits into
v3.1-devfrom
feat/platform-wallet-shutdown-join
Open

feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954
Claudius-Maginificent wants to merge 30 commits into
v3.1-devfrom
feat/platform-wallet-shutdown-join

Conversation

@Claudius-Maginificent

@Claudius-Maginificent Claudius-Maginificent commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Why this PR exists

Problem. The SPV wallet's three sync coordinators (identity / platform-address / shielded) and the event-adapter each ran their own ad-hoc worker lifecycle. The coordinators must run on dedicated OS threads (std::thread + Handle::block_on) because the dash-sdk sync futures are !Send, so a tokio task group cannot host them. The teardown logic was triplicated across the three coordinators plus two parallel bookkeeping structures, and two teardown paths could leave a live coordinator thread detached while the host freed its context.

What breaks without it (concrete):

  1. F1 — use-after-free on shutdown. shutdown()/quiesce() joined a coordinator by taking its JoinHandle by value into a future wrapped in a 30 s tokio::time::timeout. If the timeout fired mid-poll, the future dropped and the handle dropped with it — silently detaching a still-live OS thread without parking it. The next destroy saw background_join == NoneNotRunning → reported clean, so the host freed the callback context while the detached thread could still fire into it.
  2. F2 — shielded-store data-loss / UAF. clear_shielded() wiped the shielded store gated only on the current coordinator quiesce being clean. A prior-generation shielded thread parked after a stop→start race still held Arc refs to the persister being wiped, so the clear could run under a live writer.

Blocking relationship. Builds on the in-branch shutdown-join work. Introduces a generic engine in the shared dash-async crate that rs-dapi could adopt later (no rs-dapi change here).

What was done?

  • New ThreadRegistry<K> in dash-async — one generic engine managing both !Send OS-thread workers and Send tokio tasks behind a uniform API: a generation-match epilogue, reap-or-park orphan tracking, weight-ordered graceful shutdown(), and per-worker async drain hooks. The is_syncing/quiescing domain barrier stays in each coordinator and is handed to the registry as a DrainHook — the registry owns the lifecycle machinery, not the domain semantics.
  • Migrated the three coordinators + the event-adapter onto the registry; deleted the triplicated lifecycle code and both parallel structures (background_* fields, the CoordinatorOrphans type); extracted a CoordinatorLifecycle helper to dedup the remaining shared is_syncing/quiescing protocol.
  • F1 fixed structurally — joins keep the JoinHandle owned by the registry slot (never moved into a cancellable future) and deterministically re-park on timeout; F2 fixed — a single key-scoped any_alive_for() gate over live + parked workers blocks the store wipe.
  • Lifecycle hardening (from two internal review rounds): generation-guarded quiesce (no spurious Timeout/ErrorShutdownIncomplete when a same-key start races a slow quiesce), graceful thread-spawn-failure re-park with full slot-config rollback (no detach, no unwind across the extern "C" shim on EAGAIN), panic-safe epilogue (a panicking worker still runs cleanup so is_running() stays truthful), a terminal closing latch so a start racing shutdown cannot leave an un-joined worker, a restarted worker's prior parked under the slot lock so shutdown()'s orphan scan can never miss it, clear_shielded's drain bounded by a timeout (no host ANR) and its quiescing gate held continuously across drain→liveness-check→wipe, and the quiescing↔is_syncing handshake made self-fencing (SeqCst) so its correctness is by-design rather than incidental.
  • Dedup: the three coordinators' shared lifecycle state + gate protocol are factored into a single CoordinatorLifecycle helper.
  • FFI-stable: CoordinatorThreadStatus / CoordinatorExitStatus / all_clean() are unchanged (the registry's WorkerStatus maps 1:1 via From); no FFI or other public platform-wallet API break.

How Has This Been Tested?

  • platform-wallet lib: 210 (default) / 310 (--features shielded) passing — at/above baseline, no coverage regression.
  • dash-async: 29 unit tests + a compile_fail doctest, including new security regressions; the F1/F2 + lifecycle fixes are proven non-vacuous (reverting each fix makes its regression test fail, then restored).
  • cargo clippy --all-targets -- -D warnings clean (dash-async; platform-wallet and platform-wallet-ffi in both default and --features shielded); cargo fmt --all --check clean.
  • Validated by a functional/spec QA pass and two full three-lens code-review rounds (security/concurrency, Rust quality, project consistency); all findings applied, and the concurrency fixes adversarially re-verified — the quiescing↔is_syncing handshake's mutual-exclusion proven both ways (Dekker), gate-continuity proven across the wipe, and Fix-1's gate-before-cancel/never-latched invariant confirmed intact.
  • Two non-blocking follow-ups are tracked: a per-key clearing latch to fully close a documented clear_shielded-vs-concurrent-start host-serialization window (store-desync, not UAF), and a start_task runtime-flavor assert for the future task-only (rs-dapi) consumer.

Breaking Changes

  • PlatformWalletManager::shutdown() now returns CoordinatorExitStatus (was ()); FFI destroy maps !all_clean()ErrorShutdownIncomplete — hence the ! in the title. The status enum variants are byte/variant-stable across this change.
  • New additive public API in dash-async (ThreadRegistry, WorkerStatus, WorkerConfig, ShutdownWeight, …) and a new dev-only dash-async test-util feature.
  • No breaking change to the remaining platform-wallet public/FFI surface.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section
  • I have made corresponding changes to the documentation if needed

🤖 Co-authored by Claudius the Magnificent AI Agent

lklimek added 2 commits June 22, 2026 21:46
…rns CoordinatorExitStatus

The three periodic sync coordinators (platform-address, identity,
shielded) run their `!Send` loops on detached OS threads via
`Handle::block_on`. `shutdown()`/`quiesce()` previously only drained the
in-flight pass (the `is_syncing` barrier) and never joined the threads,
so a consumer that drops the tokio runtime right after `shutdown()`
(one-shot / headless / stdio) could race a coordinator still polling
`tokio::time` on a shutting-down runtime and panic with
"A Tokio 1.x context was found, but it is being shutdown".

Each coordinator now stores its OS-thread `JoinHandle`; `quiesce()` joins
it (via `spawn_blocking`, after the existing drain) and returns a
`CoordinatorThreadStatus` (NotRunning / Ok / Panicked / Error). Joining
while the runtime is still alive guarantees the loop has stopped touching
`tokio::time` before the host drops the runtime. `shutdown()` aggregates
the three into `CoordinatorExitStatus`, so a panicked loop surfaces in
the status instead of being silently dropped.

JoinHandle-join chosen over a oneshot/Notify signal: `JoinHandle::join`
natively distinguishes a clean return from a panic and waits for the
actual OS thread to terminate (not just a signal fired mid-teardown),
yielding the per-thread status for free. The generation-guard reschedule
and quiesce-drain behavior are preserved.

BREAKING CHANGE: `PlatformWalletManager::shutdown()` now returns
`CoordinatorExitStatus` instead of `()`.

FFI: the internal `shutdown()` call logs the new status; the `extern "C"`
`platform_wallet_manager_destroy` signature and C ABI are unchanged.

<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
…nnot wedge shutdown

SEC-001: Add `IsSyncingGuard` RAII struct to all three coordinator
`sync_now` (and shielded `sync_wallet`) implementations.  The guard
clears `is_syncing=false` on every exit path — normal return, early
return, and panic-unwind — so `quiesce()`'s drain loop can never spin
forever on a panicked pass, and the `Panicked` thread-exit status
becomes reachable.

SEC-002: Wrap each coordinator's `quiesce()` call in `shutdown()` with
`tokio::time::timeout(30 s)`.  On timeout the slot reports
`CoordinatorThreadStatus::Error("join timed out")` rather than hanging
forever.

SEC-003: Add `debug_assert!` in `shutdown()` that the current runtime
is `MultiThread`; document the precondition in the method doc.

F-5: In all three coordinators' `start()`, store the `JoinHandle` in
`background_join` while still holding the `background_cancel` lock —
eliminates the theoretical window where a concurrent `quiesce()` could
take a `None` handle because spawn completed before the store.

Rename `CoordinatorThreadExit` → `CoordinatorThreadStatus` with
variants `Ok / NotRunning / Panicked / Error` to match the coordinator
module's existing `super::CoordinatorThreadStatus` references (fixing
the compile break in f3354f6).  `join_coordinator_thread`'s
spawn_blocking `Err` arm now maps to `Error` rather than `Panicked`
to distinguish infra failure from thread panic (F-6 documented).

Co-Authored-By: Claudius the Magnificent <noreply@anthropic.com>

<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR adds shared worker lifecycle primitives, refactors wallet sync managers to use them, and propagates incomplete-shutdown status through wallet, FFI, and Swift result handling.

Changes

Coordinator Shutdown and Lifecycle Refactor

Layer / File(s) Summary
Atomic guard and registry primitives
packages/rs-dash-async/src/atomic.rs, packages/rs-dash-async/src/lib.rs, packages/rs-dash-async/src/registry.rs, packages/rs-dash-async/Cargo.toml
AtomicFlagGuard is added as a drop-based atomic flag wrapper, and ThreadRegistry is added with worker status, shutdown reporting, worker config, lifecycle APIs, and tests. The rs-dash-async crate exports the new modules and feature wiring.
Shared coordinator lifecycle and manager refactor
packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs, packages/rs-platform-wallet/src/manager/identity_sync.rs, packages/rs-platform-wallet/src/manager/platform_address_sync.rs, packages/rs-platform-wallet/src/manager/shielded_sync.rs, packages/rs-platform-wallet/src/changeset/core_bridge.rs, packages/rs-platform-wallet/src/changeset/mod.rs, packages/rs-platform-wallet/Cargo.toml
CoordinatorLifecycle is introduced and then used by IdentitySyncManager, PlatformAddressSyncManager, and ShieldedSyncManager for interval tracking, pass gating, quiescing, and thread startup. The managers switch to ThreadRegistry, return CoordinatorThreadStatus from quiesce(), and update tests and construction paths accordingly.
Wallet shutdown status modeling and manager teardown
packages/rs-platform-wallet/src/manager/mod.rs, packages/rs-platform-wallet/src/error.rs, packages/rs-platform-wallet/src/lib.rs
PlatformWalletManager adds shutdown status modeling, registers the wallet event adapter with ThreadRegistry, returns CoordinatorExitStatus from shutdown(), and refuses clear_shielded() when shielded quiesce is incomplete. New tests cover status mapping, idempotency, orphan handling, and teardown ordering.
Wallet error and FFI result propagation
packages/rs-platform-wallet/src/error.rs, packages/rs-platform-wallet-ffi/src/error.rs, packages/rs-platform-wallet-ffi/Cargo.toml, packages/rs-platform-wallet/src/lib.rs
PlatformWalletError gains ShieldedShutdownIncomplete, PlatformWalletFFIResultCode gains ErrorShutdownIncomplete, and the FFI conversion maps the new wallet error to the new FFI result code. The crate root re-exports SHUTDOWN_JOIN_TIMEOUT_SECS.
FFI destroy and shielded sync shutdown handling
packages/rs-platform-wallet-ffi/src/manager.rs, packages/rs-platform-wallet-ffi/src/shielded_sync.rs, packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift, packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletResult.swift
platform_wallet_manager_destroy now inspects shutdown status and returns ErrorShutdownIncomplete when the managers do not shut down cleanly. platform_wallet_manager_shielded_sync_stop uses a timeout around quiesce and maps non-clean results to the same FFI code, while platform_wallet_manager_shielded_clear maps ShieldedShutdownIncomplete separately. Swift result and error types add matching cases and mapping logic, and the Swift deinit path retains callback handlers on incomplete shutdown.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Suggested labels

ready for final review

Suggested reviewers

  • shumkov
  • lklimek
  • QuantumExplorer
  • thepastaclaw

Poem

🐇 I hopped through threads and counted each stop,
A guard kept flags from staying on top.
If shutdown is wobbly, the status will speak,
And callback tails linger just long as they պետք.
Clean joins at dusk, with a nibble and grin —
the moon sees a rabbit who knows when to spin.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the shared ThreadRegistry refactor and the shutdown safety fixes in platform-wallet.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/platform-wallet-shutdown-join

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@lklimek lklimek marked this pull request as ready for review June 23, 2026 08:06
@lklimek lklimek requested a review from thepastaclaw June 23, 2026 08:06
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.18%. Comparing base (96cba16) to head (261178e).
⚠️ Report is 2 commits behind head on v3.1-dev.

Additional details and impacted files
@@              Coverage Diff              @@
##           v3.1-dev    #3954       +/-   ##
=============================================
+ Coverage     52.54%   87.18%   +34.63%     
=============================================
  Files            11     2632     +2621     
  Lines          1707   327563   +325856     
=============================================
+ Hits            897   285592   +284695     
- Misses          810    41971    +41161     
Components Coverage Δ
dpp 87.70% <ø> (∅)
drive 86.14% <ø> (∅)
drive-abci 89.45% <ø> (∅)
sdk ∅ <ø> (∅)
dapi-client ∅ <ø> (∅)
platform-version ∅ <ø> (∅)
platform-value 92.20% <ø> (∅)
platform-wallet ∅ <ø> (∅)
drive-proof-verifier 49.55% <ø> (∅)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
packages/rs-platform-wallet/src/manager/identity_sync.rs (1)

418-459: 🩺 Stability & Availability | 🟠 Major

Don't overwrite an unjoined coordinator generation.

stop() removes only background_cancel and leaves the previous background_join unjoined. A stop()start() sequence passes the cancel_guard.is_some() check and overwrites background_join with a new handle, losing the old OS-thread handle before shutdown can join it. Gate restart on confirming the prior handle has been joined, or ensure every generation's handle is tracked and joined.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/identity_sync.rs` around lines 418 -
459, The start() method can overwrite an existing background_join handle before
it has been properly joined, causing resource leaks. Before spawning the new
identity-sync thread and storing its handle in background_join, add a check to
ensure any existing join handle in background_join has been properly cleaned up
or joined first. This can be done either by joining the existing handle before
proceeding, or by verifying that background_join is None before allowing the new
thread spawn to proceed. This ensures the prior thread's OS handle is not lost
and can be properly shutdown.
packages/rs-platform-wallet/src/manager/platform_address_sync.rs (1)

218-255: 🩺 Stability & Availability | 🟠 Major

Add generation guard to prevent thread handle loss after stop/start cycles.

After stop() cancels the token, background_cancel becomes None while the old thread keeps running. A subsequent start() sees cancel_guard.is_some() == false and spawns a new thread, unconditionally overwriting background_join. The old thread's join handle is lost, making it impossible to join its cleanup later. Additionally, without a generation counter, the exiting old thread clears the new generation's background_cancel token as it shuts down, creating a race where the new loop runs but appears stopped to is_running().

Both IdentitySyncManager and ShieldedSyncManager already implement this pattern: they increment background_generation on each start(), pass my_gen to the spawned thread, and check background_generation.load() == my_gen before clearing background_cancel on exit. Apply the same approach here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs` around
lines 218 - 255, The thread handle for the platform address sync loop can be
lost during stop/start cycles because a new thread spawns and unconditionally
overwrites the background_join handle before the old thread finishes cleanup.
Additionally, the old thread clears the new generation's background_cancel
token, creating a race condition. Implement the generation guard pattern already
used in IdentitySyncManager and ShieldedSyncManager: add a background_generation
atomic counter to the struct, increment it at the start of the start() method,
capture the current generation as my_gen before spawning the thread, pass my_gen
into the spawned thread closure, and modify the cleanup code (where
background_cancel is set to None) to only clear the token if
background_generation.load() equals my_gen, ensuring old exiting threads do not
interfere with new generations.
packages/rs-platform-wallet/src/manager/shielded_sync.rs (1)

245-300: 🩺 Stability & Availability | 🔴 Critical

Don't replace a pending shielded-sync join handle on rapid stop/start cycles.

The start() method checks only background_cancel to guard against concurrent starts. When stop() removes the token from background_cancel, a subsequent start() proceeds to spawn a new thread and overwrites background_join, even though the previous generation's thread is still winding down. The generation check (line 290) only prevents the old thread from clearing background_cancel—it does not protect background_join. This leaves prior-generation handles permanently lost.

To fix: either require quiesce() before start() to join all pending generations, or track join handles per-generation and join all pending before spawning a new thread.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs` around lines 245 -
300, The `start()` method only checks `background_cancel` to guard against
concurrent starts, but when `stop()` clears this token, a subsequent `start()`
can spawn a new thread and overwrite the `background_join` handle before the
previous generation's thread has finished cleaning up. The generation check at
the cleanup section prevents the old thread from clearing `background_cancel`,
but does not protect `background_join` from being overwritten. Fix this by
either adding logic to join all pending prior-generation join handles before
storing the new one (by tracking handles per-generation), or by ensuring that
before assigning a new join handle to `background_join` in the `start()` method,
any existing pending handle from a prior generation is properly joined and
waited for completion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 706-710: The while loop waiting for handler_started to become true
has no timeout, which will cause the test to hang indefinitely if the slow
callback never executes. Wrap the entire while loop (that checks
handler_started.load(AO::Acquire)) with tokio::time::timeout() and provide an
appropriate timeout duration, then handle the timeout error case with an
assertion that explicitly fails the test with a useful message. This ensures CI
fails fast with clear feedback instead of timing out.
- Around line 483-489: In the wallet event adapter task join error handling, the
else branch that handles non-panic JoinErrors currently returns
CoordinatorThreadStatus::Ok, which should instead return
CoordinatorThreadStatus::Error with an appropriate error message. Change the
else clause that follows the is_panic() check to map the error to a
CoordinatorThreadStatus::Error variant containing details about the join error,
rather than treating it as a successful completion.
- Around line 175-181: The current implementation uses
tokio::task::spawn_blocking to wrap handle.join(), but this pattern prevents the
timeout from effectively interrupting the blocking task if the coordinator
thread hangs. Replace the spawn_blocking closure approach with explicit polling:
repeatedly check if the JoinHandle is finished using is_finished() in a loop
until the deadline is reached, and only call join() once the handle confirms it
is finished. This ensures the timeout boundary is enforced even if the
coordinator thread misbehaves or fails to clear is_syncing before join() is
called.

---

Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 418-459: The start() method can overwrite an existing
background_join handle before it has been properly joined, causing resource
leaks. Before spawning the new identity-sync thread and storing its handle in
background_join, add a check to ensure any existing join handle in
background_join has been properly cleaned up or joined first. This can be done
either by joining the existing handle before proceeding, or by verifying that
background_join is None before allowing the new thread spawn to proceed. This
ensures the prior thread's OS handle is not lost and can be properly shutdown.

In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 218-255: The thread handle for the platform address sync loop can
be lost during stop/start cycles because a new thread spawns and unconditionally
overwrites the background_join handle before the old thread finishes cleanup.
Additionally, the old thread clears the new generation's background_cancel
token, creating a race condition. Implement the generation guard pattern already
used in IdentitySyncManager and ShieldedSyncManager: add a background_generation
atomic counter to the struct, increment it at the start of the start() method,
capture the current generation as my_gen before spawning the thread, pass my_gen
into the spawned thread closure, and modify the cleanup code (where
background_cancel is set to None) to only clear the token if
background_generation.load() equals my_gen, ensuring old exiting threads do not
interfere with new generations.

In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 245-300: The `start()` method only checks `background_cancel` to
guard against concurrent starts, but when `stop()` clears this token, a
subsequent `start()` can spawn a new thread and overwrite the `background_join`
handle before the previous generation's thread has finished cleaning up. The
generation check at the cleanup section prevents the old thread from clearing
`background_cancel`, but does not protect `background_join` from being
overwritten. Fix this by either adding logic to join all pending
prior-generation join handles before storing the new one (by tracking handles
per-generation), or by ensuring that before assigning a new join handle to
`background_join` in the `start()` method, any existing pending handle from a
prior generation is properly joined and waited for completion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 01cab324-105d-4c5d-afe0-0ceb6faff13e

📥 Commits

Reviewing files that changed from the base of the PR and between 9a93387 and 261178e.

📒 Files selected for processing (5)
  • packages/rs-platform-wallet-ffi/src/manager.rs
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated
lklimek and others added 2 commits June 23, 2026 11:12
Introduces `AtomicFlagGuard`, a pub RAII guard that clears an
`AtomicBool` flag to `false` (Release ordering) on drop.  The guard
does not set the flag on construction — the caller is responsible for
doing so (typically via a `compare_exchange`) — preserving the exact
semantics of the three identical `IsSyncingGuard` structs that were
copy-pasted across the platform-wallet sync coordinators.

This is the panic-safety keystone for the quiesce drain loop: if a sync
pass panics, the guard's `drop` still clears `is_syncing`, so
`quiesce()` is never permanently wedged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…en runtime check

**Task 1 — new enum variants**
Add `Stopped(Option<String>)` (non-panic, non-clean task exit, e.g.
tokio cancel/abort) and `Timeout` (join exceeded
SHUTDOWN_JOIN_TIMEOUT_SECS) to `CoordinatorThreadStatus`.

- Non-panic JoinError on the event-adapter task → `Stopped(Some(...))`,
  not the previous `Ok` (wrong: a cancelled task is not a clean exit).
- Timeout on any `quiesce()` wrapper → `Timeout`, not `Error("join
  timed out")`.
- `is_clean()` now returns `true` only for `Ok` and `NotRunning`; all
  other variants — including the two new ones — are non-clean.
- Update all docs / comments that referenced the old `Error("join timed
  out")` wording.

**Task 2 — promote debug_assert to assert**
`shutdown()`'s multi-thread-runtime guard was `debug_assert!`, making
it a no-op in release builds.  Changed to `assert!` — this is a real
invariant: `spawn_blocking` deadlocks on a `current_thread` runtime.

**Task 3 — bound the test wait loop**
Wrap the `while !handler_started…` polling in
`shutdown_waits_for_in_flight_pass_to_drain` with a 5 s
`tokio::time::timeout` so a broken test fails fast instead of hanging.

**Task 4 — DRY IsSyncingGuard**
Replace the three identical copy-pasted `IsSyncingGuard` structs in
`identity_sync.rs`, `platform_address_sync.rs`, and `shielded_sync.rs`
with the new `dash_async::AtomicFlagGuard`.  Adds `dash-async` to
`rs-platform-wallet/Cargo.toml`.  Zero behavioral change: construction
semantics preserved (callers set the flag via `compare_exchange` before
creating the guard; `Drop` clears it with `Ordering::Release`).

**Task 5 — new tests**
- `coordinator_thread_status_clean_predicate`: unit-tests `is_clean()`
  for all six variants including the two new ones; no real timeout needed.
- `coordinator_exit_status_all_clean`: tests `all_clean()` with
  `Timeout` and `Stopped` slots.
- `event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean`:
  aborts the adapter task before `shutdown()` and asserts the result is
  `Stopped` (covers the non-panic JoinError path).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/rs-platform-wallet/src/manager/mod.rs (2)

405-408: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Handle unclean shielded quiesce before clearing state.

quiesce() now returns a meaningful shutdown status. Ignoring it lets clear_shielded() proceed to coord.clear() after Timeout, Stopped, or Panicked, which can race a still-running shielded pass that the quiesce barrier was meant to stop.

Proposed fix
-        self.shielded_sync_manager.quiesce().await;
+        let status = self.shielded_sync_manager.quiesce().await;
+        if !status.is_clean() {
+            return Err(crate::error::PlatformWalletError::ShieldedStoreError(
+                format!("shielded sync did not stop cleanly before clear: {status:?}"),
+            ));
+        }
         if let Some(coord) = self.shielded_coordinator().await {
             coord.clear().await?;
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 405 - 408, The
clear_shielded function ignores the return value from
self.shielded_sync_manager.quiesce().await, which now provides meaningful
shutdown status information. Instead of ignoring this result, capture the return
value and check if the quiesce completed cleanly. If quiesce returns a status
indicating Timeout, Stopped, or Panicked, return an appropriate error from
clear_shielded rather than continuing to call coord.clear(), which could race
with a still-running shielded pass. Only proceed with coord.clear() when quiesce
has successfully shut down cleanly.

463-477: 🩺 Stability & Availability | 🟠 Major

Wrap quiescing in AtomicFlagGuard to ensure cancellation-safe reset in all coordinator quiesce() implementations.

The current implementations set quiescing = true before the awaited drain loop and reset it only after. If tokio::time::timeout drops the future during the loop, the reset never executes, permanently wedging quiescing and blocking all future syncs.

All three coordinators (platform_address_sync.rs:291, identity_sync.rs:494, shielded_sync.rs:334) have identical patterns. Use the same AtomicFlagGuard approach already correctly applied to is_syncing in sync_now():

pub async fn quiesce(&self) -> super::CoordinatorThreadStatus {
    self.quiescing.store(true, Ordering::Release);
    let _quiescing_guard = AtomicFlagGuard::new(&self.quiescing);
    self.stop();
    while self.is_syncing.load(Ordering::Acquire) {
        tokio::time::sleep(Duration::from_millis(20)).await;
    }
    // quiescing.store(false) removed — guard handles reset on all exit paths
    // ...rest of implementation
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 463 - 477, The
`quiesce()` method implementations in all three coordinators (the ones
containing the timeout calls for `platform_address_sync_manager`,
`identity_sync_manager`, and `shielded_sync_manager`) don't properly handle
cancellation when `tokio::time::timeout` drops the future. Wrap the `quiescing`
flag reset in an `AtomicFlagGuard` to ensure it's reset on all exit paths
including early cancellation. In each coordinator's `quiesce()` method, after
setting `quiescing` to true, immediately create an `AtomicFlagGuard` using
`AtomicFlagGuard::new(&self.quiescing)`, and remove any manual
`quiescing.store(false)` reset call at the end since the guard will handle it
automatically. Use the same pattern already correctly implemented in
`sync_now()` for the `is_syncing` flag.
🧹 Nitpick comments (1)
packages/rs-dash-async/src/atomic.rs (1)

8-15: 📐 Maintainability & Code Quality | 🔵 Trivial

Add #[must_use] annotation to AtomicFlagGuard.

The guard is dropped as a temporary if not bound, silently resetting the flag immediately. This breaks the intended guarded scope behavior. Mark the type with #[must_use] to catch accidental non-binding at compile time.

Proposed fix
-pub struct AtomicFlagGuard<'a>(&'a AtomicBool);
+#[must_use = "AtomicFlagGuard clears the flag on drop; bind it to keep the flag set for the guarded scope"]
+pub struct AtomicFlagGuard<'a>(&'a AtomicBool);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-dash-async/src/atomic.rs` around lines 8 - 15, The
AtomicFlagGuard struct can be accidentally dropped without being bound to a
variable, causing the flag to be reset immediately and breaking the guarded
scope behavior. Add the #[must_use] attribute to the AtomicFlagGuard struct
definition to make the compiler warn when the guard is not explicitly bound to a
variable, ensuring the developer catches this mistake at compile time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 671-695: The test
`event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean` is
non-deterministic because it accepts both `CoordinatorThreadStatus::Stopped` and
`CoordinatorThreadStatus::Ok` as valid outcomes. This allows the test to pass
without actually exercising the non-panic JoinError branch. To fix this, replace
the current task abort approach with a task that is guaranteed to be pending and
never complete on its own, such as a task that awaits on a channel or a
never-resolving future. This ensures the abort always triggers the Stopped path
deterministically, and update the assertion to only expect
`CoordinatorThreadStatus::Stopped`.

---

Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 405-408: The clear_shielded function ignores the return value from
self.shielded_sync_manager.quiesce().await, which now provides meaningful
shutdown status information. Instead of ignoring this result, capture the return
value and check if the quiesce completed cleanly. If quiesce returns a status
indicating Timeout, Stopped, or Panicked, return an appropriate error from
clear_shielded rather than continuing to call coord.clear(), which could race
with a still-running shielded pass. Only proceed with coord.clear() when quiesce
has successfully shut down cleanly.
- Around line 463-477: The `quiesce()` method implementations in all three
coordinators (the ones containing the timeout calls for
`platform_address_sync_manager`, `identity_sync_manager`, and
`shielded_sync_manager`) don't properly handle cancellation when
`tokio::time::timeout` drops the future. Wrap the `quiescing` flag reset in an
`AtomicFlagGuard` to ensure it's reset on all exit paths including early
cancellation. In each coordinator's `quiesce()` method, after setting
`quiescing` to true, immediately create an `AtomicFlagGuard` using
`AtomicFlagGuard::new(&self.quiescing)`, and remove any manual
`quiescing.store(false)` reset call at the end since the guard will handle it
automatically. Use the same pattern already correctly implemented in
`sync_now()` for the `is_syncing` flag.

---

Nitpick comments:
In `@packages/rs-dash-async/src/atomic.rs`:
- Around line 8-15: The AtomicFlagGuard struct can be accidentally dropped
without being bound to a variable, causing the flag to be reset immediately and
breaking the guarded scope behavior. Add the #[must_use] attribute to the
AtomicFlagGuard struct definition to make the compiler warn when the guard is
not explicitly bound to a variable, ensuring the developer catches this mistake
at compile time.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 192ea517-ff8a-4b43-9773-c096391c8a49

📥 Commits

Reviewing files that changed from the base of the PR and between 261178e and 6e78b77.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • packages/rs-dash-async/src/atomic.rs
  • packages/rs-dash-async/src/lib.rs
  • packages/rs-platform-wallet/Cargo.toml
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
@thepastaclaw

thepastaclaw commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

✅ Review complete (commit 748c4f8)

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs Outdated
Comment thread packages/rs-dash-async/src/atomic.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-dash-async/src/atomic.rs
Comment thread packages/rs-dash-async/src/atomic.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Swift deinit blocking finding — fixed in 5017ba1313. PlatformWalletManager.deinit now captures the result code from shielded_sync_stop and destroy (via a new discardReturningCode() helper) and, only on .errorShutdownIncomplete, deliberately leaks one extra strong ref to persistenceHandler/eventHandler (unbalanced Unmanaged.passRetained) so a lingering coordinator that returned Timeout cannot fire its final callback into a freed context. A clean shutdown takes neither branch — no unconditional leak. platform_address_sync_stop stays .discard() (cancel-only; never returns code 19).

⚠️ Build-unverified: this host has no Swift toolchain / xcframework, so the change was made correct-by-construction and is NOT compiled here — it needs an iOS-environment build before merge. Tracked.

🤖 Co-authored by Claudius the Magnificent AI Agent

…on tests

Wrap the cleanup `mgr.quiesce().await` in all three
`restart_after_stop_reaps_prior_thread` tests with a 2-second
`tokio::time::timeout`. An unbounded quiesce after the restarted loop
would hang CI forever if the loop wedges; now it fails fast with a clear
message. Also assert `status.is_clean()` on the returned
`CoordinatorThreadStatus`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The latest delta resolves prior-1 (Swift deinit at PlatformWalletManager.swift:153-195 now captures both shielded_sync_stop and destroy result codes via discardReturningCode() and leaks persistenceHandler/eventHandler via unbalanced passRetained on .errorShutdownIncomplete) and prior-3 (restart_after_stop_reaps_prior_thread now exists in shielded_sync.rs alongside its identity/platform-address siblings). Prior-2 — the 1 s wedge-backstop detach in each coordinator's start() — remains in the tree on the deliberate-wedge path; all six agents converge on it as a suggestion, narrow but conceptually opposed to the PR's central honest-status contract. No new blocking issues introduced.

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517: start()'s 1 s wedge-backstop detaches a still-live coordinator that shutdown()/destroy then reports as clean
  Carried forward from prior review (still valid; no reviewer-repeat required by the verifier protocol).

  On a genuine wedge — a `sync_now` future whose `Drop` never yields, or a non-yielding step inside the SDK — the prior thread cannot reach its exit epilogue within the 1 s deadline at identity_sync.rs:502-517 (and the symmetric patterns at platform_address_sync.rs:317-332 and shielded_sync.rs:350-365). `start()` then drops `h`, detaching the still-running OS thread; `background_join` already holds the new generation's handle, so a later `quiesce()`/`shutdown()` joins only the new thread, `CoordinatorExitStatus::all_clean()` returns true, and `platform_wallet_manager_destroy` returns `ok()`.

  That directly defeats the central FFI contract this PR introduces. The Swift deinit at PlatformWalletManager.swift:190-194 only retains `persistenceHandler` / `eventHandler` when it observes `.errorShutdownIncomplete`; a falsely-clean code lets ARC free both immediately, and the detached prior thread — which still owns Arc'd Rust callback wrappers carrying the `passUnretained(self).toOpaque()` raw pointers (PlatformWalletPersistenceHandler.swift:993-996, PlatformWalletManagerAddressSync.swift:47-50) — can fire one last callback into freed memory. The same hazard exists on the timeout paths of `clear_shielded` and the FFI `platform_wallet_manager_shielded_sync_stop`, where an outer `tokio::time::timeout` firing while the inner quiesce is parked inside `join_coordinator_thread(handle).await` drops the local handle at the .await — detach.

  The cancellable `tokio::select!` in each loop body and the prompt `drop(cancel_guard)` ordering (identity_sync.rs:488-494, mirrored in the other two) make this vanishingly rare in practice — `restart_after_stop_reaps_prior_thread` now pins the common-case reap at <500 ms across all three coordinators. But the residual path is the one bypass of the PR's honest-status guarantee. Two clean closures: (a) track detached handles in a per-manager orphans list that `shutdown()` polls and surfaces as `Timeout`/`Detached`, so `destroy` returns `ErrorShutdownIncomplete` and the host keeps the callback context alive; or (b) once cancellation has been signalled, drop the 1 s deadline entirely — the cancellable select! makes a real-world stall essentially impossible, and turning a degraded shutdown into a hang is arguably less dangerous than the silent UAF window the detach can open.

Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta (5017ba1..b491773) is test-only: it bounds the post-restart cleanup quiesce() in three restart-reap regression tests with a 2 s tokio::time::timeout and asserts is_clean(). No production code changed and no new defects in the delta. The prior finding about start()'s 1 s wedge-backstop silently detaching a still-live coordinator is STILL VALID on the current head for all three coordinators and is carried forward as the sole in-scope finding.

Carried-forward prior findings

  1. [suggestion] 1 s wedge-backstop in start() detaches a live prior coordinator that shutdown()/destroy then reports as clean

    packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517

    Verified against b491773. On a genuine wedge — a sync_now future whose Drop never yields, or a non-yielding step inside the SDK — the prior coordinator thread cannot reach its exit epilogue within the 1 s deadline. The loop at lines 504–513 then breaks and drops h, which detaches the still-running OS thread (std::thread::JoinHandle's Drop does not join). background_join now holds only the replacement handle installed at lines 483–486, so a later quiesce() (lines 576–581) joins only the replacement and CoordinatorExitStatus::all_clean() returns true. FFI platform_wallet_manager_destroy (packages/rs-platform-wallet-ffi/src/manager.rs:366–386) then returns ok() instead of ErrorShutdownIncomplete=19. Swift PlatformWalletManager.deinit (packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift:153–195) does NOT take the retain-leak branch and ARC releases the persister/event-handler wrappers, even though the detached orphan thread still holds Arc/Arc referencing the freed passUnretained context — exactly the UAF class this PR exists to close, but now on the stop()→start()→destroy() degraded path the new restart bookkeeping creates.

    The biased cancel-select introduced at lines 446–461 narrows the common-case window (cancel beats sync_now's next .await), but it cannot abort an SDK step or future Drop that is itself non-yielding — which is precisely when the 1 s backstop fires. Same pattern exists in platform_address_sync.rs:317–332 and shielded_sync.rs:350–365; none of the three record the detach event into state observable by shutdown()/CoordinatorExitStatus. The new restart-reap tests only exercise the prompt-reap (<500 ms) path, not a genuine wedge past 1 s.

    Minimum FFI-correct fix options (none fit an inline patch): (a) track detached orphans in a Mutex<Vec<JoinHandle<()>>> the next quiesce()/shutdown() polls and surfaces as Timeout/non-clean status; (b) have start() flip a per-coordinator atomic on detach that quiesce() ORs into its returned CoordinatorThreadStatus; (c) have start() return an error rather than silently install a replacement when the prior couldn't be reaped, so the FFI host can escalate. The current shape — silently dropping the orphan — is the one outcome that defeats the PR's central invariant that destroy()'s result code tells the host whether it is safe to free callback contexts.

…) reports them as non-clean

Closes the residual use-after-free window left by the coordinator
reap backstop. On a tight stop()->start(), each sync coordinator
waits ~1s for the prior OS thread to finish; if that thread is
genuinely wedged in a non-yielding Drop, the backstop previously
DROPPED the still-live JoinHandle (detaching it). A later shutdown()
joined only the current handle, all_clean() returned true, and the
FFI destroy returned ok() — at which point the host could free the
callback context the detached, still-running thread might still touch.

Fix (review option i): the manager now owns a shared CoordinatorOrphans
list (Arc<Mutex<Vec<JoinHandle>>>) cloned into every coordinator. The
duplicated reap blocks in identity/platform-address/shielded start()
are consolidated into reap_prior_or_park(), which PARKS a wedged prior
thread in that list instead of dropping it (lock-ordering preserved:
drop(cancel_guard) still happens before the reap). shutdown() then
drains the list via join_detached_orphans() within a bounded, yielding
is_finished() poll and reports a new CoordinatorThreadStatus::Detached
(non-clean) in CoordinatorExitStatus::detached_threads when any orphan
is still alive at the grace deadline. all_clean() folds it in, so the
FFI destroy correctly returns ErrorShutdownIncomplete and the host
delays freeing its context. The new Detached variant re-exports through
lib.rs with its sibling statuses.

Tests (manager/mod.rs): reap_prior_or_park parks a force-wedged thread;
join_detached_orphans reports Detached then Ok; and a manager shutdown()
with a parked still-live orphan reports non-clean. All proven
non-vacuous by neutering the park/join. Cleanup quiesce/join in tests is
bounded; a wedged stand-in thread is released and joined so none leak.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Prior finding (start()/wedge-backstop detaching a live coordinator) is FIXED at head 76c8bee via reap_prior_or_park + CoordinatorOrphans + join_detached_orphans, with Detached folded into all_clean(). However, two related lifecycle gaps in the same orphan-tracking design remain: (1) quiesce() takes the JoinHandle out of background_join before awaiting join_coordinator_thread, so a wrapping tokio::time::timeout (used by shutdown(), clear_shielded(), and the FFI shielded_sync_stop bridge) drops the live handle and detaches the thread without parking it; on the explicit retry path the next call sees background_join == None and reports NotRunning/clean while a wedged thread may still call host callbacks. (2) clear_shielded() consults only the current shielded quiesce status and wipes the commitment-tree store even when an earlier wedged shielded coordinator is still alive in coordinator_orphans, breaking the PR's 'Clear leaves the store intact whenever a sync drain is incomplete' invariant. Both are in scope for the PR's stated FFI lifecycle/UAF hardening.

🔴 2 blocking

1 additional finding(s) omitted (not in diff).

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:240-261: Timed-out quiesce drops the JoinHandle and silently detaches a live coordinator
  join_coordinator_thread takes the JoinHandle by value. Every caller of quiesce() — shutdown(), clear_shielded(), and the FFI platform_wallet_manager_shielded_sync_stop — wraps it in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, ...). Each coordinator's quiesce() first synchronously takes background_join (identity_sync.rs:579, shielded_sync.rs:423, platform_address_sync.rs analogous) and only then awaits this helper. If the outer timeout fires while this loop is in tokio::time::sleep, the future is dropped, the locally-owned handle is dropped with it, and the still-live OS thread is detached — not parked in coordinator_orphans. The FFI/host contract is explicitly retry-capable here (Timeout → ErrorShutdownIncomplete → host should be able to retry stop/destroy), but a retry now sees background_join == None and join_coordinator_thread returns NotRunning, which is_clean() treats as clean. The host then frees the persister / event-handler context while a wedged thread still holds Arcs to it and may still fire one final callback — the exact UAF class the orphan list was added to close. The fix is to keep the handle reachable across timeout/cancel: either don't take() from background_join until is_finished() is true, or have join_coordinator_thread re-park the unjoined handle into a manager-scoped orphans list on Drop / on the deadline (mirroring reap_prior_or_park's park-on-wedge contract).
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the store while parked coordinator orphans may still be alive
  clear_shielded() gates the store wipe on `self.shielded_sync_manager.quiesce()` returning a clean status, but a tight stop()→start() sequence can park a previously-wedged shielded coordinator thread into coordinator_orphans (reap_prior_or_park, mod.rs:296-332). Those parked threads hold Arc references to the same shielded coordinator / persister context the clear() below is about to wipe, and shutdown() is currently the only path that drains them via join_detached_orphans. As soon as the current quiesce reports clean, this path calls coord.clear() while a parked orphan may still be inside its sync_now drop / persister fan-out, violating the PR's stated invariant that Clear leaves the store intact whenever a sync drain is incomplete and re-creating the store-desync the orphan tracking was added to surface. Mirror the shutdown() path: after quiesce, drain coordinator_orphans (with SHUTDOWN_ORPHAN_GRACE_SECS) and refuse the wipe with ShieldedShutdownIncomplete if any orphan is still Detached/Panicked.

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
…via tokio::join!

The three periodic coordinators (platform-address, identity, shielded)
were quiesced sequentially in shutdown(), making the worst-case wait
additive (~3 x SHUTDOWN_JOIN_TIMEOUT_SECS). Each quiesce() touches only
its own state (its quiescing/is_syncing atomics and its own
background_cancel/background_join mutexes) and joins its own OS thread,
sharing no lock, so racing them is sound. Drain them concurrently via
tokio::join!, collapsing the worst case to ~max(timeouts).

Each join! arm keeps its OWN inner tokio::time::timeout, so every
coordinator still yields its own per-coordinator CoordinatorThreadStatus
(a single outer timeout would flatten all three to Timeout). The event
adapter teardown and join_detached_orphans stay sequential and ordered
strictly AFTER the coordinator join!, since the adapter sinks the
coordinators' stores. The multi-thread runtime assert is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
@lklimek

lklimek commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@thepastaclaw please review

lklimek and others added 4 commits June 24, 2026 22:22
Centralizes the previously-triplicated background-worker lifecycle —
generation-match exit epilogue, restart reap-or-park, orphan drain — into
one tested engine in the shared dash-async crate, generic over a worker
key and supporting both OS-thread (`!Send` block_on loops) and tokio-task
workers.

Makes two confirmed bugs impossible by construction:
- F1: quiesce/join paths take `&self`; the live JoinHandle stays owned by
  the slot and is never moved into a cancellable future's frame. A
  dropped/timed-out quiesce re-parks the handle into orphans (Timeout),
  never drop-and-detach to a clean NotRunning.
- F2: any_alive() is the single liveness gate spanning live slots AND
  parked orphans, so store-wiping paths can refuse while a prior thread
  is alive.

Weight-ordered shutdown drains tiers ascending, concurrently within a
tier. WorkerStatus variants are byte-identical to the wallet's
CoordinatorThreadStatus for a stable FFI mapping.

Adds the full registry test suite (TC-001/001b/003-014, F1 shutdown-path
GAP-006, compile-fail DrainHook Send check, default-config and idempotent-
shutdown gaps). 22 tests + 1 doctest green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
Tag each parked orphan with its originating worker key and add
any_alive_for(key), so a store-wiping path scoped to one worker (the
wallet's clear_shielded F2 gate) can refuse only while that worker — its
slot or a parked prior-generation thread under its key — is alive, without
being blocked by unrelated workers that are legitimately running (e.g. the
always-on event-adapter task). Registry-wide any_alive() is retained.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…adRegistry

Replace the triplicated per-coordinator lifecycle machinery
(background_cancel/join/generation mutexes, the cloned coordinator_orphans
list, and the mod.rs free fns join_coordinator_thread / reap_prior_or_park
/ join_detached_orphans / panic_message) with the shared
dash-async ThreadRegistry. The manager holds one
Arc<ThreadRegistry<WalletWorker>>; each coordinator's start/stop/
is_running/quiesce now delegate to it under a fixed WalletWorker key, and
each exposes its quiescing-gate raise as a registry DrainHook. The
wallet-event adapter becomes a registry start_task worker (weight 10,
draining after the coordinators it sinks at weight 0).

Fixes the two confirmed bugs structurally:
- F1: shutdown() is now CoordinatorExitStatus::from_report(registry
  .shutdown()); each worker's join is bounded by its own join_budget
  inside the registry, where the live handle stays owned by the slot. A
  dropped/timed-out join can no longer detach a live thread — it re-parks
  to Timeout.
- F2: clear_shielded() gates the store wipe on
  registry.any_alive_for(ShieldedSync) (shielded-scoped, so the always-on
  event adapter and the other coordinators running normally do not block
  Clear), refusing while a prior-generation shielded thread is parked
  alive.

CoordinatorThreadStatus / CoordinatorExitStatus / all_clean() are
byte-stable (FFI destroy maps !all_clean -> ErrorShutdownIncomplete);
WorkerStatus maps onto them 1:1 via From. The three wall-clock
restart-reap regression tests and the relocated free-fn tests are deleted
(subsumed by the dash-async registry suite); F2 (clear_shielded) and R5
(from_report) gain wallet-level tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…te assert asymmetry

QA-002: anchor the DrainHook Send+Sync compile_fail doctest to E0277 so it
verifies the !Send capture is rejected for the right reason (unsatisfied
Send bound) and cannot pass vacuously on an unrelated compile error.

QA-001: document the runtime-flavor assert asymmetry — start_thread and
shutdown assert a multi-thread runtime but start_task does not, so a
task-only consumer (rs-dapi) on a current_thread runtime would panic late
at shutdown(). The wallet always uses start_thread, so it trips the assert
at start and is unaffected. Fix deferred to the rs-dapi adoption PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Incremental review against head 3cca1cf. The new tokio::join! concurrent drain in shutdown() is structurally sound — each coordinator's quiesce() touches only its own quiescing/is_syncing atomics and background_cancel/background_join mutexes, and the shared coordinator_orphans Mutex is not taken inside quiesce(), so racing the three drains is safe; per-coordinator inner timeouts correctly preserve per-thread CoordinatorThreadStatus that a single outer timeout would have flattened. However, both prior blocking findings are unchanged on this head: a timed-out quiesce still loses the only JoinHandle and silently detaches a live coordinator OS thread, and clear_shielded still wipes the shielded store without consulting coordinator_orphans. Both were unanimously re-flagged by all six agent reviews and both are required for this PR's stated lifetime/UAF guarantees to actually hold.

🔴 1 blocking

1 additional finding(s) omitted (not in diff).

1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the shielded store without draining parked coordinator orphans
  clear_shielded (mod.rs:644–674) gates coord.clear().await on shielded_sync_manager.quiesce() returning is_clean(), and the doc on SHUTDOWN_JOIN_TIMEOUT_SECS (mod.rs:407–408) explicitly lists clear_shielded and the FFI shielded-stop bridge as using the same backstop. But neither path inspects self.coordinator_orphans. Shielded coordinator start() parks a prior wedged shielded OS thread into that shared list via reap_prior_or_park (mod.rs:296–332) whenever a tight stop()→start() cannot reap within the 1 s wedge backstop. Only shutdown() actually drains the list via join_detached_orphans (mod.rs:820–824).

  A parked shielded thread still holds an Arc to the same NetworkShieldedCoordinator and persister context that coord.clear() (mod.rs:670–672) is about to wipe; the doc on lines 663–666 warns exactly about this — "a surviving pass writing into a store we just cleared, desyncing the host's own wipe from a repopulated tree." Because the orphan drain is missing, a sequence of stop()→start() that wedges the prior shielded thread, followed by clear_shielded, lets the wipe proceed while the parked thread is still alive and can persister.store(...) / fire on_shielded_* host callbacks against the cleared store.

  The same defect is in the FFI platform_wallet_manager_shielded_sync_stop bridge: it awaits only quiesce(), so a non-clean orphan goes unreported and the host (which sees ok()) is free to release the callback context that a parked shielded thread still references.

  Fix shape (symmetric with shutdown()): in both clear_shielded and platform_wallet_manager_shielded_sync_stop, after quiesce() returns clean, run a bounded join_detached_orphans(&self.coordinator_orphans, deadline) and fold the result into the precondition — return ShieldedShutdownIncomplete { status } (and leave the store intact) unless both the live drain and the orphan drain are clean. Caveat: coordinator_orphans is shared across all three coordinator kinds; either accept the conservative wait for identity/platform-address orphans before a shielded Clear, or tag parked handles with their coordinator kind so Clear can drain only the shielded subset.

lklimek and others added 2 commits June 25, 2026 01:03
…c accuracy

Apply consolidated review findings to the shared ThreadRegistry and the
wallet's shutdown/clear paths.

Concurrency hardening (with regression tests):
- generation-guard quiesce(): a concurrent same-key restart can no longer
  have its fresh live handle parked or reported Timeout — the superseded
  quiesce returns NotRunning and the gen-scoped Repark leaves the new
  handle alone.
- graceful spawn-failure rollback: start_thread no longer .expect()s on
  thread spawn; a failure re-installs the prior handle (never detached),
  clears the running flag, and returns instead of panicking across the FFI
  boundary.
- panic-safe epilogue: an EpilogueGuard drop-guard runs the gen-gated
  epilogue even when a worker body unwinds, so is_running() reflects a
  crash and start() can relaunch it.
- closing latch: shutdown() latches the registry closed under the slot
  lock before snapshotting tiers; start_thread/start_task refuse new
  workers once teardown begins, so a start racing shutdown cannot leave an
  un-joined worker behind.
- clear_shielded holds the shielded quiescing gate across its liveness
  check and store wipe (closes the direct sync_now/sync_wallet TOCTOU) and
  documents the residual start-vs-clear host-serialization precondition.

Hygiene:
- feature-gate park_orphan_for_test behind a new test-util feature so the
  mutation seam never ships in a downstream production build.
- add Debug to WorkerConfig/ThreadRegistry; add # Panics docs to
  start_thread/start_task/shutdown; drop a tombstone comment.
- correct the WorkerStatus<->CoordinatorThreadStatus "byte-identical"
  claim to "exhaustive by-name match, never a layout cast".
- repoint dead [CoordinatorOrphans] and [spawn_wallet_event_adapter]
  intra-doc links at the surviving registry / renamed loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…three sync coordinators

The migration onto ThreadRegistry left the coordinator-side delegate +
pass-gating boilerplate copy-pasted across IdentitySyncManager,
PlatformAddressSyncManager, and ShieldedSyncManager: the same five
fields (registry, interval, is_syncing, quiescing, last_sync), byte-
identical interval/is_syncing/last_sync/drain_hook/stop/quiesce
delegations, and — critically — the subtle is_syncing-CAS + quiescing-gate
pass preamble reproduced four times.

Hoist all of it into one CoordinatorLifecycle helper that each coordinator
embeds and delegates to, so the teardown-critical handshake has a single
home. begin_pass() folds the CAS + guard + gate check into one RAII-guard-
returning call; hold_quiescing_gate() is the shared primitive the shielded
Clear flow holds across its check-then-wipe. Each coordinator now keeps
only its domain-specific pass body.

Behaviour-preserving: the full platform-wallet lib suite (307 tests, both
default and shielded) and the coordinator pass-gate tests are unchanged and
green. Also fixes the dangling [JoinHandle] intra-doc link in core_bridge
(its import was removed in the migration) by fully qualifying it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
@Claudius-Maginificent Claudius-Maginificent changed the title feat(platform-wallet)!: shutdown() joins coordinator threads and returns CoordinatorExitStatus feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes Jun 25, 2026

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Both prior blocking findings are structurally fixed in the new ThreadRegistry engine (prior-1 by the slot-owned JoinHandle + generation-scoped Repark drop-guard; prior-2 by clear_shielded's any_alive_for(ShieldedSync) gate under a held quiescing guard). One new blocking regression was introduced by the CoordinatorLifecycle dedup: the shared quiesce() no longer raises the quiescing gate itself or drains is_syncing — it only delegates to ThreadRegistry::quiesce, whose drain hook is skipped when no slot is registered — so a direct sync_now/sync_wallet in flight (which the FFI exposes without requiring the background loop to be running) is no longer awaited before quiesce reports clean. Two narrower follow-ups (symmetric orphan check on shielded_sync_stop FFI; out-of-lock prior-handle reap racing shutdown's orphan reap) round out the review.

🔴 1 blocking | 🟡 2 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:146-149: Shared quiesce() no longer drains direct in-flight sync passes
  The pre-refactor per-coordinator `quiesce()` explicitly raised `quiescing` and awaited `is_syncing` to fall before joining (`self.quiescing.store(true, ...); self.stop(); while self.is_syncing { sleep(20ms) }`). The new shared `CoordinatorLifecycle::quiesce` is `let _g = AtomicFlagGuard::new(&self.quiescing); self.registry.quiesce(self.worker).await.into()`. `AtomicFlagGuard::new` (rs-dash-async/src/atomic.rs:18-22) deliberately does **not** raise the flag — only the registry's drain hook does, and `ThreadRegistry::quiesce` (registry.rs:507-519) early-returns `NotRunning` without invoking the drain hook whenever no live slot is registered.

  The FFI exposes `platform_wallet_manager_shielded_sync_sync_now` / `shielded_sync_wallet` (shielded_sync.rs:208, 590) — and the same shape for identity / platform-address — which call `shielded_sync().sync_now(true)` / `sync_wallet(..., true)` directly, with no requirement that the background loop is running. So a host-driven direct pass can be in flight while `clear_shielded` calls `quiesce()`: the registry has no slot, the drain hook never fires, the gate is never raised, and `quiesce()` returns `NotRunning` — clean — immediately. `clear_shielded` (mod.rs:531-562) then takes the `is_clean()` branch, raises the gate via `hold_quiescing_gate()` (too late — the in-flight pass already passed `begin_pass()` and does not recheck mid-pass), and `any_alive_for(ShieldedSync)` returns false (it only inspects slots/orphans, not `is_syncing`). `coord.clear().await` wipes the store while the direct pass keeps writing through `coord.sync(...)` and its persister fan-out.

  This directly contradicts the documented contract on `clear_shielded` (mod.rs:517-518: "A concurrent direct `sync_now`/`sync_wallet` is held off (the quiescing gate stays raised across the liveness check and the wipe)"), and is the same store-desync class the prior F2 finding closed for parked orphans. Fix by restoring the explicit gate raise + `is_syncing` drain in the shared `quiesce`, before delegating to the registry — symmetric with what each coordinator did before the dedup.

In `packages/rs-platform-wallet-ffi/src/shielded_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:117-134: shielded_sync_stop FFI does not consult any_alive_for(ShieldedSync)
  `platform_wallet_manager_shielded_sync_stop` returns `Success` whenever the per-key `quiesce()` lands clean (Ok/NotRunning). It does not consult `registry.any_alive_for(WalletWorker::ShieldedSync)` — the symmetric gate `clear_shielded` adopted at manager/mod.rs:553 (the F2 fix) and the destroy path covers via `CoordinatorExitStatus::all_clean()` which folds in `detached_threads`.

  A prior tight start->stop reap that had to park a wedged shielded thread as an orphan would not block the next stop's `Success`, even though the parked thread still holds `Arc`s to the host's `FFIEventHandler::context` / `FFIPersister::context` pointers and can fire one final callback through them. The function's docstring (shielded_sync.rs:80-83) says "On this code the host must NOT free the callback context immediately" only on the non-clean branch — implying that on `Success` the host MAY free it.

  Today's only consumer (Swift `deinit`) always follows stop with destroy, and destroy's `manager.shutdown()` drains orphans, so no UAF currently reaches a host. The contract gap is still worth closing symmetrically with the other two quiesce-then-touch-host-context teardown paths.

In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:406-414: start_thread's out-of-lock prior reap races shutdown's orphan reap
  `start_thread` takes the prior handle under the slot lock, releases the lock at line 406, then calls `self.reap_prior_or_park(prior, key)` synchronously at line 413. `reap_prior_or_park` for an OS thread (registry.rs:737-762) spins up to `reap_backstop` (1 s in the wallet) before pushing a still-wedged handle into orphans. The `closing` latch and shutdown's tier snapshot only synchronize on the slot lock — they do not block on this out-of-lock reap. Interleaving from different threads:

  1. A: `start_thread` takes the slot lock, passes the `closing` check, takes the wedged `prior`, installs the new gen handle, releases the lock. A is now spinning in `reap_prior_or_park`.
  2. B: `shutdown()` takes the slot lock, sets `closing=true`, snapshots tiers (sees only the NEW gen handle — `prior` was already moved out), releases.
  3. B drains the new gen via `quiesce` (clean).
  4. B calls `reap_orphans_impl(reap_backstop)` — orphan list is still empty because A hasn't pushed yet.
  5. B returns a clean `ShutdownReport` → `CoordinatorExitStatus::all_clean()` → FFI `destroy` returns `ok()` → host frees its callback context.
  6. A's 1 s backstop expires; A pushes the still-live wedged prior into orphans and returns.

  The prior thread is now a live orphan holding `Arc`s to the freed host context — the same UAF surface F1 was designed to close. Requires (a) a wedged-past-backstop prior, (b) `start_thread` invoked concurrently with `shutdown` from a different thread, and (c) shutdown's reap_orphans phase to land in A's 1 s backstop window — narrow, but the asymmetric to the closing-latch protection the PR explicitly adds. Close by parking `prior` immediately under the slot lock and reaping asynchronously, or by tracking in-flight reaps so `reap_orphans_impl` waits for them to settle.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs
Comment thread packages/rs-platform-wallet-ffi/src/shielded_sync.rs
Comment thread packages/rs-dash-async/src/registry.rs
@lklimek

lklimek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai review all

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@lklimek Re-reviewing the full PR against the latest tip now.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
packages/rs-platform-wallet-ffi/src/manager.rs (1)

363-365: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Update the inline comment to match the new FFI behavior.

Lines 363-365 still say destroy will "just log it and drop it", but Lines 367-382 now surface non-clean shutdowns as ErrorShutdownIncomplete. Leaving the old comment here is misleading in the callback-lifetime path this PR is hardening.

✏️ Suggested edit
-        // It now joins the coordinator OS threads and returns their
-        // per-thread exit status; the C ABI exposes none of that, so we
-        // just log it (a panicked loop is worth surfacing) and drop it.
+        // It now joins the coordinator OS threads and returns their
+        // per-thread exit status. The C ABI does not expose that full
+        // structure directly, but a non-clean aggregate is surfaced as
+        // `ErrorShutdownIncomplete` so the host can keep its callback
+        // context alive until lingering workers settle.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet-ffi/src/manager.rs` around lines 363 - 365, The
inline comment near the destroy/callback-lifetime handling in manager::destroy
is outdated and still says the shutdown result is only logged and dropped.
Update that comment to describe the current FFI behavior in destroy and the
related callback-lifetime path: the coordinator OS threads are joined, and
non-clean shutdowns are surfaced as ErrorShutdownIncomplete instead of being
silently discarded. Keep the wording aligned with the destroy logic and
ErrorShutdownIncomplete handling so the comment matches the implemented shutdown
semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-dash-async/src/registry.rs`:
- Around line 674-679: The shutdown path in `Registry::shutdown` is discarding
the orphan result status from `reap_orphans_impl()`, which can hide non-clean
exits. Update `ShutdownReport` and its construction so the orphan status is
preserved alongside `detached`, and adjust `ShutdownReport::all_clean()` to
consider that stored status instead of treating parked orphans as clean by
default.

In `@packages/rs-platform-wallet/src/changeset/core_bridge.rs`:
- Around line 68-110: When the cancel branch in core_bridge.rs wins the
tokio::select! race, the loop exits immediately and can drop buffered broadcast
events, so update the cancellation handling in the main event loop to drain
receiver before breaking. Use the existing build_core_changeset and
persister.store flow for each pending event from receiver.try_recv(), preserve
the core.is_empty_no_records() skip, and keep the existing error/warn logging
style keyed by wallet_id so shutdown still persists any queued CoreChangeSet
updates.

In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 407-410: Update the `stop()` documentation in `identity_sync.rs`
to reflect that background sync passes are cancellable and may exit early when
shutdown is requested. Adjust the wording around `sync_now()`/`tokio::select!`
in `stop()` and any related comments so it no longer claims a pass will always
run to completion after `stop()`, and make the doc mention that cancellation can
interrupt an in-flight `.await`.

In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 515-522: The `clear_shielded`/`shielded_sync_start` path in
`manager::mod` still relies on a host-side precondition, but `start()` can race
`clear()` and re-persist into a wiped store. Add a per-key clearing latch in the
wallet manager/registry logic, have `shielded_sync_start` honor it, and keep it
held across quiesce, liveness check, and `coord.clear()` so the clear/start
exclusion is enforced inside the wallet rather than by the caller.

In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 206-210: Update the `stop()` documentation in
`PlatformAddressSync` to reflect that cancellation can win over `sync_now()` in
the `tokio::select!` loop, so a background pass may exit before reaching
`on_platform_address_sync_completed`. Adjust the docs/comments near `stop()`,
`sync_now()`, and the select loop to clearly state that shutdown does not
guarantee completion of an in-flight sync pass.

In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 226-230: Update the stop() documentation to reflect that
background-pass cancellation can interrupt an in-flight sync_now(false) call
rather than guaranteeing the persister fan-out completes. In shielded_sync.rs,
adjust the docs/comments around the tokio::select! path in
ShieldedSyncManager::stop and the background coordinator.sync() flow to state
that cancellation may win the race and the sync can be dropped at an await. Keep
the description aligned with the actual behavior of sync_now(false) and the
cancel.cancelled() branch.

---

Nitpick comments:
In `@packages/rs-platform-wallet-ffi/src/manager.rs`:
- Around line 363-365: The inline comment near the destroy/callback-lifetime
handling in manager::destroy is outdated and still says the shutdown result is
only logged and dropped. Update that comment to describe the current FFI
behavior in destroy and the related callback-lifetime path: the coordinator OS
threads are joined, and non-clean shutdowns are surfaced as
ErrorShutdownIncomplete instead of being silently discarded. Keep the wording
aligned with the destroy logic and ErrorShutdownIncomplete handling so the
comment matches the implemented shutdown semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef1d0648-a464-44ab-8151-583159ad51f5

📥 Commits

Reviewing files that changed from the base of the PR and between 87f6b73 and 911f99f.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • packages/rs-dash-async/Cargo.toml
  • packages/rs-dash-async/src/atomic.rs
  • packages/rs-dash-async/src/lib.rs
  • packages/rs-dash-async/src/registry.rs
  • packages/rs-platform-wallet-ffi/Cargo.toml
  • packages/rs-platform-wallet-ffi/src/error.rs
  • packages/rs-platform-wallet-ffi/src/manager.rs
  • packages/rs-platform-wallet-ffi/src/shielded_sync.rs
  • packages/rs-platform-wallet/Cargo.toml
  • packages/rs-platform-wallet/src/changeset/core_bridge.rs
  • packages/rs-platform-wallet/src/changeset/mod.rs
  • packages/rs-platform-wallet/src/error.rs
  • packages/rs-platform-wallet/src/lib.rs
  • packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs
  • packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift
  • packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletResult.swift

Comment on lines +674 to +679
// Account for parked orphans last.
let (_status, detached) = self.reap_orphans_impl(self.reap_backstop).await;
ShutdownReport {
per_worker,
detached,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Preserve non-clean orphan status in ShutdownReport.

reap_orphans_impl() can return Panicked, Stopped, or Error for an orphan that exits before the grace deadline, but shutdown() discards _status and only keeps detached. That lets ShutdownReport::all_clean() report clean when a parked worker actually ended non-cleanly.

Suggested shape
 pub struct ShutdownReport<K: RegistryKey> {
     pub per_worker: BTreeMap<K, WorkerStatus>,
+    pub orphan_status: WorkerStatus,
     pub detached: usize,
 }

 pub fn all_clean(&self) -> bool {
-    self.detached == 0 && self.per_worker.values().all(WorkerStatus::is_clean)
+    self.detached == 0
+        && self.orphan_status.is_clean()
+        && self.per_worker.values().all(WorkerStatus::is_clean)
 }

- let (_status, detached) = self.reap_orphans_impl(self.reap_backstop).await;
+ let (orphan_status, detached) = self.reap_orphans_impl(self.reap_backstop).await;
 ShutdownReport {
     per_worker,
+    orphan_status,
     detached,
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Account for parked orphans last.
let (_status, detached) = self.reap_orphans_impl(self.reap_backstop).await;
ShutdownReport {
per_worker,
detached,
}
// Account for parked orphans last.
let (orphan_status, detached) = self.reap_orphans_impl(self.reap_backstop).await;
ShutdownReport {
per_worker,
orphan_status,
detached,
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-dash-async/src/registry.rs` around lines 674 - 679, The shutdown
path in `Registry::shutdown` is discarding the orphan result status from
`reap_orphans_impl()`, which can hide non-clean exits. Update `ShutdownReport`
and its construction so the orphan status is preserved alongside `detached`, and
adjust `ShutdownReport::all_clean()` to consider that stored status instead of
treating parked orphans as clean by default.

Comment on lines +68 to +110
loop {
tokio::select! {
recv = receiver.recv() => {
match recv {
Ok(event) => {
let wallet_id = event.wallet_id();
// For events that need to consult per-wallet
// state (today only `TransactionInstantLocked`,
// which checks finality before recording the IS
// lock), grab a brief read lock on the manager.
let core = build_core_changeset(&wallet_manager, &event).await;
if core.is_empty_no_records() {
// SyncHeightAdvanced for an unknown wallet,
// empty BlockProcessed, etc. — nothing to
// persist. Skip the round-trip.
continue;
}
Err(RecvError::Lagged(n)) => {
let cs = PlatformWalletChangeSet {
core: Some(core),
..PlatformWalletChangeSet::default()
};
if let Err(e) = persister.store(wallet_id, cs) {
tracing::warn!(
missed = n,
"wallet-event adapter lagged on broadcast channel; some events were dropped"
wallet_id = %hex::encode(wallet_id),
error = %e,
"Persister rejected core changeset; state will be re-emitted on next sync round"
);
}
}
Err(RecvError::Closed) if cancel.is_cancelled() => break,
Err(RecvError::Closed) => {
tracing::error!("WalletEvent broadcast closed unexpectedly");
break;
}
Err(RecvError::Lagged(n)) => {
tracing::warn!(
missed = n,
"wallet-event adapter lagged on broadcast channel; some events were dropped"
);
}
}
_ = cancel.cancelled() => break,
}
_ = cancel.cancelled() => break,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major

Drain pending events before exiting on cancellation

If the cancel signal wins the race in tokio::select!, the loop exits immediately. Since the receiver is a broadcast channel, any events queued in the buffer at that moment will be dropped, potentially losing persistence updates for CoreChangeSet entries generated before shutdown.

Refactor the cancelled branch to drain the receiver:

_ = cancel.cancelled() => {
    while let Ok(event) = receiver.try_recv() {
        let core = build_core_changeset(&wallet_manager, &event).await;
        if core.is_empty_no_records() {
            continue;
        }
        if let Err(e) = persister.store(event.wallet_id(), PlatformWalletChangeSet {
            core: Some(core),
            ..Default::default()
        }) {
            tracing::warn!(wallet_id = %hex::encode(event.wallet_id()), "Drain persist failed");
        }
    }
    break;
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/changeset/core_bridge.rs` around lines 68 -
110, When the cancel branch in core_bridge.rs wins the tokio::select! race, the
loop exits immediately and can drop buffered broadcast events, so update the
cancellation handling in the main event loop to drain receiver before breaking.
Use the existing build_core_changeset and persister.store flow for each pending
event from receiver.try_recv(), preserve the core.is_empty_no_records() skip,
and keep the existing error/warn logging style keyed by wallet_id so shutdown
still persists any queued CoreChangeSet updates.

Comment on lines +407 to +410
tokio::select! {
biased;
_ = cancel.cancelled() => break,
_ = this.sync_now() => {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Update stop() docs for cancellable background passes.

Line 407 races sync_now() against cancellation, so a background pass can be dropped at an .await instead of “running to completion” after stop().

Suggested doc adjustment
-    /// **Cancel-only**: requests cancellation and returns immediately. A
-    /// pass already inside `sync_now` keeps running to completion,
-    /// including its `persister.store(...)` fan-out. For a real "nothing
+    /// **Cancel-only**: requests cancellation and returns immediately. In
+    /// the background loop, cancellation can drop an in-flight `sync_now()`
+    /// at its next `.await`; direct caller-owned `sync_now()` futures are
+    /// not cancelled by this method. For a real "nothing

Also applies to: 423-430

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/identity_sync.rs` around lines 407 -
410, Update the `stop()` documentation in `identity_sync.rs` to reflect that
background sync passes are cancellable and may exit early when shutdown is
requested. Adjust the wording around `sync_now()`/`tokio::select!` in `stop()`
and any related comments so it no longer claims a pass will always run to
completion after `stop()`, and make the doc mention that cancellation can
interrupt an in-flight `.await`.

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment on lines +206 to +210
tokio::select! {
biased;
_ = cancel.cancelled() => break,
_ = this.sync_now() => {}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Update stop() docs for cancellation-winning passes.

Line 206 lets cancellation win over sync_now(), so an in-flight background pass is not guaranteed to reach on_platform_address_sync_completed after stop().

Suggested doc adjustment
-    /// **Cancel-only**: requests cancellation and returns immediately. A
-    /// pass already inside `sync_now` keeps running to completion,
-    /// including its `on_platform_address_sync_completed` host-callback
-    /// dispatch. For a real "nothing is running and nothing more will
+    /// **Cancel-only**: requests cancellation and returns immediately. In
+    /// the background loop, cancellation can drop an in-flight `sync_now()`
+    /// at its next `.await`, so the completion callback is not guaranteed
+    /// after this method. For a real "nothing is running and nothing more will

Also applies to: 222-230

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs` around
lines 206 - 210, Update the `stop()` documentation in `PlatformAddressSync` to
reflect that cancellation can win over `sync_now()` in the `tokio::select!`
loop, so a background pass may exit before reaching
`on_platform_address_sync_completed`. Adjust the docs/comments near `stop()`,
`sync_now()`, and the select loop to clearly state that shutdown does not
guarantee completion of an in-flight sync pass.

Comment on lines +226 to +230
tokio::select! {
biased;
_ = cancel.cancelled() => break,
_ = this.sync_now(false) => {}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Update stop() docs for background-pass cancellation.

Line 226 races cancellation against sync_now(false), so a background coordinator.sync() can be dropped at an .await rather than always running through its persister fan-out.

Suggested doc adjustment
-    /// **Cancel-only**: this requests cancellation and returns
-    /// immediately. A pass already inside `sync_now` /
-    /// `coordinator.sync()` keeps running to completion (including its
-    /// persister-callback fan-out). For a real "nothing is running and
+    /// **Cancel-only**: this requests cancellation and returns
+    /// immediately. In the background loop, cancellation can drop an
+    /// in-flight `sync_now(false)` / `coordinator.sync()` at its next
+    /// `.await`. For a real "nothing is running and

Also applies to: 242-249

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs` around lines 226 -
230, Update the stop() documentation to reflect that background-pass
cancellation can interrupt an in-flight sync_now(false) call rather than
guaranteeing the persister fan-out completes. In shielded_sync.rs, adjust the
docs/comments around the tokio::select! path in ShieldedSyncManager::stop and
the background coordinator.sync() flow to state that cancellation may win the
race and the sync can be dropped at an await. Keep the description aligned with
the actual behavior of sync_now(false) and the cancel.cancelled() branch.

lklimek and others added 7 commits June 25, 2026 10:34
…uiesce regardless of a running loop

registry.quiesce early-returns NotRunning WITHOUT running the DrainHook when no background-loop slot is registered, so quiesce left the quiescing gate down and never drained an in-flight pass. A concurrent direct sync_now/sync_wallet that had already cleared begin_pass() was therefore not held off, breaking the clear_shielded/stop contract.

Fix: quiesce now raises quiescing itself (gate-before-cancel preserved; the AtomicFlagGuard still reopens it on return), then — after the registry's bounded cancel+join — drains is_syncing when the status is clean. Draining only on a clean status keeps a wedged loop pass (reported Timeout, its thread orphaned) from reintroducing the shutdown stall the bounded join exists to prevent, while still covering the no-loop and idle-loop+direct-pass cases. TDD: new test fails against the pre-fix delegating quiesce.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…so shutdown can't miss it

start_thread took the prior handle out of the slot and ran reap_prior_or_park (a ~1s backstop spin) OUT of the slot lock. A concurrent shutdown() could take the lock, latch closing, snapshot tiers seeing only the NEW handle, release, and reap an EMPTY orphan list — reporting clean while the wedged prior was still live and un-joined.

Fix: park the prior into orphans UNDER the slot lock (park_prior_locked), making take-prior + park-prior atomic from shutdown's under-lock view; the bounded join stays out of the lock (reap_parked_prior, which finds the prior by ThreadId, removes+joins it when finished, or leaves a genuine wedge parked). start_task parks under the lock too. This introduces the module's only slots->orphans nesting; it is deadlock-free since no path takes slots while holding orphans. TDD: long-backstop test asserts the prior is parked before the spin could elapse; fails pre-fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…iveness, like clear_shielded/destroy

shielded_sync_stop returned Success on status.is_clean() alone, ignoring a prior-generation shielded thread still parked alive as an orphan — asymmetric with clear_shielded/destroy and a misleading contract (the orphan still holds the host callback context). No live UAF today since Swift always does stop->destroy, but Success should imply no live shielded worker/orphan.

Add manager::shielded_worker_alive() (the same shielded-scoped any_alive_for gate clear_shielded consults) and have shielded_sync_stop return ErrorShutdownIncomplete when a parked orphan survives a clean drain. FFI ABI unchanged (same return-code semantics); docstring updated so Success accurately implies no live shielded worker/orphan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…cing gate continuously

SEC-001: clear_shielded's in-flight-pass drain was unbounded and the FFI clear bridge a bare block_on, so a heavy direct pass could hang the host (ANR). Bound it with a SHUTDOWN_JOIN_TIMEOUT_SECS timeout (mirroring shielded_sync_stop); on timeout the clear reports Timeout and aborts BEFORE the wipe, leaving the store intact. Split out clear_shielded_inner(drain_timeout) so the timeout path is testable without the 30s budget.

SEC-002/RUST-002: the gate lapsed between quiesce() returning (its RAII guard lowers the shared flag) and the post-drain re-raise, letting a direct sync_now/sync_wallet slip past any_alive_for and re-persist into the wiped store. Fix: raise+HOLD the gate via clear's own guard BEFORE draining, and drain via a new gate-neutral quiesce_under_held_gate (extracted cancel_join_and_drain shared with quiesce, which stays byte-identical — Fix-1 invariant untouched). The gate now stays raised continuously across drain, liveness check, and wipe; doc softened to note the only residual is a full start() racing clear (per-key-latch follow-up). PROJ-004: clear now calls shielded_worker_alive() instead of re-inlining any_alive_for. Also clarifies the quiesce doc that only shielded gates sync_wallet (platform-address's is intentionally ungated). TDD: both SEC tests proven non-vacuous (revert->fail->restore).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…story-tags

RUST-004: a failed thread spawn left the slot carrying the FAILED start's weight/drain/join_budget (and a bumped generation). Now snapshot the pre-start config and restore ALL of it on the Err path, so the re-installed prior keeps its own teardown config; generation rolls back too (the +1 is only observed under the slot lock and the failed start spawns no thread, so the rollback is net-zero and the externally-visible generation stays monotonic). New regression test asserts the restored config.

RUST-005: trimmed the duplicated park-under-lock rustdoc block in start_thread that repeated park_prior_locked's doc.

RUST-003/PROJ hygiene: removed [F1/F2/F3 FIX] history-tags from committed comments, replaced the 'Why F1 and F2 cannot recur' module section with present-state invariant descriptions, and fixed the glossary to reference the key-scoped any_alive_for (the gate store-wiping paths actually consult) rather than the registry-wide any_alive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…an-trigger contract

shielded_sync_stop now also returns .errorShutdownIncomplete when the drain was clean but a prior-generation shielded thread is still parked alive as an orphan (not only when the in-flight drain times out). Update the manager deinit comment to reflect both triggers; behaviour unchanged (the deinit already leaks one strong ref to the handlers on .errorShutdownIncomplete). Comment-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…fencing (SeqCst)

SEC-003: the pass-gate handshake is a Dekker-style StoreLoad pair across two distinct atomics — quiesce does store(quiescing) … load(is_syncing); begin_pass does CAS(is_syncing) … load(quiescing). Release/Acquire do NOT order StoreLoad across separate locations, so by the annotations alone both sides could miss each other (begin_pass reads a stale quiescing==false and runs a pass past a raised gate while the drain reads a stale is_syncing==false and returns). It was sound only incidentally — registry.quiesce happens to take the slots Mutex (a fence) before returning; a lock-free refactor of that path would make the race live.

Promote the four handshake ops to SeqCst (a single total order guarantees at least one side observes the other): the gate-raise stores in quiesce, hold_quiescing_gate, and the registry drain hook; the is_syncing CAS (success) and quiescing load in begin_pass; and the is_syncing load in the drain. Gate lowering (reopen / RAII drop) and observational reads stay Release/Acquire — a stale-high gate read only makes a pass bail conservatively. Fix-1's gate-before-cancel + never-latched invariant is preserved (SeqCst is strictly stronger than the prior Release). Added a load-bearing-ordering comment at begin_pass. Not unit-testable (ordering); the existing gate/drain handshake tests still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

All three prior findings (911f99f) are resolved at 748c4f8: CoordinatorLifecycle::quiesce now self-raises the quiescing gate and drains in-flight direct passes (coordinator_lifecycle.rs:166-173, with a regression test); the shielded_sync_stop FFI now consults manager.shielded_worker_alive() and returns ErrorShutdownIncomplete when a prior-generation shielded orphan is parked (shielded_sync.rs:127-157); and start_thread now parks the prior worker under the slot lock via park_prior_locked before the out-of-lock reap, with a regression test in the registry. One in-scope cumulative concern remains: the shared quiescing AtomicBool means a concurrent quiesce()'s AtomicFlagGuard drop can lower the gate that clear_shielded's hold_quiescing_gate is supposed to keep raised continuously — the mechanism does not match the doc's continuous-hold promise in clear_shielded_inner. In practice this is gated by the host's UI-thread serialization, so it is a suggestion, not a blocker.

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:166-173: clear_shielded's "continuously held" gate can be lowered by a concurrent quiesce()
  `clear_shielded_inner` holds `_clearing_gate = shielded_sync_manager.hold_quiescing_gate()` across drain → liveness check → wipe, expecting the shielded `quiescing` boolean to stay `true` for the whole sequence (and the doc at manager/mod.rs:517-520 explicitly promises "the quiescing gate is raised continuously" so concurrent direct `sync_now`/`sync_wallet` are held off). But the same `AtomicBool` backs `CoordinatorLifecycle::quiesce`, which builds its own `AtomicFlagGuard` (coordinator_lifecycle.rs:171); `AtomicFlagGuard::drop` unconditionally stores `false` (rs-dash-async/src/atomic.rs:25-29). If `platform_wallet_manager_shielded_sync_stop` is invoked concurrently with `clear_shielded` (the FFI handle store only takes a brief read lock in `with_item`, so two calls on the same handle can interleave), the stop's `quiesce()` returns and its dropping guard clears the flag while clear is still between its drain and its wipe. A direct `sync_now`/`sync_wallet` then sees `quiescing=false` in `begin_pass`, claims `is_syncing`, and can re-persist into the store `coord.clear()` is about to wipe — a store/host desync. The host's UI-thread serialization mitigates this today, but the documented invariant on `clear_shielded` only excludes concurrent `shielded_sync_start`, not `_stop`. Fix options: refcount the gate (so the inner drop doesn't lower it under the outer hold), use a clear-specific second flag that `begin_pass` also consults, or tighten the host-serialization doc to enumerate `_stop` too.

Comment on lines +166 to +173
pub(crate) async fn quiesce(&self) -> CoordinatorThreadStatus {
// Gate up first (instant) and held until the guard drops on return.
// SeqCst: store-half of the `quiescing`<->`is_syncing` handshake
// (see `begin_pass`).
self.quiescing.store(true, Ordering::SeqCst);
let _quiescing_gate = AtomicFlagGuard::new(&self.quiescing);
self.cancel_join_and_drain().await
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: clear_shielded's "continuously held" gate can be lowered by a concurrent quiesce()

clear_shielded_inner holds _clearing_gate = shielded_sync_manager.hold_quiescing_gate() across drain → liveness check → wipe, expecting the shielded quiescing boolean to stay true for the whole sequence (and the doc at manager/mod.rs:517-520 explicitly promises "the quiescing gate is raised continuously" so concurrent direct sync_now/sync_wallet are held off). But the same AtomicBool backs CoordinatorLifecycle::quiesce, which builds its own AtomicFlagGuard (coordinator_lifecycle.rs:171); AtomicFlagGuard::drop unconditionally stores false (rs-dash-async/src/atomic.rs:25-29). If platform_wallet_manager_shielded_sync_stop is invoked concurrently with clear_shielded (the FFI handle store only takes a brief read lock in with_item, so two calls on the same handle can interleave), the stop's quiesce() returns and its dropping guard clears the flag while clear is still between its drain and its wipe. A direct sync_now/sync_wallet then sees quiescing=false in begin_pass, claims is_syncing, and can re-persist into the store coord.clear() is about to wipe — a store/host desync. The host's UI-thread serialization mitigates this today, but the documented invariant on clear_shielded only excludes concurrent shielded_sync_start, not _stop. Fix options: refcount the gate (so the inner drop doesn't lower it under the outer hold), use a clear-specific second flag that begin_pass also consults, or tighten the host-serialization doc to enumerate _stop too.

source: ['codex']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants