Skip to content

fix(desktop): macOS/Linux pre-CEF stale-lock reap + installer pre-install kill (#4395)#4405

Merged
senamakel merged 2 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4395-crossplatform-single-instance-guard
Jul 3, 2026
Merged

fix(desktop): macOS/Linux pre-CEF stale-lock reap + installer pre-install kill (#4395)#4405
senamakel merged 2 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4395-crossplatform-single-instance-guard

Conversation

@M3gA-Mind

@M3gA-Mind M3gA-Mind commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Problem

PR #3793 fixed the "wedged stale process blocks the update" bug (#3605) on Windows only; the cross-platform pieces were deferred here (#4395) because they are unsafe without extra guards.

On Windows a Win32 named mutex early in run() forces any concurrent peer to exit(0) before the reap, so a survivor holding the CEF cache lock is provably wedged. macOS/Linux have no equivalent pre-CEF guard (tauri_plugin_single_instance registers later, after CEF init), so a process holding the CEF SingletonLock past the cache-wait budget may be a healthy running primary — a live app holds its lock for its whole lifetime. Reaping it purely on "held N seconds" could SIGKILL a healthy instance and lose user data (exactly why the earlier macOS SIGKILL was reverted, flagged by @oxoxDev + Codex P1).

Solution

  • cef_stale_reap (macOS/Linux), new module. The updater writes a relaunch marker immediately before app.restart(); the freshly launched process consumes it (once, deleted, age-bounded to 5 min). Only a fresh marker + a live CEF-lock holder is treated as stale — precisely the post-update wedge, never a normal launch of a healthy app. Extra guards: never reap self; skip when the lock's <hostname>-<pid> host ≠ ours (NFS/networked-home pid-collision safety); SIGTERM, grace, then re-read the lock to confirm the same pid still owns it and is alive before SIGKILL.
  • Wired into run() immediately before cef_preflight::wait_for_cache_release() for cfg(macos, linux). Complementary to process_recovery::reap_stale_openhuman_processes, which deliberately skips when a live CEF-lock holder exists.
  • Installer (Part 2): bundle.windows.nsis.installerHooksnsis-hooks.nsh with NSIS_HOOK_PREINSTALL running taskkill /F /T on openhuman-core.exe and the main binary before files are copied.
  • Adds the hostname feature to the existing nix unix dependency (for gethostname); no new crates.
  • Explicitly out of scope / untouched: the Windows runtime reap hardening (sibling Windows: scoped-safe startup reap of wedged stale process (follow-up to #3605) #3900).

Submission Checklist

If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — unit tests for the pure decision matrix (incl. the "no reap without a fresh marker even with a live holder" safety property), marker freshness/path, and mac/linux glue (lock-holder parse, pid re-validation, write→consume round-trip, safe-noop orchestrator run).
  • Diff coverage ≥ 80%confirmed by CI: the Rust Tauri Coverage (cargo-llvm-cov) job passed (green) on this PR, which builds + tests app/src-tauri and enforces the ≥ 80% changed-line coverage gate. (The reap_pid SIGKILL branch itself is a thin OS-kill wrapper; the decision matrix that drives it is fully unit-tested.)
  • Coverage matrix updated — N/A: no matrix feature row for this platform-lifecycle guard (behaviour is startup-recovery internal, not a user-facing feature ID).
  • All affected feature IDs from the matrix are listed in the PR description under ## RelatedN/A: no feature IDs affected.
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: no smoke-doc change; verification is the real-host matrix noted under Impact.
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Platform: desktop macOS/Linux (runtime reap, cfg-gated) + Windows (NSIS installer hook). No mobile/web/CLI behaviour change.
  • Security/safety: the whole design centers on not killing a healthy instance; kills are gated on a post-update marker + host match + pid re-validation, SIGTERM-first.
  • Verification: the cfg(...)-gated runtime paths and the NSIS hook must be verified on real macOS, Linux, and Windows(installer) hosts — they cannot be exercised cross-platform on a single dev machine.

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

Keep this section for AI-authored PRs. For human-only PRs, mark each field N/A.

Linear Issue

Commit & Branch

  • Branch: fix/GH-4395-crossplatform-single-instance-guard
  • Commit SHA: 3450adc

Validation Run

  • pnpm --filter openhuman-app format:checkN/A: frontend (app/src) untouched; the diff is app/src-tauri only (the Frontend Checks lane is correctly skipped for this PR).
  • pnpm typecheckN/A: frontend untouched (same reason as above).
  • Focused tests — cef_stale_reap unit tests compiled and ran green in CI: the passing Rust Tauri Coverage (cargo-llvm-cov) job builds and executes the app/src-tauri test suite.
  • Rust fmt/check (if changed) — N/A: root core crate (src/) unchanged; the Rust Quality (fmt, clippy) lane passed on this PR.
  • Tauri fmt/check (if changed) — cargo check for app/src-tauri compiles clean, verified green in CI by the passing Rust Tauri Coverage job that builds this crate.

Validation Blocked

  • command: real-host runtime verification of the cfg-gated reap (macOS/Linux) and the NSIS NSIS_HOOK_PREINSTALL (Windows installer).
  • status: the app/src-tauri build, unit tests, and diff-coverage gate are now verified green in CI (Rust Tauri Coverage); what remains is the platform-gated runtime behaviour, which cannot be exercised cross-platform on a single dev host.
  • impact: cfg(windows)/cfg(macos,linux) runtime paths and the NSIS hook are platform-gated and require real macOS/Linux/Windows(installer) hosts to observe end-to-end.

Behavior Changes

  • Intended behavior change: after an update on macOS/Linux, a wedged prior instance holding the CEF cache lock is reaped so the new app starts, instead of the new app silently timing out and exiting.
  • User-visible effect: updates apply cleanly instead of "app won't open until I kill the old process manually"; no change on a normal (non-update) launch.

Parity Contract

  • Legacy behavior preserved: normal launches and healthy running instances are never touched (no marker → no reap); cef_preflight::wait_for_cache_release remains the fallback.
  • Guard/fallback/dispatch parity checks: mirrors the Windows pre-CEF reap intent, but gated on a staleness signal + host match + pid re-validation instead of the Win32 mutex; Windows reap (Windows: scoped-safe startup reap of wedged stale process (follow-up to #3605) #3900) untouched.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): N/A
  • Canonical PR: this PR
  • Resolution: N/A

Summary by CodeRabbit

  • New Features

    • Improved app updates and relaunch behavior on macOS and Linux, helping avoid stuck previous instances after updating.
    • Windows installer now closes running app processes before installation to reduce file-lock issues.
  • Bug Fixes

    • Better handling of browser-related startup locks, making launches more reliable after an update.
    • Added support for detecting the current machine name on Unix-like systems to improve update recovery.

…tall kill (tinyhumansai#4395)

Follow-up to tinyhumansai#3605; the Windows-only runtime reap landed in tinyhumansai#3793. macOS and
Linux had no pre-CEF single-instance guard, so a process holding the CEF
SingletonLock past the cache-wait budget could be a healthy running primary,
not a wedged one — the reason the earlier macOS SIGKILL escalation was reverted.

Part 1: add `cef_stale_reap` (macOS/Linux), a pre-CEF reap of a wedged prior
instance gated on a post-update relaunch marker (the staleness signal from
tinyhumansai#4395). The updater writes the marker before `app.restart()`; the relaunched
process consumes it (once, deleted, age-bounded). Only a fresh marker + a live
lock holder triggers a reap. Guards: skip self, skip cross-host holders, and
SIGTERM → grace → re-validate the same pid still owns the lock → SIGKILL
(closes the PID-reuse window, honors kill_pid_force's contract). Never reaps a
healthy instance. Wired into run() before cef_preflight::wait_for_cache_release.

Part 2: add an NSIS NSIS_HOOK_PREINSTALL that terminates openhuman-core.exe and
the main binary tree before any file copy, so a stale prior-version process
can't survive an update.

Adds the nix `hostname` feature for gethostname. Pure decision logic is unit
tested on any host; the cfg-gated runtime/installer paths need verification on
real macOS, Linux, and Windows(installer) hosts.
@M3gA-Mind M3gA-Mind requested a review from a team July 2, 2026 10:50
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 27 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7552d278-1fad-4c9a-9d89-7b68fc231c1b

📥 Commits

Reviewing files that changed from the base of the PR and between 3450adc and 42cd980.

📒 Files selected for processing (1)
  • app/src-tauri/src/cef_stale_reap.rs
📝 Walkthrough

Walkthrough

Adds a macOS/Linux-only pre-CEF stale lock reap mechanism (marker-based staleness detection, ownership revalidation, SIGTERM/SIGKILL escalation) wired into update install and startup flow, enables the nix hostname feature, and adds a Windows NSIS pre-install hook to kill prior OpenHuman processes, wired via tauri.conf.json.

Changes

macOS/Linux stale CEF lock reap

Layer / File(s) Summary
Reap decision model
app/src-tauri/src/cef_stale_reap.rs
Defines constants, ReapDecision, and decide_reap gating logic on self/host/marker recency, plus marker freshness/path helpers.
Marker write/consume
app/src-tauri/src/cef_stale_reap.rs
Resolves CEF cache path, writes a PID marker post-update, and consumes (reads then deletes) it once per launch.
Reap orchestrator + nix feature
app/src-tauri/src/cef_stale_reap.rs, app/src-tauri/Cargo.toml
Implements the orchestrator that reads the SingletonLock, applies decide_reap, and escalates SIGTERM→revalidate→SIGKILL; enables the hostname feature on nix.
Startup/update wiring
app/src-tauri/src/lib.rs
Declares the module, writes the relaunch marker after update install, and invokes the reap step before waiting for cache release at startup.
Unit tests
app/src-tauri/src/cef_stale_reap.rs
Covers decision logic, marker semantics, lock holder parsing, and orchestrator no-op behavior.

Windows installer pre-install process kill

Layer / File(s) Summary
NSIS pre-install kill hook and config wiring
app/src-tauri/nsis-hooks.nsh, app/src-tauri/tauri.conf.json
Adds NSIS_HOOK_PREINSTALL macro that taskkills OpenHuman processes before install, registered via bundle.windows.nsis.installerHooks.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant AppStartup as run()
  participant Reap as cef_stale_reap
  participant LockFile as SingletonLock
  participant Marker as relaunch marker
  participant PriorProc as prior process

  AppStartup->>Reap: reap_stale_cef_lock_holder()
  Reap->>LockFile: read symlink for (host, pid)
  Reap->>Marker: consume_marker()
  Reap->>Reap: decide_reap(holder, marker, self, host)
  alt should reap
    Reap->>PriorProc: SIGTERM
    Reap->>Reap: wait TERM_GRACE
    Reap->>LockFile: holder_still_owns_lock()
    Reap->>PriorProc: SIGKILL
  else skip
    Reap-->>AppStartup: defer to cef_preflight
  end
  AppStartup->>AppStartup: wait_for_cache_release()
Loading

Suggested labels: rust-core, bug

Poem

A lock held stale, a ghost of before,
I sniff out the marker behind the door,
SIGTERM whispered, a grace-time pause,
Then taskkill stomps with rabbit-paws.
Hop, hop — the update installs clean at last! 🐇🔑

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the macOS/Linux stale-lock reap and installer pre-install kill changes.
Linked Issues check ✅ Passed The PR implements the guarded macOS/Linux reap, marker-based staleness signal, PID revalidation, and installer pre-install kill required by #4395.
Out of Scope Changes check ✅ Passed The nix hostname feature addition supports the new hostname-based lock-holder checks and is in scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot added bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. labels Jul 2, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3450adc113

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread app/src-tauri/src/lib.rs
Comment on lines +2741 to +2742
#[cfg(any(target_os = "macos", target_os = "linux"))]
cef_stale_reap::reap_stale_cef_lock_holder();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't reap before the normal relaunch grace period

On macOS/Linux post-update relaunches where app.restart() starts the new process while the old CEF runtime is still tearing down, a fresh marker plus a live SingletonLock holder is an expected transient state—the existing preflight comment above calls this the common sequential relaunch race and waits for it to clear. Running the marker-gated reaper before that wait means an otherwise healthy prior instance gets SIGTERM/SIGKILL as soon as the new process reaches startup, rather than only when it is actually wedged, reintroducing forced termination during ordinary updates.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
app/src-tauri/nsis-hooks.nsh (1)

19-29: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Consider a brief wait after force-kill to avoid handle-release races.

taskkill /F returns once termination is requested, not necessarily once the OS has fully released open handles/DLL locks. Given the whole point of this hook is to prevent a "stale process still holding files" scenario, a short Sleep (e.g. 500ms–1s) after the two taskkill calls would reduce the residual race window before file copy begins.

💡 Optional hardening
   nsExec::Exec 'taskkill /F /IM ${MAINBINARYNAME}.exe /T'
   Pop $0
+  ; Give the OS a moment to fully release file handles after force-kill.
+  Sleep 500
 !macroend
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src-tauri/nsis-hooks.nsh` around lines 19 - 29, The NSIS preinstall hook
in NSIS_HOOK_PREINSTALL should add a short delay after force-killing OpenHuman
processes to avoid handle-release races before file replacement starts. Update
the macro so that after the two nsExec::Exec taskkill calls (for
openhuman-core.exe and ${MAINBINARYNAME}.exe), it pauses briefly with Sleep to
give the OS time to fully release locks and handles. Keep the change localized
to NSIS_HOOK_PREINSTALL and preserve the existing process-kill sequence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src-tauri/src/cef_stale_reap.rs`:
- Around line 151-152: The stale-reap flow currently trusts marker freshness
without tying it to the current lock owner, so update the logic around
consume_marker(), SingletonLock handling, and the marker write/read path to
require the parsed marker PID to match the live lock-holder PID before any
reaping occurs. In the code that builds the marker body and in the later reap
checks, propagate and compare the marker’s PID against the PID associated with
the active SingletonLock, and only proceed when both freshness and PID identity
match.
- Around line 101-105: The stale-lock reap logic in cef_stale_reap.rs should
fail closed when host identity cannot be confirmed and should revalidate the
full lock owner instead of only the PID. Update the host check around the
local_host/host comparison to skip reaping whenever the local hostname is
unavailable, and make the SIGKILL revalidation and any lock-owner checks compare
both host and pid as a pair. Apply the same host+pid validation consistently in
the related reap decision paths (including the helpers around the other marked
sections) so a reused PID on another host cannot be mistaken for the same lock
owner.

---

Nitpick comments:
In `@app/src-tauri/nsis-hooks.nsh`:
- Around line 19-29: The NSIS preinstall hook in NSIS_HOOK_PREINSTALL should add
a short delay after force-killing OpenHuman processes to avoid handle-release
races before file replacement starts. Update the macro so that after the two
nsExec::Exec taskkill calls (for openhuman-core.exe and ${MAINBINARYNAME}.exe),
it pauses briefly with Sleep to give the OS time to fully release locks and
handles. Keep the change localized to NSIS_HOOK_PREINSTALL and preserve the
existing process-kill sequence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e50de52b-a491-44d2-b315-2a93baf73034

📥 Commits

Reviewing files that changed from the base of the PR and between f979bfa and 3450adc.

📒 Files selected for processing (5)
  • app/src-tauri/Cargo.toml
  • app/src-tauri/nsis-hooks.nsh
  • app/src-tauri/src/cef_stale_reap.rs
  • app/src-tauri/src/lib.rs
  • app/src-tauri/tauri.conf.json

Comment thread app/src-tauri/src/cef_stale_reap.rs Outdated
Comment thread app/src-tauri/src/cef_stale_reap.rs
@M3gA-Mind

Copy link
Copy Markdown
Collaborator Author

Fixed the PR Submission Checklist lane (the only red check). The parser (scripts/check-pr-checklist.mjs) hard-fails on any - [ ] box, and 6 were left unchecked pending CI. Those results have since landed green, so I ticked them honestly against real CI evidence — no faked/unrun items:

  • Diff coverage ≥ 80% → confirmed by the passing Rust Tauri Coverage (cargo-llvm-cov) job (builds + tests app/src-tauri and enforces the changed-line gate).
  • Focused tests / Tauri fmt/checkcef_stale_reap tests + app/src-tauri build verified green by that same CI job.
  • format:check, typecheck, Rust fmt/check (core) → marked N/A with reasons: this PR touches only app/src-tauri, so frontend (app/src) and root core (src/) are untouched (Frontend Checks correctly skipped; Rust Quality passed).

Verified locally: node scripts/check-pr-checklist.mjs on the new body reports 12/12 items satisfied (0 unchecked), exit 0. Also refreshed the Validation Blocked note so it reflects that only platform-gated runtime verification remains. No code changes.

…ansai#4395 review)

Codex + CodeRabbit review — three safety tightenings so the pre-CEF reaper can
only ever signal a provably-wedged pre-update instance:

- Fail closed on host identity. An unresolved local hostname previously still
  permitted a reap; now it skips (can't prove the SingletonLock holder is
  same-host). Revalidate the full (host, pid) owner — not pid alone — in
  holder_still_owns_lock, so a reused numeric pid on another machine (shared/NFS
  cache path) can't be mistaken for the holder we decided to reap.
- Bind the marker to the lock holder. consume_marker now returns the pid the
  marker recorded (not just freshness); decide_reap requires marker_pid == the
  live holder pid, so a leaked-but-fresh marker can't authorize killing an
  unrelated same-host process that happens to own the lock now.
- Don't reap before the normal relaunch teardown. A fresh marker + live holder
  is the expected transient on an ordinary update (the old runtime is still
  tearing down). reap_pid now waits a bounded RELAUNCH_TEARDOWN_GRACE and
  re-checks ownership before SIGTERM, so only a still-wedged holder is signalled
  — a healthy instance is never force-terminated during a routine update.

Tests: (host, pid) revalidation incl. wrong-host, marker-pid mismatch skip,
fail-closed-on-unresolved-host skip, consume_marker returns the recorded pid.
15 cef_stale_reap tests pass.
@senamakel senamakel merged commit ae41fcb into tinyhumansai:main Jul 3, 2026
15 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in Team Openhuman Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

macOS/Linux pre-CEF single-instance guard + reap, and installer pre-install process kill (follow-up to #3605)

2 participants