Skip to content

fix(remote-signer): keep retrying discovery while the cache is empty#3973

Draft
rickstaa wants to merge 1 commit into
ja/live-runnerfrom
rs/fix-remote-discovery-startup-race
Draft

fix(remote-signer): keep retrying discovery while the cache is empty#3973
rickstaa wants to merge 1 commit into
ja/live-runnerfrom
rs/fix-remote-discovery-startup-race

Conversation

@rickstaa

@rickstaa rickstaa commented Jul 4, 2026

Copy link
Copy Markdown
Member

Problem

On the remote signer, GET /discover-orchestrators returns 503 "cache empty" for up to -liveAICapReportInterval (default 25m) after startup, even once orchestrators are advertising. The only known workaround was setting a very short -liveAICapReportInterval, which then makes the metrics poll run far more often than intended.

Reported by Brad:

$ curl -s http://localhost:8081/discover-orchestrators | jq
{ "error": { "message": "Service Unavailable Error" } }

Root cause

The /discover-orchestrators snapshot in remoteDiscoveryPool is derived from the node's network-capabilities cache (GetNetworkCapabilities()), which the orchestrator pool (DBOrchestratorPoolCache) populates on its first poll — asynchronously at startup.

remoteDiscoveryPool.refresh() set lastRefresh = now unconditionally, including when the derived snapshot was empty. So the first request that landed before that first poll completed built an empty snapshot and then rate-limited every subsequent refresh for a full refreshEvery (= -liveAICapReportInterval). Result: the empty result is locked in for 25m; a short interval "fixes" it only because the next allowed refresh comes soon enough to pick up the now-populated node cache.

Fix

Rate-limit refreshes only once a non-empty snapshot exists:

if len(p.cached) > 0 && !p.lastRefresh.IsZero() && now.Sub(p.lastRefresh) <= p.refreshEvery {
    return
}

While the cache is empty, every call re-derives from the in-memory GetNetworkCapabilities() snapshot. refresh() does no network I/O (the actual polling is on the separate DBOrchestratorPoolCache ticker), so retrying while empty is cheap and bounded. Once populated, the normal interval applies. This removes the need to shorten -liveAICapReportInterval.

Test

Adds TestRemoteSigner_Discovery_EmptyCacheRetriesBeforeInterval: with refreshEvery = 1h, a first request against an unpopulated node returns 503; after UpdateNetworkCapabilities, a follow-up request within the interval returns the orchestrator. Fails on main (503), passes with the fix.

🤖 Generated with Claude Code

The remote-signer /discover-orchestrators cache is derived from the node's
network-capabilities snapshot, which the orchestrator pool populates on its
first poll (asynchronously at startup). remoteDiscoveryPool.refresh() advanced
lastRefresh unconditionally, so a request that landed before that first poll
completed locked in an empty snapshot for a full refreshEvery interval
(= -liveAICapReportInterval, default 25m). Until then /discover-orchestrators
returned "503 cache empty", and the only workaround was setting a very short
-liveAICapReportInterval.

Rate-limit refreshes only once a non-empty snapshot exists. While the cache is
empty, every call re-derives from the in-memory GetNetworkCapabilities()
snapshot (no network I/O), so orchestrators surface as soon as they appear
instead of after a full interval.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4062be9b-134a-49bc-9a68-bdc83c6c6e0d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch rs/fix-remote-discovery-startup-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added the go Pull requests that update Go code label Jul 4, 2026
@rickstaa rickstaa marked this pull request as draft July 4, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant