fix(remote-signer): keep retrying discovery while the cache is empty by rickstaa · Pull Request #3973 · livepeer/go-livepeer

rickstaa · 2026-07-04T09:50:17Z

Problem

On the remote signer, GET /discover-orchestrators returns 503 "cache empty" for up to -liveAICapReportInterval (default 25m) after startup, even once orchestrators are advertising. The only known workaround was setting a very short -liveAICapReportInterval, which then makes the metrics poll run far more often than intended.

Reported by Brad:

$ curl -s http://localhost:8081/discover-orchestrators | jq
{ "error": { "message": "Service Unavailable Error" } }

Root cause

The /discover-orchestrators snapshot in remoteDiscoveryPool is derived from the node's network-capabilities cache (GetNetworkCapabilities()), which the orchestrator pool (DBOrchestratorPoolCache) populates on its first poll — asynchronously at startup.

remoteDiscoveryPool.refresh() set lastRefresh = now unconditionally, including when the derived snapshot was empty. So the first request that landed before that first poll completed built an empty snapshot and then rate-limited every subsequent refresh for a full refreshEvery (= -liveAICapReportInterval). Result: the empty result is locked in for 25m; a short interval "fixes" it only because the next allowed refresh comes soon enough to pick up the now-populated node cache.

Fix

Rate-limit refreshes only once a non-empty snapshot exists:

if len(p.cached) > 0 && !p.lastRefresh.IsZero() && now.Sub(p.lastRefresh) <= p.refreshEvery {
    return
}

While the cache is empty, every call re-derives from the in-memory GetNetworkCapabilities() snapshot. refresh() does no network I/O (the actual polling is on the separate DBOrchestratorPoolCache ticker), so retrying while empty is cheap and bounded. Once populated, the normal interval applies. This removes the need to shorten -liveAICapReportInterval.

Test

Adds TestRemoteSigner_Discovery_EmptyCacheRetriesBeforeInterval: with refreshEvery = 1h, a first request against an unpopulated node returns 503; after UpdateNetworkCapabilities, a follow-up request within the interval returns the orchestrator. Fails on main (503), passes with the fix.

🤖 Generated with Claude Code

The remote-signer /discover-orchestrators cache is derived from the node's network-capabilities snapshot, which the orchestrator pool populates on its first poll (asynchronously at startup). remoteDiscoveryPool.refresh() advanced lastRefresh unconditionally, so a request that landed before that first poll completed locked in an empty snapshot for a full refreshEvery interval (= -liveAICapReportInterval, default 25m). Until then /discover-orchestrators returned "503 cache empty", and the only workaround was setting a very short -liveAICapReportInterval. Rate-limit refreshes only once a non-empty snapshot exists. While the cache is empty, every call re-derives from the in-memory GetNetworkCapabilities() snapshot (no network I/O), so orchestrators surface as soon as they appear instead of after a full interval. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-07-04T09:50:26Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4062be9b-134a-49bc-9a68-bdc83c6c6e0d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch rs/fix-remote-discovery-startup-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

github-actions Bot added the go Pull requests that update Go code label Jul 4, 2026

rickstaa marked this pull request as draft July 4, 2026 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(remote-signer): keep retrying discovery while the cache is empty#3973

fix(remote-signer): keep retrying discovery while the cache is empty#3973
rickstaa wants to merge 1 commit into
ja/live-runnerfrom
rs/fix-remote-discovery-startup-race

rickstaa commented Jul 4, 2026

Uh oh!

coderabbitai Bot commented Jul 4, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rickstaa commented Jul 4, 2026

Problem

Root cause

Fix

Test

Uh oh!

coderabbitai Bot commented Jul 4, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant