fix(remote-signer): keep retrying discovery while the cache is empty#3973
Draft
rickstaa wants to merge 1 commit into
Draft
fix(remote-signer): keep retrying discovery while the cache is empty#3973rickstaa wants to merge 1 commit into
rickstaa wants to merge 1 commit into
Conversation
The remote-signer /discover-orchestrators cache is derived from the node's network-capabilities snapshot, which the orchestrator pool populates on its first poll (asynchronously at startup). remoteDiscoveryPool.refresh() advanced lastRefresh unconditionally, so a request that landed before that first poll completed locked in an empty snapshot for a full refreshEvery interval (= -liveAICapReportInterval, default 25m). Until then /discover-orchestrators returned "503 cache empty", and the only workaround was setting a very short -liveAICapReportInterval. Rate-limit refreshes only once a non-empty snapshot exists. While the cache is empty, every call re-derives from the in-memory GetNetworkCapabilities() snapshot (no network I/O), so orchestrators surface as soon as they appear instead of after a full interval. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On the remote signer,
GET /discover-orchestratorsreturns503 "cache empty"for up to-liveAICapReportInterval(default 25m) after startup, even once orchestrators are advertising. The only known workaround was setting a very short-liveAICapReportInterval, which then makes the metrics poll run far more often than intended.Reported by Brad:
Root cause
The
/discover-orchestratorssnapshot inremoteDiscoveryPoolis derived from the node's network-capabilities cache (GetNetworkCapabilities()), which the orchestrator pool (DBOrchestratorPoolCache) populates on its first poll — asynchronously at startup.remoteDiscoveryPool.refresh()setlastRefresh = nowunconditionally, including when the derived snapshot was empty. So the first request that landed before that first poll completed built an empty snapshot and then rate-limited every subsequent refresh for a fullrefreshEvery(=-liveAICapReportInterval). Result: the empty result is locked in for 25m; a short interval "fixes" it only because the next allowed refresh comes soon enough to pick up the now-populated node cache.Fix
Rate-limit refreshes only once a non-empty snapshot exists:
While the cache is empty, every call re-derives from the in-memory
GetNetworkCapabilities()snapshot.refresh()does no network I/O (the actual polling is on the separateDBOrchestratorPoolCacheticker), so retrying while empty is cheap and bounded. Once populated, the normal interval applies. This removes the need to shorten-liveAICapReportInterval.Test
Adds
TestRemoteSigner_Discovery_EmptyCacheRetriesBeforeInterval: withrefreshEvery = 1h, a first request against an unpopulated node returns 503; afterUpdateNetworkCapabilities, a follow-up request within the interval returns the orchestrator. Fails onmain(503), passes with the fix.🤖 Generated with Claude Code