fix(sdk): ban rate-limited node for Envoy-advertised reset window#3951
Conversation
…limits ResourceExhausted is a congestion/backpressure signal, not endpoint ill-health. Banning a rate-limited (but healthy) node relocates its load onto survivors and cascades to NoAvailableAddressesToRetry. Now RE is classified as rate-limited: the node is not banned, the retry rotates to a different node via explicit exclusion, and the existing bounded retry count bounds an over-limit client. Genuine-failure ban path (60s x e^n) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds header-driven rate-limit banning to ChangesHeader-driven flat ban window
Sequence Diagram(s)sequenceDiagram
participant DapiClient
participant CanRetry
participant AddressList
participant ScriptedRequest
ScriptedRequest-->>DapiClient: ResourceExhausted + ratelimit-reset
DapiClient->>CanRetry: rate_limit_ban_duration()
CanRetry-->>DapiClient: Some(period) or None
DapiClient->>AddressList: ban_for(address, period, reason) or ladder ban
AddressList-->>DapiClient: updated ban state
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
✅ Review complete (commit 5d45556) |
There was a problem hiding this comment.
Pull request overview
This PR adjusts rs-dapi-client retry/failover behavior so gRPC ResourceExhausted (rate-limiting / backpressure) no longer triggers the exponential “health ban” logic, preventing ban-cascades under sustained rate-limits while keeping genuine node-failure banning unchanged.
Changes:
- Introduces a new
CanRetry::is_rate_limited()classification (defaultfalse), implemented for gRPCStatus::ResourceExhaustedand delegated throughTransportError,DapiClientError, andExecutionError. - Updates
update_address_ban_statusto not ban rate-limited addresses (log + rotate instead), preserving the existing ban ladder for genuine failures. - Adds address-rotation across retries via
AddressList::get_live_address_excluding(&[Address]), plus new unit/integration tests to assert both invariants and rotation behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/rs-dapi-client/tests/rate_limit_rotation.rs | New integration tests driving the real executor retry loop with a fake transport to validate “rotate, don’t ban” for ResourceExhausted. |
| packages/rs-dapi-client/src/transport/grpc.rs | Implements CanRetry::is_rate_limited() for gRPC Status (ResourceExhausted only). |
| packages/rs-dapi-client/src/transport.rs | Delegates is_rate_limited() for TransportError and adds unit tests pinning the “ResourceExhausted only” classification. |
| packages/rs-dapi-client/src/lib.rs | Extends the public CanRetry trait with is_rate_limited() (default false) and documents intended semantics. |
| packages/rs-dapi-client/src/executor.rs | Delegates is_rate_limited() through ExecutionError<E>. |
| packages/rs-dapi-client/src/dapi_client.rs | Skips banning for rate-limited errors and rotates retries away from already-tried addresses. Adds invariants-focused tests for ban ladder vs rate-limit behavior. |
| packages/rs-dapi-client/src/address_list.rs | Adds get_live_address_excluding and tests ensuring exclusion/rotation semantics and graceful fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Narrow, well-tested fix that adds an is_rate_limited axis to CanRetry, suppresses banning for gRPC ResourceExhausted, and rotates the next retry off the throttled node via get_live_address_excluding plus a per-execution tried set. The default trait impl preserves prior behavior and the genuine-failure ban ladder is pinned by tests. Only minor docs/test-robustness nits — nothing blocking.
🟡 1 suggestion(s) | 💬 4 nitpick(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-dapi-client/src/lib.rs`:
- [SUGGESTION] packages/rs-dapi-client/src/lib.rs:107-109: Pin `is_rate_limited ⇒ can_retry` with a test or debug_assert
The fix is load-bearing on the implicit contract that any rate-limited error is also retryable: `update_address_ban_status` only consults `is_rate_limited()` from inside the `if error.can_retry()` branch. If a future change ever marks `ResourceExhausted` non-retryable (e.g. a tweak to `Status::can_retry`), rate-limit handling silently stops firing and the error falls through to the no-op tail — neither ban nor rotation happens. Lock the invariant with either a unit test asserting `is_rate_limited() ⇒ can_retry()` across the gRPC code matrix, or a `debug_assert!(!self.is_rate_limited() || self.can_retry())` on the hot path, so the two predicates can't drift unnoticed.
…limit retry, lock can_retry invariant Three review fixes for the rotate-instead-of-ban rate-limit handling. M1 (invariant lock): the rotate-don't-ban behavior depends on `is_rate_limited() => can_retry()`. Add a debug_assert! at the rotation decision in `execute` and an exhaustive property test (`test_rate_limit_implies_retryable_invariant`) covering every gRPC code across tonic::Status, TransportError, DapiClientError and ExecutionError, so a future widening of `is_rate_limited` (or narrowing of `can_retry`) fails loudly instead of silently killing rotation. M2 (rotate off the throttled node + backoff): on a rate-limited retry the selection no longer re-picks the just-throttled node in small pools — the fallback now excludes the just-tried address and only reuses it when it is the single live node. The rate-limit retry path also replaces the flat 10ms delay with capped exponential backoff (base 10ms, x2 per attempt, capped at 500ms) plus full jitter, so a congested fleet de-correlates instead of retrying in lockstep. Genuine-failure delays are unchanged. Server-side RetryInfo is left as a documented TODO (needs metadata plumbing). M3 (rs-sdk propagation): override `is_rate_limited` on the SDK `Error` to delegate to the wrapped DapiClientError, so the rotate-don't-ban semantics survive at the SDK layer independently of how `can_retry` is defined. Pure future-proofing (today the can_retry guard short-circuits first); locked with a regression test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iant assert Two self-introduced nits from the verify pass. 1. `rate_limit_backoff_window`'s `checked_shl` guards only the shift AMOUNT, not value bit-loss: a large `attempt` (e.g. 63) shifts the set bits out of `10 << attempt`, collapsing the window to 0ms — no backoff at all. Pathological (needs retries >= 64; default is 5) but it's a footgun the backoff math created. Clamp the shift amount at 16 (already saturates the 500ms cap, so nothing changes in the valid range) and extend the unit test with the attempt=63/64 boundary asserting the window stays at the cap. 2. Add the symmetric `debug_assert!(!is_rate_limited() || can_retry())` to the transport-client-creation error block for parity with the transport-error block. Defense-in-depth; the property test already covers the invariant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review fixes (adversarial review pass)Addressed reviewer findings (CodeRabbit / thepastaclaw / internal review). All MEDIUM; two LOW left open by design. M1 —
M2 — rate-limit retry hardening
M3 — rs-sdk rate-limit classification propagated
Left open (LOW, by design): ban-evasion blind spot for a node that always returns Commits: 🤖 Co-authored by Claudius the Magnificent AI Agent |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Cumulative review at b8a8c32. The latest delta hardens the rate-limit handling: capped exponential backoff with full jitter, three-tier address selection (excluding tried → just-tried → fallback), debug_assert! invariants on both retry arms, and a property test pinning is_rate_limited() ⇒ can_retry() across all CanRetry implementors. Prior findings #1 and #4 are resolved; #2, #3, #5 are still valid nitpicks. One new latest-delta finding: the new AddressStatus ban-ladder test inherits the same thin 50ms wall-clock tolerance. No blocking issues.
💬 3 nitpick(s)
1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
|
Opened #3956 against this branch to address the Copilot note by matching the ban-info entry on the exact URI instead of a substring search. |
…an-for-duration Replace the "rotate-don't-ban" approach with "ban for server-dictated window": - Add `rate_limit.rs`: decode `google.rpc.RetryInfo` from gRPC status details via minimal prost structs; `DAPI_RATE_LIMIT_BAN_MS` env-var fallback (default 60 s) when no RetryInfo is present; WASM-compatible guard on env-var read. - Add `AddressStatus::ban_for_duration()` + `AddressList::ban_for_duration()`: set `banned_until` without touching `ban_count` so the health-ban exponential ladder is never disturbed by rate-limit events. - Add `CanRetry::rate_limit_ban_duration()` (default `None`) and wire it through `tonic::Status`, `TransportError`, `ExecutionError`, `DapiClientError` and `dash_sdk::Error`. - Update `update_address_ban_status`: `ResourceExhausted` now calls `ban_for_duration(server_hint.unwrap_or(fallback))` instead of skipping. - Remove rotation machinery: `tried: Vec<Address>`, 3-tier `get_live_address_excluding` selection, jitter backoff (`retry_delay`, `rate_limit_backoff_window`, `RATE_LIMIT_MAX_DELAY_MS`), and rand import. The retry loop now always calls `get_live_address()` — rate-limited nodes are naturally absent from the live pool until their ban window expires. - Fix `AddressBanInfo.banned`: now consistent with `get_live_address` filtering (checks only `banned_until`, not `ban_count > 0`), so rate-limit bans are visible in diagnostics. - Delete `tests/rate_limit_rotation.rs`; update/add invariant tests. All 419 unit/integration/doc tests pass. Zero clippy warnings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r; drop hand-rolled rotation Replace prost/RetryInfo-based approach with direct `ratelimit-reset` HTTP header parsing. No hand-rolled rotation machinery — banning a node already achieves rotation because `get_live_address()` skips banned nodes. Changes: - Delete `rate_limit.rs` (prost/RetryInfo decoder) and remove prost dep - Replace `is_rate_limited() -> bool` on `CanRetry` trait with `rate_limit_ban_duration() -> Option<Duration>` (default `None`) - `tonic::Status` impl: if `ResourceExhausted` + `ratelimit-reset` header is a parseable u64 > 0, return `Some(clamped [1s, 600s])`; else `None` - Add `AddressStatus::ban_for(period, reason)` + `AddressList::ban_for` that sets `banned_until = now + period`, `ban_count = max(ban_count, 1)`; does NOT inflate the exponential health-ban ladder - `update_address_ban_status`: dispatch to `ban_for` on `Some(period)`, fall back to existing `ban_with_reason` on `None` - Revert execute loop to base: flat 10ms retry delay, plain `get_live_address()` - Revert `rs-sdk/src/error.rs` to base (banning stays inside rs-dapi-client) - Add `LIMIT_RESPONSE_HEADERS_ENABLED=true` to dashmate rate-limiter compose - Replace deleted `tests/rate_limit_rotation.rs` with `tests/rate_limit_ban.rs` (5 focused tests for the header-parse path) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… fixes QA-001: Correct `AddressStatus::ban_for` / `AddressList::ban_for` docstrings — the "no ladder inflation" claim was false for fresh nodes (ban_count 0 → 1 raises the next genuine health ban from ~60 s to ~163 s). Docs now describe the floor/side-effect accurately. New test pins the 0→1 ladder-floor behaviour. QA-002: Fix `MIN_RATE_LIMIT_BAN_SECS` comment — the `> 0` filter before the clamp already rejects 0, so MIN=1 is a documentation/intent marker, not an active lower clamp. QA-003: De-tautologize "bad/zero/missing header → ladder" tests. Previous assertions (ban_count == 1) passed for BOTH ban_for and ban_with_reason. Now also assert the banned_until window is ~60 s (first ladder rung), which would fail if a bad header were wrongly routed to ban_for. QA-004: Add clamp-edge coverage: 1 → 1 s, 600 → 600 s, 601 → 600 s (was only 45 and 9999). QA-005: Add test that a ban_for'd address re-enters rotation via get_live_address after its window expires (zero-duration re-ban simulates an expired window). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@thepastaclaw review most recent commit, |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Carried-forward prior findings reconcile cleanly against HEAD 2c76a14: ban_for still unconditionally overwrites banned_until (blocking regression introduced by this PR), Envoy's CORS expose_headers still omits the ratelimit-* family (suggestion — browser/WASM clients silently fall back to the ladder), and the rate-limit path still emits a success-shaped debug log before calling ban_for (nitpick). No new latest-delta findings beyond what the agents and prior review already flagged. Quality-agent suggestions about ban_count semantics, constant placement, and timing tolerances are out-of-scope architectural/test-stability commentary and were dropped.
🟡 1 suggestion(s)
1 additional finding(s) omitted (not in diff).
2 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: Expose `ratelimit-*` headers via CORS so browser/WASM clients can read them
This PR teaches `rs-dapi-client` to read Envoy's `ratelimit-reset` header and pin an exact ban window from it. The WASM transport (`tonic_web_wasm_client`, used by `wasm-sdk`) reaches the gateway through browser `fetch`, which only surfaces non-safelisted response headers that the server enumerates in `Access-Control-Expose-Headers`. The current `expose_headers` list (line 201) does not include `ratelimit-limit`, `ratelimit-remaining`, or `ratelimit-reset`, so browser clients receive `ResourceExhausted` but cannot see the reset window and fall back to the exponential health-ban ladder — defeating the PR's stated improvement for the WASM SDK path. Native Rust clients are unaffected; the fix is one line on the gateway template.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review correction
The previous exact-SHA automated review at 2c76a14f was posted as COMMENT, but the verifier output still contains a blocking carried-forward finding. Requesting changes so the PR state matches the verified result.
🔴 1 blocking | 🟡 1 suggestion | ⚪ 1 nitpick
Blocking: packages/rs-dapi-client/src/address_list.rs:116-120 — ban_for still unconditionally writes banned_until = Some(now + period), so a later short ratelimit-reset response can shorten an existing longer health/rate-limit ban and return the node to rotation early. Preserve the longer existing/advertised deadline.
Suggestion: packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201 — expose ratelimit-limit, ratelimit-remaining, and ratelimit-reset in CORS so browser/WASM clients can read the reset window instead of falling back to the exponential ladder.
Nitpick: packages/rs-dapi-client/src/dapi_client.rs:213-225 — the rate-limit path logs a success-shaped ban message before ban_for returns, producing contradictory logs when the address has been removed.
Run artifacts: /Users/claw/.openclaw/workspace/worktrees/review-platform-3951/.review/run-1782229049-2c76a14f.
…cs and log ordering Fix 1 (BLOCKING): `AddressStatus::ban_for` now uses max-semantics — it extends `banned_until` only when `now + period` is later than the current `banned_until`, preventing a short `ratelimit-reset` response from shortening a longer active ban. `ban_reason` is updated only when the window extends; `ban_count` is raised to `max(ban_count, 1)` unconditionally. Docstring updated to present-state semantics. Fix 2 (Suggestion): the "rate-limited: banning" debug log in `update_address_ban_status` now fires inside the `if banned` branch only, so it cannot contradict a subsequent "unable to ban … not in the list anymore" trace when the address was SML-removed. Test: `test_ban_for_never_shortens_active_ban` in `tests/rate_limit_ban.rs` asserts (a) LONG→SHORT does NOT reduce `banned_until`, (b) SHORT→LONG extends it, and (c) `ban_count` ends at ≥ 1 in both cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or test The previous fix swapped `ban_for(ZERO)` for `unban()`, but those exercise different code paths: `get_live_address` reinstates a node on `banned_until < now` independent of `ban_count`, while `unban()` also zeroes `ban_count`. Restore the expiry path by back-dating `banned_until` directly (white-box, in-module test) after the 300s ban + hidden assertion. Two invariants now pinned: - `get_live_address()` returns the node once the window is in the past. - `is_banned()` is still true (ban_count > 0) — window-expiry ≠ unban. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o-shorten test hardening
QA-006: Document that `AddressStatus::ban_with_reason` intentionally always
re-bases `banned_until` (unconditional overwrite for exponential escalation).
Note that the no-shorten max-semantics invariant is scoped to ban_for→ban_for
interactions only and does not apply to this method.
QA-007: Strengthen `test_ban_for_never_shortens_active_ban` case (a) with an
exact `assert_eq!` on a `banned_until` snapshot taken between the LONG and SHORT
calls, making a last-wins regression fail unambiguously rather than relying on
a >=299s threshold that could theoretically pass with wrong semantics.
QA-008: Add named reasons ("long-reason" / "short-reason") to both test cases
and assert the `ban_reason` update contract:
- case (a) LONG→SHORT: reason preserved from LONG call (SHORT must not overwrite)
- case (b) SHORT→LONG: reason adopted from LONG call (window extended, so reason updates)
No production logic changes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…den case (b) last-wins guard DOC (ban_with_reason): explicitly name the cross-method shortening scenario — a health failure on a node holding a longer rate-limit window (set via ban_for) will re-base banned_until to the exponential value, which may be shorter. This is intentional: the health-ban ladder owns the window for genuinely-unhealthy nodes; the no-shorten guarantee is scoped to ban_for→ban_for sequences only. DOC (ban_for): add cross-reference note that the no-shorten guard is ban_for-only and that ban_with_reason re-bases unconditionally — see its docs. TEST QA-007 (case b): add a trailing ban_for(SHORT) after the LONG extension, snapshot banned_until with assert_eq!, and explain in a comment that case (a) + this trailing-short assertion together kill both MIN-wins and last-wins regressions. Under a last-wins implementation the trailing SHORT would clobber the LONG window and the equality check would fail unambiguously. Zero production-logic change — git diff on src/ shows doc comments only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…led config toggle
Replaces the hard-coded LIMIT_RESPONSE_HEADERS_ENABLED=true in
docker-compose.rate_limiter.yml with a first-class dashmate config option,
defaulting to true so existing deployments keep emitting RateLimit-* headers.
rs-dapi-client reads RateLimit-Reset to apply a precise server-advertised
ban window instead of the exponential health-ban ladder; this header toggle
must stay enabled for the ban-for-duration feature in rs-dapi-client to work.
Changes:
- configJsonSchema.js: rateLimiter.responseHeaders object schema with
description linking it to rs-dapi-client ban-for-duration behaviour
- getBaseConfigFactory.js: default responseHeaders.enabled=true; all derived
configs (local/testnet/mainnet) inherit via lodashMerge — only base updated
- getConfigFileMigrationsFactory.js: migration at 4.0.0-rc.2 sets
responseHeaders from base default for any existing config missing the field
- docker-compose.rate_limiter.yml: LIMIT_RESPONSE_HEADERS_ENABLED wired from
${PLATFORM_GATEWAY_RATE_LIMITER_RESPONSE_HEADERS_ENABLED:?err}, which
convertObjectToEnvs generates automatically from the config path
- docs/config/gateway.md: table row documenting the new option
Validation: static checks confirm schema/factory/migration/env-var/docs
are consistent. Migration spec passes in main tree; worktree mocha runner
has a pre-existing ERR_REQUIRE_CYCLE_MODULE issue unrelated to these changes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.0.0-rc.2 is already a released tag; semver.gt('4.0.0-rc.2','4.0.0-rc.2')
is false, so the migration runner's gt-filter would skip it entirely for any
operator already on rc.2 — leaving responseHeaders absent and breaking
schema validation + compose :?err on startup.
4.0.0 is gt above every 4.0.0-prerelease (rc.2, beta.x, etc.), matching the
pattern established by the '3.1.0' migration key. The idempotent guard makes
a re-run a no-op.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
🔬 Verification evidence — server-advertised ban windowDocumenting the real-network + client-side verification behind this PR. It is deliberately honest about one environmental limitation (a SHA-1 devnet TLS cert) that prevents a single-process live end-to-end run. Both ends of the chain are proven directly; the only inch between them is standard 1. Server side — gateway emits
|
| Link in the chain | Proven | How |
|---|---|---|
Gateway emits ratelimit-reset on over-limit |
✅ live | grpcurl vs running gateway (flag on); absent with flag off |
Client parses header → ban_for(window) |
✅ | mock H2C metadata → ban expires in 60.0s |
| Client clamps / falls back to ladder / never shortens | ✅ | rate_limit_ban integration tests |
| Metadata transport (server → client) | standard | tonic::Status trailers — no new code path |
Both ends are proven directly; the only unproven inch is the standard gRPC trailer hop between them, which tonic already handles. Confidence: high.
🤖 Co-authored by Claudius the Magnificent AI Agent
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Incremental review reconciling against 2c76a14. Prior blocking finding (ban_for shortening) and prior nit (debug log before ban) are both FIXED in c49adb2 — see address_list.rs:133-144 and dapi_client.rs:218-226 plus their regression tests. Prior CORS expose_headers concern remains valid: the PR's WASM/browser path depends on the gateway exposing ratelimit-* headers, but envoy.yaml.dot was not touched. One new in-scope suggestion: the migration is keyed at 4.0.0 while package.json is still 4.0.0-rc.2, so rc.2-stamped configs hit a self-validation gap until the next version bump — the author has acknowledged this in the commit message but the coordination should be confirmed by maintainers.
🟡 2 suggestion(s)
1 additional finding(s) omitted (not in diff).
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: Expose ratelimit-* headers via CORS so browser/WASM clients can read the reset window
The PR's whole value-add is letting rs-dapi-client read `ratelimit-reset` and pin an exact ban window from it. On the native gRPC transport this works because metadata is passed through transparently, but the WASM transport (`tonic-web-wasm-client`) goes through the browser Fetch API, which only surfaces response headers that the server lists in CORS `expose_headers`. The current list (`custom-header-1,grpc-status,grpc-message,code,drive-error-data-bin,dash-serialized-consensus-error-bin,stack-bin`) omits `ratelimit-limit`, `ratelimit-remaining`, and `ratelimit-reset`, so browser/WASM clients will silently fall back to the exponential health-ban ladder even when Envoy/RLS advertised the precise window enabled by the new dashmate `responseHeaders.enabled` toggle. This was flagged in the prior review and is not addressed in c49adb26.
In `packages/dashmate/src/config/configJsonSchema.js`:
- [SUGGESTION] packages/dashmate/src/config/configJsonSchema.js:708: responseHeaders is schema-required but the 4.0.0 migration won't fire for existing 4.0.0-rc.2 configs at the current package version
`configJsonSchema.js:708` makes `responseHeaders` a required field, and `getConfigFileMigrationsFactory.js:1523` backfills it via the `'4.0.0'` migration. `ConfigFileJsonRepository.read()` calls `migrateConfigFile(configData, configFormatVersion, packageVersion)` (ConfigFileJsonRepository.js:48-52), and `migrateConfigFile` returns the input unchanged when `fromVersion === toVersion` (migrateConfigFileFactory.js:12-14). `package.json` is still `4.0.0-rc.2` at this commit, so any operator already stamped at `4.0.0-rc.2` will hit `fromVersion === toVersion`, skip the new migration, and then fail AJV validation on the missing `responseHeaders` block. Older configFormatVersions migrate cleanly because `semver.gt('4.0.0', '<older>')` is true. The commit message (`fix(dashmate): key responseHeaders migration at 4.0.0 not released rc.2`) shows this is a deliberate release-coordination decision: the migration is expected to fire when the package bumps to `4.0.0` final. Please confirm the release flow will bump `package.json` to a version `> 4.0.0-rc.2` at the same time this PR ships; otherwise add a no-op `4.0.0-rc.3` migration, drop `responseHeaders` from the schema `required` array, or bump the package version in this PR.
…d with shared fake-transport harness A `ResourceExhausted` response carrying `ratelimit-reset: 300` that travels through the real `DapiClient::execute()` loop reaches `ban_for`, not `ban_with_reason`: the new test asserts a ~300 s ban window (not the ~60 s ladder rung) and `ban_count == 1`, closing the gap left by unit tests that call `update_address_ban_status` directly. Shared harness (`tests/common/mod.rs`): `FakeClient`, `FakeResponse`, and `ScriptedRequest` (closure-driven, records hit URIs) — reused by both `rate_limit_ban.rs` (new test) and a refactored `unimplemented_failover.rs` (all 3 existing tests preserved and green). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
….3, not 4.0.0 The migration ships in rc.3; keying at the next version (not the far-future 4.0.0 final) matches the 3.1.0 precedent and avoids a future '4.0.0' object-key collision. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Incremental delta (458a28e) is test-only — adds a shared fake-transport harness and an end-to-end DapiClient::execute() rate-limit ban test. Prior fixes (ban_for no-shorten max-semantics, debug-log scoped inside the ban branch) remain in place and are well-covered. Two prior findings carry forward unaddressed: (1) Envoy CORS expose_headers still does not surface ratelimit-*, so browser/WASM clients cannot observe the reset window and silently fall back to the exponential ladder; (2) dashmate schema requires responseHeaders while package.json is still 4.0.0-rc.2, so existing rc.2 configs hit migrateConfigFile's fromVersion === toVersion early-return and fail schema validation. New agent findings were nitpicks/style and pre-existing spirit; dropped per scope discipline.
🟡 1 suggestion(s)
1 additional finding(s) omitted (not in diff).
1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: CORS `expose_headers` does not surface `ratelimit-*`, so browser/WASM clients can never read the reset window
This PR's whole value is letting rs-dapi-client read `ratelimit-reset` (now emitted because dashmate enables `responseHeaders`) and ban a node for the server-advertised window instead of climbing the exponential ladder. Native tonic transports pass metadata through transparently, but the WASM path uses `tonic-web-wasm-client`, which goes through the browser Fetch API. Browsers only expose non-safelisted response headers that are listed in CORS `expose_headers`. The current list at envoy.yaml.dot:201 omits `ratelimit-limit`, `ratelimit-remaining`, and `ratelimit-reset`, so browser/WASM SDK clients will silently fall back to the exponential health-ban ladder even though the headers are emitted. Add the three RateLimit headers to `expose_headers` so the new ban-window behavior actually applies on the WASM transport.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review correction
The automated verifier output for 458a28e2 contains a blocking carried-forward finding, but the first exact-SHA post went out as COMMENT after the inline poster omitted the non-diff carried-forward item. Requesting changes so the PR state matches the verified result.
🔴 1 blocking | 🟡 1 suggestion
Blocking: packages/dashmate/src/config/configJsonSchema.js:708 — responseHeaders is now schema-required, but the backfill is keyed to the 4.0.0 migration while packages/dashmate/package.json is still 4.0.0-rc.2. Existing rc.2-stamped configs hit migrateConfigFile's fromVersion === toVersion early return, skip the new migration, and then fail AJV validation for the missing responseHeaders block. Bump the package version in lockstep, add an rc migration, or relax the schema until the final-version migration can run.
Suggestion: packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201 — expose ratelimit-limit, ratelimit-remaining, and ratelimit-reset in CORS expose_headers so browser/WASM clients can read the reset window instead of falling back to the exponential ladder.
Run artifacts: /Users/claw/.openclaw/workspace/worktrees/review-platform-3951/.review/run-1782294645-458a28e2.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Two prior findings remain unresolved at b23008e: (1) the dashmate responseHeaders schema/migration coordination gap — the migration is now keyed at 4.0.0-rc.3 while package.json is still 4.0.0-rc.2, so existing rc.2 configs short-circuit the migration runner and then fail AJV validation against the now-required responseHeaders field; (2) the CORS expose_headers list in envoy.yaml.dot still omits ratelimit-*, blocking wasm/browser clients from reading the very header this PR teaches the client to consume. The two prior Rust findings (ban_for no-shorten guard, debug-log placement inside the success branch) are confirmed fixed in this revision. Two additional review-agent nits this round (expect! interpolation, dead MIN_RATE_LIMIT_BAN_SECS constant) were dropped as low-value.
🔴 1 blocking | 🟡 1 suggestion(s)
1 additional finding(s) omitted (not in diff).
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/dashmate/configs/getConfigFileMigrationsFactory.js`:
- [BLOCKING] packages/dashmate/configs/getConfigFileMigrationsFactory.js:1523-1541: Existing 4.0.0-rc.2 configs break: required responseHeaders is queued behind a 4.0.0-rc.3 migration but package.json is still rc.2
This commit re-keys the responseHeaders backfill from `4.0.0` to `4.0.0-rc.3`, but `packages/dashmate/package.json` still reports `4.0.0-rc.2`. `ConfigFileJsonRepository.read()` calls `migrateConfigFile(configFileData, configFileData.configFormatVersion, version)` with `version` read from package.json; `migrateConfigFileFactory` early-returns at `fromVersion === toVersion` (`packages/dashmate/src/config/configFile/migrateConfigFileFactory.js:12-14`). So for an operator already on rc.2, fromVersion === toVersion === `4.0.0-rc.2` and the new migration never runs.
Meanwhile this PR makes `responseHeaders` a required field of the rate-limiter schema (`packages/dashmate/src/config/configJsonSchema.js:705,708`), and the per-config schema is enforced by `new Config(name, opts, skipValidation)` inside `ConfigFileJsonRepository.read()`. The result: any existing rc.2 deployment that pulls this PR before package.json bumps to rc.3 fails AJV validation on read with a missing `responseHeaders` error, even though the migration that would have backfilled it is sitting one version ahead.
The comment correctly documents the design intent ("runs once the package bumps to rc.3"), so the change is internally consistent — but the coupling to the rc.3 version bump must land in the same PR. Either bump `packages/dashmate/package.json` to `4.0.0-rc.3` in this PR, or temporarily drop `responseHeaders` from the schema `required` list so existing rc.2 configs still validate until the migration can run.
In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: CORS expose_headers omits ratelimit-*, so wasm/browser clients can't read the reset window this PR depends on
This PR's value is letting clients read `ratelimit-reset` (now emitted because dashmate enables responseHeaders) and ban a node for the server-advertised window instead of climbing the exponential ladder. Native tonic transports surface response metadata transparently, but browser/wasm clients consuming the gateway via grpc-web can only read non-CORS-safelisted response headers that Envoy explicitly enumerates in `expose_headers`. The current list (`custom-header-1,grpc-status,grpc-message,code,drive-error-data-bin,dash-serialized-consensus-error-bin,stack-bin`) does not include `ratelimit-limit`, `ratelimit-remaining`, or `ratelimit-reset`, so a wasm-sdk consumer of this same gateway sees a `429`/`ResourceExhausted` with no readable reset hint — defeating the per-window ban path for that transport.
This is in scope: dashmate ships the envoy template, dashmate now enables emission of these headers by default, and the entire chain (`configJsonSchema` → `docker-compose.rate_limiter.yml` → envoy → client) is being changed in this PR for the purpose of consuming exactly that header. Add the three `ratelimit-*` headers to `expose_headers`.
…it) + expose RateLimit-* so browser clients can read the ban window cors+grpc_web must precede ratelimit so the over-limit local reply is CORS-exposed + grpc-web-framed, enabling wasm/grpc-web clients to read ratelimit-reset for the same node-backoff the native client does (native path unaffected — security-reviewed). Note: the grpc_web-encode behavior on a trailers-only over-limit local reply is spec-inferred and needs a smoke test on the pinned Envoy build (see PR #3951). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sted so browser clients ban for the advertised window An HCM local_reply_config mapper rewrites the grpc-web rate-limit 429 (scoped via x-grpc-web, so JSON-RPC and native paths stay untouched) into HTTP 200 + grpc-status:8, co-locating grpc-status and ratelimit-reset as HTTP headers (trailers-only, empty body). tonic-web then builds a ResourceExhausted Status with ratelimit-reset in its metadata via Status::from_header_map, giving browser clients the same node-backoff the native client gets — no client change. Live-verified both gates against the pinned Envoy build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015i49U5QMneGDJp3AHFh9po
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Cumulative review: rs-dapi-client rate-limit-driven flat-ban path is well-implemented and well-tested; the latest delta (gateway filter reorder, grpc-web trailers-only 429 rewrite, expose_headers, .env.*.bak gitignore) is narrow and live-verified. Prior CORS exposure gap is fixed at envoy.yaml.dot:259. One prior blocking issue remains: the responseHeaders schema requirement is shipped with a migration keyed at 4.0.0-rc.3 while package.json still reports 4.0.0-rc.2, so existing rc.2 configs will fail validation until the package version is bumped.
🚫 Requesting changes: 1 blocking carried-forward finding remains valid on 5d45556c.
Carried-forward finding still valid
- Existing 4.0.0-rc.2 configs break: required responseHeaders is queued behind a 4.0.0-rc.3 migration, but package.json is still 4.0.0-rc.2 —
packages/dashmate/configs/getConfigFileMigrationsFactory.js:1523-1541
The responseHeaders backfill is re-keyed to 4.0.0-rc.3, but packages/dashmate/package.json:3 still reports 4.0.0-rc.2. ConfigFileJsonRepository.read() calls migrateConfigFile(configFileData, configFileData.configFormatVersion, version) with version from package.json, and migrateConfigFileFactory.js:12 early-returns when fromVersion === toVersion. For an operator on an existing rc.2 config, the rc.3 migration never runs — yet configJsonSchema.js:708 already lists responseHeaders in the required array, so the config fails schema validation on load. The in-code comment at lines 1530-1533 documents the intent to wait for a future rc.3 bump, but the schema requirement is live now. Either bump dashmate to 4.0.0-rc.3 in this PR (so the migration runs and backfills responseHeaders for existing rc.2 configs), or drop responseHeaders from the required array until the rc.3 release cuts.
I am not re-opening the same inline thread here because the finding has already been raised on this PR; this review records exact-SHA coverage and keeps the blocking status attached to the current head.
Prior findings reconciled
- FIXED: CORS expose_headers omits ratelimit-*, so wasm/browser clients can't read the reset window this PR depends on — Fixed at packages/dashmate/templates/platform/gateway/envoy.yaml.dot:259 — expose_headers now includes ratelimit-reset, ratelimit-limit, and ratelimit-remaining, allowing tonic-web-wasm-client to surface the reset metadata.
CodeRabbit inline findings: 0; no CodeRabbit reactions requested.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Cumulative review: prior-2 (CORS expose_headers) is FIXED at envoy.yaml.dot:259, which lists ratelimit-reset/limit/remaining so browser/wasm clients can read the ban window. Prior-1 (responseHeaders schema/migration vs package.json) is technically still open but now explicitly documented as intentionally deferred to the next version bump (mirroring the 3.1.0 pattern); carried forward as a suggestion. The new grpc-web local_reply mapper plus filter reorder are correctly scoped (gated by x-grpc-web + HTTP 429 + rateLimiter.enabled) and the trailers-only rewrite preserves grpc-status:8 and ratelimit-reset for tonic-web's Status::from_header_map. No new in-scope blockers; remaining agent items were nitpick-grade operational hardening.
🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/dashmate/configs/getConfigFileMigrationsFactory.js`:
- [SUGGESTION] packages/dashmate/configs/getConfigFileMigrationsFactory.js:1523-1541: responseHeaders backfill keyed at 4.0.0-rc.3 leaves a validation gap until the version bump lands
The schema marks `platform.gateway.rateLimiter.responseHeaders` as required (configJsonSchema.js:708) and the base config defaults `responseHeaders.enabled: true` (getBaseConfigFactory.js:244), but the backfill migration is keyed at `4.0.0-rc.3` while `packages/dashmate/package.json` is still `4.0.0-rc.2`. Per `migrateConfigFileFactory`, a migration only runs when `semver.gt(version, fromVersion)`, so an existing rc.2 deployment running this PR's dashmate (still reporting rc.2) skips the new migration and then fails schema validation on the missing required field. The added comment now explicitly documents the intent — mirror the 3.1.0/3.1.0-dev.1 pattern and let the backfill fire once dashmate bumps to rc.3 — which is fine as long as the release process couples this PR with the rc.3 version bump. If they could ship decoupled (e.g. a dashmate command run between merge and the rc.3 cut), either bump the package version in the same release cycle as this PR or make `responseHeaders` non-required in the schema with a defaulted reader. Carried forward from prior review at reduced severity now that the deferred-intent rationale is in-tree.
|
Live devnet verification Verified this change end-to-end against a running local devnet gateway already serving the committed Envoy config (filter reorder + 🤖 Co-authored by Claudius the Magnificent AI Agent |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## v3.1-dev #3951 +/- ##
=============================================
+ Coverage 52.54% 87.17% +34.62%
=============================================
Files 11 2629 +2618
Lines 1707 327221 +325514
=============================================
+ Hits 897 285265 +284368
- Misses 810 41956 +41146
🚀 New features to boost your workflow:
|
Why this PR exists
rs-dapi-clienttreats a gRPCResourceExhausted(per-IP rate-limit / backpressure) the same as a node-down failure — it applies the60s × e^ban_counthealth ban to the address. Banning a healthy-but-throttled node doesn't shed load, it relocates it onto the remaining nodes.ResourceExhausted→ banned 60s → its traffic shifts to B/C → they cross the limit → banned → … →NoAvailableAddressesToRetry, with zero server faults.v3.1-dev.What was done?
A single rate-limit ban mechanism, driven by the duration Envoy already advertises:
CanRetry::rate_limit_ban_duration(&self) -> Option<Duration>(defaultNone), implemented fortonic::Status(packages/rs-dapi-client/src/transport/grpc.rs): returnsNoneunless the code isResourceExhausted; otherwise parses theratelimit-resetresponse-metadata header (whole seconds), filters out0/non-numeric, and clamps to[1s, 600s]. Delegated unchanged throughTransportError → DapiClientError / ExecutionError.update_address_ban_statusdispatch (dapi_client.rs):Some(period)→AddressList::ban_for(address, period);None→ the existingban_with_reasonexponential health-ban ladder (the fallback is the normal ladder, not a hardcoded default).AddressList::ban_for/AddressStatus::ban_for(address_list.rs): advancesbanned_untiltonow + periodonly when that extends the current window (max-semantics — a short rate-limit reset can't shorten a longer active ban under out-of-order completions on the sharedAddressList), updatesban_reasononly on extension, and raisesban_countto a floor of1(sois_banned()stays consistent withbanned_until). The rate-limit path is flat — it never inflates the exponential ladder. Reinstatement is the existingbanned_until-expiry path.LIMIT_RESPONSE_HEADERS_ENABLEDon the Lyft RLS container (which makes Envoy emitRateLimit-Reset, surfaced to the client asratelimit-resetgRPC metadata) is now driven by a first-class config optionplatform.gateway.rateLimiter.responseHeaders.enabled(default on, since the ban-for-duration feature depends on the header) instead of a hard-coded=true. Spans the JSON schema, the base-config default, a config migration keyed at the next release4.0.0-rc.3(back-fills existing deployments, default on, so an upgrade never silently disables it), env rendering indocker-compose.rate_limiter.yml, anddocs/config/gateway.md(documents the privacy trade-off so a cautious operator can switch it off).packages/dashmate/templates/platform/gateway/envoy.yaml.dot): reordered the Envoy HTTP filters tocors → grpc_web → ratelimit → routerand addedratelimit-reset/ratelimit-limit/ratelimit-remainingto CORSexpose_headers, so browser (grpc-web / wasm-sdk) clients can read the over-limitRateLimit-Resetand apply the same per-node backoff as the native client (the ban logic intransport/grpc.rsalready compiles for wasm). The over-limit response is a local reply from the ratelimit filter, and an Envoy local reply only traverses encoder filters positioned above the filter that generated it (envoyproxy/envoy#11776); placingcors+grpc_webaboveratelimitis what gets that reply CORS-exposed and grpc-web-framed on encode. The native path is unaffected — security-reviewed against Envoy source:grpc_weblatchesis_grpc_web_request_=falseand early-returns forapplication/grpc, andcorsno-ops without anOrigin(and only ever appends headers). The one behavioural delta is thatOPTIONSpreflights now short-circuit atcorsbefore the limiter (negligible; arguably a fix).local_reply_config): the reorder alone isn't enough — Envoy's gRPC-detection matchesapplication/grpconly, sorate_limited_as_resource_exhaustednever tags the grpc-web path and the browser would get a bareHTTP 429(which tonic-web maps toUnavailable, notResourceExhausted, so the ban never fires). Alocal_reply_configmapper — scoped tostatus == 429and request headerx-grpc-webpresent, so JSON-RPC and native gRPC are untouched — rewrites that reply toHTTP 200+grpc-status: 8. Because the reply is headers-only (empty body),grpc_webpasses it through without reframing, sogrpc-statusandratelimit-resetstay co-located in the HTTP response headers;tonic-web-wasm-client(→tonic 0.14.6) then builds aResourceExhaustedStatus withratelimit-resetinstatus.metadata()viaStatus::from_header_map, firing the sameban_forthe native client uses. No client-side code change.Review fixes folded in
ratelimit-resetarriving after a longer ban no longer truncates the window;ban_reasonis preserved unless the window extends. Pinned bytest_ban_for_never_shortens_active_ban.dapi_client.rs) — thebanning … for Ns (from RateLimit-Reset header)debug line now fires inside theif bannedbranch, so it can no longer precede/contradict theunable to ban … not in the list anymoretrace whenban_forreturnsfalse.Net effect: a throttled node is banned for exactly the window the server says it needs — no more, no less — instead of a fixed exponential health ban; genuine node ill-health still bans exactly as before.
How Has This Been Tested?
cargo test -p rs-dapi-client: 121 pass (106 unit + 7rate_limit_banintegration + 3 failover + 5 doc-tests), 0 failures.cargo clippy -p rs-dapi-client -p dash-sdk -- -D warningsclean;cargo fmtclean.rate_limited_node_banned_for_advertised_window_via_executedrives the realDapiClient::execute()loop — aResourceExhaustedcarryingratelimit-reset: 300is banned for a ~300 s window (deliberately ≠ the ~60 s ladder rung), provingban_forfires through the full client path, not just via a hand-builtStatus. It shares a fake-transport harness (tests/common/mod.rs) with the unimplemented-failover suite (refactored onto it, no duplication).convertObjectToEnvsconfirmed to derivePLATFORM_GATEWAY_RATE_LIMITER_RESPONSE_HEADERS_ENABLEDfrom the config path. The dashmate mocha suite was not run in the isolated worktree (pre-existingERR_REQUIRE_CYCLE_MODULEin that runner, unrelated to these changes); the config-migration spec passes in the main tree.ban_for; clamp edges1→1s/600→600s/601→600s;0/ garbage / empty / missing header → ladder fallback (assertions boundbanned_untilto the≈60sfirst ladder rung, so they fail if a bad header is wrongly routed toban_for); full delegation chain; theban_forladder-floor side-effect (fresh nodeban_count 0→1); and a banned address re-entering rotation after its window expires.RateLimit-Reset; the dashmate change above enables it. Where the header is absent, behaviour falls back to the unchanged exponential ban ladder.dashmate config render+ gateway restart, thencurl/grpcurl). Confirmed on the running gateway:HTTP 200,content-type: application/grpc-web+proto,grpc-status: 8,grpc-message: rate limited,ratelimit-reset: <n>—grpc-status+ratelimit-resetco-located in HTTP headers, body 0 bytes (xxdconfirms no trailer frame),access-control-allow-origin+access-control-expose-headerslisting both. This is exactly the trailers-only shapetonic 0.14.6Status::from_header_mapturns intoResourceExhausted+ratelimit-resetmetadata (verified that path short-circuits on the headergrpc-statuswithout reading the body).ResourceExhausted+ratelimit-resetin trailers.local_reply_configonly rewrites Envoy local replies; normal proxied responses are untouched).Breaking Changes
None. Adds a
CanRetrymethod with a default implementation; the genuine-failure ban path is unchanged.Checklist:
For repository code-owners and collaborators only
Attribution
🤖 Co-authored by Claudius the Magnificent AI Agent
Summary by CodeRabbit
RateLimit-*headers (includingRateLimit-Reset) for reset-window support.