Skip to content

fix(sdk): ban rate-limited node for Envoy-advertised reset window#3951

Merged
lklimek merged 19 commits into
v3.1-devfrom
fix/rs-dapi-client-rate-limit-rotate
Jun 25, 2026
Merged

fix(sdk): ban rate-limited node for Envoy-advertised reset window#3951
lklimek merged 19 commits into
v3.1-devfrom
fix/rs-dapi-client-rate-limit-rotate

Conversation

@Claudius-Maginificent

@Claudius-Maginificent Claudius-Maginificent commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Why this PR exists

  • Problem: rs-dapi-client treats a gRPC ResourceExhausted (per-IP rate-limit / backpressure) the same as a node-down failure — it applies the 60s × e^ban_count health ban to the address. Banning a healthy-but-throttled node doesn't shed load, it relocates it onto the remaining nodes.
  • What breaks without it: Under a sustained per-IP rate limit, one high-concurrency client cascades to total failure: node A returns ResourceExhausted → banned 60s → its traffic shifts to B/C → they cross the limit → banned → … → NoAvailableAddressesToRetry, with zero server faults.
  • Blocking relationship: none. Independent fix off v3.1-dev.

Redesign note: an earlier revision of this PR rotated to a different node on ResourceExhausted (plus exclusion/backoff bookkeeping). Review feedback flagged that as over-engineered. This revision replaces it with a single, server-driven ban whose duration is dictated by the rate limiter itself. The intermediate commit history reflects that evolution; the net diff is the redesign only.

What was done?

A single rate-limit ban mechanism, driven by the duration Envoy already advertises:

  • CanRetry::rate_limit_ban_duration(&self) -> Option<Duration> (default None), implemented for tonic::Status (packages/rs-dapi-client/src/transport/grpc.rs): returns None unless the code is ResourceExhausted; otherwise parses the ratelimit-reset response-metadata header (whole seconds), filters out 0/non-numeric, and clamps to [1s, 600s]. Delegated unchanged through TransportError → DapiClientError / ExecutionError.
  • update_address_ban_status dispatch (dapi_client.rs): Some(period)AddressList::ban_for(address, period); None → the existing ban_with_reason exponential health-ban ladder (the fallback is the normal ladder, not a hardcoded default).
  • AddressList::ban_for / AddressStatus::ban_for (address_list.rs): advances banned_until to now + period only when that extends the current window (max-semantics — a short rate-limit reset can't shorten a longer active ban under out-of-order completions on the shared AddressList), updates ban_reason only on extension, and raises ban_count to a floor of 1 (so is_banned() stays consistent with banned_until). The rate-limit path is flat — it never inflates the exponential ladder. Reinstatement is the existing banned_until-expiry path.
  • dashmateLIMIT_RESPONSE_HEADERS_ENABLED on the Lyft RLS container (which makes Envoy emit RateLimit-Reset, surfaced to the client as ratelimit-reset gRPC metadata) is now driven by a first-class config option platform.gateway.rateLimiter.responseHeaders.enabled (default on, since the ban-for-duration feature depends on the header) instead of a hard-coded =true. Spans the JSON schema, the base-config default, a config migration keyed at the next release 4.0.0-rc.3 (back-fills existing deployments, default on, so an upgrade never silently disables it), env rendering in docker-compose.rate_limiter.yml, and docs/config/gateway.md (documents the privacy trade-off so a cautious operator can switch it off).
  • Gateway filter reorder for browser-client parity (packages/dashmate/templates/platform/gateway/envoy.yaml.dot): reordered the Envoy HTTP filters to cors → grpc_web → ratelimit → router and added ratelimit-reset/ratelimit-limit/ratelimit-remaining to CORS expose_headers, so browser (grpc-web / wasm-sdk) clients can read the over-limit RateLimit-Reset and apply the same per-node backoff as the native client (the ban logic in transport/grpc.rs already compiles for wasm). The over-limit response is a local reply from the ratelimit filter, and an Envoy local reply only traverses encoder filters positioned above the filter that generated it (envoyproxy/envoy#11776); placing cors + grpc_web above ratelimit is what gets that reply CORS-exposed and grpc-web-framed on encode. The native path is unaffected — security-reviewed against Envoy source: grpc_web latches is_grpc_web_request_=false and early-returns for application/grpc, and cors no-ops without an Origin (and only ever appends headers). The one behavioural delta is that OPTIONS preflights now short-circuit at cors before the limiter (negligible; arguably a fix).
    • Trailers-only over-limit reply for grpc-web (same file, HCM local_reply_config): the reorder alone isn't enough — Envoy's gRPC-detection matches application/grpc only, so rate_limited_as_resource_exhausted never tags the grpc-web path and the browser would get a bare HTTP 429 (which tonic-web maps to Unavailable, not ResourceExhausted, so the ban never fires). A local_reply_config mapper — scoped to status == 429 and request header x-grpc-web present, so JSON-RPC and native gRPC are untouched — rewrites that reply to HTTP 200 + grpc-status: 8. Because the reply is headers-only (empty body), grpc_web passes it through without reframing, so grpc-status and ratelimit-reset stay co-located in the HTTP response headers; tonic-web-wasm-client (→ tonic 0.14.6) then builds a ResourceExhausted Status with ratelimit-reset in status.metadata() via Status::from_header_map, firing the same ban_for the native client uses. No client-side code change.

Review fixes folded in

  • ban_for never shortens an active ban (max-semantics above) — a short ratelimit-reset arriving after a longer ban no longer truncates the window; ban_reason is preserved unless the window extends. Pinned by test_ban_for_never_shortens_active_ban.
  • Rate-limit debug log ordering (dapi_client.rs) — the banning … for Ns (from RateLimit-Reset header) debug line now fires inside the if banned branch, so it can no longer precede/contradict the unable to ban … not in the list anymore trace when ban_for returns false.

Net effect: a throttled node is banned for exactly the window the server says it needs — no more, no less — instead of a fixed exponential health ban; genuine node ill-health still bans exactly as before.

How Has This Been Tested?

  • cargo test -p rs-dapi-client: 121 pass (106 unit + 7 rate_limit_ban integration + 3 failover + 5 doc-tests), 0 failures. cargo clippy -p rs-dapi-client -p dash-sdk -- -D warnings clean; cargo fmt clean.
  • End-to-end coverage: rate_limited_node_banned_for_advertised_window_via_execute drives the real DapiClient::execute() loop — a ResourceExhausted carrying ratelimit-reset: 300 is banned for a ~300 s window (deliberately ≠ the ~60 s ladder rung), proving ban_for fires through the full client path, not just via a hand-built Status. It shares a fake-transport harness (tests/common/mod.rs) with the unimplemented-failover suite (refactored onto it, no duplication).
  • dashmate toggle: schema ↔ base-config default ↔ migration ↔ env-var rendering name-consistency statically verified; convertObjectToEnvs confirmed to derive PLATFORM_GATEWAY_RATE_LIMITER_RESPONSE_HEADERS_ENABLED from the config path. The dashmate mocha suite was not run in the isolated worktree (pre-existing ERR_REQUIRE_CYCLE_MODULE in that runner, unrelated to these changes); the config-migration spec passes in the main tree.
  • Coverage: header → ban_for; clamp edges 1→1s / 600→600s / 601→600s; 0 / garbage / empty / missing header → ladder fallback (assertions bound banned_until to the ≈60s first ladder rung, so they fail if a bad header is wrongly routed to ban_for); full delegation chain; the ban_for ladder-floor side-effect (fresh node ban_count 0→1); and a banned address re-entering rotation after its window expires.
  • Operational dependency: the per-duration ban only engages where the rate limiter advertises RateLimit-Reset; the dashmate change above enables it. Where the header is absent, behaviour falls back to the unchanged exponential ban ladder.
  • ✅ Browser-parity is live-verified against the local devnet gateway (dashmate config render + gateway restart, then curl/grpcurl). Confirmed on the running gateway:
    • grpc-web over-limitHTTP 200, content-type: application/grpc-web+proto, grpc-status: 8, grpc-message: rate limited, ratelimit-reset: <n>grpc-status + ratelimit-reset co-located in HTTP headers, body 0 bytes (xxd confirms no trailer frame), access-control-allow-origin + access-control-expose-headers listing both. This is exactly the trailers-only shape tonic 0.14.6 Status::from_header_map turns into ResourceExhausted + ratelimit-reset metadata (verified that path short-circuits on the header grpc-status without reading the body).
    • native grpcurl over-limit (regression gate) → still ResourceExhausted + ratelimit-reset in trailers.
    • under-limit, both protocols → unchanged (local_reply_config only rewrites Envoy local replies; normal proxied responses are untouched).
    • Remaining gap: an actual in-browser wasm client executing the path was not run here (no browser harness; the devnet cert is SHA-1, which browsers reject) — but the wire contract that client depends on is now verified end-to-end against the real Envoy build.

Breaking Changes

None. Adds a CanRetry method with a default implementation; the genuine-failure ban path is unchanged.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes (n/a — no breaking changes)
  • I have made corresponding changes to the documentation if needed

For repository code-owners and collaborators only

  • I have assigned this pull request to a milestone

Attribution

🤖 Co-authored by Claudius the Magnificent AI Agent

Summary by CodeRabbit

  • New Features
    • Rate-limit responses now apply a flat temporary ban matching the server-provided reset window (when available), improving timing accuracy.
    • Added a Rate Limiter option to emit RateLimit-* headers (including RateLimit-Reset) for reset-window support.
  • Bug Fixes
    • Follow-up bans no longer shorten an active ban window; ban reasons are only updated when the ban is extended.
    • Expired bans can re-enter rotation without losing ban history.

…limits

ResourceExhausted is a congestion/backpressure signal, not endpoint
ill-health. Banning a rate-limited (but healthy) node relocates its
load onto survivors and cascades to NoAvailableAddressesToRetry. Now
RE is classified as rate-limited: the node is not banned, the retry
rotates to a different node via explicit exclusion, and the existing
bounded retry count bounds an over-limit client. Genuine-failure ban
path (60s x e^n) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 50c92027-b92a-4b59-8034-a9e9e4c807b7

📥 Commits

Reviewing files that changed from the base of the PR and between 6818f3e and 5d45556.

📒 Files selected for processing (1)
  • .gitignore
✅ Files skipped from review due to trivial changes (1)
  • .gitignore

📝 Walkthrough

Walkthrough

Adds header-driven rate-limit banning to rs-dapi-client, propagating a new rate_limit_ban_duration() capability through the retry error chain and routing bans to AddressList::ban_for when ratelimit-reset is present. Dashmate adds configuration and runtime support for emitting RateLimit response headers.

Changes

Header-driven flat ban window

Layer / File(s) Summary
Address ban window semantics
packages/rs-dapi-client/src/address_list.rs
Updates AddressStatus::ban_for to use a flat window with a no-shortening guard, a ban_count floor of 1, and conditional ban_reason updates on extension. Adds AddressList::ban_for, keeps get_live_address() filtering unchanged, and extends tests for timing bounds, count behavior, ladder interaction, unknown addresses, and expiry re-entry.
Rate-limit duration parsing and propagation
packages/rs-dapi-client/src/lib.rs, packages/rs-dapi-client/src/transport/grpc.rs, packages/rs-dapi-client/src/transport.rs, packages/rs-dapi-client/src/dapi_client.rs, packages/rs-dapi-client/src/executor.rs
Adds rate_limit_ban_duration() to CanRetry, implements ratelimit-reset parsing for ResourceExhausted gRPC status with clamping, and propagates the result through transport, client, and executor error types. Tests cover valid, missing, invalid, and clamped header values.
Header-driven ban routing and tests
packages/rs-dapi-client/src/dapi_client.rs, packages/rs-dapi-client/tests/common/mod.rs, packages/rs-dapi-client/tests/rate_limit_ban.rs, packages/rs-dapi-client/tests/unimplemented_failover.rs
Routes update_address_ban_status to ban_for(address, period, reason) when a rate-limit duration is available and keeps the exponential ladder otherwise. Adds a shared scripted transport test harness, rewrites unimplemented_failover to use it, and expands rate_limit_ban coverage for header-driven bans, fallback behavior, delegation, and end-to-end execution.
Dashmate RateLimit response headers
packages/dashmate/src/config/configJsonSchema.js, packages/dashmate/configs/defaults/getBaseConfigFactory.js, packages/dashmate/configs/getConfigFileMigrationsFactory.js, packages/dashmate/docker-compose.rate_limiter.yml, packages/dashmate/docs/config/gateway.md
Adds responseHeaders.enabled to the dashmate gateway rate limiter schema, base config, config migration, docker-compose environment, and gateway documentation. The option controls emission of RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers.

Sequence Diagram(s)

sequenceDiagram
  participant DapiClient
  participant CanRetry
  participant AddressList
  participant ScriptedRequest

  ScriptedRequest-->>DapiClient: ResourceExhausted + ratelimit-reset
  DapiClient->>CanRetry: rate_limit_ban_duration()
  CanRetry-->>DapiClient: Some(period) or None
  DapiClient->>AddressList: ban_for(address, period, reason) or ladder ban
  AddressList-->>DapiClient: updated ban state
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Suggested labels

Client Only

Suggested reviewers

  • QuantumExplorer
  • shumkov
  • lklimek

Poem

🐰 I sniffed a reset in the header breeze,
and hopped to a ban with measured ease.
Flat little naps, no ladder climb,
till RateLimit-* marks the time.
Then dashmate sings in headers bright,
while bunny paws keep the window right.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: banning rate-limited nodes for Envoy-advertised reset windows.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/rs-dapi-client-rate-limit-rotate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@lklimek lklimek requested a review from Copilot June 22, 2026 10:51
@lklimek lklimek marked this pull request as ready for review June 22, 2026 10:51
@lklimek lklimek requested a review from QuantumExplorer as a code owner June 22, 2026 10:51
@thepastaclaw

thepastaclaw commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

✅ Review complete (commit 5d45556)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts rs-dapi-client retry/failover behavior so gRPC ResourceExhausted (rate-limiting / backpressure) no longer triggers the exponential “health ban” logic, preventing ban-cascades under sustained rate-limits while keeping genuine node-failure banning unchanged.

Changes:

  • Introduces a new CanRetry::is_rate_limited() classification (default false), implemented for gRPC Status::ResourceExhausted and delegated through TransportError, DapiClientError, and ExecutionError.
  • Updates update_address_ban_status to not ban rate-limited addresses (log + rotate instead), preserving the existing ban ladder for genuine failures.
  • Adds address-rotation across retries via AddressList::get_live_address_excluding(&[Address]), plus new unit/integration tests to assert both invariants and rotation behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packages/rs-dapi-client/tests/rate_limit_rotation.rs New integration tests driving the real executor retry loop with a fake transport to validate “rotate, don’t ban” for ResourceExhausted.
packages/rs-dapi-client/src/transport/grpc.rs Implements CanRetry::is_rate_limited() for gRPC Status (ResourceExhausted only).
packages/rs-dapi-client/src/transport.rs Delegates is_rate_limited() for TransportError and adds unit tests pinning the “ResourceExhausted only” classification.
packages/rs-dapi-client/src/lib.rs Extends the public CanRetry trait with is_rate_limited() (default false) and documents intended semantics.
packages/rs-dapi-client/src/executor.rs Delegates is_rate_limited() through ExecutionError<E>.
packages/rs-dapi-client/src/dapi_client.rs Skips banning for rate-limited errors and rotates retries away from already-tried addresses. Adds invariants-focused tests for ban ladder vs rate-limit behavior.
packages/rs-dapi-client/src/address_list.rs Adds get_live_address_excluding and tests ensuring exclusion/rotation semantics and graceful fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/rs-dapi-client/src/dapi_client.rs Outdated

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Narrow, well-tested fix that adds an is_rate_limited axis to CanRetry, suppresses banning for gRPC ResourceExhausted, and rotates the next retry off the throttled node via get_live_address_excluding plus a per-execution tried set. The default trait impl preserves prior behavior and the genuine-failure ban ladder is pinned by tests. Only minor docs/test-robustness nits — nothing blocking.

🟡 1 suggestion(s) | 💬 4 nitpick(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-dapi-client/src/lib.rs`:
- [SUGGESTION] packages/rs-dapi-client/src/lib.rs:107-109: Pin `is_rate_limited ⇒ can_retry` with a test or debug_assert
  The fix is load-bearing on the implicit contract that any rate-limited error is also retryable: `update_address_ban_status` only consults `is_rate_limited()` from inside the `if error.can_retry()` branch. If a future change ever marks `ResourceExhausted` non-retryable (e.g. a tweak to `Status::can_retry`), rate-limit handling silently stops firing and the error falls through to the no-op tail — neither ban nor rotation happens. Lock the invariant with either a unit test asserting `is_rate_limited() ⇒ can_retry()` across the gRPC code matrix, or a `debug_assert!(!self.is_rate_limited() || self.can_retry())` on the hot path, so the two predicates can't drift unnoticed.

Comment thread packages/rs-dapi-client/src/lib.rs Outdated
Comment thread packages/rs-dapi-client/src/lib.rs Outdated
Comment thread packages/rs-dapi-client/src/dapi_client.rs Outdated
Comment thread packages/rs-dapi-client/src/dapi_client.rs Outdated
Comment thread packages/rs-dapi-client/src/dapi_client.rs Outdated
@lklimek lklimek changed the title fix(rs-dapi-client): rotate instead of ban on ResourceExhausted rate-limits fix(sdk): rotate instead of ban on ResourceExhausted rate-limits Jun 22, 2026
lklimek and others added 2 commits June 22, 2026 16:55
…limit retry, lock can_retry invariant

Three review fixes for the rotate-instead-of-ban rate-limit handling.

M1 (invariant lock): the rotate-don't-ban behavior depends on
`is_rate_limited() => can_retry()`. Add a debug_assert! at the rotation
decision in `execute` and an exhaustive property test
(`test_rate_limit_implies_retryable_invariant`) covering every gRPC code
across tonic::Status, TransportError, DapiClientError and ExecutionError,
so a future widening of `is_rate_limited` (or narrowing of `can_retry`)
fails loudly instead of silently killing rotation.

M2 (rotate off the throttled node + backoff): on a rate-limited retry the
selection no longer re-picks the just-throttled node in small pools — the
fallback now excludes the just-tried address and only reuses it when it is
the single live node. The rate-limit retry path also replaces the flat
10ms delay with capped exponential backoff (base 10ms, x2 per attempt,
capped at 500ms) plus full jitter, so a congested fleet de-correlates
instead of retrying in lockstep. Genuine-failure delays are unchanged.
Server-side RetryInfo is left as a documented TODO (needs metadata
plumbing).

M3 (rs-sdk propagation): override `is_rate_limited` on the SDK `Error` to
delegate to the wrapped DapiClientError, so the rotate-don't-ban semantics
survive at the SDK layer independently of how `can_retry` is defined.
Pure future-proofing (today the can_retry guard short-circuits first);
locked with a regression test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iant assert

Two self-introduced nits from the verify pass.

1. `rate_limit_backoff_window`'s `checked_shl` guards only the shift AMOUNT,
   not value bit-loss: a large `attempt` (e.g. 63) shifts the set bits out of
   `10 << attempt`, collapsing the window to 0ms — no backoff at all.
   Pathological (needs retries >= 64; default is 5) but it's a footgun the
   backoff math created. Clamp the shift amount at 16 (already saturates the
   500ms cap, so nothing changes in the valid range) and extend the unit test
   with the attempt=63/64 boundary asserting the window stays at the cap.

2. Add the symmetric `debug_assert!(!is_rate_limited() || can_retry())` to the
   transport-client-creation error block for parity with the transport-error
   block. Defense-in-depth; the property test already covers the invariant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek self-requested a review as a code owner June 22, 2026 15:10
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Review fixes (adversarial review pass)

Addressed reviewer findings (CodeRabbit / thepastaclaw / internal review). All MEDIUM; two LOW left open by design.

M1 — is_rate_limited ⇒ can_retry invariant locked (thepastaclaw)

  • Exhaustive property test over 17 gRPC codes × 4 CanRetry impls (non-vacuous) + debug_assert at the rotation decision in the typed execute() loop.

M2 — rate-limit retry hardening

  • Retry now rotates to a different node (3-tier selection; never re-picks the just-throttled node while any alternative exists) — closes a gap where small/single-node pools re-hit the throttled node.
  • Flat 10 ms retry replaced with capped exponential backoff (10 ms → 500 ms) + full jitter on the rate-limit path only; genuine-failure path unchanged. Shift amount clamped to avoid pathological window collapse. Server RetryInfo honoring left as a documented TODO (needs grpc-status-details decoding).

M3 — rs-sdk rate-limit classification propagated

  • Error::is_rate_limited() now delegates to the inner DapiClientError, future-proofing against a latent SDK-layer ban regression. Zero behavior change today (no consumer reads it outside if can_retry()).

Left open (LOW, by design): ban-evasion blind spot for a node that always returns ResourceExhausted; .read().unwrap() panic on a poisoned routing lock (TODO breadcrumb added).

Commits: 396865fc, b8a8c329. Tests: clippy -D warnings clean; 110 lib + 7 integration + 5 doc-tests pass.

🤖 Co-authored by Claudius the Magnificent AI Agent

@lklimek lklimek added the ready for final review Ready for the final review. If AI was involved in producing this PR, it has already had a reviewer. label Jun 22, 2026

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Cumulative review at b8a8c32. The latest delta hardens the rate-limit handling: capped exponential backoff with full jitter, three-tier address selection (excluding tried → just-tried → fallback), debug_assert! invariants on both retry arms, and a property test pinning is_rate_limited() ⇒ can_retry() across all CanRetry implementors. Prior findings #1 and #4 are resolved; #2, #3, #5 are still valid nitpicks. One new latest-delta finding: the new AddressStatus ban-ladder test inherits the same thin 50ms wall-clock tolerance. No blocking issues.

💬 3 nitpick(s)

1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

Comment thread packages/rs-dapi-client/src/lib.rs Outdated
Comment thread packages/rs-dapi-client/src/dapi_client.rs Outdated
Comment thread packages/rs-dapi-client/src/dapi_client.rs Outdated
@thepastaclaw

Copy link
Copy Markdown
Collaborator

Opened #3956 against this branch to address the Copilot note by matching the ban-info entry on the exact URI instead of a substring search.

lklimek and others added 3 commits June 23, 2026 14:30
…an-for-duration

Replace the "rotate-don't-ban" approach with "ban for server-dictated window":

- Add `rate_limit.rs`: decode `google.rpc.RetryInfo` from gRPC status details
  via minimal prost structs; `DAPI_RATE_LIMIT_BAN_MS` env-var fallback (default
  60 s) when no RetryInfo is present; WASM-compatible guard on env-var read.
- Add `AddressStatus::ban_for_duration()` + `AddressList::ban_for_duration()`:
  set `banned_until` without touching `ban_count` so the health-ban exponential
  ladder is never disturbed by rate-limit events.
- Add `CanRetry::rate_limit_ban_duration()` (default `None`) and wire it
  through `tonic::Status`, `TransportError`, `ExecutionError`, `DapiClientError`
  and `dash_sdk::Error`.
- Update `update_address_ban_status`: `ResourceExhausted` now calls
  `ban_for_duration(server_hint.unwrap_or(fallback))` instead of skipping.
- Remove rotation machinery: `tried: Vec<Address>`, 3-tier
  `get_live_address_excluding` selection, jitter backoff (`retry_delay`,
  `rate_limit_backoff_window`, `RATE_LIMIT_MAX_DELAY_MS`), and rand import.
  The retry loop now always calls `get_live_address()` — rate-limited nodes are
  naturally absent from the live pool until their ban window expires.
- Fix `AddressBanInfo.banned`: now consistent with `get_live_address` filtering
  (checks only `banned_until`, not `ban_count > 0`), so rate-limit bans are
  visible in diagnostics.
- Delete `tests/rate_limit_rotation.rs`; update/add invariant tests.

All 419 unit/integration/doc tests pass. Zero clippy warnings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r; drop hand-rolled rotation

Replace prost/RetryInfo-based approach with direct `ratelimit-reset` HTTP
header parsing.  No hand-rolled rotation machinery — banning a node already
achieves rotation because `get_live_address()` skips banned nodes.

Changes:
- Delete `rate_limit.rs` (prost/RetryInfo decoder) and remove prost dep
- Replace `is_rate_limited() -> bool` on `CanRetry` trait with
  `rate_limit_ban_duration() -> Option<Duration>` (default `None`)
- `tonic::Status` impl: if `ResourceExhausted` + `ratelimit-reset` header
  is a parseable u64 > 0, return `Some(clamped [1s, 600s])`; else `None`
- Add `AddressStatus::ban_for(period, reason)` + `AddressList::ban_for`
  that sets `banned_until = now + period`, `ban_count = max(ban_count, 1)`;
  does NOT inflate the exponential health-ban ladder
- `update_address_ban_status`: dispatch to `ban_for` on `Some(period)`,
  fall back to existing `ban_with_reason` on `None`
- Revert execute loop to base: flat 10ms retry delay, plain `get_live_address()`
- Revert `rs-sdk/src/error.rs` to base (banning stays inside rs-dapi-client)
- Add `LIMIT_RESPONSE_HEADERS_ENABLED=true` to dashmate rate-limiter compose
- Replace deleted `tests/rate_limit_rotation.rs` with
  `tests/rate_limit_ban.rs` (5 focused tests for the header-parse path)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… fixes

QA-001: Correct `AddressStatus::ban_for` / `AddressList::ban_for` docstrings —
  the "no ladder inflation" claim was false for fresh nodes (ban_count 0 → 1
  raises the next genuine health ban from ~60 s to ~163 s). Docs now describe
  the floor/side-effect accurately. New test pins the 0→1 ladder-floor behaviour.

QA-002: Fix `MIN_RATE_LIMIT_BAN_SECS` comment — the `> 0` filter before the
  clamp already rejects 0, so MIN=1 is a documentation/intent marker, not an
  active lower clamp.

QA-003: De-tautologize "bad/zero/missing header → ladder" tests. Previous
  assertions (ban_count == 1) passed for BOTH ban_for and ban_with_reason.
  Now also assert the banned_until window is ~60 s (first ladder rung), which
  would fail if a bad header were wrongly routed to ban_for.

QA-004: Add clamp-edge coverage: 1 → 1 s, 600 → 600 s, 601 → 600 s (was only
  45 and 9999).

QA-005: Add test that a ban_for'd address re-enters rotation via
  get_live_address after its window expires (zero-duration re-ban simulates
  an expired window).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Claudius-Maginificent Claudius-Maginificent changed the title fix(sdk): rotate instead of ban on ResourceExhausted rate-limits fix(rs-dapi-client): ban rate-limited node for Envoy-advertised reset window Jun 23, 2026
@lklimek

lklimek commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@thepastaclaw review most recent commit,

@lklimek lklimek changed the title fix(rs-dapi-client): ban rate-limited node for Envoy-advertised reset window fix(dapi-client): ban rate-limited node for Envoy-advertised reset window Jun 23, 2026
lklimek
lklimek previously approved these changes Jun 23, 2026

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Carried-forward prior findings reconcile cleanly against HEAD 2c76a14: ban_for still unconditionally overwrites banned_until (blocking regression introduced by this PR), Envoy's CORS expose_headers still omits the ratelimit-* family (suggestion — browser/WASM clients silently fall back to the ladder), and the rate-limit path still emits a success-shaped debug log before calling ban_for (nitpick). No new latest-delta findings beyond what the agents and prior review already flagged. Quality-agent suggestions about ban_count semantics, constant placement, and timing tolerances are out-of-scope architectural/test-stability commentary and were dropped.

🟡 1 suggestion(s)

1 additional finding(s) omitted (not in diff).

2 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: Expose `ratelimit-*` headers via CORS so browser/WASM clients can read them
  This PR teaches `rs-dapi-client` to read Envoy's `ratelimit-reset` header and pin an exact ban window from it. The WASM transport (`tonic_web_wasm_client`, used by `wasm-sdk`) reaches the gateway through browser `fetch`, which only surfaces non-safelisted response headers that the server enumerates in `Access-Control-Expose-Headers`. The current `expose_headers` list (line 201) does not include `ratelimit-limit`, `ratelimit-remaining`, or `ratelimit-reset`, so browser clients receive `ResourceExhausted` but cannot see the reset window and fall back to the exponential health-ban ladder — defeating the PR's stated improvement for the WASM SDK path. Native Rust clients are unaffected; the fix is one line on the gateway template.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review correction

The previous exact-SHA automated review at 2c76a14f was posted as COMMENT, but the verifier output still contains a blocking carried-forward finding. Requesting changes so the PR state matches the verified result.

🔴 1 blocking | 🟡 1 suggestion | ⚪ 1 nitpick

Blocking: packages/rs-dapi-client/src/address_list.rs:116-120ban_for still unconditionally writes banned_until = Some(now + period), so a later short ratelimit-reset response can shorten an existing longer health/rate-limit ban and return the node to rotation early. Preserve the longer existing/advertised deadline.

Suggestion: packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201 — expose ratelimit-limit, ratelimit-remaining, and ratelimit-reset in CORS so browser/WASM clients can read the reset window instead of falling back to the exponential ladder.

Nitpick: packages/rs-dapi-client/src/dapi_client.rs:213-225 — the rate-limit path logs a success-shaped ban message before ban_for returns, producing contradictory logs when the address has been removed.

Run artifacts: /Users/claw/.openclaw/workspace/worktrees/review-platform-3951/.review/run-1782229049-2c76a14f.

lklimek and others added 6 commits June 24, 2026 09:22
…cs and log ordering

Fix 1 (BLOCKING): `AddressStatus::ban_for` now uses max-semantics — it extends
`banned_until` only when `now + period` is later than the current `banned_until`,
preventing a short `ratelimit-reset` response from shortening a longer active ban.
`ban_reason` is updated only when the window extends; `ban_count` is raised to
`max(ban_count, 1)` unconditionally.  Docstring updated to present-state semantics.

Fix 2 (Suggestion): the "rate-limited: banning" debug log in `update_address_ban_status`
now fires inside the `if banned` branch only, so it cannot contradict a subsequent
"unable to ban … not in the list anymore" trace when the address was SML-removed.

Test: `test_ban_for_never_shortens_active_ban` in `tests/rate_limit_ban.rs` asserts
(a) LONG→SHORT does NOT reduce `banned_until`, (b) SHORT→LONG extends it, and (c)
`ban_count` ends at ≥ 1 in both cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or test

The previous fix swapped `ban_for(ZERO)` for `unban()`, but those exercise
different code paths: `get_live_address` reinstates a node on `banned_until < now`
independent of `ban_count`, while `unban()` also zeroes `ban_count`.

Restore the expiry path by back-dating `banned_until` directly (white-box,
in-module test) after the 300s ban + hidden assertion.  Two invariants now pinned:
- `get_live_address()` returns the node once the window is in the past.
- `is_banned()` is still true (ban_count > 0) — window-expiry ≠ unban.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o-shorten test hardening

QA-006: Document that `AddressStatus::ban_with_reason` intentionally always
re-bases `banned_until` (unconditional overwrite for exponential escalation).
Note that the no-shorten max-semantics invariant is scoped to ban_for→ban_for
interactions only and does not apply to this method.

QA-007: Strengthen `test_ban_for_never_shortens_active_ban` case (a) with an
exact `assert_eq!` on a `banned_until` snapshot taken between the LONG and SHORT
calls, making a last-wins regression fail unambiguously rather than relying on
a >=299s threshold that could theoretically pass with wrong semantics.

QA-008: Add named reasons ("long-reason" / "short-reason") to both test cases
and assert the `ban_reason` update contract:
- case (a) LONG→SHORT: reason preserved from LONG call (SHORT must not overwrite)
- case (b) SHORT→LONG: reason adopted from LONG call (window extended, so reason updates)

No production logic changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…den case (b) last-wins guard

DOC (ban_with_reason): explicitly name the cross-method shortening scenario —
a health failure on a node holding a longer rate-limit window (set via ban_for)
will re-base banned_until to the exponential value, which may be shorter.  This
is intentional: the health-ban ladder owns the window for genuinely-unhealthy
nodes; the no-shorten guarantee is scoped to ban_for→ban_for sequences only.

DOC (ban_for): add cross-reference note that the no-shorten guard is ban_for-only
and that ban_with_reason re-bases unconditionally — see its docs.

TEST QA-007 (case b): add a trailing ban_for(SHORT) after the LONG extension,
snapshot banned_until with assert_eq!, and explain in a comment that case (a) +
this trailing-short assertion together kill both MIN-wins and last-wins regressions.
Under a last-wins implementation the trailing SHORT would clobber the LONG window
and the equality check would fail unambiguously.

Zero production-logic change — git diff on src/ shows doc comments only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…led config toggle

Replaces the hard-coded LIMIT_RESPONSE_HEADERS_ENABLED=true in
docker-compose.rate_limiter.yml with a first-class dashmate config option,
defaulting to true so existing deployments keep emitting RateLimit-* headers.

rs-dapi-client reads RateLimit-Reset to apply a precise server-advertised
ban window instead of the exponential health-ban ladder; this header toggle
must stay enabled for the ban-for-duration feature in rs-dapi-client to work.

Changes:
- configJsonSchema.js: rateLimiter.responseHeaders object schema with
  description linking it to rs-dapi-client ban-for-duration behaviour
- getBaseConfigFactory.js: default responseHeaders.enabled=true; all derived
  configs (local/testnet/mainnet) inherit via lodashMerge — only base updated
- getConfigFileMigrationsFactory.js: migration at 4.0.0-rc.2 sets
  responseHeaders from base default for any existing config missing the field
- docker-compose.rate_limiter.yml: LIMIT_RESPONSE_HEADERS_ENABLED wired from
  ${PLATFORM_GATEWAY_RATE_LIMITER_RESPONSE_HEADERS_ENABLED:?err}, which
  convertObjectToEnvs generates automatically from the config path
- docs/config/gateway.md: table row documenting the new option

Validation: static checks confirm schema/factory/migration/env-var/docs
are consistent. Migration spec passes in main tree; worktree mocha runner
has a pre-existing ERR_REQUIRE_CYCLE_MODULE issue unrelated to these changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.0.0-rc.2 is already a released tag; semver.gt('4.0.0-rc.2','4.0.0-rc.2')
is false, so the migration runner's gt-filter would skip it entirely for any
operator already on rc.2 — leaving responseHeaders absent and breaking
schema validation + compose :?err on startup.

4.0.0 is gt above every 4.0.0-prerelease (rc.2, beta.x, etc.), matching the
pattern established by the '3.1.0' migration key. The idempotent guard makes
a re-run a no-op.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

🔬 Verification evidence — server-advertised ban window

Documenting the real-network + client-side verification behind this PR. It is deliberately honest about one environmental limitation (a SHA-1 devnet TLS cert) that prevents a single-process live end-to-end run. Both ends of the chain are proven directly; the only inch between them is standard tonic gRPC trailer transport.

1. Server side — gateway emits ratelimit-reset on ResourceExhausted(live, grpcurl)

With LIMIT_RESPONSE_HEADERS_ENABLED=true (now the dashmate platform.gateway.rateLimiter.responseHeaders.enabled toggle this PR adds), the running devnet gateway advertises the reset window. Observed against the live gateway via grpcurl:

-- Request 6 --
ratelimit-limit: 5, ratelimit-remaining: 0, ratelimit-reset: 60, x-envoy-ratelimited: true
*** RESOURCE_EXHAUSTED on request 6 ***

Baseline confirms the header is genuinely gated by the flag — against a gateway without it, x-envoy-ratelimited: true is present but ratelimit-reset is absent. The toggle does exactly what it claims.

2. Client side — rs-dapi-client bans for exactly the advertised window ✅ (mock H2C)

A mock H2C gRPC server returns ResourceExhausted with ratelimit-reset: 60 in the response metadata. rs-dapi-client parses it and routes to ban_for:

rate-limited (ResourceExhausted): banning http://127.0.0.1:.../ for 60s (from RateLimit-Reset header)
ban_info: banned=true … ban expires in 60.0s    ✅ ban_for PATH CONFIRMED

The clamp [1s, 600s], the 0/garbage/missing-header fallback to the exponential ladder, and the never-shorten max-semantics are pinned by the rate_limit_ban integration suite (6 tests).

3. Honest gap — no single-process live e2e, and why

A direct rs-dapi-client → devnet run is not achievable in this environment. The devnet gateway certificate is sha1WithRSAEncryption, and rustls (rs-dapi-client's TLS stack) hard-rejects SHA-1 signatures with no danger_accept_invalid_certs escape hatch for this case:

invalid peer certificate: UnsupportedSignatureAlgorithmContext {
  signature_algorithm_id: [6, 9, 42, 134, 72, 134, 247, 13, 1, 1, 5, 5, 0] }

OID 1.2.840.113549.1.1.5 = sha1WithRSAEncryption. grpcurl connects because Go's TLS accepts SHA-1; rustls does not. This is an inherited property of the devnet cert, not a defect in this PR.

Net

Link in the chain Proven How
Gateway emits ratelimit-reset on over-limit ✅ live grpcurl vs running gateway (flag on); absent with flag off
Client parses header → ban_for(window) mock H2C metadata → ban expires in 60.0s
Client clamps / falls back to ladder / never shortens rate_limit_ban integration tests
Metadata transport (server → client) standard tonic::Status trailers — no new code path

Both ends are proven directly; the only unproven inch is the standard gRPC trailer hop between them, which tonic already handles. Confidence: high.

🤖 Co-authored by Claudius the Magnificent AI Agent

@lklimek lklimek requested review from shumkov and thepastaclaw June 24, 2026 09:25

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Incremental review reconciling against 2c76a14. Prior blocking finding (ban_for shortening) and prior nit (debug log before ban) are both FIXED in c49adb2 — see address_list.rs:133-144 and dapi_client.rs:218-226 plus their regression tests. Prior CORS expose_headers concern remains valid: the PR's WASM/browser path depends on the gateway exposing ratelimit-* headers, but envoy.yaml.dot was not touched. One new in-scope suggestion: the migration is keyed at 4.0.0 while package.json is still 4.0.0-rc.2, so rc.2-stamped configs hit a self-validation gap until the next version bump — the author has acknowledged this in the commit message but the coordination should be confirmed by maintainers.

🟡 2 suggestion(s)

1 additional finding(s) omitted (not in diff).

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: Expose ratelimit-* headers via CORS so browser/WASM clients can read the reset window
  The PR's whole value-add is letting rs-dapi-client read `ratelimit-reset` and pin an exact ban window from it. On the native gRPC transport this works because metadata is passed through transparently, but the WASM transport (`tonic-web-wasm-client`) goes through the browser Fetch API, which only surfaces response headers that the server lists in CORS `expose_headers`. The current list (`custom-header-1,grpc-status,grpc-message,code,drive-error-data-bin,dash-serialized-consensus-error-bin,stack-bin`) omits `ratelimit-limit`, `ratelimit-remaining`, and `ratelimit-reset`, so browser/WASM clients will silently fall back to the exponential health-ban ladder even when Envoy/RLS advertised the precise window enabled by the new dashmate `responseHeaders.enabled` toggle. This was flagged in the prior review and is not addressed in c49adb26.

In `packages/dashmate/src/config/configJsonSchema.js`:
- [SUGGESTION] packages/dashmate/src/config/configJsonSchema.js:708: responseHeaders is schema-required but the 4.0.0 migration won't fire for existing 4.0.0-rc.2 configs at the current package version
  `configJsonSchema.js:708` makes `responseHeaders` a required field, and `getConfigFileMigrationsFactory.js:1523` backfills it via the `'4.0.0'` migration. `ConfigFileJsonRepository.read()` calls `migrateConfigFile(configData, configFormatVersion, packageVersion)` (ConfigFileJsonRepository.js:48-52), and `migrateConfigFile` returns the input unchanged when `fromVersion === toVersion` (migrateConfigFileFactory.js:12-14). `package.json` is still `4.0.0-rc.2` at this commit, so any operator already stamped at `4.0.0-rc.2` will hit `fromVersion === toVersion`, skip the new migration, and then fail AJV validation on the missing `responseHeaders` block. Older configFormatVersions migrate cleanly because `semver.gt('4.0.0', '<older>')` is true. The commit message (`fix(dashmate): key responseHeaders migration at 4.0.0 not released rc.2`) shows this is a deliberate release-coordination decision: the migration is expected to fire when the package bumps to `4.0.0` final. Please confirm the release flow will bump `package.json` to a version `> 4.0.0-rc.2` at the same time this PR ships; otherwise add a no-op `4.0.0-rc.3` migration, drop `responseHeaders` from the schema `required` array, or bump the package version in this PR.

Comment thread packages/dashmate/src/config/configJsonSchema.js
lklimek and others added 2 commits June 24, 2026 11:35
…d with shared fake-transport harness

A `ResourceExhausted` response carrying `ratelimit-reset: 300` that travels
through the real `DapiClient::execute()` loop reaches `ban_for`, not
`ban_with_reason`: the new test asserts a ~300 s ban window (not the ~60 s
ladder rung) and `ban_count == 1`, closing the gap left by unit tests that
call `update_address_ban_status` directly.

Shared harness (`tests/common/mod.rs`): `FakeClient`, `FakeResponse`, and
`ScriptedRequest` (closure-driven, records hit URIs) — reused by both
`rate_limit_ban.rs` (new test) and a refactored `unimplemented_failover.rs`
(all 3 existing tests preserved and green).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
….3, not 4.0.0

The migration ships in rc.3; keying at the next version (not the far-future
4.0.0 final) matches the 3.1.0 precedent and avoids a future '4.0.0'
object-key collision.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Incremental delta (458a28e) is test-only — adds a shared fake-transport harness and an end-to-end DapiClient::execute() rate-limit ban test. Prior fixes (ban_for no-shorten max-semantics, debug-log scoped inside the ban branch) remain in place and are well-covered. Two prior findings carry forward unaddressed: (1) Envoy CORS expose_headers still does not surface ratelimit-*, so browser/WASM clients cannot observe the reset window and silently fall back to the exponential ladder; (2) dashmate schema requires responseHeaders while package.json is still 4.0.0-rc.2, so existing rc.2 configs hit migrateConfigFile's fromVersion === toVersion early-return and fail schema validation. New agent findings were nitpicks/style and pre-existing spirit; dropped per scope discipline.

🟡 1 suggestion(s)

1 additional finding(s) omitted (not in diff).

1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: CORS `expose_headers` does not surface `ratelimit-*`, so browser/WASM clients can never read the reset window
  This PR's whole value is letting rs-dapi-client read `ratelimit-reset` (now emitted because dashmate enables `responseHeaders`) and ban a node for the server-advertised window instead of climbing the exponential ladder. Native tonic transports pass metadata through transparently, but the WASM path uses `tonic-web-wasm-client`, which goes through the browser Fetch API. Browsers only expose non-safelisted response headers that are listed in CORS `expose_headers`. The current list at envoy.yaml.dot:201 omits `ratelimit-limit`, `ratelimit-remaining`, and `ratelimit-reset`, so browser/WASM SDK clients will silently fall back to the exponential health-ban ladder even though the headers are emitted. Add the three RateLimit headers to `expose_headers` so the new ban-window behavior actually applies on the WASM transport.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review correction

The automated verifier output for 458a28e2 contains a blocking carried-forward finding, but the first exact-SHA post went out as COMMENT after the inline poster omitted the non-diff carried-forward item. Requesting changes so the PR state matches the verified result.

🔴 1 blocking | 🟡 1 suggestion

Blocking: packages/dashmate/src/config/configJsonSchema.js:708responseHeaders is now schema-required, but the backfill is keyed to the 4.0.0 migration while packages/dashmate/package.json is still 4.0.0-rc.2. Existing rc.2-stamped configs hit migrateConfigFile's fromVersion === toVersion early return, skip the new migration, and then fail AJV validation for the missing responseHeaders block. Bump the package version in lockstep, add an rc migration, or relax the schema until the final-version migration can run.

Suggestion: packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201 — expose ratelimit-limit, ratelimit-remaining, and ratelimit-reset in CORS expose_headers so browser/WASM clients can read the reset window instead of falling back to the exponential ladder.

Run artifacts: /Users/claw/.openclaw/workspace/worktrees/review-platform-3951/.review/run-1782294645-458a28e2.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Two prior findings remain unresolved at b23008e: (1) the dashmate responseHeaders schema/migration coordination gap — the migration is now keyed at 4.0.0-rc.3 while package.json is still 4.0.0-rc.2, so existing rc.2 configs short-circuit the migration runner and then fail AJV validation against the now-required responseHeaders field; (2) the CORS expose_headers list in envoy.yaml.dot still omits ratelimit-*, blocking wasm/browser clients from reading the very header this PR teaches the client to consume. The two prior Rust findings (ban_for no-shorten guard, debug-log placement inside the success branch) are confirmed fixed in this revision. Two additional review-agent nits this round (expect! interpolation, dead MIN_RATE_LIMIT_BAN_SECS constant) were dropped as low-value.

🔴 1 blocking | 🟡 1 suggestion(s)

1 additional finding(s) omitted (not in diff).

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/dashmate/configs/getConfigFileMigrationsFactory.js`:
- [BLOCKING] packages/dashmate/configs/getConfigFileMigrationsFactory.js:1523-1541: Existing 4.0.0-rc.2 configs break: required responseHeaders is queued behind a 4.0.0-rc.3 migration but package.json is still rc.2
  This commit re-keys the responseHeaders backfill from `4.0.0` to `4.0.0-rc.3`, but `packages/dashmate/package.json` still reports `4.0.0-rc.2`. `ConfigFileJsonRepository.read()` calls `migrateConfigFile(configFileData, configFileData.configFormatVersion, version)` with `version` read from package.json; `migrateConfigFileFactory` early-returns at `fromVersion === toVersion` (`packages/dashmate/src/config/configFile/migrateConfigFileFactory.js:12-14`). So for an operator already on rc.2, fromVersion === toVersion === `4.0.0-rc.2` and the new migration never runs.

  Meanwhile this PR makes `responseHeaders` a required field of the rate-limiter schema (`packages/dashmate/src/config/configJsonSchema.js:705,708`), and the per-config schema is enforced by `new Config(name, opts, skipValidation)` inside `ConfigFileJsonRepository.read()`. The result: any existing rc.2 deployment that pulls this PR before package.json bumps to rc.3 fails AJV validation on read with a missing `responseHeaders` error, even though the migration that would have backfilled it is sitting one version ahead.

  The comment correctly documents the design intent ("runs once the package bumps to rc.3"), so the change is internally consistent — but the coupling to the rc.3 version bump must land in the same PR. Either bump `packages/dashmate/package.json` to `4.0.0-rc.3` in this PR, or temporarily drop `responseHeaders` from the schema `required` list so existing rc.2 configs still validate until the migration can run.

In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: CORS expose_headers omits ratelimit-*, so wasm/browser clients can't read the reset window this PR depends on
  This PR's value is letting clients read `ratelimit-reset` (now emitted because dashmate enables responseHeaders) and ban a node for the server-advertised window instead of climbing the exponential ladder. Native tonic transports surface response metadata transparently, but browser/wasm clients consuming the gateway via grpc-web can only read non-CORS-safelisted response headers that Envoy explicitly enumerates in `expose_headers`. The current list (`custom-header-1,grpc-status,grpc-message,code,drive-error-data-bin,dash-serialized-consensus-error-bin,stack-bin`) does not include `ratelimit-limit`, `ratelimit-remaining`, or `ratelimit-reset`, so a wasm-sdk consumer of this same gateway sees a `429`/`ResourceExhausted` with no readable reset hint — defeating the per-window ban path for that transport.

  This is in scope: dashmate ships the envoy template, dashmate now enables emission of these headers by default, and the entire chain (`configJsonSchema` → `docker-compose.rate_limiter.yml` → envoy → client) is being changed in this PR for the purpose of consuming exactly that header. Add the three `ratelimit-*` headers to `expose_headers`.

Comment thread packages/dashmate/configs/getConfigFileMigrationsFactory.js
…it) + expose RateLimit-* so browser clients can read the ban window

cors+grpc_web must precede ratelimit so the over-limit local reply is CORS-exposed
+ grpc-web-framed, enabling wasm/grpc-web clients to read ratelimit-reset for
the same node-backoff the native client does (native path unaffected — security-reviewed).
Note: the grpc_web-encode behavior on a trailers-only over-limit local reply is
spec-inferred and needs a smoke test on the pinned Envoy build (see PR #3951).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
shumkov
shumkov previously approved these changes Jun 24, 2026
…sted so browser clients ban for the advertised window

An HCM local_reply_config mapper rewrites the grpc-web rate-limit 429 (scoped via x-grpc-web, so JSON-RPC and native paths stay untouched) into HTTP 200 + grpc-status:8, co-locating grpc-status and ratelimit-reset as HTTP headers (trailers-only, empty body). tonic-web then builds a ResourceExhausted Status with ratelimit-reset in its metadata via Status::from_header_map, giving browser clients the same node-backoff the native client gets — no client change. Live-verified both gates against the pinned Envoy build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015i49U5QMneGDJp3AHFh9po

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Cumulative review: rs-dapi-client rate-limit-driven flat-ban path is well-implemented and well-tested; the latest delta (gateway filter reorder, grpc-web trailers-only 429 rewrite, expose_headers, .env.*.bak gitignore) is narrow and live-verified. Prior CORS exposure gap is fixed at envoy.yaml.dot:259. One prior blocking issue remains: the responseHeaders schema requirement is shipped with a migration keyed at 4.0.0-rc.3 while package.json still reports 4.0.0-rc.2, so existing rc.2 configs will fail validation until the package version is bumped.

🚫 Requesting changes: 1 blocking carried-forward finding remains valid on 5d45556c.

Carried-forward finding still valid

  • Existing 4.0.0-rc.2 configs break: required responseHeaders is queued behind a 4.0.0-rc.3 migration, but package.json is still 4.0.0-rc.2packages/dashmate/configs/getConfigFileMigrationsFactory.js:1523-1541

The responseHeaders backfill is re-keyed to 4.0.0-rc.3, but packages/dashmate/package.json:3 still reports 4.0.0-rc.2. ConfigFileJsonRepository.read() calls migrateConfigFile(configFileData, configFileData.configFormatVersion, version) with version from package.json, and migrateConfigFileFactory.js:12 early-returns when fromVersion === toVersion. For an operator on an existing rc.2 config, the rc.3 migration never runs — yet configJsonSchema.js:708 already lists responseHeaders in the required array, so the config fails schema validation on load. The in-code comment at lines 1530-1533 documents the intent to wait for a future rc.3 bump, but the schema requirement is live now. Either bump dashmate to 4.0.0-rc.3 in this PR (so the migration runs and backfills responseHeaders for existing rc.2 configs), or drop responseHeaders from the required array until the rc.3 release cuts.

I am not re-opening the same inline thread here because the finding has already been raised on this PR; this review records exact-SHA coverage and keeps the blocking status attached to the current head.

Prior findings reconciled

  • FIXED: CORS expose_headers omits ratelimit-*, so wasm/browser clients can't read the reset window this PR depends on — Fixed at packages/dashmate/templates/platform/gateway/envoy.yaml.dot:259 — expose_headers now includes ratelimit-reset, ratelimit-limit, and ratelimit-remaining, allowing tonic-web-wasm-client to surface the reset metadata.

CodeRabbit inline findings: 0; no CodeRabbit reactions requested.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Cumulative review: prior-2 (CORS expose_headers) is FIXED at envoy.yaml.dot:259, which lists ratelimit-reset/limit/remaining so browser/wasm clients can read the ban window. Prior-1 (responseHeaders schema/migration vs package.json) is technically still open but now explicitly documented as intentionally deferred to the next version bump (mirroring the 3.1.0 pattern); carried forward as a suggestion. The new grpc-web local_reply mapper plus filter reorder are correctly scoped (gated by x-grpc-web + HTTP 429 + rateLimiter.enabled) and the trailers-only rewrite preserves grpc-status:8 and ratelimit-reset for tonic-web's Status::from_header_map. No new in-scope blockers; remaining agent items were nitpick-grade operational hardening.

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/dashmate/configs/getConfigFileMigrationsFactory.js`:
- [SUGGESTION] packages/dashmate/configs/getConfigFileMigrationsFactory.js:1523-1541: responseHeaders backfill keyed at 4.0.0-rc.3 leaves a validation gap until the version bump lands
  The schema marks `platform.gateway.rateLimiter.responseHeaders` as required (configJsonSchema.js:708) and the base config defaults `responseHeaders.enabled: true` (getBaseConfigFactory.js:244), but the backfill migration is keyed at `4.0.0-rc.3` while `packages/dashmate/package.json` is still `4.0.0-rc.2`. Per `migrateConfigFileFactory`, a migration only runs when `semver.gt(version, fromVersion)`, so an existing rc.2 deployment running this PR's dashmate (still reporting rc.2) skips the new migration and then fails schema validation on the missing required field. The added comment now explicitly documents the intent — mirror the 3.1.0/3.1.0-dev.1 pattern and let the backfill fire once dashmate bumps to rc.3 — which is fine as long as the release process couples this PR with the rc.3 version bump. If they could ship decoupled (e.g. a dashmate command run between merge and the rc.3 cut), either bump the package version in the same release cycle as this PR or make `responseHeaders` non-required in the schema with a defaulted reader. Carried forward from prior review at reduced severity now that the deferred-intent rationale is in-tree.

Comment thread packages/dashmate/configs/getConfigFileMigrationsFactory.js
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Live devnet verification

Verified this change end-to-end against a running local devnet gateway already serving the committed Envoy config (filter reorder + local_reply_config + expose_headers). The ratelimit-reset/-limit/-remaining headers are exposed on both wires: native gRPC carries them in response metadata (trailers, with ResourceExhausted, when over-limit), and grpc-web gets them via CORS expose_headers plus the local_reply_config that rewrites the over-limit reply to HTTP 200 + grpc-status: 8 + ratelimit-reset co-located in the HTTP headers with a 0-byte body — exactly the trailers-only shape the wasm client parses. Running the rs-sdk test-vector generation through the modified Envoy showed no regression to normal proxied SDK traffic. A concurrent batch even captured rs-dapi-client's own ban_for logic firing live — banning the node from the ratelimit-reset it read off the Envoy gRPC metadata — confirming the feature works with zero client-code changes (Envoy-only).

🤖 Co-authored by Claudius the Magnificent AI Agent

@lklimek lklimek merged commit 48f0cc3 into v3.1-dev Jun 25, 2026
5 checks passed
@lklimek lklimek deleted the fix/rs-dapi-client-rate-limit-rotate branch June 25, 2026 09:29
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.17%. Comparing base (96cba16) to head (5d45556).
⚠️ Report is 5 commits behind head on v3.1-dev.

Additional details and impacted files
@@              Coverage Diff              @@
##           v3.1-dev    #3951       +/-   ##
=============================================
+ Coverage     52.54%   87.17%   +34.62%     
=============================================
  Files            11     2629     +2618     
  Lines          1707   327221   +325514     
=============================================
+ Hits            897   285265   +284368     
- Misses          810    41956    +41146     
Components Coverage Δ
dpp 87.70% <ø> (∅)
drive 86.14% <ø> (∅)
drive-abci 89.45% <ø> (∅)
sdk ∅ <ø> (∅)
dapi-client ∅ <ø> (∅)
platform-version ∅ <ø> (∅)
platform-value 92.20% <ø> (∅)
platform-wallet ∅ <ø> (∅)
drive-proof-verifier 49.55% <ø> (∅)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for final review Ready for the final review. If AI was involved in producing this PR, it has already had a reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants