Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Live at dmarc.mx | Repo: github.com/schmug/dmarcheck
- `src/views/` — HTML generation via template literals (styles.ts, scripts.ts, components.ts, html.ts, favicon.ts)
- `components.ts` — `generateCreature(size, mood, partyHat?)` helper and `gradeToMood()` mapping
- `markdown.ts` — markdown renderings served when `Accept: text/markdown` (landing, /check report, /scoring, /learn, /docs/api)
- `src/rate-limit.ts` — Cache API-based rate limiter (10 req/IP/60s)
- `src/rate-limit.ts` — per-identity rate limiter (free 10/60s, pro 60/3600s). Primary path is an atomic Durable Object counter (`src/rate-limit-do.ts` `RateLimiterDO`, bound as `RATE_LIMITER`); its single-threaded RPC serializes increments so a concurrent burst under one identity can't exceed the ceiling (GHSA-v7qc-7qh8-h69g — replaced a non-atomic Cache-API read-modify-write). `checkRateLimit(identity, config, namespace?)` falls back to the in-memory limiter when the binding is absent (self-host deploys, Node test pool)

## Agent discovery

Expand Down
13 changes: 5 additions & 8 deletions THREAT_MODEL.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ flowchart LR
end

subgraph worker["dmarcheck Worker — trust boundary (Cloudflare edge)"]
rl["Rate limiter / cache (E9)"]
rl["Rate limiter (E9)"]
scan["Scan API + orchestrator (E1, E2)"]
auth["Auth & session (E5)"]
dash["Dashboard / history CRUD (E6)"]
Expand Down Expand Up @@ -93,14 +93,14 @@ flowchart LR
| entry_point | description | trust_boundary | reachable_assets |
|---|---|---|---|
| E1 — Public scan API (`/check`, `/api/check`, `/api/check/stream`, `/badge`, `/mx/:slug`) | Attacker controls `?domain`, `?selectors`, `?format`, `Accept`; drives DNS lookups + HTML/JSON/CSV/SSE rendering | unauth HTTP → app logic; Worker → upstream DNS | grade integrity, service availability |
| E2 — MCP handler (`POST /mcp` `scan_domain`) | Arbitrary JSON-RPC body; `domain`/`dkim_selectors` drive a full scan. No bearer requirement and **no rate-limit middleware** (contrast `/check`) | unauth HTTP → Worker → DNS/HTTP | service availability, grade integrity |
| E2 — MCP handler (`POST /mcp` `scan_domain`) | Arbitrary JSON-RPC body; `domain`/`dkim_selectors` drive a full scan. No bearer requirement, but rate-limited per-IP by `rateLimitMiddleware` (same anon bucket as `/api/check`, `src/index.ts` `app.use("/mcp", …)`) | unauth HTTP → Worker → DNS/HTTP | service availability, grade integrity |
| E3 — Analyzer outbound fetch (MTA-STS, security.txt, BIMI) | Scanned domain interpolated into upstream HTTPS URLs; MTA-STS uses `redirect: "manual"`, security.txt uses `redirect: "follow"` | Worker → attacker-named upstream HTTP | internal network, service integrity |
| E4 — Outbound webhook dispatch | Fetches a Pro user's saved `webhook.url`; save path validates only `protocol === "https:"` | authenticated user → Worker outbound to arbitrary host | internal network, service integrity |
| E5 — Auth & session (session cookie JWT, bearer API key, Cloudflare Access JWT) | HS256 session HMAC + exp; `dmk_` API key SHA-256 lookup; `jose` RS256 Access JWT (preview only, fail-closed) | unauth → authenticated identity | all authenticated assets |
| E6 — Dashboard CRUD + history/bulk-scan APIs (D1, per-user) | Authenticated reads/writes scoped by `WHERE user_id = ?` / `getDomainByUserAndName` | authenticated session → another user's data | scan history, API keys, user/billing data |
| E7 — Stripe webhook (`POST /webhooks/stripe`) | Raw-body HMAC-SHA256 verify, 5-min skew, event-id idempotency, then mutates subscription state | unauth internet → billing state mutation | subscription state, billing data |
| E8 — HTML report rendering (`src/views/*`) | User/DNS-derived values interpolated into template-literal HTML | scan data → rendered HTML in a viewer's browser | viewer session, grade integrity |
| E9 — Rate limiter / cache | Keyed on `CF-Connecting-IP` (`ip:<x>`) or `user:<id>`; Cache API store with in-memory fallback | spoofable identity / shared cache key | service availability |
| E9 — Rate limiter | Keyed on `CF-Connecting-IP` (`ip:<x>`) or `user:<id>`; per-identity Durable Object atomic counter (`RateLimiterDO`, single-threaded RPC) with in-memory fallback when the binding is absent | spoofable identity | service availability |
| E10 — CI/CD + deploy (GitHub Actions: ci, codeql, migrate, release, deploy-mta-sts; Cloudflare Git integration) | `pull_request` on a public repo; `main`-gated jobs hold prod D1 / deploy / release tokens | PR/main → CI runner → prod | infra tokens, prod D1, releases |
| E11 — Autonomous-routine PR merge path | External routine identity opens + auto-merges PRs; CODEOWNERS + fail-closed gate | external automation → `main` | analyzers, orchestration, scoring, CI |

Expand All @@ -116,7 +116,7 @@ flowchart LR
| T6 | Secret or PII exposure via logs or error responses | remote_unauth | E1, E5, E7 | secrets, user/billing data | high | possible | unmitigated | Sentry capture; no documented scrubbing audit | |
| T7 | Billing privilege escalation (free → paid) via forged or replayed Stripe webhook | remote_unauth | E7 | subscription state | high | rare | partially_mitigated | raw-body HMAC-SHA256 verify, constant-time compare, 5-min skew, event-id idempotency | |
| T8 | Supply-chain / CI compromise escalating to prod D1 write or deploy | supply_chain | E10 | prod D1, infra tokens, releases | high | rare | partially_mitigated | SHA-pinned actions, ubuntu-latest only, explicit `permissions:` blocks, secrets only on `main`-gated jobs | |
| T9 | Rate-limit bypass → DNS amplification / scan abuse via unauthenticated, unmetered `/mcp` and non-`/check` scan routes | remote_unauth | E2, E9 | service availability, upstream DNS | medium | likely | partially_mitigated | `CF-Connecting-IP` keying on `/check`; XFF no longer trusted | #71, #123, #59 |
| T9 | Rate-limit bypass → DNS amplification / scan abuse: an unauthenticated caller rotating source IPs earns a fresh per-IP bucket on each scan route | remote_unauth | E2, E9 | service availability, upstream DNS | medium | possible | partially_mitigated | every scan-triggering route carries `rateLimitMiddleware` (`/check`, `/api/check`, `/api/bulk-scan`, SSE `/api/check/stream`, `/badge`, `/mcp`, `/api/domain/*`); `CF-Connecting-IP` keying (XFF no longer trusted); per-identity Durable Object atomic counter closes the Cache-API read-modify-write burst-bypass window (GHSA-v7qc-7qh8-h69g). Residual: IP-rotation (botnet) still gets per-IP buckets | #71, #123, #59, GHSA-v7qc-7qh8-h69g |
| T10 | Stored/reflected XSS via unescaped scan data rendered into the HTML report | remote_unauth | E8, E1 | viewer session, grade integrity | medium | possible | partially_mitigated | `esc()` on interpolated values; per-request CSP nonce + `strict-dynamic`; `default-src 'none'` | #59, #281, 0fc81e2 |
| T11 | Denial of service via DNS resource exhaustion or scan-abort on attacker-controlled domains | remote_unauth | E1, E3 | service availability | medium | possible | partially_mitigated | SPF lookup-limit early-exit; per-analyzer failure isolation (one analyzer error can't abort the scan); `DnsLookupError` catch on external lookups | #90, #354 |
| T12 | Login CSRF / OAuth-flow tampering | remote_unauth | E5 | user session | medium | rare | mitigated | OAuth `state` cookie (HttpOnly/Secure/SameSite=Lax) + strict callback match | #150 |
Expand All @@ -138,9 +138,6 @@ flowchart LR
- **Webhook SSRF posture (T2):** Is the outbound-webhook feature intended to
reach arbitrary user hosts, or should it enforce a public-IP/host allowlist
and `redirect: "manual"`? Does the dispatch fetch currently follow redirects?
- **`/mcp` rate limiting (T9):** Is the unauthenticated MCP scan path
intentionally exempt from `rateLimitMiddleware`, or an oversight? Same
question for `/badge` and `/mx/:slug`.
- **Bot-identity split (T5):** Has #299 landed? Until the routine runs as a
non-admin identity, the CODEOWNERS gate is advisory (admin bypasses the
ruleset).
Expand All @@ -163,7 +160,7 @@ flowchart LR
| mitigation | threat_ids | closes_class | effort |
|---|---|---|---|
| Enforce a public-host allowlist + `redirect: "manual"` + private-IP/DNS-rebinding guard on all server-side fetches built from user input | T2, T3 | partial | M |
| Apply `rateLimitMiddleware` to every scan-triggering route (`/mcp`, `/badge`, `/mx/:slug`, SSE) — centralize "any route that performs a DNS scan is rate-limited" | T9, T11 | yes | S |
| ✅ Done — `rateLimitMiddleware` applied to every scan-triggering route (`/check`, `/api/check`, `/api/bulk-scan`, SSE `/api/check/stream`, `/badge`, `/mcp`, `/api/domain/*`); `/mx/:slug` is a static provider page (no scan, no limiter needed) | T9, T11 | yes | S |
| Centralize per-user row scoping in a query helper so no handler can issue an unscoped read/write of a tenant-owned table | T4 | yes | M |
| Keep all HTML interpolation behind `esc()` and the CSP nonce; lint/block raw user input inside inline `<script>` or unescaped attributes | T10 | yes | S |
| Audit Sentry/error paths for secret + PII scrubbing; never echo internal state in 5xx bodies | T6 | partial | S |
Expand Down
7 changes: 7 additions & 0 deletions src/env.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
import type { RateLimiterDO } from "./rate-limit-do.js";

export interface Env {
DB: D1Database;
// Atomic per-identity rate-limit counter (GHSA-v7qc-7qh8-h69g). Optional so
// self-host deploys without the binding fall back to the in-memory limiter
// (see checkRateLimit in src/rate-limit.ts); the hosted dmarc.mx worker has
// it wired in wrangler.toml.
RATE_LIMITER?: DurableObjectNamespace<RateLimiterDO>;
WORKOS_CLIENT_ID: string;
WORKOS_CLIENT_SECRET: string;
WORKOS_REDIRECT_URI: string;
Expand Down
15 changes: 11 additions & 4 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,11 @@ import { JS } from "./views/scripts.js";
import { CSS } from "./views/styles.js";
import { fireBulkScanWebhooks } from "./webhooks/triggers.js";

// Durable Object class for the atomic rate limiter (GHSA-v7qc-7qh8-h69g).
// Must be re-exported from the Worker entry module so the `RATE_LIMITER`
// binding in wrangler.toml can resolve its `class_name`.
export { RateLimiterDO } from "./rate-limit-do.js";

// The Hono app is exported for tests (which call `app.request(...)`).
// Runtime Workers use the Sentry-wrapped default export below, which adds
// cron (`scheduled`) alongside `fetch`.
Expand Down Expand Up @@ -409,10 +414,12 @@ type RateLimitBlockedResponder = (
export function rateLimitMiddleware(onBlocked: RateLimitBlockedResponder) {
return async (c: Context, next: () => Promise<void>) => {
const { identity, config } = await resolveRateLimitScope(c);
const result = await checkRateLimit(identity, config);
if (result.pendingWrite) {
c.executionCtx.waitUntil(result.pendingWrite.catch(() => {}));
}
// The Durable Object RPC is awaited end-to-end, so the counter is durably
// updated before the decision is used — no deferred write to drain.
// `c.env` is always present at runtime; the optional chain keeps the
// limiter working in lightweight unit tests that call `app.request(path)`
// without an env (falls back to the in-memory limiter).
const result = await checkRateLimit(identity, config, c.env?.RATE_LIMITER);

const headers = rateLimitHeaders(result);

Expand Down
71 changes: 71 additions & 0 deletions src/rate-limit-do.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import { DurableObject } from "cloudflare:workers";
import type { RateLimitResult } from "./rate-limit.js";

// Atomic per-identity rate-limit counter (GHSA-v7qc-7qh8-h69g).
//
// The previous limiter was a non-atomic read-modify-write on the Cache API
// (`match` → `count++` → `put`). The Cache API has no atomic increment/CAS, so
// a concurrent burst under one identity could all read the same stale count
// and each write `count + 1`, letting the effective ceiling exceed the
// configured limit. A Durable Object's single-threaded execution serializes
// the read-modify-write across isolates and colos, which is the canonical
// Workers primitive for an atomic counter.
//
// One DO instance per identity: callers route with `getByName("ip:<x>")` /
// `getByName("user:<id>")`, so each instance owns exactly one bucket (a single
// row). The whole `increment` body is synchronous SQL — it runs to completion
// without yielding, so overlapping RPCs cannot interleave their read and write.
export class RateLimiterDO extends DurableObject {
constructor(ctx: DurableObjectState, env: Cloudflare.Env) {
super(ctx, env);
ctx.blockConcurrencyWhile(async () => {
this.ctx.storage.sql.exec(
`CREATE TABLE IF NOT EXISTS bucket (
id INTEGER PRIMARY KEY,
count INTEGER NOT NULL,
reset_at INTEGER NOT NULL
)`,
);
});
}

// Atomically increments this identity's counter for the current window and
// returns the resulting decision. `limit`/`windowSec` are passed per call so
// the same DO class serves both tiers (free 10/60, pro 60/3600) — the bucket
// is keyed entirely by the DO instance (identity), not the window size.
increment(limit: number, windowSec: number): RateLimitResult {
const nowSec = Math.floor(Date.now() / 1000);
const existing = this.ctx.storage.sql
.exec<{ count: number; reset_at: number }>(
"SELECT count, reset_at FROM bucket WHERE id = 1",
)
.toArray()[0];

let count: number;
let resetAt: number;
if (existing && existing.reset_at > nowSec) {
count = existing.count + 1;
resetAt = existing.reset_at;
} else {
// Fresh window: no row yet, or the previous window has elapsed.
count = 1;
resetAt = nowSec + windowSec;
}

this.ctx.storage.sql.exec(
`INSERT INTO bucket (id, count, reset_at) VALUES (1, ?, ?)
ON CONFLICT(id) DO UPDATE SET count = excluded.count, reset_at = excluded.reset_at`,
count,
resetAt,
);

return {
allowed: count <= limit,
remaining: Math.max(0, limit - count),
limit,
windowSec,
resetAt,
count,
};
}
}
102 changes: 30 additions & 72 deletions src/rate-limit.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import type { RateLimiterDO } from "./rate-limit-do.js";

export interface RateLimitConfig {
limit: number;
windowSec: number;
Expand All @@ -9,7 +11,10 @@ export interface RateLimitResult {
limit: number;
windowSec: number;
resetAt: number;
pendingWrite?: Promise<void>;
// Post-increment value of the identity's counter for the current window.
// Not surfaced in headers; exposed for observability and to let tests assert
// the atomic counter reached the expected total under concurrency.
count: number;
}

export type PlanTier = "free" | "pro";
Expand All @@ -34,86 +39,38 @@ const SWEEP_INTERVAL = 100;
export async function checkRateLimit(
identity: string,
config: RateLimitConfig,
namespace?: DurableObjectNamespace<RateLimiterDO>,
): Promise<RateLimitResult> {
try {
if (typeof caches !== "undefined" && caches.default) {
return await checkRateLimitCache(identity, config);
// Durable Object is the only atomic primitive on Workers: its single-threaded
// RPC serializes the read-modify-write so a concurrent burst under one
// identity cannot exceed the limit (GHSA-v7qc-7qh8-h69g). The Cache API
// counter it replaces had no atomic increment, so bursts could bypass the
// ceiling.
if (namespace) {
try {
return await checkRateLimitDO(identity, config, namespace);
} catch {
// DO unreachable (transient error, or no binding at runtime). Fall back
// to the in-memory limiter so requests stay bounded rather than failing
// open or 500ing.
}
} catch {
// Cache API unavailable — fall through to in-memory
}
// Graceful fallback for environments without the DO binding (self-host
// deploys that strip it, and the Node test pool). Atomic within a single
// isolate; not shared across isolates/colos.
return checkRateLimitMemory(identity, config);
}

interface StoredPayload {
count: number;
resetAt: number;
}

function parseStoredPayload(raw: string): StoredPayload | null {
try {
const parsed = JSON.parse(raw) as unknown;
if (
parsed &&
typeof parsed === "object" &&
typeof (parsed as StoredPayload).count === "number" &&
typeof (parsed as StoredPayload).resetAt === "number"
) {
return parsed as StoredPayload;
}
} catch {
// Legacy integer-only bodies from a previous deploy won't parse as JSON.
// Treat them as a fresh window — worst case a caller gets one extra
// quota bucket during the seconds it takes for the old entry to age out.
}
return null;
}

async function checkRateLimitCache(
async function checkRateLimitDO(
identity: string,
config: RateLimitConfig,
namespace: DurableObjectNamespace<RateLimiterDO>,
): Promise<RateLimitResult> {
const cache = caches.default;
const key = new Request(
`https://dmarc-mx-ratelimit.internal/${encodeURIComponent(identity)}`,
);

const cached = await cache.match(key);
const nowSec = Math.floor(Date.now() / 1000);
let count = 0;
let resetAt = nowSec + config.windowSec;

if (cached) {
const stored = parseStoredPayload(await cached.text());
if (stored && stored.resetAt > nowSec) {
count = stored.count;
resetAt = stored.resetAt;
}
}

count++;
const allowed = count <= config.limit;
const remaining = Math.max(0, config.limit - count);
const ttl = Math.max(1, resetAt - nowSec);

const response = new Response(JSON.stringify({ count, resetAt }), {
headers: {
"Cache-Control": `s-maxage=${ttl}`,
},
});
// ⚡ Bolt Optimization: Do not await cache.put on the critical path.
// Return the promise so the caller can pass it to executionCtx.waitUntil(),
// removing Cache API write latency from every rate-limited request.
const pendingWrite = cache.put(key, response);

return {
allowed,
remaining,
limit: config.limit,
windowSec: config.windowSec,
resetAt,
pendingWrite,
};
// One DO instance per identity bucket (`ip:<x>` / `user:<id>`). `getByName`
// maps the identity string to a stable instance; the RPC returns the
// post-increment decision for the current window.
const stub = namespace.getByName(identity);
return stub.increment(config.limit, config.windowSec);
}

function checkRateLimitMemory(
Expand Down Expand Up @@ -156,6 +113,7 @@ function checkRateLimitMemory(
limit: config.limit,
windowSec: config.windowSec,
resetAt,
count,
};
}

Expand Down
Loading