diff --git a/skills/review-publisher/PLAYBOOK.md b/skills/review-publisher/PLAYBOOK.md
new file mode 100644
index 000000000..480b4f1b4
--- /dev/null
+++ b/skills/review-publisher/PLAYBOOK.md
@@ -0,0 +1,307 @@
+# How to Review a Publisher PR
+
+> **Audience: AI coding agents.** This is an operational playbook meant to be executed by an agent,
+> not a human contributor guide. It's also the source doc behind the `review-publisher` skill.
+
+A review playbook for PRs that **add a new publisher** or **add/change a parser version**. The matching
+authoring process is [`how_to_add_a_publisher.md`](/docs/how_to_add_a_publisher.md) — you're reviewing the
+fruits of it, so the checks below map onto the promises that process makes.
+
+All mechanics run through one driver. `<skill>` below is the skill directory SKILL.md told you
+(`${CLAUDE_SKILL_DIR}` is substituted only there); the cross-doc links above and below are
+repo-root-relative — read them from the repo root, where you are anyway.
+
+    python "<skill>/scripts/review.py" {crawl,sweep,show,adjudicate,status,payload} <cc>.<Class>
+
+The driver enforces the bookkeeping (what was crawled, what was swept, what is still un-adjudicated);
+this playbook is about the part that stays yours: **judgment**. The split is deliberate — the driver
+may refuse, warn, and count, but it never decides; you never count by hand.
+
+**The unit of review is the PR's diff, not the whole publisher.** Review what *this PR* changes. On a
+parser-version bump, scope your reading to the new version and how it differs from the old — don't
+re-litigate code the PR doesn't touch. A new publisher is all-new, so there the diff *is* the whole file.
+
+**A multi-publisher PR is N independent reviews, not one.** Run §2 once per publisher — each gets its
+own crawl, sweep, and `status` READY. A blocker on one publisher does **not** discharge the body checks
+on the others: the loud finding is a distraction from the clean-reading publishers, which get the *same*
+full treatment, never a lighter pass.
+
+## What you're protecting
+
+> **The extracted `ArticleBody` must mirror the real article — no missing content, no extra content.**
+
+This isn't aesthetic. Fundus maps extracted text back to the source HTML for annotation, so a dropped
+paragraph or a leaked photo caption corrupts that mapping. Over-capture is a blocker, not a nit.
+
+## Rules
+
+- **Don't run `pytest`, `mypy`, `ruff`, or any other checks locally — the GitHub CI covers all of
+  that.** Your value-add is confirming the parser is correct on **live** articles; the unit test only
+  checks one frozen HTML snapshot, so a green test proves nothing about today's site or other layouts.
+- **Cite everything.** Every finding needs quotable evidence: the verbatim dropped/leaked text, the
+  article URL, and the offending selector. Adjudication notes are part of that trail — they land in
+  `findings.json` verbatim.
+- **Keep scratch out of the repo.** `review.json` and any throwaway dump go in the cache dir the
+  driver prints (`<tempdir>/fundus-review/…`) or any OS temp dir — never the working directory.
+- **Never truncate driver output.** Don't pipe `crawl` through `[:N]` or cap the printed text — a
+  slice that cuts off partway through the draw reads clean on articles you never saw. The state file
+  records what was actually cached, and `status` reconciles the counts for you.
+
+## Inputs
+
+- **PR number** and whether it's **your own** PR (changes the verdict event — see §4). **If the user
+  didn't name a PR, resolve it before doing anything else:** `gh pr view --json number,title,author`
+  reads the current branch's open PR; show it and confirm. If the branch has no open PR, ask — don't
+  guess. (The `author` also settles the own-PR question.)
+- The publisher(s) under review as `PublisherCollection.<cc>.<Class>`, and the parser version(s) touched.
+
+## 1. Static read
+
+```bash
+gh pr checkout <PR_NUMBER>
+gh pr diff <PR_NUMBER>
+```
+
+Skim the diff and sanity-check the parser:
+
+- `@attribute` return types match [`attribute_guidelines.md`](/docs/attribute_guidelines.md).
+- **`@attribute(validate=False)` bypasses CI.** Type conformance is enforced by the unit tests for
+  *validated* attributes only. Any attribute marked `validate=False` is checked by **nobody but you** —
+  read its logic and return value by hand.
+- **`free_access` / paywall.** If the publisher runs a subscription model, `free_access` must be
+  implemented (off `isAccessibleForFree` in the `ld+json`). A missing or wrong `free_access` on a
+  premium publisher is a blocker. If the publisher is fully free, there's nothing to check here.
+- **`VALID_UNTIL` is *not* inherited** — every `BaseParser` subclass defaults to `date.max`. When a PR
+  adds a new version for a layout change, the *previous* version must get an explicit `VALID_UNTIL`
+  (the day before the change), and the newest must leave it unset. (Subclassing another publisher's
+  version and leaving `VALID_UNTIL` unset is fine — that's the intended `date.max`.)
+- **Version bump fits the change.** Selector-only fix → minor bump (`V1_1(V1)`); new or substantially
+  changed attributes → major bump (`V2`). Flag a `V2` that's really just selectors, or a minor bump
+  that quietly changes attribute behavior.
+- **Shared-utility changes** (`parser/utility.py`: `generic_topic_parsing`, `apply_result_filter`,
+  `image_extraction`, …) affect **every** publisher, not just this one. Check the call sites and rely
+  on the full (CI-run) suite staying green.
+
+## 2. Crawl live articles and verify the body
+
+### Crawl once
+
+```bash
+python "<skill>/scripts/review.py" crawl <cc>.<Class>             # pool 50 -> 2 articles per layout
+python "<skill>/scripts/review.py" crawl <cc>.<Class> --pool 80   # widen the draw if rare layouts are missed
+```
+
+One live crawl per publisher; everything after it replays the cache, so the read set and the swept
+set are the same draw by construction. The crawl draws a **candidate pool** (`--pool`, default 50)
+and reduces it to a **layout-diverse subset** — **two representatives per distinct layout** the
+publisher uses (its most typical article plus its most different one), **never dropping a layout** —
+so you review each layout twice over instead of the first *N* near-duplicate news stories. The number
+of layouts is discovered automatically; if a near-uniform publisher collapses to fewer than three
+articles, it falls back to the three most-diverse articles so the coherence read and the leak scan
+still have something to work with. It prints the **Tier-1 view** (url / title / authors / topics /
+image count / body, each tagged with its layout id and `rep`/`extra` role) and caches the selected
+articles' exact bytes. Sampling needs the whole pool up front, so an interrupted crawl caches nothing
+— just re-run it (the crawl is the only networked step).
+
+**Layout coverage is the gate, and the sampler is what delivers it.** The subset must span the
+layouts that break parsers — a straight news piece, an opinion/column, a **listicle or bullet-list**
+piece, and an **image-heavy** one. The pool→sample draw surfaces these automatically; if a layout you
+know exists is still missing, **widen the draw (`--pool`) and re-crawl**. If the publisher is too
+small or uniform to cover all four, note that in the review. Don't prompt the user for a number; the
+default pool is the floor and coverage decides the rest.
+
+**Stop condition:** 0 articles after a fair attempt means the sources or parser are broken — that is
+itself a blocker-level finding; report it, don't silently stall. Many publishers block generic
+fetchers, so inspect via the cached html (`show` prints the per-article file paths), never a manual
+`urllib`/`requests` fetch.
+
+### Tier 1 — coherence read (all articles, from the crawl output)
+
+Read each extracted body. It should read like the article: no dangling sentences, no abrupt jumps, no
+sentence referencing something that isn't there ("…fell into either of two groups:" followed by
+nothing), and no boilerplate (newsletter sign-ups, "Read More" teasers, photo captions) mixed in. Also
+eyeball `title`/`authors`/`topics` for emptiness or obvious junk (UUIDs, section names, the domain,
+`Category:` prefixes). Note every article that trips a signal.
+
+Coherence has **one blind spot**: a paragraph dropped mid-body leaves the surrounding text still
+flowing — the seam closes and nothing reads wrong. Tier 2 exists for exactly that.
+
+### Tier 2 — sweep, then adjudicate every candidate (hard gate, every publisher)
+
+```bash
+python "<skill>/scripts/review.py" sweep <cc>.<Class>                 # same cache, no network
+python "<skill>/scripts/review.py" sweep <cc>.<Class> --version V1_1  # pin a legacy version; re-runs free
+```
+
+The sweep applies the version's *real* body selectors to each cached article and emits two kinds of
+candidate, each with an id:
+
+- **DROP** — a structural block (`table`/`ul`/`ol`/`dl`/`blockquote`/`pre`) or `<p>`/`<h*>` in the
+  body container whose text is **absent from the extracted body**. Either real content the selector
+  misses (blocker) or page chrome (fine).
+- **LEAK** — a body unit **repeated across half the cached articles**: the signature of boilerplate
+  *inside* the body (newsletter pitches, teasers, bios repeat; article text doesn't).
+
+The boilerplate-vs-body call is yours, and it's recorded, not implied:
+
+```bash
+python "<skill>/scripts/review.py" show <cc>.<Class> <id>      # full text + cached html paths
+python "<skill>/scripts/review.py" adjudicate <cc>.<Class> <id> ok --note "site-wide cookie banner"
+python "<skill>/scripts/review.py" adjudicate <cc>.<Class> <id> blocker --note "agenda table dropped; <url>"
+```
+
+`ok` means benign (chrome outside the body for a DROP; legitimately recurring content for a LEAK);
+`blocker` means a real finding — the note is your evidence line and lands in `findings.json` verbatim.
+Adjudicate from the **cached** html (`show` prints each article's `NN.html` path) so there's no "site
+changed" ambiguity. Heads-up: the sweep *prints* the blocks it suppressed as duplicates (text already
+in the body) — skim them; that suppression is vetoable, not authoritative.
+
+What the sweep **cannot** do, and stays on you:
+
+- **Over-capture beyond repetition.** The LEAK scan only catches *recurring* boilerplate. A one-off
+  caption or "Related:" line leaked into a single body never repeats — on the articles you read,
+  check the body for anything that isn't article content.
+- **Not-sweepable articles.** A version without a `_paragraph_selector` builds its body another way;
+  the sweep reports it N/A and you do the by-hand walk below.
+- **Layout coverage** (§2 crawl) and the **image attributes** (§3).
+
+When you want the raw container — a candidate needs context, or an article is N/A — walk the cached
+bytes directly:
+
+```python
+from pathlib import Path
+from lxml import html as lhtml
+doc = lhtml.fromstring(Path("<cache>/01.html").read_bytes())  # path from `show`
+# ILLUSTRATIVE selector - use the actual ones: the version class's public accessor
+# `.body_selectors()` returns the real paragraph/summary/subheadline selectors.
+for el in doc.xpath("//div[@class='story-content']/*"):
+    print(el.tag, "|", " ".join(el.text_content().split())[:100])
+```
+
+**Delegating to subagents.** When multiple candidates trip or the HTML is large, fan out — one
+cheap-model subagent per suspect article, pointed at the cached `NN.html` and the candidate's `show`
+output, so the dumps stay out of this context and nothing re-crawls. The subagent reports evidence;
+the adjudication call stays with you.
+
+### The gate
+
+```bash
+python "<skill>/scripts/review.py" status <cc>.<Class>
+```
+
+`status` is the self-audit the old coverage table used to be, but machine-checked: it exits non-zero
+until the crawl completed, the sweep ran on it, and **every candidate is adjudicated**. It also lists
+what it *can't* check (coherence read, layout coverage, one-off over-capture, images) — those become
+one line each in your review body per publisher. **No verdict before `status` reports READY for every
+publisher in the PR.**
+
+## 3. Diagnose a miss
+
+When the sweep shows content missing, the usual culprits:
+
+- **`<ul>/<ol>` lists dropped** — a paragraph-only selector skips list items.
+- **`<p>` whose text is only inside `<em>`/`<i>`/`<a>`** dropped when the selector needs direct
+  `text()` (e.g. every address in a listicle: `<p><em>73 York St.</em></p>`).
+- **`<span>`-wrapped paragraphs** missing if the selector doesn't allow them.
+- **Over-capture**: boilerplate or captions leaking *into* the body.
+
+For images, compare each `caption`/`authors`/`is_cover` against the live `<figure>` to confirm they're
+paired with the right image and that prefixes like `"Photo by "` are stripped — these map to the
+parser's `caption_selector`, `author_selector` (whose `credits` named group is stripped from the
+caption), and the `upper_`/`lower_boundary_selector` cover boundaries. Name the offending selector when
+you report.
+
+To see articles the extraction filter *dropped* entirely (not just mis-parsed), re-crawl with
+`crawl(max_articles=..., only_complete=False)` in a scratch snippet — incomplete articles then show up
+instead of being silently filtered. (For a one-off URL:
+`PublisherCollection.<cc>.<Class>.parser(date.today()).parse(content, "raise")` — but source the HTML
+from the cache or the crawler's session; many publishers block generic fetchers.)
+
+## 4. Decide the verdict
+
+**Don't decide until `status` reports READY for every publisher.** A multi-publisher PR gets a
+per-publisher verdict; the PR's overall event is the most severe of them.
+
+Group findings by severity with quotable evidence (the dropped text, the URL, the selector at fault):
+
+- **Blockers** — crashes, empty/wrong required attributes, body that misrepresents the article
+  (missing paragraphs/lists or leaked boilerplate), mis-paired image data, missing/incorrect
+  `free_access` on a premium publisher, `VALID_UNTIL`/version-bump mistakes, an unreviewed
+  `validate=False` attribute that's wrong.
+- **Nits** — dropped trailers ("With files from …"), minor topic noise.
+
+Then pick the review **event**:
+
+- **Any blocker → `REQUEST_CHANGES`.**
+- **No blockers → `COMMENT`.** Leave the actual `APPROVE` to a human — an agent reports findings, it
+  doesn't sign off the merge gate.
+- **It's your own PR → `COMMENT`** regardless — GitHub blocks `REQUEST_CHANGES` on your own PR anyway.
+
+If a gap lives in a **base/parent parser** that other publishers inherit, say so — the fix likely
+belongs there. When proposing a fix, prefer locking it in with a new test case.
+
+## 5. Post the review
+
+```bash
+python "<skill>/scripts/review.py" payload <cc>.<Class>   # refuses while anything is pending
+```
+
+`payload` emits `findings.json` — the adjudicated blockers with your notes and the article URLs, plus
+a suggested event. That's the mechanical half; your Tier-1 / layout / over-capture / image findings
+join it. Submit **one review** via the GitHub API so the summary, the inline comments, and the verdict
+land together.
+
+### Write it to be read
+
+A review nobody can skim doesn't get acted on. The failure mode to avoid is **stating the same
+evidence twice** — once in the summary and again in the inline comment. Split the labor:
+
+- **Summary body** — skimmable, no verbatim evidence. Lead with the verdict, then **one line per
+  publisher** (clean, or the blocker named in a clause; include the not-machine-checked line: read N,
+  layouts covered, over-capture scanned), then a short **bulleted blocker list** where each bullet
+  names the problem and **links to its inline thread** instead of restating the quote.
+- **Inline comment** — where the evidence lives, one finding each, in this shape:
+  - **Line 1:** `**Blocker — <one-line claim>.**` (or `**Nit — …**`).
+  - **One** quoted snippet — the single most damning example, ellipsized to the few words that prove
+    it. If it recurs, append `(also N more)` rather than quoting each.
+  - `Fix:` one line naming the selector/change.
+  - The article URL as an **anchored footnote** (`[[1]](https://…)`), never a bare inline URL.
+
+  Aim for ~60–80 words. Keep the one claim, the one proof, the one fix.
+
+> **Confirm before posting.** A review is outward-facing and hard to retract. Build `review.json`,
+> show it to the user, and run the `gh api` POST only once they approve.
+
+```bash
+# commit_id is the PR head SHA:
+gh pr view <PR> --json headRefOid -q .headRefOid
+
+# Build the payload in the cache dir the driver printed, NOT the repo working dir.
+# Valid JSON — no comments inside it. One entry per finding; each comment needs path + line + side + body.
+cat > "$SCRATCH/review.json" <<'JSON'
+{
+  "commit_id": "<PR_HEAD_SHA>",
+  "event": "REQUEST_CHANGES",
+  "body": "<verdict; one line per publisher; bulleted blocker list linking to the inline threads>",
+  "comments": [
+    { "path": "src/fundus/publishers/<cc>/<file>.py", "line": 9, "side": "RIGHT",
+      "body": "**Blocker — list items dropped.** \"…<ellipsized snippet>…\" Fix: <selector>. [[1]](<live article url>)" }
+  ]
+}
+JSON
+gh api repos/flairNLP/fundus/pulls/<PR>/reviews -X POST --input "$SCRATCH/review.json"
+```
+
+Notes:
+
+- `event` is `REQUEST_CHANGES` or `COMMENT` per the rule in §4.
+- `line` must fall inside the diff hunk (added or context lines are both fine); for a brand-new file
+  every line qualifies. The `9` above is illustrative — use a real line from the diff.
+- The review posts under the `gh`-authenticated user and, with `REQUEST_CHANGES`, blocks merge until
+  resolved if the repo enforces review gating.
+
+---
+
+*Cleanup: with scratch kept in the cache/temp dir (see Rules), there's nothing to clean in the repo.
+If anything of yours did land in the working dir, remove only that; never touch pre-existing untracked
+files without asking.*
diff --git a/skills/review-publisher/SKILL.md b/skills/review-publisher/SKILL.md
new file mode 100644
index 000000000..e2fc3a08c
--- /dev/null
+++ b/skills/review-publisher/SKILL.md
@@ -0,0 +1,33 @@
+---
+name: review-publisher
+description: >-
+  Review a Fundus publisher PR — one that adds a new publisher or adds/changes a parser version.
+  Crawls live articles to verify the extracted ArticleBody mirrors the real article (no missing or
+  leaked content), checks VALID_UNTIL / version bumps / validate=False attributes / free_access, and
+  drafts a single GitHub review. Use when asked to review a publisher or parser PR.
+---
+
+# Review a Publisher PR
+
+**This skill's directory is `${CLAUDE_SKILL_DIR}`** — that placeholder is substituted *only here in
+SKILL.md*, so note the literal path above now. The procedure lives in the sibling
+[`PLAYBOOK.md`](PLAYBOOK.md); wherever its commands write `<skill>`, substitute the literal path.
+The bundled driver — the only tool you need — is:
+
+    python "<skill>/scripts/review.py" {crawl,sweep,show,adjudicate,status,payload} <cc>.<Class>
+
+Open the PLAYBOOK and work through §1–§5; it's the source of truth. These are the guardrails to
+hold onto even before you open it:
+
+- **No PR named? Resolve it first** — read the current branch's PR (`gh pr view`), confirm it, or ask.
+  Never guess or default to the latest.
+- **Don't run `pytest` / `mypy` / `ruff`** — CI covers them; your value-add is live-article correctness.
+- **The body must mirror the article** — no dropped paragraphs, no leaked boilerplate. The driver's
+  crawl-once-then-sweep flow (§2) is a hard gate on **every** publisher: every candidate it surfaces
+  must be explicitly adjudicated (`adjudicate <id> ok|blocker --note ...`), and `status` must report
+  READY before any verdict. The judgment calls stay yours; skipping them silently does not.
+- **Each publisher is its own review** — its own crawl, sweep, and `status READY`; a blocker on one
+  does not discharge the checks on the rest.
+- **Verdict:** any blocker → `REQUEST_CHANGES`, else `COMMENT`. Never `APPROVE`; your own PR → `COMMENT`.
+- **Keep the review tight, and confirm before posting** — one skimmable review, no evidence stated
+  twice (§5); show `review.json` to the user before the `gh api` POST.
diff --git a/skills/review-publisher/requirements.txt b/skills/review-publisher/requirements.txt
new file mode 100644
index 000000000..a312233cf
--- /dev/null
+++ b/skills/review-publisher/requirements.txt
@@ -0,0 +1,5 @@
+# Sampler dependencies beyond Fundus itself. Not part of the shipped package's runtime deps.
+# Pinned for Python 3.8 (the repo's CI target).
+numpy==1.24.4
+scikit-learn==1.3.2
+# lxml is already a Fundus dependency.
diff --git a/skills/review-publisher/scripts/_store.py b/skills/review-publisher/scripts/_store.py
new file mode 100644
index 000000000..0fe8d57eb
--- /dev/null
+++ b/skills/review-publisher/scripts/_store.py
@@ -0,0 +1,196 @@
+"""Persistence for the publisher-review driver (`review.py`): one cache dir per review.
+
+The cache dir holds the raw crawled bytes (`NN.html`) plus a single `state.json` that is
+the source of truth for the whole review: crawl parameters, the per-article records, the
+sweep's candidates, and the agent's adjudications. Everything `review.py` knows it knows
+from here, which is what makes the workflow crash-safe (state is rewritten after every
+article) and gateable (`payload_gaps` can name exactly what is still missing).
+
+Layout of a cache dir (default: <tempdir>/fundus-review/<cc>.<Class>/):
+
+    state.json   # crawl meta + article records + sweep candidates + adjudications
+    01.html      # raw html.content bytes for article 1 (exact crawled bytes)
+    02.html      # ...
+"""
+
+import hashlib
+import json
+import shutil
+import tempfile
+import time
+from datetime import datetime
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+from fundus import Article, PublisherCollection
+from fundus.publishers.base_objects import Publisher
+
+STATE_FILE = "state.json"
+
+# Adjudication verdicts: "ok" = benign (boilerplate outside the body for a drop candidate,
+# legitimately repeated content for a leak candidate); "blocker" = a real finding.
+VERDICTS = ("ok", "blocker")
+
+
+# --- publisher + path resolution ---
+
+
+def resolve_publisher(spec: str) -> Publisher:
+    """`ca.NationalPost` -> the Publisher object on PublisherCollection."""
+    try:
+        cc, name = spec.split(".", 1)
+    except ValueError:
+        raise SystemExit(f"publisher spec must be '<cc>.<Class>', got {spec!r}")
+    region = getattr(PublisherCollection, cc, None)
+    if region is None:
+        raise SystemExit(f"no such country code on PublisherCollection: {cc!r}")
+    publisher = getattr(region, name, None)
+    if not isinstance(publisher, Publisher):
+        raise SystemExit(f"no publisher {name!r} under PublisherCollection.{cc}")
+    return publisher
+
+
+def default_cache_dir(spec: str) -> Path:
+    """Predictable per-publisher temp location, so every subcommand agrees without a path."""
+    return Path(tempfile.gettempdir()) / "fundus-review" / spec
+
+
+def resolve_cache_dir(spec: str, provided: Optional[str]) -> Path:
+    return Path(provided) if provided else default_cache_dir(spec)
+
+
+def prepare_cache_dir(cache_dir: Path) -> None:
+    """Wipe and recreate `cache_dir` for a fresh crawl — refusing anything that isn't a review cache.
+
+    The guard is what makes `--cache-dir` safe: a non-empty directory without a
+    `state.json` (someone's working tree, a typo'd path) is never deleted.
+    """
+    if cache_dir.exists():
+        if any(cache_dir.iterdir()) and not (cache_dir / STATE_FILE).exists():
+            raise SystemExit(
+                f"refusing to wipe {cache_dir}: non-empty and no {STATE_FILE}, so it doesn't look like "
+                f"a review cache. Use an empty or not-yet-existing --cache-dir."
+            )
+        shutil.rmtree(cache_dir)
+    cache_dir.mkdir(parents=True, exist_ok=True)
+
+
+# --- state file ---
+
+
+def new_state(spec: str, pool: int) -> Dict[str, Any]:
+    return {
+        "publisher": spec,
+        "crawl": {"pool": pool, "started": time.time(), "finished": None, "completed": False},
+        "articles": [],
+        "sweep": None,
+        "adjudications": {},
+    }
+
+
+def write_state(cache_dir: Path, state: Dict[str, Any]) -> None:
+    """Atomically (write-then-replace) persist the state, so a crash never corrupts it."""
+    tmp = cache_dir / (STATE_FILE + ".tmp")
+    tmp.write_text(json.dumps(state, ensure_ascii=False, indent=2), encoding="utf-8")
+    tmp.replace(cache_dir / STATE_FILE)
+
+
+def load_state(cache_dir: Path) -> Optional[Dict[str, Any]]:
+    state_file = cache_dir / STATE_FILE
+    if not state_file.exists():
+        return None
+    state: Dict[str, Any] = json.loads(state_file.read_text(encoding="utf-8"))
+    return state
+
+
+# --- article records ---
+
+
+def html_filename(index: int) -> str:
+    return f"{index:02d}.html"
+
+
+def save_article(cache_dir: Path, index: int, article: Article) -> Dict[str, Any]:
+    """Write one article's raw bytes and return its state record.
+
+    `write_bytes` keeps the cached file the *exact* crawled bytes — text mode would
+    newline-translate on Windows (\\r\\n -> \\r\\r\\n on disk).
+    """
+    (cache_dir / html_filename(index)).write_bytes(article.html.content.encode("utf-8"))
+    body = article.body
+    return {
+        "index": index,
+        "url": article.html.requested_url,
+        "crawl_date": article.html.crawl_date.isoformat(),
+        "title": article.title,
+        "authors": list(article.authors),
+        "topics": list(article.topics),
+        "images": len(article.images),
+        "body": body.serialize() if body is not None else None,
+        "html_file": html_filename(index),
+    }
+
+
+def read_html(cache_dir: Path, record: Dict[str, Any]) -> str:
+    return (cache_dir / str(record["html_file"])).read_bytes().decode("utf-8")
+
+
+def record_crawl_date(record: Dict[str, Any]) -> datetime:
+    return datetime.fromisoformat(str(record["crawl_date"]))
+
+
+def body_units(serialized_body: Optional[Dict[str, Any]]) -> List[str]:
+    """Flatten a serialized ArticleBody into its text units (summary, headlines, paragraphs)."""
+    if not serialized_body:
+        return []
+    units: List[str] = list(serialized_body.get("summary") or [])
+    for section in serialized_body.get("sections") or []:
+        units.extend(section.get("headline") or [])
+        units.extend(section.get("paragraphs") or [])
+    return units
+
+
+# --- candidates + adjudication gate ---
+
+
+def candidate_id(kind: str, *key_parts: str) -> str:
+    """Stable short id from the candidate's content, so re-sweeps keep existing adjudications."""
+    digest = hashlib.sha1("\x1f".join(key_parts).encode("utf-8")).hexdigest()[:6]
+    return f"{kind[0].upper()}{digest}"
+
+
+def candidates(state: Dict[str, Any]) -> List[Dict[str, Any]]:
+    sweep = state.get("sweep") or {}
+    result: List[Dict[str, Any]] = sweep.get("candidates") or []
+    return result
+
+
+def pending_candidates(state: Dict[str, Any]) -> List[Dict[str, Any]]:
+    adjudicated = state.get("adjudications") or {}
+    return [c for c in candidates(state) if c["id"] not in adjudicated]
+
+
+def blocker_candidates(state: Dict[str, Any]) -> List[Dict[str, Any]]:
+    adjudications = state.get("adjudications") or {}
+    return [c for c in candidates(state) if adjudications.get(c["id"], {}).get("verdict") == "blocker"]
+
+
+def payload_gaps(state: Dict[str, Any]) -> List[str]:
+    """Every reason the review is not ready to be written up; empty means the gate is open."""
+    gaps: List[str] = []
+    crawl = state.get("crawl") or {}
+    if not crawl.get("completed"):
+        gaps.append("the crawl did not complete (interrupted?) — re-run `crawl`")
+    if not state.get("articles"):
+        gaps.append("no articles in the cache — 0 crawled is itself a blocker-level finding (see PLAYBOOK §2)")
+    sweep = state.get("sweep")
+    if not sweep:
+        gaps.append("no sweep recorded — run `sweep`")
+    else:
+        if crawl.get("finished") and sweep.get("swept_at", 0) < crawl["finished"]:
+            gaps.append("the sweep predates the last crawl — re-run `sweep`")
+        pending = pending_candidates(state)
+        if pending:
+            ids = ", ".join(c["id"] for c in pending[:15])
+            gaps.append(f"{len(pending)} candidate(s) un-adjudicated: {ids}")
+    return gaps
diff --git a/skills/review-publisher/scripts/_sweep.py b/skills/review-publisher/scripts/_sweep.py
new file mode 100644
index 000000000..7347f07d3
--- /dev/null
+++ b/skills/review-publisher/scripts/_sweep.py
@@ -0,0 +1,263 @@
+"""Pure sweep logic for the publisher review: drop candidates and leak candidates.
+
+No I/O and no crawling here — `review.py` feeds parsed documents and cached body text in,
+results come out. That is what makes this file unit-testable (tests/test_review_skill.py),
+and the tests pin exactly the failure modes a review gate must not have.
+
+Design rule: **a heuristic may be noisy, but it must never be silently green.**
+- Fewer than two captured nodes -> the walk falls back to the whole document and is
+  flagged `loose_scope`, never silently scoped to the one captured node.
+- A block whose text is judged "already in the body" is returned as a *duplicate* (and
+  printed by the driver) so the suppression itself is visible and vetoable.
+- Presence is checked on *every* segment of a block (each <li>, cell, paragraph), not a
+  prefix probe — a list that opens by restating the lede still flags its dropped tail.
+- Text is normalized with fundus's own `normalize_whitespace`, the same function that
+  built the body text being compared against, so the two sides can never drift.
+"""
+
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Sequence, Set, Tuple, Type
+
+import lxml.html
+
+from fundus.parser import BaseParser, ParserProxy
+from fundus.parser.utility import normalize_whitespace
+
+# Structural blocks whose dropped content silently corrupts the body. A paragraph-only
+# selector never matches these, so any one of them with real text is a drop candidate.
+STRUCTURAL_TAGS = {"table", "ul", "ol", "dl", "blockquote", "pre"}
+# Text-bearing blocks the selector is *supposed* to catch; an uncaptured one usually
+# means a selector gap (e.g. <p> whose text lives only inside <em>/<a>).
+TEXT_TAGS = {"p", "h2", "h3", "h4", "h5"}
+REPORT_TAGS = STRUCTURAL_TAGS | TEXT_TAGS
+
+# The text-bearing leaves a structural block is split into for the presence check.
+SEGMENT_TAGS = {"p", "li", "dt", "dd", "td", "th", "caption", "h2", "h3", "h4", "h5"}
+# Segments shorter than this prove nothing on their own ("Yes.", a number cell).
+MIN_SEGMENT_CHARS = 12
+
+# A body unit must be at least this long to count for cross-article leak detection.
+MIN_LEAK_CHARS = 8
+
+
+@dataclass
+class DropCandidate:
+    """An uncaptured block whose text is absent from the extracted body."""
+
+    tag: str
+    description: str
+    text: str
+    chars: int
+    missing_segments: List[str]
+
+
+@dataclass
+class LeakCandidate:
+    """A body unit repeated across many articles — the boilerplate signature."""
+
+    text: str
+    article_indices: List[int]
+
+
+@dataclass
+class SweepResult:
+    applicable: bool
+    counts: Dict[str, int] = field(default_factory=dict)
+    container: Optional[str] = None
+    loose_scope: bool = False
+    captured_blocks: int = 0
+    duplicates: List[str] = field(default_factory=list)
+    drops: List[DropCandidate] = field(default_factory=list)
+
+
+# --- version + selector access ---
+
+
+def version_classes(parser_proxy: ParserProxy) -> Dict[str, Type[BaseParser]]:
+    """Version label (e.g. 'V1_1') -> version class, from the proxy's own registry."""
+    return {cls.__name__: cls for cls in parser_proxy}
+
+
+# --- per-article helpers (bottom-up) ---
+
+
+def _describe(el: lxml.html.HtmlElement) -> str:
+    cls = el.get("class")
+    return f"<{str(el.tag)}{(' class=' + repr(cls)) if cls else ''}>"
+
+
+def _element_nodes(selector: Any, doc: lxml.html.HtmlElement) -> List[lxml.html.HtmlElement]:
+    """Apply a selector once, keeping only real elements (a text()-XPath yields strings)."""
+    return [el for el in selector(doc) if isinstance(el, lxml.html.HtmlElement)]
+
+
+def lowest_common_ancestor(elements: Sequence[lxml.html.HtmlElement]) -> Optional[lxml.html.HtmlElement]:
+    """LCA of the captured nodes = the body container to walk.
+
+    Returns None for fewer than two *distinct* elements: the degenerate case must fall
+    back to a loud whole-document walk, never scope the sweep to the lone captured node
+    (which would cover its whole subtree and report a guaranteed false 'clean').
+
+    Sets hold the *elements*, never bare id()s: lxml elements are proxies, and only a
+    live reference guarantees the same node yields the same Python object again.
+    """
+    distinct: List[lxml.html.HtmlElement] = []
+    for el in elements:
+        if el not in distinct:
+            distinct.append(el)
+    if len(distinct) < 2:
+        return None
+    # Ancestor chain (root-first) of the first element.
+    chain = list(reversed(list(distinct[0].iterancestors()))) + [distinct[0]]
+    common = set(chain)
+    for el in distinct[1:]:
+        common &= set(el.iterancestors()) | {el}
+    # Deepest element in `chain` that's still common across all.
+    for el in reversed(chain):
+        if el in common:
+            return el
+    return None
+
+
+def coverage_sets(
+    captured_nodes: Sequence[lxml.html.HtmlElement],
+) -> Tuple[Set[lxml.html.HtmlElement], Set[lxml.html.HtmlElement]]:
+    """Precompute coverage once per article: (elements above a captured node, elements inside one).
+
+    'Above' (a captured descendant exists) covers `<p><strong>Subhead</strong></p>` where
+    the <strong> is matched; 'inside' covers walking into a captured container's children.
+    Replaces a per-element `iterdescendants` scan, which is O(elements x subtree). The sets
+    keep the element proxies alive on purpose — see `lowest_common_ancestor`.
+    """
+    above: Set[lxml.html.HtmlElement] = set()
+    inside: Set[lxml.html.HtmlElement] = set()
+    for node in captured_nodes:
+        above.add(node)
+        inside.add(node)
+        above.update(node.iterancestors())
+        inside.update(node.iterdescendants())
+    return above, inside
+
+
+def _segments(el: lxml.html.HtmlElement) -> List[str]:
+    """A block's text-bearing leaves (each <li>, cell, paragraph); the block itself if none."""
+    segments = [
+        normalize_whitespace(sub.text_content()) for sub in el.iter() if sub is not el and sub.tag in SEGMENT_TAGS
+    ]
+    segments = [segment for segment in segments if segment]
+    return segments or [normalize_whitespace(el.text_content())]
+
+
+def text_present(el: lxml.html.HtmlElement, body_norm_folded: str) -> Tuple[bool, List[str]]:
+    """Is this block's text fully in the body? Returns (present, missing segments).
+
+    Every long-enough segment must appear — a partial match is a drop, with the missing
+    segments named. Blocks made only of short segments fall back to their whole text, so
+    a tiny block can't pass by having nothing probative to check.
+    """
+    segments = _segments(el)
+    missing = [
+        segment
+        for segment in segments
+        if len(segment) >= MIN_SEGMENT_CHARS and segment.casefold() not in body_norm_folded
+    ]
+    if missing:
+        return False, missing
+    if all(len(segment) < MIN_SEGMENT_CHARS for segment in segments):
+        whole = normalize_whitespace(el.text_content())
+        if whole.casefold() not in body_norm_folded:
+            return False, [whole]
+    return True, []
+
+
+# --- the per-article sweep ---
+
+
+def sweep_article(doc: lxml.html.HtmlElement, selectors: Dict[str, Optional[Any]], units: Sequence[str]) -> SweepResult:
+    """Sweep one parsed document against the body units the parser extracted from it."""
+    paragraph_selector = selectors.get("paragraph")
+    if paragraph_selector is None:
+        return SweepResult(applicable=False)
+
+    para_nodes = _element_nodes(paragraph_selector, doc)
+    summary_selector = selectors.get("summary")
+    summ_nodes = _element_nodes(summary_selector, doc) if summary_selector is not None else []
+    subheadline_selector = selectors.get("subheadline")
+    sub_nodes = _element_nodes(subheadline_selector, doc) if subheadline_selector is not None else []
+
+    above, inside = coverage_sets(para_nodes + summ_nodes + sub_nodes)
+    body_norm_folded = " ".join(normalize_whitespace(unit) for unit in units).casefold()
+
+    # Scope the walk to the body container: the LCA of paragraph + subheadline nodes.
+    # Summary often lives in the page header, so it's excluded from the LCA.
+    container = lowest_common_ancestor(para_nodes + sub_nodes)
+    loose_scope = container is None or container.tag in ("html", "body")
+    walk_root = doc if loose_scope or container is None else container
+
+    result = SweepResult(
+        applicable=True,
+        counts={"paragraph": len(para_nodes), "summary": len(summ_nodes), "subheadline": len(sub_nodes)},
+        container=_describe(container) if container is not None and not loose_scope else "(whole document)",
+        loose_scope=loose_scope,
+    )
+    reported: Set[lxml.html.HtmlElement] = set()
+    for el in walk_root.iter():
+        if el.tag not in REPORT_TAGS:
+            continue
+        if el in above or el in inside:
+            result.captured_blocks += 1
+            continue
+        text = normalize_whitespace(el.text_content())
+        if not text:
+            continue
+        # Only the outermost uncaptured block is a candidate; its nested blocks ride along.
+        if any(ancestor in reported for ancestor in el.iterancestors()):
+            continue
+        present, missing = text_present(el, body_norm_folded)
+        if present:
+            # Visible, vetoable suppression — never a silent one.
+            result.duplicates.append(f'{_describe(el)} "{text[:80]}"')
+            continue
+        result.drops.append(
+            DropCandidate(
+                tag=str(el.tag),
+                description=_describe(el),
+                text=text,
+                chars=len(text),
+                missing_segments=missing[:3],
+            )
+        )
+        reported.add(el)
+    return result
+
+
+# --- the cross-article leak scan ---
+
+
+def find_leaks(units_per_article: Sequence[Tuple[int, Sequence[str]]]) -> List[LeakCandidate]:
+    """Body units repeated across at least half the articles — language-agnostic boilerplate.
+
+    The sweep's drop side cannot see leaks (leaked text is *in* the body), but boilerplate
+    repeats across articles while article text doesn't. Needs >= 3 articles to mean anything;
+    below that it returns nothing and the driver says so.
+    """
+    n = len(units_per_article)
+    if n < 3:
+        return []
+    threshold = max(2, (n + 1) // 2)
+    occurrences: Dict[str, Tuple[str, List[int]]] = {}
+    for index, units in units_per_article:
+        per_article: Dict[str, str] = {}
+        for unit in units:
+            normalized = normalize_whitespace(unit)
+            if len(normalized) >= MIN_LEAK_CHARS:
+                per_article[normalized.casefold()] = normalized
+        for key, display in per_article.items():
+            occurrences.setdefault(key, (display, []))[1].append(index)
+    leaks = [
+        LeakCandidate(text=display, article_indices=indices)
+        for display, indices in occurrences.values()
+        if len(indices) >= threshold
+    ]
+    leaks.sort(key=lambda leak: (-len(leak.article_indices), leak.text))
+    return leaks
diff --git a/skills/review-publisher/scripts/review.py b/skills/review-publisher/scripts/review.py
new file mode 100644
index 000000000..8d5586394
--- /dev/null
+++ b/skills/review-publisher/scripts/review.py
@@ -0,0 +1,472 @@
+"""Driver for reviewing a Fundus publisher: crawl once, sweep, adjudicate, gate, report.
+
+This is a *review aid*, not part of the shipped package. It owns the review's state
+machine so the agent can't substitute "looks clean" for verification:
+
+    crawl       crawl a candidate pool -> sample a layout-diverse subset -> Tier-1 read + cache it
+    sweep       offline structural sweep of that same cache -> DROP / LEAK candidates with ids
+    show        full detail for one candidate (text, articles, cached html paths)
+    adjudicate  record the boilerplate-vs-body judgment for one candidate (the only judgment input)
+    status      where the review stands; exit 0 only when nothing is pending
+    payload     refuse while anything is un-adjudicated, else emit findings.json for §5
+
+Both tiers and every re-run work the *same* crawled draw (PLAYBOOK.md §2): `crawl` is the
+only networked step, everything else replays the cache. Re-sweeping (e.g. `--version V1`)
+costs nothing and keeps existing adjudications — candidate ids are content-hashes.
+
+Usage (any working directory; <skill>/ is this skill's directory):
+
+    python <skill>/scripts/review.py crawl ca.NationalPost [--pool 50]
+    python <skill>/scripts/review.py sweep ca.NationalPost [--version V1_1]
+    python <skill>/scripts/review.py adjudicate ca.NationalPost D3f2a1c ok --note "cookie banner"
+    python <skill>/scripts/review.py status ca.NationalPost
+    python <skill>/scripts/review.py payload ca.NationalPost
+"""
+
+import argparse
+import io
+import json
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Tuple
+
+import lxml.html
+from _store import (
+    VERDICTS,
+    blocker_candidates,
+    body_units,
+    candidate_id,
+    candidates,
+    default_cache_dir,
+    load_state,
+    new_state,
+    payload_gaps,
+    pending_candidates,
+    prepare_cache_dir,
+    read_html,
+    record_crawl_date,
+    resolve_cache_dir,
+    resolve_publisher,
+    save_article,
+    write_state,
+)
+from _sweep import STRUCTURAL_TAGS, SweepResult, find_leaks, sweep_article, version_classes
+
+from fundus import Article, Crawler
+
+SCRIPT = Path(__file__).resolve()
+RULE = "=" * 100
+
+TEXT_CAP = 400  # stored/printed candidate text cap; `show` prints it in full from state
+ARTICLES_PER_LAYOUT = 2  # representatives reviewed per layout: the medoid + its most different member
+MIN_REVIEW_ARTICLES = 3  # floor below which per-layout sampling falls back to diverse sampling
+
+
+# --- small helpers ---
+
+
+def _self_invocation(args: argparse.Namespace, command: str) -> str:
+    """A copy-pasteable follow-up command with the resolved script path and cache dir."""
+    cache_flag = f' --cache-dir "{args.cache_dir}"' if args.cache_dir else ""
+    return f'python "{SCRIPT}" {command} {args.publisher}{cache_flag}'
+
+
+def _require_state(cache_dir: Path, spec: str) -> Dict[str, Any]:
+    state = load_state(cache_dir)
+    if state is None:
+        raise SystemExit(f"no review state in {cache_dir} - run `crawl {spec}` first.")
+    return state
+
+
+def _records_by_index(state: Dict[str, Any]) -> Dict[int, Dict[str, Any]]:
+    return {int(record["index"]): record for record in state["articles"]}
+
+
+def _candidate_lines(state: Dict[str, Any]) -> List[str]:
+    adjudications = state.get("adjudications") or {}
+    lines = []
+    for candidate in candidates(state):
+        verdict = adjudications.get(candidate["id"], {}).get("verdict", "PENDING")
+        marker = "  [STRUCTURAL]" if candidate.get("tag") in STRUCTURAL_TAGS else ""
+        where = f"articles {candidate['articles']}"
+        lines.append(
+            f"  {candidate['id']}  [{verdict:>7}]  {candidate['kind']}: "
+            f'{candidate.get("description", "")} "{candidate["text"][:90]}" ({where}){marker}'
+        )
+    return lines
+
+
+# --- crawl ---
+
+
+def _select_for_review(pool: List[Article]) -> List[Tuple[Article, int, bool]]:
+    """Reduce the crawled pool to `ARTICLES_PER_LAYOUT` representatives per distinct layout.
+
+    Every layout the sampler resolves contributes its representatives (the medoid plus its most
+    different member) — no layout is ever dropped; the number of layouts is discovered unsupervised
+    (bounded by the sampler's `k_max`). So the review covers each layout the publisher uses, twice
+    over, instead of *n* near-duplicate news stories.
+
+    Floor: if that yields fewer than `MIN_REVIEW_ARTICLES` (a near-uniform publisher collapsing to
+    one layout), the coherence read and the cross-article leak scan have too little to chew on, so
+    fall back to diverse sampling for `MIN_REVIEW_ARTICLES` farthest-point picks instead.
+
+    Returns each kept article with its layout id and whether it is that layout's representative
+    (medoid) vs the extra per-layout pick.
+    """
+    if not pool:
+        return []
+    from sampler import Sampler  # numpy / scikit-learn — only imported when we actually reduce
+
+    sampler = Sampler()
+    sampled = sampler.per_layout(pool, k=ARTICLES_PER_LAYOUT)
+    if len(sampled) < MIN_REVIEW_ARTICLES:
+        sampled = sampler.diverse(pool, n=MIN_REVIEW_ARTICLES)
+    return [(s.article, s.layout, s.is_representative) for s in sampled]
+
+
+def cmd_crawl(args: argparse.Namespace) -> int:
+    publisher = resolve_publisher(args.publisher)
+    cache_dir = resolve_cache_dir(args.publisher, args.cache_dir)
+    prepare_cache_dir(cache_dir)
+
+    state = new_state(args.publisher, args.pool)
+    write_state(cache_dir, state)
+
+    pool: List[Article] = []
+    completed = False
+    try:
+        # The only networked step: draw a candidate pool, then reduce it to a layout-diverse subset.
+        # Sampling needs the whole pool up front, so unlike the per-article cache this buffers in
+        # memory — an interrupted crawl caches nothing and is simply re-run (it's the cheap step).
+        pool = list(Crawler(publisher).crawl(max_articles=args.pool))
+        for article, layout, is_representative in _select_for_review(pool):
+            index = len(state["articles"]) + 1
+            state["articles"].append(save_article(cache_dir, index, article))
+            write_state(cache_dir, state)
+
+            # Tier-1 coherence view: read each body here for dangling sentences / boilerplate.
+            role = "rep" if is_representative else "extra"
+            print(RULE)
+            print(f"[{index}] (layout {layout} {role}) {article.html.requested_url}")
+            print(f"{article.title} | {article.authors} | {article.topics} | imgs: {len(article.images)}")
+            print(str(article.body))
+        completed = True
+    finally:
+        state["crawl"]["finished"] = time.time()
+        state["crawl"]["completed"] = completed
+        write_state(cache_dir, state)
+
+    reviewing = len(state["articles"])
+    print(RULE)
+    print(f"crawled {len(pool)} candidate(s), reviewing {reviewing} layout-diverse article(s) -> {cache_dir}")
+    if reviewing == 0:
+        print("0 articles crawled - sources or parser likely broken (blocker-level; see PLAYBOOK §2).")
+    else:
+        print(f"next (Tier 2): {_self_invocation(args, 'sweep')}")
+    return 0
+
+
+# --- sweep ---
+
+
+def _aggregate_drops(per_article: List[Tuple[int, SweepResult]], spec: str) -> List[Dict[str, Any]]:
+    """Merge identical-text drops across articles (nav/chrome repeats) into one candidate."""
+    merged: Dict[Tuple[str, str], Dict[str, Any]] = {}
+    for index, result in per_article:
+        for drop in result.drops:
+            key = (drop.tag, drop.text[:160].casefold())
+            entry = merged.get(key)
+            if entry is None:
+                merged[key] = entry = {
+                    "id": candidate_id("drop", spec, drop.tag, drop.text[:160].casefold()),
+                    "kind": "drop",
+                    "tag": drop.tag,
+                    "description": drop.description,
+                    "text": drop.text[:TEXT_CAP],
+                    "chars": drop.chars,
+                    "missing": drop.missing_segments,
+                    "articles": [],
+                }
+            entry["articles"].append(index)
+    return list(merged.values())
+
+
+def cmd_sweep(args: argparse.Namespace) -> int:
+    publisher = resolve_publisher(args.publisher)
+    cache_dir = resolve_cache_dir(args.publisher, args.cache_dir)
+    state = _require_state(cache_dir, args.publisher)
+
+    parser_proxy = publisher.parser
+    version_map = version_classes(parser_proxy)
+    pinned_cls = None
+    if args.version is not None:
+        pinned_cls = version_map.get(args.version)
+        if pinned_cls is None:
+            raise SystemExit(f"no version {args.version!r}; available: {sorted(version_map)}")
+
+    per_article: List[Tuple[int, SweepResult]] = []
+    units_per_article: List[Tuple[int, List[str]]] = []
+    not_applicable = 0
+    for record in state["articles"]:
+        index = int(record["index"])
+        crawl_date = record_crawl_date(record)
+        version_cls = pinned_cls or type(parser_proxy(crawl_date))
+        doc = lxml.html.document_fromstring(read_html(cache_dir, record))
+        units = body_units(record["body"])
+        units_per_article.append((index, units))
+        result = sweep_article(doc, version_cls.body_selectors(), units)
+        per_article.append((index, result))
+
+        print(RULE)
+        print(f"[{index}] {record['url']}")
+        print(f"     version={version_cls.__name__}  crawl_date={record['crawl_date']}")
+        if not result.applicable:
+            not_applicable += 1
+            print("     sweep N/A (no _paragraph_selector - body built another way).")
+            print("     MANUAL diff required for this article (PLAYBOOK §2, by-hand walk).")
+            continue
+        counts = result.counts
+        print(
+            f"     captured nodes: {counts['paragraph']} paragraph, {counts['summary']} summary, "
+            f"{counts['subheadline']} subheadline | other captured blocks: {result.captured_blocks}"
+        )
+        print(f"     body container: {result.container}")
+        if result.loose_scope:
+            print(
+                "     ! walking the WHOLE page (selectors share no tight ancestor) - "
+                "expect chrome noise; adjudicate carefully."
+            )
+        for duplicate in result.duplicates[:12]:
+            print(f"     duplicate (text already in body - verify, not a drop): {duplicate}")
+        if len(result.duplicates) > 12:
+            print(f"     ... and {len(result.duplicates) - 12} more duplicates")
+        print(f"     uncaptured blocks with text: {len(result.drops)}")
+
+    drop_candidates = _aggregate_drops(per_article, args.publisher)
+    leaks = find_leaks(units_per_article)
+    leak_candidates: List[Dict[str, Any]] = [
+        {
+            "id": candidate_id("leak", args.publisher, leak.text.casefold()),
+            "kind": "leak",
+            "text": leak.text[:TEXT_CAP],
+            "articles": leak.article_indices,
+        }
+        for leak in leaks
+    ]
+
+    state["sweep"] = {
+        "version": args.version,
+        "swept_at": time.time(),
+        "articles_swept": len(per_article),
+        "not_applicable": not_applicable,
+        "candidates": drop_candidates + leak_candidates,
+    }
+    write_state(cache_dir, state)
+
+    print(RULE)
+    print(
+        f"swept {len(per_article)} article(s): {len(drop_candidates)} drop, {len(leak_candidates)} leak candidate(s)."
+    )
+    if len(state["articles"]) < 3:
+        print("note: <3 articles cached, so the cross-article leak scan is inactive - scan bodies by hand.")
+    if not_applicable:
+        print(f"! {not_applicable} article(s) not sweepable - the manual diff there is on you (PLAYBOOK §2).")
+    for line in _candidate_lines(state):
+        print(line)
+    if state["sweep"]["candidates"]:
+        print("adjudicate each candidate (`show <id>` for detail, cached html included):")
+        print(f'  {_self_invocation(args, "adjudicate")} <id> ok|blocker --note "..."')
+    print(f"then: {_self_invocation(args, 'status')}")
+    return 0
+
+
+# --- show / adjudicate ---
+
+
+def _find_candidate(state: Dict[str, Any], candidate_ref: str) -> Dict[str, Any]:
+    for candidate in candidates(state):
+        if candidate["id"] == candidate_ref:
+            return candidate
+    known = ", ".join(c["id"] for c in candidates(state)) or "(none - run `sweep` first)"
+    raise SystemExit(f"no candidate {candidate_ref!r}; known: {known}")
+
+
+def cmd_show(args: argparse.Namespace) -> int:
+    cache_dir = resolve_cache_dir(args.publisher, args.cache_dir)
+    state = _require_state(cache_dir, args.publisher)
+    candidate = _find_candidate(state, args.id)
+    records = _records_by_index(state)
+
+    print(f"{candidate['id']}  kind={candidate['kind']}  {candidate.get('description', '')}")
+    print(f"text ({candidate.get('chars', len(candidate['text']))} chars):")
+    print(f"  {candidate['text']}")
+    for segment in candidate.get("missing") or []:
+        print(f'  missing from body: "{segment[:120]}"')
+    print("articles:")
+    for index in candidate["articles"]:
+        record = records.get(index)
+        if record is not None:
+            print(f"  [{index}] {record['url']}")
+            print(f"       raw html: {cache_dir / str(record['html_file'])}")
+    adjudication = (state.get("adjudications") or {}).get(candidate["id"])
+    if adjudication:
+        print(f"adjudicated: {adjudication['verdict']} - {adjudication['note']}")
+    return 0
+
+
+def cmd_adjudicate(args: argparse.Namespace) -> int:
+    cache_dir = resolve_cache_dir(args.publisher, args.cache_dir)
+    state = _require_state(cache_dir, args.publisher)
+    candidate = _find_candidate(state, args.id)
+
+    state.setdefault("adjudications", {})[candidate["id"]] = {
+        "verdict": args.verdict,
+        "note": args.note,
+        "at": time.time(),
+    }
+    write_state(cache_dir, state)
+
+    pending = pending_candidates(state)
+    print(f"{candidate['id']} -> {args.verdict}: {args.note}")
+    print(
+        f"{len(pending)} candidate(s) still pending" + (f": {', '.join(c['id'] for c in pending)}" if pending else "")
+    )
+    return 0
+
+
+# --- status / payload ---
+
+
+def cmd_status(args: argparse.Namespace) -> int:
+    cache_dir = resolve_cache_dir(args.publisher, args.cache_dir)
+    state = _require_state(cache_dir, args.publisher)
+    crawl, sweep = state["crawl"], state.get("sweep")
+    adjudications = state.get("adjudications") or {}
+
+    print(f"publisher: {state['publisher']}")
+    print(f"cache:     {cache_dir}")
+    print(
+        f"crawl:     {len(state['articles'])} reviewed (pool {crawl.get('pool', '?')}), "
+        f"{'completed' if crawl.get('completed') else 'NOT completed (interrupted?)'}"
+    )
+    if sweep is None:
+        print("sweep:     not run")
+    else:
+        version = sweep["version"] or "by crawl date"
+        print(f"sweep:     {sweep['articles_swept']} article(s), selectors {version}, N/A: {sweep['not_applicable']}")
+        verdicts = [adjudications.get(c["id"], {}).get("verdict") for c in candidates(state)]
+        print(
+            f"candidates: {len(verdicts)} total - {verdicts.count('blocker')} blocker, "
+            f"{verdicts.count('ok')} ok, {verdicts.count(None)} PENDING"
+        )
+        for line in _candidate_lines(state):
+            print(line)
+
+    print("still yours (not machine-checked): Tier-1 coherence read; layout coverage (story/opinion/")
+    print("listicle/image-heavy); the over-capture scan beyond repeated boilerplate; image attributes.")
+
+    gaps = payload_gaps(state)
+    if gaps:
+        print("gate: NOT READY")
+        for gap in gaps:
+            print(f"  - {gap}")
+        return 1
+    print("gate: READY - `payload` will emit findings.json")
+    return 0
+
+
+def cmd_payload(args: argparse.Namespace) -> int:
+    cache_dir = resolve_cache_dir(args.publisher, args.cache_dir)
+    state = _require_state(cache_dir, args.publisher)
+
+    gaps = payload_gaps(state)
+    if gaps:
+        print("refusing to emit findings - the review is not complete:")
+        for gap in gaps:
+            print(f"  - {gap}")
+        return 2
+
+    records = _records_by_index(state)
+    adjudications = state.get("adjudications") or {}
+    blockers = []
+    for candidate in blocker_candidates(state):
+        blockers.append(
+            {
+                "id": candidate["id"],
+                "kind": candidate["kind"],
+                "text": candidate["text"],
+                "note": adjudications[candidate["id"]]["note"],
+                "urls": [records[i]["url"] for i in candidate["articles"] if i in records],
+            }
+        )
+    findings = {
+        "publisher": state["publisher"],
+        "articles_cached": len(state["articles"]),
+        "selector_version": (state["sweep"] or {}).get("version") or "by crawl date",
+        "not_applicable_articles": (state["sweep"] or {}).get("not_applicable", 0),
+        "event_suggestion": "REQUEST_CHANGES" if blockers else "COMMENT",
+        "blockers": blockers,
+        "ok_candidates": sum(1 for c in candidates(state) if adjudications.get(c["id"], {}).get("verdict") == "ok"),
+    }
+
+    findings_file = cache_dir / "findings.json"
+    findings_file.write_text(json.dumps(findings, ensure_ascii=False, indent=2), encoding="utf-8")
+    print(json.dumps(findings, ensure_ascii=False, indent=2))
+    print(RULE)
+    print(f"written to {findings_file}")
+    print("This is the mechanical half only - your Tier-1 / layout / over-capture findings join it in")
+    print("the review body. Assemble review.json per PLAYBOOK §5 (own PR -> COMMENT; never APPROVE)")
+    print("and show it to the user BEFORE any `gh api` POST.")
+    return 0
+
+
+# --- entry point ---
+
+
+def main() -> int:
+    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8")  # smart quotes survive on Windows
+
+    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    subparsers = parser.add_subparsers(dest="command", required=True)
+
+    def add(name: str, help_text: str) -> argparse.ArgumentParser:
+        sub = subparsers.add_parser(name, help=help_text)
+        sub.add_argument("publisher", help="publisher spec, e.g. 'ca.NationalPost'")
+        sub.add_argument(
+            "--cache-dir", default=None, help=f"override the cache dir (default: {default_cache_dir('<spec>')})"
+        )
+        return sub
+
+    crawl = add("crawl", "crawl a candidate pool, then Tier-1 read + cache a layout-diverse subset")
+    crawl.add_argument("--pool", type=int, default=50, help="candidate articles to crawl before sampling")
+    crawl.set_defaults(func=cmd_crawl)
+
+    sweep = add("sweep", "offline structural sweep of the cached draw -> candidates")
+    sweep.add_argument("--version", default=None, help="pin a version label (e.g. V1_1) instead of by crawl date")
+    sweep.set_defaults(func=cmd_sweep)
+
+    show = add("show", "full detail for one candidate")
+    show.add_argument("id", help="candidate id, e.g. D3f2a1c")
+    show.set_defaults(func=cmd_show)
+
+    adjudicate = add("adjudicate", "record the judgment for one candidate")
+    adjudicate.add_argument("id", help="candidate id, e.g. D3f2a1c")
+    adjudicate.add_argument("verdict", choices=list(VERDICTS), help="ok = benign; blocker = real finding")
+    adjudicate.add_argument("--note", required=True, help="one line of evidence/reasoning (lands in findings.json)")
+    adjudicate.set_defaults(func=cmd_adjudicate)
+
+    status = add("status", "where the review stands; exit 0 only when the gate is open")
+    status.set_defaults(func=cmd_status)
+
+    payload = add("payload", "emit findings.json - refuses while anything is pending")
+    payload.set_defaults(func=cmd_payload)
+
+    args = parser.parse_args()
+    result: int = args.func(args)
+    return result
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/skills/review-publisher/scripts/sampler/__init__.py b/skills/review-publisher/scripts/sampler/__init__.py
new file mode 100644
index 000000000..2e171ea8a
--- /dev/null
+++ b/skills/review-publisher/scripts/sampler/__init__.py
@@ -0,0 +1,18 @@
+"""Structural sampler for Fundus publishers — reduce a crawled pool to layout-diverse articles.
+
+Helper for the review-publisher skill: it lets `review.py crawl` review one article per distinct
+layout instead of the first N, for full layout coverage at minimal cost. Run from `scripts/`, so:
+
+    from sampler import Sampler
+
+    sampler = Sampler()
+    diverse = sampler.diverse(articles, n=8)        # mode 1: the n most diverse
+    layouts = sampler.per_layout(articles, k=1)     # mode 2: one per distinct layout
+
+    for s in layouts:
+        print(s.layout, s.is_representative, s.url)
+"""
+
+from .sampler import SampledArticle, Sampler, sample_diverse, sample_per_layout
+
+__all__ = ["Sampler", "SampledArticle", "sample_diverse", "sample_per_layout"]
diff --git a/skills/review-publisher/scripts/sampler/_features.py b/skills/review-publisher/scripts/sampler/_features.py
new file mode 100644
index 000000000..35bd1291a
--- /dev/null
+++ b/skills/review-publisher/scripts/sampler/_features.py
@@ -0,0 +1,132 @@
+"""Featurization: raw article HTML -> a publisher-agnostic structural fingerprint, and the
+pairwise distance matrix over a pool of them.
+
+The design choices here are the ones the clustering study landed on:
+
+* **Body isolation by text density, not tags.** A new publisher has no parser, so we can't ask
+  where its body is. `content_region` finds the tightest subtree holding most of the body
+  text, after stripping link-heavy boilerplate (nav/related/comment rails). Pure text + link
+  ratios over tag-agnostic text blocks — no tag/class/parser assumptions — so it generalizes
+  across publishers.
+* **`tagpath` representation.** Each page becomes the multiset of its root-to-node tag paths;
+  TF-IDF over the pool then zeroes out the chrome that's constant across a publisher's articles
+  and amplifies the body structures that vary. Robust even when the body isolation is imperfect.
+"""
+
+from __future__ import annotations
+
+from typing import Dict, List, Optional, Sequence
+
+import lxml.html
+import numpy as np
+from lxml.html import HtmlElement
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_distances
+
+# Tags carrying no layout meaning (or only noise) — skipped when building tag paths.
+_SKIP_TAGS = frozenset({"script", "style", "noscript", "template", "svg", "path", "br", "wbr"})
+
+DEFAULT_TEXT_SHARE = 0.7  # tightest container holding >= this share of body text (validated)
+_MAX_LINK_DENSITY = 0.5  # a block whose text is >this fraction inside <a> is boilerplate
+_MIN_BLOCK_TEXT = 40  # ignore tiny blocks when stripping boilerplate / protecting prose
+
+
+# --- body isolation ---
+
+
+def _link_density(el: HtmlElement) -> float:
+    total = len((el.text_content() or "").strip())
+    if not total:
+        return 0.0
+    return sum(len(a.text_content() or "") for a in el.iter("a")) / total
+
+
+def _direct_text_len(el: HtmlElement) -> int:
+    """Length of the text ``el`` owns directly — its own ``text`` plus its children's ``tail`` — but
+    not text nested deeper. Keeps text blocks disjoint so summing their lengths never double-counts
+    the same characters up the tree."""
+    parts = [el.text or ""] + [child.tail or "" for child in el]
+    return len("".join(parts).strip())
+
+
+def _text_blocks(root: HtmlElement) -> Dict[HtmlElement, int]:
+    """The prose leaves keyed to their direct-text length: every element under ``root`` that owns a
+    substantial run of direct text. Tag-agnostic — replaces the old ``<p>``-only assumption so that
+    ``<div>``/``<span>``-based bodies are found too — while link labels (text inside ``<a>``) stay
+    below threshold and so never register as prose."""
+    blocks: Dict[HtmlElement, int] = {}
+    for el in root.iter():
+        if not isinstance(el.tag, str):
+            continue
+        length = _direct_text_len(el)
+        if length >= _MIN_BLOCK_TEXT:
+            blocks[el] = length
+    return blocks
+
+
+def _strip_boilerplate(doc: HtmlElement) -> None:
+    """Drop link-heavy blocks in place; never drop a block that still holds real prose."""
+    for el in list(doc.iter("div", "section", "ul", "ol", "nav", "aside", "header", "footer")):
+        if el is doc or el.getroottree().getroot() is not doc:  # root, or removed with an ancestor
+            continue
+        if (
+            len((el.text_content() or "").strip()) >= _MIN_BLOCK_TEXT
+            and _link_density(el) > _MAX_LINK_DENSITY
+            and sum(_text_blocks(el).values()) < _MIN_BLOCK_TEXT
+        ):
+            el.drop_tree()
+
+
+def content_region(doc: HtmlElement, text_share: float = DEFAULT_TEXT_SHARE) -> HtmlElement:
+    """The article body subtree, found by text density after boilerplate removal."""
+    _strip_boilerplate(doc)
+    blocks = _text_blocks(doc)
+    if not blocks:
+        return doc
+    total = sum(blocks.values())
+    scores: Dict[HtmlElement, int] = {}
+    for block, length in blocks.items():
+        for ancestor in block.iterancestors():
+            scores[ancestor] = scores.get(ancestor, 0) + length
+    candidates = [el for el, score in scores.items() if score >= text_share * total]
+    if not candidates:
+        return doc
+    return min(candidates, key=lambda el: sum(1 for _ in el.iter()))
+
+
+# --- tag-path fingerprint ---
+
+
+def tag_paths(region: HtmlElement, max_path_len: Optional[int] = None) -> List[str]:
+    """One ``a/b/c`` root-to-node path per element in the region (noise tags skipped)."""
+    paths: List[str] = []
+
+    def walk(el: HtmlElement, ancestry: List[str]) -> None:
+        if not isinstance(el.tag, str) or el.tag in _SKIP_TAGS:
+            return
+        ancestry = ancestry + [el.tag]
+        kept = ancestry if max_path_len is None else ancestry[-max_path_len:]
+        paths.append("/".join(kept))
+        for child in el:
+            walk(child, ancestry)
+
+    walk(region, [])
+    return paths
+
+
+def fingerprint(html: str, text_share: float = DEFAULT_TEXT_SHARE, max_path_len: Optional[int] = None) -> List[str]:
+    """Tag-path multiset of one article's body region — the unit the distance is computed over."""
+    doc = lxml.html.document_fromstring(html)
+    return tag_paths(content_region(doc, text_share), max_path_len)
+
+
+# --- distance ---
+
+
+def distance_matrix(fingerprints: Sequence[List[str]]) -> np.ndarray:
+    """Symmetric (m, m) tagpath TF-IDF cosine distances; IDF suppresses the shared site chrome."""
+    vectorizer = TfidfVectorizer(analyzer=lambda paths: paths, norm="l2", sublinear_tf=True)
+    matrix = vectorizer.fit_transform(list(fingerprints))
+    distances: np.ndarray = cosine_distances(matrix)
+    np.fill_diagonal(distances, 0.0)
+    return np.asarray(distances, dtype=float)
diff --git a/skills/review-publisher/scripts/sampler/_select.py b/skills/review-publisher/scripts/sampler/_select.py
new file mode 100644
index 000000000..7480b94f7
--- /dev/null
+++ b/skills/review-publisher/scripts/sampler/_select.py
@@ -0,0 +1,61 @@
+"""Selection primitives over a precomputed distance matrix: unsupervised clustering (auto-k)
+and farthest-point sampling. These back the Sampler's two modes.
+"""
+
+from __future__ import annotations
+
+from typing import List, Optional, Sequence
+
+import numpy as np
+from sklearn.cluster import AgglomerativeClustering
+from sklearn.metrics import silhouette_score
+
+
+def medoid(distances: np.ndarray, members: Sequence[int]) -> int:
+    """Index (into the full matrix) of the most central member — minimizes summed distance."""
+    members = np.asarray(members)
+    sub = distances[np.ix_(members, members)]
+    return int(members[int(np.argmin(sub.sum(axis=1)))])
+
+
+def farthest_first(distances: np.ndarray, n: int, pool: Optional[Sequence[int]] = None) -> List[int]:
+    """Greedy farthest-point sampling over `pool` (default all), seeded at the pool's medoid.
+
+    Returns up to n indices, each maximally far from those already chosen. Seeding at the medoid
+    makes the first pick the most *representative* article, then each subsequent pick adds the most
+    *different* one — "diverse but meaningful".
+    """
+    candidates = np.arange(len(distances)) if pool is None else np.asarray(pool)
+    n = min(n, len(candidates))
+    if n <= 0:
+        return []
+    chosen = [medoid(distances, candidates)]
+    while len(chosen) < n:
+        nearest = distances[np.ix_(candidates, chosen)].min(axis=1)
+        nearest[np.isin(candidates, chosen)] = -1.0  # never re-pick
+        chosen.append(int(candidates[int(np.argmax(nearest))]))
+    return chosen
+
+
+def auto_cluster(distances: np.ndarray, k_max: int) -> np.ndarray:
+    """Average-linkage agglomerative clustering; k chosen unsupervised by maximizing silhouette.
+
+    Returns a layout label per article. For very small pools (<4) everything is one layout.
+    """
+    n = len(distances)
+    if n < 4 or k_max < 2:
+        return np.zeros(n, dtype=int)
+    best_score, best_labels = -2.0, np.zeros(n, dtype=int)
+    for k in range(2, min(k_max, n - 1) + 1):
+        labels = AgglomerativeClustering(n_clusters=k, metric="precomputed", linkage="average").fit_predict(distances)
+        score = float(silhouette_score(distances, labels, metric="precomputed"))
+        if score > best_score:
+            best_score, best_labels = score, labels
+    return best_labels
+
+
+def layouts_by_size(labels: np.ndarray) -> List[np.ndarray]:
+    """Member indices per layout, largest layout first."""
+    groups = [np.where(labels == label)[0] for label in set(labels.tolist())]
+    groups.sort(key=len, reverse=True)
+    return groups
diff --git a/skills/review-publisher/scripts/sampler/sampler.py b/skills/review-publisher/scripts/sampler/sampler.py
new file mode 100644
index 000000000..f8cd962e4
--- /dev/null
+++ b/skills/review-publisher/scripts/sampler/sampler.py
@@ -0,0 +1,138 @@
+"""Structural sampler for Fundus publishers.
+
+Given an already-crawled pool of articles, reduce it to a small set that maximizes *layout*
+coverage. Two modes:
+
+    Sampler().diverse(articles, n=8)           # mode 1: the n most diverse/representative articles
+    Sampler().per_layout(articles, k=1)         # mode 2: k representatives from every distinct layout
+
+The pipeline is fully publisher-agnostic — no parser, no body selector, no per-site rules:
+
+    isolate body by text density -> tagpath TF-IDF distance -> select
+
+Mode 1 is farthest-point sampling (fixed size n). Mode 2 clusters the pool into layouts
+(unsupervised, k discovered from the data) and takes k per layout, so the output size reflects how
+many distinct layouts the publisher actually has.
+
+Dependencies beyond Fundus: numpy, scikit-learn, lxml (see the skill's requirements.txt, installed
+by skills/install.py). A helper for the review-publisher skill, not part of the shipped package.
+"""
+
+from __future__ import annotations
+
+import logging
+from dataclasses import dataclass
+from typing import Iterable, List, Union
+
+import numpy as np
+
+from fundus import Article
+from fundus.publishers.base_objects import Publisher
+
+from ._features import DEFAULT_TEXT_SHARE, distance_matrix, fingerprint
+from ._select import auto_cluster, farthest_first, layouts_by_size
+
+logger = logging.getLogger(__name__)
+
+# A publisher to crawl, or an already-crawled pool of articles (handy for tests / re-use).
+Source = Union[Publisher, Iterable[Article]]
+
+
+@dataclass(frozen=True)
+class SampledArticle:
+    """One selected article plus where it sits in the sample."""
+
+    article: Article
+    layout: int  # layout/cluster id (mode 2); in mode 1 each pick is its own layout
+    rank: int  # selection order within the returned list
+    is_representative: bool  # the layout's primary representative (medoid) vs an extra per-layout pick
+
+    @property
+    def url(self) -> str:
+        return self.article.html.requested_url
+
+
+class Sampler:
+    """Reduce a crawled pool of articles to a layout-diverse sample.
+
+    Parameters
+    ----------
+    text_share     : body-isolation tightness for the density finder (0.7 validated; <0.9).
+    k_max          : cap on the number of layouts the clustering may find (mode 2).
+    max_path_len   : optionally coarsen tag paths to the last k tags (None = full path).
+    """
+
+    def __init__(
+        self,
+        *,
+        text_share: float = DEFAULT_TEXT_SHARE,
+        k_max: int = 12,
+        max_path_len: Union[int, None] = None,
+    ) -> None:
+        self.text_share = text_share
+        self.k_max = k_max
+        self.max_path_len = max_path_len
+
+    # --- public API ---
+
+    def diverse(self, articles: List[Article], n: int) -> List[SampledArticle]:
+        """Mode 1 — the n most diverse yet representative articles (farthest-point sampling)."""
+        if n < 1:
+            raise ValueError("n must be >= 1")
+        distances = self._compute_distance_matrix(articles)
+        if not articles:
+            return []
+        order = farthest_first(distances, n)
+        return [
+            SampledArticle(article=articles[idx], layout=rank, rank=rank, is_representative=True)
+            for rank, idx in enumerate(order)
+        ]
+
+    def per_layout(self, articles: List[Article], k: int = 1) -> List[SampledArticle]:
+        """Mode 2 — k representatives from every distinct layout the publisher uses.
+
+        The number of layouts is discovered unsupervised; the first pick per layout is its medoid
+        (most typical), any extra picks are the most different members within that layout.
+        """
+        if k < 1:
+            raise ValueError("k must be >= 1")
+        if not articles:
+            return []
+
+        distances = self._compute_distance_matrix(articles)
+
+        labels = auto_cluster(distances, min(self.k_max, len(articles) - 1))
+
+        sampled: List[SampledArticle] = []
+        for layout_id, members in enumerate(layouts_by_size(labels)):
+            picks = farthest_first(distances, k, pool=members)  # medoid first, then farthest-within
+            for within_rank, idx in enumerate(picks):
+                sampled.append(
+                    SampledArticle(
+                        article=articles[idx],
+                        layout=layout_id,
+                        rank=len(sampled),
+                        is_representative=(within_rank == 0),
+                    )
+                )
+        return sampled
+
+    # --- internals ---
+
+    def _compute_distance_matrix(self, articles: List[Article]) -> np.ndarray:
+        """Fingerprint each article's body, and build the distance matrix."""
+        fingerprints = [fingerprint(a.html.content, self.text_share, self.max_path_len) for a in articles]
+        return distance_matrix(fingerprints)
+
+
+# --- module-level convenience wrappers ---
+
+
+def sample_diverse(articles: List[Article], n: int, **kwargs: object) -> List[SampledArticle]:
+    """Shortcut for `Sampler(**kwargs).diverse(source, n)`."""
+    return Sampler(**kwargs).diverse(articles, n)  # type: ignore[arg-type]
+
+
+def sample_per_layout(articles: List[Article], k: int = 1, **kwargs: object) -> List[SampledArticle]:
+    """Shortcut for `Sampler(**kwargs).per_layout(source, k)`."""
+    return Sampler(**kwargs).per_layout(articles, k)  # type: ignore[arg-type]
diff --git a/src/fundus/parser/base_parser.py b/src/fundus/parser/base_parser.py
index 30f3ab2cf..ebed2c73c 100644
--- a/src/fundus/parser/base_parser.py
+++ b/src/fundus/parser/base_parser.py
@@ -24,6 +24,7 @@
 )
 
 import lxml.html
+from lxml.etree import XPath
 
 from fundus.logging import create_logger
 from fundus.parser.data import LinkedDataMapping
@@ -233,6 +234,22 @@ def __init_subclass__(cls, **kwargs):
     def version(cls) -> str:
         return cls.__name__
 
+    @classmethod
+    def body_selectors(cls) -> Dict[str, Optional[XPath]]:
+        """The version's body selectors, keyed 'summary' / 'subheadline' / 'paragraph'.
+
+        These are the selectors a version feeds to ``extract_article_body_with_selector``;
+        a value is None when the version does not declare the corresponding selector
+        (or builds its body another way entirely). Public so external tooling (e.g. the
+        review skill under skills/) does not have to reach into the private
+        ``_*_selector`` attributes.
+        """
+        return {
+            "summary": getattr(cls, "_summary_selector", None),
+            "subheadline": getattr(cls, "_subheadline_selector", None),
+            "paragraph": getattr(cls, "_paragraph_selector", None),
+        }
+
     @classmethod
     def _search_members(cls, obj_type: type) -> List[Tuple[str, Any]]:
         members = inspect.getmembers(cls, predicate=lambda x: isinstance(x, obj_type)) if obj_type else []
diff --git a/tests/test_review_skill.py b/tests/test_review_skill.py
new file mode 100644
index 000000000..9efa984e0
--- /dev/null
+++ b/tests/test_review_skill.py
@@ -0,0 +1,201 @@
+"""Tests for the review-publisher skill's sweep/store logic (skills/review-publisher/scripts).
+
+The sweep is a review *gate*: each test here pins a failure mode the gate must not have —
+above all, ways it could report a silent false "clean".
+"""
+
+import sys
+from datetime import datetime
+from pathlib import Path
+from types import SimpleNamespace
+from typing import Any, Dict, cast
+
+import lxml.html
+import pytest
+from lxml.etree import XPath
+
+from fundus import Article
+from fundus.parser import BaseParser
+
+_SCRIPTS = Path(__file__).resolve().parents[1] / "skills" / "review-publisher" / "scripts"
+sys.path.insert(0, str(_SCRIPTS))
+
+import _store  # noqa: E402
+import _sweep  # noqa: E402
+
+PARAGRAPH_SELECTOR = XPath("//p[@class='b']")
+SELECTORS: Dict[str, Any] = {"paragraph": PARAGRAPH_SELECTOR, "summary": None, "subheadline": None}
+
+
+def _doc(body_html: str) -> lxml.html.HtmlElement:
+    return lxml.html.document_fromstring(f"<html><body>{body_html}</body></html>")
+
+
+class TestSweepArticle:
+    def test_single_captured_node_falls_back_to_loud_whole_document_walk(self):
+        # One captured <p> must never scope the walk to itself - that covered its whole
+        # subtree and guaranteed a false "clean" on exactly the worst article.
+        doc = _doc(
+            """
+            <div class="content">
+              <p class="b">Intro paragraph text here.</p>
+              <ul><li>First dropped item with enough characters.</li></ul>
+            </div>
+            """
+        )
+        result = _sweep.sweep_article(doc, SELECTORS, ["Intro paragraph text here."])
+        assert result.applicable
+        assert result.loose_scope
+        assert [drop.tag for drop in result.drops] == ["ul"]
+
+    def test_nested_uncaptured_blocks_report_outermost_only(self):
+        doc = _doc(
+            """
+            <div class="content">
+              <p class="b">Para one long enough text.</p>
+              <p class="b">Para two long enough text.</p>
+              <table><tr><td><ul><li>Nested dropped list item text.</li></ul></td></tr></table>
+            </div>
+            """
+        )
+        result = _sweep.sweep_article(doc, SELECTORS, ["Para one long enough text.", "Para two long enough text."])
+        assert not result.loose_scope
+        assert [drop.tag for drop in result.drops] == ["table"]
+
+    def test_duplicated_opening_does_not_suppress_a_dropped_tail(self):
+        # A block that *opens* with body text but carries unseen content is a drop -
+        # the old 60-char prefix probe suppressed it silently.
+        lede = "The lede sentence appears in the body fully."
+        dropped = "Completely different dropped content sentence."
+        doc = _doc(
+            f"""
+            <div class="content">
+              <p class="b">{lede}</p>
+              <p class="b">Second body paragraph with plenty of text.</p>
+              <ul><li>{lede}</li><li>{dropped}</li></ul>
+            </div>
+            """
+        )
+        result = _sweep.sweep_article(doc, SELECTORS, [lede, "Second body paragraph with plenty of text."])
+        assert len(result.drops) == 1
+        assert dropped in result.drops[0].missing_segments
+
+    def test_true_duplicate_is_suppressed_but_visibly(self):
+        text = "Second body paragraph with plenty of text."
+        doc = _doc(
+            f"""
+            <div class="content">
+              <p class="b">First body paragraph with plenty of text.</p>
+              <p class="b">{text}</p>
+              <p class="dek">{text}</p>
+            </div>
+            """
+        )
+        result = _sweep.sweep_article(doc, SELECTORS, ["First body paragraph with plenty of text.", text])
+        assert result.drops == []
+        assert len(result.duplicates) == 1 and text[:40] in result.duplicates[0]
+
+    def test_zero_width_characters_do_not_false_flag(self):
+        # The parser body is normalize_whitespace()-normalized; the sweep must use the
+        # same normalization or zero-width characters cause false drop candidates.
+        raw = "Zero​width joined words make a sentence."
+        from fundus.parser.utility import normalize_whitespace
+
+        doc = _doc(
+            f"""
+            <div class="content">
+              <p class="b">Some captured paragraph with text.</p>
+              <p class="dek">{raw}</p>
+            </div>
+            """
+        )
+        result = _sweep.sweep_article(doc, SELECTORS, [normalize_whitespace(raw)])
+        assert result.drops == []
+
+
+class TestFindLeaks:
+    def test_repeated_unit_is_flagged(self):
+        boilerplate = "Subscribe to our newsletter for daily updates!"
+        units = [(i, [f"unique article text number {i} here", boilerplate]) for i in range(1, 6)]
+        units += [(i, [f"unique article text number {i} here"]) for i in range(6, 11)]
+        leaks = _sweep.find_leaks(units)
+        assert [leak.text for leak in leaks] == [boilerplate]
+        assert leaks[0].article_indices == [1, 2, 3, 4, 5]
+
+    def test_below_threshold_and_tiny_samples_yield_nothing(self):
+        boilerplate = "Subscribe to our newsletter for daily updates!"
+        units = [(i, [f"unique article text number {i} here", boilerplate]) for i in range(1, 5)]
+        units += [(i, [f"unique article text number {i} here"]) for i in range(5, 11)]
+        assert _sweep.find_leaks(units) == []  # 4 of 10 < threshold 5
+        assert _sweep.find_leaks(units[:2]) == []  # <3 articles: scan is inactive
+
+
+class TestStore:
+    def test_save_article_round_trips_exact_bytes(self, tmp_path: Path):
+        # CRLF must survive: text mode would write \r\r\n on Windows and read back \n\n.
+        content = "<html>\r\n<body>line1\r\nline2</body>\r\n</html>"
+        article = SimpleNamespace(
+            html=SimpleNamespace(content=content, requested_url="https://x.test/a", crawl_date=datetime(2026, 6, 1)),
+            title="t",
+            authors=["a"],
+            topics=[],
+            images=[],
+            body=None,
+        )
+        record = _store.save_article(tmp_path, 1, cast(Article, article))
+        assert _store.read_html(tmp_path, record) == content
+        assert _store.body_units(record["body"]) == []
+
+    def test_prepare_cache_dir_refuses_foreign_directories(self, tmp_path: Path):
+        foreign = tmp_path / "foreign"
+        foreign.mkdir()
+        (foreign / "important.txt").write_text("do not delete", encoding="utf-8")
+        with pytest.raises(SystemExit):
+            _store.prepare_cache_dir(foreign)
+        assert (foreign / "important.txt").exists()
+
+        cache = tmp_path / "cache"
+        cache.mkdir()
+        (cache / _store.STATE_FILE).write_text("{}", encoding="utf-8")
+        (cache / "01.html").write_text("x", encoding="utf-8")
+        _store.prepare_cache_dir(cache)  # a real cache is wiped and recreated
+        assert cache.exists() and not any(cache.iterdir())
+
+        _store.prepare_cache_dir(tmp_path / "fresh")  # nonexistent is simply created
+        assert (tmp_path / "fresh").is_dir()
+
+    def test_candidate_ids_are_stable_content_hashes(self):
+        assert _store.candidate_id("drop", "x", "ul", "text") == _store.candidate_id("drop", "x", "ul", "text")
+        assert _store.candidate_id("drop", "x", "ul", "text") != _store.candidate_id("drop", "x", "ul", "other")
+        assert _store.candidate_id("leak", "x", "text").startswith("L")
+
+    def test_payload_gaps_gate(self):
+        candidate = {"id": "Dabc123", "kind": "drop", "text": "x", "articles": [1]}
+        state: Dict[str, Any] = {
+            "publisher": "xx.Test",
+            "crawl": {"pool": 50, "started": 1.0, "finished": 2.0, "completed": True},
+            "articles": [{"index": 1}],
+            "sweep": {"version": None, "swept_at": 3.0, "not_applicable": 0, "candidates": [candidate]},
+            "adjudications": {},
+        }
+        assert any("un-adjudicated" in gap for gap in _store.payload_gaps(state))
+
+        state["adjudications"] = {"Dabc123": {"verdict": "ok", "note": "chrome"}}
+        assert _store.payload_gaps(state) == []
+        assert _store.blocker_candidates(state) == []
+
+        state["sweep"]["swept_at"] = 1.5  # sweep predates the crawl -> stale
+        assert any("re-run `sweep`" in gap for gap in _store.payload_gaps(state))
+
+        state["crawl"]["completed"] = False
+        assert any("did not complete" in gap for gap in _store.payload_gaps(state))
+
+
+class TestBodySelectorsAccessor:
+    def test_declared_and_absent_selectors(self):
+        class Dummy(BaseParser):
+            _paragraph_selector = XPath("//p")
+
+        selectors = Dummy.body_selectors()
+        assert selectors["paragraph"] is Dummy._paragraph_selector
+        assert selectors["summary"] is None and selectors["subheadline"] is None