diff --git a/skills/review-publisher/PLAYBOOK.md b/skills/review-publisher/PLAYBOOK.md new file mode 100644 index 000000000..480b4f1b4 --- /dev/null +++ b/skills/review-publisher/PLAYBOOK.md @@ -0,0 +1,307 @@ +# How to Review a Publisher PR + +> **Audience: AI coding agents.** This is an operational playbook meant to be executed by an agent, +> not a human contributor guide. It's also the source doc behind the `review-publisher` skill. + +A review playbook for PRs that **add a new publisher** or **add/change a parser version**. The matching +authoring process is [`how_to_add_a_publisher.md`](/docs/how_to_add_a_publisher.md) — you're reviewing the +fruits of it, so the checks below map onto the promises that process makes. + +All mechanics run through one driver. `` below is the skill directory SKILL.md told you +(`${CLAUDE_SKILL_DIR}` is substituted only there); the cross-doc links above and below are +repo-root-relative — read them from the repo root, where you are anyway. + + python "/scripts/review.py" {crawl,sweep,show,adjudicate,status,payload} . + +The driver enforces the bookkeeping (what was crawled, what was swept, what is still un-adjudicated); +this playbook is about the part that stays yours: **judgment**. The split is deliberate — the driver +may refuse, warn, and count, but it never decides; you never count by hand. + +**The unit of review is the PR's diff, not the whole publisher.** Review what *this PR* changes. On a +parser-version bump, scope your reading to the new version and how it differs from the old — don't +re-litigate code the PR doesn't touch. A new publisher is all-new, so there the diff *is* the whole file. + +**A multi-publisher PR is N independent reviews, not one.** Run §2 once per publisher — each gets its +own crawl, sweep, and `status` READY. A blocker on one publisher does **not** discharge the body checks +on the others: the loud finding is a distraction from the clean-reading publishers, which get the *same* +full treatment, never a lighter pass. + +## What you're protecting + +> **The extracted `ArticleBody` must mirror the real article — no missing content, no extra content.** + +This isn't aesthetic. Fundus maps extracted text back to the source HTML for annotation, so a dropped +paragraph or a leaked photo caption corrupts that mapping. Over-capture is a blocker, not a nit. + +## Rules + +- **Don't run `pytest`, `mypy`, `ruff`, or any other checks locally — the GitHub CI covers all of + that.** Your value-add is confirming the parser is correct on **live** articles; the unit test only + checks one frozen HTML snapshot, so a green test proves nothing about today's site or other layouts. +- **Cite everything.** Every finding needs quotable evidence: the verbatim dropped/leaked text, the + article URL, and the offending selector. Adjudication notes are part of that trail — they land in + `findings.json` verbatim. +- **Keep scratch out of the repo.** `review.json` and any throwaway dump go in the cache dir the + driver prints (`/fundus-review/…`) or any OS temp dir — never the working directory. +- **Never truncate driver output.** Don't pipe `crawl` through `[:N]` or cap the printed text — a + slice that cuts off partway through the draw reads clean on articles you never saw. The state file + records what was actually cached, and `status` reconciles the counts for you. + +## Inputs + +- **PR number** and whether it's **your own** PR (changes the verdict event — see §4). **If the user + didn't name a PR, resolve it before doing anything else:** `gh pr view --json number,title,author` + reads the current branch's open PR; show it and confirm. If the branch has no open PR, ask — don't + guess. (The `author` also settles the own-PR question.) +- The publisher(s) under review as `PublisherCollection..`, and the parser version(s) touched. + +## 1. Static read + +```bash +gh pr checkout +gh pr diff +``` + +Skim the diff and sanity-check the parser: + +- `@attribute` return types match [`attribute_guidelines.md`](/docs/attribute_guidelines.md). +- **`@attribute(validate=False)` bypasses CI.** Type conformance is enforced by the unit tests for + *validated* attributes only. Any attribute marked `validate=False` is checked by **nobody but you** — + read its logic and return value by hand. +- **`free_access` / paywall.** If the publisher runs a subscription model, `free_access` must be + implemented (off `isAccessibleForFree` in the `ld+json`). A missing or wrong `free_access` on a + premium publisher is a blocker. If the publisher is fully free, there's nothing to check here. +- **`VALID_UNTIL` is *not* inherited** — every `BaseParser` subclass defaults to `date.max`. When a PR + adds a new version for a layout change, the *previous* version must get an explicit `VALID_UNTIL` + (the day before the change), and the newest must leave it unset. (Subclassing another publisher's + version and leaving `VALID_UNTIL` unset is fine — that's the intended `date.max`.) +- **Version bump fits the change.** Selector-only fix → minor bump (`V1_1(V1)`); new or substantially + changed attributes → major bump (`V2`). Flag a `V2` that's really just selectors, or a minor bump + that quietly changes attribute behavior. +- **Shared-utility changes** (`parser/utility.py`: `generic_topic_parsing`, `apply_result_filter`, + `image_extraction`, …) affect **every** publisher, not just this one. Check the call sites and rely + on the full (CI-run) suite staying green. + +## 2. Crawl live articles and verify the body + +### Crawl once + +```bash +python "/scripts/review.py" crawl . # pool 50 -> 2 articles per layout +python "/scripts/review.py" crawl . --pool 80 # widen the draw if rare layouts are missed +``` + +One live crawl per publisher; everything after it replays the cache, so the read set and the swept +set are the same draw by construction. The crawl draws a **candidate pool** (`--pool`, default 50) +and reduces it to a **layout-diverse subset** — **two representatives per distinct layout** the +publisher uses (its most typical article plus its most different one), **never dropping a layout** — +so you review each layout twice over instead of the first *N* near-duplicate news stories. The number +of layouts is discovered automatically; if a near-uniform publisher collapses to fewer than three +articles, it falls back to the three most-diverse articles so the coherence read and the leak scan +still have something to work with. It prints the **Tier-1 view** (url / title / authors / topics / +image count / body, each tagged with its layout id and `rep`/`extra` role) and caches the selected +articles' exact bytes. Sampling needs the whole pool up front, so an interrupted crawl caches nothing +— just re-run it (the crawl is the only networked step). + +**Layout coverage is the gate, and the sampler is what delivers it.** The subset must span the +layouts that break parsers — a straight news piece, an opinion/column, a **listicle or bullet-list** +piece, and an **image-heavy** one. The pool→sample draw surfaces these automatically; if a layout you +know exists is still missing, **widen the draw (`--pool`) and re-crawl**. If the publisher is too +small or uniform to cover all four, note that in the review. Don't prompt the user for a number; the +default pool is the floor and coverage decides the rest. + +**Stop condition:** 0 articles after a fair attempt means the sources or parser are broken — that is +itself a blocker-level finding; report it, don't silently stall. Many publishers block generic +fetchers, so inspect via the cached html (`show` prints the per-article file paths), never a manual +`urllib`/`requests` fetch. + +### Tier 1 — coherence read (all articles, from the crawl output) + +Read each extracted body. It should read like the article: no dangling sentences, no abrupt jumps, no +sentence referencing something that isn't there ("…fell into either of two groups:" followed by +nothing), and no boilerplate (newsletter sign-ups, "Read More" teasers, photo captions) mixed in. Also +eyeball `title`/`authors`/`topics` for emptiness or obvious junk (UUIDs, section names, the domain, +`Category:` prefixes). Note every article that trips a signal. + +Coherence has **one blind spot**: a paragraph dropped mid-body leaves the surrounding text still +flowing — the seam closes and nothing reads wrong. Tier 2 exists for exactly that. + +### Tier 2 — sweep, then adjudicate every candidate (hard gate, every publisher) + +```bash +python "/scripts/review.py" sweep . # same cache, no network +python "/scripts/review.py" sweep . --version V1_1 # pin a legacy version; re-runs free +``` + +The sweep applies the version's *real* body selectors to each cached article and emits two kinds of +candidate, each with an id: + +- **DROP** — a structural block (`table`/`ul`/`ol`/`dl`/`blockquote`/`pre`) or `

`/`` in the + body container whose text is **absent from the extracted body**. Either real content the selector + misses (blocker) or page chrome (fine). +- **LEAK** — a body unit **repeated across half the cached articles**: the signature of boilerplate + *inside* the body (newsletter pitches, teasers, bios repeat; article text doesn't). + +The boilerplate-vs-body call is yours, and it's recorded, not implied: + +```bash +python "/scripts/review.py" show . # full text + cached html paths +python "/scripts/review.py" adjudicate . ok --note "site-wide cookie banner" +python "/scripts/review.py" adjudicate . blocker --note "agenda table dropped; " +``` + +`ok` means benign (chrome outside the body for a DROP; legitimately recurring content for a LEAK); +`blocker` means a real finding — the note is your evidence line and lands in `findings.json` verbatim. +Adjudicate from the **cached** html (`show` prints each article's `NN.html` path) so there's no "site +changed" ambiguity. Heads-up: the sweep *prints* the blocks it suppressed as duplicates (text already +in the body) — skim them; that suppression is vetoable, not authoritative. + +What the sweep **cannot** do, and stays on you: + +- **Over-capture beyond repetition.** The LEAK scan only catches *recurring* boilerplate. A one-off + caption or "Related:" line leaked into a single body never repeats — on the articles you read, + check the body for anything that isn't article content. +- **Not-sweepable articles.** A version without a `_paragraph_selector` builds its body another way; + the sweep reports it N/A and you do the by-hand walk below. +- **Layout coverage** (§2 crawl) and the **image attributes** (§3). + +When you want the raw container — a candidate needs context, or an article is N/A — walk the cached +bytes directly: + +```python +from pathlib import Path +from lxml import html as lhtml +doc = lhtml.fromstring(Path("/01.html").read_bytes()) # path from `show` +# ILLUSTRATIVE selector - use the actual ones: the version class's public accessor +# `.body_selectors()` returns the real paragraph/summary/subheadline selectors. +for el in doc.xpath("//div[@class='story-content']/*"): + print(el.tag, "|", " ".join(el.text_content().split())[:100]) +``` + +**Delegating to subagents.** When multiple candidates trip or the HTML is large, fan out — one +cheap-model subagent per suspect article, pointed at the cached `NN.html` and the candidate's `show` +output, so the dumps stay out of this context and nothing re-crawls. The subagent reports evidence; +the adjudication call stays with you. + +### The gate + +```bash +python "/scripts/review.py" status . +``` + +`status` is the self-audit the old coverage table used to be, but machine-checked: it exits non-zero +until the crawl completed, the sweep ran on it, and **every candidate is adjudicated**. It also lists +what it *can't* check (coherence read, layout coverage, one-off over-capture, images) — those become +one line each in your review body per publisher. **No verdict before `status` reports READY for every +publisher in the PR.** + +## 3. Diagnose a miss + +When the sweep shows content missing, the usual culprits: + +- **`