Refactor scraping module into layered source/pipeline/crawler packages by MaxDall · Pull Request #933 · flairNLP/fundus

MaxDall · 2026-06-08T18:20:31Z

The scraping/ module had grown into a few large files (crawler.py ~870 lines, scraper.py, html.py) that mixed fetching, parsing, and orchestration. This PR splits them along the three-layer flow from the architecture docs — Source → Pipeline → Crawler — so each layer is its own module and independently testable. No public API change: everything re-exported from fundus/__init__.py stays the same.

Source layer — `scraping/pipeline/source/`

Raw HTML producers behind the HTMLSource protocol:

web.py — WebSource for live-web fetching, with an internal _Pacer that enforces per-host crawl delays and WebSourceInfo carrying per-source metadata.
ccnews.py — CCNewsSource reading the CC-NEWS WARC archive (WarcSourceInfo, with WarcFileLoadError for unreadable segments).

Pipeline layer — `scraping/pipeline/init.py`

Pipeline pairs a source with the publisher's parser, turning each HTML into an Article or dropping it via the extraction/language filters. Owns the HTMLSource protocol and PipelineError.

Crawler layer — `scraping/crawler/`

Public entry points, split by backend:

base.py — CrawlerBase (the crawl() contract) and the internal _CrawlState (progress/limit bookkeeping).
web.py — Crawler, threaded across publishers, plus the publisher_context_wrapper that scopes per-publisher cancellation state.
ccnews.py — CCNewsCrawler, multiprocessing over the archive.
queueing.py — shared producer/consumer plumbing (enqueue_results, iter_pool_results) and RemoteException for surfacing worker-process errors.

Cross-cutting

Utils: new concurrency.py (execution-context detection, dill_wrapper for pickling closures across processes, a proxied tqdm for multiprocessing progress) and timing.py (random_sleep); timeout.py rewritten to compose a ResettableTimer instead of subclassing threading.Thread; serialization.py and events.py refactored with fuller docstrings.
Primitives html/article/url/filter refactored, with supporting cleanup in parser/ and publishers/base_objects.py.

Fixed / improved

5xx retry in the session layer (session.py): the stack previously had no retry anywhere, so any transient 5xx aborted the source. 5xx responses are now retried in place with interruptable full-jitter exponential backoff (configurable via max_retries / retry_backoff_base / retry_backoff_cap, honoring Retry-After); an exhausted retry surfaces as a normal HTTPError.
timeout hardening: dropped the fragile threading.Thread subclass that reached into private _target/_args (with a type: ignore) and the while True and … loop; Timeout(None) now cleanly disables the timer instead of relying on the separate disable= flag.

Tests

Reorganized to mirror the source tree (tests/scraping/{crawler,pipeline,…}, tests/utils/, tests/parser/, tests/publishers/), with shared fixtures/builders under tests/fixtures/ and a tests/README.md documenting conventions. New unit coverage for the timeout, serialization, queueing, and source/pipeline layers. An integration pytest marker separates the mocked-I/O end-to-end tests.

TODO (this PR)

Refactor logging — loggers currently propagate upstream, producing duplicate (double-printed) messages.
Fix Publisher.__eq__ — it compares self.parser by identity and ParserProxy defines no __eq__, so two value-equal publishers never compare equal. The bug is currently pinned by a strict xfail in tests/publishers/test_base_objects.py; remove the xfail once fixed.

Outlook (follow-up PRs)

Replace the global __EVENTS__ registry with explicit CancellationToken objects (planned redesign documented at the top of events.py). The current registry conflates cancellation, shutdown propagation, and post-mortem queryability into one string-keyed, thread-id-resolved mechanism, has no clean seam for multiprocessing (threading.Event doesn't cross process boundaries), and forces __EVENTS__.context("test") boilerplate into every source test.
Refactor the parser / publisher module — publishers/base_objects.py was only lightly touched here; the parser-proxy and publisher layer still deserve the same restructuring/test treatment.
Adopt a TypedDict in article.py once PEP 728 lands (tracked by an inline TODO).

…b source and robots

MaxDall added 8 commits June 8, 2026 19:54

refactor utility helpers and add concurrency and timing modules

cec7570

add retry on 5xx errors and refactor session handling

36beda3

refactor html, article, url and filter primitives

71f1149

add layered source and pipeline modules

1c88d04

split crawler into base, web and ccnews modules

636cf7e

restructure test suite into package layout

dab634d

refactor docstrings and documentation

6dc02dc

test timeout handling with real curl_cffi timeouts across session, we…

57cf746

…b source and robots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor scraping module into layered source/pipeline/crawler packages#933

Refactor scraping module into layered source/pipeline/crawler packages#933
MaxDall wants to merge 8 commits into
masterfrom
refactor-scraping-module

MaxDall commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MaxDall commented Jun 8, 2026

Source layer — scraping/pipeline/source/

Pipeline layer — scraping/pipeline/__init__.py

Crawler layer — scraping/crawler/

Cross-cutting

Fixed / improved

Tests

TODO (this PR)

Outlook (follow-up PRs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Source layer — `scraping/pipeline/source/`

Pipeline layer — `scraping/pipeline/init.py`

Crawler layer — `scraping/crawler/`