Skip to content

Refactor scraping module into layered source/pipeline/crawler packages#933

Draft
MaxDall wants to merge 8 commits into
masterfrom
refactor-scraping-module
Draft

Refactor scraping module into layered source/pipeline/crawler packages#933
MaxDall wants to merge 8 commits into
masterfrom
refactor-scraping-module

Conversation

@MaxDall

@MaxDall MaxDall commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

The scraping/ module had grown into a few large files (crawler.py ~870 lines, scraper.py, html.py) that mixed fetching, parsing, and orchestration. This PR splits them along the three-layer flow from the architecture docs — Source → Pipeline → Crawler — so each layer is its own module and independently testable. No public API change: everything re-exported from fundus/__init__.py stays the same.

Source layer — scraping/pipeline/source/

Raw HTML producers behind the HTMLSource protocol:

  • web.pyWebSource for live-web fetching, with an internal _Pacer that enforces per-host crawl delays and WebSourceInfo carrying per-source metadata.
  • ccnews.pyCCNewsSource reading the CC-NEWS WARC archive (WarcSourceInfo, with WarcFileLoadError for unreadable segments).

Pipeline layer — scraping/pipeline/__init__.py

Pipeline pairs a source with the publisher's parser, turning each HTML into an Article or dropping it via the extraction/language filters. Owns the HTMLSource protocol and PipelineError.

Crawler layer — scraping/crawler/

Public entry points, split by backend:

  • base.pyCrawlerBase (the crawl() contract) and the internal _CrawlState (progress/limit bookkeeping).
  • web.pyCrawler, threaded across publishers, plus the publisher_context_wrapper that scopes per-publisher cancellation state.
  • ccnews.pyCCNewsCrawler, multiprocessing over the archive.
  • queueing.py — shared producer/consumer plumbing (enqueue_results, iter_pool_results) and RemoteException for surfacing worker-process errors.

Cross-cutting

  • Utils: new concurrency.py (execution-context detection, dill_wrapper for pickling closures across processes, a proxied tqdm for multiprocessing progress) and timing.py (random_sleep); timeout.py rewritten to compose a ResettableTimer instead of subclassing threading.Thread; serialization.py and events.py refactored with fuller docstrings.
  • Primitives html/article/url/filter refactored, with supporting cleanup in parser/ and publishers/base_objects.py.

Fixed / improved

  • 5xx retry in the session layer (session.py): the stack previously had no retry anywhere, so any transient 5xx aborted the source. 5xx responses are now retried in place with interruptable full-jitter exponential backoff (configurable via max_retries / retry_backoff_base / retry_backoff_cap, honoring Retry-After); an exhausted retry surfaces as a normal HTTPError.
  • timeout hardening: dropped the fragile threading.Thread subclass that reached into private _target/_args (with a type: ignore) and the while True and … loop; Timeout(None) now cleanly disables the timer instead of relying on the separate disable= flag.

Tests

Reorganized to mirror the source tree (tests/scraping/{crawler,pipeline,…}, tests/utils/, tests/parser/, tests/publishers/), with shared fixtures/builders under tests/fixtures/ and a tests/README.md documenting conventions. New unit coverage for the timeout, serialization, queueing, and source/pipeline layers. An integration pytest marker separates the mocked-I/O end-to-end tests.

TODO (this PR)

  • Refactor logging — loggers currently propagate upstream, producing duplicate (double-printed) messages.
  • Fix Publisher.__eq__ — it compares self.parser by identity and ParserProxy defines no __eq__, so two value-equal publishers never compare equal. The bug is currently pinned by a strict xfail in tests/publishers/test_base_objects.py; remove the xfail once fixed.

Outlook (follow-up PRs)

  • Replace the global __EVENTS__ registry with explicit CancellationToken objects (planned redesign documented at the top of events.py). The current registry conflates cancellation, shutdown propagation, and post-mortem queryability into one string-keyed, thread-id-resolved mechanism, has no clean seam for multiprocessing (threading.Event doesn't cross process boundaries), and forces __EVENTS__.context("test") boilerplate into every source test.
  • Refactor the parser / publisher module — publishers/base_objects.py was only lightly touched here; the parser-proxy and publisher layer still deserve the same restructuring/test treatment.
  • Adopt a TypedDict in article.py once PEP 728 lands (tracked by an inline TODO).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant