Refactor scraping module into layered source/pipeline/crawler packages#933
Draft
MaxDall wants to merge 8 commits into
Draft
Refactor scraping module into layered source/pipeline/crawler packages#933MaxDall wants to merge 8 commits into
MaxDall wants to merge 8 commits into
Conversation
…b source and robots
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
scraping/module had grown into a few large files (crawler.py~870 lines,scraper.py,html.py) that mixed fetching, parsing, and orchestration. This PR splits them along the three-layer flow from the architecture docs — Source → Pipeline → Crawler — so each layer is its own module and independently testable. No public API change: everything re-exported fromfundus/__init__.pystays the same.Source layer —
scraping/pipeline/source/Raw
HTMLproducers behind theHTMLSourceprotocol:web.py—WebSourcefor live-web fetching, with an internal_Pacerthat enforces per-host crawl delays andWebSourceInfocarrying per-source metadata.ccnews.py—CCNewsSourcereading the CC-NEWS WARC archive (WarcSourceInfo, withWarcFileLoadErrorfor unreadable segments).Pipeline layer —
scraping/pipeline/__init__.pyPipelinepairs a source with the publisher's parser, turning eachHTMLinto anArticleor dropping it via the extraction/language filters. Owns theHTMLSourceprotocol andPipelineError.Crawler layer —
scraping/crawler/Public entry points, split by backend:
base.py—CrawlerBase(thecrawl()contract) and the internal_CrawlState(progress/limit bookkeeping).web.py—Crawler, threaded across publishers, plus thepublisher_context_wrapperthat scopes per-publisher cancellation state.ccnews.py—CCNewsCrawler, multiprocessing over the archive.queueing.py— shared producer/consumer plumbing (enqueue_results,iter_pool_results) andRemoteExceptionfor surfacing worker-process errors.Cross-cutting
concurrency.py(execution-context detection,dill_wrapperfor pickling closures across processes, a proxied tqdm for multiprocessing progress) andtiming.py(random_sleep);timeout.pyrewritten to compose aResettableTimerinstead of subclassingthreading.Thread;serialization.pyandevents.pyrefactored with fuller docstrings.html/article/url/filterrefactored, with supporting cleanup inparser/andpublishers/base_objects.py.Fixed / improved
session.py): the stack previously had no retry anywhere, so any transient 5xx aborted the source. 5xx responses are now retried in place with interruptable full-jitter exponential backoff (configurable viamax_retries/retry_backoff_base/retry_backoff_cap, honoringRetry-After); an exhausted retry surfaces as a normalHTTPError.timeouthardening: dropped the fragilethreading.Threadsubclass that reached into private_target/_args(with atype: ignore) and thewhile True and …loop;Timeout(None)now cleanly disables the timer instead of relying on the separatedisable=flag.Tests
Reorganized to mirror the source tree (
tests/scraping/{crawler,pipeline,…},tests/utils/,tests/parser/,tests/publishers/), with shared fixtures/builders undertests/fixtures/and atests/README.mddocumenting conventions. New unit coverage for the timeout, serialization, queueing, and source/pipeline layers. Anintegrationpytest marker separates the mocked-I/O end-to-end tests.TODO (this PR)
Publisher.__eq__— it comparesself.parserby identity andParserProxydefines no__eq__, so two value-equal publishers never compare equal. The bug is currently pinned by astrictxfailintests/publishers/test_base_objects.py; remove thexfailonce fixed.Outlook (follow-up PRs)
__EVENTS__registry with explicitCancellationTokenobjects (planned redesign documented at the top ofevents.py). The current registry conflates cancellation, shutdown propagation, and post-mortem queryability into one string-keyed, thread-id-resolved mechanism, has no clean seam for multiprocessing (threading.Eventdoesn't cross process boundaries), and forces__EVENTS__.context("test")boilerplate into every source test.publishers/base_objects.pywas only lightly touched here; the parser-proxy and publisher layer still deserve the same restructuring/test treatment.TypedDictinarticle.pyonce PEP 728 lands (tracked by an inlineTODO).