flairNLP · MaxDall · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026
diff --git a/docs/1_getting_started.md b/docs/1_getting_started.md
@@ -49,8 +49,8 @@ crawler = Crawler(PublisherCollection)
 
 # How to crawl articles
 
-Now to crawl articles make use of the `crawl()` method of the initialized crawler class.
-Calling this will return an `Iterator` over articles.
+To crawl articles, call the `crawl()` method of the initialized crawler.
+This returns an `Iterator` over articles.
 
 Let's crawl one news article from a publisher based in the US and print it.
 
@@ -76,7 +76,7 @@ Fundus-Article:
 - From:   FreeBeacon (2023-05-11 18:41)
 ```
 
-You can also crawl all available articles by simply removing the `max_articles` parameter.
+You can also crawl all available articles by simply omitting the `max_articles` parameter.
 
 ```` python
 # crawl all available articles

diff --git a/docs/2_crawl_from_cc_news.md b/docs/2_crawl_from_cc_news.md
@@ -59,9 +59,9 @@ The CC-NEWS dataset consists of multiple terabytes of articles.
 Due to the sheer amount of data, the crawler utilizes multiple processes.
 Per default, it uses all CPUs available in your system.
 You can alter the number of additional processes used for crawling with the `processes` parameter of `CCNewsCrawler`.
-For optimal performance, we recommend setting the amount of process used manually.
+For optimal performance, we recommend setting the number of processes used manually.
 A good rule of thumb is to allocate `one process per 200 Mbps of bandwidth`.
-This can vary depending on the actual speed of your cpu cores.
+This can vary depending on the actual speed of your CPU cores.
 
 ````python
 from fundus import CCNewsCrawler, PublisherCollection
@@ -70,7 +70,7 @@ from fundus import CCNewsCrawler, PublisherCollection
 crawler = CCNewsCrawler(*PublisherCollection, processes=5)
 ````
 
-To omit multiprocessing, pass `-1` to the `processes` parameter.
+To omit multiprocessing, pass `0` to the `processes` parameter.
 
 In the [next section](3_the_article_class.md) we will introduce you to the `Article` class.
 
diff --git a/docs/3_the_article_class.md b/docs/3_the_article_class.md
@@ -38,37 +38,41 @@ Donald Trump asks judge to delay classified documents trial
 Now have a look at the [**attribute guidelines**](attribute_guidelines.md).
 All attributes listed here can be safely accessed through the `Article` class.
 
-**_NOTE:_** The listed attributes represent fields of the `Article` dataclass with all of them having default values.
+> [!NOTE]
+> The listed attributes are exposed as properties of the `Article` class, each falling back to a default value when the parser is unable to extract it.
 
 Some parsers may support additional attributes not listed in the guidelines.
 You can find those attributes under the [**supported publisher**](supported_publishers.md) tables under `Additional Attributes`.
 
-**_NOTE:_** Keep in mind that these additional attributes are specific to a parser and cannot be accessed safely for every article.
+> [!NOTE]
+> Keep in mind that these additional attributes are specific to a parser and cannot be accessed safely for every article.
 
 Sometimes an attribute listed in the attribute guidelines isn't supported at all by a specific parser.
 You can find this information under the `Missing Attributes` tab within the supported publisher tables.
-There is also a built-in search mechanic you can learn about [here](5_advanced_topics)
+There is also a built-in search mechanism you can learn about [here](5_advanced_topics.md).
 
 ## The articles' body
 
 Fundus supports two methods to access the body of the article
 1. Accessing the `plaintext` property of `Article` with `article.plaintext`.
-   This will return a cleaned and formatted version of the article body as a single string object and should be suitable for most use cases. <br>
-   **_NOTE:_** The different DOM elements are joined with two new lines and cleaned with `split()` and `' '.join()`.
+   This will return a cleaned and formatted version of the article body as a single string object and should be suitable for most use cases.
 2. Accessing the `body` attribute of `Article`. 
    This returns an `ArticleBody` instance, granting more fine-grained access to the DOM structure of the article body.
 
+> [!NOTE]
+> When the body is rendered as text, its DOM elements are joined with two newlines and normalized with `split()` and `' '.join()`.
+
 The `ArticleBody` consists of
 - a `summary` giving a brief introduction of the article
-- a attribute `sections` containing multiple `ArticleSection`
+- an attribute `sections` containing multiple `ArticleSection`
 
 With `ArticleSection` including
 - a `headline`; separating the section from other sections
 - multiple `paragraphs` following the headline
 
 ````console
-ArticleSection
-    |-- headline: TextSequence
+ArticleBody
+    |-- summary: TextSequence
     |-- sections: List[ArticleSection]
                             |-- headline: TextSequence
                             |-- paragraphs: TextSequence
@@ -101,9 +105,10 @@ This is a paragraph: When someone dies, the executor presents their will [...]
 This is a paragraph: People who would like to keep the details of their [...]
 ```
 
-**_NOTE:_** Not all publishers support the layout format shown above.
-Sometimes headlines are missing or the entire summary is.
-You can always check the specific parser what to expect, but even within publishers, the layout differs from article to article.
+> [!NOTE]
+> Not all publishers support the layout format shown above.
+> Sometimes headlines are missing or the entire summary is.
+> You can always check the specific parser what to expect, but even within publishers, the layout differs from article to article.
 
 ## HTML
 
@@ -116,7 +121,7 @@ Here you have access to the following information:
    Often the same as `requested_url`; can change with redirects.
 3. `content: str`: The HTML content.
 4. `crawl_date: datetime`: The exact timestamp the article was crawled.
-5. `source_info: SourceInfo`: Some information about the HTML's origins, mostly for debugging purpose.
+5. `source_info: SourceInfo`: Provenance metadata about the HTML's origin, mostly for debugging purposes.
 
 ## Images
 
@@ -170,6 +175,6 @@ for article in crawler.crawl(max_articles=10):
     article_json = article.to_json("title", "plaintext", "lang")
 ````
 
-To save all articles at once, using the default serialization and only specifying a location, refer to [this section](5_advanced_topics.md#saving-the-crawled-articles).
+To save all articles at once, using the default serialization and only specifying a location, refer to [this section](1_getting_started.md#saving-crawled-articles).
 
 In the [**next section**](4_how_to_filter_articles.md) we will show you how to filter articles.
diff --git a/docs/4_how_to_filter_articles.md b/docs/4_how_to_filter_articles.md
@@ -3,7 +3,7 @@
 * [How to filter articles](#how-to-filter-articles)
   * [Extraction filter](#extraction-filter)
     * [Custom extraction filter](#custom-extraction-filter)
-      * [Some more extraction filter examples:](#some-more-extraction-filter-examples)
+      * [Some more extraction filter examples](#some-more-extraction-filter-examples)
   * [URL filter](#url-filter)
     * [Combine filters](#combine-filters)
   * [Filter sources](#filter-sources)
@@ -20,7 +20,7 @@ A specific article may not contain all attributes the parser is capable of extra
 By default, Fundus drops all articles without at least a title, body, and publishing date extracted to ensure data quality.
 To alter this behavior make use of the `only_complete` parameter of the `crawl()` method.
 You have three options to do so:
-- Use the build in `ExtractionFilter` `Requires`, or write a custome one.
+- Use the built-in `ExtractionFilter` `Requires`, or write a custom one.
 - Set it to `false` to disable extraction filtering entirely.
 - Set it to `true` to yield only fully extracted articles.
 
@@ -35,7 +35,8 @@ for article in crawler.crawl(max_articles=2, only_complete=Requires("title", "bo
     print(article)
 ````
 
-**_NOTE:_** We recommend thinking about what kind of data is needed first and then running Fundus with a configured extraction filter afterward.
+> [!NOTE]
+> We recommend thinking about what kind of data is needed first and then running Fundus with a configured extraction filter afterward.
 
 ### Custom extraction filter
 
@@ -64,11 +65,12 @@ for us_themed_article in crawler.crawl(only_complete=topic_filter):
     print(us_themed_article)
 ````
 
-**_NOTE:_** Fundus' filters work inversely to Python's built-in filter.
-A filter in Fundus describes what is filtered out and not what's kept.
-If a filter returns True on a specific element the element will be dropped.
+> [!NOTE]
+> Fundus' filters work inversely to Python's built-in filter.
+> A filter in Fundus describes what is filtered out and not what's kept.
+> If a filter returns True on a specific element, the element will be dropped.
 
-#### Some more extraction filter examples:
+#### Some more extraction filter examples
 
 ````python
 # only select articles from the past seven days
@@ -106,8 +108,8 @@ for article in crawler.crawl(max_articles=5, url_filter=regex_filter("advertisem
     print(article.html.requested_url)
 ````
 
-Often it's useful to select certain criteria rather than filtering them.
-To do so use the `inverse` operator from `fundus.scraping.filter.py`.
+Often it's useful to select for certain criteria rather than filtering them out.
+To do so use the `inverse` operator from `fundus.scraping.filter`.
 
 Let's crawl a bunch of articles with URLs including the string `politic`.
 
@@ -131,12 +133,13 @@ https://www.cnbc.com/2023/07/12/thai-elections-deep-generational-divides-belie-t
 https://www.reuters.com/business/autos-transportation/volkswagens-china-chief-welcomes-political-goal-germanys-beijing-strategy-2023-07-13/
 ````
 
-**_NOTE:_** As with the `ExtractionFilter` you can also write custom URL filters satisfying the `URLFilter` protocol.
+> [!NOTE]
+> As with the `ExtractionFilter` you can also write custom URL filters satisfying the `URLFilter` protocol.
 
 ### Combine filters
 
 Sometimes it is useful to combine filters of the same kind.
-You can do so by using the `lor` (logic `or`) and `land` (logic `and`) operators from `fundus.scraping.filter.py`.
+You can do so by using the `lor` (logic `or`) and `land` (logic `and`) operators from `fundus.scraping.filter`.
 
 Let's combine both URL filters from the examples above and add a new condition.
 Our goal is to get articles that include both strings 'politic' and 'trump' in their URL and don't include the strings 'podcast' or 'advertisement'.
@@ -169,8 +172,9 @@ https://www.thegatewaypundit.com/2023/06/pres-trump-defends-punching-down-politi
 https://www.thegatewaypundit.com/2023/06/breaking-poll-trump-most-popular-politician-country-rfk/
 ````
 
-**_NOTE:_** You can use the `combine`, `lor`, and `land` operators on `ExtractionFilter` as well.
-Make sure to only use them on filters of the same kind.
+> [!NOTE]
+> You can use the `lor` and `land` operators on `ExtractionFilter` as well.
+> Make sure to only use them on filters of the same kind.
 
 ## Filter sources
 
@@ -179,7 +183,8 @@ Fundus supports different sources for articles which are split into two categori
 1. Only recent articles: `RSSFeed`, `NewsMap` (recommended for continuous crawling jobs)
 2. The whole site: `Sitemap` (recommended for one-time crawling)
 
-**_NOTE:_** Sometimes the `Sitemap` provided by a specific publisher won't span the entire site.
+> [!NOTE]
+> Sometimes the `Sitemap` provided by a specific publisher won't span the entire site.
 
 You can preselect the source for your articles when initializing a new `Crawler`.
 Let's initiate a crawler who only crawls from `NewsMaps`'s.
@@ -190,7 +195,8 @@ from fundus import Crawler, PublisherCollection, NewsMap
 crawler = Crawler(PublisherCollection.us, restrict_sources_to=[NewsMap])
 ````
 
-**_NOTE:_** The `restrict_sources_to` parameter expects a list as value to specify multiple sources at once, e.g. `[RSSFeed, NewsMap]`
+> [!NOTE]
+> The `restrict_sources_to` parameter expects a list as value to specify multiple sources at once, e.g. `[RSSFeed, NewsMap]`
 
 ## Filter unique articles
 
@@ -202,4 +208,4 @@ You can alter this behavior by setting the `only_unique` parameter.
 Finally, the `crawl()` method also allows you to filter articles by language.
 You can do so by passing a list of 2 letter language codes ([ISO 639-1](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)) to the method using the `language_filter` parameter.
 
-In the [next section](5_advanced_topics) we will guide you through advanced topics as how to search through publishers in the `PublisherCollection` and how to deal with deprecated publishers.
+In the [next section](5_advanced_topics.md) we will guide you through advanced topics such as how to search through publishers in the `PublisherCollection` and how to deal with deprecated publishers.
diff --git a/docs/5_advanced_topics.md b/docs/5_advanced_topics.md
@@ -1,13 +1,13 @@
 # Table of Contents
 
-* [Advanced Topics](#advanced-topics)
+* [Advanced topics](#advanced-topics)
   * [How to search for publishers](#how-to-search-for-publishers)
     * [Using `search()`](#using-search)
   * [Working with deprecated publishers](#working-with-deprecated-publishers)
   * [Filtering publishers for AI training](#filtering-publishers-for-ai-training)
   * [Browser impersonation](#browser-impersonation)
 
-# Advanced Topics
+# Advanced topics
 
 This tutorial will show further options such as searching for specific publishers in the `PublisherCollection` or dealing with deprecated ones.
 
@@ -19,7 +19,7 @@ There are quite a few differences between the publishers, especially in the attr
 You can search through the collection to get only publishers fitting your use case by utilizing the `search()` method.
 
 Let's get some publishers based in the US, supporting an attribute called `topics` and `NewsMap` as a source, and use them to initialize a crawler afterward.
-The `search()` method also implements an internal language filter, allowing you to restrict your results to a specific languages.
+The `search()` method also implements an internal language filter, allowing you to restrict your results to specific languages.
 In this example, we are only interested in Spanish articles.
 
 ````python
@@ -32,7 +32,7 @@ crawler = Crawler(*fitting_publishers)
 ## Working with deprecated publishers
 
 When we notice that a publisher is uncrawlable for whatever reason, we will mark it with a deprecated flag.
-This mostly has internal usages, since the default value for the `Crawler` `ignore_deprecated` flag is `False`.
+This is mostly for internal use, since the `Crawler`'s `ignore_deprecated` flag defaults to `False`.
 You can alter this behaviour when initiating the `Crawler` and setting the `ignore_deprecated` flag.
 
 ## Filtering publishers for AI training

diff --git a/docs/6_logging.md b/docs/6_logging.md
@@ -1,7 +1,7 @@
 # Table of Contents
 
 * [Logging in Fundus](#logging-in-fundus)
-  * [Principals](#principals)
+  * [Principles](#principals)
   * [Accessing loggers](#accessing-loggers)
   * [Changing log levels](#changing-log-levels)
   * [Format and handlers](#format-and-handlers)
@@ -10,9 +10,9 @@
 
 This tutorial will introduce you to the logging mechanics used in Fundus
 
-## Principals
+## Principles
 
-Fundus uses module scoped logging with module names as logger names.
+Fundus uses module-scoped logging with module names as logger names.
 Not every module has a logger per se, but every module that logs a message has.
 All module related implementation is centralized in Fundus' logging module under `fundus.logging`.
 
@@ -25,14 +25,15 @@ Fundus uses 4 different log levels:
 
 with default log level for all Fundus loggers being `ERROR`.
 
-*__NOTE__*: Depending on the spawn method (spawn) your OS uses to spawn new processes in python (this effects mostly Windows), log messages beneath `ERROR` won't be received when using multiprocessing. 
+> [!NOTE]
+> Depending on the start method your OS uses to spawn new processes in Python (this mainly affects Windows), log messages below `ERROR` won't be received when using multiprocessing.
 
 ## Accessing loggers
 
 You can import a specific logger from the corresponding module like this:
 
 ````python
-from fundus.scraping.crawler import logger
+from fundus.scraping.crawler.web import logger
 ````
 
 Or find a collection of all existing loggers with their module names here:
@@ -58,7 +59,7 @@ from fundus.logging import set_log_level
 set_log_level(logging.DEBUG)
 ````
 
-## Format and Handlers
+## Format and handlers
 
 By default, all Fundus log messages are written to `stderr` with the following format `%(asctime)s - %(name)s - %(levelname)s - %(message)s`
 To add another handler use the `add_handler` function.
@@ -73,5 +74,6 @@ file_handler.setFormatter(logging.Formatter("%(asctime)s - %(levelname)s - %(mes
 add_handler(file_handler)
 ````
 
-*__NOTE__*: All of the above can also be done individually for every logger by [accessing loggers](#accessing-loggers) directly.
+> [!NOTE]
+> All of the above can also be done individually for every logger by [accessing loggers](#accessing-loggers) directly.
 
diff --git a/docs/attribute_guidelines.md b/docs/attribute_guidelines.md
@@ -4,14 +4,16 @@ Consistency between publishers and parsers is a main goal, please report any cas
 document.
 If you want to contribute a parser to this library, please ensure that these attributes are named consistently.
 
-**_NOTE:_** There are certain utility functions to aid you with parsing.
-These can be found under `fundus/parser/utility.py`.
-We *highly* recommend using them.
+> [!NOTE]
+> There are certain utility functions to aid you with parsing.
+> These can be found under `fundus/parser/utility.py`.
+> We *highly* recommend using them.
 
 The following table lists Fundus' core attributes and includes the name of the corresponding utility function.
 Those attributes will be validated with unit tests when used.
 
-**_NOTE:_** If you want to bypass validation you can set the `validate` parameter of the `attribute` decorator to false.
+> [!NOTE]
+> If you want to bypass validation you can set the `validate` parameter of the `attribute` decorator to `False`.
 
 ## Attributes table
 
@@ -60,16 +62,16 @@ Those attributes will be validated with unit tests when used.
     </tr>
     <tr>
         <td>free_access</td>
-        <td>A boolean which is set to be False, if the article is restricted to users with a subscription. This usually indicates
+        <td>A boolean that is False if the article is restricted to users with a subscription. This usually indicates
         that the article cannot be crawled completely.
         <i><b>This attribute is implemented by default</b></i></td>
         <td><code>bool</code></td>
         <td></td>
     </tr>
     <tr>
         <td>images</td>
-        <td>A list of `Images` - Fundus own datatype for image representation - included within the article. 
-        The `Images` include metadata like caption, authors, and position if available.</td>
+        <td>A list of `Image` objects — Fundus' own datatype for image representation — included within the article.
+        The `Image` objects include metadata like caption, authors, and position if available.</td>
         <td><code>List[Image]</code></td>
         <td><code>image_extraction</code></td>
     </tr>