Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/1_getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ crawler = Crawler(PublisherCollection)

# How to crawl articles

Now to crawl articles make use of the `crawl()` method of the initialized crawler class.
Calling this will return an `Iterator` over articles.
To crawl articles, call the `crawl()` method of the initialized crawler.
This returns an `Iterator` over articles.

Let's crawl one news article from a publisher based in the US and print it.

Expand All @@ -76,7 +76,7 @@ Fundus-Article:
- From: FreeBeacon (2023-05-11 18:41)
```

You can also crawl all available articles by simply removing the `max_articles` parameter.
You can also crawl all available articles by simply omitting the `max_articles` parameter.

```` python
# crawl all available articles
Expand Down
6 changes: 3 additions & 3 deletions docs/2_crawl_from_cc_news.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ The CC-NEWS dataset consists of multiple terabytes of articles.
Due to the sheer amount of data, the crawler utilizes multiple processes.
Per default, it uses all CPUs available in your system.
You can alter the number of additional processes used for crawling with the `processes` parameter of `CCNewsCrawler`.
For optimal performance, we recommend setting the amount of process used manually.
For optimal performance, we recommend setting the number of processes used manually.
A good rule of thumb is to allocate `one process per 200 Mbps of bandwidth`.
This can vary depending on the actual speed of your cpu cores.
This can vary depending on the actual speed of your CPU cores.

````python
from fundus import CCNewsCrawler, PublisherCollection
Expand All @@ -70,7 +70,7 @@ from fundus import CCNewsCrawler, PublisherCollection
crawler = CCNewsCrawler(*PublisherCollection, processes=5)
````

To omit multiprocessing, pass `-1` to the `processes` parameter.
To omit multiprocessing, pass `0` to the `processes` parameter.

In the [next section](3_the_article_class.md) we will introduce you to the `Article` class.

31 changes: 18 additions & 13 deletions docs/3_the_article_class.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,37 +38,41 @@ Donald Trump asks judge to delay classified documents trial
Now have a look at the [**attribute guidelines**](attribute_guidelines.md).
All attributes listed here can be safely accessed through the `Article` class.

**_NOTE:_** The listed attributes represent fields of the `Article` dataclass with all of them having default values.
> [!NOTE]
> The listed attributes are exposed as properties of the `Article` class, each falling back to a default value when the parser is unable to extract it.

Some parsers may support additional attributes not listed in the guidelines.
You can find those attributes under the [**supported publisher**](supported_publishers.md) tables under `Additional Attributes`.

**_NOTE:_** Keep in mind that these additional attributes are specific to a parser and cannot be accessed safely for every article.
> [!NOTE]
> Keep in mind that these additional attributes are specific to a parser and cannot be accessed safely for every article.

Sometimes an attribute listed in the attribute guidelines isn't supported at all by a specific parser.
You can find this information under the `Missing Attributes` tab within the supported publisher tables.
There is also a built-in search mechanic you can learn about [here](5_advanced_topics)
There is also a built-in search mechanism you can learn about [here](5_advanced_topics.md).

## The articles' body

Fundus supports two methods to access the body of the article
1. Accessing the `plaintext` property of `Article` with `article.plaintext`.
This will return a cleaned and formatted version of the article body as a single string object and should be suitable for most use cases. <br>
**_NOTE:_** The different DOM elements are joined with two new lines and cleaned with `split()` and `' '.join()`.
This will return a cleaned and formatted version of the article body as a single string object and should be suitable for most use cases.
2. Accessing the `body` attribute of `Article`.
This returns an `ArticleBody` instance, granting more fine-grained access to the DOM structure of the article body.

> [!NOTE]
> When the body is rendered as text, its DOM elements are joined with two newlines and normalized with `split()` and `' '.join()`.

The `ArticleBody` consists of
- a `summary` giving a brief introduction of the article
- a attribute `sections` containing multiple `ArticleSection`
- an attribute `sections` containing multiple `ArticleSection`

With `ArticleSection` including
- a `headline`; separating the section from other sections
- multiple `paragraphs` following the headline

````console
ArticleSection
|-- headline: TextSequence
ArticleBody
|-- summary: TextSequence
|-- sections: List[ArticleSection]
|-- headline: TextSequence
|-- paragraphs: TextSequence
Expand Down Expand Up @@ -101,9 +105,10 @@ This is a paragraph: When someone dies, the executor presents their will [...]
This is a paragraph: People who would like to keep the details of their [...]
```

**_NOTE:_** Not all publishers support the layout format shown above.
Sometimes headlines are missing or the entire summary is.
You can always check the specific parser what to expect, but even within publishers, the layout differs from article to article.
> [!NOTE]
> Not all publishers support the layout format shown above.
> Sometimes headlines are missing or the entire summary is.
> You can always check the specific parser what to expect, but even within publishers, the layout differs from article to article.

## HTML

Expand All @@ -116,7 +121,7 @@ Here you have access to the following information:
Often the same as `requested_url`; can change with redirects.
3. `content: str`: The HTML content.
4. `crawl_date: datetime`: The exact timestamp the article was crawled.
5. `source_info: SourceInfo`: Some information about the HTML's origins, mostly for debugging purpose.
5. `source_info: SourceInfo`: Provenance metadata about the HTML's origin, mostly for debugging purposes.

## Images

Expand Down Expand Up @@ -170,6 +175,6 @@ for article in crawler.crawl(max_articles=10):
article_json = article.to_json("title", "plaintext", "lang")
````

To save all articles at once, using the default serialization and only specifying a location, refer to [this section](5_advanced_topics.md#saving-the-crawled-articles).
To save all articles at once, using the default serialization and only specifying a location, refer to [this section](1_getting_started.md#saving-crawled-articles).

In the [**next section**](4_how_to_filter_articles.md) we will show you how to filter articles.
38 changes: 22 additions & 16 deletions docs/4_how_to_filter_articles.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
* [How to filter articles](#how-to-filter-articles)
* [Extraction filter](#extraction-filter)
* [Custom extraction filter](#custom-extraction-filter)
* [Some more extraction filter examples:](#some-more-extraction-filter-examples)
* [Some more extraction filter examples](#some-more-extraction-filter-examples)
* [URL filter](#url-filter)
* [Combine filters](#combine-filters)
* [Filter sources](#filter-sources)
Expand All @@ -20,7 +20,7 @@ A specific article may not contain all attributes the parser is capable of extra
By default, Fundus drops all articles without at least a title, body, and publishing date extracted to ensure data quality.
To alter this behavior make use of the `only_complete` parameter of the `crawl()` method.
You have three options to do so:
- Use the build in `ExtractionFilter` `Requires`, or write a custome one.
- Use the built-in `ExtractionFilter` `Requires`, or write a custom one.
- Set it to `false` to disable extraction filtering entirely.
- Set it to `true` to yield only fully extracted articles.

Expand All @@ -35,7 +35,8 @@ for article in crawler.crawl(max_articles=2, only_complete=Requires("title", "bo
print(article)
````

**_NOTE:_** We recommend thinking about what kind of data is needed first and then running Fundus with a configured extraction filter afterward.
> [!NOTE]
> We recommend thinking about what kind of data is needed first and then running Fundus with a configured extraction filter afterward.

### Custom extraction filter

Expand Down Expand Up @@ -64,11 +65,12 @@ for us_themed_article in crawler.crawl(only_complete=topic_filter):
print(us_themed_article)
````

**_NOTE:_** Fundus' filters work inversely to Python's built-in filter.
A filter in Fundus describes what is filtered out and not what's kept.
If a filter returns True on a specific element the element will be dropped.
> [!NOTE]
> Fundus' filters work inversely to Python's built-in filter.
> A filter in Fundus describes what is filtered out and not what's kept.
> If a filter returns True on a specific element, the element will be dropped.

#### Some more extraction filter examples:
#### Some more extraction filter examples

````python
# only select articles from the past seven days
Expand Down Expand Up @@ -106,8 +108,8 @@ for article in crawler.crawl(max_articles=5, url_filter=regex_filter("advertisem
print(article.html.requested_url)
````

Often it's useful to select certain criteria rather than filtering them.
To do so use the `inverse` operator from `fundus.scraping.filter.py`.
Often it's useful to select for certain criteria rather than filtering them out.
To do so use the `inverse` operator from `fundus.scraping.filter`.

Let's crawl a bunch of articles with URLs including the string `politic`.

Expand All @@ -131,12 +133,13 @@ https://www.cnbc.com/2023/07/12/thai-elections-deep-generational-divides-belie-t
https://www.reuters.com/business/autos-transportation/volkswagens-china-chief-welcomes-political-goal-germanys-beijing-strategy-2023-07-13/
````

**_NOTE:_** As with the `ExtractionFilter` you can also write custom URL filters satisfying the `URLFilter` protocol.
> [!NOTE]
> As with the `ExtractionFilter` you can also write custom URL filters satisfying the `URLFilter` protocol.

### Combine filters

Sometimes it is useful to combine filters of the same kind.
You can do so by using the `lor` (logic `or`) and `land` (logic `and`) operators from `fundus.scraping.filter.py`.
You can do so by using the `lor` (logic `or`) and `land` (logic `and`) operators from `fundus.scraping.filter`.

Let's combine both URL filters from the examples above and add a new condition.
Our goal is to get articles that include both strings 'politic' and 'trump' in their URL and don't include the strings 'podcast' or 'advertisement'.
Expand Down Expand Up @@ -169,8 +172,9 @@ https://www.thegatewaypundit.com/2023/06/pres-trump-defends-punching-down-politi
https://www.thegatewaypundit.com/2023/06/breaking-poll-trump-most-popular-politician-country-rfk/
````

**_NOTE:_** You can use the `combine`, `lor`, and `land` operators on `ExtractionFilter` as well.
Make sure to only use them on filters of the same kind.
> [!NOTE]
> You can use the `lor` and `land` operators on `ExtractionFilter` as well.
> Make sure to only use them on filters of the same kind.

## Filter sources

Expand All @@ -179,7 +183,8 @@ Fundus supports different sources for articles which are split into two categori
1. Only recent articles: `RSSFeed`, `NewsMap` (recommended for continuous crawling jobs)
2. The whole site: `Sitemap` (recommended for one-time crawling)

**_NOTE:_** Sometimes the `Sitemap` provided by a specific publisher won't span the entire site.
> [!NOTE]
> Sometimes the `Sitemap` provided by a specific publisher won't span the entire site.

You can preselect the source for your articles when initializing a new `Crawler`.
Let's initiate a crawler who only crawls from `NewsMaps`'s.
Expand All @@ -190,7 +195,8 @@ from fundus import Crawler, PublisherCollection, NewsMap
crawler = Crawler(PublisherCollection.us, restrict_sources_to=[NewsMap])
````

**_NOTE:_** The `restrict_sources_to` parameter expects a list as value to specify multiple sources at once, e.g. `[RSSFeed, NewsMap]`
> [!NOTE]
> The `restrict_sources_to` parameter expects a list as value to specify multiple sources at once, e.g. `[RSSFeed, NewsMap]`

## Filter unique articles

Expand All @@ -202,4 +208,4 @@ You can alter this behavior by setting the `only_unique` parameter.
Finally, the `crawl()` method also allows you to filter articles by language.
You can do so by passing a list of 2 letter language codes ([ISO 639-1](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)) to the method using the `language_filter` parameter.

In the [next section](5_advanced_topics) we will guide you through advanced topics as how to search through publishers in the `PublisherCollection` and how to deal with deprecated publishers.
In the [next section](5_advanced_topics.md) we will guide you through advanced topics such as how to search through publishers in the `PublisherCollection` and how to deal with deprecated publishers.
8 changes: 4 additions & 4 deletions docs/5_advanced_topics.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Table of Contents

* [Advanced Topics](#advanced-topics)
* [Advanced topics](#advanced-topics)
* [How to search for publishers](#how-to-search-for-publishers)
* [Using `search()`](#using-search)
* [Working with deprecated publishers](#working-with-deprecated-publishers)
* [Filtering publishers for AI training](#filtering-publishers-for-ai-training)
* [Browser impersonation](#browser-impersonation)

# Advanced Topics
# Advanced topics

This tutorial will show further options such as searching for specific publishers in the `PublisherCollection` or dealing with deprecated ones.

Expand All @@ -19,7 +19,7 @@ There are quite a few differences between the publishers, especially in the attr
You can search through the collection to get only publishers fitting your use case by utilizing the `search()` method.

Let's get some publishers based in the US, supporting an attribute called `topics` and `NewsMap` as a source, and use them to initialize a crawler afterward.
The `search()` method also implements an internal language filter, allowing you to restrict your results to a specific languages.
The `search()` method also implements an internal language filter, allowing you to restrict your results to specific languages.
In this example, we are only interested in Spanish articles.

````python
Expand All @@ -32,7 +32,7 @@ crawler = Crawler(*fitting_publishers)
## Working with deprecated publishers

When we notice that a publisher is uncrawlable for whatever reason, we will mark it with a deprecated flag.
This mostly has internal usages, since the default value for the `Crawler` `ignore_deprecated` flag is `False`.
This is mostly for internal use, since the `Crawler`'s `ignore_deprecated` flag defaults to `False`.
You can alter this behaviour when initiating the `Crawler` and setting the `ignore_deprecated` flag.

## Filtering publishers for AI training
Expand Down
16 changes: 9 additions & 7 deletions docs/6_logging.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Table of Contents

* [Logging in Fundus](#logging-in-fundus)
* [Principals](#principals)
* [Principles](#principals)
* [Accessing loggers](#accessing-loggers)
* [Changing log levels](#changing-log-levels)
* [Format and handlers](#format-and-handlers)
Expand All @@ -10,9 +10,9 @@

This tutorial will introduce you to the logging mechanics used in Fundus

## Principals
## Principles

Fundus uses module scoped logging with module names as logger names.
Fundus uses module-scoped logging with module names as logger names.
Not every module has a logger per se, but every module that logs a message has.
All module related implementation is centralized in Fundus' logging module under `fundus.logging`.

Expand All @@ -25,14 +25,15 @@ Fundus uses 4 different log levels:

with default log level for all Fundus loggers being `ERROR`.

*__NOTE__*: Depending on the spawn method (spawn) your OS uses to spawn new processes in python (this effects mostly Windows), log messages beneath `ERROR` won't be received when using multiprocessing.
> [!NOTE]
> Depending on the start method your OS uses to spawn new processes in Python (this mainly affects Windows), log messages below `ERROR` won't be received when using multiprocessing.

## Accessing loggers

You can import a specific logger from the corresponding module like this:

````python
from fundus.scraping.crawler import logger
from fundus.scraping.crawler.web import logger
````

Or find a collection of all existing loggers with their module names here:
Expand All @@ -58,7 +59,7 @@ from fundus.logging import set_log_level
set_log_level(logging.DEBUG)
````

## Format and Handlers
## Format and handlers

By default, all Fundus log messages are written to `stderr` with the following format `%(asctime)s - %(name)s - %(levelname)s - %(message)s`
To add another handler use the `add_handler` function.
Expand All @@ -73,5 +74,6 @@ file_handler.setFormatter(logging.Formatter("%(asctime)s - %(levelname)s - %(mes
add_handler(file_handler)
````

*__NOTE__*: All of the above can also be done individually for every logger by [accessing loggers](#accessing-loggers) directly.
> [!NOTE]
> All of the above can also be done individually for every logger by [accessing loggers](#accessing-loggers) directly.

16 changes: 9 additions & 7 deletions docs/attribute_guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,16 @@ Consistency between publishers and parsers is a main goal, please report any cas
document.
If you want to contribute a parser to this library, please ensure that these attributes are named consistently.

**_NOTE:_** There are certain utility functions to aid you with parsing.
These can be found under `fundus/parser/utility.py`.
We *highly* recommend using them.
> [!NOTE]
> There are certain utility functions to aid you with parsing.
> These can be found under `fundus/parser/utility.py`.
> We *highly* recommend using them.

The following table lists Fundus' core attributes and includes the name of the corresponding utility function.
Those attributes will be validated with unit tests when used.

**_NOTE:_** If you want to bypass validation you can set the `validate` parameter of the `attribute` decorator to false.
> [!NOTE]
> If you want to bypass validation you can set the `validate` parameter of the `attribute` decorator to `False`.

## Attributes table

Expand Down Expand Up @@ -60,16 +62,16 @@ Those attributes will be validated with unit tests when used.
</tr>
<tr>
<td>free_access</td>
<td>A boolean which is set to be False, if the article is restricted to users with a subscription. This usually indicates
<td>A boolean that is False if the article is restricted to users with a subscription. This usually indicates
that the article cannot be crawled completely.
<i><b>This attribute is implemented by default</b></i></td>
<td><code>bool</code></td>
<td></td>
</tr>
<tr>
<td>images</td>
<td>A list of `Images` - Fundus own datatype for image representation - included within the article.
The `Images` include metadata like caption, authors, and position if available.</td>
<td>A list of `Image` objects — Fundus' own datatype for image representation included within the article.
The `Image` objects include metadata like caption, authors, and position if available.</td>
<td><code>List[Image]</code></td>
<td><code>image_extraction</code></td>
</tr>
Expand Down
Loading