# https://github.com/adbar/trafilatura Project Manual

Generated at: 2026-06-19 18:34:18 UTC

## Table of Contents

- [Overview, Installation, and Quickstart](#page-1)
- [Core Extraction Engine: Text and Metadata](#page-2)
- [Web Discovery, Crawling, and Downloads](#page-3)
- [Settings, Output Formats, and Known Issues](#page-4)

<a id='page-1'></a>

## Overview, Installation, and Quickstart

### Related Pages

Related topics: [Core Extraction Engine: Text and Metadata](#page-2), [Web Discovery, Crawling, and Downloads](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/adbar/trafilatura/blob/main/README.md)
- [trafilatura/__init__.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/__init__.py)
- [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py)
- [trafilatura/settings.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.py)
- [trafilatura/metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/metadata.py)
- [trafilatura/json_metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/json_metadata.py)
- [trafilatura/readability_lxml.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/readability_lxml.py)
- [CONTRIBUTING.md](https://github.com/adbar/trafilatura/blob/main/CONTRIBUTING.md)
- [HISTORY.md](https://github.com/adbar/trafilatura/blob/main/HISTORY.md)
</details>

# Overview, Installation, and Quickstart

## Purpose and Scope

Trafilatura is a Python library and command-line tool for gathering text on the Web. It bundles web crawling/scraping, downloading, extraction of main text, metadata, and comments into a single package. According to the package metadata in [trafilatura/__init__.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/__init__.py), it is released under the Apache-2.0 license (versions 1.8.0 and later) and the current stable version is **2.1.0**.

The library is designed for researchers and engineers who need to build web corpora or scrape articles at scale. The [README.md](https://github.com/adbar/trafilatura/blob/main/README.md) documents that it supports TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI output formats and that it is integrated into thousands of projects by organizations such as HuggingFace, IBM, and Microsoft Research.

The public API is intentionally narrow. As shown in [trafilatura/__init__.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/__init__.py), the top-level exports are `bare_extraction`, `baseline`, `extract`, `extract_metadata`, `extract_with_metadata`, `fetch_response`, `fetch_url`, `html2txt`, and `load_html`. The three most commonly used functions for new users are `extract`, `extract_metadata`, and `fetch_url`.

## High-Level Architecture

Trafilatura separates concerns into three layers: downloading, parsing/loading HTML, and extraction. The data flow is straightforward and is the same whether the entry point is the CLI or Python:

```mermaid
flowchart LR
    A[URL or HTML string] --> B[fetch_url / load_html]
    B --> C[lxml HtmlElement tree]
    C --> D[Pruning + XPath selection]
    D --> E{bare_extraction}
    E --> F[Document metadata]
    E --> G[Main text body]
    E --> H[Comments]
    G --> I[Output formatter]
    I --> J[txt / markdown / json / xml / xmltei / csv / html]
```

The `extract()` and `bare_extraction()` functions in [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py) accept both raw HTML and pre-loaded trees, which lets the caller skip the download step when working with cached files. When `with_metadata=True`, metadata is computed first by `extract_metadata()` in [trafilatura/metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/metadata.py), including JSON-LD parsing from [trafilatura/json_metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/json_metadata.py). A readability-style fallback is provided through [trafilatura/readability_lxml.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/readability_lxml.py), and the `baseline()`/`html2txt()` functions offer a simpler alternative extractor.

## Installation

The package is published on PyPI. According to the [README.md](https://github.com/adbar/trafilatura/blob/main/README.md), a minimal install is:

```shell
pip install trafilatura
```

Optional add-ons documented in the README include language detection and faster downloads:

```shell
pip install trafilatura[all]
```

### Dependency Notes (community-reported issues)

Trafilatura 2.1.0 explicitly updates `lxml`, as called out in the community context for issue #532 and in the release notes. Older releases used `lxml.html.clean`, which was removed in `lxml` 5.2.0; users upgrading from `lxml<5.2` must update to trafilatura 2.x. Conversely, issue #846 points to a CVE in older `lxml` versions (pre-6.1.0), so upgrading the dependency is also a security concern. The fix for the import error is therefore to upgrade both `lxml` and `trafilatura`:

```shell
pip install --upgrade lxml trafilatura
```

## Quickstart

### Python: Extract Main Text from a URL

The shortest path from URL to plain text uses `fetch_url` followed by `extract`, both exported from [trafilatura/__init__.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/__init__.py):

```python
import trafilatura

downloaded = trafilatura.fetch_url("https://example.org/article")
if downloaded is not None:
    text = trafilatura.extract(downloaded)
    print(text)
```

`fetch_url` returns the HTML body as a string or `None` on failure. `extract` returns a string in the requested `output_format` or `None` if extraction is not possible. The supported formats are listed in the docstring of `_internal_extraction` in [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py): `txt`, `markdown`, `csv`, `json`, `html`, `xml`, and `xmltei`.

### Python: Extract Metadata

For structured access to title, author, date, sitename, and categories, use `extract_metadata`:

```python
import trafilatura

html = trafilatura.fetch_url("https://example.org/article")
tree = trafilatura.load_html(html)
metadata = trafilatura.extract_metadata(tree)
print(metadata.title, metadata.author, metadata.date, metadata.url)
```

The `Document` dataclass returned by `extract_metadata` is defined in [trafilatura/metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/metadata.py), which combines XPath selectors from `xpaths.py` with JSON-LD parsing routines in [trafilatura/json_metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/json_metadata.py).

### Python: Full Bundle with `bare_extraction`

When you need both content and metadata in one call, `bare_extraction` returns a Python dict:

```python
import trafilatura

result = trafilatura.bare_extraction(downloaded, with_metadata=True)
# result keys include 'text', 'comments', 'metadata', 'title', 'author', ...
```

The `with_metadata` and `only_with_metadata` flags are documented in [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py). The latter short-circuits if essential metadata (date, title, URL) is missing.

### Common Configuration Knobs

The `extract` function accepts a wide range of toggles. The most useful ones, defined in [trafilatura/settings.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.py), are listed below.

| Option | Purpose | Notes |
| --- | --- | --- |
| `output_format` | Output serialization | One of `txt`, `markdown`, `csv`, `json`, `html`, `xml`, `xmltei` |
| `fast` | Skip fallback extractors | Trades recall for speed |
| `favor_precision` / `favor_recall` | Bias extraction | Mutually exclusive presets |
| `include_comments` | Extract comment sections | Default `True` |
| `include_tables` | Keep `<table>` content | See community issue #794 below |
| `include_links` | Preserve hyperlinks | See community issue #794 below |
| `include_images` | Preserve image references | Experimental |
| `target_language` | Filter by ISO 639-1 code | Requires `py3langid` |
| `deduplicate` | Drop repeated segments | |
| `with_metadata` / `only_with_metadata` | Metadata handling | |
| `url_blacklist` / `author_blacklist` | Filtering | Sets of strings |
| `date_extraction_params` | Pass-through to `htmldate` | Dict |

### Command-Line Quickstart

The package also installs a `trafilatura` CLI. The [README.md](https://github.com/adbar/trafilatura/blob/main/README.md) links to [usage-cli.html](https://trafilatura.readthedocs.io/en/latest/usage-cli.html) for the full reference. The most common invocations are:

```shell
# Download and extract to plain text
trafilatura -u "https://example.org/article"

# Convert an existing HTML file to Markdown
trafilatura -f page.html --output-format markdown

# Keep tables and links, output JSON with metadata
trafilatura -u "https://example.org/article" --json --with-metadata --links --tables
```

## Known Limitations and Community-Reported Issues

Three issues from the community context are worth flagging for new users:

- **Issue #532 — `lxml` 5.2.0 breaks import.** Pre-2.0 versions of trafilatura import `lxml.html.clean`, which was removed in `lxml` 5.2.0. Users must upgrade to trafilatura 2.1.0 or downgrade `lxml`. Source: community context.
- **Issue #777 / #794 — Table and `--links` interaction.** When `include_links=True` (or the `--links` CLI flag) is combined with `include_tables=True`, some `<table>` markup is not preserved correctly in the output. Source: community context and [trafilatura/settings.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.py) (which exposes both flags).
- **Issue #846 — Release cadence and CVE exposure.** Because releases are infrequent, downstream users must monitor dependency CVEs (notably in `lxml`) themselves. The 2.1.0 release addresses the immediate `lxml` concern. Source: community context.

## See Also

- Core Functions reference (CLI and Python)
- Configuration file format (`settings.cfg`)
- Evaluation benchmarks and accuracy comparisons
- Contributing guide ([CONTRIBUTING.md](https://github.com/adbar/trafilatura/blob/main/CONTRIBUTING.md))

---

<a id='page-2'></a>

## Core Extraction Engine: Text and Metadata

### Related Pages

Related topics: [Overview, Installation, and Quickstart](#page-1), [Settings, Output Formats, and Known Issues](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [trafilatura/__init__.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/__init__.py)
- [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py)
- [trafilatura/metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/metadata.py)
- [trafilatura/settings.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.py)
- [trafilatura/xpaths.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/xpaths.py)
- [trafilatura/json_metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/json_metadata.py)
- [trafilatura/readability_lxml.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/readability_lxml.py)
- [HISTORY.md](https://github.com/adbar/trafilatura/blob/main/HISTORY.md)
- [README.md](https://github.com/adbar/trafilatura/blob/main/README.md)
</details>

# Core Extraction Engine: Text and Metadata

## Overview

Trafilatura's core extraction engine is the heart of the library, responsible for transforming raw HTML into clean, structured main text and metadata. The package exposes this functionality primarily through three functions in the `core` module: `extract`, `bare_extraction`, and `extract_with_metadata`, all of which are re-exported in the top-level `trafilatura` namespace for convenience. Source: [trafilatura/__init__.py:18-22]().

The engine is designed to balance precision and recall while supporting multiple output formats (TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI). According to the README, Trafilatura "consistently outperforms other open-source libraries in text extraction" as measured by ROUGE-LSum Mean F1 Page Scores in independent benchmarks. Source: [README.md:142-145]().

## Architecture and Pipeline

The extraction pipeline can be conceptualized as a series of stages that progressively refine HTML input into clean output. The `core` module orchestrates this through an internal `_internal_extraction` helper, which is invoked by the public functions. Source: [trafilatura/core.py:55-100]().

```mermaid
flowchart TD
    A[HTML Input] --> B[Load & Parse HTML<br/>load_html]
    B --> C[Prune Unwanted Nodes<br/>prune_unwanted_nodes]
    C --> D[HTMLProcessing<br/>link conversion, cleaning]
    D --> E[Main Text Extraction<br/>XPath-based + fallbacks]
    E --> F[Metadata Extraction<br/>metadata.py]
    F --> G[Format Output<br/>txt, json, xml, xmltei, csv]
    G --> H[Document Object]

    style E fill:#f9f,stroke:#333
    style F fill:#bbf,stroke:#333
```

The pipeline begins by loading the HTML, either from raw content or a URL, and then applies a series of transformations: pruning unwanted nodes, processing links and formatting, identifying the main content region via XPath expressions, and finally extracting both the body text and structured metadata. The main text extraction leverages compiled XPath expressions defined in `xpaths.py`, which target elements such as `<article>`, `<main>`, and common class names like "content" or "article-body". Source: [trafilatura/xpaths.py:1-30]().

## Text Extraction

The main extraction logic is centralized in the `extract` function, which accepts extensive configuration parameters documented in its docstring. Key parameters include `output_format` (supporting "txt", "markdown", "csv", "json", "html", "xml", and "xmltei"), `include_tables`, `include_images`, `include_links`, `include_formatting`, and `include_comments`. Source: [trafilatura/core.py:10-50]().

### Precision vs. Recall Trade-offs

The engine supports three operating modes that tune the extraction behavior:

- **Default**: Balanced precision and recall.
- **`favor_precision=True`**: "prefer less text but correct extraction."
- **`favor_recall=True`**: "when unsure, prefer more text."

The `fast=True` option uses "faster heuristics and skip backup extraction," trading thoroughness for speed. Source: [trafilatura/core.py:18-22]().

### Fallback Strategies

When the primary XPath-based extraction fails or yields insufficient content, Trafilatura falls back to alternative algorithms. The `readability_lxml.py` module implements a port of the Readability algorithm, which scores candidate elements to identify the main content. Source: [trafilatura/readability_lxml.py:1-40](). The `baseline` module provides additional fallback extraction using simpler heuristics. The engine may also invoke `jusText` as a tertiary fallback, depending on configuration.

## Metadata Extraction

Metadata extraction is handled by the `metadata` module, which exports a `Document` class and an `extract_metadata` function. The module scrapes metadata from multiple sources, including HTML `<meta>` tags, JSON-LD structured data, microdata, and visible page elements. Source: [trafilatura/metadata.py:1-30]().

The `json_metadata.py` module contains specialized parsers for JSON-LD and Schema.org markup. It recognizes numerous schema types including `NewsArticle`, `Report`, `BlogPosting`, `ScholarlyArticle`, and others. The module uses compiled regular expressions to extract author names, publisher information, article sections, and content types from JSON-LD blocks. Source: [trafilatura/json_metadata.py:1-25]().

Metadata fields extracted include: title, author, date (via the `htmldate` library), site name, description, categories, tags, license, and sitename. The `find_date` function from `htmldate` is invoked to determine publication dates. Source: [trafilatura/metadata.py:18-20]().

## Configuration System

Trafilatura uses a comprehensive settings system defined in `settings.py`. The module exports a `Document` configuration class that maps CLI and Python arguments to internal processing options. The `CONFIG_MAPPING` dictionary includes keys for format, precision controls, content inclusion flags (comments, formatting, links, images, tables), deduplication, language filtering, and various size thresholds. Source: [trafilatura/settings.py:1-40]().

Key size thresholds include `min_extracted_size`, `min_output_size`, `min_output_comm_size`, `min_extracted_comm_size`, `min_duplcheck_size`, and `max_repetitions`. These control the minimum amount of text required for extraction to succeed and limit the repetition of duplicate segments. Source: [trafilatura/settings.py:30-40]().

The `MANUALLY_STRIPPED` list in `settings.py` defines HTML elements that are unconditionally removed during preprocessing (e.g., `font`, `ins`, `mark`, `small`, `template`). The `BASIC_CLEAN_XPATH` expression targets elements such as `<aside>`, `<footer>`, `<script>`, and `<style>` for removal. Source: [trafilatura/settings.py:60-80]().

## Output Formats and Serialization

The `core` module serializes the extracted `Document` object into the requested output format. The `bare_extraction` function returns a Python dictionary representation, while `extract` and `extract_with_metadata` return formatted strings. The `tei_validation` parameter enables DTD-based validation of XML-TEI output against the Text Encoding Initiative standard. Source: [trafilatura/core.py:35-40]().

## Known Issues and Limitations

Several community-reported issues affect the extraction engine:

- **LXML Compatibility**: Issue #532 documents that `lxml 5.2.0` breaks imports because `lxml.html.clean` was extracted to a separate `lxml_html_clean` project. The latest 2.1.0 release addressed this with updated dependencies. Source: [HISTORY.md:1-10]().

- **Table Extraction Bugs**: Issue #777 reports incorrect table tags in HTML-formatted output, where standard `<thead>`/`<tr>`/`<th>` structures are converted to `<row>`/`<cell>` elements. Issue #794 documents that the `--links` flag breaks tables in certain Wikipedia pages. The 2.1.0 release included "fix table extraction bugs" by contributor @unsleepy22. Source: [HISTORY.md:5-10]().

- **Performance**: Release 2.1.0 introduced "Faster XPath performance using XSLT extensions" (PR #793), indicating ongoing optimization of the XPath evaluation layer used for main content identification. Source: [HISTORY.md:3-5]().

## See Also

- [Trafilatura Documentation](https://trafilatura.readthedocs.io/)
- [Core Python Functions](https://trafilatura.readthedocs.io/en/latest/corefunctions.html)
- [Configuration Options](https://trafilatura.readthedocs.io/en/latest/usage-python.html)

---

<a id='page-3'></a>

## Web Discovery, Crawling, and Downloads

### Related Pages

Related topics: [Overview, Installation, and Quickstart](#page-1), [Core Extraction Engine: Text and Metadata](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [trafilatura/__init__.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/__init__.py)
- [trafilatura/downloads.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/downloads.py)
- [trafilatura/spider.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/spider.py)
- [trafilatura/sitemaps.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/sitemaps.py)
- [trafilatura/feeds.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/feeds.py)
- [trafilatura/cli.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/cli.py)
- [trafilatura/cli_utils.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/cli_utils.py)
- [trafilatura/settings.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.py)
</details>

# Web Discovery, Crawling, and Downloads

## Overview

Trafilatura bundles a complete pipeline for finding, fetching, and navigating web content. The pipeline is split across several modules that work together: a low-level HTTP fetcher, a high-level focused crawler, sitemap and feed parsers, and a command-line entry point. The current release, **2.1.0**, exposes these capabilities through both a Python API (re-exported from the top-level package) and a CLI surface (Source: [trafilatura/__init__.py:13-39](); Source: [HISTORY.md:1-15]()).

The web discovery subsystem is responsible for three concerns:

1. **Fetching** raw HTTP responses for a single URL or a queue of URLs.
2. **Discovering** candidate URLs through sitemaps (TXT/XML) and syndication feeds (RSS/Atom/JSON).
3. **Crawling** websites in a targeted, polite fashion using heuristics that prioritize content-bearing pages.

The extraction layer (covered elsewhere) consumes the bytes produced by this subsystem. Keeping discovery separate from extraction allows users to mix-and-match: they can pre-fetch a corpus with Trafilatura and feed the resulting HTML files to a different extractor, or they can let Trafilatura handle the whole chain end-to-end.

## Module Layout and Public API

The top-level `__init__.py` re-exports a small, stable surface: `fetch_response`, `fetch_url`, and the high-level extraction entry points. The `__all__` list is the contract callers should rely on (Source: [trafilatura/__init__.py:21-35]()).

| Symbol | Source file | Purpose |
|---|---|---|
| `fetch_url` | `trafilatura/downloads.py` | Download a URL and return decoded HTML content as a string. |
| `fetch_response` | `trafilatura/downloads.py` | Return a full `urllib3.HTTPResponse` object for advanced inspection. |
| `focused_crawler` | `trafilatura/spider.py` | Targeted crawl of a website starting from a homepage. |
| Sitemap helpers | `trafilatura/sitemaps.py` | Parse TXT and XML sitemaps, return iterators of URLs. |
| Feed helpers | `trafilatura/feeds.py` | Detect and parse RSS, Atom, and JSON feeds. |
| CLI driver | `trafilatura/cli.py` + `trafilatura/cli_utils.py` | Expose the above through `trafilatura` and `trafilatura-cli` commands. |

```mermaid
flowchart LR
    A[CLI / Python caller] --> B{Discovery source}
    B -->|Single URL| C[fetch_url / fetch_response]
    B -->|Sitemap| D[sitemaps.py]
    B -->|Feed| E[feeds.py]
    B -->|Website| F[focused_crawler]
    C --> G[Extraction layer]
    D --> G
    E --> G
    F --> C
    F --> G
```

## Single-URL Downloads

The lowest layer is the downloader. `fetch_url` returns decoded HTML suitable for direct handoff to `extract()` or `bare_extraction()`. `fetch_response` is the more capable variant: it surfaces a `urllib3`-style response object so callers can inspect headers, status codes, and raw bytes (Source: [trafilatura/__init__.py:13-17]()).

Both functions delegate to `trafilatura/downloads.py`, which wraps `urllib3` directly rather than `requests` to keep dependencies minimal and to give Trafilatura tight control over retries, encoding detection, and backoff. The trade-off is documented in `HISTORY.md`: replacing `requests` with bare `urllib3` and custom decoding shipped in **0.7.0** (Source: [HISTORY.md:128-136]()).

### Configuration

`trafilatura/settings.py` defines the keys consumed by the downloader and the crawler. Among the parameters relevant to discovery are:

- `SLEEP_TIME` — base delay between requests, honored by the crawler.
- `MAX_REDIRECTS` — added in **1.6.4** to prevent redirect loops (Source: [HISTORY.md:36-44]()).
- `max_file_size` / `min_file_size` — size bounds used to filter responses.
- `max_tree_size` — cap on parsed-tree size to avoid pathological inputs.

These keys are loaded from `DEFAULT_CONFIG` and can be overridden per call by passing a `configparser` object to the high-level functions, or by editing the user's `trafilatura.cfg` (Source: [trafilatura/settings.py:11-58]()).

## Focused Crawling

`focused_crawler` is the workhorse for traversing an entire website. Its signature (truncated) is:

```
focused_crawler(homepage, max_seen_urls=10, max_known_urls=100000,
                todo=None, known_links=None, lang=None,
                config=DEFAULT_CONFIG, rules=None, prune_xpath=None)
```

The function uses a `URLStore` (from the `courlan` dependency) to deduplicate URLs, tracks a navigation vs. content heuristic, and stops when either `max_seen_urls` or `max_known_urls` is reached (Source: [trafilatura/spider.py:65-95]()).

Key behaviors worth noting:

- **Politeness** is enforced through `URLStore.get_crawl_delay()`, which respects `robots.txt` rules supplied via the `rules` parameter. A `sleep_time` is read from configuration as a fallback (Source: [trafilatura/spider.py:71-75]()).
- **Boundary detection** uses URL-path matching so the crawler does not wander into unrelated subdomains. This restriction was tightened in **1.12.1** (Source: [HISTORY.md:81-89]()).
- **Optional seeding** lets callers pass a pre-built `todo` frontier or a list of `known_links`, which is useful when combining Trafilatura with an external URL discovery step (Source: [trafilatura/spider.py:54-67]()).

The crawler returns a tuple of `(todo, known_links)`, both as ordered lists, so callers can resume a crawl in a later process.

## Sitemaps and Feeds

For sites that publish a sitemap, `trafilatura/sitemaps.py` parses both the plain-text variant and the XML variant, and can also follow nested sitemap indexes. The `max_sitemaps` parameter was added in **1.12.2** to prevent runaway recursion on misconfigured sites (Source: [HISTORY.md:71-80]()).

Feed discovery in `trafilatura/feeds.py` accepts RSS, Atom, and JSON Feed formats. Robustness improvements shipped across the 1.x line: better feed detection in **1.6.4**, JSON web feed support in **1.0.0**, and a long-running series of fixes for link discovery inside feeds (Source: [HISTORY.md:46-54](); Source: [HISTORY.md:100-108]()).

A typical discovery workflow is therefore: try the homepage, look for a `<link rel="alternate">` tag pointing at a feed, parse the feed for canonical article URLs, and finally hand those URLs to `fetch_url` or `focused_crawler`.

## Command-Line Interface

The CLI is split between the entry point in `trafilatura/cli.py` and shared helpers in `trafilatura/cli_utils.py`. From the README and the issue tracker, the most relevant flags for discovery are:

- `--inputfile` / `-i` — read URLs from a file.
- `--parallel` / `-j` — process a download queue across multiple cores.
- `--crawl` — drive `focused_crawler` instead of treating each line as an isolated URL.
- `--probe` — added in **1.6.3** to test a list of URLs for extractable content without writing output (Source: [HISTORY.md:62-70]()).

The `trafilatura` and `trafilatura-cli` executables are both available; this was clarified in earlier releases and is reflected in the package's `console_scripts` entry points.

## Community Notes Relevant to This Page

- **Dependency churn and stale releases** — Issue **#846** ("Is this project dead?") and **#813** ("Release new version on pypi") both flag the gap between source-level fixes and PyPI releases. The **2.1.0** release notes mention dependency updates (notably `lxml`) and XPath performance improvements, but downstream users who pinned to the previous PyPI version will not see them until they upgrade (Source: [HISTORY.md:1-15]()).
- **`lxml` 5.2.0 break** — Issue **#532** documents `ImportError: lxml.html.clean module is now a separate project lxml_html_clean`. The downloader does not import `lxml.html.clean` directly, but any user code that combined the downloader with `lxml_html_clean` for sanitization needs the new separate package. The fix in **1.7.0** was an explicit compatibility pass for `lxml` v5+ (Source: [HISTORY.md:18-26]()).
- **Feeds and `max_sitemaps`** — When feeding large sites into the pipeline, prefer the `max_sitemaps` parameter; without it, a malformed sitemap index can keep the parser busy indefinitely (Source: [HISTORY.md:71-80]()).

## See Also

- [Core Extraction Functions](#) — covers `extract`, `bare_extraction`, and `extract_metadata`.
- [Settings and Configuration](#) — full list of config keys and their defaults.
- [Metadata Extraction](#) — title, author, date, sitename, and categories.
- [Evaluation and Benchmarks](#) — how the downloader/crawler are exercised in the test corpus.

External: [Trafilatura documentation](https://trafilatura.readthedocs.io/) and the [`courlan`](https://github.com/adbar/courlan) package that powers URL validation and the `URLStore` used by `focused_crawler`.

---

<a id='page-4'></a>

## Settings, Output Formats, and Known Issues

### Related Pages

Related topics: [Core Extraction Engine: Text and Metadata](#page-2), [Web Discovery, Crawling, and Downloads](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [trafilatura/settings.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.py)
- [trafilatura/settings.cfg](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.cfg)
- [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py)
- [trafilatura/xml.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/xml.py)
- [trafilatura/xpaths.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/xpaths.py)
- [trafilatura/__init__.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/__init__.py)
- [trafilatura/metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/metadata.py)
- [trafilatura/readability_lxml.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/readability_lxml.py)
- [HISTORY.md](https://github.com/adbar/trafilatura/blob/main/HISTORY.md)
- [README.md](https://github.com/adbar/trafilatura/blob/main/README.md)
- [CONTRIBUTING.md](https://github.com/adbar/trafilatura/blob/main/CONTRIBUTING.md)
</details>

# Settings, Output Formats, and Known Issues

Trafilatura is a Python and command-line toolkit for web crawling, downloading, and structured text extraction. The current release, **v2.1.0**, is exported from [trafilatura/__init__.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/__init__.py) and exposes high-level functions `extract`, `bare_extraction`, `extract_with_metadata`, `baseline`, `extract_metadata`, `html2txt`, `load_html`, `fetch_url`, and `fetch_response`. This page documents how those functions are configured, what formats they emit, and the failure modes the community reports most often.

## Configuration System

### Extractor Class and Defaults

All extraction behavior is centralized in the `Extractor` class defined in [trafilatura/settings.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.py). It declares the canonical list of tunables — `format`, `fast`, `focus`, `comments`, `formatting`, `links`, `images`, `tables`, `dedup`, `lang`, `min_extracted_size`, `min_output_size`, `min_output_comm_size`, `min_extracted_comm_size`, `min_duplcheck_size`, `max_repetitions`, `max_file_size`, `min_file_size`, `max_tree_size`, `source`, `url`, `with_metadata`, `only_with_metadata`, `tei_validation`, `date_params`, `author_blacklist`, `url_blacklist` — and binds them through the `CONFIG_MAPPING` constant. The constructor accepts a `ConfigParser` (`DEFAULT_CONFIG`), a chosen `output_format`, and the corresponding keyword flags (`fast`, `precision`, `recall`, `comments`, `formatting`, `links`, `images`, `tables`, `dedup`, `lang`, `url`) that propagate into the instance attributes.

The helper `_add_config` projects the parser section into strongly typed attributes (`min_extracted_size: int`, `max_repetitions: int`, etc.), which guarantees that downstream consumers in [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py) never read raw strings from the config file. Source: [trafilatura/settings.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.py).

### Settings File Override

A shipped [trafilatura/settings.cfg](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.cfg) supplies sane defaults that can be replaced wholesale via the `settingsfile` argument, or selectively via `config` and `options`. The docstring of `extract` in [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py) makes the precedence explicit: `prune_xpath` (str or list[str]) lets callers wipe subtrees before extraction runs, and `author_blacklist` / `url_blacklist` (Python sets) are applied immediately after `extract_metadata` returns. Source: [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py).

## Output Formats

```mermaid
flowchart LR
    A[HTML Input] --> B[load_html<br/>utils.py]
    B --> C[Options<br/>Extractor instance]
    C --> D{output_format}
    D -->|txt| E[TXT]
    D -->|markdown| F[Markdown]
    D -->|csv| G[CSV]
    D -->|json| H[JSON]
    D -->|html| I[HTML]
    D -->|xml| J[XML]
    D -->|xmltei| K[XML-TEI]
    E --> L[Return str]
    F --> L
    G --> L
    H --> L
    I --> L
    J --> L
    K --> L
```

The `output_format` argument enumerates seven legal values documented in the docstrings of `extract` and `bare_extraction`: `"txt"`, `"markdown"`, `"csv"`, `"json"`, `"html"`, `"xml"`, and `"xmltei"`. Format-specific behavior is split across modules: the generic serializers live in [trafilatura/xml.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/xml.py), while `xmltei` additionally validates against the DTD at [trafilatura/data/tei_corpus.dtd](https://github.com/adbar/trafilatura/blob/main/trafilatura/data/tei_corpus.dtd) when the `tei_validation` flag is on. The XML TEI option requires `lxml` for schema validation, whereas plain `xml` and `txt` work with minimal dependencies.

Formatting hints are toggled independently: `include_formatting` is "only valuable if `output_format` is set to XML" per the docstring in [trafilatura/core.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/core.py), and `include_links` / `include_images` are tagged experimental. Tables are on by default (`include_tables=True`) but can be disabled; images and formatting follow the same opt-in pattern.

## Known Issues and Community Concerns

### Dependency Compatibility

The most upvoted community thread, issue #532, reports that LXML 5.2.0 removed `lxml.html.clean`, breaking imports with `ImportError: lxml.html.clean module is now a separate project lxml_html_clean`. According to [HISTORY.md](https://github.com/adbar/trafilatura/blob/main/HISTORY.md), the LXML v5+ transition was addressed in **1.7.0** ("support for LXML v5+"), and the legacy `lxml.html.Cleaner` was fully removed in **1.8.0**. The **2.1.0** release continues the cleanup with "Dependencies updated, lxml in particular (with minimal changes in the code)".

Users on older LXML pins should install `lxml_html_clean` separately or upgrade Trafilatura past 1.7.0. Issue #846 further flags a CVE in `lxml` prior to v6.1.0, which is mitigated by upgrading Trafilatura to a release that bumps the dependency floor.

### Extraction Edge Cases

Two open community reports describe table-related regressions:

- **#777 — Table tags incorrect in HTML formatted output.** HTML such as `<table><thead><tr><th>…</th></tr></thead></table>` is emitted as `<table><row span="4"><cell role="head">…</cell></row></table>`, suggesting that the row/span synthesis in [trafilatura/xml.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/xml.py) collapses header cells into a single attribute-bag cell.
- **#794 — `--links` flag breaks tables.** When `include_links=True`, complex Wikipedia-style tables containing `<span typeof="mw:File/…">` inside `<td>` lose their cells. The version 2.1.0 changelog credits @unsleepy22 with "fix table extraction bugs", but the community threads remain open, indicating residual edge cases.

These regressions are sensitive to the `MANUALLY_STRIPPED` list in [trafilatura/settings.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/settings.py), which already excludes `tbody`, `thead`, and `tfoot` from stripping but still routes raw `<table>` content through the link-density heuristics in [trafilatura/readability_lxml.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/readability_lxml.py).

### Project Maintenance

Issue #846 ("Is this project dead?") and #813 ("Release new version on pypi") both voice concern over release cadence. The fix is partial: [HISTORY.md](https://github.com/adbar/trafilatura/blob/main/HISTORY.md) shows version 2.1.0 shipped with "More deprecation warnings" and "More robust code", but the PyPI release log visible to users in #813 still lagged at 2.0.0 at the time of writing. The [CONTRIBUTING.md](https://github.com/adbar/trafilatura/blob/main/CONTRIBUTING.md) file explicitly invites sponsorships and PRs to sustain the project.

## Recent Release Highlights

Version **2.1.0**, documented in [HISTORY.md](https://github.com/adbar/trafilatura/blob/main/HISTORY.md), brings:

- Faster XPath via XSLT extensions (#793 by @Honesty-of-the-Cavernous-Tissue).
- Patched `AttributeError` during node pruning (#761 by @PLPeeters).
- Refined `<img src>` URL handling and additional table extraction fixes by @unsleepy22.
- Tightened metadata extraction in [trafilatura/metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/metadata.py), which still delegates date parsing to `htmldate.find_date` and URL normalization to `courlan`.

For a usage-oriented overview, the [README.md](https://github.com/adbar/trafilatura/blob/main/README.md) lists the supported input sources (live URLs, sitemaps, feeds, on-disk HTML, pre-parsed trees) and the evaluation results showing Trafilatura leading ROUGE-LSum F1 benchmarks against competing extractors.

## See Also

- Core Python API and CLI usage — see `corefunctions.md` and `usage-cli.md` on the documentation site linked from [README.md](https://github.com/adbar/trafilatura/blob/main/README.md).
- Web crawling and link discovery — `crawling.md`.
- Metadata extraction internals — `metadata.md` (covers [trafilatura/metadata.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/metadata.py) and `htmldate` integration).
- Fallback extractor algorithms — `fallbacks.md` (covers [trafilatura/readability_lxml.py](https://github.com/adbar/trafilatura/blob/main/trafilatura/readability_lxml.py) and `jusText`).

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: adbar/trafilatura

Summary: Found 37 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Runtime risk - Runtime risk requires verification.

## 1. Runtime risk - Runtime risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/661

## 2. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/634

## 3. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: Duplicate paragraph extraction when a long sibling paragraph is present
- User impact: Developers may misconfigure credentials, environment, or host setup: Duplicate paragraph extraction when a long sibling paragraph is present
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/817

## 4. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: Duplicated lines when nested in <article> and <main>, with <br> in front
- User impact: Developers may misconfigure credentials, environment, or host setup: Duplicated lines when nested in <article> and <main>, with <br> in front
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/768

## 5. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: `include_images` changes text extraction
- User impact: Developers may misconfigure credentials, environment, or host setup: `include_images` changes text extraction
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/194

## 6. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: some extraction duplicated in xml
- User impact: Developers may misconfigure credentials, environment, or host setup: some extraction duplicated in xml
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/634

## 7. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/768

## 8. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/236

## 9. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/829

## 10. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/adbar/trafilatura

## 11. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/755

## 12. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/78

## 13. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/842

## 14. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/471

## 15. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/825

## 16. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/396

## 17. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/411

## 18. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this migration risk before relying on the project: Investigate spacing in element tails
- User impact: Developers may hit a documented source-backed failure mode: Investigate spacing in element tails
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/661

## 19. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this migration risk before relying on the project: trafilatura-1.12.0
- User impact: Upgrade or migration may change expected behavior: trafilatura-1.12.0
- Evidence: failure_mode_cluster:github_release | https://github.com/adbar/trafilatura/releases/tag/v1.12.0

## 20. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this migration risk before relying on the project: trafilatura-1.12.1
- User impact: Upgrade or migration may change expected behavior: trafilatura-1.12.1
- Evidence: failure_mode_cluster:github_release | https://github.com/adbar/trafilatura/releases/tag/v1.12.1

## 21. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this migration risk before relying on the project: trafilatura-2.0.0
- User impact: Upgrade or migration may change expected behavior: trafilatura-2.0.0
- Evidence: failure_mode_cluster:github_release | https://github.com/adbar/trafilatura/releases/tag/v2.0.0

## 22. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/788

## 23. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/adbar/trafilatura

## 24. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/adbar/trafilatura

## 25. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/adbar/trafilatura

## 26. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/817

## 27. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/194

## 28. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: Keeping all valid table information and formatting
- User impact: Developers may hit a documented source-backed failure mode: Keeping all valid table information and formatting
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/78

## 29. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: Keeping images breaks parsing
- User impact: Developers may hit a documented source-backed failure mode: Keeping images breaks parsing
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/842

## 30. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: `included_images` failed when trying to extract images in a table
- User impact: Developers may hit a documented source-backed failure mode: `included_images` failed when trying to extract images in a table
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/396

## 31. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: include_links breaks the extraction for https://news.ycombinator.com
- User impact: Developers may hit a documented source-backed failure mode: include_links breaks the extraction for https://news.ycombinator.com
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/411

## 32. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this conceptual risk before relying on the project: Backticks produce extra line breaks
- User impact: Developers may hit a documented source-backed failure mode: Backticks produce extra line breaks
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/755

## 33. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this conceptual risk before relying on the project: HTML conversion: 'NoneType' object is not subscriptable
- User impact: Developers may hit a documented source-backed failure mode: HTML conversion: 'NoneType' object is not subscriptable
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/236

## 34. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this conceptual risk before relying on the project: Missing Yoast FAQ block headers
- User impact: Developers may hit a documented source-backed failure mode: Missing Yoast FAQ block headers
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/471

## 35. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this conceptual risk before relying on the project: Text dropped in table after setting `include_formatting=True`
- User impact: Developers may hit a documented source-backed failure mode: Text dropped in table after setting `include_formatting=True`
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/829

## 36. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/adbar/trafilatura

## 37. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/adbar/trafilatura

<!-- canonical_name: adbar/trafilatura; human_manual_source: deepwiki_human_wiki -->
