trafilatura Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Overview, Installation, and Quickstart

Related topics: Core Extraction Engine: Text and Metadata, Web Discovery, Crawling, and Downloads

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Dependency Notes (community-reported issues)

Continue reading this section for the full explanation and source context.

Section Python: Extract Main Text from a URL

Continue reading this section for the full explanation and source context.

Section Python: Extract Metadata

Continue reading this section for the full explanation and source context.

Overview, Installation, and Quickstart

Purpose and Scope

Trafilatura is a Python library and command-line tool for gathering text on the Web. It bundles web crawling/scraping, downloading, extraction of main text, metadata, and comments into a single package. According to the package metadata in trafilatura/__init__.py, it is released under the Apache-2.0 license (versions 1.8.0 and later) and the current stable version is 2.1.0.

The library is designed for researchers and engineers who need to build web corpora or scrape articles at scale. The README.md documents that it supports TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI output formats and that it is integrated into thousands of projects by organizations such as HuggingFace, IBM, and Microsoft Research.

The public API is intentionally narrow. As shown in trafilatura/__init__.py, the top-level exports are bare_extraction, baseline, extract, extract_metadata, extract_with_metadata, fetch_response, fetch_url, html2txt, and load_html. The three most commonly used functions for new users are extract, extract_metadata, and fetch_url.

High-Level Architecture

Trafilatura separates concerns into three layers: downloading, parsing/loading HTML, and extraction. The data flow is straightforward and is the same whether the entry point is the CLI or Python:

flowchart LR
    A[URL or HTML string] --> B[fetch_url / load_html]
    B --> C[lxml HtmlElement tree]
    C --> D[Pruning + XPath selection]
    D --> E{bare_extraction}
    E --> F[Document metadata]
    E --> G[Main text body]
    E --> H[Comments]
    G --> I[Output formatter]
    I --> J[txt / markdown / json / xml / xmltei / csv / html]

The extract() and bare_extraction() functions in trafilatura/core.py accept both raw HTML and pre-loaded trees, which lets the caller skip the download step when working with cached files. When with_metadata=True, metadata is computed first by extract_metadata() in trafilatura/metadata.py, including JSON-LD parsing from trafilatura/json_metadata.py. A readability-style fallback is provided through trafilatura/readability_lxml.py, and the baseline()/html2txt() functions offer a simpler alternative extractor.

Installation

The package is published on PyPI. According to the README.md, a minimal install is:

pip install trafilatura

Optional add-ons documented in the README include language detection and faster downloads:

pip install trafilatura[all]

Dependency Notes (community-reported issues)

Trafilatura 2.1.0 explicitly updates lxml, as called out in the community context for issue #532 and in the release notes. Older releases used lxml.html.clean, which was removed in lxml 5.2.0; users upgrading from lxml<5.2 must update to trafilatura 2.x. Conversely, issue #846 points to a CVE in older lxml versions (pre-6.1.0), so upgrading the dependency is also a security concern. The fix for the import error is therefore to upgrade both lxml and trafilatura:

pip install --upgrade lxml trafilatura

Quickstart

Python: Extract Main Text from a URL

The shortest path from URL to plain text uses fetch_url followed by extract, both exported from trafilatura/__init__.py:

import trafilatura

downloaded = trafilatura.fetch_url("https://example.org/article")
if downloaded is not None:
    text = trafilatura.extract(downloaded)
    print(text)

fetch_url returns the HTML body as a string or None on failure. extract returns a string in the requested output_format or None if extraction is not possible. The supported formats are listed in the docstring of _internal_extraction in trafilatura/core.py: txt, markdown, csv, json, html, xml, and xmltei.

Python: Extract Metadata

For structured access to title, author, date, sitename, and categories, use extract_metadata:

import trafilatura

html = trafilatura.fetch_url("https://example.org/article")
tree = trafilatura.load_html(html)
metadata = trafilatura.extract_metadata(tree)
print(metadata.title, metadata.author, metadata.date, metadata.url)

The Document dataclass returned by extract_metadata is defined in trafilatura/metadata.py, which combines XPath selectors from xpaths.py with JSON-LD parsing routines in trafilatura/json_metadata.py.

Python: Full Bundle with `bare_extraction`

When you need both content and metadata in one call, bare_extraction returns a Python dict:

import trafilatura

result = trafilatura.bare_extraction(downloaded, with_metadata=True)
# result keys include 'text', 'comments', 'metadata', 'title', 'author', ...

The with_metadata and only_with_metadata flags are documented in trafilatura/core.py. The latter short-circuits if essential metadata (date, title, URL) is missing.

Common Configuration Knobs

The extract function accepts a wide range of toggles. The most useful ones, defined in trafilatura/settings.py, are listed below.

Option	Purpose	Notes
`output_format`	Output serialization	One of `txt`, `markdown`, `csv`, `json`, `html`, `xml`, `xmltei`
`fast`	Skip fallback extractors	Trades recall for speed
`favor_precision` / `favor_recall`	Bias extraction	Mutually exclusive presets
`include_comments`	Extract comment sections	Default `True`
`include_tables`	Keep `<table>` content	See community issue #794 below
`include_links`	Preserve hyperlinks	See community issue #794 below
`include_images`	Preserve image references	Experimental
`target_language`	Filter by ISO 639-1 code	Requires `py3langid`
`deduplicate`	Drop repeated segments
`with_metadata` / `only_with_metadata`	Metadata handling
`url_blacklist` / `author_blacklist`	Filtering	Sets of strings
`date_extraction_params`	Pass-through to `htmldate`	Dict

Command-Line Quickstart

The package also installs a trafilatura CLI. The README.md links to usage-cli.html for the full reference. The most common invocations are:

# Download and extract to plain text
trafilatura -u "https://example.org/article"

# Convert an existing HTML file to Markdown
trafilatura -f page.html --output-format markdown

# Keep tables and links, output JSON with metadata
trafilatura -u "https://example.org/article" --json --with-metadata --links --tables

Known Limitations and Community-Reported Issues

Three issues from the community context are worth flagging for new users:

Issue #532 — lxml 5.2.0 breaks import. Pre-2.0 versions of trafilatura import lxml.html.clean, which was removed in lxml 5.2.0. Users must upgrade to trafilatura 2.1.0 or downgrade lxml. Source: community context.
Issue #777 / #794 — Table and --links interaction. When include_links=True (or the --links CLI flag) is combined with include_tables=True, some <table> markup is not preserved correctly in the output. Source: community context and trafilatura/settings.py (which exposes both flags).
Issue #846 — Release cadence and CVE exposure. Because releases are infrequent, downstream users must monitor dependency CVEs (notably in lxml) themselves. The 2.1.0 release addresses the immediate lxml concern. Source: community context.

Core Extraction Engine: Text and Metadata

Related topics: Overview, Installation, and Quickstart, Settings, Output Formats, and Known Issues

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Precision vs. Recall Trade-offs

Continue reading this section for the full explanation and source context.

Section Fallback Strategies

Continue reading this section for the full explanation and source context.

Core Extraction Engine: Text and Metadata

Overview

Trafilatura's core extraction engine is the heart of the library, responsible for transforming raw HTML into clean, structured main text and metadata. The package exposes this functionality primarily through three functions in the core module: extract, bare_extraction, and extract_with_metadata, all of which are re-exported in the top-level trafilatura namespace for convenience. Source: trafilatura/__init__.py:18-22.

The engine is designed to balance precision and recall while supporting multiple output formats (TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI). According to the README, Trafilatura "consistently outperforms other open-source libraries in text extraction" as measured by ROUGE-LSum Mean F1 Page Scores in independent benchmarks. Source: README.md:142-145.

Architecture and Pipeline

The extraction pipeline can be conceptualized as a series of stages that progressively refine HTML input into clean output. The core module orchestrates this through an internal _internal_extraction helper, which is invoked by the public functions. Source: trafilatura/core.py:55-100.

flowchart TD
    A[HTML Input] --> B[Load & Parse HTML<br/>load_html]
    B --> C[Prune Unwanted Nodes<br/>prune_unwanted_nodes]
    C --> D[HTMLProcessing<br/>link conversion, cleaning]
    D --> E[Main Text Extraction<br/>XPath-based + fallbacks]
    E --> F[Metadata Extraction<br/>metadata.py]
    F --> G[Format Output<br/>txt, json, xml, xmltei, csv]
    G --> H[Document Object]

    style E fill:#f9f,stroke:#333
    style F fill:#bbf,stroke:#333

The pipeline begins by loading the HTML, either from raw content or a URL, and then applies a series of transformations: pruning unwanted nodes, processing links and formatting, identifying the main content region via XPath expressions, and finally extracting both the body text and structured metadata. The main text extraction leverages compiled XPath expressions defined in xpaths.py, which target elements such as <article>, <main>, and common class names like "content" or "article-body". Source: trafilatura/xpaths.py:1-30.

Text Extraction

The main extraction logic is centralized in the extract function, which accepts extensive configuration parameters documented in its docstring. Key parameters include output_format (supporting "txt", "markdown", "csv", "json", "html", "xml", and "xmltei"), include_tables, include_images, include_links, include_formatting, and include_comments. Source: trafilatura/core.py:10-50.

Precision vs. Recall Trade-offs

The engine supports three operating modes that tune the extraction behavior:

Default: Balanced precision and recall.
favor_precision=True: "prefer less text but correct extraction."
favor_recall=True: "when unsure, prefer more text."

The fast=True option uses "faster heuristics and skip backup extraction," trading thoroughness for speed. Source: trafilatura/core.py:18-22.

Fallback Strategies

When the primary XPath-based extraction fails or yields insufficient content, Trafilatura falls back to alternative algorithms. The readability_lxml.py module implements a port of the Readability algorithm, which scores candidate elements to identify the main content. Source: trafilatura/readability_lxml.py:1-40. The baseline module provides additional fallback extraction using simpler heuristics. The engine may also invoke jusText as a tertiary fallback, depending on configuration.

Metadata Extraction

Metadata extraction is handled by the metadata module, which exports a Document class and an extract_metadata function. The module scrapes metadata from multiple sources, including HTML <meta> tags, JSON-LD structured data, microdata, and visible page elements. Source: trafilatura/metadata.py:1-30.

The json_metadata.py module contains specialized parsers for JSON-LD and Schema.org markup. It recognizes numerous schema types including NewsArticle, Report, BlogPosting, ScholarlyArticle, and others. The module uses compiled regular expressions to extract author names, publisher information, article sections, and content types from JSON-LD blocks. Source: trafilatura/json_metadata.py:1-25.

Metadata fields extracted include: title, author, date (via the htmldate library), site name, description, categories, tags, license, and sitename. The find_date function from htmldate is invoked to determine publication dates. Source: trafilatura/metadata.py:18-20.

Configuration System

Trafilatura uses a comprehensive settings system defined in settings.py. The module exports a Document configuration class that maps CLI and Python arguments to internal processing options. The CONFIG_MAPPING dictionary includes keys for format, precision controls, content inclusion flags (comments, formatting, links, images, tables), deduplication, language filtering, and various size thresholds. Source: trafilatura/settings.py:1-40.

Key size thresholds include min_extracted_size, min_output_size, min_output_comm_size, min_extracted_comm_size, min_duplcheck_size, and max_repetitions. These control the minimum amount of text required for extraction to succeed and limit the repetition of duplicate segments. Source: trafilatura/settings.py:30-40.

The MANUALLY_STRIPPED list in settings.py defines HTML elements that are unconditionally removed during preprocessing (e.g., font, ins, mark, small, template). The BASIC_CLEAN_XPATH expression targets elements such as <aside>, <footer>, <script>, and <style> for removal. Source: trafilatura/settings.py:60-80.

Output Formats and Serialization

The core module serializes the extracted Document object into the requested output format. The bare_extraction function returns a Python dictionary representation, while extract and extract_with_metadata return formatted strings. The tei_validation parameter enables DTD-based validation of XML-TEI output against the Text Encoding Initiative standard. Source: trafilatura/core.py:35-40.

Known Issues and Limitations

Several community-reported issues affect the extraction engine:

LXML Compatibility: Issue #532 documents that lxml 5.2.0 breaks imports because lxml.html.clean was extracted to a separate lxml_html_clean project. The latest 2.1.0 release addressed this with updated dependencies. Source: HISTORY.md:1-10.

Table Extraction Bugs: Issue #777 reports incorrect table tags in HTML-formatted output, where standard <thead>/<tr>/<th> structures are converted to <row>/<cell> elements. Issue #794 documents that the --links flag breaks tables in certain Wikipedia pages. The 2.1.0 release included "fix table extraction bugs" by contributor @unsleepy22. Source: HISTORY.md:5-10.

Performance: Release 2.1.0 introduced "Faster XPath performance using XSLT extensions" (PR #793), indicating ongoing optimization of the XPath evaluation layer used for main content identification. Source: HISTORY.md:3-5.

Web Discovery, Crawling, and Downloads

Related topics: Overview, Installation, and Quickstart, Core Extraction Engine: Text and Metadata

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Configuration

Continue reading this section for the full explanation and source context.

Web Discovery, Crawling, and Downloads

Overview

Trafilatura bundles a complete pipeline for finding, fetching, and navigating web content. The pipeline is split across several modules that work together: a low-level HTTP fetcher, a high-level focused crawler, sitemap and feed parsers, and a command-line entry point. The current release, 2.1.0, exposes these capabilities through both a Python API (re-exported from the top-level package) and a CLI surface (Source: trafilatura/__init__.py:13-39; Source: HISTORY.md:1-15).

The web discovery subsystem is responsible for three concerns:

Fetching raw HTTP responses for a single URL or a queue of URLs.
Discovering candidate URLs through sitemaps (TXT/XML) and syndication feeds (RSS/Atom/JSON).
Crawling websites in a targeted, polite fashion using heuristics that prioritize content-bearing pages.

The extraction layer (covered elsewhere) consumes the bytes produced by this subsystem. Keeping discovery separate from extraction allows users to mix-and-match: they can pre-fetch a corpus with Trafilatura and feed the resulting HTML files to a different extractor, or they can let Trafilatura handle the whole chain end-to-end.

Module Layout and Public API

The top-level __init__.py re-exports a small, stable surface: fetch_response, fetch_url, and the high-level extraction entry points. The __all__ list is the contract callers should rely on (Source: trafilatura/__init__.py:21-35).

Symbol	Source file	Purpose
`fetch_url`	`trafilatura/downloads.py`	Download a URL and return decoded HTML content as a string.
`fetch_response`	`trafilatura/downloads.py`	Return a full `urllib3.HTTPResponse` object for advanced inspection.
`focused_crawler`	`trafilatura/spider.py`	Targeted crawl of a website starting from a homepage.
Sitemap helpers	`trafilatura/sitemaps.py`	Parse TXT and XML sitemaps, return iterators of URLs.
Feed helpers	`trafilatura/feeds.py`	Detect and parse RSS, Atom, and JSON feeds.
CLI driver	`trafilatura/cli.py` + `trafilatura/cli_utils.py`	Expose the above through `trafilatura` and `trafilatura-cli` commands.

flowchart LR
    A[CLI / Python caller] --> B{Discovery source}
    B -->|Single URL| C[fetch_url / fetch_response]
    B -->|Sitemap| D[sitemaps.py]
    B -->|Feed| E[feeds.py]
    B -->|Website| F[focused_crawler]
    C --> G[Extraction layer]
    D --> G
    E --> G
    F --> C
    F --> G

Single-URL Downloads

The lowest layer is the downloader. fetch_url returns decoded HTML suitable for direct handoff to extract() or bare_extraction(). fetch_response is the more capable variant: it surfaces a urllib3-style response object so callers can inspect headers, status codes, and raw bytes (Source: trafilatura/__init__.py:13-17).

Both functions delegate to trafilatura/downloads.py, which wraps urllib3 directly rather than requests to keep dependencies minimal and to give Trafilatura tight control over retries, encoding detection, and backoff. The trade-off is documented in HISTORY.md: replacing requests with bare urllib3 and custom decoding shipped in 0.7.0 (Source: HISTORY.md:128-136).

Configuration

trafilatura/settings.py defines the keys consumed by the downloader and the crawler. Among the parameters relevant to discovery are:

SLEEP_TIME — base delay between requests, honored by the crawler.
MAX_REDIRECTS — added in 1.6.4 to prevent redirect loops (Source: HISTORY.md:36-44).
max_file_size / min_file_size — size bounds used to filter responses.
max_tree_size — cap on parsed-tree size to avoid pathological inputs.

These keys are loaded from DEFAULT_CONFIG and can be overridden per call by passing a configparser object to the high-level functions, or by editing the user's trafilatura.cfg (Source: trafilatura/settings.py:11-58).

Focused Crawling

focused_crawler is the workhorse for traversing an entire website. Its signature (truncated) is:

focused_crawler(homepage, max_seen_urls=10, max_known_urls=100000,
                todo=None, known_links=None, lang=None,
                config=DEFAULT_CONFIG, rules=None, prune_xpath=None)

The function uses a URLStore (from the courlan dependency) to deduplicate URLs, tracks a navigation vs. content heuristic, and stops when either max_seen_urls or max_known_urls is reached (Source: trafilatura/spider.py:65-95).

Key behaviors worth noting:

Politeness is enforced through URLStore.get_crawl_delay(), which respects robots.txt rules supplied via the rules parameter. A sleep_time is read from configuration as a fallback (Source: trafilatura/spider.py:71-75).
Boundary detection uses URL-path matching so the crawler does not wander into unrelated subdomains. This restriction was tightened in 1.12.1 (Source: HISTORY.md:81-89).
Optional seeding lets callers pass a pre-built todo frontier or a list of known_links, which is useful when combining Trafilatura with an external URL discovery step (Source: trafilatura/spider.py:54-67).

The crawler returns a tuple of (todo, known_links), both as ordered lists, so callers can resume a crawl in a later process.

Sitemaps and Feeds

For sites that publish a sitemap, trafilatura/sitemaps.py parses both the plain-text variant and the XML variant, and can also follow nested sitemap indexes. The max_sitemaps parameter was added in 1.12.2 to prevent runaway recursion on misconfigured sites (Source: HISTORY.md:71-80).

Feed discovery in trafilatura/feeds.py accepts RSS, Atom, and JSON Feed formats. Robustness improvements shipped across the 1.x line: better feed detection in 1.6.4, JSON web feed support in 1.0.0, and a long-running series of fixes for link discovery inside feeds (Source: HISTORY.md:46-54; Source: HISTORY.md:100-108).

A typical discovery workflow is therefore: try the homepage, look for a <link rel="alternate"> tag pointing at a feed, parse the feed for canonical article URLs, and finally hand those URLs to fetch_url or focused_crawler.

Command-Line Interface

The CLI is split between the entry point in trafilatura/cli.py and shared helpers in trafilatura/cli_utils.py. From the README and the issue tracker, the most relevant flags for discovery are:

--inputfile / -i — read URLs from a file.
--parallel / -j — process a download queue across multiple cores.
--crawl — drive focused_crawler instead of treating each line as an isolated URL.
--probe — added in 1.6.3 to test a list of URLs for extractable content without writing output (Source: HISTORY.md:62-70).

The trafilatura and trafilatura-cli executables are both available; this was clarified in earlier releases and is reflected in the package's console_scripts entry points.

Community Notes Relevant to This Page

Dependency churn and stale releases — Issue #846 ("Is this project dead?") and #813 ("Release new version on pypi") both flag the gap between source-level fixes and PyPI releases. The 2.1.0 release notes mention dependency updates (notably lxml) and XPath performance improvements, but downstream users who pinned to the previous PyPI version will not see them until they upgrade (Source: HISTORY.md:1-15).
lxml 5.2.0 break — Issue #532 documents ImportError: lxml.html.clean module is now a separate project lxml_html_clean. The downloader does not import lxml.html.clean directly, but any user code that combined the downloader with lxml_html_clean for sanitization needs the new separate package. The fix in 1.7.0 was an explicit compatibility pass for lxml v5+ (Source: HISTORY.md:18-26).
Feeds and max_sitemaps — When feeding large sites into the pipeline, prefer the max_sitemaps parameter; without it, a malformed sitemap index can keep the parser busy indefinitely (Source: HISTORY.md:71-80).

Settings, Output Formats, and Known Issues

Related topics: Core Extraction Engine: Text and Metadata, Web Discovery, Crawling, and Downloads

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Extractor Class and Defaults

Continue reading this section for the full explanation and source context.

Section Settings File Override

Continue reading this section for the full explanation and source context.

Section Dependency Compatibility

Continue reading this section for the full explanation and source context.

Settings, Output Formats, and Known Issues

Trafilatura is a Python and command-line toolkit for web crawling, downloading, and structured text extraction. The current release, v2.1.0, is exported from trafilatura/__init__.py and exposes high-level functions extract, bare_extraction, extract_with_metadata, baseline, extract_metadata, html2txt, load_html, fetch_url, and fetch_response. This page documents how those functions are configured, what formats they emit, and the failure modes the community reports most often.

Configuration System

Extractor Class and Defaults

All extraction behavior is centralized in the Extractor class defined in trafilatura/settings.py. It declares the canonical list of tunables — format, fast, focus, comments, formatting, links, images, tables, dedup, lang, min_extracted_size, min_output_size, min_output_comm_size, min_extracted_comm_size, min_duplcheck_size, max_repetitions, max_file_size, min_file_size, max_tree_size, source, url, with_metadata, only_with_metadata, tei_validation, date_params, author_blacklist, url_blacklist — and binds them through the CONFIG_MAPPING constant. The constructor accepts a ConfigParser (DEFAULT_CONFIG), a chosen output_format, and the corresponding keyword flags (fast, precision, recall, comments, formatting, links, images, tables, dedup, lang, url) that propagate into the instance attributes.

The helper _add_config projects the parser section into strongly typed attributes (min_extracted_size: int, max_repetitions: int, etc.), which guarantees that downstream consumers in trafilatura/core.py never read raw strings from the config file. Source: trafilatura/settings.py.

Settings File Override

A shipped trafilatura/settings.cfg supplies sane defaults that can be replaced wholesale via the settingsfile argument, or selectively via config and options. The docstring of extract in trafilatura/core.py makes the precedence explicit: prune_xpath (str or list[str]) lets callers wipe subtrees before extraction runs, and author_blacklist / url_blacklist (Python sets) are applied immediately after extract_metadata returns. Source: trafilatura/core.py.

Output Formats

flowchart LR
    A[HTML Input] --> B[load_html<br/>utils.py]
    B --> C[Options<br/>Extractor instance]
    C --> D{output_format}
    D -->|txt| E[TXT]
    D -->|markdown| F[Markdown]
    D -->|csv| G[CSV]
    D -->|json| H[JSON]
    D -->|html| I[HTML]
    D -->|xml| J[XML]
    D -->|xmltei| K[XML-TEI]
    E --> L[Return str]
    F --> L
    G --> L
    H --> L
    I --> L
    J --> L
    K --> L

The output_format argument enumerates seven legal values documented in the docstrings of extract and bare_extraction: "txt", "markdown", "csv", "json", "html", "xml", and "xmltei". Format-specific behavior is split across modules: the generic serializers live in trafilatura/xml.py, while xmltei additionally validates against the DTD at trafilatura/data/tei_corpus.dtd when the tei_validation flag is on. The XML TEI option requires lxml for schema validation, whereas plain xml and txt work with minimal dependencies.

Formatting hints are toggled independently: include_formatting is "only valuable if output_format is set to XML" per the docstring in trafilatura/core.py, and include_links / include_images are tagged experimental. Tables are on by default (include_tables=True) but can be disabled; images and formatting follow the same opt-in pattern.

Known Issues and Community Concerns

Dependency Compatibility

The most upvoted community thread, issue #532, reports that LXML 5.2.0 removed lxml.html.clean, breaking imports with ImportError: lxml.html.clean module is now a separate project lxml_html_clean. According to HISTORY.md, the LXML v5+ transition was addressed in 1.7.0 ("support for LXML v5+"), and the legacy lxml.html.Cleaner was fully removed in 1.8.0. The 2.1.0 release continues the cleanup with "Dependencies updated, lxml in particular (with minimal changes in the code)".

Users on older LXML pins should install lxml_html_clean separately or upgrade Trafilatura past 1.7.0. Issue #846 further flags a CVE in lxml prior to v6.1.0, which is mitigated by upgrading Trafilatura to a release that bumps the dependency floor.

Extraction Edge Cases

Two open community reports describe table-related regressions:

#777 — Table tags incorrect in HTML formatted output. HTML such as <table><thead><tr><th>…</th></tr></thead></table> is emitted as <table><row span="4"><cell role="head">…</cell></row></table>, suggesting that the row/span synthesis in trafilatura/xml.py collapses header cells into a single attribute-bag cell.
#794 — --links flag breaks tables. When include_links=True, complex Wikipedia-style tables containing <span typeof="mw:File/…"> inside <td> lose their cells. The version 2.1.0 changelog credits @unsleepy22 with "fix table extraction bugs", but the community threads remain open, indicating residual edge cases.

These regressions are sensitive to the MANUALLY_STRIPPED list in trafilatura/settings.py, which already excludes tbody, thead, and tfoot from stripping but still routes raw <table> content through the link-density heuristics in trafilatura/readability_lxml.py.

Project Maintenance

Issue #846 ("Is this project dead?") and #813 ("Release new version on pypi") both voice concern over release cadence. The fix is partial: HISTORY.md shows version 2.1.0 shipped with "More deprecation warnings" and "More robust code", but the PyPI release log visible to users in #813 still lagged at 2.0.0 at the time of writing. The CONTRIBUTING.md file explicitly invites sponsorships and PRs to sustain the project.

Recent Release Highlights

Version 2.1.0, documented in HISTORY.md, brings:

Faster XPath via XSLT extensions (#793 by @Honesty-of-the-Cavernous-Tissue).
Patched AttributeError during node pruning (#761 by @PLPeeters).
Refined <img src> URL handling and additional table extraction fixes by @unsleepy22.
Tightened metadata extraction in trafilatura/metadata.py, which still delegates date parsing to htmldate.find_date and URL normalization to courlan.

For a usage-oriented overview, the README.md lists the supported input sources (live URLs, sitemaps, feeds, on-disk HTML, pre-parsed trees) and the evaluation results showing Trafilatura leading ROUGE-LSum F1 benchmarks against competing extractors.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Runtime risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Configuration risk requires verification

Developers may misconfigure credentials, environment, or host setup: Duplicate paragraph extraction when a long sibling paragraph is present

medium Configuration risk requires verification

Developers may misconfigure credentials, environment, or host setup: Duplicated lines when nested in <article> and <main>, with <br> in front

Doramagic Pitfall Log

Found 37 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Runtime risk - Runtime risk requires verification.

1. Runtime risk: Runtime risk requires verification

Severity: high
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/661

2. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/634

3. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: Duplicate paragraph extraction when a long sibling paragraph is present
User impact: Developers may misconfigure credentials, environment, or host setup: Duplicate paragraph extraction when a long sibling paragraph is present
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Duplicate paragraph extraction when a long sibling paragraph is present. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/817

4. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: Duplicated lines when nested in <article> and <main>, with <br> in front
User impact: Developers may misconfigure credentials, environment, or host setup: Duplicated lines when nested in <article> and <main>, with <br> in front
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Duplicated lines when nested in <article> and <main>, with <br> in front. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/768

5. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: include_images changes text extraction
User impact: Developers may misconfigure credentials, environment, or host setup: include_images changes text extraction
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: include_images changes text extraction. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/194

6. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: some extraction duplicated in xml
User impact: Developers may misconfigure credentials, environment, or host setup: some extraction duplicated in xml
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: some extraction duplicated in xml. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/634

7. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/768

8. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/236

9. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/829

10. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/adbar/trafilatura

11. Runtime risk: Runtime risk requires verification

Severity: medium
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/755

12. Runtime risk: Runtime risk requires verification

Severity: medium
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/78

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using trafilatura with real data or production workflows.

Investigate spacing in element tails - github / github_issue
HTML conversion: 'NoneType' object is not subscriptable - github / github_issue
include_links breaks the extraction for https://news.ycombinator.com - github / github_issue
Keeping all valid table information and formatting - github / github_issue
Backticks produce extra line breaks - github / github_issue
some extraction duplicated in xml - github / github_issue
include_images changes text extraction - github / github_issue
Keeping images breaks parsing - github / github_issue
Text dropped in table after setting include_formatting=True - github / github_issue
Duplicate paragraph extraction when a long sibling paragraph is present - github / github_issue
Duplicated lines when nested in <article> and <main>, with <br> in front - github / github_issue
included_images failed when trying to extract images in a table - github / github_issue

Source: Project Pack community evidence and pitfall evidence

trafilatura

Overview, Installation, and Quickstart

Related Pages

Overview, Installation, and Quickstart

Purpose and Scope

High-Level Architecture

Installation

Dependency Notes (community-reported issues)

Quickstart

Python: Extract Main Text from a URL

Python: Extract Metadata

Python: Full Bundle with `bare_extraction`

Common Configuration Knobs

Command-Line Quickstart

Known Limitations and Community-Reported Issues

See Also

Core Extraction Engine: Text and Metadata

Related Pages

Core Extraction Engine: Text and Metadata

Overview

Architecture and Pipeline

Text Extraction

Precision vs. Recall Trade-offs

Fallback Strategies

Metadata Extraction

Configuration System

Output Formats and Serialization

Known Issues and Limitations

See Also

Web Discovery, Crawling, and Downloads

Related Pages

Web Discovery, Crawling, and Downloads

Overview

Module Layout and Public API

Single-URL Downloads

Configuration

Focused Crawling

Sitemaps and Feeds

Command-Line Interface

Community Notes Relevant to This Page

See Also

Settings, Output Formats, and Known Issues

Related Pages

Settings, Output Formats, and Known Issues

Configuration System

Extractor Class and Defaults

Settings File Override

Output Formats

Known Issues and Community Concerns

Dependency Compatibility

Extraction Edge Cases

Project Maintenance

Recent Release Highlights

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Runtime risk: Runtime risk requires verification

2. Security or permission risk: Security or permission risk requires verification

3. Configuration risk: Configuration risk requires verification

4. Configuration risk: Configuration risk requires verification

5. Configuration risk: Configuration risk requires verification

6. Configuration risk: Configuration risk requires verification

7. Configuration risk: Configuration risk requires verification

8. Configuration risk: Configuration risk requires verification

9. Configuration risk: Configuration risk requires verification

10. Capability evidence risk: Capability evidence risk requires verification

11. Runtime risk: Runtime risk requires verification

12. Runtime risk: Runtime risk requires verification

Community Discussion Evidence

Community Discussion Evidence