Doramagic Project Pack · Human Manual
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Overview, Installation, and Quickstart
Related topics: Core Extraction Engine: Text and Metadata, Web Discovery, Crawling, and Downloads
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Extraction Engine: Text and Metadata, Web Discovery, Crawling, and Downloads
Overview, Installation, and Quickstart
Purpose and Scope
Trafilatura is a Python library and command-line tool for gathering text on the Web. It bundles web crawling/scraping, downloading, extraction of main text, metadata, and comments into a single package. According to the package metadata in trafilatura/__init__.py, it is released under the Apache-2.0 license (versions 1.8.0 and later) and the current stable version is 2.1.0.
The library is designed for researchers and engineers who need to build web corpora or scrape articles at scale. The README.md documents that it supports TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI output formats and that it is integrated into thousands of projects by organizations such as HuggingFace, IBM, and Microsoft Research.
The public API is intentionally narrow. As shown in trafilatura/__init__.py, the top-level exports are bare_extraction, baseline, extract, extract_metadata, extract_with_metadata, fetch_response, fetch_url, html2txt, and load_html. The three most commonly used functions for new users are extract, extract_metadata, and fetch_url.
High-Level Architecture
Trafilatura separates concerns into three layers: downloading, parsing/loading HTML, and extraction. The data flow is straightforward and is the same whether the entry point is the CLI or Python:
flowchart LR
A[URL or HTML string] --> B[fetch_url / load_html]
B --> C[lxml HtmlElement tree]
C --> D[Pruning + XPath selection]
D --> E{bare_extraction}
E --> F[Document metadata]
E --> G[Main text body]
E --> H[Comments]
G --> I[Output formatter]
I --> J[txt / markdown / json / xml / xmltei / csv / html]The extract() and bare_extraction() functions in trafilatura/core.py accept both raw HTML and pre-loaded trees, which lets the caller skip the download step when working with cached files. When with_metadata=True, metadata is computed first by extract_metadata() in trafilatura/metadata.py, including JSON-LD parsing from trafilatura/json_metadata.py. A readability-style fallback is provided through trafilatura/readability_lxml.py, and the baseline()/html2txt() functions offer a simpler alternative extractor.
Installation
The package is published on PyPI. According to the README.md, a minimal install is:
pip install trafilatura
Optional add-ons documented in the README include language detection and faster downloads:
pip install trafilatura[all]
Dependency Notes (community-reported issues)
Trafilatura 2.1.0 explicitly updates lxml, as called out in the community context for issue #532 and in the release notes. Older releases used lxml.html.clean, which was removed in lxml 5.2.0; users upgrading from lxml<5.2 must update to trafilatura 2.x. Conversely, issue #846 points to a CVE in older lxml versions (pre-6.1.0), so upgrading the dependency is also a security concern. The fix for the import error is therefore to upgrade both lxml and trafilatura:
pip install --upgrade lxml trafilatura
Quickstart
Python: Extract Main Text from a URL
The shortest path from URL to plain text uses fetch_url followed by extract, both exported from trafilatura/__init__.py:
import trafilatura
downloaded = trafilatura.fetch_url("https://example.org/article")
if downloaded is not None:
text = trafilatura.extract(downloaded)
print(text)
fetch_url returns the HTML body as a string or None on failure. extract returns a string in the requested output_format or None if extraction is not possible. The supported formats are listed in the docstring of _internal_extraction in trafilatura/core.py: txt, markdown, csv, json, html, xml, and xmltei.
Python: Extract Metadata
For structured access to title, author, date, sitename, and categories, use extract_metadata:
import trafilatura
html = trafilatura.fetch_url("https://example.org/article")
tree = trafilatura.load_html(html)
metadata = trafilatura.extract_metadata(tree)
print(metadata.title, metadata.author, metadata.date, metadata.url)
The Document dataclass returned by extract_metadata is defined in trafilatura/metadata.py, which combines XPath selectors from xpaths.py with JSON-LD parsing routines in trafilatura/json_metadata.py.
Python: Full Bundle with `bare_extraction`
When you need both content and metadata in one call, bare_extraction returns a Python dict:
import trafilatura
result = trafilatura.bare_extraction(downloaded, with_metadata=True)
# result keys include 'text', 'comments', 'metadata', 'title', 'author', ...
The with_metadata and only_with_metadata flags are documented in trafilatura/core.py. The latter short-circuits if essential metadata (date, title, URL) is missing.
Common Configuration Knobs
The extract function accepts a wide range of toggles. The most useful ones, defined in trafilatura/settings.py, are listed below.
| Option | Purpose | Notes |
|---|---|---|
output_format | Output serialization | One of txt, markdown, csv, json, html, xml, xmltei |
fast | Skip fallback extractors | Trades recall for speed |
favor_precision / favor_recall | Bias extraction | Mutually exclusive presets |
include_comments | Extract comment sections | Default True |
include_tables | Keep <table> content | See community issue #794 below |
include_links | Preserve hyperlinks | See community issue #794 below |
include_images | Preserve image references | Experimental |
target_language | Filter by ISO 639-1 code | Requires py3langid |
deduplicate | Drop repeated segments | |
with_metadata / only_with_metadata | Metadata handling | |
url_blacklist / author_blacklist | Filtering | Sets of strings |
date_extraction_params | Pass-through to htmldate | Dict |
Command-Line Quickstart
The package also installs a trafilatura CLI. The README.md links to usage-cli.html for the full reference. The most common invocations are:
# Download and extract to plain text
trafilatura -u "https://example.org/article"
# Convert an existing HTML file to Markdown
trafilatura -f page.html --output-format markdown
# Keep tables and links, output JSON with metadata
trafilatura -u "https://example.org/article" --json --with-metadata --links --tables
Known Limitations and Community-Reported Issues
Three issues from the community context are worth flagging for new users:
- Issue #532 —
lxml5.2.0 breaks import. Pre-2.0 versions of trafilatura importlxml.html.clean, which was removed inlxml5.2.0. Users must upgrade to trafilatura 2.1.0 or downgradelxml. Source: community context. - Issue #777 / #794 — Table and
--linksinteraction. Wheninclude_links=True(or the--linksCLI flag) is combined withinclude_tables=True, some<table>markup is not preserved correctly in the output. Source: community context and trafilatura/settings.py (which exposes both flags). - Issue #846 — Release cadence and CVE exposure. Because releases are infrequent, downstream users must monitor dependency CVEs (notably in
lxml) themselves. The 2.1.0 release addresses the immediatelxmlconcern. Source: community context.
See Also
- Core Functions reference (CLI and Python)
- Configuration file format (
settings.cfg) - Evaluation benchmarks and accuracy comparisons
- Contributing guide (CONTRIBUTING.md)
Source: https://github.com/adbar/trafilatura / Human Manual
Core Extraction Engine: Text and Metadata
Related topics: Overview, Installation, and Quickstart, Settings, Output Formats, and Known Issues
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Installation, and Quickstart, Settings, Output Formats, and Known Issues
Core Extraction Engine: Text and Metadata
Overview
Trafilatura's core extraction engine is the heart of the library, responsible for transforming raw HTML into clean, structured main text and metadata. The package exposes this functionality primarily through three functions in the core module: extract, bare_extraction, and extract_with_metadata, all of which are re-exported in the top-level trafilatura namespace for convenience. Source: trafilatura/__init__.py:18-22.
The engine is designed to balance precision and recall while supporting multiple output formats (TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI). According to the README, Trafilatura "consistently outperforms other open-source libraries in text extraction" as measured by ROUGE-LSum Mean F1 Page Scores in independent benchmarks. Source: README.md:142-145.
Architecture and Pipeline
The extraction pipeline can be conceptualized as a series of stages that progressively refine HTML input into clean output. The core module orchestrates this through an internal _internal_extraction helper, which is invoked by the public functions. Source: trafilatura/core.py:55-100.
flowchart TD
A[HTML Input] --> B[Load & Parse HTML<br/>load_html]
B --> C[Prune Unwanted Nodes<br/>prune_unwanted_nodes]
C --> D[HTMLProcessing<br/>link conversion, cleaning]
D --> E[Main Text Extraction<br/>XPath-based + fallbacks]
E --> F[Metadata Extraction<br/>metadata.py]
F --> G[Format Output<br/>txt, json, xml, xmltei, csv]
G --> H[Document Object]
style E fill:#f9f,stroke:#333
style F fill:#bbf,stroke:#333The pipeline begins by loading the HTML, either from raw content or a URL, and then applies a series of transformations: pruning unwanted nodes, processing links and formatting, identifying the main content region via XPath expressions, and finally extracting both the body text and structured metadata. The main text extraction leverages compiled XPath expressions defined in xpaths.py, which target elements such as <article>, <main>, and common class names like "content" or "article-body". Source: trafilatura/xpaths.py:1-30.
Text Extraction
The main extraction logic is centralized in the extract function, which accepts extensive configuration parameters documented in its docstring. Key parameters include output_format (supporting "txt", "markdown", "csv", "json", "html", "xml", and "xmltei"), include_tables, include_images, include_links, include_formatting, and include_comments. Source: trafilatura/core.py:10-50.
Precision vs. Recall Trade-offs
The engine supports three operating modes that tune the extraction behavior:
- Default: Balanced precision and recall.
favor_precision=True: "prefer less text but correct extraction."favor_recall=True: "when unsure, prefer more text."
The fast=True option uses "faster heuristics and skip backup extraction," trading thoroughness for speed. Source: trafilatura/core.py:18-22.
Fallback Strategies
When the primary XPath-based extraction fails or yields insufficient content, Trafilatura falls back to alternative algorithms. The readability_lxml.py module implements a port of the Readability algorithm, which scores candidate elements to identify the main content. Source: trafilatura/readability_lxml.py:1-40. The baseline module provides additional fallback extraction using simpler heuristics. The engine may also invoke jusText as a tertiary fallback, depending on configuration.
Metadata Extraction
Metadata extraction is handled by the metadata module, which exports a Document class and an extract_metadata function. The module scrapes metadata from multiple sources, including HTML <meta> tags, JSON-LD structured data, microdata, and visible page elements. Source: trafilatura/metadata.py:1-30.
The json_metadata.py module contains specialized parsers for JSON-LD and Schema.org markup. It recognizes numerous schema types including NewsArticle, Report, BlogPosting, ScholarlyArticle, and others. The module uses compiled regular expressions to extract author names, publisher information, article sections, and content types from JSON-LD blocks. Source: trafilatura/json_metadata.py:1-25.
Metadata fields extracted include: title, author, date (via the htmldate library), site name, description, categories, tags, license, and sitename. The find_date function from htmldate is invoked to determine publication dates. Source: trafilatura/metadata.py:18-20.
Configuration System
Trafilatura uses a comprehensive settings system defined in settings.py. The module exports a Document configuration class that maps CLI and Python arguments to internal processing options. The CONFIG_MAPPING dictionary includes keys for format, precision controls, content inclusion flags (comments, formatting, links, images, tables), deduplication, language filtering, and various size thresholds. Source: trafilatura/settings.py:1-40.
Key size thresholds include min_extracted_size, min_output_size, min_output_comm_size, min_extracted_comm_size, min_duplcheck_size, and max_repetitions. These control the minimum amount of text required for extraction to succeed and limit the repetition of duplicate segments. Source: trafilatura/settings.py:30-40.
The MANUALLY_STRIPPED list in settings.py defines HTML elements that are unconditionally removed during preprocessing (e.g., font, ins, mark, small, template). The BASIC_CLEAN_XPATH expression targets elements such as <aside>, <footer>, <script>, and <style> for removal. Source: trafilatura/settings.py:60-80.
Output Formats and Serialization
The core module serializes the extracted Document object into the requested output format. The bare_extraction function returns a Python dictionary representation, while extract and extract_with_metadata return formatted strings. The tei_validation parameter enables DTD-based validation of XML-TEI output against the Text Encoding Initiative standard. Source: trafilatura/core.py:35-40.
Known Issues and Limitations
Several community-reported issues affect the extraction engine:
- LXML Compatibility: Issue #532 documents that
lxml 5.2.0breaks imports becauselxml.html.cleanwas extracted to a separatelxml_html_cleanproject. The latest 2.1.0 release addressed this with updated dependencies. Source: HISTORY.md:1-10.
- Table Extraction Bugs: Issue #777 reports incorrect table tags in HTML-formatted output, where standard
<thead>/<tr>/<th>structures are converted to<row>/<cell>elements. Issue #794 documents that the--linksflag breaks tables in certain Wikipedia pages. The 2.1.0 release included "fix table extraction bugs" by contributor @unsleepy22. Source: HISTORY.md:5-10.
- Performance: Release 2.1.0 introduced "Faster XPath performance using XSLT extensions" (PR #793), indicating ongoing optimization of the XPath evaluation layer used for main content identification. Source: HISTORY.md:3-5.
See Also
Source: https://github.com/adbar/trafilatura / Human Manual
Web Discovery, Crawling, and Downloads
Related topics: Overview, Installation, and Quickstart, Core Extraction Engine: Text and Metadata
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Installation, and Quickstart, Core Extraction Engine: Text and Metadata
Web Discovery, Crawling, and Downloads
Overview
Trafilatura bundles a complete pipeline for finding, fetching, and navigating web content. The pipeline is split across several modules that work together: a low-level HTTP fetcher, a high-level focused crawler, sitemap and feed parsers, and a command-line entry point. The current release, 2.1.0, exposes these capabilities through both a Python API (re-exported from the top-level package) and a CLI surface (Source: trafilatura/__init__.py:13-39; Source: HISTORY.md:1-15).
The web discovery subsystem is responsible for three concerns:
- Fetching raw HTTP responses for a single URL or a queue of URLs.
- Discovering candidate URLs through sitemaps (TXT/XML) and syndication feeds (RSS/Atom/JSON).
- Crawling websites in a targeted, polite fashion using heuristics that prioritize content-bearing pages.
The extraction layer (covered elsewhere) consumes the bytes produced by this subsystem. Keeping discovery separate from extraction allows users to mix-and-match: they can pre-fetch a corpus with Trafilatura and feed the resulting HTML files to a different extractor, or they can let Trafilatura handle the whole chain end-to-end.
Module Layout and Public API
The top-level __init__.py re-exports a small, stable surface: fetch_response, fetch_url, and the high-level extraction entry points. The __all__ list is the contract callers should rely on (Source: trafilatura/__init__.py:21-35).
| Symbol | Source file | Purpose |
|---|---|---|
fetch_url | trafilatura/downloads.py | Download a URL and return decoded HTML content as a string. |
fetch_response | trafilatura/downloads.py | Return a full urllib3.HTTPResponse object for advanced inspection. |
focused_crawler | trafilatura/spider.py | Targeted crawl of a website starting from a homepage. |
| Sitemap helpers | trafilatura/sitemaps.py | Parse TXT and XML sitemaps, return iterators of URLs. |
| Feed helpers | trafilatura/feeds.py | Detect and parse RSS, Atom, and JSON feeds. |
| CLI driver | trafilatura/cli.py + trafilatura/cli_utils.py | Expose the above through trafilatura and trafilatura-cli commands. |
flowchart LR
A[CLI / Python caller] --> B{Discovery source}
B -->|Single URL| C[fetch_url / fetch_response]
B -->|Sitemap| D[sitemaps.py]
B -->|Feed| E[feeds.py]
B -->|Website| F[focused_crawler]
C --> G[Extraction layer]
D --> G
E --> G
F --> C
F --> GSingle-URL Downloads
The lowest layer is the downloader. fetch_url returns decoded HTML suitable for direct handoff to extract() or bare_extraction(). fetch_response is the more capable variant: it surfaces a urllib3-style response object so callers can inspect headers, status codes, and raw bytes (Source: trafilatura/__init__.py:13-17).
Both functions delegate to trafilatura/downloads.py, which wraps urllib3 directly rather than requests to keep dependencies minimal and to give Trafilatura tight control over retries, encoding detection, and backoff. The trade-off is documented in HISTORY.md: replacing requests with bare urllib3 and custom decoding shipped in 0.7.0 (Source: HISTORY.md:128-136).
Configuration
trafilatura/settings.py defines the keys consumed by the downloader and the crawler. Among the parameters relevant to discovery are:
SLEEP_TIME— base delay between requests, honored by the crawler.MAX_REDIRECTS— added in 1.6.4 to prevent redirect loops (Source: HISTORY.md:36-44).max_file_size/min_file_size— size bounds used to filter responses.max_tree_size— cap on parsed-tree size to avoid pathological inputs.
These keys are loaded from DEFAULT_CONFIG and can be overridden per call by passing a configparser object to the high-level functions, or by editing the user's trafilatura.cfg (Source: trafilatura/settings.py:11-58).
Focused Crawling
focused_crawler is the workhorse for traversing an entire website. Its signature (truncated) is:
focused_crawler(homepage, max_seen_urls=10, max_known_urls=100000,
todo=None, known_links=None, lang=None,
config=DEFAULT_CONFIG, rules=None, prune_xpath=None)
The function uses a URLStore (from the courlan dependency) to deduplicate URLs, tracks a navigation vs. content heuristic, and stops when either max_seen_urls or max_known_urls is reached (Source: trafilatura/spider.py:65-95).
Key behaviors worth noting:
- Politeness is enforced through
URLStore.get_crawl_delay(), which respectsrobots.txtrules supplied via therulesparameter. Asleep_timeis read from configuration as a fallback (Source: trafilatura/spider.py:71-75). - Boundary detection uses URL-path matching so the crawler does not wander into unrelated subdomains. This restriction was tightened in 1.12.1 (Source: HISTORY.md:81-89).
- Optional seeding lets callers pass a pre-built
todofrontier or a list ofknown_links, which is useful when combining Trafilatura with an external URL discovery step (Source: trafilatura/spider.py:54-67).
The crawler returns a tuple of (todo, known_links), both as ordered lists, so callers can resume a crawl in a later process.
Sitemaps and Feeds
For sites that publish a sitemap, trafilatura/sitemaps.py parses both the plain-text variant and the XML variant, and can also follow nested sitemap indexes. The max_sitemaps parameter was added in 1.12.2 to prevent runaway recursion on misconfigured sites (Source: HISTORY.md:71-80).
Feed discovery in trafilatura/feeds.py accepts RSS, Atom, and JSON Feed formats. Robustness improvements shipped across the 1.x line: better feed detection in 1.6.4, JSON web feed support in 1.0.0, and a long-running series of fixes for link discovery inside feeds (Source: HISTORY.md:46-54; Source: HISTORY.md:100-108).
A typical discovery workflow is therefore: try the homepage, look for a <link rel="alternate"> tag pointing at a feed, parse the feed for canonical article URLs, and finally hand those URLs to fetch_url or focused_crawler.
Command-Line Interface
The CLI is split between the entry point in trafilatura/cli.py and shared helpers in trafilatura/cli_utils.py. From the README and the issue tracker, the most relevant flags for discovery are:
--inputfile/-i— read URLs from a file.--parallel/-j— process a download queue across multiple cores.--crawl— drivefocused_crawlerinstead of treating each line as an isolated URL.--probe— added in 1.6.3 to test a list of URLs for extractable content without writing output (Source: HISTORY.md:62-70).
The trafilatura and trafilatura-cli executables are both available; this was clarified in earlier releases and is reflected in the package's console_scripts entry points.
Community Notes Relevant to This Page
- Dependency churn and stale releases — Issue #846 ("Is this project dead?") and #813 ("Release new version on pypi") both flag the gap between source-level fixes and PyPI releases. The 2.1.0 release notes mention dependency updates (notably
lxml) and XPath performance improvements, but downstream users who pinned to the previous PyPI version will not see them until they upgrade (Source: HISTORY.md:1-15). lxml5.2.0 break — Issue #532 documentsImportError: lxml.html.clean module is now a separate project lxml_html_clean. The downloader does not importlxml.html.cleandirectly, but any user code that combined the downloader withlxml_html_cleanfor sanitization needs the new separate package. The fix in 1.7.0 was an explicit compatibility pass forlxmlv5+ (Source: HISTORY.md:18-26).- Feeds and
max_sitemaps— When feeding large sites into the pipeline, prefer themax_sitemapsparameter; without it, a malformed sitemap index can keep the parser busy indefinitely (Source: HISTORY.md:71-80).
See Also
- Core Extraction Functions — covers
extract,bare_extraction, andextract_metadata. - Settings and Configuration — full list of config keys and their defaults.
- Metadata Extraction — title, author, date, sitename, and categories.
- Evaluation and Benchmarks — how the downloader/crawler are exercised in the test corpus.
External: Trafilatura documentation and the courlan package that powers URL validation and the URLStore used by focused_crawler.
Source: https://github.com/adbar/trafilatura / Human Manual
Settings, Output Formats, and Known Issues
Related topics: Core Extraction Engine: Text and Metadata, Web Discovery, Crawling, and Downloads
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Extraction Engine: Text and Metadata, Web Discovery, Crawling, and Downloads
Settings, Output Formats, and Known Issues
Trafilatura is a Python and command-line toolkit for web crawling, downloading, and structured text extraction. The current release, v2.1.0, is exported from trafilatura/__init__.py and exposes high-level functions extract, bare_extraction, extract_with_metadata, baseline, extract_metadata, html2txt, load_html, fetch_url, and fetch_response. This page documents how those functions are configured, what formats they emit, and the failure modes the community reports most often.
Configuration System
Extractor Class and Defaults
All extraction behavior is centralized in the Extractor class defined in trafilatura/settings.py. It declares the canonical list of tunables — format, fast, focus, comments, formatting, links, images, tables, dedup, lang, min_extracted_size, min_output_size, min_output_comm_size, min_extracted_comm_size, min_duplcheck_size, max_repetitions, max_file_size, min_file_size, max_tree_size, source, url, with_metadata, only_with_metadata, tei_validation, date_params, author_blacklist, url_blacklist — and binds them through the CONFIG_MAPPING constant. The constructor accepts a ConfigParser (DEFAULT_CONFIG), a chosen output_format, and the corresponding keyword flags (fast, precision, recall, comments, formatting, links, images, tables, dedup, lang, url) that propagate into the instance attributes.
The helper _add_config projects the parser section into strongly typed attributes (min_extracted_size: int, max_repetitions: int, etc.), which guarantees that downstream consumers in trafilatura/core.py never read raw strings from the config file. Source: trafilatura/settings.py.
Settings File Override
A shipped trafilatura/settings.cfg supplies sane defaults that can be replaced wholesale via the settingsfile argument, or selectively via config and options. The docstring of extract in trafilatura/core.py makes the precedence explicit: prune_xpath (str or list[str]) lets callers wipe subtrees before extraction runs, and author_blacklist / url_blacklist (Python sets) are applied immediately after extract_metadata returns. Source: trafilatura/core.py.
Output Formats
flowchart LR
A[HTML Input] --> B[load_html<br/>utils.py]
B --> C[Options<br/>Extractor instance]
C --> D{output_format}
D -->|txt| E[TXT]
D -->|markdown| F[Markdown]
D -->|csv| G[CSV]
D -->|json| H[JSON]
D -->|html| I[HTML]
D -->|xml| J[XML]
D -->|xmltei| K[XML-TEI]
E --> L[Return str]
F --> L
G --> L
H --> L
I --> L
J --> L
K --> LThe output_format argument enumerates seven legal values documented in the docstrings of extract and bare_extraction: "txt", "markdown", "csv", "json", "html", "xml", and "xmltei". Format-specific behavior is split across modules: the generic serializers live in trafilatura/xml.py, while xmltei additionally validates against the DTD at trafilatura/data/tei_corpus.dtd when the tei_validation flag is on. The XML TEI option requires lxml for schema validation, whereas plain xml and txt work with minimal dependencies.
Formatting hints are toggled independently: include_formatting is "only valuable if output_format is set to XML" per the docstring in trafilatura/core.py, and include_links / include_images are tagged experimental. Tables are on by default (include_tables=True) but can be disabled; images and formatting follow the same opt-in pattern.
Known Issues and Community Concerns
Dependency Compatibility
The most upvoted community thread, issue #532, reports that LXML 5.2.0 removed lxml.html.clean, breaking imports with ImportError: lxml.html.clean module is now a separate project lxml_html_clean. According to HISTORY.md, the LXML v5+ transition was addressed in 1.7.0 ("support for LXML v5+"), and the legacy lxml.html.Cleaner was fully removed in 1.8.0. The 2.1.0 release continues the cleanup with "Dependencies updated, lxml in particular (with minimal changes in the code)".
Users on older LXML pins should install lxml_html_clean separately or upgrade Trafilatura past 1.7.0. Issue #846 further flags a CVE in lxml prior to v6.1.0, which is mitigated by upgrading Trafilatura to a release that bumps the dependency floor.
Extraction Edge Cases
Two open community reports describe table-related regressions:
- #777 — Table tags incorrect in HTML formatted output. HTML such as
<table><thead><tr><th>…</th></tr></thead></table>is emitted as<table><row span="4"><cell role="head">…</cell></row></table>, suggesting that the row/span synthesis in trafilatura/xml.py collapses header cells into a single attribute-bag cell. - #794 —
--linksflag breaks tables. Wheninclude_links=True, complex Wikipedia-style tables containing<span typeof="mw:File/…">inside<td>lose their cells. The version 2.1.0 changelog credits @unsleepy22 with "fix table extraction bugs", but the community threads remain open, indicating residual edge cases.
These regressions are sensitive to the MANUALLY_STRIPPED list in trafilatura/settings.py, which already excludes tbody, thead, and tfoot from stripping but still routes raw <table> content through the link-density heuristics in trafilatura/readability_lxml.py.
Project Maintenance
Issue #846 ("Is this project dead?") and #813 ("Release new version on pypi") both voice concern over release cadence. The fix is partial: HISTORY.md shows version 2.1.0 shipped with "More deprecation warnings" and "More robust code", but the PyPI release log visible to users in #813 still lagged at 2.0.0 at the time of writing. The CONTRIBUTING.md file explicitly invites sponsorships and PRs to sustain the project.
Recent Release Highlights
Version 2.1.0, documented in HISTORY.md, brings:
- Faster XPath via XSLT extensions (#793 by @Honesty-of-the-Cavernous-Tissue).
- Patched
AttributeErrorduring node pruning (#761 by @PLPeeters). - Refined
<img src>URL handling and additional table extraction fixes by @unsleepy22. - Tightened metadata extraction in trafilatura/metadata.py, which still delegates date parsing to
htmldate.find_dateand URL normalization tocourlan.
For a usage-oriented overview, the README.md lists the supported input sources (live URLs, sitemaps, feeds, on-disk HTML, pre-parsed trees) and the evaluation results showing Trafilatura leading ROUGE-LSum F1 benchmarks against competing extractors.
See Also
- Core Python API and CLI usage — see
corefunctions.mdandusage-cli.mdon the documentation site linked from README.md. - Web crawling and link discovery —
crawling.md. - Metadata extraction internals —
metadata.md(covers trafilatura/metadata.py andhtmldateintegration). - Fallback extractor algorithms —
fallbacks.md(covers trafilatura/readability_lxml.py andjusText).
Source: https://github.com/adbar/trafilatura / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Developers may misconfigure credentials, environment, or host setup: Duplicate paragraph extraction when a long sibling paragraph is present
Developers may misconfigure credentials, environment, or host setup: Duplicated lines when nested in <article> and <main>, with <br> in front
Doramagic Pitfall Log
Found 37 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Runtime risk - Runtime risk requires verification.
1. Runtime risk: Runtime risk requires verification
- Severity: high
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/661
2. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/634
3. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: Duplicate paragraph extraction when a long sibling paragraph is present
- User impact: Developers may misconfigure credentials, environment, or host setup: Duplicate paragraph extraction when a long sibling paragraph is present
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Duplicate paragraph extraction when a long sibling paragraph is present. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/817
4. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: Duplicated lines when nested in <article> and <main>, with <br> in front
- User impact: Developers may misconfigure credentials, environment, or host setup: Duplicated lines when nested in <article> and <main>, with <br> in front
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Duplicated lines when nested in <article> and <main>, with <br> in front. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/768
5. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project:
include_imageschanges text extraction - User impact: Developers may misconfigure credentials, environment, or host setup:
include_imageschanges text extraction - Recommended check: Before packaging this project, run the relevant install/config/quickstart check for:
include_imageschanges text extraction. Context: Source discussion did not expose a precise runtime context. - Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/194
6. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: some extraction duplicated in xml
- User impact: Developers may misconfigure credentials, environment, or host setup: some extraction duplicated in xml
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: some extraction duplicated in xml. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/634
7. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/768
8. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/236
9. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/829
10. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/adbar/trafilatura
11. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/755
12. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/78
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using trafilatura with real data or production workflows.
- Investigate spacing in element tails - github / github_issue
- HTML conversion: 'NoneType' object is not subscriptable - github / github_issue
- include_links breaks the extraction for https://news.ycombinator.com - github / github_issue
- Keeping all valid table information and formatting - github / github_issue
- Backticks produce extra line breaks - github / github_issue
- some extraction duplicated in xml - github / github_issue
include_imageschanges text extraction - github / github_issue- Keeping images breaks parsing - github / github_issue
- Text dropped in table after setting
include_formatting=True- github / github_issue - Duplicate paragraph extraction when a long sibling paragraph is present - github / github_issue
- Duplicated lines when nested in <article> and <main>, with <br> in front - github / github_issue
included_imagesfailed when trying to extract images in a table - github / github_issue
Source: Project Pack community evidence and pitfall evidence