`, and common class names like "content" or "article-body". Source: [trafilatura/xpaths.py:1-30](). ## Text Extraction The main extraction logic is centralized in the `extract` function, which accepts extensive configuration parameters documented in its docstring. Key parameters include `output_format` (supporting "txt", "markdown", "csv", "json", "html", "xml", and "xmltei"), `include_tables`, `include_images`, `include_links`, `include_formatting`, and `include_comments`. Source: [trafilatura/core.py:10-50](). ### Precision vs. Recall Trade-offs The engine supports three operating modes that tune the extraction behavior: - **Default**: Balanced precision and recall. - **`favor_precision=True`**: "prefer less text but correct extraction." - **`favor_recall=True`**: "when unsure, prefer more text." The `fast=True` option uses "faster heuristics and skip backup extraction," trading thoroughness for speed. Source: [trafilatura/core.py:18-22](). ### Fallback Strategies When the primary XPath-based extraction fails or yields insufficient content, Trafilatura falls back to alternative algorithms. The `readability_lxml.py` module implements a port of the Readability algorithm, which scores candidate elements to identify the main content. Source: [trafilatura/readability_lxml.py:1-40](). The `baseline` module provides additional fallback extraction using simpler heuristics. The engine may also invoke `jusText` as a tertiary fallback, depending on configuration. ## Metadata Extraction Metadata extraction is handled by the `metadata` module, which exports a `Document` class and an `extract_metadata` function. The module scrapes metadata from multiple sources, including HTML `` tags, JSON-LD structured data, microdata, and visible page elements. Source: [trafilatura/metadata.py:1-30](). The `json_metadata.py` module contains specialized parsers for JSON-LD and Schema.org markup. It recognizes numerous schema types including `NewsArticle`, `Report`, `BlogPosting`, `ScholarlyArticle`, and others. The module uses compiled regular expressions to extract author names, publisher information, article sections, and content types from JSON-LD blocks. Source: [trafilatura/json_metadata.py:1-25](). Metadata fields extracted include: title, author, date (via the `htmldate` library), site name, description, categories, tags, license, and sitename. The `find_date` function from `htmldate` is invoked to determine publication dates. Source: [trafilatura/metadata.py:18-20](). ## Configuration System Trafilatura uses a comprehensive settings system defined in `settings.py`. The module exports a `Document` configuration class that maps CLI and Python arguments to internal processing options. The `CONFIG_MAPPING` dictionary includes keys for format, precision controls, content inclusion flags (comments, formatting, links, images, tables), deduplication, language filtering, and various size thresholds. Source: [trafilatura/settings.py:1-40](). Key size thresholds include `min_extracted_size`, `min_output_size`, `min_output_comm_size`, `min_extracted_comm_size`, `min_duplcheck_size`, and `max_repetitions`. These control the minimum amount of text required for extraction to succeed and limit the repetition of duplicate segments. Source: [trafilatura/settings.py:30-40](). The `MANUALLY_STRIPPED` list in `settings.py` defines HTML elements that are unconditionally removed during preprocessing (e.g., `font`, `ins`, `mark`, `small`, `template`). The `BASIC_CLEAN_XPATH` expression targets elements such as `