# trafilatura - Doramagic AI Context Pack

> Purpose: pre-work context for the user's host AI. This pack does not prove that the project has been installed, run, or validated.

## Project

- canonical_name: `adbar/trafilatura`
- capability: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
- expected_user_outcome: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

## Operating Boundaries

- Do not claim that the project has been installed, run, called through an API, or used on local files unless separate evidence proves it.
- Project facts must come from repo evidence, Claim Graph, or explicit source references.
- When a capability is not verified, mark it as unverified instead of completing it as fact.
- publish_status: `publishable`
- blocking_gaps: none

---

## Doramagic Context Augmentation

The following sections strengthen the repository context for a host AI. Human Manual data is a reading route, and pitfall notes become operating constraints.

## Human Manual Outline

Usage rule: this is only a reading route and salience signal, not factual authority. Concrete claims must still return to repo evidence or Claim Graph.

Host AI hard rules:
- Do not treat page titles, section order, summaries, or importance values as factual project evidence.
- When explaining the Human Manual outline, state that it is only a reading route or salience signal.
- Capability, installation, compatibility, runtime state, and risk claims must cite repo evidence, source paths, or Claim Graph.

- **Overview, Installation, and Quickstart**: importance `high`
  - source_paths: README.md, trafilatura/__init__.py, docs/quickstart.rst, docs/installation.rst
- **Core Extraction Engine: Text and Metadata**: importance `high`
  - source_paths: trafilatura/core.py, trafilatura/main_extractor.py, trafilatura/metadata.py, trafilatura/htmlprocessing.py, trafilatura/utils.py
- **Web Discovery, Crawling, and Downloads**: importance `high`
  - source_paths: trafilatura/downloads.py, trafilatura/spider.py, trafilatura/sitemaps.py, trafilatura/feeds.py, trafilatura/cli.py
- **Settings, Output Formats, and Known Issues**: importance `high`
  - source_paths: trafilatura/settings.py, trafilatura/settings.cfg, trafilatura/xml.py, trafilatura/xpaths.py, trafilatura/data/tei_corpus.dtd

## Repo Inspection Evidence

- repo_clone_verified: true
- repo_inspection_verified: true
- repo_commit: `1c614458e5bb3637e8e628c2c86ab2ceaee33b58`
- inspected_files: `README.md`, `pyproject.toml`, `docs/conf.py`

Host AI hard rules:
- Without repo_clone_verified=true, do not claim that the source code has been read.
- Without repo_inspection_verified=true, do not write README, docs, or package-file conclusions as facts.
- Without quick_start_verified=true, do not claim that the Quick Start path has run successfully.

## Doramagic Pitfall Constraints

These rules come from Doramagic discovery, validation, or compilation findings. The host AI must treat them as operating constraints, not background notes.

### Constraint 1: Runtime risk requires verification

- Trigger: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- Host AI rule: Reproduce the official install and quickstart path in an isolated environment.
- Why it matters: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/661
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.

### Constraint 2: Security or permission risk requires verification

- Trigger: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- Host AI rule: Reproduce the official install and quickstart path in an isolated environment.
- Why it matters: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/634
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.

### Constraint 3: Configuration risk requires verification

- Trigger: Developers should check this configuration risk before relying on the project: Duplicate paragraph extraction when a long sibling paragraph is present
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: Duplicate paragraph extraction when a long sibling paragraph is present. Context: Source discussion did not expose a precise runtime context.
- Why it matters: Developers may misconfigure credentials, environment, or host setup: Duplicate paragraph extraction when a long sibling paragraph is present
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/817
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.

### Constraint 4: Configuration risk requires verification

- Trigger: Developers should check this configuration risk before relying on the project: Duplicated lines when nested in <article> and <main>, with <br> in front
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: Duplicated lines when nested in <article> and <main>, with <br> in front. Context: Source discussion did not expose a precise runtime context.
- Why it matters: Developers may misconfigure credentials, environment, or host setup: Duplicated lines when nested in <article> and <main>, with <br> in front
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/768
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.

### Constraint 5: Configuration risk requires verification

- Trigger: Developers should check this configuration risk before relying on the project: `include_images` changes text extraction
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: `include_images` changes text extraction. Context: Source discussion did not expose a precise runtime context.
- Why it matters: Developers may misconfigure credentials, environment, or host setup: `include_images` changes text extraction
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/194
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.

### Constraint 6: Configuration risk requires verification

- Trigger: Developers should check this configuration risk before relying on the project: some extraction duplicated in xml
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: some extraction duplicated in xml. Context: Source discussion did not expose a precise runtime context.
- Why it matters: Developers may misconfigure credentials, environment, or host setup: some extraction duplicated in xml
- Evidence: failure_mode_cluster:github_issue | https://github.com/adbar/trafilatura/issues/634
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.

### Constraint 7: Configuration risk requires verification

- Trigger: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- Host AI rule: Reproduce the official install and quickstart path in an isolated environment.
- Why it matters: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/768
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.

### Constraint 8: Configuration risk requires verification

- Trigger: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- Host AI rule: Reproduce the official install and quickstart path in an isolated environment.
- Why it matters: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/236
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.

### Constraint 9: Configuration risk requires verification

- Trigger: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- Host AI rule: Reproduce the official install and quickstart path in an isolated environment.
- Why it matters: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/adbar/trafilatura/issues/829
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.

### Constraint 10: Capability evidence risk requires verification

- Trigger: README/documentation is current enough for a first validation pass.
- Host AI rule: Reproduce the official install and quickstart path in an isolated environment.
- Why it matters: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/adbar/trafilatura
- Hard boundary: Do not present this pitfall as solved, verified, or ignorable unless later evidence explicitly closes it.
