# trafilatura - Doramagic AI Context Pack

> 定位：安装前体验与判断资产。它帮助宿主 AI 有一个好的开始，但不代表已经安装、执行或验证目标项目。

## 充分原则

- **充分原则，不是压缩原则**：AI Context Pack 应该充分到让宿主 AI 在开工前理解项目价值、能力边界、使用入口、风险和证据来源；它可以分层组织，但不以最短摘要为目标。
- **压缩策略**：只压缩噪声和重复内容，不压缩会影响判断和开工质量的上下文。

## 给宿主 AI 的使用方式

你正在读取 Doramagic 为 trafilatura 编译的 AI Context Pack。请把它当作开工前上下文：帮助用户理解适合谁、能做什么、如何开始、哪些必须安装后验证、风险在哪里。不要声称你已经安装、运行或执行了目标项目。

## Claim 消费规则

- **事实来源**：Repo Evidence + Claim/Evidence Graph；Human Wiki 只提供显著性、术语和叙事结构。
- **事实最低状态**：`supported`
- `supported`：可以作为项目事实使用，但回答中必须引用 claim_id 和证据路径。
- `weak`：只能作为低置信度线索，必须要求用户继续核实。
- `inferred`：只能用于风险提示或待确认问题，不能包装成项目事实。
- `unverified`：不得作为事实使用，应明确说证据不足。
- `contradicted`：必须展示冲突来源，不得替用户强行选择一个版本。

## 它最适合谁

- **AI 研究者或研究型 Agent 构建者**：README 明确围绕研究、实验或论文工作流展开。 证据：`README.md` Claim：`clm_0002` supported 0.86

## 它能做什么

- **项目知识预览**（可做安装前预览）：项目可被阅读和解释，但当前证据不足以确认可安装能力或运行入口。 证据：`README.md`, `CONTRIBUTING.md`, `LICENSE`, `HISTORY.md` 等 Claim：`clm_0001` supported 0.86

## 怎么开始

- 项目证据中没有稳定 Quick Start 命令；此项应留空，而不是由 Doramagic 编造。

## 继续前判断卡

- **当前建议**：先做 Prompt Preview
- **为什么**：当前信息足以做安装前体验，但真实兼容性、输出质量或风险边界还不能直接相信。

### 30 秒判断

- **现在怎么做**：先做 Prompt Preview
- **最小安全下一步**：先跑 Prompt Preview
- **先别相信**：真实输出质量不能在安装前相信。
- **继续会触碰**：宿主 AI 上下文

### 现在可以相信

- **适合人群线索：AI 研究者或研究型 Agent 构建者**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`README.md` Claim：`clm_0002` supported 0.86
- **能力存在：项目知识预览**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`README.md`, `CONTRIBUTING.md`, `LICENSE`, `HISTORY.md` 等 Claim：`clm_0001` supported 0.86

### 现在还不能相信

- **真实输出质量不能在安装前相信。**（unverified）：Prompt Preview 只能展示引导方式，不能证明真实项目中的结果质量。
- **宿主 AI 版本兼容性不能在安装前相信。**（unverified）：Claude、Cursor、Codex、Gemini 等宿主加载规则和版本差异必须在真实环境验证。
- **不会污染现有宿主 AI 行为，不能直接相信。**（inferred）：Skill、plugin、AGENTS/CLAUDE/GEMINI 指令可能改变宿主 AI 的默认行为。
- **可安全回滚不能默认相信。**（unverified）：除非项目明确提供卸载和恢复说明，否则必须先在隔离环境验证。
- **真实安装后是否与用户当前宿主 AI 版本兼容？**（unverified）：兼容性只能通过实际宿主环境验证。
- **项目输出质量是否满足用户具体任务？**（unverified）：安装前预览只能展示流程和边界，不能替代真实评测。

### 继续会触碰什么

- **宿主 AI 上下文**：AI Context Pack、Prompt Preview、Skill 路由、风险规则和项目事实。 原因：导入上下文会影响宿主 AI 后续判断，必须避免把未验证项包装成事实。

### 最小安全下一步

- **先跑 Prompt Preview**：用安装前交互式试用判断工作方式是否匹配，不需要授权或改环境。（适用：任何项目都适用，尤其是输出质量未知时。）
- **安装后只验证一个最小任务**：先验证加载、兼容、输出质量和回滚，再决定是否深用。（适用：准备从试用进入真实工作流时。）

### 退出方式

- **保留安装前状态**：记录原始宿主配置和项目状态，后续才能判断是否可恢复。
- **如果没有回滚路径，不进入主力环境**：不可回滚是继续前阻断项，不应靠信任或运气继续。

## 哪些只能预览

- 解释项目适合谁和能做什么
- 基于项目文档演示典型对话流程
- 帮助用户判断是否值得安装或继续研究

## 哪些必须安装后验证

- 真实安装 Skill、插件或 CLI
- 执行脚本、修改本地文件或访问外部服务
- 验证真实输出质量、性能和兼容性

## 边界与风险判断卡

- **把安装前预览误认为真实运行**：用户可能高估项目已经完成的配置、权限和兼容性验证。 处理方式：明确区分 prompt_preview_can_do 与 runtime_required。 Claim：`clm_0003` inferred 0.45
- **待确认**：真实安装后是否与用户当前宿主 AI 版本兼容？。原因：兼容性只能通过实际宿主环境验证。
- **待确认**：项目输出质量是否满足用户具体任务？。原因：安装前预览只能展示流程和边界，不能替代真实评测。

## 开工前工作上下文

### 加载顺序

- 先读取 how_to_use.host_ai_instruction，建立安装前判断资产的边界。
- 读取 claim_graph_summary，确认事实来自 Claim/Evidence Graph，而不是 Human Wiki 叙事。
- 再读取 intended_users、capabilities 和 quick_start_candidates，判断用户是否匹配。
- 需要执行具体任务时，优先查 role_skill_index，再查 evidence_index。
- 遇到真实安装、文件修改、网络访问、性能或兼容性问题时，转入 risk_card 和 boundaries.runtime_required。

### 任务路由

- **项目知识预览**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md`, `CONTRIBUTING.md`, `LICENSE`, `HISTORY.md` 等 Claim：`clm_0001` supported 0.86

### 上下文规模

- 文件总数：68
- 重要文件覆盖：40/68
- 证据索引条目：66
- 角色 / Skill 条目：3

### 证据不足时的处理

- **missing_evidence**：说明证据不足，要求用户提供目标文件、README 段落或安装后验证记录；不要补全事实。
- **out_of_scope_request**：说明该任务超出当前 AI Context Pack 证据范围，并建议用户先查看 Human Manual 或真实安装后验证。
- **runtime_request**：给出安装前检查清单和命令来源，但不要替用户执行命令或声称已执行。
- **source_conflict**：同时展示冲突来源，标记为待核实，不要强行选择一个版本。

## Prompt Recipes

### 适配判断

- 目标：判断这个项目是否适合用户当前任务。
- 预期输出：适配结论、关键理由、证据引用、安装前可预览内容、必须安装后验证内容、下一步建议。

```text
请基于 trafilatura 的 AI Context Pack，先问我 3 个必要问题，然后判断它是否适合我的任务。回答必须包含：适合谁、能做什么、不能做什么、是否值得安装、证据来自哪里。所有项目事实必须引用 evidence_refs、source_paths 或 claim_id。
```

### 安装前体验

- 目标：让用户在安装前感受核心工作流，同时避免把预览包装成真实能力或营销承诺。
- 预期输出：一段带边界标签的体验剧本、安装后验证清单和谨慎建议；不含真实运行承诺或强营销表述。

```text
请把 trafilatura 当作安装前体验资产，而不是已安装工具或真实运行环境。

请严格输出四段：
1. 先问我 3 个必要问题。
2. 给出一段“体验剧本”：用 [安装前可预览]、[必须安装后验证]、[证据不足] 三种标签展示它可能如何引导工作流。
3. 给出安装后验证清单：列出哪些能力只有真实安装、真实宿主加载、真实项目运行后才能确认。
4. 给出谨慎建议：只能说“值得继续研究/试装”“先补充信息后再判断”或“不建议继续”，不得替项目背书。

硬性边界：
- 不要声称已经安装、运行、执行测试、修改文件或产生真实结果。
- 不要写“自动适配”“确保通过”“完美适配”“强烈建议安装”等承诺性表达。
- 如果描述安装后的工作方式，必须使用“如果安装成功且宿主正确加载 Skill，它可能会……”这种条件句。
- 体验剧本只能写成“示例台词/假设流程”：使用“可能会询问/可能会建议/可能会展示”，不要写“已写入、已生成、已通过、正在运行、正在生成”。
- Prompt Preview 不负责给安装命令；如用户准备试装，只能提示先阅读 Quick Start 和 Risk Card，并在隔离环境验证。
- 所有项目事实必须来自 supported claim、evidence_refs 或 source_paths；inferred/unverified 只能作风险或待确认项。

```

### 角色 / Skill 选择

- 目标：从项目里的角色或 Skill 中挑选最匹配的资产。
- 预期输出：候选角色或 Skill 列表，每项包含适用场景、证据路径、风险边界和是否需要安装后验证。

```text
请读取 role_skill_index，根据我的目标任务推荐 3-5 个最相关的角色或 Skill。每个推荐都要说明适用场景、可能输出、风险边界和 evidence_refs。
```

### 风险预检

- 目标：安装或引入前识别环境、权限、规则冲突和质量风险。
- 预期输出：环境、权限、依赖、许可、宿主冲突、质量风险和未知项的检查清单。

```text
请基于 risk_card、boundaries 和 quick_start_candidates，给我一份安装前风险预检清单。不要替我执行命令，只说明我应该检查什么、为什么检查、失败会有什么影响。
```

### 宿主 AI 开工指令

- 目标：把项目上下文转成一次对话开始前的宿主 AI 指令。
- 预期输出：一段边界明确、证据引用明确、适合复制给宿主 AI 的开工前指令。

```text
请基于 trafilatura 的 AI Context Pack，生成一段我可以粘贴给宿主 AI 的开工前指令。这段指令必须遵守 not_runtime=true，不能声称项目已经安装、运行或产生真实结果。
```

## 角色 / Skill 索引

- 共索引 3 个角色 / Skill / 项目文档条目。

- **Trafilatura: Discover and Extract Text Data on the Web**（project_doc）：Trafilatura: Discover and Extract Text Data on the Web 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`README.md`
- **How to contribute**（project_doc）：If you value this software or depend on it for your product, consider sponsoring it and contributing to its codebase. Your support will help ensure the sustainability and growth of the project. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`CONTRIBUTING.md`
- **History / Changelog**（project_doc）：Major changes: - Dependencies updated, lxml in particular with minimal changes in the code - Faster XPath performance using XSLT extensions by @Honesty-of-the-Cavernous-Tissue 793 - More deprecation warnings - More robust code 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`HISTORY.md`

## 证据索引

- 共索引 66 条证据。

- **Trafilatura: Discover and Extract Text Data on the Web**（documentation）：Trafilatura: Discover and Extract Text Data on the Web 证据：`README.md`
- **How to contribute**（documentation）：If you value this software or depend on it for your product, consider sponsoring it and contributing to its codebase. Your support will help ensure the sustainability and growth of the project. 证据：`CONTRIBUTING.md`
- **License**（source_file）：Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ 证据：`LICENSE`
- **History / Changelog**（documentation）：Major changes: - Dependencies updated, lxml in particular with minimal changes in the code - Faster XPath performance using XSLT extensions by @Honesty-of-the-Cavernous-Tissue 793 - More deprecation warnings - More robust code 证据：`HISTORY.md`
- **Init**（source_file）：title = "Trafilatura" author = "Adrien Barbaresi and contributors" license = "Apache-2.0" copyright = "Copyright 2019-present, Adrien Barbaresi" version = "2.1.0" ⋮---- all = 证据：`trafilatura/__init__.py`
- **Cli**（source_file）：def add args parser: argparse.ArgumentParser - argparse.ArgumentParser ⋮---- "Add argument groups and arguments to parser." group1 = parser.add argument group "Input", "URLs, files or directories to process" group1 ex = group1.add mutually exclusive group group2 = parser.add argument group "Output", "Determines if and how files will be written" group3 = parser.add argument group "Navigation", "Link discovery and web crawling" group3 ex = group3.add mutually exclusive group group4 = parser.add argument group "Extraction", "Customization of text and metadata processing" group5 = parser.add argument group "Format", "Selection of the output format" group5 ex = group5.add mutually exclusive grou… 证据：`trafilatura/cli.py`
- **deduplicate, filter and convert to dict**（source_file）：HAS GZIP = True ⋮---- HAS GZIP = False ⋮---- LOGGER = logging.getLogger name ⋮---- CHAR CLASS = string.ascii letters + string.digits STRIP DIR = re.compile r" ^/ +$" STRIP EXTENSION = re.compile r"\. a-z {2,5}$" CLEAN XML = re.compile r" " INPUT URLS ARGS = "URL", "crawl", "explore", "probe", "feed", "sitemap" EXTENSION MAPPING = { def load input urls args: argparse.Namespace - list str ⋮---- "Read list of URLs to process or derive one from command-line arguments." input urls: list str = ⋮---- input urls = getattr args, arg ⋮---- def load blacklist filename: str - set str ⋮---- "Read list of unwanted URLs." ⋮---- blacklist = {URL BLACKLIST REGEX.sub "", line.strip for line in inputfh} ⋮----… 证据：`trafilatura/cli_utils.py`
- **normalize Unicode format defaults to NFC**（source_file）：LOGGER = logging.getLogger name TXT FORMATS = {"markdown", "txt"} def determine returnstring document: Document, options: Extractor - str ⋮---- parent = element.getparent ⋮---- returnstring = control xml output document, options ⋮---- returnstring = xmltocsv document, options.formatting ⋮---- returnstring = build json output document, options.with metadata ⋮---- returnstring = build html output document, options.with metadata ⋮---- header = "---\n" ⋮---- header = "" returnstring = f"{header}{xmltotxt document.body, options.formatting }" ⋮---- returnstring = f"{returnstring}\n{xmltotxt document.commentsbody, options.formatting }".strip normalize Unicode format defaults to NFC ⋮---- "Execute… 证据：`trafilatura/core.py`
- **Replace all punctuation with spaces using translation table**（source_file）："Code parts dedicated to duplicate removal and text similarity." ⋮---- Replace all punctuation with spaces using translation table ⋮---- perform hashing with limited size ⋮---- possibly a hex string 证据：`trafilatura/deduplication.py`
- **https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies**（source_file）：PROXY URL = os.environ.get "http proxy" ⋮---- PROXY URL = None ⋮---- CURL SHARE = pycurl.CurlShare ⋮---- HAS PYCURL = True ⋮---- HAS PYCURL = False LOGGER = logging.getLogger name ⋮---- HTTP POOL = None NO CERT POOL = None RETRY STRATEGY = None def create pool args: Any - urllib3.PoolManager Any ⋮---- "Configure urllib3 download pool according to user-defined settings." manager class = SOCKSProxyManager if PROXY URL else urllib3.PoolManager manager args = {"proxy url": PROXY URL} if PROXY URL else {} ⋮---- def apply curl proxy curl: "pycurl.Curl" - None ⋮---- "Route the pycurl request through PROXY URL when one is configured." ⋮---- DEFAULT HEADERS = urllib3.util.make headers accept encodin… 证据：`trafilatura/downloads.py`
- **backup**（source_file）：LOGGER = logging.getLogger name FEED TYPES = { FEED OPENING = re.compile r" ?:\s ?: ? ?:\s ", re.DOTALL BLACKLIST = re.compile r"\bcomments\b" LINK VALIDATION RE = re.compile ⋮---- r"\?type=100$ " Typo3 r"feeds/posts/default/?$ " Blogger ⋮---- r"feed$" Generic ⋮---- class FeedParameters ⋮---- "Store necessary information to proceed a feed." slots = "base", "domain", "ext", "lang", "ref" ⋮---- def is potential feed feed string: str - bool ⋮---- "Check if the string could be a feed." ⋮---- beginning = feed string :100 ⋮---- def handle link list linklist: list str , params: FeedParameters - list str ⋮---- output links = ⋮---- link = fix relative urls params.base, item checked = check url link,… 证据：`trafilatura/feeds.py`
- **author info**（source_file）：LOGGER = logging.getLogger name JSON ARTICLE SCHEMA = { JSON OGTYPE SCHEMA = { JSON PUBLISHER SCHEMA = {"newsmediaorganization", "organization", "webpage", "website"} JSON AUTHOR 1 = re.compile r'"author": ^} +?"name?\\?": ?\\?" ^"\\ + "author" ^} +?"names?".+?" ^" + ', re.DOTALL JSON AUTHOR 2 = re.compile r'" Pp erson" ^} +?"names?".+?" ^" + ', re.DOTALL JSON AUTHOR REMOVE = re.compile JSON PUBLISHER = re.compile r'"publisher": ^} +?"name?\\?": ?\\?" ^"\\ + ', re.DOTALL JSON TYPE = re.compile r'"@type"\s :\s " ^" "', re.DOTALL JSON CATEGORY = re.compile r'"articleSection": ?" ^"\\ + ', re.DOTALL JSON MATCH = re.compile r'"author": "person":', flags=re.IGNORECASE JSON REMOVE HTML = re.compi… 证据：`trafilatura/json_metadata.py`
- **also interesting: article:section**（source_file）：all = "Document" LOGGER = logging.getLogger name ⋮---- META URL = re.compile r"https?:// ?:www\. w 0-9 +\. ? ^/ + " JSON MINIFY = re.compile r' " ?:\\. ^"\\ " \s' HTMLTITLE REGEX = re.compile r"^ .+ ?\s+ –•·— ⁄ ⋆~‹« :- \s+ .+ $" part without dots? CLEAN META TAGS = re.compile r' "\' ' LICENSE REGEX = re.compile r"/ by-nc-nd by-nc-sa by-nc by-nd by-sa by zero / 1-9 \. 0-9 " TEXT LICENSE REGEX = re.compile METANAME AUTHOR = { ⋮---- } questionable: twitter:creator METANAME DESCRIPTION = { METANAME PUBLISHER = { ⋮---- } questionable: citation publisher METANAME TAG = { METANAME TITLE = { METANAME URL = {"rbmainurl", "twitter:url"} METANAME IMAGE = { PROPERTY AUTHOR = {"author", "article:author"… 证据：`trafilatura/metadata.py`
- **Loop through and try again.**（source_file）：LOGGER = logging.getLogger name DOT SPACE = re.compile r"\. $ " def tostring string: HtmlElement - str DIV TO P ELEMS = { DIV SCORES = {"div", "article"} BLOCK SCORES = {"pre", "td", "blockquote"} BAD ELEM SCORES = {"address", "ol", "ul", "dl", "dd", "dt", "li", "form", "aside"} STRUCTURE SCORES = {"h1", "h2", "h3", "h4", "h5", "h6", "th", "header", "footer", "nav"} TEXT CLEAN ELEMS = {"p", "img", "li", "a", "embed", "input"} REGEXES = { FRAME TAGS = {"body", "html"} LIST TAGS = {"ol", "ul"} def text length elem: HtmlElement - int ⋮---- "Return the length of the element with all its contents." ⋮---- class Candidate ⋮---- "Defines a class to score candidate elements." slots = "score", "elem"… 证据：`trafilatura/readability_lxml.py`
- **Defines settings for trafilatura https://trafilatura.readthedocs.io/en/latest/settings.html**（source_file）：Defines settings for trafilatura https://trafilatura.readthedocs.io/en/latest/settings.html 证据：`trafilatura/settings.cfg`
- **todo Python = 3.10: use dataclass with slots=True**（source_file）：get affinity = getattr os, "sched getaffinity", None CPU COUNT = len get affinity 0 if get affinity is not None else os.cpu count or 1 SUPPORTED FMT CLI = "csv", "json", "html", "markdown", "txt", "xml", "xmltei" SUPPORTED FORMATS = set SUPPORTED FMT CLI {"python"} def use config filename: str None = None, config: ConfigParser None = None - ConfigParser ⋮---- config = ConfigParser default file = str Path file .parent / "settings.cfg" ⋮---- DEFAULT CONFIG = use config CONFIG MAPPING = { def get optional int config: ConfigParser, option: str - int None ⋮---- "Read an optional positive integer setting; None when empty or non-numeric." value = config.get "DEFAULT", option, fallback="" .strip ⋮-… 证据：`trafilatura/settings.py`
- **fix, check, clean and normalize**（source_file）：LOGGER = logging.getLogger name LINK REGEX = re.compile r" ?: ? " XHTML REGEX = re.compile r" ", re.DOTALL HREFLANG REGEX = re.compile r'href= "\' .+? "\' ' WHITELISTED PLATFORMS = re.compile SITEMAP FORMAT = re.compile r"^.{0,5}<\?xml <sitemap <urlset" ⋮---- if link == self.current url: safety check ⋮---- fix, check, clean and normalize ⋮---- self.extract links LINK REGEX, 1, self.handle link process middle part of the match tuple ⋮---- safeguard ⋮---- try to extract links from TXT file ⋮---- process XML sitemap ⋮---- iterate through nested sitemaps and results ⋮---- sanity check: keep track of visited sitemaps and exclude them ⋮---- check content 证据：`trafilatura/sitemaps.py`
- **test meta-refresh redirection**（source_file）：LOGGER = logging.getLogger name URL STORE = UrlStore compressed=False, strict=False ROBOTS TXT URL = "/robots.txt" MAX SEEN URLS = 10 MAX KNOWN URLS = 100000 class CrawlParameters ⋮---- "Store necessary information to manage a focused crawl." slots = "start", "base", "lang", "rules", "ref", "i", "known num", "is on", "prune xpath" ⋮---- def get base url self, start: str - str ⋮---- "Set reference domain for the crawl." base: str = get base url start ⋮---- def get reference self, start: str - str ⋮---- "Determine the reference URL." ⋮---- def update metadata self, url store: UrlStore - None ⋮---- "Adjust crawl data based on URL store info." ⋮---- def filter list self, todo: list str None - l… 证据：`trafilatura/spider.py`
- **control characters**（source_file）：HAS GZIP = True ⋮---- HAS GZIP = False ⋮---- HAS ZLIB = True ⋮---- HAS ZLIB = False ⋮---- HAS BROTLI = True ⋮---- HAS BROTLI = False ⋮---- HAS ZSTD = True ⋮---- HAS ZSTD = False ⋮---- LANGID FLAG = True ⋮---- LANGID FLAG = False ⋮---- cchardet detect = None ⋮---- LOGGER = logging.getLogger name UNICODE ALIASES = {"utf-8", "utf 8"} DOCTYPE TAG = re.compile "^ / ^< ", re.I FAULTY HTML = re.compile r" ", re.I HTML STRIP TAGS = re.compile r" " control characters INVALID XML CHARS = re.compile r" \x00-\x08\x0b\x0c\x0e-\x1f\ufffe\uffff " note: htmldate could use HTML comments huge tree=True, remove blank text=True HTML PARSER = HTMLParser collect ids=False, default doctype=False, encoding="utf-8"… 证据：`trafilatura/utils.py`
- **There is a previous node, append text to its tail**（source_file）：LOGGER = logging.getLogger name PKG VERSION = version "trafilatura" TEI SCHEMA = str Path file .parent / "data" / "tei corpus.dtd" TEI VALID TAGS = { TEI VALID ATTRS = {"rend", "rendition", "role", "target", "type"} TEI DTD = None TEI REMOVE TAIL = {"ab", "p"} TEI DIV SIBLINGS = {"p", "list", "table", "quote", "ab"} CONTROL PARSER = XMLParser remove blank text=True NEWLINE ELEMS = {"graphic", "head", "lb", "list", "p", "quote", "row", "table"} SPECIAL FORMATTING = {"code", "del", "head", "hi", "ref", "item", "cell"} WITH ATTRIBUTES = {"cell", "row", "del", "graphic", "head", "hi", "item", "list", "ref"} NESTING WHITELIST = {"cell", "figure", "item", "note", "quote"} META ATTRIBUTES = HI FOR… 证据：`trafilatura/xml.py`
- **.coveragerc**（source_file）：report exclude lines = pragma: no cover if name == . main .: except . ImportError. : except . UnicodeDecodeError. : except . urllib3.exceptions. : 证据：`.coveragerc`
- **override github/linguist settings**（source_file）：override github/linguist settings tests/cache/ linguist-vendored tests/eval/ linguist-vendored tests/resources/ linguist-vendored 证据：`.gitattributes`
- **Compiled python modules.**（source_file）：packaging dist/ build/ .egg-info/ .idea/ 证据：`.gitignore`
- **.Readthedocs**（source_file）：version: 2 build: os: ubuntu-24.04 tools: python: "3.13" sphinx: configuration: docs/conf.py python: install: - method: pip path: . extra requirements: - docs 证据：`.readthedocs.yaml`
- **Citation**（source_file）：authors: - family-names: Barbaresi given-names: Adrien orcid: https://orcid.org/0000-0002-8079-8694 cff-version: 1.2.0 identifiers: - description: "This is the collection of archived snapshots of all versions of Trafilatura" type: doi value: 10.5281/zenodo.3460969 message: "If you use this software, please cite both the article from preferred-citation and the software itself." preferred-citation: authors: - family-names: Barbaresi given-names: Adrien title: "Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction" type: article year: 2021 repository: https://github.com/adbar/trafilatura repository-code: https://github.com/adbar/trafilatura title: Trafilat… 证据：`CITATION.cff`
- **Manifest**（source_file）：include CITATION.cff CONTRIBUTING.md HISTORY.md README.rst LICENSE graft trafilatura/data/ include trafilatura/settings.cfg include trafilatura/py.typed 证据：`MANIFEST.in`
- **Minimal makefile for Sphinx documentation**（source_file）：Minimal makefile for Sphinx documentation 证据：`docs/Makefile`
- **Trafilatura Overview**（source_file）：{ "cells": { "cell type": "markdown", "id": "59249398-24e8-4339-a6ad-1c0ef69fb25e", "metadata": {}, "source": " Trafilatura: Overview and main functions\n", "\n", " 1. Installation\n", "\n", " pip install trafilatura \n", "\n", " Updating\n", "\n", " pip install -U trafilatura \n", "\n", "For more info see Installation https://trafilatura.readthedocs.io/en/latest/installation.html ." }, { "cell type": "markdown", "id": "8505f5ea-b7dd-455e-94a2-0950ee5ebdc5", "metadata": {}, "source": " 2. Downloads" }, { "cell type": "code", "execution count": 1, "id": "8998cb12-efd0-4822-b214-77a94c79aead", "metadata": {}, "outputs": , "source": "from trafilatura import fetch url\n", "\n", "document = fetc… 证据：`docs/Trafilatura_Overview.ipynb`
- **Background**（source_file）：The pages below provide background information on scientific approaches to web data collection and processing, corpus linguistics, digital humanities, and natural language processing. 证据：`docs/background.rst`
- **Compendium**（source_file）：Compendium: Web texts in linguistics and humanities =================================================== 证据：`docs/compendium.rst`
- **Conf**（source_file）：project = 'Trafilatura' copyright = '2025, Adrien Barbaresi' html show sphinx = False author = 'Adrien Barbaresi' version = trafilatura. version master doc = 'index' ⋮---- release = trafilatura. version language = 'en' extensions = templates path = ' templates' exclude patterns = ' build', 'Thumbs.db', '.DS Store' add module names = True html theme = 'pydata sphinx theme' html theme options = { ⋮---- html logo = "trafilatura-logo.png" html context = { intersphinx mapping = { html baseurl = 'https://trafilatura.readthedocs.io/' sitemap url scheme = "{lang}latest/{link}" html extra path = 'robots.txt' 证据：`docs/conf.py`
- **Corefunctions**（source_file）：.. contents:: Table of contents :depth: 2 :local: :backlinks: none 证据：`docs/corefunctions.rst`
- **Corpus Data**（source_file）：Working with corpus data ======================== 证据：`docs/corpus-data.rst`
- **perform the first iteration will not work with this website, there are no internal links**（source_file）：.. meta:: :description lang=en: Dive deep into the web with Python and on the command-line. Trafilatura supports focused crawling, enforces politeness rules, and navigates through websites. 证据：`docs/crawls.rst`
- **create a filename-safe string by hashing the given content**（source_file）：.. meta:: :description lang=en: Duplicate content can harm data quality and efficiency. Trafilatura detects similar texts and segments using a LRU cache and locality sensitive hashing LSH . 证据：`docs/deduplication.rst`
- **single download**（source_file）：Download web pages ================== 证据：`docs/downloads.rst`
- **Evaluation**（source_file）：.. meta:: :description lang=en: See how Python tools work on main text extraction from HTML pages html2txt . Trafilatura consistently outperforms other open-source libraries, showcasing its accuracy in extracting web content. 证据：`docs/evaluation.rst`
- **outputs main content and comments as plain text ...**（source_file）：A Python package & command-line tool to gather text on the Web ============================================================== 证据：`docs/index.rst`
- **to make sure you have the latest version**（source_file）：.. meta:: :description lang=en: Setting up Trafilatura is straightforward. This installation guide walks you through the process step-by-step. 证据：`docs/installation.rst`
- **Make**（source_file）：REM Command file for Sphinx documentation 证据：`docs/make.bat`
- **import the necessary functions**（source_file）：Trafilatura is a tool that simplifies the process of turning raw HTML into structured, meaningful data. This quickstart guide will walk you through the main functions of the software package using Python or the command-line. 证据：`docs/quickstart.rst`
- **Robots**（source_file）：Sitemap: https://trafilatura.readthedocs.io/en/latest/sitemap.xml 证据：`docs/robots.txt`
- **load necessary functions and data**（source_file）：Settings and customization ========================== 证据：`docs/settings.rst`
- **use the list gathered in 1**（source_file）：Finding sources for web corpora =============================== 证据：`docs/sources.rst`
- **url is the target**（source_file）：.. meta:: :description lang=en: This page explains how to solve common issues about content extraction and downloads. They include missing content, paywalls, cookies, and networks. 证据：`docs/troubleshooting.rst`
- **Tutorial Dwds**（source_file）：Tutorial: DWDS-Korpusdaten reproduzieren ======================================== 证据：`docs/tutorial-dwds.rst`
- **replace with a production server if not running a local docker container**（source_file）：Tutorial: Text embedding ======================== 证据：`docs/tutorial-epsilla.rst`
- **sort the links and make sure they are unique**（source_file）：Tutorial: Gathering a custom web corpus ======================================= 证据：`docs/tutorial0.rst`
- **display most frequent tokens**（source_file）：Tutorial: From a list of links to a frequency list ================================================== 证据：`docs/tutorial1.rst`
- **load the necessary components**（source_file）：Tutorial: Validation of TEI files ================================= 证据：`docs/tutorial2.rst`
- **Tutorials**（source_file）：Learn through practical examples. The following tutorials cover various scenarios, from text embedding for vector search to building custom web corpora and generating word frequency lists. 证据：`docs/tutorials.rst`
- **load the function from the included courlan package**（source_file）：.. meta:: :description lang=en: This page shows how to filter and refine a list of URLs, with Python and on the command-line, using the functions provided by the included courlan package. 证据：`docs/url-management.rst`
- **Usage Api**（source_file）：.. meta:: :description lang=en: See how to use the official Trafilatura API to download and extract data. 证据：`docs/usage-api.rst`
- **outputs main content and comments as plain text ...**（source_file）：On the command-line =================== 证据：`docs/usage-cli.rst`
- **Usage Gui**（source_file）：Graphical user interface ======================== 证据：`docs/usage-gui.rst`
- **some formatting preserved in basic XML structure**（source_file）：.. meta:: :description lang=en: This tutorial focuses on text extraction from web pages with Python code snippets. Data mining with this library encompasses HTML parsing and language identification. 证据：`docs/usage-python.rst`
- **getting started**（source_file）：.. meta:: :description lang=en: Trafilatura extends its download and extractions capabilities to the R community. Discover how to use Trafilatura in your R projects with this dedicated guide. 证据：`docs/usage-r.rst`
- **Usage**（source_file）：quickstart usage-python usage-cli usage-r usage-api usage-gui downloads crawls settings deduplication troubleshooting url-management 证据：`docs/usage.rst`
- **Used By**（source_file）：.. meta:: :description lang=en: Trafilatura now widely used, integrated into other software packages and cited in research publications. Notable projects and institutional users are listed on this page. 证据：`docs/used-by.rst`
- **https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/**（source_file）：https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/ build-system requires = "setuptools =61.0" build-backend = "setuptools.build meta" 证据：`pyproject.toml`
- 其余 6 条证据见 `AI_CONTEXT_PACK.json` 或 `EVIDENCE_INDEX.json`。

## 宿主 AI 必须遵守的规则

- **把本资产当作开工前上下文，而不是运行环境。**：AI Context Pack 只包含证据化项目理解，不包含目标项目的可执行状态。 证据：`README.md`, `CONTRIBUTING.md`, `LICENSE`
- **回答用户时区分可预览内容与必须安装后才能验证的内容。**：安装前体验的消费者价值来自降低误装和误判，而不是伪装成真实运行。 证据：`README.md`, `CONTRIBUTING.md`, `LICENSE`

## 用户开工前应该回答的问题

- 你准备在哪个宿主 AI 或本地环境中使用它？
- 你只是想先体验工作流，还是准备真实安装？
- 你最在意的是安装成本、输出质量、还是和现有规则的冲突？

## 验收标准

- 所有能力声明都能回指到 evidence_refs 中的文件路径。
- AI_CONTEXT_PACK.md 没有把预览包装成真实运行。
- 用户能在 3 分钟内看懂适合谁、能做什么、如何开始和风险边界。

---

## Doramagic Context Augmentation

下面内容用于强化 Repomix/AI Context Pack 主体。Human Manual 只提供阅读骨架；踩坑日志会被转成宿主 AI 必须遵守的工作约束。

## Human Manual 骨架

使用规则：这里只是项目阅读路线和显著性信号，不是事实权威。具体事实仍必须回到 repo evidence / Claim Graph。

宿主 AI 硬性规则：
- 不得把页标题、章节顺序、摘要或 importance 当作项目事实证据。
- 解释 Human Manual 骨架时，必须明确说它只是阅读路线/显著性信号。
- 能力、安装、兼容性、运行状态和风险判断必须引用 repo evidence、source path 或 Claim Graph。

- **Overview & System Architecture**：importance `high`
  - source_paths: README.md, trafilatura/__init__.py, trafilatura/core.py, HISTORY.md, pyproject.toml
- **Text & Metadata Extraction Engine**：importance `high`
  - source_paths: trafilatura/main_extractor.py, trafilatura/baseline.py, trafilatura/readability_lxml.py, trafilatura/htmlprocessing.py, trafilatura/metadata.py
- **Web Crawling, Downloads & URL Discovery**：importance `medium`
  - source_paths: trafilatura/downloads.py, trafilatura/sitemaps.py, trafilatura/feeds.py, trafilatura/spider.py, trafilatura/utils.py
- **CLI, Configuration & Known Issues**：importance `high`
  - source_paths: trafilatura/cli.py, trafilatura/cli_utils.py, trafilatura/settings.py, trafilatura/settings.cfg, docs/troubleshooting.rst

## Repo Inspection Evidence / 源码检查证据

- repo_clone_verified: true
- repo_inspection_verified: true
- repo_commit: `9068a9781276134d10f64b5426da4e5cb649e094`
- inspected_files: `README.md`, `pyproject.toml`, `docs/conf.py`

宿主 AI 硬性规则：
- 没有 repo_clone_verified=true 时，不得声称已经读过源码。
- 没有 repo_inspection_verified=true 时，不得把 README/docs/package 文件判断写成事实。
- 没有 quick_start_verified=true 时，不得声称 Quick Start 已跑通。

## Doramagic Pitfall Constraints / 踩坑约束

这些规则来自 Doramagic 发现、验证或编译过程中的项目专属坑点。宿主 AI 必须把它们当作工作约束，而不是普通说明文字。

### Constraint 1: 能力判断依赖假设

- Trigger: README/documentation is current enough for a first validation pass.
- Host AI rule: 将假设转成下游验证清单。
- Why it matters: 假设不成立时，用户拿不到承诺的能力。
- Evidence: capability.assumptions | https://github.com/adbar/trafilatura | README/documentation is current enough for a first validation pass.
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 2: 维护活跃度未知

- Trigger: 未记录 last_activity_observed。
- Host AI rule: 补 GitHub 最近 commit、release、issue/PR 响应信号。
- Why it matters: 新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- Evidence: evidence.maintainer_signals | https://github.com/adbar/trafilatura | last_activity_observed missing
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

- Trigger: no_demo
- Evidence: downstream_validation.risk_items | https://github.com/adbar/trafilatura | no_demo; severity=medium
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 4: 存在评分风险

- Trigger: no_demo
- Why it matters: 风险会影响是否适合普通用户安装。
- Evidence: risks.scoring_risks | https://github.com/adbar/trafilatura | no_demo; severity=medium
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 5: issue/PR 响应质量未知

- Trigger: issue_or_pr_quality=unknown。
- Host AI rule: 抽样最近 issue/PR，判断是否长期无人处理。
- Why it matters: 用户无法判断遇到问题后是否有人维护。
- Evidence: evidence.maintainer_signals | https://github.com/adbar/trafilatura | issue_or_pr_quality=unknown
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 6: 发布节奏不明确

- Trigger: release_recency=unknown。
- Host AI rule: 确认最近 release/tag 和 README 安装命令是否一致。
- Why it matters: 安装命令和文档可能落后于代码，用户踩坑概率升高。
- Evidence: evidence.maintainer_signals | https://github.com/adbar/trafilatura | release_recency=unknown
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。
