# marker - Doramagic AI Context Pack

> 定位：安装前体验与判断资产。它帮助宿主 AI 有一个好的开始，但不代表已经安装、执行或验证目标项目。

## 充分原则

- **充分原则，不是压缩原则**：AI Context Pack 应该充分到让宿主 AI 在开工前理解项目价值、能力边界、使用入口、风险和证据来源；它可以分层组织，但不以最短摘要为目标。
- **压缩策略**：只压缩噪声和重复内容，不压缩会影响判断和开工质量的上下文。

## 给宿主 AI 的使用方式

你正在读取 Doramagic 为 marker 编译的 AI Context Pack。请把它当作开工前上下文：帮助用户理解适合谁、能做什么、如何开始、哪些必须安装后验证、风险在哪里。不要声称你已经安装、运行或执行了目标项目。

## Claim 消费规则

- **事实来源**：Repo Evidence + Claim/Evidence Graph；Human Wiki 只提供显著性、术语和叙事结构。
- **事实最低状态**：`supported`
- `supported`：可以作为项目事实使用，但回答中必须引用 claim_id 和证据路径。
- `weak`：只能作为低置信度线索，必须要求用户继续核实。
- `inferred`：只能用于风险提示或待确认问题，不能包装成项目事实。
- `unverified`：不得作为事实使用，应明确说证据不足。
- `contradicted`：必须展示冲突来源，不得替用户强行选择一个版本。

## 它最适合谁

- **AI 研究者或研究型 Agent 构建者**：README 明确围绕研究、实验或论文工作流展开。 证据：`README.md` Claim：`clm_0011` supported 0.86
- **正在使用 Claude/Codex/Cursor/Gemini 等宿主 AI 的开发者**：README 或插件配置提到多个宿主 AI。 证据：`README.md` Claim：`clm_0012` supported 0.86

## 它能做什么

- **Multi-format Document Conversion**（可做安装前预览）：Converts PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files into markdown, JSON, chunks, and HTML output formats. 证据：`README.md`, `README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Multi-language Document Support**（可做安装前预览）：Processes documents in all languages, with OCR support via Surya-OCR. 证据：`README.md`, `data/examples/markdown/thinkpython/thinkpython.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Image Extraction**（可做安装前预览）：Extracts and saves images embedded within documents during conversion. 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Artifact Removal**（可做安装前预览）：Removes headers, footers, and other artifacts from documents during conversion. 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **LLM-Enhanced Accuracy (Hybrid Mode)**（需要安装后验证）：Uses LLM alongside core models to boost accuracy for table merging, inline math handling, form extraction, and value extraction. Supports Gemini and Ollama models. 证据：`README.md`, `README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Cross-Platform Hardware Support**（可做安装前预览）：Runs on GPU (CUDA), CPU, or Apple MPS (Metal Performance Shaders) for inference. 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Batch Processing**（可做安装前预览）：Processes multiple PDF pages or files in parallel for improved throughput. 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Structured Extraction (Beta)**（可做安装前预览）：Performs structured extraction given a JSON schema for extracting specific data structures from documents. 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Extensibility**（可做安装前预览）：Allows customization with user-defined formatting and logic via extensions. 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Cloud Deployment Ready**（可做安装前预览）：Can be deployed to cloud platforms like Modal for scalable GPU-accelerated inference. 证据：`examples/README.md` Claim：`clm_0010` supported 0.86

## 怎么开始

- `pip install marker-pdf` 证据：`README.md` Claim：`clm_0013` supported 0.86, `clm_0014` supported 0.86
- `pip install marker-pdf[full]` 证据：`README.md` Claim：`clm_0014` supported 0.86
- `pip install streamlit streamlit-ace` 证据：`README.md` Claim：`clm_0015` supported 0.86
- `pip install -U uvicorn fastapi python-multipart` 证据：`README.md` Claim：`clm_0016` supported 0.86
- `git clone https://github.com/VikParuchuri/marker.git` 证据：`README.md` Claim：`clm_0017` supported 0.86

## 继续前判断卡

- **当前建议**：需要管理员/安全审批
- **为什么**：继续前可能涉及密钥、账号、外部服务或敏感上下文，建议先经过管理员或安全审批。

### 30 秒判断

- **现在怎么做**：需要管理员/安全审批
- **最小安全下一步**：先跑 Prompt Preview；若涉及凭证或企业环境，先审批再试装
- **先别相信**：真实输出质量不能在安装前相信。
- **继续会触碰**：命令执行、本地环境或项目文件、环境变量 / API Key

### 现在可以相信

- **适合人群线索：AI 研究者或研究型 Agent 构建者**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`README.md` Claim：`clm_0011` supported 0.86
- **适合人群线索：正在使用 Claude/Codex/Cursor/Gemini 等宿主 AI 的开发者**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`README.md` Claim：`clm_0012` supported 0.86
- **能力存在：Multi-format Document Conversion**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`README.md`, `README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86
- **能力存在：Multi-language Document Support**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`README.md`, `data/examples/markdown/thinkpython/thinkpython.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86
- **能力存在：Image Extraction**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86
- **能力存在：Artifact Removal**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86

### 现在还不能相信

- **真实输出质量不能在安装前相信。**（unverified）：Prompt Preview 只能展示引导方式，不能证明真实项目中的结果质量。
- **宿主 AI 版本兼容性不能在安装前相信。**（unverified）：Claude、Cursor、Codex、Gemini 等宿主加载规则和版本差异必须在真实环境验证。
- **不会污染现有宿主 AI 行为，不能直接相信。**（inferred）：Skill、plugin、AGENTS/CLAUDE/GEMINI 指令可能改变宿主 AI 的默认行为。
- **可安全回滚不能默认相信。**（unverified）：除非项目明确提供卸载和恢复说明，否则必须先在隔离环境验证。
- **真实安装后是否与用户当前宿主 AI 版本兼容？**（unverified）：兼容性只能通过实际宿主环境验证。
- **项目输出质量是否满足用户具体任务？**（unverified）：安装前预览只能展示流程和边界，不能替代真实评测。
- **安装命令是否需要网络、权限或全局写入？**（unverified）：这影响企业环境和个人环境的安装风险。 证据：`README.md`

### 继续会触碰什么

- **命令执行**：包管理器、网络下载、本地插件目录、项目配置或用户主目录。 原因：运行第一条命令就可能产生环境改动；必须先判断是否值得跑。 证据：`README.md`
- **本地环境或项目文件**：安装结果、插件缓存、项目配置或本地依赖目录。 原因：安装前无法证明写入范围和回滚方式，需要隔离验证。 证据：`README.md`
- **环境变量 / API Key**：项目入口文档明确出现 API key、token、secret 或账号凭证配置。 原因：如果真实安装需要凭证，应先使用测试凭证并经过权限/合规判断。 证据：`README.md`, `marker/settings.py`
- **宿主 AI 上下文**：AI Context Pack、Prompt Preview、Skill 路由、风险规则和项目事实。 原因：导入上下文会影响宿主 AI 后续判断，必须避免把未验证项包装成事实。

### 最小安全下一步

- **先跑 Prompt Preview**：用安装前交互式试用判断工作方式是否匹配，不需要授权或改环境。（适用：任何项目都适用，尤其是输出质量未知时。）
- **只在隔离目录或测试账号试装**：避免安装命令污染主力宿主 AI、真实项目或用户主目录。（适用：存在命令执行、插件配置或本地写入线索时。）
- **不要使用真实生产凭证**：环境变量/API key 一旦进入宿主或工具链，可能产生账号和合规风险。（适用：出现 API、TOKEN、KEY、SECRET 等环境线索时。）
- **安装后只验证一个最小任务**：先验证加载、兼容、输出质量和回滚，再决定是否深用。（适用：准备从试用进入真实工作流时。）

### 退出方式

- **保留安装前状态**：记录原始宿主配置和项目状态，后续才能判断是否可恢复。
- **记录安装命令和写入路径**：没有明确卸载说明时，至少要知道哪些目录或配置需要手动清理。
- **准备撤销测试 API key 或 token**：测试凭证泄露或误用时，可以快速止损。
- **如果没有回滚路径，不进入主力环境**：不可回滚是继续前阻断项，不应靠信任或运气继续。

## 哪些只能预览

- 解释项目适合谁和能做什么
- 基于项目文档演示典型对话流程
- 帮助用户判断是否值得安装或继续研究

## 哪些必须安装后验证

- 真实安装 Skill、插件或 CLI
- 执行脚本、修改本地文件或访问外部服务
- 验证真实输出质量、性能和兼容性

## 边界与风险判断卡

- **把安装前预览误认为真实运行**：用户可能高估项目已经完成的配置、权限和兼容性验证。 处理方式：明确区分 prompt_preview_can_do 与 runtime_required。 Claim：`clm_0018` inferred 0.45
- **命令执行会修改本地环境**：安装命令可能写入用户主目录、宿主插件目录或项目配置。 处理方式：先在隔离环境或测试账号中运行。 证据：`README.md` Claim：`clm_0019` supported 0.86
- **风险**： 处理方式：
- **风险**： 处理方式：
- **风险**： 处理方式：
- **风险**： 处理方式：
- **风险**： 处理方式：
- **待确认**：真实安装后是否与用户当前宿主 AI 版本兼容？。原因：兼容性只能通过实际宿主环境验证。
- **待确认**：项目输出质量是否满足用户具体任务？。原因：安装前预览只能展示流程和边界，不能替代真实评测。
- **待确认**：安装命令是否需要网络、权限或全局写入？。原因：这影响企业环境和个人环境的安装风险。

## 开工前工作上下文

### 加载顺序

- 先读取 how_to_use.host_ai_instruction，建立安装前判断资产的边界。
- 读取 claim_graph_summary，确认事实来自 Claim/Evidence Graph，而不是 Human Wiki 叙事。
- 再读取 intended_users、capabilities 和 quick_start_candidates，判断用户是否匹配。
- 需要执行具体任务时，优先查 role_skill_index，再查 evidence_index。
- 遇到真实安装、文件修改、网络访问、性能或兼容性问题时，转入 risk_card 和 boundaries.runtime_required。

### 任务路由

- **Multi-format Document Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md`, `README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Multi-language Document Support**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md`, `data/examples/markdown/thinkpython/thinkpython.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Image Extraction**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Artifact Removal**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **LLM-Enhanced Accuracy (Hybrid Mode)**：先说明这是安装后验证能力，再给出安装前检查清单。 边界：必须真实安装或运行后验证。 证据：`README.md`, `README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Cross-Platform Hardware Support**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Batch Processing**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Structured Extraction (Beta)**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Extensibility**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Cloud Deployment Ready**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`examples/README.md` Claim：`clm_0010` supported 0.86

### 上下文规模

- 文件总数：197
- 重要文件覆盖：40/197
- 证据索引条目：66
- 角色 / Skill 条目：6

### 证据不足时的处理

- **missing_evidence**：说明证据不足，要求用户提供目标文件、README 段落或安装后验证记录；不要补全事实。
- **out_of_scope_request**：说明该任务超出当前 AI Context Pack 证据范围，并建议用户先查看 Human Manual 或真实安装后验证。
- **runtime_request**：给出安装前检查清单和命令来源，但不要替用户执行命令或声称已执行。
- **source_conflict**：同时展示冲突来源，标记为待核实，不要强行选择一个版本。

## Prompt Recipes

### 适配判断

- 目标：判断这个项目是否适合用户当前任务。
- 预期输出：适配结论、关键理由、证据引用、安装前可预览内容、必须安装后验证内容、下一步建议。

```text
请基于 marker 的 AI Context Pack，先问我 3 个必要问题，然后判断它是否适合我的任务。回答必须包含：适合谁、能做什么、不能做什么、是否值得安装、证据来自哪里。所有项目事实必须引用 evidence_refs、source_paths 或 claim_id。
```

### 安装前体验

- 目标：让用户在安装前感受核心工作流，同时避免把预览包装成真实能力或营销承诺。
- 预期输出：一段带边界标签的体验剧本、安装后验证清单和谨慎建议；不含真实运行承诺或强营销表述。

```text
请把 marker 当作安装前体验资产，而不是已安装工具或真实运行环境。

请严格输出四段：
1. 先问我 3 个必要问题。
2. 给出一段“体验剧本”：用 [安装前可预览]、[必须安装后验证]、[证据不足] 三种标签展示它可能如何引导工作流。
3. 给出安装后验证清单：列出哪些能力只有真实安装、真实宿主加载、真实项目运行后才能确认。
4. 给出谨慎建议：只能说“值得继续研究/试装”“先补充信息后再判断”或“不建议继续”，不得替项目背书。

硬性边界：
- 不要声称已经安装、运行、执行测试、修改文件或产生真实结果。
- 不要写“自动适配”“确保通过”“完美适配”“强烈建议安装”等承诺性表达。
- 如果描述安装后的工作方式，必须使用“如果安装成功且宿主正确加载 Skill，它可能会……”这种条件句。
- 体验剧本只能写成“示例台词/假设流程”：使用“可能会询问/可能会建议/可能会展示”，不要写“已写入、已生成、已通过、正在运行、正在生成”。
- Prompt Preview 不负责给安装命令；如用户准备试装，只能提示先阅读 Quick Start 和 Risk Card，并在隔离环境验证。
- 所有项目事实必须来自 supported claim、evidence_refs 或 source_paths；inferred/unverified 只能作风险或待确认项。

```

### 角色 / Skill 选择

- 目标：从项目里的角色或 Skill 中挑选最匹配的资产。
- 预期输出：候选角色或 Skill 列表，每项包含适用场景、证据路径、风险边界和是否需要安装后验证。

```text
请读取 role_skill_index，根据我的目标任务推荐 3-5 个最相关的角色或 Skill。每个推荐都要说明适用场景、可能输出、风险边界和 evidence_refs。
```

### 风险预检

- 目标：安装或引入前识别环境、权限、规则冲突和质量风险。
- 预期输出：环境、权限、依赖、许可、宿主冲突、质量风险和未知项的检查清单。

```text
请基于 risk_card、boundaries 和 quick_start_candidates，给我一份安装前风险预检清单。不要替我执行命令，只说明我应该检查什么、为什么检查、失败会有什么影响。
```

### 宿主 AI 开工指令

- 目标：把项目上下文转成一次对话开始前的宿主 AI 指令。
- 预期输出：一段边界明确、证据引用明确、适合复制给宿主 AI 的开工前指令。

```text
请基于 marker 的 AI Context Pack，生成一段我可以粘贴给宿主 AI 的开工前指令。这段指令必须遵守 not_runtime=true，不能声称项目已经安装、运行或产生真实结果。
```


## 角色 / Skill 索引

- 共索引 6 个角色 / Skill / 项目文档条目。

- **Marker**（project_doc）：Datalab State of the Art models for Document Intelligence 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`README.md`
- **Usage Examples**（project_doc）：This directory contains examples of running marker in different contexts. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`examples/README.md`
- **Cla**（project_doc）：This Marker Contributor Agreement "MCA" applies to any contribution that you make to any product or project managed by us the "project" , and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean Endless Labs, Inc. The term "you" shall mean the person or entity identified below. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`CLA.md`
- **An Aggregated Multicolumn Dilated Convolution Network for Perspective-Free Counting**（project_doc）：An Aggregated Multicolumn Dilated Convolution Network for Perspective-Free Counting 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`data/examples/markdown/multicolcnn/multicolcnn.md`
- **Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity**（project_doc）：Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`data/examples/markdown/switch_transformers/switch_trans.md`
- **Think Python**（project_doc）：How to Think Like a Computer Scientist 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`data/examples/markdown/thinkpython/thinkpython.md`

## 证据索引

- 共索引 66 条证据。

- **Marker**（documentation）：Datalab State of the Art models for Document Intelligence 证据：`README.md`
- **Usage Examples**（documentation）：This directory contains examples of running marker in different contexts. 证据：`examples/README.md`
- **License**（source_file）：GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 证据：`LICENSE`
- **Cla**（documentation）：This Marker Contributor Agreement "MCA" applies to any contribution that you make to any product or project managed by us the "project" , and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean Endless Labs, Inc. The term "you" shall mean the person or entity identified below. 证据：`CLA.md`
- **An Aggregated Multicolumn Dilated Convolution Network for Perspective-Free Counting**（documentation）：An Aggregated Multicolumn Dilated Convolution Network for Perspective-Free Counting 证据：`data/examples/markdown/multicolcnn/multicolcnn.md`
- **Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity**（documentation）：Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity 证据：`data/examples/markdown/switch_transformers/switch_trans.md`
- **Think Python**（documentation）：How to Think Like a Computer Scientist 证据：`data/examples/markdown/thinkpython/thinkpython.md`
- **Multicolcnn**（structured_config）：{ "children": { "id": "/page/0/Page/277", "block type": "Page", "html": " ", "polygon": 0.0, 0.0 , 612.0, 0.0 , 612.0, 792.0 , 0.0, 792.0 , "bbox": 0.0, 0.0, 612.0, 792.0 , "children": { "id": "/page/0/PageHeader/14", "block type": "PageHeader", "html": "", "polygon": 18.119998931884766, 211.199951171875 , 36.2599983215332, 211.199951171875 , 36.2599983215332, 559.2799987792969 , 18.119998931884766, 559.2799987792969 , "bbox": 18.119998931884766, 211.199951171875, 36.2599983215332, 559.2799987792969 , "children": null, "section hierarchy": {}, "images": {} }, { "id": "/page/0/SectionHeader/0", "block type": "SectionHeader", "html": " An Aggregated Multicolumn Dilated Convolution Network for… 证据：`data/examples/json/multicolcnn.json`
- **Switch Trans**（structured_config）：{ "children": { "id": "/page/0/Page/164", "block type": "Page", "html": " ", "polygon": 0.0, 0.0 , 612.0, 0.0 , 612.0, 792.0 , 0.0, 792.0 , "bbox": 0.0, 0.0, 612.0, 792.0 , "children": { "id": "/page/0/PageHeader/0", "block type": "PageHeader", "html": "", "polygon": 90.0, 41.72613525390625 , 521.8120727539062, 41.72613525390625 , 521.8120727539062, 49.83837890625 , 90.0, 49.83837890625 , "bbox": 90.0, 41.72613525390625, 521.8120727539062, 49.83837890625 , "children": null, "section hierarchy": {}, "images": {} }, { "id": "/page/0/PageHeader/1", "block type": "PageHeader", "html": "", "polygon": 348.43359375, 42.369873046875 , 521.15625, 42.369873046875 , 521.15625, 49.74169921875 , 348.433… 证据：`data/examples/json/switch_trans.json`
- **Thinkpython**（structured_config）：{ "children": { "id": "/page/0/Page/10", "block type": "Page", "html": " ", "polygon": 0.0, 0.0 , 612.0, 0.0 , 612.0, 792.0 , 0.0, 792.0 , "bbox": 0.0, 0.0, 612.0, 792.0 , "children": { "id": "/page/0/SectionHeader/0", "block type": "SectionHeader", "html": " Think Python ", "polygon": 398.935546875, 265.095703125 , 525.6013793945312, 265.095703125 , 525.6013793945312, 289.6333312988281 , 398.935546875, 289.6333312988281 , "bbox": 398.935546875, 265.095703125, 525.6013793945312, 289.6333312988281 , "children": null, "section hierarchy": { "2": "/page/0/SectionHeader/0" }, "images": {} }, { "id": "/page/0/SectionHeader/1", "block type": "SectionHeader", "html": " How to Think Like a Computer… 证据：`data/examples/json/thinkpython.json`
- **Multicolcnn Meta**（structured_config）：{ "table of contents": { "title": "An Aggregated Multicolumn Dilated Convolution Network\nfor Perspective-Free Counting", "heading level": null, "page id": 0, "polygon": 117.5888671875, 105.9219970703125 , 477.371826171875, 105.9219970703125 , 477.371826171875, 138.201171875 , 117.5888671875, 138.201171875 }, { "title": "Abstract", "heading level": null, "page id": 0, "polygon": 144.1845703125, 232.4891357421875 , 190.48028564453125, 232.4891357421875 , 190.48028564453125, 244.4443359375 , 144.1845703125, 244.4443359375 }, { "title": "1. Introduction", "heading level": null, "page id": 0, "polygon": 50.016357421875, 512.06591796875 , 128.49609375, 512.06591796875 , 128.49609375, 524.0211181… 证据：`data/examples/markdown/multicolcnn/multicolcnn_meta.json`
- **Switch Trans Meta**（structured_config）：{ "table of contents": { "title": "Switch Transformers: Scaling to Trillion Parameter Models\nwith Simple and Efficient Sparsity", "heading level": null, "page id": 0, "polygon": 93.0849609375, 101.5679931640625 , 515.77734375, 101.5679931640625 , 515.77734375, 133.84625244140625 , 93.0849609375, 133.84625244140625 }, { "title": "William Fedus\u2217", "heading level": null, "page id": 0, "polygon": 89.2001953125, 151.53192138671875 , 174.6650390625, 151.53192138671875 , 174.6650390625, 164.20635986328125 , 89.2001953125, 164.20635986328125 }, { "title": "Barret Zoph\u2217", "heading level": null, "page id": 0, "polygon": 90.00000762939453, 181.7108154296875 , 165.849609375, 181.710815429687… 证据：`data/examples/markdown/switch_transformers/switch_trans_meta.json`
- **Thinkpython Meta**（structured_config）：{ "table of contents": { "title": "Think Python", "heading level": null, "page id": 0, "polygon": 398.935546875, 265.095703125 , 525.6013793945312, 265.095703125 , 525.6013793945312, 289.6333312988281 , 398.935546875, 289.6333312988281 }, { "title": "How to Think Like a Computer Scientist", "heading level": null, "page id": 0, "polygon": 267.3017578125, 306.861328125 , 525.6033325195312, 306.861328125 , 525.6033325195312, 323.876953125 , 267.3017578125, 323.876953125 }, { "title": "Think Python", "heading level": null, "page id": 2, "polygon": 398.63671875, 264.90234375 , 525.6013793945312, 264.90234375 , 525.6013793945312, 289.6333312988281 , 398.63671875, 289.6333312988281 }, { "title": "… 证据：`data/examples/markdown/thinkpython/thinkpython_meta.json`
- **Cla**（structured_config）：{ "signedContributors": { "name": "korakot", "id": 3155646, "comment id": 2143359366, "created at": "2024-06-01T08:25:52Z", "repoId": 712111618, "pullRequestNo": 161 }, { "name": "tosaddler", "id": 13705399, "comment id": 2144014410, "created at": "2024-06-02T20:40:52Z", "repoId": 712111618, "pullRequestNo": 165 }, { "name": "q2333gh", "id": 32679742, "comment id": 2156122900, "created at": "2024-06-08T18:01:39Z", "repoId": 712111618, "pullRequestNo": 176 }, { "name": "q2333gh", "id": 32679742, "comment id": 2156614334, "created at": "2024-06-09T13:48:49Z", "repoId": 712111618, "pullRequestNo": 176 }, { "name": "aniketinamdar", "id": 79044809, "comment id": 2157453610, "created at": "2024-0… 证据：`signatures/version1/cla.json`
- **Byte-compiled / optimized / DLL files**（source_file）：private.py .DS Store local.env experiments test data training wandb .dat report.json benchmark data debug data temp.md temp conversion results uploads /cache 证据：`.gitignore`
- **.Pre Commit Config**（source_file）：repos: - repo: https://github.com/astral-sh/ruff-pre-commit rev: v0.9.10 hooks: - id: ruff types or: python, pyi args: --fix - id: ruff-format types or: python, pyi 证据：`.pre-commit-config.yaml`
- **Model License**（source_file）：AI PUBS OPEN RAIL-M LICENSE MODIFIED 证据：`MODEL_LICENSE`
- **Verify Scores**（source_file）：def verify scores file path ⋮---- data = json.load file ⋮---- raw scores = data "scores" k for k in data "scores" marker scores = r "marker" "heuristic" "score" for r in raw scores marker score = sum marker scores / len marker scores ⋮---- def verify table scores file path ⋮---- avg = sum r "marker score" for r in data "marker" / len data ⋮---- parser = argparse.ArgumentParser description="Verify benchmark scores" ⋮---- args = parser.parse args 证据：`benchmarks/verify_scores.py`
- **data/.gitignore**（source_file）：latex pdfs references 证据：`data/.gitignore`
- **Latex To Md**（source_file）：FILES=$ find latex -name " .tex" for f in $FILES do echo "Processing $f file..." base name=$ basename "$f" .tex out file="references/${base name}.md" pandoc --wrap=none \ --no-highlight \ --strip-comments \ --from=latex \ --to=commonmark x+pipe tables \ "$f" \ -o "$out file" sed -i .bak 's/ / /g' "$out file" sed -i .bak 's/ / /g' "$out file" sed -i .bak 's/ / /g' "$out file" sed -i .bak 's/ / /g' "$out file" sed -i.bak -E 's/ \\cite //g; s/ //g; s/\{ ^} \}//g; s/\\cite\{ ^} \}//g' "$out file" sed -i.bak -E ' s/ \\cite //g; Remove \cite commands inside backticks s/::: //g; Remove the leading ::: for content markers s/\ //g; Remove opening square bracket s/\ //g; Remove closing square bracket… 证据：`data/latex_to_md.sh`
- **Create/load models**（source_file）：app = modal.App "datalab-marker-modal-demo" GPU TYPE = "L40S" MODEL PATH PREFIX = "/root/.cache/datalab/models" ⋮---- image = ⋮---- models volume = modal.Volume.from name "marker-models-modal-demo", create if missing=True ⋮---- def setup models with cache check logger, commit volume=False ⋮---- models dir exists = os.path.exists MODEL PATH PREFIX models dir contents = os.listdir MODEL PATH PREFIX if models dir exists else ⋮---- Create/load models models = create model dict ⋮---- contents = os.listdir MODEL PATH PREFIX ⋮---- Commit volume if requested for download function ⋮---- def download models ⋮---- logger = logging.getLogger name ⋮---- models = setup models with cache check logger, com… 证据：`examples/marker_modal_deployment.py`
- **Logger**（source_file）：def configure logging ⋮---- logger = get logger ⋮---- handler = logging.StreamHandler formatter = logging.Formatter ⋮---- def get logger 证据：`marker/logger.py`
- **Output**（source_file）：def unwrap outer tag html: str ⋮---- soup = BeautifulSoup html, "html.parser" contents = list soup.contents ⋮---- def json to html block: JSONBlockOutput BlockOutput ⋮---- child html = json to html child for child in block.children child ids = child.id for child in block.children ⋮---- soup = BeautifulSoup block.html, "html.parser" content refs = soup.find all "content-ref" ⋮---- src id = ref.attrs "src" ⋮---- child soup = BeautifulSoup ⋮---- def output exists output dir: str, fname base: str ⋮---- exts = "md", "html", "json" ⋮---- def text from rendered rendered: BaseModel ⋮---- from marker.renderers.chunk import ChunkOutput Has an import from this file ⋮---- def convert if not rgb image:… 证据：`marker/output.py`
- **General models**（source_file）：class Settings BaseSettings ⋮---- BASE DIR: str = os.path.dirname os.path.dirname os.path.abspath file OUTPUT DIR: str = os.path.join BASE DIR, "conversion results" FONT DIR: str = os.path.join BASE DIR, "static", "fonts" DEBUG DATA FOLDER: str = os.path.join BASE DIR, "debug data" ARTIFACT URL: str = "https://models.datalab.to/artifacts" FONT NAME: str = "GoNotoCurrent-Regular.ttf" FONT PATH: str = os.path.join FONT DIR, FONT NAME LOGLEVEL: str = "INFO" ⋮---- OUTPUT ENCODING: str = "utf-8" OUTPUT IMAGE FORMAT: str = "JPEG" ⋮---- GOOGLE API KEY: Optional str = "" ⋮---- General models TORCH DEVICE: Optional str = ⋮---- None Note: MPS device does not work for text detection, and will default… 证据：`marker/settings.py`
- **Modification of unwrap math from surya.recognition**（source_file）：OPENING TAG REGEX = re.compile r" ? " CLOSING TAG REGEX = re.compile r" " TAG MAPPING = { ⋮---- def strings to classes items: List str - List type ⋮---- classes = ⋮---- module = import module module name ⋮---- def classes to strings items: List type - List str ⋮---- def verify config keys obj ⋮---- annotations = inspect.get annotations obj. class ⋮---- none vals = "" ⋮---- value = getattr obj, attr name ⋮---- def assign config cls, config: BaseModel dict None ⋮---- cls name = cls. class . name ⋮---- dict config = config.dict ⋮---- dict config = config ⋮---- split k = k.removeprefix cls name + " " ⋮---- def parse range str range str: str - List int ⋮---- range lst = range str.split "," page… 证据：`marker/util.py`
- **Optional dependencies for documents**（source_file）：tool.poetry name = "marker-pdf" version = "1.10.2" description = "Convert documents to markdown with high speed and accuracy." authors = "Vik Paruchuri " readme = "README.md" license = "GPL-3.0-or-later" repository = "https://github.com/VikParuchuri/marker" keywords = "pdf", "markdown", "ocr", "nlp" packages = {include = "marker"} include = "marker/scripts/ .sh", "marker/scripts/ .html", 证据：`pyproject.toml`
- **Pytest**（source_file）：pytest testpaths=tests markers = filename name : specify the filename for the pdf document fixture filterwarnings = ignore::Warning 证据：`pytest.ini`
- **Dataset**（source_file）：def build dataset bench dataset: datasets.Dataset, result: FullResult, score types: List str , max rows: int None = None - datasets.Dataset ⋮---- rows = ⋮---- row = { ⋮---- method cls = METHOD REGISTRY method md = result "markdown" idx method ⋮---- method img = method cls.render result "markdown" idx method ⋮---- method img = PIL.Image.new "RGB", 200, 200 ⋮---- row f"{method} {score type}" = -1.0 Missing score ⋮---- ds = datasets.Dataset.from list rows 证据：`benchmarks/overall/display/dataset.py`
- **Table**（source_file）：def write table title: str, rows: list, headers: list, out path: Path, filename: str ⋮---- table = tabulate.tabulate rows, headers=headers, tablefmt="github" ⋮---- document types = list result "averages by type" default method default score type .keys headers = "Document Type" ⋮---- document rows = k for k in document types ⋮---- avg score = sum result "averages by type" method score type doc type / max 1, len result "averages by type" method score type doc type ⋮---- headers = "Block Type" block types = list result "averages by block type" default method default score type .keys block score types = list result "averages by block type" default method .keys ⋮---- block rows = k for k in bloc… 证据：`benchmarks/overall/display/table.py`
- **Base**（source_file）：class Downloader ⋮---- cache path: Path = Path "cache" service: str ⋮---- def init self, api key, app id, max rows: int = 2200 ⋮---- def get html self, pdf bytes ⋮---- def upload ds self ⋮---- rows = ⋮---- data = json.load f ⋮---- out ds = datasets.Dataset.from list rows, features=datasets.Features { ⋮---- def generate data self ⋮---- max rows = self.max rows ⋮---- cache file = self.cache path / f"{idx}.json" ⋮---- pdf bytes = sample "pdf" ⋮---- out data = self.get html pdf bytes ⋮---- def call self 证据：`benchmarks/overall/download/base.py`
- **Llamaparse**（source_file）：class LlamaParseDownloader Downloader ⋮---- service = "llamaparse" ⋮---- def get html self, pdf bytes ⋮---- rand name = str time.time + ".pdf" start = time.time buff = io.BytesIO pdf bytes md = upload and parse file self.api key, rand name, buff end = time.time ⋮---- md = md.decode "utf-8" ⋮---- def upload and parse file api key: str, fname: str, buff, max retries: int = 180, delay: int = 1 ⋮---- headers = { ⋮---- files = { response = requests.post ⋮---- job id = response.json 'id' ⋮---- status response = requests.get ⋮---- result response = requests.get 证据：`benchmarks/overall/download/llamaparse.py`
- **Main**（source_file）：@click.command "Download data from inference services" @click.argument "service", type=click.Choice "mathpix", "llamaparse", "mistral" @click.option "--max rows", type=int, default=2200 @click.option "--api key", type=str, default=None @click.option "--app id", type=str, default=None def main service: str, max rows: int, api key: str, app id: str ⋮---- registry = { downloader = registry service api key, app id, max rows=max rows 证据：`benchmarks/overall/download/main.py`
- **Mathpix**（source_file）：class MathpixDownloader Downloader ⋮---- service = "mathpix" ⋮---- def get html self, pdf bytes ⋮---- headers = { start = time.time pdf id = mathpix request pdf bytes, headers status = mathpix status pdf id, headers ⋮---- md = "" ⋮---- md = mathpix results pdf id, headers end = time.time ⋮---- md = md.decode "utf-8" ⋮---- def mathpix request buffer, headers ⋮---- response = requests.post "https://api.mathpix.com/v3/pdf", data = response.json pdf id = data "pdf id" ⋮---- def mathpix status pdf id, headers ⋮---- max iters = 120 i = 0 status = "processing" status2 = "processing" ⋮---- response = requests.get f"https://api.mathpix.com/v3/converter/{pdf id}", status resp = response.json ⋮---- st… 证据：`benchmarks/overall/download/mathpix.py`
- **Mistral**（source_file）：class MistralDownloader Downloader ⋮---- service = "mistral" ⋮---- def get html self, pdf bytes ⋮---- rand name = str time.time + ".pdf" start = time.time buff = io.BytesIO pdf bytes md = upload and process file self.api key, rand name, buff end = time.time ⋮---- md = md.decode "utf-8" ⋮---- def upload and process file api key: str, fname: str, buff ⋮---- headers = { ⋮---- upload headers = headers.copy files = { ⋮---- upload response = requests.post ⋮---- file id = upload response.json 'id' ⋮---- url headers = headers.copy ⋮---- url response = requests.get ⋮---- signed url = url response.json 'url' ⋮---- ocr headers = headers.copy ⋮---- ocr data = { ocr response = requests.post ⋮---- result… 证据：`benchmarks/overall/download/mistral.py`
- **Elo**（source_file）：rating prompt = """ ⋮---- class ComparerSchema BaseModel ⋮---- image description: str version a description: str version b description: str comparison: str winner: Literal "version a", "version b" ⋮---- class Comparer ⋮---- def init self ⋮---- hydrated prompt = rating prompt.replace "{{version a}}", version a .replace "{{version b}}", version b ⋮---- rating = self.llm rater img, hydrated prompt ⋮---- def llm rater self, img: Image.Image, prompt: str ⋮---- response = self.llm response wrapper ⋮---- client = genai.Client ⋮---- responses = client.models.generate content output = responses.candidates 0 .content.parts 0 .text ⋮---- def display win rates table win rates: dict ⋮---- table = header… 证据：`benchmarks/overall/elo.py`
- **Replace placeholders**（source_file）：def init self, kwargs ⋮---- @staticmethod def convert to md html: str ⋮---- md = MarkdownRenderer markdown = md.md cls.convert html ⋮---- def call self, sample - BenchmarkResult ⋮---- def render self, markdown: str ⋮---- @staticmethod def convert to html md: str ⋮---- block placeholders = inline placeholders = ⋮---- def block sub match ⋮---- content = match.group 1 placeholder = f"1BLOCKMATH{len block placeholders }1" ⋮---- def inline sub match ⋮---- placeholder = f"1INLINEMATH{len inline placeholders }1" ⋮---- md = re.sub r'\${2} . ? \${2}', block sub, md, flags=re.DOTALL md = re.sub r'\$ . ? \$', inline sub, md ⋮---- html = markdown2.markdown md, extras= 'tables' ⋮---- Replace placeholder… 证据：`benchmarks/overall/methods/__init__.py`
- **Docling**（source_file）：class DoclingMethod BaseMethod ⋮---- model dict: dict = None use llm: bool = False ⋮---- def call self, sample - BenchmarkResult ⋮---- pdf bytes = sample "pdf" converter = DocumentConverter ⋮---- start = time.time result = converter.convert f.name total = time.time - start 证据：`benchmarks/overall/methods/docling.py`
- **Gt**（source_file）：class GTMethod BaseMethod ⋮---- def call self, sample - BenchmarkResult ⋮---- gt blocks = json.loads sample "gt blocks" gt html = block "html" for block in gt blocks if len block "html" 0 gt markdown = self.convert to md block for block in gt html ⋮---- def render self, html: List str - Image.Image ⋮---- joined = "\n\n".join html html = f""" 证据：`benchmarks/overall/methods/gt.py`
- **Llamaparse**（source_file）：class LlamaParseMethod BaseMethod ⋮---- llamaparse ds: datasets.Dataset = None ⋮---- def call self, sample - BenchmarkResult ⋮---- uuid = sample "uuid" data = None ⋮---- data = row 证据：`benchmarks/overall/methods/llamaparse.py`
- **Marker**（source_file）：class MarkerMethod BaseMethod ⋮---- model dict: dict = None use llm: bool = False ⋮---- def call self, sample - BenchmarkResult ⋮---- pdf bytes = sample "pdf" parser = ConfigParser { ⋮---- block converter = PdfConverter ⋮---- start = time.time rendered = block converter f.name total = time.time - start 证据：`benchmarks/overall/methods/marker.py`
- **Mathpix**（source_file）：class MathpixMethod BaseMethod ⋮---- mathpix ds: datasets.Dataset = None ⋮---- def call self, sample - BenchmarkResult ⋮---- uuid = sample "uuid" data = None ⋮---- data = row 证据：`benchmarks/overall/methods/mathpix.py`
- **Mistral**（source_file）：class MistralMethod BaseMethod ⋮---- mistral ds: datasets.Dataset = None ⋮---- def call self, sample - BenchmarkResult ⋮---- uuid = sample "uuid" data = None ⋮---- data = row 证据：`benchmarks/overall/methods/mistral.py`
- **Apply the chat template and processor**（source_file）：def convert single page filename: str, model, processor, device ⋮---- image base64 = render pdf to base64png filename, 1, target longest image dim=1024 ⋮---- anchor text = get anchor text filename, 1, pdf engine="pdfreport", target length=4000 prompt = build finetuning prompt anchor text ⋮---- messages = ⋮---- Apply the chat template and processor text = processor.apply chat template messages, tokenize=False, add generation prompt=True main image = Image.open BytesIO base64.b64decode image base64 ⋮---- inputs = processor inputs = {key: value.to device for key, value in inputs.items } ⋮---- output = model.generate ⋮---- prompt length = inputs "input ids" .shape 1 new tokens = output :, promp… 证据：`benchmarks/overall/methods/olmocr.py`
- **Schema**（source_file）：class BenchmarkResult TypedDict ⋮---- markdown: str List str time: float None 证据：`benchmarks/overall/methods/schema.py`
- **Ensure marker is always first**（source_file）：def get method scores benchmark dataset: datasets.Dataset, methods: List str , score types: List str , artifacts: dict, max rows=None - FullResult ⋮---- bench scores = {} averages by type = defaultdict lambda: defaultdict lambda: defaultdict list averages by block type = defaultdict lambda: defaultdict lambda: defaultdict list average times = defaultdict list markdown by method = defaultdict dict total rows = len benchmark dataset ⋮---- total rows = min max rows, total rows ⋮---- doc type = sample "classification" gt cls = METHOD REGISTRY "gt" gt blocks = json.loads sample "gt blocks" gt md = gt cls artifacts sample "markdown" ⋮---- out data = defaultdict dict ⋮---- method cls = METHOD REGI… 证据：`benchmarks/overall/overall.py`
- **Registry**（source_file）：SCORE REGISTRY = { ⋮---- METHOD REGISTRY = { 证据：`benchmarks/overall/registry.py`
- **Schema**（source_file）：AVG TYPE = Dict str, Dict str, Dict str, List float ⋮---- class FullResult TypedDict ⋮---- scores: Dict int, Dict str, Dict str, BlockScores averages by type: AVG TYPE averages by block type: AVG TYPE average times: Dict str, List float markdown: Dict int, Dict str, str 证据：`benchmarks/overall/schema.py`
- **Init**（source_file）：class BaseScorer ⋮---- def init self ⋮---- def call self, sample, gt markdown: List str , method markdown: str - BlockScores 证据：`benchmarks/overall/scorers/__init__.py`
- **Replace image urls with a generic tag**（source_file）：class MarkdownCleaner ⋮---- def init self ⋮---- def call self, markdown ⋮---- markdown = self.normalize markdown markdown ⋮---- pattern = r' ? ", "\n" markdown = re.sub r" . ? ", r"\1", markdown markdown = re.sub r" . ? ", r"\1", markdown markdown = re.sub r" . ? ", r"\1", markdown Remove span tags and keep content ⋮---- Clean up markdown formatting markdown = re.sub r"\s+", " ", markdown markdown = re.sub r"\n+", "\n", markdown markdown = re.sub "\\.+", ".", ⋮---- markdown Replace repeated periods with a single period, like in table of contents markdown = re.sub " +", " ", markdown Replace repeated headers with a single header markdown = markdown.encode .decode 'unicode-escape', errors="ig… 证据：`benchmarks/overall/scorers/clean.py`
- **Heuristic**（source_file）：class HeuristicScorer BaseScorer ⋮---- def call self, sample, gt markdown: List str , method markdown: str - BlockScores ⋮---- gt markdown = self.clean input block for block in gt markdown method markdown = self.clean input method markdown ⋮---- alignments = self.find fuzzy alignments method markdown, gt markdown scores = alignment "score" for alignment in alignments ⋮---- orders = alignment "start" for alignment in alignments correct order = list range len gt markdown actual order = sorted range len gt markdown , key=lambda x: orders x order score = self.kendall tau correct order, actual order ⋮---- gt weights = len g for g in gt markdown weighted scores = score weight for score, weight in… 证据：`benchmarks/overall/scorers/heuristic.py`
- **Llm**（source_file）：rating prompt = """ ⋮---- comparison keys = "comparison" description keys = "image description", "markdown description" text keys = comparison keys + description keys score keys = "overall", "text", "formatting", "section headers", "tables", "forms", "equations", ⋮---- class LLMScorer BaseScorer ⋮---- def call self, sample, gt markdown: List str , markdown: str - BlockScores ⋮---- pdf bytes = sample "pdf" ⋮---- doc = pdfium.PdfDocument f.name img = doc 0 .render scale=96/72 .to pil ⋮---- def llm rater self, img: Image.Image, markdown: str - BlockScores ⋮---- null scores = {k: 1 for k in score keys} text scores = {k: "" for k in text keys} ⋮---- req keys = text keys + score keys properties =… 证据：`benchmarks/overall/scorers/llm.py`
- **Schema**（source_file）：class BlockScores TypedDict ⋮---- score: float specific scores: Dict str, float List float 证据：`benchmarks/overall/scorers/schema.py`
- **Gemini**（source_file）：prompt = """ ⋮---- class TableSchema BaseModel ⋮---- table html: str ⋮---- def gemini table rec image: Image.Image ⋮---- client = genai.Client ⋮---- image bytes = BytesIO ⋮---- responses = client.models.generate content ⋮---- output = responses.candidates 0 .content.parts 0 .text 证据：`benchmarks/table/gemini.py`
- **Normalize the bboxes**（source_file）：def extract tables children: List JSONBlockOutput ⋮---- tables = ⋮---- def fix table html table html: str - str ⋮---- marker table soup = BeautifulSoup table html, 'html.parser' tbody = marker table soup.find 'tbody' ⋮---- marker table html = str marker table soup marker table html = marker table html.replace "\n", " " Fintabnet uses spaces instead of newlines ⋮---- def inference tables dataset, use llm: bool, table rec batch size: int None, max rows: int, use gemini: bool ⋮---- models = create model dict config parser = ConfigParser {'output format': 'json', "use llm": use llm, "table rec batch size": table rec batch size, "disable tqdm": True} total unaligned = 0 results = ⋮---- iteration… 证据：`benchmarks/table/inference.py`
- **Sets self.name and self.children**（source_file）：def wrap table html table html:str - str ⋮---- class TableTree Tree ⋮---- def init self, tag, colspan=None, rowspan=None, content=None, children ⋮---- Sets self.name and self.children ⋮---- def bracket self ⋮---- """Show tree using brackets notation""" ⋮---- result = '"tag": %s, "colspan": %d, "rowspan": %d, "text": %s' % \ ⋮---- result = '"tag": %s' % self.tag ⋮---- class CustomConfig Config ⋮---- @staticmethod def maximum sequences ⋮---- def normalized distance self, sequences ⋮---- def rename self, node1, node2 ⋮---- def tokenize node ⋮---- def tree convert html node, convert cell=False, parent=None ⋮---- tokens = ⋮---- cell = tokens 1:-1 .copy ⋮---- cell = new node = TableTree node.tag,… 证据：`benchmarks/table/scoring.py`
- **Table**（source_file）：def update teds score result, prefix: str = "marker" ⋮---- score = similarity eval html prediction, ground truth ⋮---- start = time.time ⋮---- dataset = datasets.load dataset dataset, split='train' dataset = dataset.shuffle seed=0 ⋮---- marker results = list ⋮---- avg score = sum r "marker score" for r in marker results / len marker results headers = "Avg score", "Total tables" data = f"{avg score:.3f}", len marker results gemini results = None ⋮---- gemini results = list avg gemini score = sum r "gemini score" for r in gemini results / len gemini results ⋮---- table = tabulate data , headers=headers, tablefmt="github" ⋮---- results = { ⋮---- out path = Path result path 证据：`benchmarks/table/table.py`
- **Main**（source_file）：def get next pdf ds: datasets.Dataset, i: int ⋮---- pdf = ds i "pdf" filename = ds i "filename" ⋮---- i = 0 ⋮---- ds = datasets.load dataset "datalab-to/pdfs", split="train" model dict = create model dict ⋮---- times = ⋮---- pages = 0 chars = 0 ⋮---- min time = time.time ⋮---- pdf doc = pdfium.PdfDocument pdf page count = len pdf doc ⋮---- page range chunks = list range 0, page count, chunksize ⋮---- chunk end = min chunk start + chunksize, page count page range = list range chunk start, chunk end ⋮---- block converter = PdfConverter start = time.time rendered = block converter f.name ⋮---- total = time.time - start ⋮---- max gpu vram = torch.cuda.max memory reserved / 1024 3 max time = tim… 证据：`benchmarks/throughput/main.py`
- **Init**（source_file）：class BaseBuilder ⋮---- def init self, config: Optional BaseModel dict = None ⋮---- def call self, data, args, kwargs 证据：`marker/builders/__init__.py`
- **Document**（source_file）：class DocumentBuilder BaseBuilder ⋮---- lowres image dpi: Annotated highres image dpi: Annotated disable ocr: Annotated ⋮---- def call self, provider: PdfProvider, layout builder: LayoutBuilder, line builder: LineBuilder, ocr builder: OcrBuilder ⋮---- document = self.build document provider ⋮---- def build document self, provider: PdfProvider ⋮---- PageGroupClass: PageGroup = get block class BlockTypes.Page lowres images = provider.get images provider.page range, self.lowres image dpi highres images = provider.get images provider.page range, self.highres image dpi initial pages = DocumentClass: Document = get block class BlockTypes.Document 证据：`marker/builders/document.py`
- **Layout**（source_file）：class LayoutBuilder BaseBuilder ⋮---- layout batch size: Annotated force layout block: Annotated disable tqdm: Annotated expand block types: Annotated max expand frac: Annotated ⋮---- def init self, layout model: LayoutPredictor, config=None ⋮---- def call self, document: Document, provider: PdfProvider ⋮---- layout results = self.forced layout document.pages ⋮---- layout results = self.surya layout document.pages ⋮---- def get batch size self ⋮---- def forced layout self, pages: List PageGroup - List LayoutResult ⋮---- layout results = ⋮---- def surya layout self, pages: List PageGroup - List LayoutResult ⋮---- layout results = self.layout model ⋮---- def expand layout blocks self, documen… 证据：`marker/builders/layout.py`
- 其余 6 条证据见 `AI_CONTEXT_PACK.json` 或 `EVIDENCE_INDEX.json`。

## 宿主 AI 必须遵守的规则

- **把本资产当作开工前上下文，而不是运行环境。**：AI Context Pack 只包含证据化项目理解，不包含目标项目的可执行状态。 证据：`README.md`, `examples/README.md`, `LICENSE`
- **回答用户时区分可预览内容与必须安装后才能验证的内容。**：安装前体验的消费者价值来自降低误装和误判，而不是伪装成真实运行。 证据：`README.md`, `examples/README.md`, `LICENSE`

## 用户开工前应该回答的问题

- 你准备在哪个宿主 AI 或本地环境中使用它？
- 你只是想先体验工作流，还是准备真实安装？
- 你最在意的是安装成本、输出质量、还是和现有规则的冲突？

## 验收标准

- 所有能力声明都能回指到 evidence_refs 中的文件路径。
- AI_CONTEXT_PACK.md 没有把预览包装成真实运行。
- 用户能在 3 分钟内看懂适合谁、能做什么、如何开始和风险边界。

---

## Doramagic Context Augmentation

下面内容用于强化 Repomix/AI Context Pack 主体。Human Manual 只提供阅读骨架；踩坑日志会被转成宿主 AI 必须遵守的工作约束。

## Human Manual 骨架

使用规则：这里只是项目阅读路线和显著性信号，不是事实权威。具体事实仍必须回到 repo evidence / Claim Graph。

宿主 AI 硬性规则：
- 不得把页标题、章节顺序、摘要或 importance 当作项目事实证据。
- 解释 Human Manual 骨架时，必须明确说它只是阅读路线/显著性信号。
- 能力、安装、兼容性、运行状态和风险判断必须引用 repo evidence、source path 或 Claim Graph。

- **Marker 概述**：importance `high`
  - source_paths: README.md, marker/__init__.py, pyproject.toml
- **安装指南**：importance `high`
  - source_paths: pyproject.toml, marker/settings.py, marker/utils/gpu.py
- **系统架构**：importance `high`
  - source_paths: marker/providers/__init__.py, marker/builders/__init__.py, marker/processors/__init__.py, marker/renderers/__init__.py, marker/converters/__init__.py
- **转换器详解**：importance `high`
  - source_paths: marker/converters/pdf.py, marker/converters/table.py, marker/converters/ocr.py, marker/converters/extraction.py, marker/scripts/convert.py
- **文件格式提供者**：importance `medium`
  - source_paths: marker/providers/registry.py, marker/providers/pdf.py, marker/providers/image.py, marker/providers/html.py, marker/providers/epub.py
- **处理器详解**：importance `high`
  - source_paths: marker/processors/__init__.py, marker/processors/text.py, marker/processors/table.py, marker/processors/equation.py, marker/processors/llm/__init__.py
- **渲染器与输出格式**：importance `high`
  - source_paths: marker/renderers/markdown.py, marker/renderers/json.py, marker/renderers/html.py, marker/renderers/chunk.py, marker/output.py
- **LLM 集成与混合模式**：importance `high`
  - source_paths: marker/services/__init__.py, marker/services/gemini.py, marker/services/claude.py, marker/services/openai.py, marker/services/ollama.py

## Repo Inspection Evidence / 源码检查证据

- repo_clone_verified: true
- repo_inspection_verified: true
- repo_commit: `6ae38895d6e11cbc8fb4a60a0750b3bac479e304`
- inspected_files: `pyproject.toml`, `README.md`, `examples/README.md`, `examples/marker_modal_deployment.py`

宿主 AI 硬性规则：
- 没有 repo_clone_verified=true 时，不得声称已经读过源码。
- 没有 repo_inspection_verified=true 时，不得把 README/docs/package 文件判断写成事实。
- 没有 quick_start_verified=true 时，不得声称 Quick Start 已跑通。

## Doramagic Pitfall Constraints / 踩坑约束

这些规则来自 Doramagic 发现、验证或编译过程中的项目专属坑点。宿主 AI 必须把它们当作工作约束，而不是普通说明文字。

### Constraint 1: 来源证据：[BUG: Breaking]

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[BUG: Breaking]
- Host AI rule: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Why it matters: 可能阻塞安装或首次运行。
- Evidence: community_evidence:github | cevd_5e263773fc84449f88bdf5f4ec5dfeba | https://github.com/datalab-to/marker/issues/1032 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 2: 来源证据：[BUG: Breaking] Marker is 20x+ slower since v1.9.0+ in Mac

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[BUG: Breaking] Marker is 20x+ slower since v1.9.0+ in Mac
- Host AI rule: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Why it matters: 可能影响升级、迁移或版本选择。
- Evidence: community_evidence:github | cevd_310c8ea2147f416597bcff9cc1438928 | https://github.com/datalab-to/marker/issues/960 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 3: 来源证据：[BUG: Breaking] missing dependency: psutil

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[BUG: Breaking] missing dependency: psutil
- Host AI rule: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Why it matters: 可能影响升级、迁移或版本选择。
- Evidence: community_evidence:github | cevd_7b635cb675114e8fa34251c940ce4a92 | https://github.com/datalab-to/marker/issues/818 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 4: 失败模式：installation: [BUG: Breaking]

- Trigger: Developers should check this installation risk before relying on the project: [BUG: Breaking]
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: [BUG: Breaking]. Context: Observed when using python, linux
- Why it matters: Developers may fail before the first successful local run: [BUG: Breaking]
- Evidence: failure_mode_cluster:github_issue | fmev_163a0175abe51d147c94b98207ebbc97 | https://github.com/datalab-to/marker/issues/1032 | [BUG: Breaking]
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 5: 失败模式：installation: [BUG: Breaking] Marker is 20x+ slower since v1.9.0+ in Mac

- Trigger: Developers should check this installation risk before relying on the project: [BUG: Breaking] Marker is 20x+ slower since v1.9.0+ in Mac
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: [BUG: Breaking] Marker is 20x+ slower since v1.9.0+ in Mac. Context: Observed when using python, cuda
- Why it matters: Developers may fail before the first successful local run: [BUG: Breaking] Marker is 20x+ slower since v1.9.0+ in Mac
- Evidence: failure_mode_cluster:github_issue | fmev_17d2e6761f9590ea709f9a3c31258ef2 | https://github.com/datalab-to/marker/issues/960 | [BUG: Breaking] Marker is 20x+ slower since v1.9.0+ in Mac
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 6: 失败模式：installation: [BUG: Breaking] missing dependency: psutil

- Trigger: Developers should check this installation risk before relying on the project: [BUG: Breaking] missing dependency: psutil
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: [BUG: Breaking] missing dependency: psutil. Context: Observed when using python, docker
- Why it matters: Developers may fail before the first successful local run: [BUG: Breaking] missing dependency: psutil
- Evidence: failure_mode_cluster:github_issue | fmev_7f0847da6ce4e932ebb7040531b8c2fb | https://github.com/datalab-to/marker/issues/818 | [BUG: Breaking] missing dependency: psutil
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 7: 失败模式：configuration: Minor fixes

- Trigger: Developers should check this configuration risk before relying on the project: Minor fixes
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: Minor fixes. Context: Source discussion did not expose a precise runtime context.
- Why it matters: Upgrade or migration may change expected behavior: Minor fixes
- Evidence: failure_mode_cluster:github_release | fmev_9bd20653f46c11589bffedacd93a5209 | https://github.com/datalab-to/marker/releases/tag/v1.10.1 | Minor fixes
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 8: 失败模式：configuration: [BUG: Breaking]torch.AcceleratorError: index 8192 is out of bounds: 2, range 0 to 4756

- Trigger: Developers should check this configuration risk before relying on the project: [BUG: Breaking]torch.AcceleratorError: index 8192 is out of bounds: 2, range 0 to 4756
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: [BUG: Breaking]torch.AcceleratorError: index 8192 is out of bounds: 2, range 0 to 4756. Context: Observed when using python
- Why it matters: Developers may misconfigure credentials, environment, or host setup: [BUG: Breaking]torch.AcceleratorError: index 8192 is out of bounds: 2, range 0 to 4756
- Evidence: failure_mode_cluster:github_issue | fmev_7f14e833dcd25affa00b77c20fbaedad | https://github.com/datalab-to/marker/issues/1036 | [BUG: Breaking]torch.AcceleratorError: index 8192 is out of bounds: 2, range 0 to 4756
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 9: 失败模式：configuration: [FEAT] Export converter: save extracted tables and structured content to SQLite / CSV

- Trigger: Developers should check this configuration risk before relying on the project: [FEAT] Export converter: save extracted tables and structured content to SQLite / CSV
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: [FEAT] Export converter: save extracted tables and structured content to SQLite / CSV. Context: Observed when using python
- Why it matters: Developers may misconfigure credentials, environment, or host setup: [FEAT] Export converter: save extracted tables and structured content to SQLite / CSV
- Evidence: failure_mode_cluster:github_issue | fmev_e892a62132667ffb55c57d798a583266 | https://github.com/datalab-to/marker/issues/1035 | [FEAT] Export converter: save extracted tables and structured content to SQLite / CSV
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 10: 能力判断依赖假设

- Trigger: README/documentation is current enough for a first validation pass.
- Host AI rule: 将假设转成下游验证清单。
- Why it matters: 假设不成立时，用户拿不到承诺的能力。
- Evidence: capability.assumptions | github_repo:712111618 | https://github.com/datalab-to/marker | README/documentation is current enough for a first validation pass.
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。
