# olmocr - Doramagic AI Context Pack

> 定位：安装前体验与判断资产。它帮助宿主 AI 有一个好的开始，但不代表已经安装、执行或验证目标项目。

## 充分原则

- **充分原则，不是压缩原则**：AI Context Pack 应该充分到让宿主 AI 在开工前理解项目价值、能力边界、使用入口、风险和证据来源；它可以分层组织，但不以最短摘要为目标。
- **压缩策略**：只压缩噪声和重复内容，不压缩会影响判断和开工质量的上下文。

## 给宿主 AI 的使用方式

你正在读取 Doramagic 为 olmocr 编译的 AI Context Pack。请把它当作开工前上下文：帮助用户理解适合谁、能做什么、如何开始、哪些必须安装后验证、风险在哪里。不要声称你已经安装、运行或执行了目标项目。

## Claim 消费规则

- **事实来源**：Repo Evidence + Claim/Evidence Graph；Human Wiki 只提供显著性、术语和叙事结构。
- **事实最低状态**：`supported`
- `supported`：可以作为项目事实使用，但回答中必须引用 claim_id 和证据路径。
- `weak`：只能作为低置信度线索，必须要求用户继续核实。
- `inferred`：只能用于风险提示或待确认问题，不能包装成项目事实。
- `unverified`：不得作为事实使用，应明确说证据不足。
- `contradicted`：必须展示冲突来源，不得替用户强行选择一个版本。

## 它最适合谁

- **AI 研究者或研究型 Agent 构建者**：README 明确围绕研究、实验或论文工作流展开。 证据：`README.md` Claim：`clm_0002` supported 0.86

## 它能做什么

- **命令行启动或安装流程**（需要安装后验证）：项目文档中存在可执行命令，真实使用需要在本地或宿主环境中运行这些命令。 证据：`README.md` Claim：`clm_0001` supported 0.86

## 怎么开始

- `pip install olmocr` 证据：`README.md` Claim：`clm_0003` supported 0.86, `clm_0004` supported 0.86, `clm_0006` supported 0.86, `clm_0007` supported 0.86 等
- `pip install olmocr[gpu] --extra-index-url https://download.pytorch.org/whl/cu128` 证据：`README.md` Claim：`clm_0004` supported 0.86
- `pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl` 证据：`README.md` Claim：`clm_0005` supported 0.86
- `pip install olmocr[beaker]` 证据：`README.md` Claim：`clm_0006` supported 0.86
- `pip install olmocr[bench]` 证据：`README.md` Claim：`clm_0007` supported 0.86
- `pip install olmocr[gpu,beaker] --extra-index-url https://download.pytorch.org/whl/cu128` 证据：`README.md` Claim：`clm_0008` supported 0.86
- `pip install olmocr[gpu,bench] --extra-index-url https://download.pytorch.org/whl/cu128` 证据：`README.md` Claim：`clm_0009` supported 0.86
- `curl -o olmocr-sample.pdf https://olmocr.allenai.org/papers/olmocr_3pg_sample.pdf` 证据：`README.md` Claim：`clm_0010` supported 0.86

## 继续前判断卡

- **当前建议**：先做角色匹配试用
- **为什么**：这个项目更像角色库，核心风险是选错角色或把角色文案当执行能力；先用 Prompt Preview 试角色匹配，再决定是否沙盒导入。

### 30 秒判断

- **现在怎么做**：先做角色匹配试用
- **最小安全下一步**：先用 Prompt Preview 试角色匹配；满意后再隔离导入
- **先别相信**：角色质量和任务匹配不能直接相信。
- **继续会触碰**：角色选择偏差、命令执行、本地环境或项目文件

### 现在可以相信

- **适合人群线索：AI 研究者或研究型 Agent 构建者**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`README.md` Claim：`clm_0002` supported 0.86
- **能力存在：命令行启动或安装流程**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`README.md` Claim：`clm_0001` supported 0.86
- **存在 Quick Start / 安装命令线索**（supported）：可以相信项目文档出现过启动或安装入口；不要因此直接在主力环境运行。 证据：`README.md` Claim：`clm_0003` supported 0.86, `clm_0004` supported 0.86, `clm_0006` supported 0.86, `clm_0007` supported 0.86

### 现在还不能相信

- **角色质量和任务匹配不能直接相信。**（unverified）：角色库证明有很多角色，不证明每个角色都适合你的具体任务，也不证明角色能产生高质量结果。
- **不能把角色文案当成真实执行能力。**（unverified）：安装前只能判断角色描述和任务画像是否匹配，不能证明它能在宿主 AI 里完成任务。
- **真实输出质量不能在安装前相信。**（unverified）：Prompt Preview 只能展示引导方式，不能证明真实项目中的结果质量。
- **宿主 AI 版本兼容性不能在安装前相信。**（unverified）：Claude、Cursor、Codex、Gemini 等宿主加载规则和版本差异必须在真实环境验证。
- **不会污染现有宿主 AI 行为，不能直接相信。**（inferred）：Skill、plugin、AGENTS/CLAUDE/GEMINI 指令可能改变宿主 AI 的默认行为。
- **可安全回滚不能默认相信。**（unverified）：除非项目明确提供卸载和恢复说明，否则必须先在隔离环境验证。
- **真实安装后是否与用户当前宿主 AI 版本兼容？**（unverified）：兼容性只能通过实际宿主环境验证。
- **项目输出质量是否满足用户具体任务？**（unverified）：安装前预览只能展示流程和边界，不能替代真实评测。

### 继续会触碰什么

- **角色选择偏差**：用户对任务应该由哪个专家角色处理的判断。 原因：选错角色会让 AI 从错误专业视角回答，浪费时间或误导决策。
- **命令执行**：包管理器、网络下载、本地插件目录、项目配置或用户主目录。 原因：运行第一条命令就可能产生环境改动；必须先判断是否值得跑。 证据：`README.md`
- **本地环境或项目文件**：安装结果、插件缓存、项目配置或本地依赖目录。 原因：安装前无法证明写入范围和回滚方式，需要隔离验证。 证据：`README.md`
- **宿主 AI 上下文**：AI Context Pack、Prompt Preview、Skill 路由、风险规则和项目事实。 原因：导入上下文会影响宿主 AI 后续判断，必须避免把未验证项包装成事实。

### 最小安全下一步

- **先跑 Prompt Preview**：先用交互式试用验证任务画像和角色匹配，不要先导入整套角色库。（适用：任何项目都适用，尤其是输出质量未知时。）
- **只在隔离目录或测试账号试装**：避免安装命令污染主力宿主 AI、真实项目或用户主目录。（适用：存在命令执行、插件配置或本地写入线索时。）
- **安装后只验证一个最小任务**：先验证加载、兼容、输出质量和回滚，再决定是否深用。（适用：准备从试用进入真实工作流时。）

### 退出方式

- **保留安装前状态**：记录原始宿主配置和项目状态，后续才能判断是否可恢复。
- **保留原始角色选择记录**：如果输出偏题，可以回到任务画像阶段重新选择角色，而不是继续沿着错误角色推进。
- **记录安装命令和写入路径**：没有明确卸载说明时，至少要知道哪些目录或配置需要手动清理。
- **如果没有回滚路径，不进入主力环境**：不可回滚是继续前阻断项，不应靠信任或运气继续。

## 哪些只能预览

- 解释项目适合谁和能做什么
- 基于项目文档演示典型对话流程
- 帮助用户判断是否值得安装或继续研究

## 哪些必须安装后验证

- 真实安装 Skill、插件或 CLI
- 执行脚本、修改本地文件或访问外部服务
- 验证真实输出质量、性能和兼容性

## 边界与风险判断卡

- **把安装前预览误认为真实运行**：用户可能高估项目已经完成的配置、权限和兼容性验证。 处理方式：明确区分 prompt_preview_can_do 与 runtime_required。 Claim：`clm_0011` inferred 0.45
- **命令执行会修改本地环境**：安装命令可能写入用户主目录、宿主插件目录或项目配置。 处理方式：先在隔离环境或测试账号中运行。 证据：`README.md` Claim：`clm_0012` supported 0.86
- **待确认**：真实安装后是否与用户当前宿主 AI 版本兼容？。原因：兼容性只能通过实际宿主环境验证。
- **待确认**：项目输出质量是否满足用户具体任务？。原因：安装前预览只能展示流程和边界，不能替代真实评测。
- **待确认**：安装命令是否需要网络、权限或全局写入？。原因：这影响企业环境和个人环境的安装风险。

## 开工前工作上下文

### 加载顺序

- 先读取 how_to_use.host_ai_instruction，建立安装前判断资产的边界。
- 读取 claim_graph_summary，确认事实来自 Claim/Evidence Graph，而不是 Human Wiki 叙事。
- 再读取 intended_users、capabilities 和 quick_start_candidates，判断用户是否匹配。
- 需要执行具体任务时，优先查 role_skill_index，再查 evidence_index。
- 遇到真实安装、文件修改、网络访问、性能或兼容性问题时，转入 risk_card 和 boundaries.runtime_required。

### 任务路由

- **命令行启动或安装流程**：先说明这是安装后验证能力，再给出安装前检查清单。 边界：必须真实安装或运行后验证。 证据：`README.md` Claim：`clm_0001` supported 0.86

### 上下文规模

- 文件总数：315
- 重要文件覆盖：40/315
- 证据索引条目：76
- 角色 / Skill 条目：29

### 证据不足时的处理

- **missing_evidence**：说明证据不足，要求用户提供目标文件、README 段落或安装后验证记录；不要补全事实。
- **out_of_scope_request**：说明该任务超出当前 AI Context Pack 证据范围，并建议用户先查看 Human Manual 或真实安装后验证。
- **runtime_request**：给出安装前检查清单和命令来源，但不要替用户执行命令或声称已执行。
- **source_conflict**：同时展示冲突来源，标记为待核实，不要强行选择一个版本。

## Prompt Recipes

### 适配判断

- 目标：判断这个项目是否适合用户当前任务。
- 预期输出：适配结论、关键理由、证据引用、安装前可预览内容、必须安装后验证内容、下一步建议。

```text
请基于 olmocr 的 AI Context Pack，先问我 3 个必要问题，然后判断它是否适合我的任务。回答必须包含：适合谁、能做什么、不能做什么、是否值得安装、证据来自哪里。所有项目事实必须引用 evidence_refs、source_paths 或 claim_id。
```

### 安装前体验

- 目标：让用户在安装前感受核心工作流，同时避免把预览包装成真实能力或营销承诺。
- 预期输出：一段带边界标签的体验剧本、安装后验证清单和谨慎建议；不含真实运行承诺或强营销表述。

```text
请把 olmocr 当作安装前体验资产，而不是已安装工具或真实运行环境。

请严格输出四段：
1. 先问我 3 个必要问题。
2. 给出一段“体验剧本”：用 [安装前可预览]、[必须安装后验证]、[证据不足] 三种标签展示它可能如何引导工作流。
3. 给出安装后验证清单：列出哪些能力只有真实安装、真实宿主加载、真实项目运行后才能确认。
4. 给出谨慎建议：只能说“值得继续研究/试装”“先补充信息后再判断”或“不建议继续”，不得替项目背书。

硬性边界：
- 不要声称已经安装、运行、执行测试、修改文件或产生真实结果。
- 不要写“自动适配”“确保通过”“完美适配”“强烈建议安装”等承诺性表达。
- 如果描述安装后的工作方式，必须使用“如果安装成功且宿主正确加载 Skill，它可能会……”这种条件句。
- 体验剧本只能写成“示例台词/假设流程”：使用“可能会询问/可能会建议/可能会展示”，不要写“已写入、已生成、已通过、正在运行、正在生成”。
- Prompt Preview 不负责给安装命令；如用户准备试装，只能提示先阅读 Quick Start 和 Risk Card，并在隔离环境验证。
- 所有项目事实必须来自 supported claim、evidence_refs 或 source_paths；inferred/unverified 只能作风险或待确认项。

```

### 角色 / Skill 选择

- 目标：从项目里的角色或 Skill 中挑选最匹配的资产。
- 预期输出：候选角色或 Skill 列表，每项包含适用场景、证据路径、风险边界和是否需要安装后验证。

```text
请读取 role_skill_index，根据我的目标任务推荐 3-5 个最相关的角色或 Skill。每个推荐都要说明适用场景、可能输出、风险边界和 evidence_refs。
```

### 风险预检

- 目标：安装或引入前识别环境、权限、规则冲突和质量风险。
- 预期输出：环境、权限、依赖、许可、宿主冲突、质量风险和未知项的检查清单。

```text
请基于 risk_card、boundaries 和 quick_start_candidates，给我一份安装前风险预检清单。不要替我执行命令，只说明我应该检查什么、为什么检查、失败会有什么影响。
```

### 宿主 AI 开工指令

- 目标：把项目上下文转成一次对话开始前的宿主 AI 指令。
- 预期输出：一段边界明确、证据引用明确、适合复制给宿主 AI 的开工前指令。

```text
请基于 olmocr 的 AI Context Pack，生成一段我可以粘贴给宿主 AI 的开工前指令。这段指令必须遵守 not_runtime=true，不能声称项目已经安装、运行或产生真实结果。
```

## 角色 / Skill 索引

- 共索引 29 个角色 / Skill / 项目文档条目。

- **News**（project_doc）：A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`README.md`
- **olmOCR-Bench**（project_doc）：Dataset Link: https://huggingface.co/datasets/allenai/olmOCR-bench 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/README.md`
- **olmOCR Training Guide**（project_doc）：This guide provides comprehensive instructions for training olmOCR models, including what you need to reproduce https://huggingface.co/allenai/olmOCR-2-7B-1025-FP8 on your own hardware. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/train/README.md`
- **elo rating**（project_doc）：Calculates elo rating of olmOCR vs other tools. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`scripts/elo/README.md`
- **Changelog**（project_doc）：All notable changes to this project will be documented in this file. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`CHANGELOG.md`
- **GitHub Release Process**（project_doc）：1. Update the version in olmocr/version.py . 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`RELEASE_PROCESS.md`
- **Blank Book Pg1 Pg1 Repeat1**（project_doc）： 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/blank_book_pg1_pg1_repeat1.md`
- **Buildingnotes Pg1 Repeat1**（project_doc）：Master - 7 1/4 - 36" Master Bath - 7 1/4 - 30" Laundry - 4 3/4 - 36" Bath - 7 1/4 - 24" MUD - 7 - 36" UTIL - 8 1/4 - 36" DOWN BATH - 7 1/4 - 32" BUT KIT - 6 3/4 - 30 PANTRY - 4 3/4 - 24 6 WEST - 32 9/8 - 32 6 WEST BATH 5" - 24" 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/buildingnotes_pg1_repeat1.md`
- **Discoverworld Crazy Table4 Pg1 Repeat1**（project_doc）：Table 4: Baseline model performance on each of the three scoring metrics task completion, task process, explanatory knowledge discovery across all 24 DISCOVERY WORLD tasks. Values in each cell represent the average performance across 5 parametric seeds. Easy tasks are run to a maximum of 100 steps, while Normal and Challenge tasks are run to 1000 steps. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/discoverworld_crazy_table4_pg1_repeat1.md`
- **Earnings Pg1 Repeat1**（project_doc）：Recently Issued Accounting Pronouncements 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/earnings_pg1_repeat1.md`
- **Ff0F0B22C55D8B90Dd77D153F48E144Fc9Db Pg2 Pg1 Repeat1**（project_doc）：Lassa Fever in Post-Conflict Sierra Leone 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff0f0b22c55d8b90dd77d153f48e144fc9db_pg2_pg1_repeat1.md`
- **Ff1Fc6A205Ad039139Ce566851B6B260C929 Pg1 Pg1 Repeat1**（project_doc）：RTG Degradation Primer and Application to MMRTG 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff1fc6a205ad039139ce566851b6b260c929_pg1_pg1_repeat1.md`
- **Ff3D6E051903Fe5Ca9Bc172Ece14964C5632 Pg1 Pg1 Repeat1**（project_doc）：بررسی دیدگاه و نظرات کتابداران و اعضای هیئت علمی دانشگاه شیراز در بره گیری از فناوری شبکه‌های پی سیم در کتابخانه‌های دانشگاهی 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff3d6e051903fe5ca9bc172ece14964c5632_pg1_pg1_repeat1.md`
- **Ff4F7Dad78081Cff727D19Ab51C181D4A661 Pg1 Pg1 Repeat1**（project_doc）：Molecular markers of breast cancer metastasis Weigelt, B. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff4f7dad78081cff727d19ab51c181d4a661_pg1_pg1_repeat1.md`
- **Ff518B1240A66978F22035528Ccb029450B5 Pg2 Pg1 Repeat1**（project_doc）：Prophet of the Jubilee, translated and edited by Ronald D. Dennis Religious Studies Center, Brigham Young University, 1997 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff518b1240a66978f22035528ccb029450b5_pg2_pg1_repeat1.md`
- **Ffaac214730D2B8C2Ec842E3618Ccb9C4259 Pg1 Pg1 Repeat1**（project_doc）： 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ffaac214730d2b8c2ec842e3618ccb9c4259_pg1_pg1_repeat1.md`
- **Fff590Bed29A2854Ac1F874Dad5752Ede1Aa Pg1 Pg1 Repeat1**（project_doc）：Lake Shore Cryotronics, Inc. 575 McCorkle Blvd. Westerville, Ohio 43082-8888 USA 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/fff590bed29a2854ac1f874dad5752ede1aa_pg1_pg1_repeat1.md`
- **Lincoln Letter Pg1 Repeat1**（project_doc）：Major General Hitchcock, Commissioner of Exchanges, is authorized and directed to offer Brigadier General Trimble, now a prisoner of war in Fort McHenry, in exchange for Major White, who is held as a prisoner at Richmond. He is also directed to send forward the offer of exchange by Henry M. Warfield, Esq. of Baltimore, under a flag of truce, and give him a pass to City Point. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/lincoln_letter_pg1_repeat1.md`
- **Math 2503 04086 Pg1 Repeat1**（project_doc）：Proof. Let $S$ be the generating set associated with $D$ as described in Proposition 2.5. By the circulant diagonalization theorem, the spectrum of $G R D = \Gamma R, S $ is the multiset $\{\lambda g\} {g \in R}$ where 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/math_2503_04086_pg1_repeat1.md`
- **The 20 Most Important Mathematical Equations**（project_doc）：The 20 Most Important Mathematical Equations 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/mathfuncs_colswitch_pg1_repeat1.md`
- **The 20 Most Important Mathematical Equations**（project_doc）：The 20 Most Important Mathematical Equations 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/mathfuncs_pg1_repeat1.md`
- **Mattsnotes Pg1 Repeat1**（project_doc）：CodeText: SE, whatever we've scraped 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/mattsnotes_pg1_repeat1.md`
- **Mattsnotes Pg2 Repeat1**（project_doc）：P1: 100% Source code P2: 80% code 20% language 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/mattsnotes_pg2_repeat1.md`
- **Mattsnotes Pg3 Repeat1**（project_doc）：- Pick Arch like OLMO-IB - OR replicate a 3D model - Follow standard LR flow 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/mattsnotes_pg3_repeat1.md`
- **Multi Column Miss Pg1 Repeat1**（project_doc）：stakeholders has occurred in other nations, with groups and individuals refusing to risk being appropriated into the industry’s public relations ambitions. It now looks like that with vigilance, tobacco control advocates can easily foment disruption of its efforts to take its place alongside other industries—often with considerable social credit—in the hope that it might gain by association. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/multi_column_miss_pg1_repeat1.md`
- **Olmo2 Pg4 Pg1 Repeat1**（project_doc）：Table 1 Composition of the pretraining data for OLMo 2. The OLMo 2 1124 Mix is composed of StarCoder Li et al., 2023b; Kocetkov et al., 2022 , peS2o Soldaini and Lo, 2023 , web text from DCLM Li et al., 2024 and Wiki come from Dolma 1.7 Soldaini et al., 2024 . arXiv comes from Red-Pajama Together AI, 2023 , while OpenWebMath Paster et al., 2023 and Algebraic Stack come from ProofPile II Azerbayev et al., 2023 . 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/olmo2-pg4_pg1_repeat1.md`
- **Openstax Caculus Pg 273 Pg1 Repeat1**（project_doc）：For the following exercises, the given functions represent the position of a particle traveling along a horizontal line. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/openstax_caculus_pg_273_pg1_repeat1.md`
- **Small Page Size Pg1 Repeat1**（project_doc）：any—was very trifling. Since the use of bones has, however, become general, the turnip crop has been, in many instances, ten-fold, and in few less than four or five-fold its former bulk. All the succeeding crops of grain and seeds have been amazingly increased, and, upon the four or five-shift system, there is no doubt the land will go on progressively improving, requiring a less quantity of bones annually, from its… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/small_page_size_pg1_repeat1.md`
- **Test Graphical Text Pg1 Repeat1**（project_doc）：THE POWER OF STORYTELLING FOR LEADERS ดร.วิทย์ สิทธิเวคิน 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`olmocr/bench/sample_data/olmocr_pipeline/test-graphical-text_pg1_repeat1.md`

## 证据索引

- 共索引 76 条证据。

- **News**（documentation）：A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format. 证据：`README.md`
- **olmOCR-Bench**（documentation）：Dataset Link: https://huggingface.co/datasets/allenai/olmOCR-bench 证据：`olmocr/bench/README.md`
- **olmOCR Training Guide**（documentation）：This guide provides comprehensive instructions for training olmOCR models, including what you need to reproduce https://huggingface.co/allenai/olmOCR-2-7B-1025-FP8 on your own hardware. 证据：`olmocr/train/README.md`
- **elo rating**（documentation）：Calculates elo rating of olmOCR vs other tools. 证据：`scripts/elo/README.md`
- **License**（source_file）：Apache License Version 2.0, January 2004 https://www.apache.org/licenses/ 证据：`LICENSE`
- **Build from the base olmocr image**（source_file）：Build from the base olmocr image FROM alleninstituteforai/olmocr:latest 证据：`Dockerfile.with-model`
- **Datatypes**（source_file）：@dataclass frozen=True class PdfOutput ⋮---- path: str text: str total pdf pages: int processed pdf pages: int def mk dolma doc self, kwargs - str ⋮---- metadata = { id = hashlib.sha1 self.text.encode .hexdigest dolma doc = { 证据：`olmocr/datatypes.py`
- **Read headers**（source_file）：logger = logging.getLogger name ⋮---- server logger = logging.getLogger "vllm" ⋮---- console handler = logging.StreamHandler ⋮---- workspace s3 = boto3.client "s3" pdf s3 = boto3.client "s3" metrics = MetricsKeeper window=60 5 tracker = WorkerTracker vllm queued requests = None TEMPERATURE BY ATTEMPT = 0.1, 0.1, 0.2, 0.3, 0.5, 0.8, 0.9, 1.0 pdf render max workers limit = asyncio.BoundedSemaphore int float os.environ.get "BEAKER ASSIGNED CPU COUNT", max 1, multiprocessing.cpu count - 2 max concurrent requests limit = asyncio.BoundedSemaphore 1 get pdf filter = cache lambda: PdfFilter languages to keep={Language.ENGLISH, None}, apply download spam check=True, apply form check=True ⋮---- @data… 证据：`olmocr/pipeline.py`
- **Fall back for local files**（source_file）：logger = logging.getLogger name ⋮---- def parse s3 path s3 path: str - tuple str, str ⋮---- parsed = urlparse s3 path bucket = parsed.netloc key = parsed.path.lstrip "/" ⋮---- def expand s3 glob s3 client, s3 glob: str - dict str, str ⋮---- parsed = urlparse s3 glob ⋮---- raw path = parsed.path.lstrip "/" ⋮---- first wildcard = min raw path.index wc for wc in " ", " ", " " if wc in raw path prefix = raw path :first wildcard paginator = s3 client.get paginator "list objects v2" matched = {} ⋮---- key = obj "Key" ⋮---- resp = s3 client.head object Bucket=bucket, Key=raw path ⋮---- check prefix = raw path if raw path.endswith "/" else raw path + "/" ⋮---- def get s3 bytes s3 client, s3 path: s… 证据：`olmocr/s3_utils.py`
- **Create done flag in done flags dir**（source_file）：logger = logging.getLogger name WORKER LOCKS DIR = "worker locks" DONE FLAGS DIR = "done flags" ⋮---- @dataclass class WorkItem ⋮---- hash: str work paths: List str class Backend abc.ABC ⋮---- @abc.abstractmethod async def load index lines self - List str ⋮---- @abc.abstractmethod async def save index lines self, lines: List str - None ⋮---- @abc.abstractmethod async def get completed hashes self - Set str ⋮---- @abc.abstractmethod async def is completed self, work hash: str - bool ⋮---- @abc.abstractmethod async def is worker lock taken self, work hash: str, worker lock timeout secs: int = 1800 - bool ⋮---- @abc.abstractmethod async def create worker lock self, work hash: str - None ⋮----… 证据：`olmocr/work_queue.py`
- **See https://setuptools.pypa.io/en/latest/userguide/quickstart.html for more project configuration options.**（source_file）：build-system requires = "setuptools", "wheel" build-backend = "setuptools.build meta" 证据：`pyproject.toml`
- **Define an inner function to evaluate a single test**（source_file）：candidate errors = test failures = test type breakdown = {} all test scores = test results = {} candidate name = os.path.basename candidate folder pdf to md files = {} all files = list glob.glob os.path.join candidate folder, " / .md" , recursive=True ⋮---- md base = os.path.splitext pdf name 0 md regex = re.compile rf"^{re.escape md base } pg\d+ repeat\d+\.md$" md files = f for f in all files if md regex.match os.path.relpath f, candidate folder ⋮---- Define an inner function to evaluate a single test def process test test: BasePDFTest - Tuple float, str, str, List str , Tuple bool, str ⋮---- local errors = test failure = None pdf name = test.pdf Initialize the test results structure if ne… 证据：`olmocr/bench/benchmark.py`
- **Prompts**（source_file）：def build basic prompt - str def build openai silver data prompt no document anchoring base text: str - str def claude response format schema - dict 证据：`olmocr/bench/prompts.py`
- **Create a new PDF with just the requested page**（source_file）：marker converter = None def run marker pdf path: str, page num: int = 1 - str ⋮---- config = { config parser = ConfigParser config marker converter = PdfConverter pdf to process = pdf path temp file = None ⋮---- reader = PdfReader pdf path ⋮---- Create a new PDF with just the requested page writer = PdfWriter pypdf uses 0-based indexing, so subtract 1 from page num ⋮---- Save the extracted page to a temporary file temp file = tempfile.NamedTemporaryFile suffix=".pdf", delete=False ⋮---- pdf to process = temp file.name ⋮---- rendered = marker converter pdf to process 证据：`olmocr/bench/runners/run_marker.py`
- **We leave the server running for potential reuse**（source_file）：logger = logging.getLogger "olmocr runner" ⋮---- @dataclass class Args ⋮---- model: str = "allenai/olmOCR-2-7B-1025-FP8" server: str = "http://localhost:30044/v1" port: int = 30044 model chat template: str = "qwen2-vl" max model len: int = 16384 guided decoding: bool = False gpu memory utilization: float = 0.8 target longest image dim: int = 1288 target anchor text len: int = -1 max page retries: int = 8 max page error rate: float = 0.004 tensor parallel size: int = 1 data parallel size: int = 1 server check lock = asyncio.Lock async def run olmocr pipeline pdf path: str, page num: int = 1, model: str = "allenai/olmOCR-2-7B-1025-FP8" - Optional str ⋮---- metrics = MetricsKeeper window=60 5… 证据：`olmocr/bench/runners/run_olmocr_pipeline.py`
- **Run Server**（source_file）：image base64 = render pdf to base64png pdf path, page num=page num, target longest image dim=target longest image dim anchor text = get anchor text pdf path, page num, pdf engine="pdfreport" ⋮---- prompt = build openai silver data prompt anchor text ⋮---- prompt = build openai silver data prompt no document anchoring anchor text ⋮---- prompt = build finetuning prompt anchor text ⋮---- prompt = build basic prompt ⋮---- prompt = build no anchoring v4 yaml prompt ⋮---- prompt = build openai silver data prompt v3 simple width, height ⋮---- request = { url = f"{endpoint.rstrip '/' }/chat/completions" ⋮---- response = await client.post url, json=request ⋮---- data = response.json choice = data "c… 证据：`olmocr/bench/runners/run_server.py`
- **Split Table Tests**（source_file）：RELATIONSHIP FIELDS = "up", "down", "left", "right", "top heading", "left heading" def base test to dict test: BasePDFTest - dict ⋮---- result = {} ⋮---- value = getattr test, field.name ⋮---- def test to dict minimal test: TableTest - dict ⋮---- """ Convert a TableTest to a dict, removing empty relationship fields. """ result = { ⋮---- value = getattr test, field ⋮---- def split table test test: TableTest - list dict ⋮---- active relationships = ⋮---- split tests = ⋮---- def main ⋮---- parser = argparse.ArgumentParser description="Split table tests with multiple relationships into individual tests" ⋮---- args = parser.parse args input path = Path args.input file output path = Path args.out… 证据：`olmocr/bench/scripts/split_table_tests.py`
- **Return the merged images along with other elements**（source_file）：options = { scores = {label: get document coherency text for label, text in options.items } best option label = max scores, key=scores.get best option = options best option label ⋮---- def get pdftotext local pdf path: str, page: int - str ⋮---- pdftotext result = subprocess.run ⋮---- def get pypdf raw local pdf path: str, page: int - str ⋮---- reader = PdfReader local pdf path pypage = reader.pages page - 1 ⋮---- def get pdfium local pdf path: str, page: int - str ⋮---- pdf = pdfium.PdfDocument local pdf path textpage = pdf page - 1 .get textpage ⋮---- def transform point x, y, m ⋮---- x new = m 0 x + m 2 y + m 4 y new = m 1 x + m 3 y + m 5 ⋮---- def mult m: List float , n: List float - Li… 证据：`olmocr/prompts/anchor.py`
- **Validate rotation correction is one of the allowed values**（source_file）：def build openai silver data prompt base text: str - str def build openai silver data prompt v2 base text: str - str def build openai silver data prompt v2 simple page width: int, page height: int - str def build openai silver data prompt v3 simple page width: int, page height: int - str ⋮---- @dataclass frozen=True class PageResponse ⋮---- primary language: Optional str is rotation valid: bool rotation correction: int is table: bool is diagram: bool natural text: Optional str def post init self ⋮---- Validate rotation correction is one of the allowed values ⋮---- Type checks ⋮---- def openai response format schema - dict def build finetuning prompt base text: str - str def build no anchori… 证据：`olmocr/prompts/prompts.py`
- **Process all subscript tags**（source_file）：total input tokens = 0 total output tokens = 0 def get git commit hash ⋮---- result = subprocess.run "git", "rev-parse", "HEAD" , stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True ⋮---- SUPERSCRIPT MAP = { SUBSCRIPT MAP = { def convert superscripts subscripts element ⋮---- sup text = sup.get text unicode text = "".join SUPERSCRIPT MAP.get char, char for char in sup text ⋮---- Process all subscript tags ⋮---- sub text = sub.get text unicode text = "".join SUBSCRIPT MAP.get char, char for char in sub text ⋮---- def download s3 pdf path, local path ⋮---- """Download a PDF from S3 or copy from local path.""" ⋮---- Check if it's a local path ⋮---- It's a local file, just copy… 证据：`olmocr/synth/mine_html_templates.py`
- **Load YAML with OmegaConf for better features**（source_file）：@dataclass class PipelineStepConfig ⋮---- name: str enabled: bool = True ⋮---- @dataclass class FrontMatterParserConfig PipelineStepConfig ⋮---- name: str = "FrontMatterParser" use page response class: bool = True ⋮---- @dataclass class PDFRendererConfig PipelineStepConfig ⋮---- name: str = "PDFRenderer" target longest image dim: int = 1024 ⋮---- @dataclass class StaticLengthDocumentAnchoringConfig PipelineStepConfig ⋮---- name: str = "StaticLengthDocumentAnchoring" target anchor text len: int = 6000 ⋮---- @dataclass class FinetuningPromptConfig PipelineStepConfig ⋮---- name: str = "FinetuningPrompt" ⋮---- @dataclass class NewYamlFinetuningPromptWithAnchoringConfig PipelineStepConfig ⋮----… 证据：`olmocr/train/config.py`
- **Test that document anchoring works**（source_file）：logger = logging.getLogger name def validate pdf pair md path: Path - Tuple Optional Dict str, Path , Optional Tuple Path, str ⋮---- pdf path = md path.with suffix ".pdf" ⋮---- pdf path = pdf path.resolve ⋮---- reader = PdfReader str pdf path num pages = len reader.pages ⋮---- Test that document anchoring works ⋮---- @dataclass frozen=True, slots=True class PipelineStep ABC ⋮---- """Abstract base class for pipeline steps.""" ⋮---- @abstractmethod def call self, sample: Sample - Optional Sample ⋮---- """Process a sample and return the modified sample, or None to skip this sample.""" ⋮---- class BaseMarkdownPDFDataset Dataset ⋮---- """Base dataset class that loads and verifies markdown-PDF pa… 证据：`olmocr/train/dataloader.py`
- **Log a formatted summary to console**（source_file）：logger = logging.getLogger name bench type filter: Optional List str = None def make type stats class DetailedRewardLogger ⋮---- def init self def clear self def add batch stats self, batch detailed stats: List Optional Dict ⋮---- jsonl name = os.path.basename stats "jsonl file" ⋮---- def get summary stats self - Dict ⋮---- summary = {"bench reward/total completions": self.accumulated stats "total completions" } ⋮---- def get batch summary self, batch detailed stats: List Optional Dict - Dict ⋮---- summary = { ⋮---- def gather across ranks self ⋮---- world size = dist.get world size gathered = None world size stats to send = { ⋮---- merged = self.accumulated stats ⋮---- def log to wandb sel… 证据：`olmocr/train/grpo_train.py`
- **Muon**（source_file）：def zeropower via newtonschulz5 G, steps: int ⋮---- X = G.bfloat16 ⋮---- X = X.mT X = X / X.norm dim= -2, -1 , keepdim=True + 1e-7 ⋮---- A = X @ X.mT B = b A + c A @ A X = a X + B @ X ⋮---- def muon update grad, momentum, beta=0.95, ns steps=5, nesterov=True ⋮---- update = grad.lerp momentum, beta if nesterov else momentum ⋮---- update = update.view len update , -1 update = zeropower via newtonschulz5 update, steps=ns steps ⋮---- class Muon torch.optim.Optimizer ⋮---- def init self, params, lr=0.02, weight decay=0, momentum=0.95 ⋮---- defaults = dict lr=lr, weight decay=weight decay, momentum=momentum ⋮---- params = sorted params, key=lambda x: x.size , reverse=True ⋮---- @torch.no grad def… 证据：`olmocr/train/muon.py`
- **Save model**（source_file）：logger = logging.getLogger name def prepare lora model model: torch.nn.Module, model cfg - torch.nn.Module ⋮---- lora kwargs = dict ⋮---- lora config = LoraConfig lora kwargs model = get peft model model, lora config ⋮---- base model = getattr model, "base model", None ⋮---- inner model = getattr base model, "model", None ⋮---- def is lora checkpoint checkpoint dir: str - bool class QwenDataCollator ⋮---- def init self, max token len: Optional int = None def call self, examples ⋮---- batch = {"input ids": , "attention mask": , "labels": , "pixel values": , "image grid thw": } ⋮---- input ids = torch.from numpy example "input ids" if isinstance example "input ids" , np.ndarray else example "… 证据：`olmocr/train/train.py`
- **Read status line**（source_file）：logger = logging.getLogger name ⋮---- server logger = logging.getLogger "vllm" ⋮---- file handler = logging.FileHandler "olmocr-pipeline-debug.log", mode="a" ⋮---- console handler = logging.StreamHandler ⋮---- SERVER PORT = 30024 metrics = MetricsKeeper window=60 5 class PIIClassification BaseModel ⋮---- primary language: str = Field ..., description="Primary language as a two-letter code" document type: str = Field ..., description="Basic summary of document type classification" is resume cv: Optional bool = Field None, description="True if the document is a page from a resume or cv" is academic paper: Optional bool = None is textbook: Optional bool = None is news article: Optional bool =… 证据：`scripts/pii/rich_tagging_pipeline.py`
- **Read status line**（source_file）：logger = logging.getLogger name ⋮---- server logger = logging.getLogger "vllm" ⋮---- file handler = logging.FileHandler "olmocr-pipeline-debug.log", mode="a" ⋮---- console handler = logging.StreamHandler ⋮---- SERVER PORT = 30024 metrics = MetricsKeeper window=60 5 class PIIClassification BaseModel ⋮---- primary language: str = Field ..., description="Primary language as a two-letter code" document type: str = Field ..., description="Basic summary of document type classification" is resume cv: Optional bool = Field None, description="True if the document is a page from a resume or cv" is academic paper: Optional bool = None is textbook: Optional bool = None is news article: Optional bool =… 证据：`scripts/pii/tagging_pipeline.py`
- **Changelog**（documentation）：All notable changes to this project will be documented in this file. 证据：`CHANGELOG.md`
- **GitHub Release Process**（documentation）：1. Update the version in olmocr/version.py . 证据：`RELEASE_PROCESS.md`
- **Remove markdown bold formatting or for bold**（source_file）：test = False class TestType str, Enum ⋮---- BASELINE = "baseline" PRESENT = "present" ABSENT = "absent" ORDER = "order" TABLE = "table" MATH = "math" FORMAT = "format" FOOTNOTE = "footnote" class TestChecked str, Enum ⋮---- VERIFIED = "verified" REJECTED = "rejected" class ValidationError Exception def normalize text md content: str - str ⋮---- md content = re.sub r" ", " ", md content Remove markdown bold formatting or for bold md content = re.sub r"\ \ . ? \ \ ", r"\1", md content md content = re.sub r" . ? ", r"\1", md content md content = re.sub r" ", "", md content Remove tags if they exist md content = re.sub r" ", "", md content Remove tags if they exist Remove markdown italics forma… 证据：`olmocr/bench/tests.py`
- **Buildingnotes Pg1 Repeat1**（documentation）：Master - 7 1/4 - 36" Master Bath - 7 1/4 - 30" Laundry - 4 3/4 - 36" Bath - 7 1/4 - 24" MUD - 7 - 36" UTIL - 8 1/4 - 36" DOWN BATH - 7 1/4 - 32" BUT KIT - 6 3/4 - 30 PANTRY - 4 3/4 - 24 6 WEST - 32 9/8 - 32 6 WEST BATH 5" - 24" 证据：`olmocr/bench/sample_data/olmocr_pipeline/buildingnotes_pg1_repeat1.md`
- **Discoverworld Crazy Table4 Pg1 Repeat1**（documentation）：Table 4: Baseline model performance on each of the three scoring metrics task completion, task process, explanatory knowledge discovery across all 24 DISCOVERY WORLD tasks. Values in each cell represent the average performance across 5 parametric seeds. Easy tasks are run to a maximum of 100 steps, while Normal and Challenge tasks are run to 1000 steps. 证据：`olmocr/bench/sample_data/olmocr_pipeline/discoverworld_crazy_table4_pg1_repeat1.md`
- **Earnings Pg1 Repeat1**（documentation）：Recently Issued Accounting Pronouncements 证据：`olmocr/bench/sample_data/olmocr_pipeline/earnings_pg1_repeat1.md`
- **Ff0F0B22C55D8B90Dd77D153F48E144Fc9Db Pg2 Pg1 Repeat1**（documentation）：Lassa Fever in Post-Conflict Sierra Leone 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff0f0b22c55d8b90dd77d153f48e144fc9db_pg2_pg1_repeat1.md`
- **Ff1Fc6A205Ad039139Ce566851B6B260C929 Pg1 Pg1 Repeat1**（documentation）：RTG Degradation Primer and Application to MMRTG 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff1fc6a205ad039139ce566851b6b260c929_pg1_pg1_repeat1.md`
- **Ff3D6E051903Fe5Ca9Bc172Ece14964C5632 Pg1 Pg1 Repeat1**（documentation）：بررسی دیدگاه و نظرات کتابداران و اعضای هیئت علمی دانشگاه شیراز در بره گیری از فناوری شبکه‌های پی سیم در کتابخانه‌های دانشگاهی 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff3d6e051903fe5ca9bc172ece14964c5632_pg1_pg1_repeat1.md`
- **Ff4F7Dad78081Cff727D19Ab51C181D4A661 Pg1 Pg1 Repeat1**（documentation）：Molecular markers of breast cancer metastasis Weigelt, B. 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff4f7dad78081cff727d19ab51c181d4a661_pg1_pg1_repeat1.md`
- **Ff518B1240A66978F22035528Ccb029450B5 Pg2 Pg1 Repeat1**（documentation）：Prophet of the Jubilee, translated and edited by Ronald D. Dennis Religious Studies Center, Brigham Young University, 1997 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/ff518b1240a66978f22035528ccb029450b5_pg2_pg1_repeat1.md`
- **Fff590Bed29A2854Ac1F874Dad5752Ede1Aa Pg1 Pg1 Repeat1**（documentation）：Lake Shore Cryotronics, Inc. 575 McCorkle Blvd. Westerville, Ohio 43082-8888 USA 证据：`olmocr/bench/sample_data/olmocr_pipeline/headers_footers/fff590bed29a2854ac1f874dad5752ede1aa_pg1_pg1_repeat1.md`
- **Lincoln Letter Pg1 Repeat1**（documentation）：Major General Hitchcock, Commissioner of Exchanges, is authorized and directed to offer Brigadier General Trimble, now a prisoner of war in Fort McHenry, in exchange for Major White, who is held as a prisoner at Richmond. He is also directed to send forward the offer of exchange by Henry M. Warfield, Esq. of Baltimore, under a flag of truce, and give him a pass to City Point. 证据：`olmocr/bench/sample_data/olmocr_pipeline/lincoln_letter_pg1_repeat1.md`
- **Math 2503 04086 Pg1 Repeat1**（documentation）：Proof. Let $S$ be the generating set associated with $D$ as described in Proposition 2.5. By the circulant diagonalization theorem, the spectrum of $G R D = \Gamma R, S $ is the multiset $\{\lambda g\} {g \in R}$ where 证据：`olmocr/bench/sample_data/olmocr_pipeline/math_2503_04086_pg1_repeat1.md`
- **The 20 Most Important Mathematical Equations**（documentation）：The 20 Most Important Mathematical Equations 证据：`olmocr/bench/sample_data/olmocr_pipeline/mathfuncs_colswitch_pg1_repeat1.md`
- **The 20 Most Important Mathematical Equations**（documentation）：The 20 Most Important Mathematical Equations 证据：`olmocr/bench/sample_data/olmocr_pipeline/mathfuncs_pg1_repeat1.md`
- **Mattsnotes Pg1 Repeat1**（documentation）：CodeText: SE, whatever we've scraped 证据：`olmocr/bench/sample_data/olmocr_pipeline/mattsnotes_pg1_repeat1.md`
- **Mattsnotes Pg2 Repeat1**（documentation）：P1: 100% Source code P2: 80% code 20% language 证据：`olmocr/bench/sample_data/olmocr_pipeline/mattsnotes_pg2_repeat1.md`
- **Mattsnotes Pg3 Repeat1**（documentation）：- Pick Arch like OLMO-IB - OR replicate a 3D model - Follow standard LR flow 证据：`olmocr/bench/sample_data/olmocr_pipeline/mattsnotes_pg3_repeat1.md`
- **Multi Column Miss Pg1 Repeat1**（documentation）：stakeholders has occurred in other nations, with groups and individuals refusing to risk being appropriated into the industry’s public relations ambitions. It now looks like that with vigilance, tobacco control advocates can easily foment disruption of its efforts to take its place alongside other industries—often with considerable social credit—in the hope that it might gain by association. 证据：`olmocr/bench/sample_data/olmocr_pipeline/multi_column_miss_pg1_repeat1.md`
- **Olmo2 Pg4 Pg1 Repeat1**（documentation）：Table 1 Composition of the pretraining data for OLMo 2. The OLMo 2 1124 Mix is composed of StarCoder Li et al., 2023b; Kocetkov et al., 2022 , peS2o Soldaini and Lo, 2023 , web text from DCLM Li et al., 2024 and Wiki come from Dolma 1.7 Soldaini et al., 2024 . arXiv comes from Red-Pajama Together AI, 2023 , while OpenWebMath Paster et al., 2023 and Algebraic Stack come from ProofPile II Azerbayev et al., 2023 . 证据：`olmocr/bench/sample_data/olmocr_pipeline/olmo2-pg4_pg1_repeat1.md`
- **Openstax Caculus Pg 273 Pg1 Repeat1**（documentation）：For the following exercises, the given functions represent the position of a particle traveling along a horizontal line. 证据：`olmocr/bench/sample_data/olmocr_pipeline/openstax_caculus_pg_273_pg1_repeat1.md`
- **Small Page Size Pg1 Repeat1**（documentation）：any—was very trifling. Since the use of bones has, however, become general, the turnip crop has been, in many instances, ten-fold, and in few less than four or five-fold its former bulk. All the succeeding crops of grain and seeds have been amazingly increased, and, upon the four or five-shift system, there is no doubt the land will go on progressively improving, requiring a less quantity of bones annually, from its increased fertility and power." 证据：`olmocr/bench/sample_data/olmocr_pipeline/small_page_size_pg1_repeat1.md`
- **Let's not copy any bash scripts from the scripts folder over, otherwise trashing the docker image too much with recent…**（source_file）：.git .github .mypy cache .pytest cache .venv pycache .egg-info 证据：`.dockerignore`
- **ml stuff**（source_file）：ml stuff wandb/ histogram.png .json dolma previews/ s2 previews/ gnarly previews/ s2orc previews/ s2orc previews 3200/ sample200 vllm/ sample200 sglang/ pdelfin testset/ localworkspace/ math data/ math data big/ gpt4otestset/ old train/ gpt4otestset output/ pdfs/ olmOCR-bench/ table data / /synth / dolma samples/ old train/ filtered items/ filtered items prefilter/ augraphy cache/ / .html html templates / olmocr-synthmix / scoreelo.csv debug.log birrpipeline-debug.log beakerpipeline-debug.log olmocr-pipeline-debug.log 证据：`.gitignore`
- **.Readthedocs**（source_file）：version: 2 sphinx: configuration: docs/source/conf.py fail on warning: true python: version: "3.8" install: - method: pip path: . extra requirements: - dev 证据：`.readthedocs.yaml`
- **Workaround for installing fonts, which are needed for good rendering of documents**（source_file）：ENV PYTHON VERSION=3.12 ENV CUSTOM PY="/usr/bin/python${PYTHON VERSION}" 证据：`Dockerfile`
- **Makefile**（source_file）：.PHONY : docs docs : rm -rf docs/build/ sphinx-autobuild -b html --watch olmocr/ docs/source/ docs/build/ 证据：`Makefile`
- **Test Graphical Text Pg1 Repeat1**（documentation）：THE POWER OF STORYTELLING FOR LEADERS ดร.วิทย์ สิทธิเวคิน 证据：`olmocr/bench/sample_data/olmocr_pipeline/test-graphical-text_pg1_repeat1.md`
- **Check**（source_file）：logger = logging.getLogger name def check poppler version ⋮---- result = subprocess.run "pdftoppm", "-h" , stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True ⋮---- def check sglang version def check torch gpu available min gpu memory: int = 15 1024 3 ⋮---- gpu memory = torch.cuda.get device properties 0 .total memory 证据：`olmocr/check.py`
- **Run img2pdf with all images as arguments**（source_file）：def convert image to pdf bytes image files: Union str, List str - bytes ⋮---- image files = image files ⋮---- Run img2pdf with all images as arguments result = subprocess.run "img2pdf" + image files, check=True, capture output=True ⋮---- def is png file path ⋮---- header = f.read 8 ⋮---- def is jpeg file path ⋮---- header = f.read 2 证据：`olmocr/image_utils.py`
- **Sort metrics alphabetically for consistency**（source_file）：class MetricsKeeper ⋮---- def init self, window=60 5 def add metrics self, kwargs ⋮---- current time = time.time ⋮---- def str self ⋮---- elapsed time = current time - self.start time window time = min self.window, elapsed time if elapsed time 0 else 1 header = f"{'Metric Name': 25} {'Recently tokens/sec ': 25}" separator = "-" len header lines = header, separator Sort metrics alphabetically for consistency ⋮---- total = self.total metrics key window = self.window sum.get key, 0 total rate = total / elapsed time if elapsed time 0 else 0 window rate = window / window time if window time 0 else 0 line = f"{key: 25.2f} {window rate: 25.2f}" ⋮---- def get total metrics self def get metrics summ… 证据：`olmocr/metrics.py`
- **Normalize all whitespace to single spaces**（source_file）：class RepeatDetector ⋮---- def init self, max ngram size: int = 10 def add letters self, new str: str def ngram repeats self - list int ⋮---- result = 0 self.max ngram size ⋮---- Normalize all whitespace to single spaces text = re.sub r"\s+", " ", self.data For each n-gram size ⋮---- Get the last n-gram target = text -size: Count backwards from the end to find repeats count = 0 pos = len text - size Start position for previous n-gram ⋮---- pos -= size Move back by the size of the n-gram ⋮---- class RepeatDetectorTest unittest.TestCase ⋮---- def test basicTest1 self ⋮---- d = RepeatDetector max ngram size=3 ⋮---- def test basicTest2 self def test longer sequence self def test no repeats self… 证据：`olmocr/repeatdetect.py`
- 其余 16 条证据见 `AI_CONTEXT_PACK.json` 或 `EVIDENCE_INDEX.json`。

## 宿主 AI 必须遵守的规则

- **把本资产当作开工前上下文，而不是运行环境。**：AI Context Pack 只包含证据化项目理解，不包含目标项目的可执行状态。 证据：`README.md`, `olmocr/bench/README.md`, `olmocr/train/README.md`
- **回答用户时区分可预览内容与必须安装后才能验证的内容。**：安装前体验的消费者价值来自降低误装和误判，而不是伪装成真实运行。 证据：`README.md`, `olmocr/bench/README.md`, `olmocr/train/README.md`

## 用户开工前应该回答的问题

- 你准备在哪个宿主 AI 或本地环境中使用它？
- 你只是想先体验工作流，还是准备真实安装？
- 你最在意的是安装成本、输出质量、还是和现有规则的冲突？

## 验收标准

- 所有能力声明都能回指到 evidence_refs 中的文件路径。
- AI_CONTEXT_PACK.md 没有把预览包装成真实运行。
- 用户能在 3 分钟内看懂适合谁、能做什么、如何开始和风险边界。

---

## Doramagic Context Augmentation

下面内容用于强化 Repomix/AI Context Pack 主体。Human Manual 只提供阅读骨架；踩坑日志会被转成宿主 AI 必须遵守的工作约束。

## Human Manual 骨架

使用规则：这里只是项目阅读路线和显著性信号，不是事实权威。具体事实仍必须回到 repo evidence / Claim Graph。

宿主 AI 硬性规则：
- 不得把页标题、章节顺序、摘要或 importance 当作项目事实证据。
- 解释 Human Manual 骨架时，必须明确说它只是阅读路线/显著性信号。
- 能力、安装、兼容性、运行状态和风险判断必须引用 repo evidence、source path 或 Claim Graph。

- **项目概览与安装指南**：importance `high`
  - source_paths: README.md, Dockerfile, Dockerfile.with-model, pyproject.toml, docs/source/installation.md
- **推理管线与集群执行 (Pipeline)**：importance `high`
  - source_paths: olmocr/pipeline.py, olmocr/prompts/prompts.py, olmocr/prompts/anchor.py, olmocr/work_queue.py, olmocr/datatypes.py
- **olmOCR-Bench 评测套件**：importance `high`
  - source_paths: olmocr/bench/benchmark.py, olmocr/bench/README.md, olmocr/bench/tests.py, olmocr/bench/runners/run_server.py, olmocr/bench/runners/run_olmocr_pipeline.py
- **训练、数据准备与合成数据**：importance `medium`
  - source_paths: olmocr/train/train.py, olmocr/train/grpo_train.py, olmocr/train/dataloader.py, olmocr/train/config.py, olmocr/train/muon.py

## Repo Inspection Evidence / 源码检查证据

- repo_clone_verified: true
- repo_inspection_verified: true
- repo_commit: `f7cfe4c22098b154c76b6ec950d1c0a464eecf8d`
- inspected_files: `Dockerfile`, `pyproject.toml`, `README.md`, `docs/source/overview.md`, `docs/source/CHANGELOG.md`, `docs/source/conf.py`, `docs/source/index.md`, `docs/source/CONTRIBUTING.md`, `docs/source/installation.md`

宿主 AI 硬性规则：
- 没有 repo_clone_verified=true 时，不得声称已经读过源码。
- 没有 repo_inspection_verified=true 时，不得把 README/docs/package 文件判断写成事实。
- 没有 quick_start_verified=true 时，不得声称 Quick Start 已跑通。

## Doramagic Pitfall Constraints / 踩坑约束

这些规则来自 Doramagic 发现、验证或编译过程中的项目专属坑点。宿主 AI 必须把它们当作工作约束，而不是普通说明文字。

### Constraint 1: 来源证据：olmocr.bench scoring: `partial_ratio` falsely matches when candidate is near-empty (e.g. single `\\n`)

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：olmocr.bench scoring: `partial_ratio` falsely matches when candidate is near-empty (e.g. single `\\n`)
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/461 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 2: 来源证据：configurable timeout for HTTP client in server method

- Trigger: GitHub 社区证据显示该项目存在一个配置相关的待验证问题：configurable timeout for HTTP client in server method
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/455 | 来源类型 github_issue 暴露的待验证使用条件。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 3: 能力判断依赖假设

- Trigger: README/documentation is current enough for a first validation pass.
- Host AI rule: 将假设转成下游验证清单。
- Why it matters: 假设不成立时，用户拿不到承诺的能力。
- Evidence: capability.assumptions | github_repo:858798469 | https://github.com/allenai/olmocr | README/documentation is current enough for a first validation pass.
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 4: 来源证据：Fail to parse b4c3c4ac3d6f7b52a993cec7ca8b3ad43cecabad_page_3.pdf

- Trigger: GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：Fail to parse b4c3c4ac3d6f7b52a993cec7ca8b3ad43cecabad_page_3.pdf
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/463 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 5: 来源证据：Model allenai/olmOCR-2-7B-1025 on DeepInfra will be deprecated on 2026-05-07

- Trigger: GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：Model allenai/olmOCR-2-7B-1025 on DeepInfra will be deprecated on 2026-05-07
- Why it matters: 可能影响升级、迁移或版本选择。
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/460 | 来源类型 github_issue 暴露的待验证使用条件。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 6: 维护活跃度未知

- Trigger: 未记录 last_activity_observed。
- Host AI rule: 补 GitHub 最近 commit、release、issue/PR 响应信号。
- Why it matters: 新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- Evidence: evidence.maintainer_signals | github_repo:858798469 | https://github.com/allenai/olmocr | last_activity_observed missing
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

- Trigger: no_demo
- Evidence: downstream_validation.risk_items | github_repo:858798469 | https://github.com/allenai/olmocr | no_demo; severity=medium
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 8: 存在评分风险

- Trigger: no_demo
- Why it matters: 风险会影响是否适合普通用户安装。
- Evidence: risks.scoring_risks | github_repo:858798469 | https://github.com/allenai/olmocr | no_demo; severity=medium
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 9: 来源证据：Writing markdown error : 'gbk' codec can't encode character '\u1eca' in position 3419

- Trigger: GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Writing markdown error : 'gbk' codec can't encode character '\u1eca' in position 3419
- Why it matters: 可能影响授权、密钥配置或安全边界。
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/459 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 10: 来源证据：[bug] badly formed help string

- Trigger: GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：[bug] badly formed help string
- Why it matters: 可能影响授权、密钥配置或安全边界。
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/451 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。
