# markitdown - Doramagic AI Context Pack

> 定位：安装前体验与判断资产。它帮助宿主 AI 有一个好的开始，但不代表已经安装、执行或验证目标项目。

## 充分原则

- **充分原则，不是压缩原则**：AI Context Pack 应该充分到让宿主 AI 在开工前理解项目价值、能力边界、使用入口、风险和证据来源；它可以分层组织，但不以最短摘要为目标。
- **压缩策略**：只压缩噪声和重复内容，不压缩会影响判断和开工质量的上下文。

## 给宿主 AI 的使用方式

你正在读取 Doramagic 为 markitdown 编译的 AI Context Pack。请把它当作开工前上下文：帮助用户理解适合谁、能做什么、如何开始、哪些必须安装后验证、风险在哪里。不要声称你已经安装、运行或执行了目标项目。

## Claim 消费规则

- **事实来源**：Repo Evidence + Claim/Evidence Graph；Human Wiki 只提供显著性、术语和叙事结构。
- **事实最低状态**：`supported`
- `supported`：可以作为项目事实使用，但回答中必须引用 claim_id 和证据路径。
- `weak`：只能作为低置信度线索，必须要求用户继续核实。
- `inferred`：只能用于风险提示或待确认问题，不能包装成项目事实。
- `unverified`：不得作为事实使用，应明确说证据不足。
- `contradicted`：必须展示冲突来源，不得替用户强行选择一个版本。

## 它最适合谁

- **正在使用 Claude/Codex/Cursor/Gemini 等宿主 AI 的开发者**：README 或插件配置提到多个宿主 AI。 证据：`README.md` Claim：`clm_0019` supported 0.86

## 它能做什么

- **PDF_Conversion**（可做安装前预览）：Converts PDF files to Markdown, extracting text, tables, and images with fallback handling for malformed PDFs. 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown/src/markitdown/converters/_pdf_converter.py`, `packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py`, `packages/markitdown/src/markitdown/converters/_pdf_converter.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Word_DOCX_Conversion**（可做安装前预览）：Converts Microsoft Word .docx files to Markdown, preserving document structure including headings, lists, tables, and images. 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown/src/markitdown/converters/_docx_converter.py`, `packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **PowerPoint_PPTX_Conversion**（可做安装前预览）：Converts PowerPoint .pptx presentations to Markdown with slide titles, content, notes, tables, and chart data extraction. 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py`, `packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Excel_XLSX_Conversion**（可做安装前预览）：Converts Excel .xlsx and .xls files to Markdown tables with per-sheet output and embedded image OCR support. 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown-ocr/src/markitdown_ocr/_xlsx_converter_with_ocr.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Image_Conversion**（可做安装前预览）：Converts images to Markdown with EXIF metadata extraction and optional LLM-based caption generation. 证据：`packages/markitdown/src/markitdown/converters/_image_converter.py`, `packages/markitdown/src/markitdown/converters/_image_converter.py` Claim：`clm_0005` supported 0.86
- **Audio_Transcription**（可做安装前预览）：Transcribes audio files (WAV, MP3, MP4) to text using speech recognition, with EXIF metadata extraction. 证据：`packages/markitdown/src/markitdown/converters/_audio_converter.py`, `packages/markitdown/pyproject.toml` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **HTML_Conversion**（可做安装前预览）：Converts HTML documents and web pages to clean Markdown, preserving links, lists, and structure. 证据：`packages/markitdown/src/markitdown/converters/_html_converter.py` Claim：`clm_0007` supported 0.86
- **Text_Formats_Conversion**（可做安装前预览）：Converts text-based formats (CSV, JSON, XML) to Markdown with structured output for tabular data. 证据：`packages/markitdown/src/markitdown/converters/_csv_converter.py`, `README.md` Claim：`clm_0008` supported 0.86, `clm_0010` supported 0.86, `clm_0011` supported 0.86, `clm_0018` supported 0.86
- **Epub_Conversion**（可做安装前预览）：Converts EPUB ebooks to Markdown, extracting metadata and chapter content while preserving structure. 证据：`packages/markitdown/src/markitdown/converters/_epub_converter.py` Claim：`clm_0009` supported 0.86
- **YouTube_Transcription**（可做安装前预览）：Fetches and converts YouTube video transcripts to Markdown text. 证据：`packages/markitdown/pyproject.toml`, `README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **ZIP_Archive_Processing**（可做安装前预览）：Recursively processes ZIP archives, converting each contained file to Markdown and combining results. 证据：`README.md` Claim：`clm_0008` supported 0.86, `clm_0010` supported 0.86, `clm_0011` supported 0.86, `clm_0018` supported 0.86
- **Jupyter_Notebook_Conversion**（可做安装前预览）：Converts Jupyter .ipynb notebooks to Markdown, preserving code cells and outputs. 证据：`packages/markitdown/src/markitdown/converters/_ipynb_converter.py` Claim：`clm_0012` supported 0.86
- **Azure_Document_Intelligence_Integration**（需要安装后验证）：Optional integration with Azure Document Intelligence for enhanced document analysis with OCR and formula extraction. 证据：`packages/markitdown/src/markitdown/converters/_doc_intel_converter.py`, `packages/markitdown/pyproject.toml` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Azure_Content_Understanding_Integration**（需要安装后验证）：Optional integration with Azure Content Understanding service for multi-modal document analysis. 证据：`packages/markitdown/src/markitdown/converters/_cu_converter.py`, `packages/markitdown/pyproject.toml` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Plugin_Architecture**（可做安装前预览）：Extensible plugin system using entry points that allows registering custom converters with priority-based selection. 证据：`packages/markitdown/src/markitdown/_markitdown.py`, `packages/markitdown/src/markitdown/_markitdown.py`, `packages/markitdown-sample-plugin/pyproject.toml`, `packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py` Claim：`clm_0015` supported 0.86
- **LLM_Vision_OCR_Plugin**（需要安装后验证）：OCR plugin using LLM Vision to extract text from images embedded in PDF, DOCX, PPTX, and XLSX files. 证据：`packages/markitdown-ocr/pyproject.toml`, `packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py`, `packages/markitdown-ocr/src/markitdown_ocr/_plugin.py`, `packages/markitdown-ocr/README.md` Claim：`clm_0016` supported 0.86
- **MCP_Server_Interface**（可做安装前预览）：Model Context Protocol server exposing the convert_to_markdown tool via STDIO, HTTP, and SSE transports. 证据：`packages/markitdown-mcp/src/markitdown_mcp/__main__.py`, `packages/markitdown-mcp/src/markitdown_mcp/__main__.py`, `packages/markitdown-mcp/pyproject.toml`, `packages/markitdown-mcp/README.md` Claim：`clm_0017` supported 0.86
- **Command_Line_Interface**（可做安装前预览）：Command-line utility for converting files to Markdown with support for plugins, streaming, and output redirection. 证据：`packages/markitdown/src/markitdown/__main__.py`, `packages/markitdown/pyproject.toml`, `README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等

## 怎么开始

- `git clone git@github.com:microsoft/markitdown.git` 证据：`README.md` Claim：`clm_0020` supported 0.86
- `pip install -e 'packages/markitdown[all]'` 证据：`README.md` Claim：`clm_0021` supported 0.86, `clm_0027` supported 0.86
- `pip install 'markitdown[pdf, docx, pptx]'` 证据：`README.md` Claim：`clm_0022` supported 0.86
- `pip install markitdown-ocr` 证据：`README.md` Claim：`clm_0023` supported 0.86
- `pip install openai  # or any OpenAI-compatible client` 证据：`README.md` Claim：`clm_0024` supported 0.86
- `pip install hatch  # Other ways of installing hatch: https://hatch.pypa.io/dev/install/` 证据：`README.md` Claim：`clm_0025` supported 0.86
- `pip install markitdown[all]` 证据：`packages/markitdown/README.md` Claim：`clm_0026` supported 0.86
- `pip install -e packages/markitdown[all]` 证据：`packages/markitdown/README.md` Claim：`clm_0021` supported 0.86, `clm_0027` supported 0.86
- `pip install markitdown-mcp` 证据：`packages/markitdown-mcp/README.md` Claim：`clm_0028` supported 0.86
- `npx @modelcontextprotocol/inspector` 证据：`packages/markitdown-mcp/README.md` Claim：`clm_0029` supported 0.86

## 继续前判断卡

- **当前建议**：先做权限沙盒试用
- **为什么**：项目存在安装命令、宿主配置或本地写入线索，不建议直接进入主力环境，应先在隔离环境试装。

### 30 秒判断

- **现在怎么做**：先做权限沙盒试用
- **最小安全下一步**：先跑 Prompt Preview；若仍要安装，只在隔离环境试装
- **先别相信**：工具权限边界不能在安装前相信。
- **继续会触碰**：命令执行、本地环境或项目文件、宿主 AI 上下文

### 现在可以相信

- **适合人群线索：正在使用 Claude/Codex/Cursor/Gemini 等宿主 AI 的开发者**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`README.md` Claim：`clm_0019` supported 0.86
- **能力存在：PDF_Conversion**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown/src/markitdown/converters/_pdf_converter.py`, `packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py`, `packages/markitdown/src/markitdown/converters/_pdf_converter.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86
- **能力存在：Word_DOCX_Conversion**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown/src/markitdown/converters/_docx_converter.py`, `packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86
- **能力存在：PowerPoint_PPTX_Conversion**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py`, `packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86
- **能力存在：Excel_XLSX_Conversion**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown-ocr/src/markitdown_ocr/_xlsx_converter_with_ocr.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86
- **能力存在：Image_Conversion**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`packages/markitdown/src/markitdown/converters/_image_converter.py`, `packages/markitdown/src/markitdown/converters/_image_converter.py` Claim：`clm_0005` supported 0.86

### 现在还不能相信

- **工具权限边界不能在安装前相信。**（unverified）：MCP/tool 类项目通常会触碰文件、网络、浏览器或外部 API，必须真实检查权限和日志。
- **真实输出质量不能在安装前相信。**（unverified）：Prompt Preview 只能展示引导方式，不能证明真实项目中的结果质量。
- **宿主 AI 版本兼容性不能在安装前相信。**（unverified）：Claude、Cursor、Codex、Gemini 等宿主加载规则和版本差异必须在真实环境验证。
- **不会污染现有宿主 AI 行为，不能直接相信。**（inferred）：Skill、plugin、AGENTS/CLAUDE/GEMINI 指令可能改变宿主 AI 的默认行为。
- **可安全回滚不能默认相信。**（unverified）：除非项目明确提供卸载和恢复说明，否则必须先在隔离环境验证。
- **真实安装后是否与用户当前宿主 AI 版本兼容？**（unverified）：兼容性只能通过实际宿主环境验证。
- **项目输出质量是否满足用户具体任务？**（unverified）：安装前预览只能展示流程和边界，不能替代真实评测。
- **安装命令是否需要网络、权限或全局写入？**（unverified）：这影响企业环境和个人环境的安装风险。 证据：`README.md`

### 继续会触碰什么

- **命令执行**：包管理器、网络下载、本地插件目录、项目配置或用户主目录。 原因：运行第一条命令就可能产生环境改动；必须先判断是否值得跑。 证据：`README.md`, `packages/markitdown-mcp/README.md`, `packages/markitdown-ocr/README.md`, `packages/markitdown/README.md`
- **本地环境或项目文件**：安装结果、插件缓存、项目配置或本地依赖目录。 原因：安装前无法证明写入范围和回滚方式，需要隔离验证。 证据：`packages/markitdown-ocr/README.md`, `packages/markitdown-ocr/pyproject.toml`, `packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py`, `packages/markitdown-ocr/src/markitdown_ocr/_plugin.py` 等
- **宿主 AI 上下文**：AI Context Pack、Prompt Preview、Skill 路由、风险规则和项目事实。 原因：导入上下文会影响宿主 AI 后续判断，必须避免把未验证项包装成事实。

### 最小安全下一步

- **先跑 Prompt Preview**：用安装前交互式试用判断工作方式是否匹配，不需要授权或改环境。（适用：任何项目都适用，尤其是输出质量未知时。）
- **只在隔离目录或测试账号试装**：避免安装命令污染主力宿主 AI、真实项目或用户主目录。（适用：存在命令执行、插件配置或本地写入线索时。）
- **安装后只验证一个最小任务**：先验证加载、兼容、输出质量和回滚，再决定是否深用。（适用：准备从试用进入真实工作流时。）

### 退出方式

- **保留安装前状态**：记录原始宿主配置和项目状态，后续才能判断是否可恢复。
- **记录安装命令和写入路径**：没有明确卸载说明时，至少要知道哪些目录或配置需要手动清理。
- **如果没有回滚路径，不进入主力环境**：不可回滚是继续前阻断项，不应靠信任或运气继续。

## 哪些只能预览

- 解释项目适合谁和能做什么
- 基于项目文档演示典型对话流程
- 帮助用户判断是否值得安装或继续研究

## 哪些必须安装后验证

- 真实安装 Skill、插件或 CLI
- 执行脚本、修改本地文件或访问外部服务
- 验证真实输出质量、性能和兼容性

## 边界与风险判断卡

- **把安装前预览误认为真实运行**：用户可能高估项目已经完成的配置、权限和兼容性验证。 处理方式：明确区分 prompt_preview_can_do 与 runtime_required。 Claim：`clm_0033` inferred 0.45
- **命令执行会修改本地环境**：安装命令可能写入用户主目录、宿主插件目录或项目配置。 处理方式：先在隔离环境或测试账号中运行。 证据：`README.md`, `packages/markitdown-mcp/README.md`, `packages/markitdown-ocr/README.md`, `packages/markitdown/README.md` Claim：`clm_0034` supported 0.86
- **风险**： 处理方式：
- **风险**： 处理方式：
- **风险**： 处理方式：
- **风险**： 处理方式：
- **待确认**：真实安装后是否与用户当前宿主 AI 版本兼容？。原因：兼容性只能通过实际宿主环境验证。
- **待确认**：项目输出质量是否满足用户具体任务？。原因：安装前预览只能展示流程和边界，不能替代真实评测。
- **待确认**：安装命令是否需要网络、权限或全局写入？。原因：这影响企业环境和个人环境的安装风险。

## 开工前工作上下文

### 加载顺序

- 先读取 how_to_use.host_ai_instruction，建立安装前判断资产的边界。
- 读取 claim_graph_summary，确认事实来自 Claim/Evidence Graph，而不是 Human Wiki 叙事。
- 再读取 intended_users、capabilities 和 quick_start_candidates，判断用户是否匹配。
- 需要执行具体任务时，优先查 role_skill_index，再查 evidence_index。
- 遇到真实安装、文件修改、网络访问、性能或兼容性问题时，转入 risk_card 和 boundaries.runtime_required。

### 任务路由

- **PDF_Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown/src/markitdown/converters/_pdf_converter.py`, `packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py`, `packages/markitdown/src/markitdown/converters/_pdf_converter.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Word_DOCX_Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown/src/markitdown/converters/_docx_converter.py`, `packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **PowerPoint_PPTX_Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py`, `packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Excel_XLSX_Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/pyproject.toml`, `packages/markitdown-ocr/src/markitdown_ocr/_xlsx_converter_with_ocr.py` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Image_Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/src/markitdown/converters/_image_converter.py`, `packages/markitdown/src/markitdown/converters/_image_converter.py` Claim：`clm_0005` supported 0.86
- **Audio_Transcription**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/src/markitdown/converters/_audio_converter.py`, `packages/markitdown/pyproject.toml` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **HTML_Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/src/markitdown/converters/_html_converter.py` Claim：`clm_0007` supported 0.86
- **Text_Formats_Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/src/markitdown/converters/_csv_converter.py`, `README.md` Claim：`clm_0008` supported 0.86, `clm_0010` supported 0.86, `clm_0011` supported 0.86, `clm_0018` supported 0.86
- **Epub_Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/src/markitdown/converters/_epub_converter.py` Claim：`clm_0009` supported 0.86
- **YouTube_Transcription**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/pyproject.toml`, `README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **ZIP_Archive_Processing**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`README.md` Claim：`clm_0008` supported 0.86, `clm_0010` supported 0.86, `clm_0011` supported 0.86, `clm_0018` supported 0.86
- **Jupyter_Notebook_Conversion**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/src/markitdown/converters/_ipynb_converter.py` Claim：`clm_0012` supported 0.86
- **Azure_Document_Intelligence_Integration**：先说明这是安装后验证能力，再给出安装前检查清单。 边界：必须真实安装或运行后验证。 证据：`packages/markitdown/src/markitdown/converters/_doc_intel_converter.py`, `packages/markitdown/pyproject.toml` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Azure_Content_Understanding_Integration**：先说明这是安装后验证能力，再给出安装前检查清单。 边界：必须真实安装或运行后验证。 证据：`packages/markitdown/src/markitdown/converters/_cu_converter.py`, `packages/markitdown/pyproject.toml` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等
- **Plugin_Architecture**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/src/markitdown/_markitdown.py`, `packages/markitdown/src/markitdown/_markitdown.py`, `packages/markitdown-sample-plugin/pyproject.toml`, `packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py` Claim：`clm_0015` supported 0.86
- **LLM_Vision_OCR_Plugin**：先说明这是安装后验证能力，再给出安装前检查清单。 边界：必须真实安装或运行后验证。 证据：`packages/markitdown-ocr/pyproject.toml`, `packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py`, `packages/markitdown-ocr/src/markitdown_ocr/_plugin.py`, `packages/markitdown-ocr/README.md` Claim：`clm_0016` supported 0.86
- **MCP_Server_Interface**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown-mcp/src/markitdown_mcp/__main__.py`, `packages/markitdown-mcp/src/markitdown_mcp/__main__.py`, `packages/markitdown-mcp/pyproject.toml`, `packages/markitdown-mcp/README.md` Claim：`clm_0017` supported 0.86
- **Command_Line_Interface**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`packages/markitdown/src/markitdown/__main__.py`, `packages/markitdown/pyproject.toml`, `README.md` Claim：`clm_0001` supported 0.86, `clm_0002` supported 0.86, `clm_0003` supported 0.86, `clm_0004` supported 0.86 等

### 上下文规模

- 文件总数：108
- 重要文件覆盖：40/108
- 证据索引条目：68
- 角色 / Skill 条目：15

### 证据不足时的处理

- **missing_evidence**：说明证据不足，要求用户提供目标文件、README 段落或安装后验证记录；不要补全事实。
- **out_of_scope_request**：说明该任务超出当前 AI Context Pack 证据范围，并建议用户先查看 Human Manual 或真实安装后验证。
- **runtime_request**：给出安装前检查清单和命令来源，但不要替用户执行命令或声称已执行。
- **source_conflict**：同时展示冲突来源，标记为待核实，不要强行选择一个版本。

## Prompt Recipes

### 适配判断

- 目标：判断这个项目是否适合用户当前任务。
- 预期输出：适配结论、关键理由、证据引用、安装前可预览内容、必须安装后验证内容、下一步建议。

```text
请基于 markitdown 的 AI Context Pack，先问我 3 个必要问题，然后判断它是否适合我的任务。回答必须包含：适合谁、能做什么、不能做什么、是否值得安装、证据来自哪里。所有项目事实必须引用 evidence_refs、source_paths 或 claim_id。
```

### 安装前体验

- 目标：让用户在安装前感受核心工作流，同时避免把预览包装成真实能力或营销承诺。
- 预期输出：一段带边界标签的体验剧本、安装后验证清单和谨慎建议；不含真实运行承诺或强营销表述。

```text
请把 markitdown 当作安装前体验资产，而不是已安装工具或真实运行环境。

请严格输出四段：
1. 先问我 3 个必要问题。
2. 给出一段“体验剧本”：用 [安装前可预览]、[必须安装后验证]、[证据不足] 三种标签展示它可能如何引导工作流。
3. 给出安装后验证清单：列出哪些能力只有真实安装、真实宿主加载、真实项目运行后才能确认。
4. 给出谨慎建议：只能说“值得继续研究/试装”“先补充信息后再判断”或“不建议继续”，不得替项目背书。

硬性边界：
- 不要声称已经安装、运行、执行测试、修改文件或产生真实结果。
- 不要写“自动适配”“确保通过”“完美适配”“强烈建议安装”等承诺性表达。
- 如果描述安装后的工作方式，必须使用“如果安装成功且宿主正确加载 Skill，它可能会……”这种条件句。
- 体验剧本只能写成“示例台词/假设流程”：使用“可能会询问/可能会建议/可能会展示”，不要写“已写入、已生成、已通过、正在运行、正在生成”。
- Prompt Preview 不负责给安装命令；如用户准备试装，只能提示先阅读 Quick Start 和 Risk Card，并在隔离环境验证。
- 所有项目事实必须来自 supported claim、evidence_refs 或 source_paths；inferred/unverified 只能作风险或待确认项。

```

### 角色 / Skill 选择

- 目标：从项目里的角色或 Skill 中挑选最匹配的资产。
- 预期输出：候选角色或 Skill 列表，每项包含适用场景、证据路径、风险边界和是否需要安装后验证。

```text
请读取 role_skill_index，根据我的目标任务推荐 3-5 个最相关的角色或 Skill。每个推荐都要说明适用场景、可能输出、风险边界和 evidence_refs。
```

### 风险预检

- 目标：安装或引入前识别环境、权限、规则冲突和质量风险。
- 预期输出：环境、权限、依赖、许可、宿主冲突、质量风险和未知项的检查清单。

```text
请基于 risk_card、boundaries 和 quick_start_candidates，给我一份安装前风险预检清单。不要替我执行命令，只说明我应该检查什么、为什么检查、失败会有什么影响。
```

### 宿主 AI 开工指令

- 目标：把项目上下文转成一次对话开始前的宿主 AI 指令。
- 预期输出：一段边界明确、证据引用明确、适合复制给宿主 AI 的开工前指令。

```text
请基于 markitdown 的 AI Context Pack，生成一段我可以粘贴给宿主 AI 的开工前指令。这段指令必须遵守 not_runtime=true，不能声称项目已经安装、运行或产生真实结果。
```


## 角色 / Skill 索引

- 共索引 15 个角色 / Skill / 项目文档条目。

- **MarkItDown**（project_doc）：! PyPI https://img.shields.io/pypi/v/markitdown.svg https://pypi.org/project/markitdown/ ! PyPI - Downloads https://img.shields.io/pypi/dd/markitdown ! Built by AutoGen Team https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue https://github.com/microsoft/autogen 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`README.md`
- **MarkItDown-MCP**（project_doc）：!IMPORTANT The MarkItDown-MCP package is meant for local use , with local trusted agents. In particular, when running the MCP server with Streamable HTTP or SSE, it binds to localhost by default, and is not exposed to other machines on the network or Internet. In this configuration, it is meant to be a direct alternative to the STDIO transport, which may be more convenient in some cases. DO NOT bind the server to ot… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown-mcp/README.md`
- **MarkItDown OCR Plugin**（project_doc）：LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown-ocr/README.md`
- **MarkItDown Sample Plugin**（project_doc）：! PyPI https://img.shields.io/pypi/v/markitdown-sample-plugin.svg https://pypi.org/project/markitdown-sample-plugin/ ! PyPI - Downloads https://img.shields.io/pypi/dd/markitdown-sample-plugin ! Built by AutoGen Team https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue https://github.com/microsoft/autogen 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown-sample-plugin/README.md`
- **MarkItDown**（project_doc）：!TIP MarkItDown is a Python package and command-line utility for converting various files to Markdown e.g., for indexing, text analysis, etc . For more information, and full documentation, see the project README.md https://github.com/microsoft/markitdown on GitHub. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown/README.md`
- **THIRD-PARTY SOFTWARE NOTICES AND INFORMATION**（project_doc）：THIRD-PARTY SOFTWARE NOTICES AND INFORMATION 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown/ThirdPartyNotices.md`
- **Microsoft Open Source Code of Conduct**（project_doc）：Microsoft Open Source Code of Conduct 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`CODE_OF_CONDUCT.md`
- **Security**（project_doc）：Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include Microsoft https://github.com/Microsoft , Azure https://github.com/Azure , DotNet https://github.com/dotnet , AspNet https://github.com/aspnet and Xamarin https://github.com/xamarin . 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`SECURITY.md`
- **TODO: The maintainer of this repo has not yet edited this file**（project_doc）：TODO: The maintainer of this repo has not yet edited this file 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`SUPPORT.md`
- **Medrpt 2024 Pat 3847 Medical Report Scan**（project_doc）： 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown/tests/test_files/expected_outputs/MEDRPT-2024-PAT-3847_medical_report_scan.md`
- **Receipt 2024 Txn 98765 Retail Purchase**（project_doc）：TECHMART ELECTRONICS 4567 Innovation Blvd San Francisco, CA 94103 415 555-0199 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown/tests/test_files/expected_outputs/RECEIPT-2024-TXN-98765_retail_purchase.md`
- **Repair 2022 Inv 001 Multipage**（project_doc）：ZAVA AUTO REPAIR Certified Collision Repair 123 Main Street, Redmond, WA 98052 Phone: 425 000-0000 Preliminary Estimate ID: EST-1008 Customer Information Vehicle Information -------------------- ------------------- --- ------------------- ----------------- Insured name Gabriel Diaz Year 2022 Claim SF-1008 Make Jeep Policy POL-2022-555 Model Grand Cherokee Phone 425 111-1111 Trim Limited Email gabriel@contoso.com VIN… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown/tests/test_files/expected_outputs/REPAIR-2022-INV-001_multipage.md`
- **Sparse 2024 Inv 1234 Borderless Table**（project_doc）：INVENTORY RECONCILIATION REPORT Report ID: SPARSE-2024-INV-1234 Warehouse: Distribution Center East Report Date: 2024-11-15 Prepared By: Sarah Martinez Product Code Location Expected Actual Variance Status ------------ -------- -------- ------ -------- -------- SKU-8847 A-12 450 B-07 289 -23 SKU-9201 780 778 OK C-15 +15 SKU-4563 D-22 156 CRITICAL 180 -24 SKU-7728 A-08 920 935 +15 OK Variance Analysis: Summary Statis… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown/tests/test_files/expected_outputs/SPARSE-2024-INV-1234_borderless_table.md`
- **Movie Theater Booking 2024**（project_doc）：BOOKING ORDER Print Date 12/15/2024 14:30:22 Page 1 of 1 STARLIGHT CINEMAS Orders Order / Rev: 2024-12-5678 Cinema: Downtown Multiplex ------------ -------------- --- --- ---------------- --- ------------------ Alt Order : SC-WINTER-2024 Primary Contact: Sarah Johnson Product Desc: Holiday Movie Marathon Package Location: NYC-01 Estimate: EST-456 Region: NORTHEAST -------------------- ----------------------- --- ---… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown/tests/test_files/expected_outputs/movie-theater-booking-2024.md`
- **Test**（project_doc）：Large language models LLMs are becoming a crucial building block in developing powerful agents that utilize LLMs for reasoning, tool usage, and adapting to new observations Yao et al., 2022; Xi et al., 2023; Wang et al., 2023b in many real-world tasks. Given the expanding tasks that could benefit from LLMs and the growing task complexity, an intuitive approach to scale up the power of agents is to use multiple agent… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`packages/markitdown/tests/test_files/expected_outputs/test.md`

## 证据索引

- 共索引 68 条证据。

- **MarkItDown**（documentation）：! PyPI https://img.shields.io/pypi/v/markitdown.svg https://pypi.org/project/markitdown/ ! PyPI - Downloads https://img.shields.io/pypi/dd/markitdown ! Built by AutoGen Team https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue https://github.com/microsoft/autogen 证据：`README.md`
- **MarkItDown-MCP**（documentation）：!IMPORTANT The MarkItDown-MCP package is meant for local use , with local trusted agents. In particular, when running the MCP server with Streamable HTTP or SSE, it binds to localhost by default, and is not exposed to other machines on the network or Internet. In this configuration, it is meant to be a direct alternative to the STDIO transport, which may be more convenient in some cases. DO NOT bind the server to other interfaces unless you understand the security implications security-considerations of doing so. 证据：`packages/markitdown-mcp/README.md`
- **MarkItDown OCR Plugin**（documentation）：LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files. 证据：`packages/markitdown-ocr/README.md`
- **MarkItDown Sample Plugin**（documentation）：! PyPI https://img.shields.io/pypi/v/markitdown-sample-plugin.svg https://pypi.org/project/markitdown-sample-plugin/ ! PyPI - Downloads https://img.shields.io/pypi/dd/markitdown-sample-plugin ! Built by AutoGen Team https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue https://github.com/microsoft/autogen 证据：`packages/markitdown-sample-plugin/README.md`
- **MarkItDown**（documentation）：!TIP MarkItDown is a Python package and command-line utility for converting various files to Markdown e.g., for indexing, text analysis, etc . For more information, and full documentation, see the project README.md https://github.com/microsoft/markitdown on GitHub. 证据：`packages/markitdown/README.md`
- **License**（source_file）：Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files the "Software" , to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 证据：`LICENSE`
- **License**（source_file）：Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files the "Software" , to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 证据：`packages/markitdown-ocr/LICENSE`
- **THIRD-PARTY SOFTWARE NOTICES AND INFORMATION**（documentation）：THIRD-PARTY SOFTWARE NOTICES AND INFORMATION 证据：`packages/markitdown/ThirdPartyNotices.md`
- **Runtime dependency**（source_file）：ENV DEBIAN FRONTEND=noninteractive ENV EXIFTOOL PATH=/usr/bin/exiftool ENV FFMPEG PATH=/usr/bin/ffmpeg 证据：`Dockerfile`
- **Runtime dependency**（source_file）：ENV DEBIAN FRONTEND=noninteractive ENV EXIFTOOL PATH=/usr/bin/exiftool ENV FFMPEG PATH=/usr/bin/ffmpeg ENV MARKITDOWN ENABLE PLUGINS=True 证据：`packages/markitdown-mcp/Dockerfile`
- **Pyproject**（source_file）：build-system requires = "hatchling" build-backend = "hatchling.build" 证据：`packages/markitdown-mcp/pyproject.toml`
- **Init**（source_file）：all = 证据：`packages/markitdown-mcp/src/markitdown_mcp/__init__.py`
- **Main**（source_file）：mcp = FastMCP "markitdown" ⋮---- @mcp.tool async def convert to markdown uri: str - str ⋮---- def check plugins enabled - bool ⋮---- def create starlette app mcp server: Server, , debug: bool = False - Starlette ⋮---- sse = SseServerTransport "/messages/" session manager = StreamableHTTPSessionManager ⋮---- async def handle sse request: Request - None ⋮---- @contextlib.asynccontextmanager async def lifespan app: Starlette - AsyncIterator None ⋮---- def main ⋮---- mcp server = mcp. mcp server ⋮---- parser = argparse.ArgumentParser description="Run a MarkItDown MCP server" ⋮---- args = parser.parse args ⋮---- use http = args.http or args.sse ⋮---- host = args.host if args.host else "127.0.0.1… 证据：`packages/markitdown-mcp/src/markitdown_mcp/__main__.py`
- **Core dependencies — matches the file-format libraries markitdown already uses**（source_file）：build-system requires = "hatchling" build-backend = "hatchling.build" 证据：`packages/markitdown-ocr/pyproject.toml`
- **Init**（source_file）：all = 证据：`packages/markitdown-ocr/src/markitdown_ocr/__init__.py`
- **Standard conversion without OCR**（source_file）：dependency exc info = None ⋮---- dependency exc info = sys.exc info ⋮---- PLACEHOLDER = "MARKITDOWNOCRBLOCK{}" ⋮---- class DocxConverterWithOCR HtmlConverter ⋮---- def init self, ocr service: Optional LLMVisionOCRService = None ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- ocr service: Optional LLMVisionOCRService = ⋮---- image ocr map = self. extract and ocr images file stream, ocr service ⋮---- pre process stream = pre process docx file stream html result = mammoth.convert to html ⋮---- md result = self. html converter.convert string md = md result.markdown ⋮---- placeholder = PLACEHOLDER.format i ocr block = f" Image OCR \n{raw te… 证据：`packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py`
- **Ocr Service**（source_file）：@dataclass class OCRResult ⋮---- text: str confidence: float None = None backend used: str None = None error: str None = None ⋮---- class LLMVisionOCRService ⋮---- content type: str None = None ⋮---- content type = stream info.mimetype ⋮---- img = Image.open image stream fmt = img.format.lower if img.format else "png" content type = f"image/{fmt}" ⋮---- content type = "image/png" ⋮---- base64 image = base64.b64encode image stream.read .decode "utf-8" data uri = f"data:{content type};base64,{base64 image}" ⋮---- actual prompt = prompt or self.default prompt response = self.client.chat.completions.create ⋮---- text = response.choices 0 .message.content 证据：`packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py`
- **No OCR, just extract text**（source_file）：dependency exc info = None ⋮---- dependency exc info = sys.exc info ⋮---- def extract images from page page: Any - list dict ⋮---- images info = ⋮---- images = ⋮---- images = page.images ⋮---- images = page.objects.get "image", ⋮---- all objs = page.objects ⋮---- potential imgs = all objs.get obj type, ⋮---- images = potential imgs ⋮---- img stream = None y pos = 0 ⋮---- img bytes = img dict "stream" .get data ⋮---- pil img = Image.open io.BytesIO img bytes ⋮---- pil img = pil img.convert "RGB" ⋮---- img stream = io.BytesIO ⋮---- y pos = img dict.get "top", 0 ⋮---- x0 = img dict.get "x0", 0 y0 = img dict.get "top", 0 x1 = img dict.get "x1", 0 y1 = img dict.get "bottom", 0 y pos = y0 ⋮---- b… 证据：`packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py`
- **Plugin**（source_file）：plugin interface version = 1 ⋮---- def register converters markitdown: MarkItDown, kwargs: Any - None ⋮---- llm client = kwargs.get "llm client" llm model = kwargs.get "llm model" llm prompt = kwargs.get "llm prompt" ⋮---- ocr service: LLMVisionOCRService None = None ⋮---- ocr service = LLMVisionOCRService ⋮---- PRIORITY OCR ENHANCED = -1.0 证据：`packages/markitdown-ocr/src/markitdown_ocr/_plugin.py`
- **Format extracted content using unified OCR block format**（source_file）：dependency exc info = None ⋮---- dependency exc info = sys.exc info ⋮---- class PptxConverterWithOCR DocumentConverter ⋮---- def init self, ocr service: Optional LLMVisionOCRService = None ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- ocr service: Optional LLMVisionOCRService = llm client = kwargs.get "llm client" ⋮---- presentation = pptx.Presentation file stream md content = "" slide num = 0 ⋮---- title = slide.shapes.title ⋮---- def get shape content shape, kwargs ⋮---- image stream = io.BytesIO shape.image.blob ⋮---- llm description = "" ⋮---- image filename = shape.image.filename image extension = None ⋮---- image extension = os… 证据：`packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py`
- **Perform OCR**（source_file）：xlsx dependency exc info = None ⋮---- xlsx dependency exc info = sys.exc info ⋮---- class XlsxConverterWithOCR DocumentConverter ⋮---- def init self, ocr service: Optional LLMVisionOCRService = None ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- ocr service: Optional LLMVisionOCRService = ⋮---- kwargs without ocr = {k: v for k, v in kwargs.items if k != "ocr service"} ⋮---- sheets = pd.read excel file stream, sheet name=None, engine="openpyxl" md content = "" ⋮---- html content = sheets sheet name .to html index=False ⋮---- wb = load workbook file stream ⋮---- sheet = wb sheet name ⋮---- df = pd.read excel html content = df.to html in… 证据：`packages/markitdown-ocr/src/markitdown_ocr/_xlsx_converter_with_ocr.py`
- **IMPORTANT: MarkItDown will look for this entry point to find the plugin.**（source_file）：build-system requires = "hatchling" build-backend = "hatchling.build" 证据：`packages/markitdown-sample-plugin/pyproject.toml`
- **Init**（source_file）：all = 证据：`packages/markitdown-sample-plugin/src/markitdown_sample_plugin/__init__.py`
- **Plugin**（source_file）：plugin interface version = ⋮---- ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ".rtf" ⋮---- def register converters markitdown: MarkItDown, kwargs ⋮---- class RtfConverter DocumentConverter ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- encoding = stream info.charset or locale.getpreferredencoding stream data = file stream.read .decode encoding 证据：`packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py`
- **=1.2.0b1 required for to llm input helper used by ContentUnderstandingConverter**（source_file）：build-system requires = "hatchling" build-backend = "hatchling.build" 证据：`packages/markitdown/pyproject.toml`
- **Init**（source_file）：all = 证据：`packages/markitdown/src/markitdown/__init__.py`
- **Parse the charset**（source_file）：def main ⋮---- parser = argparse.ArgumentParser ⋮---- cloud group = parser.add mutually exclusive group ⋮---- args = parser.parse args ⋮---- extension hint = args.extension ⋮---- extension hint = extension hint.strip .lower ⋮---- extension hint = "." + extension hint ⋮---- extension hint = None ⋮---- mime type hint = args.mime type ⋮---- mime type hint = mime type hint.strip ⋮---- mime type hint = None ⋮---- Parse the charset charset hint = args.charset ⋮---- charset hint = charset hint.strip ⋮---- charset hint = codecs.lookup charset hint .name ⋮---- charset hint = None ⋮---- stream info = None ⋮---- stream info = StreamInfo ⋮---- List installed plugins, then exit ⋮---- plugin entry points… 证据：`packages/markitdown/src/markitdown/__main__.py`
- **Base Converter**（source_file）：class DocumentConverterResult ⋮---- @property def text content self - str ⋮---- @text content.setter def text content self, markdown: str ⋮---- def str self - str ⋮---- class DocumentConverter ⋮---- kwargs: Any, Options to pass to the converter ⋮---- """ Convert a document to Markdown text. Parameters: - file stream: The file-like object to convert. Must support seek , tell , and read methods. - stream info: The StreamInfo object containing metadata about the file mimetype, extension, charset, set - kwargs: Additional keyword arguments for the converter. Returns: - DocumentConverterResult: The result of the conversion, which includes the title and markdown content. Raises: - FileConversionE… 证据：`packages/markitdown/src/markitdown/_base_converter.py`
- **Exceptions**（source_file）：MISSING DEPENDENCY MESSAGE = """{converter} recognized the input as a potential {extension} file, but the dependencies needed to read {extension} files have not been installed. To resolve this error, include the optional dependency {feature} or all when installing MarkItDown. For example: ⋮---- class MarkItDownException Exception ⋮---- class MissingDependencyException MarkItDownException ⋮---- class UnsupportedFormatException MarkItDownException ⋮---- class FailedConversionAttempt object ⋮---- def init self, converter: Any, exc info: Optional tuple = None ⋮---- class FileConversionException MarkItDownException ⋮---- message = "File conversion failed." ⋮---- message = f"File conversion faile… 证据：`packages/markitdown/src/markitdown/_exceptions.py`
- **Local path or url**（source_file）：PRIORITY SPECIFIC FILE FORMAT = PRIORITY GENERIC FILE FORMAT = ⋮---- plugins: Union None, List Any = None ⋮---- def load plugins - Union None, List Any ⋮---- plugins = ⋮---- tb = traceback.format exc ⋮---- @dataclass kw only=True, frozen=True class ConverterRegistration ⋮---- """A registration of a converter with its priority and other metadata.""" ⋮---- converter: DocumentConverter priority: float ⋮---- class MarkItDown ⋮---- """ In preview An extremely simple text-based document reader, suitable for LLM use. This reader will convert common file-types or webpages to Markdown.""" ⋮---- requests session = kwargs.get "requests session" ⋮---- def enable builtins self, kwargs - None ⋮---- candi… 证据：`packages/markitdown/src/markitdown/_markitdown.py`
- **Stream Info**（source_file）：@dataclass kw only=True, frozen=True class StreamInfo ⋮---- mimetype: Optional str = None extension: Optional str = None charset: Optional str = None filename: Optional local path: Optional str = None url: Optional str = None ⋮---- def copy and update self, args, kwargs ⋮---- new info = asdict self 证据：`packages/markitdown/src/markitdown/_stream_info.py`
- **Uri Utils**（source_file）：def file uri to path file uri: str - Tuple str None, str ⋮---- parsed = urlparse file uri ⋮---- netloc = parsed.netloc if parsed.netloc else None path = os.path.abspath url2pathname parsed.path ⋮---- def parse data uri uri: str - Tuple str None, Dict str, str , bytes ⋮---- meta = header 5: parts = meta.split ";" ⋮---- is base64 = False ⋮---- is base64 = True ⋮---- mime type = None ⋮---- mime type = parts.pop 0 ⋮---- attributes: Dict str, str = {} ⋮---- content = base64.b64decode data if is base64 else unquote to bytes data 证据：`packages/markitdown/src/markitdown/_uri_utils.py`
- **Init**（source_file）：all = 证据：`packages/markitdown/src/markitdown/converters/__init__.py`
- **Add metadata**（source_file）：ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ⋮---- class AudioConverter DocumentConverter ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- kwargs: Any, Options to pass to the converter ⋮---- md content = "" ⋮---- Add metadata metadata = exiftool metadata ⋮---- audio format = "wav" ⋮---- audio format = "mp3" ⋮---- audio format = "mp4" ⋮---- audio format = None ⋮---- transcript = transcribe audio file stream, audio format=audio format 证据：`packages/markitdown/src/markitdown/converters/_audio_converter.py`
- **Read the file content**（source_file）：ACCEPTED MIME TYPE PREFIXES = ACCEPTED FILE EXTENSIONS = ".csv" ⋮---- class CsvConverter DocumentConverter ⋮---- def init self ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- kwargs: Any, Options to pass to the converter ⋮---- Read the file content ⋮---- content = file stream.read .decode stream info.charset ⋮---- content = str from bytes file stream.read .best ⋮---- Parse CSV content reader = csv.reader io.StringIO content rows = list reader ⋮---- Create markdown table markdown table = ⋮---- Add header row ⋮---- Add separator row ⋮---- Add data rows ⋮---- Make sure row has the same number of columns as header ⋮---- Truncate if row has… 证据：`packages/markitdown/src/markitdown/converters/_csv_converter.py`
- **Create CU client**（source_file）：dependency exc info = None ⋮---- dependency exc info = sys.exc info ⋮---- class AzureKeyCredential ⋮---- class TokenCredential ⋮---- class ContentUnderstandingClient ⋮---- class UserAgentPolicy ⋮---- class DefaultAzureCredential ⋮---- def to llm input args, kwargs ⋮---- class ContentUnderstandingFileType str, Enum ⋮---- PDF = "pdf" DOCX = "docx" PPTX = "pptx" XLSX = "xlsx" HTML = "html" TXT = "txt" MD = "md" RTF = "rtf" XML = "xml" ⋮---- EML = "eml" MSG = "msg" ⋮---- JPEG = "jpeg" PNG = "png" BMP = "bmp" TIFF = "tiff" HEIF = "heif" ⋮---- MP4 = "mp4" M4V = "m4v" MOV = "mov" AVI = "avi" MKV = "mkv" WEBM = "webm" FLV = "flv" WMV = "wmv" ⋮---- WAV = "wav" MP3 = "mp3" M4A = "m4a" FLAC = "flac" O… 证据：`packages/markitdown/src/markitdown/converters/_cu_converter.py`
- **Types that don't support ocr**（source_file）：dependency exc info = None ⋮---- dependency exc info = sys.exc info ⋮---- class AzureKeyCredential ⋮---- class TokenCredential ⋮---- class DocumentIntelligenceClient ⋮---- class AnalyzeDocumentRequest ⋮---- class AnalyzeResult ⋮---- class DocumentAnalysisFeature ⋮---- class DefaultAzureCredential ⋮---- CONTENT FORMAT = "markdown" ⋮---- class DocumentIntelligenceFileType str, Enum ⋮---- DOCX = "docx" PPTX = "pptx" XLSX = "xlsx" HTML = "html" ⋮---- PDF = "pdf" JPEG = "jpeg" PNG = "png" BMP = "bmp" TIFF = "tiff" ⋮---- def get mime type prefixes types: List DocumentIntelligenceFileType - List str ⋮---- prefixes: List str = ⋮---- def get file extensions types: List DocumentIntelligenceFileType -… 证据：`packages/markitdown/src/markitdown/converters/_doc_intel_converter.py`
- **Check: the dependencies**（source_file）：dependency exc info = None ⋮---- dependency exc info = sys.exc info ⋮---- ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ".docx" ⋮---- class DocxConverter HtmlConverter ⋮---- def init self ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- kwargs: Any, Options to pass to the converter ⋮---- Check: the dependencies ⋮---- style map = kwargs.get "style map", None pre process stream = pre process docx file stream 证据：`packages/markitdown/src/markitdown/converters/_docx_converter.py`
- **Extract and convert the content**（source_file）：ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ".epub" ⋮---- MIME TYPE MAPPING = { ⋮---- class EpubConverter HtmlConverter ⋮---- def init self ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- kwargs: Any, Options to pass to the converter ⋮---- container dom = minidom.parse z.open "META-INF/container.xml" opf path = container dom.getElementsByTagName "rootfile" 0 .getAttribute ⋮---- opf dom = minidom.parse z.open opf path metadata: Dict str, Any = { ⋮---- manifest = { ⋮---- spine items = opf dom.getElementsByTagName "itemref" spine order = item.getAttribute "idref" for item in spine items ⋮---- base path = "/".join spine… 证据：`packages/markitdown/src/markitdown/converters/_epub_converter.py`
- **Pop our own keyword before forwarding the rest to markdownify.**（source_file）：ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ⋮---- class HtmlConverter DocumentConverter ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- kwargs: Any, Options to pass to the converter ⋮---- Pop our own keyword before forwarding the rest to markdownify. strict=True raises RecursionError instead of falling back to plain text. strict: bool = kwargs.pop "strict", False ⋮---- encoding = "utf-8" if stream info.charset is None else stream info.charset soup = BeautifulSoup file stream, "html.parser", from encoding=encoding ⋮---- body elm = soup.find "body" webpage text = "" ⋮---- webpage text = CustomMarkdownify kwargs .conver… 证据：`packages/markitdown/src/markitdown/converters/_html_converter.py`
- **Add metadata**（source_file）：ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ".jpg", ".jpeg", ".png" ⋮---- class ImageConverter DocumentConverter ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- kwargs: Any, Options to pass to the converter ⋮---- md content = "" ⋮---- Add metadata metadata = exiftool metadata ⋮---- llm client = kwargs.get "llm client" llm model = kwargs.get "llm model" ⋮---- llm description = self. get llm description ⋮---- prompt = "Write a detailed caption for this image." ⋮---- Get the content type content type = stream info.mimetype ⋮---- content type = "application/octet-stream" ⋮---- cur pos = file stream.tell ⋮---- base64 imag… 证据：`packages/markitdown/src/markitdown/converters/_image_converter.py`
- **Read further to see if it's a notebook**（source_file）：CANDIDATE MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ".ipynb" ⋮---- class IpynbConverter DocumentConverter ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- Read further to see if it's a notebook cur pos = file stream.tell ⋮---- encoding = stream info.charset or "utf-8" notebook content = file stream.read .decode encoding ⋮---- notebook content = file stream.read .decode encoding=encoding ⋮---- def convert self, notebook content: dict - DocumentConverterResult ⋮---- md output = title = None ⋮---- cell type = cell.get "cell type", "" source lines = cell.get "source", ⋮---- Extract the first heading as title if not already found… 证据：`packages/markitdown/src/markitdown/converters/_ipynb_converter.py`
- **No next line to merge with, keep as is**（source_file）：PARTIAL NUMBERING PATTERN = re.compile r"^\.\d+$" ⋮---- def merge partial numbering lines text: str - str ⋮---- """ Post-process extracted text to merge MasterFormat-style partial numbering with the following text line. MasterFormat documents use partial numbering like: .1 The intent of this Request for Proposal... .2 Available information relative to... Some PDF extractors split these into separate lines: .1 The intent of this Request for Proposal... This function merges them back together. """ lines = text.split "\n" result lines: list str = i = 0 ⋮---- line = lines i stripped = line.strip ⋮---- j = i + 1 ⋮---- next line = lines j .strip ⋮---- i = j + 1 Skip past the merged line ⋮---- No… 证据：`packages/markitdown/src/markitdown/converters/_pdf_converter.py`
- **Check the dependencies**（source_file）：dependency exc info = None ⋮---- dependency exc info = sys.exc info ⋮---- ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ".pptx" ⋮---- class PptxConverter DocumentConverter ⋮---- def init self ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- kwargs: Any, Options to pass to the converter ⋮---- Check the dependencies ⋮---- presentation = pptx.Presentation file stream md content = "" slide num = 0 ⋮---- title = slide.shapes.title ⋮---- def get shape content shape, kwargs ⋮---- llm description = "" alt text = "" ⋮---- Potentially generate a description using an LLM llm client = kwargs.get "llm client" llm model = kwargs.get… 证据：`packages/markitdown/src/markitdown/converters/_pptx_converter.py`
- **Check for precise mimetypes and file extensions**（source_file）：PRECISE MIME TYPE PREFIXES = ⋮---- PRECISE FILE EXTENSIONS = ".rss", ".atom" ⋮---- CANDIDATE MIME TYPE PREFIXES = ⋮---- CANDIDATE FILE EXTENSIONS = ⋮---- class RssConverter DocumentConverter ⋮---- def init self ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- Check for precise mimetypes and file extensions ⋮---- def check xml self, file stream: BinaryIO - bool ⋮---- cur pos = file stream.tell ⋮---- doc = minidom.parse file stream ⋮---- def feed type self, doc: Any - str None ⋮---- root = doc.getElementsByTagName "feed" 0 ⋮---- feed type = self. feed type doc ⋮---- def parse atom type self, doc: Document - DocumentConverterResult ⋮---- t… 证据：`packages/markitdown/src/markitdown/converters/_rss_converter.py`
- **Not a Wikipedia URL**（source_file）：ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ⋮---- url = stream info.url or "" mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- Not a Wikipedia URL ⋮---- Not HTML content ⋮---- kwargs: Any, Options to pass to the converter ⋮---- Parse the stream encoding = "utf-8" if stream info.charset is None else stream info.charset soup = bs4.BeautifulSoup file stream, "html.parser", from encoding=encoding ⋮---- body elm = soup.find "div", {"id": "mw-content-text"} title elm = soup.find "span", {"class": "mw-page-title-main"} ⋮---- webpage text = "" main title = None if soup.title is None else soup.title.string ⋮---- What's the title ⋮--… 证据：`packages/markitdown/src/markitdown/converters/_wikipedia_converter.py`
- **Check the dependencies**（source_file）：xlsx dependency exc info = None ⋮---- xlsx dependency exc info = sys.exc info ⋮---- xls dependency exc info = None ⋮---- xls dependency exc info = sys.exc info ⋮---- ACCEPTED XLSX MIME TYPE PREFIXES = ACCEPTED XLSX FILE EXTENSIONS = ".xlsx" ⋮---- ACCEPTED XLS MIME TYPE PREFIXES = ACCEPTED XLS FILE EXTENSIONS = ".xls" ⋮---- class XlsxConverter DocumentConverter ⋮---- def init self ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- kwargs: Any, Options to pass to the converter ⋮---- Check the dependencies ⋮---- sheets = pd.read excel file stream, sheet name=None, engine="openpyxl" md content = "" ⋮---- html content = sheets s .to html index… 证据：`packages/markitdown/src/markitdown/converters/_xlsx_converter.py`
- **Not a YouTube URL**（source_file）：IS YOUTUBE TRANSCRIPT CAPABLE = True ⋮---- IS YOUTUBE TRANSCRIPT CAPABLE = False ⋮---- ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ⋮---- class YouTubeConverter DocumentConverter ⋮---- url = stream info.url or "" mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- url = unquote url url = url.replace r"\?", "?" .replace r"\=", "=" ⋮---- Not a YouTube URL ⋮---- Not HTML content ⋮---- kwargs: Any, Options to pass to the converter ⋮---- Parse the stream encoding = "utf-8" if stream info.charset is None else stream info.charset soup = bs4.BeautifulSoup file stream, "html.parser", from encoding=encoding ⋮---- metadata: Dict str, str… 证据：`packages/markitdown/src/markitdown/converters/_youtube_converter.py`
- **Zip Converter**（source_file）：ACCEPTED MIME TYPE PREFIXES = ⋮---- ACCEPTED FILE EXTENSIONS = ".zip" ⋮---- class ZipConverter DocumentConverter ⋮---- mimetype = stream info.mimetype or "" .lower extension = stream info.extension or "" .lower ⋮---- kwargs: Any, Options to pass to the converter ⋮---- file path = stream info.url or stream info.local path or stream info.filename md content = f"Content from the zip file {file path} :\n\n" ⋮---- z file stream = io.BytesIO zipObj.read name z file stream info = StreamInfo result = self. markitdown.convert stream 证据：`packages/markitdown/src/markitdown/converters/_zip_converter.py`
- **Microsoft Open Source Code of Conduct**（documentation）：Microsoft Open Source Code of Conduct 证据：`CODE_OF_CONDUCT.md`
- **Security**（documentation）：Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include Microsoft https://github.com/Microsoft , Azure https://github.com/Azure , DotNet https://github.com/dotnet , AspNet https://github.com/aspnet and Xamarin https://github.com/xamarin . 证据：`SECURITY.md`
- **TODO: The maintainer of this repo has not yet edited this file**（documentation）：TODO: The maintainer of this repo has not yet edited this file 证据：`SUPPORT.md`
- **Test Docx Converter**（source_file）：TEST DATA DIR = Path file .parent / "ocr test data" ⋮---- MOCK TEXT = "MOCK OCR TEXT 12345" ⋮---- class MockOCRService ⋮---- @pytest.fixture scope="module" def svc - MockOCRService ⋮---- def convert filename: str, ocr service: MockOCRService - str ⋮---- path = TEST DATA DIR / filename ⋮---- converter = DocxConverterWithOCR ⋮---- def test docx image start svc: MockOCRService - None ⋮---- expected = ⋮---- def test docx image middle svc: MockOCRService - None ⋮---- def test docx image end svc: MockOCRService - None ⋮---- def test docx multiple images svc: MockOCRService - None ⋮---- def test docx multipage svc: MockOCRService - None ⋮---- def test docx complex layout svc: MockOCRService - None… 证据：`packages/markitdown-ocr/tests/test_docx_converter.py`
- **---------------------------------------------------------------------------**（source_file）：TEST DATA DIR = Path file .parent / "ocr test data" ⋮---- MOCK TEXT = "MOCK OCR TEXT 12345" OCR BLOCK = f" Image OCR \n{ MOCK TEXT}\n End OCR " PAGE 1 SCANNED = f" ⋮---- class MockOCRService: ⋮---- @pytest.fixture scope="module" def svc - MockOCRService ⋮---- def convert filename: str, ocr service: MockOCRService - str ⋮---- path = TEST DATA DIR / filename ⋮---- converter = PdfConverterWithOCR ⋮---- def test pdf image start svc: MockOCRService - None ⋮---- expected = ⋮---- def test pdf image middle svc: MockOCRService - None ⋮---- def test pdf image end svc: MockOCRService - None ⋮---- def test pdf multiple images svc: MockOCRService - None ⋮---- def test pdf complex layout svc: MockOCRServ… 证据：`packages/markitdown-ocr/tests/test_pdf_converter.py`
- **Test Pptx Converter**（source_file）：TEST DATA DIR = Path file .parent / "ocr test data" ⋮---- MOCK TEXT = "MOCK OCR TEXT 12345" OCR BLOCK = f" Image OCR \n{ MOCK TEXT}\n End OCR " ⋮---- class MockOCRService ⋮---- self, noqa: ANN101 ⋮---- @pytest.fixture scope="module" def svc - MockOCRService ⋮---- def convert filename: str, ocr service: MockOCRService - str ⋮---- path = TEST DATA DIR / filename ⋮---- converter = PptxConverterWithOCR ⋮---- def test pptx image start svc: MockOCRService - None ⋮---- expected = ⋮---- def test pptx image middle svc: MockOCRService - None ⋮---- def test pptx image end svc: MockOCRService - None ⋮---- def test pptx multiple images svc: MockOCRService - None ⋮---- def test pptx complex layout svc: M… 证据：`packages/markitdown-ocr/tests/test_pptx_converter.py`
- **Test Xlsx Converter**（source_file）：TEST DATA DIR = Path file .parent / "ocr test data" ⋮---- MOCK TEXT = "MOCK OCR TEXT 12345" OCR BLOCK = f" Image OCR \n{ MOCK TEXT}\n End OCR " IMG SECTION = " ⋮---- class MockOCRService: ⋮---- @pytest.fixture scope="module" def svc - MockOCRService ⋮---- def convert filename: str, ocr service: MockOCRService - str ⋮---- path = TEST DATA DIR / filename ⋮---- converter = XlsxConverterWithOCR ⋮---- def test xlsx image start svc: MockOCRService - None ⋮---- expected = ⋮---- def test xlsx image middle svc: MockOCRService - None ⋮---- def test xlsx image end svc: MockOCRService - None ⋮---- def test xlsx multiple images svc: MockOCRService - None ⋮---- def test xlsx complex layout svc: MockOCRSe… 证据：`packages/markitdown-ocr/tests/test_xlsx_converter.py`
- **Test Sample Plugin**（source_file）：TEST FILES DIR = os.path.join os.path.dirname file , "test files" ⋮---- RTF TEST STRINGS = { ⋮---- def test converter - None ⋮---- converter = RtfConverter result = converter.convert ⋮---- def test markitdown - None ⋮---- md = MarkItDown enable plugins=True result = md.convert os.path.join TEST FILES DIR, "test.rtf" 证据：`packages/markitdown-sample-plugin/tests/test_sample_plugin.py`
- **---------------------------------------------------------------------------**（source_file）：def make converter file types=None, analyzer id=None, analyzer modality=None ⋮---- conv = ContentUnderstandingConverter. new ContentUnderstandingConverter ⋮---- types = file types if file types is not None else ALL FILE TYPES ⋮---- class TestAcceptsExtension ⋮---- def test accepts supported extensions self, ext ⋮---- conv = make converter ⋮---- def test rejects unsupported extensions self, ext ⋮---- --------------------------------------------------------------------------- accepts tests — MIME-based ⋮---- class TestAcceptsMime ⋮---- """Test accepts for MIME type matching.""" ⋮---- def test accepts supported mimetypes self, mime ⋮---- def test rejects unsupported mimetypes self, mime ⋮----… 证据：`packages/markitdown/tests/test_cu_converter.py`
- **Devcontainer**（structured_config）：// For format details, see https://aka.ms/devcontainer.json. For config options, see the // README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile { "name": "Existing Dockerfile", "build": { // Sets the run context to one level up instead of the .devcontainer folder. "context": "..", // Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename. "dockerfile": "../Dockerfile", "args": { "INSTALL GIT": "true" } }, 证据：`.devcontainer/devcontainer.json`
- **.dockerignore**（source_file）：!packages/ 证据：`.dockerignore`
- 其余 8 条证据见 `AI_CONTEXT_PACK.json` 或 `EVIDENCE_INDEX.json`。

## 宿主 AI 必须遵守的规则

- **把本资产当作开工前上下文，而不是运行环境。**：AI Context Pack 只包含证据化项目理解，不包含目标项目的可执行状态。 证据：`README.md`, `packages/markitdown-mcp/README.md`, `packages/markitdown-ocr/README.md`
- **回答用户时区分可预览内容与必须安装后才能验证的内容。**：安装前体验的消费者价值来自降低误装和误判，而不是伪装成真实运行。 证据：`README.md`, `packages/markitdown-mcp/README.md`, `packages/markitdown-ocr/README.md`

## 用户开工前应该回答的问题

- 你准备在哪个宿主 AI 或本地环境中使用它？
- 你只是想先体验工作流，还是准备真实安装？
- 你最在意的是安装成本、输出质量、还是和现有规则的冲突？

## 验收标准

- 所有能力声明都能回指到 evidence_refs 中的文件路径。
- AI_CONTEXT_PACK.md 没有把预览包装成真实运行。
- 用户能在 3 分钟内看懂适合谁、能做什么、如何开始和风险边界。

---

## Doramagic Context Augmentation

下面内容用于强化 Repomix/AI Context Pack 主体。Human Manual 只提供阅读骨架；踩坑日志会被转成宿主 AI 必须遵守的工作约束。

## Human Manual 骨架

使用规则：这里只是项目阅读路线和显著性信号，不是事实权威。具体事实仍必须回到 repo evidence / Claim Graph。

宿主 AI 硬性规则：
- 不得把页标题、章节顺序、摘要或 importance 当作项目事实证据。
- 解释 Human Manual 骨架时，必须明确说它只是阅读路线/显著性信号。
- 能力、安装、兼容性、运行状态和风险判断必须引用 repo evidence、source path 或 Claim Graph。

- **Home**：importance `high`
  - source_paths: README.md, packages/markitdown/src/markitdown/__about__.py
- **Installation Guide**：importance `high`
  - source_paths: README.md, packages/markitdown/pyproject.toml, Dockerfile
- **Command-Line Interface**：importance `high`
  - source_paths: README.md, packages/markitdown/src/markitdown/__main__.py
- **Architecture Overview**：importance `high`
  - source_paths: packages/markitdown/src/markitdown/_markitdown.py, packages/markitdown/src/markitdown/_base_converter.py, packages/markitdown/src/markitdown/converters/__init__.py, packages/markitdown/src/markitdown/_uri_utils.py, packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py
- **Python API Reference**：importance `high`
  - source_paths: packages/markitdown/src/markitdown/_markitdown.py, packages/markitdown/src/markitdown/__init__.py, packages/markitdown/src/markitdown/_stream_info.py, packages/markitdown/src/markitdown/_exceptions.py
- **Supported File Formats**：importance `high`
  - source_paths: packages/markitdown/src/markitdown/converters/_pdf_converter.py, packages/markitdown/src/markitdown/converters/_docx_converter.py, packages/markitdown/src/markitdown/converters/_pptx_converter.py, packages/markitdown/src/markitdown/converters/_xlsx_converter.py, packages/markitdown/src/markitdown/converters/_image_converter.py
- **Azure Integrations**：importance `high`
  - source_paths: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py, README.md, packages/markitdown/src/markitdown/converters/__init__.py
- **OCR Plugin**：importance `high`
  - source_paths: packages/markitdown-ocr/README.md, packages/markitdown-ocr/src/markitdown_ocr/_plugin.py, packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py, packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py, packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py

## Repo Inspection Evidence / 源码检查证据

- repo_clone_verified: true
- repo_inspection_verified: true
- repo_commit: `e144e0a2be95b34df17433bac904e635f2c5e551`
- inspected_files: `Dockerfile`, `README.md`, `packages/markitdown-sample-plugin/pyproject.toml`, `packages/markitdown-sample-plugin/README.md`, `packages/markitdown-ocr/pyproject.toml`, `packages/markitdown-ocr/README.md`, `packages/markitdown-mcp/pyproject.toml`, `packages/markitdown-mcp/README.md`, `packages/markitdown/pyproject.toml`, `packages/markitdown/ThirdPartyNotices.md`, `packages/markitdown/README.md`, `packages/markitdown-sample-plugin/tests/test_sample_plugin.py`, `packages/markitdown-sample-plugin/tests/__init__.py`, `packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py`, `packages/markitdown-sample-plugin/src/markitdown_sample_plugin/__about__.py`, `packages/markitdown-sample-plugin/src/markitdown_sample_plugin/__init__.py`, `packages/markitdown-ocr/tests/test_pptx_converter.py`, `packages/markitdown-ocr/tests/test_xlsx_converter.py`, `packages/markitdown-ocr/tests/test_docx_converter.py`, `packages/markitdown-ocr/tests/test_pdf_converter.py`

宿主 AI 硬性规则：
- 没有 repo_clone_verified=true 时，不得声称已经读过源码。
- 没有 repo_inspection_verified=true 时，不得把 README/docs/package 文件判断写成事实。
- 没有 quick_start_verified=true 时，不得声称 Quick Start 已跑通。

## Doramagic Pitfall Constraints / 踩坑约束

这些规则来自 Doramagic 发现、验证或编译过程中的项目专属坑点。宿主 AI 必须把它们当作工作约束，而不是普通说明文字。

### Constraint 1: 来源证据：[Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- Host AI rule: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | cevd_f70b2e3ea5ed47418a4aeb9ef27230f9 | https://github.com/microsoft/markitdown/issues/1685 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 2: 来源证据：Unrecognized Arguments Error in markitdown CLI for undocumented arguments

- Trigger: GitHub 社区证据显示该项目存在一个运行相关的待验证问题：Unrecognized Arguments Error in markitdown CLI for undocumented arguments
- Host AI rule: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | cevd_252ef0d45ac040688ffa066bc1b64ba0 | https://github.com/microsoft/markitdown/issues/1897 | 来源类型 github_issue 暴露的待验证使用条件。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 3: 来源证据：bug: DOCX math converter crashes when oMath element is missing in malformed equations

- Trigger: GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：bug: DOCX math converter crashes when oMath element is missing in malformed equations
- Host AI rule: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Why it matters: 可能阻塞安装或首次运行。
- Evidence: community_evidence:github | cevd_6e08b71ee29f46a98e6825a5d5b11e6e | https://github.com/microsoft/markitdown/issues/1979 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 4: 来源证据：bug: DOCX math converter crashes with NotImplementedError on unknown functions

- Trigger: GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：bug: DOCX math converter crashes with NotImplementedError on unknown functions
- Host AI rule: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Why it matters: 可能阻塞安装或首次运行。
- Evidence: community_evidence:github | cevd_439f22f47a524773808819148caadca5 | https://github.com/microsoft/markitdown/issues/1982 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 5: 失败模式：installation: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception

- Trigger: Developers should check this installation risk before relying on the project: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception. Context: Source discussion did not expose a precise runtime context.
- Why it matters: Developers may fail before the first successful local run: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- Evidence: failure_mode_cluster:github_issue | fmev_087a8a7b6538b2ce2b065ade73c555af | https://github.com/microsoft/markitdown/issues/1408 | Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 6: 失败模式：installation: Support for .doc extensions

- Trigger: Developers should check this installation risk before relying on the project: Support for .doc extensions
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: Support for .doc extensions. Context: Observed when using windows, linux
- Why it matters: Developers may fail before the first successful local run: Support for .doc extensions
- Evidence: failure_mode_cluster:github_issue | fmev_d5a467d012987779306cb5c50725275b | https://github.com/microsoft/markitdown/issues/23 | Support for .doc extensions
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 7: 失败模式：installation: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux

- Trigger: Developers should check this installation risk before relying on the project: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux. Context: Observed when using python, windows, linux
- Why it matters: Developers may fail before the first successful local run: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- Evidence: failure_mode_cluster:github_issue | fmev_1f9167a15a1eec72c8f79514f1b70b76 | https://github.com/microsoft/markitdown/issues/1685 | [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 8: 失败模式：installation: v0.1.0

- Trigger: Developers should check this installation risk before relying on the project: v0.1.0
- Host AI rule: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.0. Context: Observed when using python
- Why it matters: Upgrade or migration may change expected behavior: v0.1.0
- Evidence: failure_mode_cluster:github_release | fmev_1d5ae6ee21225356f45c36c20024dccd | https://github.com/microsoft/markitdown/releases/tag/v0.1.0 | v0.1.0
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 9: 来源证据：Office Open XML: Invalid Files Return Success with Error Message Instead of Exception

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- Host AI rule: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | cevd_734e117518a3496eb3779e5f22b600b5 | https://github.com/microsoft/markitdown/issues/1408 | 来源类型 github_issue 暴露的待验证使用条件。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 10: 来源证据：bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)
- Host AI rule: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Why it matters: 可能阻塞安装或首次运行。
- Evidence: community_evidence:github | cevd_77597bea6262485b9609d8fc5f50a69a | https://github.com/microsoft/markitdown/issues/1894 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。