# https://github.com/microsoft/markitdown 项目说明书

生成时间：2026-05-30 19:07:23 UTC

## 目录

- [项目概述](#page-overview)
- [支持的文件格式](#page-supported-formats)
- [系统架构](#page-architecture)
- [转换器系统](#page-converter-system)
- [安装与配置](#page-installation)
- [命令行使用](#page-cli-usage)
- [Python API](#page-python-api)
- [插件系统](#page-plugin-system)
- [OCR 插件](#page-ocr-plugin)
- [Azure 服务集成](#page-azure-integration)

<a id='page-overview'></a>

## 项目概述

### 相关页面

相关主题：[支持的文件格式](#page-supported-formats), [系统架构](#page-architecture)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
- [packages/markitdown/src/markitdown/converters/_wikipedia_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py)
- [packages/markitdown/src/markitdown/converters/_rss_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_rss_converter.py)
- [packages/markitdown/src/markitdown/converters/_docx/pre_process.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_docx/pre_process.py)
</details>

# 项目概述

## 简介

MarkItDown 是一个由 Microsoft 开发的 Python 包和命令行工具，用于将各种文件格式转换为 Markdown 格式。其设计目标是便于文档索引和文本分析等场景使用。 资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

MarkItDown 的核心价值主张在于 Markdown 与 LLM 的天然兼容性。主流的大语言模型（如 OpenAI 的 GPT-4o）本身就能很好地理解和生成 Markdown，这种设计选择使得转换后的内容可以被 AI 模型高效处理。 资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## 核心功能

MarkItDown 支持多种文件格式的转换，具体包括：

| 类别 | 支持的格式 |
|------|-----------|
| 文档 | PDF、Word (DOCX)、Excel (XLSX)、PowerPoint (PPTX) |
| 媒体 | 图片（EXIF 元数据和 OCR）、音频（EXIF 元数据和语音转录） |
| Web | HTML、Wikipedia 页面 |
| 数据 | CSV、JSON、XML |
| 其他 | ZIP 文件（遍历内容）、YouTube URL、EPub |

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### 转换器架构

MarkItDown 采用插件化的转换器架构，每个文件格式由对应的转换器（Converter）处理。转换器通过 `DocumentConverter` 基类定义统一的接口：

- `accepts()`：判断转换器是否接受给定文件
- `convert()`：执行实际的格式转换

资料来源：[packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

### 插件系统

从 v0.1.0 开始，MarkItDown 引入了插件架构，允许第三方开发者扩展支持的格式。插件通过 `markitdown.plugin` 入口点注册，核心函数为 `register_converters()`。 资料来源：[v0.1.0 发布说明](https://github.com/microsoft/markitdown/releases/tag/v0.1.0)

插件接口版本定义为 `__plugin_interface_version__ = 1`，当前仅支持版本 1。 资料来源：[packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

## 安装方式

### 标准安装

```bash
pip install markitdown[all]
```

`[all]` 标记安装所有可选依赖，包括完整功能集。 资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### 从源码安装

```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown[all]
```

### 前置要求

MarkItDown 需要 **Python 3.10 或更高版本**。建议使用虚拟环境以避免依赖冲突。 资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## 使用方式

### 命令行接口

MarkItDown 提供命令行工具 `markitdown`，支持多种调用方式：

```bash
# 基本用法
markitdown path-to-file.pdf > document.md

# 从 stdin 读取
cat example.pdf | markitdown

# 指定输出文件
markitdown example.pdf -o example.md

# 指定文件扩展名提示（从 stdin 读取时）
markitdown -x .pdf < example.pdf
```

资料来源：[packages/markitdown/src/markitdown/__main__.py:23-52](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### Python API

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
```

默认情况下插件处于禁用状态。如需启用插件：

```python
md = MarkItDown(enable_plugins=True)
```

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## 高级功能

### LLM 图像描述

MarkItDown 支持使用大语言模型为图像生成描述，适用于 PowerPoint 和图片文件：

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="optional custom prompt"
)
result = md.convert("example.jpg")
print(result.text_content)
```

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### Azure Document Intelligence

对于 PDF 文档，可以使用 Azure Document Intelligence 服务获得更高质量的转换：

```bash
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```

Python API 用法：

```python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
```

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### Azure Content Understanding

Content Understanding 提供更高质量的转换，支持结构化字段提取（YAML front matter）和多模态处理（文档、图片、音频、视频）。 资料来源：[packages/markitdown/src/markitdown/converters/_cu_converter.py:1-20](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

安装方式：

```bash
pip install 'markitdown[az-content-understanding]'
```

使用示例：

```python
from markitdown import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],  # 只对 PDF 使用 CU
)
result = md.convert("document.pdf")
```

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### OCR 插件

`markitdown-ocr` 插件为 PDF、DOCX、PPTX 和 XLSX 转换器添加 OCR 支持，从嵌入图像中提取文本。 资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

安装：

```bash
pip install markitdown-ocr
pip install openai  # 或任何 OpenAI 兼容客户端
```

使用：

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

OCR 插件工作原理：

1. MarkItDown 通过 `markitdown.plugin` 入口点发现插件
2. 调用 `register_converters()`，转发所有参数包括 `llm_client` 和 `llm_model`
3. 插件创建 `LLMVisionOCRService` 并注册四个 OCR 增强转换器（优先级 -1.0，优先于内置转换器的 0.0）

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## 系统架构

### 转换流程

```mermaid
graph TD
    A[输入文件/URL] --> B[MarkItDown 核心]
    B --> C{文件类型判断}
    C -->|PDF| D[PDFConverter]
    C -->|DOCX| E[DOCXConverter]
    C -->|XLSX| F[XLSXConverter]
    C -->|PPTX| G[PPTXConverter]
    C -->|HTML| H[HTMLConverter]
    C -->|图片| I[ImageConverter]
    C -->|音频| J[AudioConverter]
    C -->|ZIP| K[ZIPConverter]
    C -->|YouTube| L[YouTubeConverter]
    C -->|其他| M[文本转换器]
    
    D --> N[文本提取]
    E --> O[DOCX→HTML→Markdown]
    F --> P[表格处理]
    G --> Q[幻灯片处理]
    H --> R[HTML净化]
    I --> S[EXIF + OCR]
    J --> T[音频处理]
    K --> U[递归遍历]
    L --> V[字幕提取]
    M --> W[纯文本]
    
    N --> X[Markdown 输出]
    O --> X
    P --> X
    Q --> X
    R --> X
    S --> X
    T --> X
    U --> X
    V --> X
    W --> X
```

### 插件加载机制

```mermaid
graph TD
    A[MarkItDown 初始化] --> B{enable_plugins=True?}
    B -->|否| C[仅加载内置转换器]
    B -->|是| D[扫描 markitdown.plugin 入口点]
    D --> E[调用 register_converters]
    E --> F[插件转换器注册到转换器列表]
    C --> G[按优先级排序转换器]
    F --> G
    G --> H[文件转换时按序匹配]
```

## CLI 参数说明

| 参数 | 说明 |
|------|------|
| `-v, --version` | 显示版本号并退出 |
| `-o, --output` | 指定输出文件名，未指定则输出到 stdout |
| `-x, --extension` | 提供文件扩展名提示（从 stdin 读取时使用） |
| `-m, --mimetype` | 提供 MIME 类型提示 |
| `-c, --charset` | 提供字符集提示 |
| `-d, --use-docintel` | 使用 Azure Document Intelligence |
| `-e, --endpoint` | Document Intelligence 端点 |
| `--use-cu` | 使用 Azure Content Understanding |
| `--cu-endpoint` | Content Understanding 端点 |
| `--cu-analyzer` | Content Understanding 分析器 ID |
| `--cu-file-types` | 路由到 Content Understanding 的文件类型 |
| `-p, --use-plugins` | 启用第三方插件 |
| `--list-plugins` | 列出已安装的插件 |
| `--keep-data-uris` | 保留输出中的 data URI |
| `filename` | 要转换的文件（可选，为空则从 stdin 读取） |

资料来源：[packages/markitdown/src/markitdown/__main__.py:54-130](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

## 依赖分组

MarkItDown 将依赖组织为功能组，用户可以只安装需要的转换器： 资料来源：[v0.1.0 发布说明](https://github.com/microsoft/markitdown/releases/tag/v0.1.0)

| 功能组 | 说明 |
|--------|------|
| `markitdown` | 核心功能 |
| `markitdown[all]` | 所有可选依赖 |
| `markitdown[az-doc-intel]` | Azure Document Intelligence |
| `markitdown[az-content-understanding]` | Azure Content Understanding |

## 已知限制与问题

### 社区反馈的问题

| 问题 | 说明 | 相关版本 |
|------|------|----------|
| PDF 表格转换不完整 | 用户反馈 PDF 中的复杂表格未能正确转换为 Markdown | 持续优化中 |
| PDF 不支持结构化输出 | PDF 转换为基础文本，非高保真文档转换 | v0.1.6 新增 OCR 支持 |
| 不支持 .doc 格式 | 仅支持 .docx，旧的 .doc 格式不在支持范围内 | Issue #23 |
| Office Open XML 无效文件处理 | 无效的 DOCX/XLSX/PPTX 文件返回成功结果而非异常 | Issue #1408 |
| IpynbConverter UnicodeDecodeError | 非 ASCII 文件（如法文 PDF）导致解码错误 | Issue #1894 |
| Linux 下 pydub 警告 | 缺少 ffmpeg 或 avconv 时触发 RuntimeWarning | Issue #1685 |

资料来源：[Issue #293](https://github.com/microsoft/markitdown/issues/293)、[Issue #296](https://github.com/microsoft/markitdown/issues/296)、[Issue #23](https://github.com/microsoft/markitdown/issues/23)、[Issue #1408](https://github.com/microsoft/markitdown/issues/1408)、[Issue #1894](https://github.com/microsoft/markitdown/issues/1894)、[Issue #1685](https://github.com/microsoft/markitdown/issues/1685)

### 安全注意事项

> [!IMPORTANT]
> MarkItDown 以当前进程的权限执行 I/O 操作。与 `open()` 或 `requests.get()` 一样，它会访问进程本身可以访问的资源。在不受信任的环境中，请对输入进行清理，并调用最窄范围的转换函数（如 `convert_stream()` 或 `convert_local()`）。 资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## 版本历史

| 版本 | 主要变化 |
|------|----------|
| v0.1.6 | 新增 OCR 插件支持，修复 PDF 内存增长问题 |
| v0.1.5 | PDF 表格提取改进，支持对齐 Markdown，修复编号列表问题 |
| v0.1.4 | 更新 mammoth 和 pdfminer.six 以修复安全漏洞 |
| v0.1.3 | 为 Windows 固定 onnxruntime，MCP 服务器支持 |
| v0.1.2 | 新增 DOCX 数学公式渲染，CSV 到 Markdown 表格转换 |
| v0.1.1 | `convert_url` 重命名为 `convert_uri`，支持文件 URI 和 data URI |
| v0.1.0 | 插件架构，依赖分组，YouTube URL 支持 |

资料来源：[发布页面](https://github.com/microsoft/markitdown/releases)

## Docker 支持

MarkItDown 支持通过 Docker 运行：

```sh
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## 参见

- [插件开发指南](packages/markitdown-sample-plugin/README.md) - 如何开发 MarkItDown 插件
- [OCR 插件文档](packages/markitdown-ocr/README.md) - LLM Vision OCR 插件详细文档
- [Azure Document Intelligence 集成](README.md) - 使用 Azure AI 服务进行文档转换
- [版本发布页面](https://github.com/microsoft/markitdown/releases) - 完整版本历史和变更日志

---

<a id='page-supported-formats'></a>

## 支持的文件格式

### 相关页面

相关主题：[项目概述](#page-overview), [转换器系统](#page-converter-system)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/markitdown/src/markitdown/converters/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/__init__.py)
- [packages/markitdown/src/markitdown/converters/_pdf_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pdf_converter.py)
- [packages/markitdown/src/markitdown/converters/_docx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_docx_converter.py)
- [packages/markitdown/src/markitdown/converters/_xlsx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_xlsx_converter.py)
- [packages/markitdown/src/markitdown/converters/_pptx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pptx_converter.py)
- [packages/markitdown/src/markitdown/converters/_wikipedia_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py)
- [packages/markitdown/src/markitdown/converters/_zip_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_zip_converter.py)
- [packages/markitdown/src/markitdown/converters/_rss_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_rss_converter.py)
</details>

# 支持的文件格式

MarkItDown 是一个 Python 包和命令行工具，用于将各种文件格式转换为 Markdown 格式。本文档详细说明 MarkItDown 支持的文件格式、转换机制以及各格式的处理方式。

## 概述

MarkItDown 支持多种文件格式的转换，涵盖文档、图片、音频、视频以及基于文本的格式。转换器采用插件化架构设计，内置转换器处理核心格式，第三方插件可扩展更多格式支持。

资料来源：[packages/markitdown/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/README.md)

## 支持格式总览

### 核心支持格式

| 格式类别 | 文件扩展名 | 支持状态 | 说明 |
|---------|-----------|---------|------|
| **PDF** | `.pdf` | ✅ 支持 | 文本提取（结构化有限），支持 OCR 插件 |
| **Word** | `.docx` | ✅ 支持 | 完整格式转换，支持数学公式 |
| **Excel** | `.xlsx` | ✅ 支持 | 工作表和表格转换 |
| **PowerPoint** | `.pptx` | ✅ 支持 | 幻灯片内容提取 |
| **图片** | `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.webp` | ✅ 支持 | EXIF 元数据和 OCR |
| **音频** | `.mp3`, `.wav`, `.ogg`, `.m4a` | ✅ 支持 | 元数据和语音转录 |
| **HTML** | `.html`, `.htm` | ✅ 支持 | 网页内容提取 |
| **CSV** | `.csv` | ✅ 支持 | 转换为 Markdown 表格 |
| **JSON** | `.json` | ✅ 支持 | 结构化文本输出 |
| **XML** | `.xml` | ✅ 支持 | 结构化文本输出 |
| **Jupyter Notebook** | `.ipynb` | ✅ 支持 | 笔记本格式转换 |
| **ZIP** | `.zip` | ✅ 支持 | 递归处理内部文件 |
| **EPUB** | `.epub` | ✅ 支持 | 电子书格式 |
| **RSS/Atom** | `.rss`, `.atom`, `.xml` | ✅ 支持 | 订阅源转换 |
| **Wikipedia** | 特定 URL | ✅ 支持 | Wikipedia 页面提取 |

### 已知限制

| 格式 | 限制说明 | 相关问题 |
|-----|---------|---------|
| **PDF** | 表格识别和编号列表支持有限，可能提取为纯文本 | [#293](https://github.com/microsoft/markitdown/issues/293), [#296](https://github.com/microsoft/markitdown/issues/296) |
| **.doc** | 不支持旧版 Word 格式（`.doc`），仅支持 `.docx` | [#23](https://github.com/microsoft/markitdown/issues/23) |
| **音频** | Linux 系统需要 ffmpeg 或 avconv，否则会有 RuntimeWarning | [#1685](https://github.com/microsoft/markitdown/issues/1685) |
| **.ipynb** | 非 ASCII 文件可能触发 UnicodeDecodeError | [#1894](https://github.com/microsoft/markitdown/issues/1894) |

## 转换器架构

### 转换器注册机制

MarkItDown 使用动态转换器注册系统，每个文件格式由对应的 `DocumentConverter` 子类处理。

```mermaid
graph TD
    A[文件输入] --> B[StreamInfo 检测]
    B --> C{文件类型判断}
    C -->|PDF| D[PdfConverter]
    C -->|DOCX| E[DocxConverter]
    C -->|XLSX| F[XlsxConverter]
    C -->|PPTX| G[PptxConverter]
    C -->|图片| H[ImageConverter]
    C -->|HTML| I[HtmlConverter]
    C -->|ZIP| J[ZipConverter]
    C -->|其他| K[通用转换器]
    
    D --> L[Markdown 输出]
    E --> L
    F --> L
    G --> L
    H --> L
    I --> L
    J --> L
    K --> L
    
    style A fill:#e1f5fe
    style L fill:#c8e6c9
```

资料来源：[packages/markitdown/src/markitdown/converters/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/__init__.py)

### DocumentConverter 基类

所有转换器继承自 `DocumentConverter` 基类，需要实现以下核心方法：

| 方法 | 说明 |
|-----|------|
| `accepts(file_stream, stream_info, **kwargs)` | 判断转换器是否接受该文件流 |
| `convert(file_stream, stream_info, **kwargs)` | 执行文件到 Markdown 的转换 |
| `PRIORITY_SPECIFIC_FILE_FORMAT` | 特定文件格式优先级（100.0） |
| `PRIORITY_GENERIC_FILE_FORMAT` | 通用文件格式优先级（50.0） |

资料来源：[packages/markitdown/src/markitdown/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__init__.py)

## 文档格式详解

### PDF 转换

PDF 转换器使用 `pdfminer.six` 库提取文本内容。

**支持的功能：**
- 文本段落提取
- 标题检测（基于字体大小）
- 表格提取（对齐 Markdown 支持）
- 编号列表识别

**限制：**
- PDF 是可视化格式，结构信息有限
- 表格转换可能不完美（社区反馈 [#293](https://github.com/microsoft/markitdown/issues/293)）
- 扫描的 PDF 需要 OCR 支持

资料来源：[packages/markitdown/src/markitdown/converters/_pdf_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pdf_converter.py)

**使用 Azure Document Intelligence 提升 PDF 质量：**

```python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("document.pdf")
print(result.text_content)
```

### Word 文档 (DOCX)

Word 转换器使用 `mammoth` 库处理 DOCX 文件。

**支持的功能：**
- 段落和标题提取
- 列表（有序和无序）
- 表格转换
- 图片提取（带描述）
- **数学公式渲染**（从 v0.1.2 支持）
- 链接处理

**已知问题：**
- 无效的 Office Open XML 文件会返回成功状态，但 `text_content` 包含错误消息
- 仅支持 `.docx` 格式，不支持旧版 `.doc` 格式

资料来源：[packages/markitdown/src/markitdown/converters/_docx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_docx_converter.py)

### Excel 电子表格 (XLSX)

Excel 转换器将工作表数据转换为 Markdown 表格格式。

**输出示例：**

```markdown
## Sheet1
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| data1   | data2   | data3   |
| data4   | data5   | data6   |
```

**特性：**
- 多工作表处理
- 保持原始列结构
- 空单元格处理

资料来源：[packages/markitdown/src/markitdown/converters/_xlsx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_xlsx_converter.py)

### PowerPoint 演示文稿 (PPTX)

PowerPoint 转换器提取幻灯片内容和备注。

**输出结构：**
- 每张幻灯片作为独立章节
- 提取文本内容和形状
- 图片和备注信息

资料来源：[packages/markitdown/src/markitdown/converters/_pptx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pptx_converter.py)

## 图片与媒体格式

### 图片转换

图片转换器支持 EXIF 元数据提取和可选的 LLM 描述生成。

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="描述这张图片的内容"
)
result = md.convert("photo.jpg")
```

**支持的图片格式：**
- JPEG (.jpg, .jpeg)
- PNG (.png)
- GIF (.gif)
- BMP (.bmp)
- WebP (.webp)

### 音频转换

音频文件支持 EXIF 元数据提取和语音转录。

**依赖要求：**
- Linux 系统需要安装 `ffmpeg` 或 `avconv`
- 缺少时会产生 RuntimeWarning

```bash
# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg
```

> ⚠️ 社区反馈：Linux 环境下缺少 ffmpeg 会触发 RuntimeWarning，影响用户体验。相关问题：[#1685](https://github.com/microsoft/markitdown/issues/1685)

## 基于文本的格式

### HTML 与 Wikipedia

HTML 转换器使用 BeautifulSoup 解析网页内容。

**特殊处理：**
- Wikipedia 页面使用专门的 WikipediaConverter
- 提取主文档内容，排除侧边栏和导航

```mermaid
graph LR
    A[HTML 输入] --> B{是否为 Wikipedia?}
    B -->|是| C[WikipediaConverter]
    B -->|否| D[HtmlConverter]
    C --> E[提取主内容]
    D --> F[完整 HTML 处理]
    E --> G[Markdown 输出]
    F --> G
```

资料来源：[packages/markitdown/src/markitdown/converters/_wikipedia_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/converters/_wikipedia_converter.py)

### CSV 转换

CSV 文件直接转换为 Markdown 表格格式。

```python
md = MarkItDown()
result = md.convert("data.csv")
# 输出 Markdown 表格
```

### JSON 和 XML

结构化数据格式转换为易读的文本表示。

### RSS 和 Atom

订阅源格式转换为 Markdown，支持 .rss、.atom 和 .xml 扩展名。

资料来源：[packages/markitdown/src/markitdown/converters/_rss_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_rss_converter.py)

## ZIP 文件处理

ZIP 转换器递归处理压缩包内的所有文件。

**处理流程：**

```mermaid
graph TD
    A[ZIP 文件] --> B[提取到临时目录]
    B --> C{遍历文件}
    C -->|文件| D[根据扩展名选择转换器]
    C -->|完成| E[合并结果]
    D --> F[PDF/DOCX/图片/...]
    F --> C
    E --> G[Markdown 输出]
    
    style A fill:#fff3e0
    style G fill:#c8e6c9
```

**输出格式：**

```markdown
Content from the zip file `example.zip`:

## File: docs/readme.txt
[文件内容]

## File: images/example.jpg
ImageSize: 1920x1080
DateTimeOriginal: 2024-02-15 14:30:00

## File: data/report.xlsx
[Excel 内容]
```

资料来源：[packages/markitdown/src/markitdown/converters/_zip_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_zip_converter.py)

## Azure 集成

### Azure Document Intelligence

使用 Azure AI 服务提升文档转换质量：

```bash
markitdown path-to-file.pdf -d -e "<document_intelligence_endpoint>"
```

**优势：**
- 更高质量的布局分析
- 更准确的表格提取
- OCR 支持

### Azure Content Understanding

从 v0.1.5 开始支持，提供更高级的多模态提取：

```python
from markitdown import MarkItDown, ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],
)
```

**支持的文件类型：**
- PDF 文档
- 图片
- 音频文件
- 视频文件

## OCR 插件

markitdown-ocr 插件为 PDF、DOCX、PPTX 和 XLSX 添加 OCR 功能。

**安装：**

```bash
pip install markitdown-ocr
pip install openai  # 或任何 OpenAI 兼容客户端
```

**使用：**

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
```

**命令行使用：**

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

> ⚠️ 社区反馈：CLI 参数 `--llm-client` 和 `--llm-model` 在某些版本中可能未被识别，请确保使用最新版。相关问题：[#1897](https://github.com/microsoft/markitdown/issues/1897)

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## 插件扩展

### 插件架构

MarkItDown 支持通过插件扩展支持的格式。插件使用 `markitdown.plugin` 入口点注册。

**开发示例：**

```python
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo

class RtfConverter(DocumentConverter):
    def __init__(self, priority=DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT):
        super().__init__(priority=priority)

    def accepts(self, file_stream, stream_info, **kwargs) -> bool:
        return stream_info.extension == ".rtf"

    def convert(self, file_stream, stream_info, **kwargs) -> DocumentConverterResult:
        # 实现转换逻辑
        raise NotImplementedError()

__plugin_interface_version__ = 1

def register_converters(markitdown: MarkItDown, **kwargs):
    markitdown.register_converter(RtfConverter())
```

**pyproject.toml 配置：**

```toml
[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"
```

资料来源：[packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

### 列出已安装插件

```bash
markitdown --list-plugins
```

输出示例：

```
Installed MarkItDown 3rd-party Plugins:

  * markitdown-ocr   (package: markitdown_ocr)
```

## 常见问题

### 为什么 PDF 转换质量不如预期？

PDF 是可视化格式，许多结构信息（如语义标题、表格边界）需要通过启发式方法推断。从 v0.1.5 开始，表格提取有所改进，使用对齐 Markdown 格式。相关讨论：[#293](https://github.com/microsoft/markitdown/issues/293), [#296](https://github.com/microsoft/markitdown/issues/296)

### 如何转换 .doc 文件？

MarkItDown 不支持旧版 `.doc` 格式，仅支持基于 Office Open XML 的 `.docx` 格式。建议将 `.doc` 文件在 Microsoft Word 中另存为 `.docx` 后转换。相关讨论：[#23](https://github.com/microsoft/markitdown/issues/23)

### Jupyter Notebook 转换失败怎么办？

如果 Jupyter Notebook 文件包含非 ASCII 字符，转换可能因 UnicodeDecodeError 失败。这是已知的 bug，参见 [#1894](https://github.com/microsoft/markitdown/issues/1894)。

### Office Open XML 文件显示成功但内容为错误消息？

对于无效的 DOCX/XLSX/PPTX 文件，MarkItDown 可能返回成功状态但 `text_content` 包含 `"This is not a valid Office Open XML file."` 消息。相关讨论：[#1408](https://github.com/microsoft/markitdown/issues/1408)

## 安装选项

MarkItDown 提供模块化安装，可按需选择功能：

```bash
# 仅核心功能
pip install markitdown

# 包含所有功能
pip install markitdown[all]

# 按功能组安装
pip install markitdown[docx]      # Word 支持
pip install markitdown[xlsx]      # Excel 支持
pip install markitdown[pptx]      # PowerPoint 支持
pip install markitdown[pdf]        # PDF 支持
pip install markitdown[image]      # 图片支持
pip install markitdown[audio]      # 音频支持
pip install markitdown[az-doc-intel]  # Azure Document Intelligence
pip install markitdown[az-content-understanding]  # Azure Content Understanding
```

## 相关链接

- [项目主页](https://github.com/microsoft/markitdown)
- [OCR 插件文档](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [插件开发指南](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [Azure Document Intelligence 配置](https://learn.microsoft.com/azure/ai-services/document-intelligence/)
- [Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/)

---

<a id='page-architecture'></a>

## 系统架构

### 相关页面

相关主题：[转换器系统](#page-converter-system), [插件系统](#page-plugin-system)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)
- [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
- [packages/markitdown/src/markitdown/converters/_rss_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_rss_converter.py)
</details>

# 系统架构

## 概述

MarkItDown 是一个轻量级的 Python 工具库，用于将各种文件格式转换为 Markdown 格式，专为 LLM 和相关文本分析管道设计。该项目的核心架构采用**插件式转换器模式**，允许通过统一的接口扩展支持的文件格式。资料来源：[README.md:1-20]()

MarkItDown 的设计理念强调：
- **统一的转换接口**：所有转换器都继承自 `DocumentConverter` 基类
- **插件可扩展性**：支持第三方开发者创建自定义转换器
- **灵活的依赖管理**：依赖项按功能分组，支持选择性安装
- **多后端支持**：内置转换器、云服务集成（Azure Document Intelligence、Azure Content Understanding）

## 核心组件

MarkItDown 的核心架构由以下主要组件构成：

### MarkItDown 主类

`MarkItDown` 类是整个系统的入口点，负责：
- 管理所有内置和插件转换器的注册
- 路由文件到合适的转换器
- 协调 LLM 客户端配置
- 处理 URI 转换（本地文件、URL、数据 URI）

```python
from markitdown import MarkItDown

md = MarkItDown(
    enable_plugins=True,        # 启用第三方插件
    llm_client=OpenAI(),        # LLM 客户端用于图像描述
    llm_model="gpt-4o",        # LLM 模型名称
    docintel_endpoint="...",    # Azure Document Intelligence 端点
    cu_endpoint="..."          # Azure Content Understanding 端点
)
```

资料来源：[packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

### DocumentConverter 基类

所有转换器必须继承 `DocumentConverter` 基类，该类定义了转换器的基本接口：

| 方法 | 说明 |
|------|------|
| `accepts(file_stream, stream_info, **kwargs)` | 判断转换器是否接受给定文件流 |
| `convert(file_stream, stream_info, **kwargs)` | 执行文件到 Markdown 的转换 |
| `PRIORITY_*` 常量 | 转换器优先级（用于插件覆盖内置转换器）|

转换器优先级定义：
- `PRIORITY_MAXIMUM = 1000`：最高优先级
- `PRIORITY_SPECIFIC_FILE_FORMAT = 500`：特定文件格式
- `PRIORITY_DEFAULT = 0`：默认优先级
- `PRIORITY_MINIMUM = -1000`：最低优先级

资料来源：[packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

### StreamInfo 数据模型

`StreamInfo` 存储文件流元信息，用于转换器的 `accepts()` 方法判断文件类型：

```python
@dataclass
class StreamInfo:
    url: Optional[str] = None           # 源 URL
    extension: Optional[str] = None     # 文件扩展名
    mimetype: Optional[str] = None       # MIME 类型
    charset: Optional[str] = None        # 字符编码
```

资料来源：[packages/markitdown/src/markitdown/_stream_info.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_stream_info.py)

## 转换器体系

MarkItDown 内置了针对多种文件格式的转换器，采用分层架构处理不同类型的文档。

### 内置转换器

| 转换器 | 支持格式 | 依赖库 |
|--------|----------|--------|
| PdfConverter | .pdf | pdfminer.six |
| PptxConverter | .pptx | python-pptx |
| DocxConverter | .docx | mammoth |
| XlsxConverter | .xlsx | openpyxl |
| ImageConverter | .jpg, .png, .gif 等 | Pillow |
| AudioConverter | .mp3, .wav 等 | pydub, speech_recognition |
| HtmlConverter | .html, .htm | BeautifulSoup |
| CsvConverter | .csv | 内置 |
| JsonConverter | .json | 内置 |
| XmlConverter | .xml | 内置 |
| RssConverter | .rss, .atom | defusedxml, BeautifulSoup |
| WikipediaConverter | Wikipedia HTML 页面 | BeautifulSoup |
| ZipConverter | .zip | 内置（遍历内容）|
| IpynbConverter | .ipynb | 内置 |
| EPubConverter | .epub | 内置 |
| YoutubeConverter | YouTube URL | yt-dlp |

资料来源：[README.md:40-55]()

### 内置转换器的注册流程

```mermaid
graph TD
    A[MarkItDown.__init__] --> B[注册内置转换器]
    B --> C[加载 _converters 目录下的所有转换器]
    C --> D{enable_plugins?}
    D -->|是| E[通过 entry_points 加载第三方插件]
    D -->|否| F[跳过插件加载]
    E --> G[调用 register_converters 钩子]
    G --> H[注册插件提供的转换器]
    F --> I[返回 MarkItDown 实例]
    H --> I
```

## 插件机制

MarkItDown 的插件系统基于 Python 的 `entry_points` 机制实现，允许第三方开发者扩展支持的格式。

### 插件接口版本

当前支持的插件接口版本为 `__plugin_interface_version__ = 1`，定义在插件包的根级别。

### 插件开发规范

开发 MarkItDown 插件需要：

1. **创建 DocumentConverter 子类**
2. **实现 `register_converters()` 函数**
3. **配置 `pyproject.toml` entry point**

```python
# 示例：RTF 转换器插件
__plugin_interface_version__ = 1

def register_converters(markitdown: MarkItDown, **kwargs):
    markitdown.register_converter(RtfConverter())
```

Entry point 配置：
```toml
[project.entry-points."markitdown.plugin"]
my_plugin = "my_plugin_package"
```

资料来源：[packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

### markitdown-ocr 插件示例

`markitdown-ocr` 是官方提供的 OCR 增强插件，展示了完整的插件开发模式：

```mermaid
graph LR
    A[PDF/DOCX/PPTX/XLSX] --> B[markitdown-ocr 插件]
    B --> C{llm_client 提供?}
    C -->|是| D[使用 LLM Vision OCR]
    C -->|否| E[回退到内置转换器]
    D --> F[提取嵌入图像文本]
    E --> G[标准文本提取]
```

该插件的工作流程：
1. MarkItDown 发现插件通过 `markitdown.plugin` entry point
2. 调用 `register_converters()` 并传递 `llm_client` 和 `llm_model`
3. 创建 `LLMVisionOCRService` 执行 OCR
4. 使用优先级覆盖内置转换器

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## 工作流程

### 转换请求处理流程

```mermaid
sequenceDiagram
    participant 用户
    participant MarkItDown
    participant ConverterRegistry
    participant Converter
    
    用户->>MarkItDown: convert(uri)
    MarkItDown->>MarkItDown: convert_uri(uri)
    MarkItDown->>MarkItDown: _get_file_info(uri)
    MarkItDown->>ConverterRegistry: 遍历注册转换器
    ConverterRegistry->>Converter: accepts(stream_info)
    Converter-->>ConverterRegistry: bool
    ConverterRegistry->>MarkItDown: 返回匹配的转换器
    MarkItDown->>Converter: convert(stream, stream_info, **kwargs)
    Converter-->>MarkItDown: DocumentConverterResult
    MarkItDown-->>用户: 返回结果
```

### 文件类型检测顺序

转换器的 `accepts()` 方法按注册顺序被调用，第一个返回 `True` 的转换器将被使用。检测逻辑基于：

1. **文件扩展名**：`stream_info.extension`
2. **MIME 类型**：`stream_info.mimetype`
3. **文件内容检测**：部分转换器会读取文件流进行内容分析
4. **URL 模式**：如 WikipediaConverter 依赖 URL 匹配

例如，`RssConverter` 的 `accepts()` 方法首先检查精确的 MIME 类型和扩展名，然后对 XML 文件进行深度内容检测：

```python
def accepts(self, file_stream, stream_info, **kwargs):
    mimetype = (stream_info.mimetype or "").lower()
    extension = (stream_info.extension or "").lower()
    
    # 精确匹配
    if extension in PRECISE_FILE_EXTENSIONS:
        return True
    for prefix in PRECISE_MIME_TYPE_PREFIXES:
        if mimetype.startswith(prefix):
            return True
    
    # XML 文件需要内容检测
    if extension in CANDIDATE_FILE_EXTENSIONS:
        return self._check_xml(file_stream)
```

资料来源：[packages/markitdown/src/markitdown/converters/_rss_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_rss_converter.py)

## Azure 集成

MarkItDown 支持两种 Azure AI 服务集成，用于高质量文档转换。

### Azure Document Intelligence

适用于 PDF 和 Office 文档的云端解析：

```python
md = MarkItDown(docintel_endpoint="<endpoint>")
result = md.convert("document.pdf")
```

CLI 用法：
```bash
markitdown document.pdf -d -e "<document_intelligence_endpoint>"
```

### Azure Content Understanding

提供更高质量的多模态提取，支持结构化字段提取（YAML front matter）：

```python
from markitdown import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],
)
```

支持的格式路由：
- 文档：PDF、DOCX、PPTX、XLSX
- 图像：JPEG、PNG、GIF、BMP
- 音频：MP3、WAV、M4A
- 视频：MP4、MOV

资料来源：[packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

## 命令行接口

MarkItDown 提供完整的 CLI 功能，入口点位于 `__main__.py`：

### CLI 参数说明

| 参数 | 说明 | 示例 |
|------|------|------|
| `filename` | 输入文件（支持 stdin） | `markitdown file.pdf` |
| `-o, --output` | 输出文件路径 | `-o output.md` |
| `-x, --extension` | 文件扩展名提示（stdin 模式） | `-x pdf` |
| `-m, --mimetype` | MIME 类型提示 | `-m application/pdf` |
| `-c, --charset` | 字符集提示 | `-c utf-8` |
| `-d, --use-docintel` | 使用 Document Intelligence | `-d -e <endpoint>` |
| `--use-cu` | 使用 Content Understanding | `--use-cu --cu-endpoint <url>` |
| `--cu-file-types` | 路由到 CU 的文件类型 | `--cu-file-types pdf,jpeg` |
| `-p, --use-plugins` | 启用第三方插件 | `-p` |
| `--list-plugins` | 列出已安装插件 | `--list-plugins` |
| `--keep-data-uris` | 保留 data URI | `--keep-data-uris` |
| `-v, --version` | 显示版本 | `-v` |

资料来源：[packages/markitdown/src/markitdown/__main__.py:30-120](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### CLI 初始化流程

```mermaid
graph TD
    A[markitdown 命令] --> B[argparse 解析参数]
    B --> C{--list-plugins?}
    C -->|是| D[列出插件并退出]
    C -->|否| E{--use-docintel?}
    E -->|是| F[创建 MarkItDown<br/>docintel_endpoint]
    E -->|否| G{--use-cu?}
    G -->|是| H[创建 MarkItDown<br/>cu_endpoint + cu_file_types]
    G -->|否| I[创建 MarkItDown<br/>enable_plugins]
    F --> J[执行转换]
    H --> J
    I --> J
```

## 安全考虑

> [!IMPORTANT]
> MarkItDown 执行 I/O 操作时使用当前进程的权限。类似于 `open()` 或 `requests.get()`，它会访问进程本身可访问的资源。

在不受信任的环境中使用时，建议：
- **对输入进行清理**：验证文件来源和类型
- **使用最窄的转换方法**：如 `convert_stream()` 或 `convert_local()`
- **限制文件访问**：确保进程权限范围最小化

## 常见问题与已知限制

### 社区反馈的架构相关问题

| 问题 | 描述 | 影响范围 |
|------|------|----------|
| UnicodeDecodeError in IpynbConverter | 非 ASCII 文件触发未捕获异常 | #1894 |
| Office Open XML 无效文件处理 | 无效文件返回成功而非异常 | #1408 |
| pydub RuntimeWarning | Linux 上缺少 ffmpeg 时警告 | #1685 |
| CLI 未记录参数 | `--llm-client` 等参数未在 CLI 中记录 | #1897 |

### 架构限制说明

1. **PDF 转换**：基础 PDF 转换输出纯文本，非结构化 Markdown。对于表格、页眉页脚等结构化内容支持有限。
2. **.doc 格式**：仅支持 .docx，不支持旧版 .doc 格式。
3. **音频转换**：依赖 ffmpeg，在 Linux 系统上可能需要额外安装。

## 相关文档

- [安装指南](../Installation) — MarkItDown 安装与依赖管理
- [插件开发](./Plugin-Development) — 第三方插件开发教程
- [Azure 集成](./Azure-Integration) — Azure AI 服务配置
- [CLI 参考](./CLI-Reference) — 命令行工具完整参考
- [API 参考](./API-Reference) — Python API 详细文档

---

<a id='page-converter-system'></a>

## 转换器系统

### 相关页面

相关主题：[系统架构](#page-architecture)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)
- [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)
- [packages/markitdown/src/markitdown/converters/_ipynb_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_ipynb_converter.py)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
- [packages/markitdown/src/markitdown/converters/_rss_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_rss_converter.py)
- [packages/markitdown/src/markitdown/converters/_wikipedia_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
</details>

# 转换器系统

## 概述

MarkItDown 的转换器系统（Converter System）是一个基于插件架构的文档转换框架，负责将各种格式的文档转换为 Markdown 格式。该系统采用**优先级驱动的选择机制**，允许内置转换器和第三方插件动态注册自己的文档转换器。

转换器系统的核心设计理念：

- **模块化**：每种文档格式由独立的转换器类处理
- **可扩展性**：通过插件机制支持第三方转换器
- **优先级机制**：插件可以替换或覆盖内置转换器
- **流式处理**：支持从文件流、URL、本地路径等多种来源读取文档

资料来源：[packages/markitdown/src/markitdown/_base_converter.py:1-50](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

---

## 架构设计

### 系统组件

```mermaid
graph TD
    subgraph "核心层"
        MD[MarkItDown 主类]
        BC[BaseConverter 基类]
    end
    
    subgraph "内置转换器"
        PDF[PDFConverter]
        DOCX[DocxConverter]
        PPTX[PptxConverter]
        XLSX[XlsxConverter]
        HTML[HtmlConverter]
        RSS[RssConverter]
        WIKI[WikipediaConverter]
        IPYNB[IpynbConverter]
        CU[CUConverter]
    end
    
    subgraph "插件扩展"
        OCR[markitdown-ocr]
        CUSTOM[自定义插件]
    end
    
    MD --> BC
    MD --> PDF
    MD --> DOCX
    MD --> PPTX
    MD --> XLSX
    MD --> HTML
    MD --> RSS
    MD --> WIKI
    MD --> IPYNB
    MD --> CU
    
    MD -.->|动态注册| OCR
    MD -.->|动态注册| CUSTOM
    
    style MD fill:#e1f5fe
    style BC fill:#fff3e0
```

### 转换器注册流程

```mermaid
sequenceDiagram
    participant U as 用户代码
    participant MD as MarkItDown
    participant PL as 插件发现
    participant RG as 转换器注册
    participant CNV as 转换器集合
    
    U->>MD: new MarkItDown(enable_plugins=True)
    MD->>PL: 发现 markitdown.plugin 入口点
    PL-->>MD: 返回插件列表
    loop 每个插件
        MD->>PL: 调用 register_converters()
        PL->>RG: 注册自定义转换器
        RG->>CNV: 按优先级插入转换器
    end
    MD-->>U: MarkItDown 实例就绪
```

---

## 基类设计

### DocumentConverter 基类

所有转换器都必须继承自 `DocumentConverter` 基类，定义在 [`_base_converter.py`](packages/markitdown/src/markitdown/_base_converter.py) 中。

| 属性/方法 | 类型 | 说明 |
|-----------|------|------|
| `PRIORITY_DEFAULT` | float | 默认优先级 0.0 |
| `PRIORITY_SPECIFIC_FILE_FORMAT` | float | 特定文件格式优先级 100.0 |
| `priority` | float | 转换器优先级，值越大优先级越高 |
| `accepts()` | method | 判断转换器是否接受该文件 |
| `convert()` | method | 执行实际的文档转换 |

#### accepts() 方法签名

```python
def accepts(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool:
    """
    判断此转换器是否能够处理给定的文件流
    
    Args:
        file_stream: 文件二进制流
        stream_info: 文件元信息（扩展名、MIME类型等）
        **kwargs: 传递给转换器的额外参数
    
    Returns:
        bool: 如果转换器可以处理此文件返回 True
    """
```

#### convert() 方法签名

```python
def convert(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult:
    """
    将文件流转换为 Markdown 格式
    
    Args:
        file_stream: 文件二进制流
        stream_info: 文件元信息
        **kwargs: 传递给转换器的额外参数
    
    Returns:
        DocumentConverterResult: 包含转换结果的 Result 对象
    """
```

资料来源：[packages/markitdown/src/markitdown/_base_converter.py:14-80](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

### StreamInfo 数据类

`StreamInfo` 封装了文件的元信息，用于转换器判断是否接受该文件。

| 字段 | 类型 | 说明 |
|------|------|------|
| `url` | str | 文件来源 URL（可选） |
| `extension` | str | 文件扩展名（如 ".pdf"） |
| `mimetype` | str | MIME 类型（如 "application/pdf"） |
| `charset` | str | 字符编码（可选） |

资料来源：[packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

---

## 内置转换器

MarkItDown 提供了多种内置转换器，覆盖常见的文档格式。

### 转换器列表

| 转换器 | 文件扩展名 | MIME 类型 | 优先级 | 依赖库 |
|--------|-----------|-----------|--------|--------|
| `PdfConverter` | .pdf | application/pdf | 100.0 | pdfminer.six, pdfplumber |
| `DocxConverter` | .docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document | 100.0 | mammoth |
| `PptxConverter` | .pptx | application/vnd.openxmlformats-officedocument.presentationml.presentation | 100.0 | python-pptx |
| `XlsxConverter` | .xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | 100.0 | openpyxl |
| `WikipediaConverter` | .html/.htm | text/html | 50.0 | beautifulsoup4 |
| `RssConverter` | .rss, .atom, .xml | application/rss+xml, application/atom+xml | 50.0 | beautifulsoup4, defusedxml |
| `IpynbConverter` | .ipynb | application/x-ipynb+json | 100.0 | - |
| `HtmlConverter` | .html, .htm | text/html | 10.0 | beautifulsoup4 |
| `CsvConverter` | .csv | text/csv | 100.0 | - |
| `CUConverter` | * | * | -1000.0 | azure-ai-contentunderstanding |

### 转换器实现示例：IpynbConverter

`IpynbConverter` 负责将 Jupyter Notebook 文件（.ipynb）转换为 Markdown。

```python
class IpynbConverter(DocumentConverter):
    """将 Jupyter Notebook (.ipynb) 转换为 Markdown"""
    
    def __init__(self):
        super().__init__(priority=DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT)
    
    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool:
        extension = (stream_info.extension or "").lower()
        mimetype = (stream_info.mimetype or "").lower()
        
        # 检查文件扩展名
        if extension == ".ipynb":
            return True
        
        # 检查 MIME 类型
        if mimetype == "application/x-ipynb+json":
            return True
        
        return False
    
    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> DocumentConverterResult:
        # 解析 JSON 并转换为 Markdown
        content = json.loads(file_stream.read())
        # ... 转换逻辑
```

资料来源：[packages/markitdown/src/markitdown/converters/_ipynb_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_ipynb_converter.py)

### WikipediaConverter

`WikipediaConverter` 是专门处理 Wikipedia 页面的转换器，它从 HTML 中提取主要内容区域（`<div id="mw-content-text">`）。

```python
class WikipediaConverter(DocumentConverter):
    """专门处理 Wikipedia 页面的转换器"""
    
    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool:
        url = stream_info.url or ""
        
        # 只处理来自 Wikipedia 的内容
        if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/", url):
            return False
        # ...
```

资料来源：[packages/markitdown/src/markitdown/converters/_wikipedia_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py)

### Azure Content Understanding 转换器

`CUConverter` 是一个特殊的转换器，用于与 Azure Content Understanding 服务集成，提供高质量的多模态文档提取。

```python
class CUConverter(DocumentConverter):
    """使用 Azure Content Understanding 的转换器"""
    
    def __init__(
        self,
        cu_endpoint: str,
        cu_file_types: Optional[List[ContentUnderstandingFileType]] = None,
        **kwargs,
    ):
        super().__init__(priority=-1000.0)  # 低优先级作为后备
        # ...
```

资料来源：[packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

---

## 插件系统

### 插件架构概述

MarkItDown 的插件系统允许第三方开发者扩展转换器功能。插件通过 Python 入口点机制注册。

```mermaid
graph LR
    subgraph "插件包结构"
        PE["pyproject.toml<br/>[project.entry-points<br/>'markitdown.plugin']"]
        RC["register_converters()"]
        CV["Converter 实现"]
    end
    
    PE --> RC
    RC --> CV
```

### 开发自定义插件

#### 1. 实现 DocumentConverter 子类

```python
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo

class RtfConverter(DocumentConverter):
    
    def __init__(self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT):
        super().__init__(priority=priority)
    
    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool:
        extension = (stream_info.extension or "").lower()
        return extension == ".rtf"
    
    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> DocumentConverterResult:
        # 实现 RTF 到 Markdown 的转换逻辑
        content = file_stream.read().decode("utf-8")
        markdown = convert_rtf_to_markdown(content)
        return DocumentConverterResult(text_content=markdown)
```

#### 2. 创建注册函数

```python
__plugin_interface_version__ = 1  # 当前仅支持版本 1

def register_converters(markitdown: MarkItDown, **kwargs):
    """
    在 MarkItDown 实例化时被调用，用于注册插件提供的转换器
    """
    markitdown.register_converter(RtfConverter())
```

#### 3. 配置入口点

在 `pyproject.toml` 中添加：

```toml
[project.entry-points."markitdown.plugin"]
my_rtf_plugin = "my_package.my_rtf_plugin"
```

资料来源：[packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

### markitdown-ocr 插件

`markitdown-ocr` 是官方提供的 OCR 插件，为 PDF、DOCX、PPTX 和 XLSX 转换器添加图像文字识别功能。

#### 特性

- **嵌入式图像 OCR**：从 PDF、DOCX、PPTX、XLSX 中提取嵌入图像并进行 OCR
- **全页 OCR 回退**：对扫描版 PDF 自动检测并进行全页 OCR
- **上下文保持**：保持文档结构，插入提取的文本

#### 工作原理

```mermaid
sequenceDiagram
    participant MD as MarkItDown
    participant OCR as OCR 插件
    participant LLM as LLM Vision API
    participant BUILTIN as 内置转换器
    
    Note over MD,OCR: 初始化阶段
    MD->>OCR: enable_plugins=True, llm_client, llm_model
    OCR->>OCR: 创建 LLMVisionOCRService
    OCR->>MD: 注册优先级为 -1.0 的 OCR 转换器
    
    Note over MD,BUILTIN: 转换阶段（优先级 -1.0 < 0.0）
    MD->>OCR: 调用 accepts() 检查文件
    OCR->>BUILTIN: 转发给内置转换器
    BUILTIN-->>OCR: 返回结果
    OCR->>OCR: 提取嵌入图像
    loop 每个图像
        OCR->>LLM: 发送图像到 LLM Vision
        LLM-->>OCR: 返回识别的文本
    end
    OCR-->>MD: 返回包含 OCR 文本的结果
```

#### 安装与使用

```bash
pip install markitdown-ocr
pip install openai  # 或其他 OpenAI 兼容客户端
```

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)
```

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

---

## 转换器选择机制

### 优先级系统

转换器通过优先级决定处理顺序：

| 优先级范围 | 用途 |
|-----------|------|
| > 100.0 | 特定文件格式转换器 |
| 10.0 - 100.0 | 专用转换器（如 Wikipedia） |
| 0.0 | 默认转换器 |
| -1000.0 | 后备/云服务转换器 |

### 选择流程

```mermaid
flowchart TD
    A[开始转换] --> B{启用插件?}
    B -->|是| C[加载所有插件转换器]
    B -->|否| D[仅使用内置转换器]
    C --> E[按优先级排序所有转换器]
    D --> E
    E --> F{遍历转换器}
    F -->|还有转换器| G[调用 accepts]
    G --> H{返回 True?}
    H -->|是| I[调用 convert]
    I --> J[返回结果]
    H -->|否| F
    J --> K[结束]
    F -->|无更多转换器| L[抛出异常]
    L --> K
```

---

## 使用方法

### Python API

#### 基本用法

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```

#### 启用插件

```python
md = MarkItDown(enable_plugins=True)
result = md.convert("document_with_images.pdf")
```

#### 使用 Azure Document Intelligence

```python
md = MarkItDown(
    enable_plugins=False,
    docintel_endpoint="https://your-resource.cognitiveservices.azure.com/",
)
result = md.convert("document.pdf")
```

### 命令行接口

```bash
# 基本转换
markitdown document.pdf -o output.md

# 启用插件
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o

# 使用 Document Intelligence
markitdown document.pdf -d -e "https://your-endpoint.cognitiveservices.azure.com/"

# 列出已安装插件
markitdown --list-plugins
```

---

## 常见问题与故障排除

### 问题 1：IpynbConverter 的 UnicodeDecodeError

**问题描述**：`IpynbConverter.accepts()` 在处理非 ASCII 文件时抛出 `UnicodeDecodeError`，导致整个转换管道崩溃。

**影响版本**：受影响的版本（参见 [Issue #1894](https://github.com/microsoft/markitdown/issues/1894)）

**原因**：`accepts()` 方法直接读取文件流并解码，未处理编码错误。

**解决方案**：

```python
def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool:
    extension = (stream_info.extension or "").lower()
    
    if extension == ".ipynb":
        try:
            # 使用 errors='ignore' 避免编码错误
            content = file_stream.read().decode("utf-8", errors="ignore")
            json.loads(content)  # 验证 JSON 格式
            return True
        except (json.JSONDecodeError, UnicodeDecodeError):
            return False
    return False
```

### 问题 2：Office Open XML 无效文件处理

**问题描述**：当转换无效的 DOCX、XLSX 或 PPTX 文件时，MarkItDown 返回成功结果，错误信息作为文本内容返回，而非抛出异常。

**影响**：难以区分成功转换和失败转换。

**参考**：[Issue #1408](https://github.com/microsoft/markitdown/issues/1408)

### 问题 3：pydub 音频警告

**问题描述**：在 Linux 系统上运行时，`pydub` 依赖会发出 `RuntimeWarning: Couldn't find ffmpeg or avconv` 警告。

**解决方案**：安装 ffmpeg

```bash
# Debian/Ubuntu
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg
```

**参考**：[Issue #1685](https://github.com/microsoft/markitdown/issues/1685)

---

## 安全注意事项

> [!IMPORTANT]
> MarkItDown 以当前进程的权限执行 I/O 操作，类似于 `open()` 或 `requests.get()`。它会访问进程本身可访问的资源。

**建议**：

1. 在不受信任的环境中运行时，对输入进行清理
2. 使用最窄的转换方法：
   - `convert_stream()` - 仅处理给定的文件流
   - `convert_local()` - 仅处理本地文件
   - `convert_uri()` - 处理 URI（包括 file:// 和 data://）

---

## 配置选项

### MarkItDown 构造函数参数

| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `enable_plugins` | bool | False | 是否启用第三方插件 |
| `llm_client` | Any | None | LLM 客户端实例（用于图像描述和 OCR） |
| `llm_model` | str | None | LLM 模型名称 |
| `llm_prompt` | str | None | 自定义 LLM 提示词 |
| `docintel_endpoint` | str | None | Azure Document Intelligence 端点 |
| `cu_endpoint` | str | None | Azure Content Understanding 端点 |
| `cu_file_types` | List[ContentUnderstandingFileType] | None | 仅使用 CU 的文件类型 |

### CLI 参数

| 参数 | 简写 | 说明 |
|------|------|------|
| `--version` | `-v` | 显示版本号 |
| `--output` | `-o` | 输出文件路径 |
| `--extension` | `-x` | 从 stdin 读取时的文件扩展名提示 |
| `--list-plugins` | `-l` | 列出已安装的插件 |
| `--use-plugins` | `-p` | 启用第三方插件 |
| `--use-docintel` | `-d` | 使用 Azure Document Intelligence |
| `--endpoint` | `-e` | 指定服务端点 |
| `--use-cu` | - | 使用 Azure Content Understanding |
| `--cu-endpoint` | - | Content Understanding 端点 |
| `--llm-client` | - | LLM 客户端类型（如 openai） |
| `--llm-model` | - | LLM 模型名称 |
| `--llm-prompt` | - | 自定义 LLM 提示词 |

---

## 参见

- [MarkItDown 主文档](../README)
- [markitdown-ocr 插件文档](../markitdown-ocr)
- [开发自定义插件指南](../markitdown-sample-plugin)
- [CLI 使用指南](./CLI-Usage)
- [Azure 集成配置](./Azure-Integration)

---

<a id='page-installation'></a>

## 安装与配置

### 相关页面

相关主题：[命令行使用](#page-cli-usage), [Python API](#page-python-api)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/markitdown/pyproject.toml](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/pyproject.toml)
- [Dockerfile](https://github.com/microsoft/markitdown/blob/main/Dockerfile)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)
</details>

# 安装与配置

本文档详细介绍 MarkItDown 的安装方法、依赖管理、配置选项以及常见问题的解决方案。MarkItDown 是一个轻量级的 Python 工具，用于将各种文件格式转换为 Markdown，以便与 LLM 和文本分析管道配合使用。

## 系统要求

### Python 版本要求

MarkItDown 要求 **Python 3.10 或更高版本**。项目代码中使用了类型提示和现代 Python 特性，低于 3.10 的版本无法正常运行。

```bash
# 检查 Python 版本
python3 --version
```

### 操作系统兼容性

MarkItDown 支持以下操作系统：

| 操作系统 | 支持状态 | 备注 |
|---------|---------|------|
| Linux | ✅ 完全支持 | 音频转换需要 ffmpeg |
| macOS | ✅ 完全支持 | 需要安装 ffmpeg 用于音频处理 |
| Windows | ✅ 完全支持 | onnxruntime 已针对 Windows 进行了特殊处理 |

> ⚠️ **注意**：在 Linux 系统上使用音频功能时，如果未安装 ffmpeg 或 avconv，可能会收到 RuntimeWarning 警告。建议安装 ffmpeg 以避免此问题。资料来源：[github.com/microsoft/markitdown/issues/1685](https://github.com/microsoft/markitdown/issues/1685)

## 安装方法

### 方法一：通过 pip 安装（推荐）

#### 安装完整版本（包含所有功能）

```bash
pip install markitdown[all]
```

完整安装包含所有内置转换器和可选依赖项，包括：
- PDF 转换器
- PowerPoint 转换器
- Word 转换器
- Excel 转换器
- 图片转换器（包含 EXIF 元数据和 OCR）
- 音频转换器（包含语音转录）
- Azure Document Intelligence 集成
- Azure Content Understanding 集成

#### 安装基础版本

```bash
pip install markitdown
```

基础安装仅包含核心功能和最常用的转换器。

#### 按需安装特定功能

MarkItDown 将依赖项组织为功能组，可以选择性安装所需的转换器和功能：

| 功能组 | 安装命令 | 包含内容 |
|-------|---------|---------|
| PDF 支持 | `pip install markitdown[pdf]` | PDF 文本提取和表格提取 |
| Office 文档 | `pip install markitdown[office]` | DOCX, PPTX, XLSX 转换 |
| 图片支持 | `pip install markitdown[image]` | 图片元数据和 OCR |
| 音频支持 | `pip install markitdown[audio]` | 音频元数据和语音转录 |
| Azure 集成 | `pip install markitdown[az-doc-intel]` | Azure Document Intelligence |
| 内容理解 | `pip install markitdown[az-content-understanding]` | Azure Content Understanding |
| 所有功能 | `pip install markitdown[all]` | 包含上述所有功能 |

资料来源：[packages/markitdown/pyproject.toml](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/pyproject.toml)

### 方法二：从源码安装

对于需要最新功能或希望参与开发的用户，可以从源码安装：

```bash
# 克隆仓库
git clone git@github.com:microsoft/markitdown.git
cd markitdown

# 创建虚拟环境（推荐）
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# 或: .venv\Scripts\activate  # Windows

# 安装为可编辑模式
pip install -e packages/markitdown[all]
```

如果使用 `uv` 工具管理虚拟环境：

```bash
uv venv --python 3.12 .venv
source .venv/bin/activate
uv pip install -e packages/markitdown[all]
```

### 方法三：Docker 部署

MarkItDown 提供官方 Docker 镜像，支持无依赖安装：

```bash
# 构建镜像
docker build -t markitdown:latest .

# 使用 Docker 运行
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```

Docker 方式适合在不希望安装 Python 依赖的环境中快速部署使用。

资料来源：[Dockerfile](https://github.com/microsoft/markitdown/blob/main/Dockerfile)

## 插件系统安装

MarkItDown 支持通过插件扩展功能。插件通过 `markitdown.plugin` 入口点组进行注册。

### 官方 OCR 插件

`markitdown-ocr` 插件为 PDF、DOCX、PPTX 和 XLSX 转换器添加 OCR 支持，使用 LLM Vision 从嵌入式图片中提取文本。

**安装步骤：**

```bash
# 安装插件
pip install markitdown-ocr

# 安装 OpenAI 兼容客户端（用于 LLM 调用）
pip install openai
```

**使用方法：**

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

或者在 Python API 中：

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
```

> ⚠️ **已知问题**：在某些情况下，CLI 参数组合可能未被正确识别。例如，运行 `markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o` 时可能出现"Unrecognized Arguments"错误。资料来源：[github.com/microsoft/markitdown/issues/1897](https://github.com/microsoft/markitdown/issues/1897)

### 自定义插件开发

MarkItDown 提供了示例插件项目 `markitdown-sample-plugin`，展示了如何开发自定义转换器：

1. 实现 `DocumentConverter` 子类
2. 导出 `__plugin_interface_version__` 变量
3. 实现 `register_converters()` 函数
4. 在 `pyproject.toml` 中配置入口点

```toml
[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"
```

资料来源：[packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

### 查看已安装插件

```bash
markitdown --list-plugins
```

此命令列出所有已安装的第三方插件，帮助用户了解可用的扩展功能。

## 命令行配置

### 常用 CLI 参数

MarkItDown CLI 提供多种参数用于控制转换行为：

| 参数 | 简写 | 说明 | 示例 |
|-----|-----|------|------|
| `--version` | `-v` | 显示版本号 | `markitdown --version` |
| `--output` | `-o` | 指定输出文件名 | `markitdown file.pdf -o output.md` |
| `--extension` | `-x` | 提供文件扩展名提示（从 stdin 读取时） | `markitdown -x pdf < input.bin` |
| `--list-plugins` | 无 | 列出已安装的插件 | `markitdown --list-plugins` |
| `--use-plugins` | `-p` | 启用第三方插件 | `markitdown --use-plugins file.pdf` |
| `--use-docintel` | `-d` | 使用 Azure Document Intelligence | `markitdown --use-docintel file.pdf` |
| `--endpoint` | `-e` | 指定服务终结点 | `markitdown -e "https://..." file.pdf` |

资料来源：[packages/markitdown/src/markitdown/__main__.py:1-100](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### CLI 使用示例

#### 基本转换

```bash
# 从文件转换
markitdown example.pdf > output.md

# 指定输出文件
markitdown example.xlsx -o spreadsheet.md

# 从标准输入读取
cat example.docx | markitdown > output.md
```

#### 使用 Azure Document Intelligence

```bash
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```

#### 使用 Azure Content Understanding

```bash
markitdown document.pdf -o output.md --use-cu --cu-endpoint "<content_understanding_endpoint>"
```

#### 使用插件

```bash
# 列出插件
markitdown --list-plugins

# 启用插件进行转换
markitdown --use-plugins document_with_images.pdf --llm-client openai --llm-model gpt-4o
```

## Python API 配置

### 基础配置

MarkItDown 的核心配置通过 `MarkItDown` 类的构造函数完成：

```python
from markitdown import MarkItDown

# 基础使用
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```

### 完整配置选项

| 参数 | 类型 | 默认值 | 说明 |
|-----|------|--------|------|
| `enable_plugins` | `bool` | `False` | 是否启用第三方插件 |
| `llm_client` | `Any` | `None` | LLM 客户端实例（如 OpenAI 客户端） |
| `llm_model` | `str` | `None` | LLM 模型名称（如 "gpt-4o"） |
| `llm_prompt` | `str` | `None` | 自定义 LLM 提示词 |
| `docintel_endpoint` | `str` | `None` | Azure Document Intelligence 终结点 |
| `docintel_key` | `str` | `None` | Azure Document Intelligence API 密钥 |
| `cu_endpoint` | `str` | `None` | Azure Content Understanding 终结点 |
| `cu_file_types` | `list` | `None` | 仅使用 Content Understanding 的文件类型 |

资料来源：[packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### 使用 LLM 进行图片描述

MarkItDown 支持使用 LLM 为图片生成描述（目前支持 pptx 和图片文件）：

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="可选的自定义提示词"
)
result = md.convert("example.jpg")
print(result.text_content)
```

### 使用 Azure Document Intelligence

```python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
```

### 使用 Azure Content Understanding

```python
from markitdown import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],  # 仅 PDF 使用 CU
)
result = md.convert("document.pdf")
print(result.text_content)
```

## 环境变量配置

### MCP 服务器环境变量

当 MarkItDown 作为 MCP 服务器运行时，可以从环境变量读取配置：

| 环境变量 | 说明 |
|---------|------|
| `MARKITDOWN_ENABLE_PLUGINS` | 设置为 "1" 或 "true" 以启用插件 |

```bash
export MARKITDOWN_ENABLE_PLUGINS=1
markitdown --server
```

## 虚拟环境配置

### 创建虚拟环境

强烈建议使用虚拟环境来安装 MarkItDown，以避免依赖冲突：

**使用标准 Python：**

```bash
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# 或
.venv\Scripts\activate  # Windows
```

**使用 uv 工具：**

```bash
uv venv --python 3.12 .venv
source .venv/bin/activate
```

### 在虚拟环境中安装

```bash
# 激活虚拟环境后
pip install markitdown[all]

# 或使用 uv
uv pip install markitdown[all]
```

## 依赖项说明

### 核心依赖

MarkItDown 的核心功能依赖于以下包：

| 依赖包 | 版本要求 | 用途 |
|-------|---------|------|
| Python | ≥3.10 | 运行环境 |

### 可选依赖组

在 `pyproject.toml` 中定义的可选依赖组：

```toml
[project.optional-dependencies]
pdf = ["pdfminer.six", "pdfplumber", "pymupdf"]
office = ["mammoth", "python-pptx", "openpyxl"]
image = ["Pillow", "exif"]
audio = ["pydub", "SpeechRecognition"]
html = ["beautifulsoup4", "html5lib"]
epub = ["ebooklib"]
doc-intel = ["azure-ai-formrecognizer"]
az-cu = ["azure-ai-contentunderstanding"]
all = [
    "pdfminer.six",
    "pdfplumber",
    "pymupdf",
    "mammoth",
    "python-pptx",
    "openpyxl",
    "Pillow",
    "exif",
    "pydub",
    "SpeechRecognition",
    "beautifulsoup4",
    "html5lib",
    "ebooklib",
    "azure-ai-formrecognizer",
    "azure-ai-contentunderstanding",
]
```

资料来源：[packages/markitdown/pyproject.toml](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/pyproject.toml)

## 常见配置问题与解决方案

### 问题一：CLI 参数识别错误

**问题描述**：运行 `markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o` 时出现"Unrecognized Arguments"错误。

**解决方案**：确保使用正确的参数格式。某些参数组合可能需要通过环境变量或 Python API 设置。

资料来源：[github.com/microsoft/markitdown/issues/1897](https://github.com/microsoft/markitdown/issues/1897)

### 问题二：Linux 系统缺少 ffmpeg

**问题描述**：在 Linux 系统上运行时出现 `RuntimeWarning: Couldn't find ffmpeg or avconv` 警告。

**解决方案**：安装 ffmpeg

```bash
# Debian/Ubuntu
sudo apt-get install ffmpeg

# Fedora
sudo dnf install ffmpeg

# macOS
brew install ffmpeg
```

资料来源：[github.com/microsoft/markitdown/issues/1685](https://github.com/microsoft/markitdown/issues/1685)

### 问题三：非 ASCII 文件名处理

**问题描述**：处理包含非 ASCII 字符文件名的文件时可能出现 `UnicodeDecodeError`。

**解决方案**：确保 Python 运行环境使用 UTF-8 编码：

```bash
export PYTHONIOENCODING=utf-8
```

### 问题四：Azure 服务认证

**问题描述**：使用 Azure Document Intelligence 或 Content Understanding 时认证失败。

**解决方案**：可以通过多种方式配置凭据：

```python
# 使用密钥
from azure.core.credentials import AzureKeyCredential

md = MarkItDown(
    docintel_endpoint="<endpoint>",
    docintel_key="<key>"  # 或通过 AzureKeyCredential
)

# 使用 DefaultAzureCredential（自动从环境读取）
from azure.identity import DefaultAzureCredential
```

## 安全注意事项

> [!IMPORTANT]
> MarkItDown 会以当前进程的权限执行 I/O 操作。就像 `open()` 或 `requests.get()` 一样，它将访问进程本身可以访问的资源。在不受信任的环境中，请务必对输入进行清理，并调用最窄的 `convert_*` 函数（如 `convert_stream()` 或 `convert_local()`）以降低安全风险。

### 安全最佳实践

1. **最小权限原则**：在受控环境中运行，不要以管理员权限运行 MarkItDown
2. **输入验证**：在使用前验证所有文件输入
3. **选择合适的转换方法**：
   - `convert_stream()` - 最安全，适合处理不可信输入
   - `convert_local()` - 用于本地文件
   - `convert_uri()` - 用于 URI 资源
   - `convert()` - 通用方法，自动检测资源类型

## 升级与迁移

### 从旧版本升级

```bash
pip install --upgrade markitdown[all]
```

### 版本 0.1.x 重要变更

| 版本 | 重要变更 |
|-----|---------|
| 0.1.1 | `convert_url` 重命名为 `convert_uri`，支持 data 和 file URI |
| 0.1.2 | 新增 CSV 到 Markdown 表格转换；DOCX 数学公式渲染 |
| 0.1.3 | MCP 服务器支持；Windows 上 onnxruntime 固定 |
| 0.1.4 | mammoth 升级到 1.11.0（安全修复）；pdfminer.six 升级 |
| 0.1.5 | PDF 表格提取改进；支持对齐的 Markdown 表格 |
| 0.1.6 | 新增 OCR 层服务；修复 PDF 转换中的 O(n) 内存增长 |

## 相关文档

- [README 主页](https://github.com/microsoft/markitdown/blob/main/README.md)
- [markitdown-ocr 插件文档](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [自定义插件开发指南](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [Azure Document Intelligence 集成](../azure-integration/)
- [支持的转换器列表](../converters/)

---

<a id='page-cli-usage'></a>

## 命令行使用

### 相关页面

相关主题：[安装与配置](#page-installation)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
</details>

# 命令行使用

MarkItDown 提供了一个功能完整的命令行界面（CLI），允许用户在不编写 Python 代码的情况下直接将各种文件格式转换为 Markdown 文档。本页面详细说明 CLI 的所有功能、选项和使用模式。

## 概述

MarkItDown CLI 是用户与 MarkItDown 交互的主要方式之一。通过命令行，用户可以：

- 将本地文件转换为 Markdown 格式
- 从标准输入读取内容进行转换
- 支持多种文件格式的自动识别
- 集成 Azure Document Intelligence 进行高级文档处理
- 集成 Azure Content Understanding 进行多模态内容提取
- 启用第三方插件扩展功能
- 配置 LLM 客户端用于图像描述和 OCR 功能

```mermaid
graph TD
    A[用户执行 markitdown 命令] --> B{是否指定文件名?}
    B -->|是| C[读取本地文件]
    B -->|否| D[从 stdin 读取]
    C --> E[识别文件类型]
    D --> E
    E --> F[选择合适的转换器]
    F --> G[执行转换]
    G --> H[输出 Markdown]
```

## 基本语法

MarkItDown CLI 的基本语法如下：

```bash
markitdown [选项] [文件名]
```

如果未指定文件名，MarkItDown 将从标准输入（stdin）读取内容。转换结果默认输出到标准输出（stdout）。

资料来源：[packages/markitdown/src/markitdown/__main__.py:27-40]()

### 输入方式

MarkItDown 支持多种输入方式：

| 输入方式 | 命令示例 | 说明 |
|---------|---------|------|
| 文件参数 | `markitdown document.pdf` | 直接指定要转换的文件 |
| 标准输入 | `cat document.pdf \| markitdown` | 通过管道传输文件内容 |
| 重定向输入 | `markitdown < document.pdf` | 使用输入重定向 |
| 保存输出 | `markitdown document.pdf -o output.md` | 指定输出文件 |
| 重定向输出 | `markitdown document.pdf > output.md` | 使用输出重定向 |

## 命令行选项

### 全局选项

| 选项 | 简写 | 说明 | 示例 |
|------|------|------|------|
| `--version` | `-v` | 显示版本号并退出 | `markitdown --version` |
| `--help` | `-h` | 显示帮助信息 | `markitdown --help` |
| `--output` | `-o` | 指定输出文件名 | `markitdown file.pdf -o result.md` |
| `--extension` | `-x` | 提供文件扩展名提示（从 stdin 读取时） | `cat file \| markitdown -x pdf` |
| `--hint-mime-type` | `-m` | 提供 MIME 类型提示 | `markitdown -m application/pdf` |
| `--hint-charset` | 无 | 指定字符编码 | `markitdown -c utf-8` |

资料来源：[packages/markitdown/src/markitdown/__main__.py:51-80]()

### 插件相关选项

| 选项 | 说明 | 示例 |
|------|------|------|
| `--list-plugins` | 列出所有已安装的第三方插件 | `markitdown --list-plugins` |
| `--use-plugins` | 启用第三方插件 | `markitdown file.pdf --use-plugins` |

#### 列出已安装的插件

```bash
markitdown --list-plugins
```

输出示例：

```
Installed MarkItDown 3rd-party Plugins:

  * markitdown-ocr      (package: markitdown_ocr)

Use the -p (or --use-plugins) option to enable 3rd-party plugins.
```

如果没有安装任何插件，将显示：

```
Installed MarkItDown 3rd-party Plugins:

  * No 3rd-party plugins installed.

Find plugins by searching for the hashtag #markitdown-plugin on GitHub.
```

资料来源：[packages/markitdown/src/markitdown/__main__.py:115-135]()

### Azure Document Intelligence 选项

| 选项 | 说明 | 必需 |
|------|------|------|
| `--use-docintel` | 启用 Azure Document Intelligence | 是 |
| `--endpoint` | Document Intelligence 服务端点 | 是（使用 `--use-docintel` 时） |

#### 使用 Document Intelligence 的完整示例

```bash
markitdown path-to-file.pdf -d -e "https://your-resource.cognitiveservices.azure.com/"
```

注意：使用 Document Intelligence 时必须同时指定文件名。

资料来源：[packages/markitdown/src/markitdown/__main__.py:137-155]()

### Azure Content Understanding 选项

| 选项 | 说明 | 必需 |
|------|------|------|
| `--use-cu` | 启用 Azure Content Understanding | 是 |
| `--cu-endpoint` | Content Understanding 服务端点 | 是（使用 `--use-cu` 时） |
| `--cu-file-types` | 指定使用 Content Understanding 的文件类型 | 否 |

#### 使用 Content Understanding 的完整示例

```bash
markitdown path-to-file.pdf --use-cu --cu-endpoint "https://your-resource.cognitiveservices.azure.com/"
```

可以通过环境变量 `MARKITDOWN_CU_ENDPOINT` 和 `MARKITDOWN_CU_FILE_TYPES` 设置默认值。

资料来源：[packages/markitdown/src/markitdown/converters/_cu_converter.py:1-30]()

### LLM 相关选项

这些选项主要用于配置图像描述和 OCR 功能（通过 markitdown-ocr 插件）。

| 选项 | 说明 |
|------|------|
| `--llm-client` | 指定 LLM 客户端类型（如 `openai`） |
| `--llm-model` | 指定要使用的 LLM 模型（如 `gpt-4o`） |
| `--llm-prompt` | 自定义 LLM 提示词（可选） |

#### 使用 LLM 进行图像描述和 OCR

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

⚠️ **注意**：根据社区反馈（Issue #1897），`--llm-client`、`--llm-model` 和 `--llm-prompt` 选项在当前版本中可能未被正确注册到 argparse。如果遇到"unrecognized arguments"错误，请确保使用正确版本的 MarkItDown，或考虑通过 Python API 使用这些功能。

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## 常见使用场景

### 基本文件转换

```bash
# PDF 转换
markitdown document.pdf > output.md

# Word 文档转换
markitdown document.docx -o output.md

# PowerPoint 转换
markitdown presentation.pptx -o output.md

# Excel 转换
markitdown spreadsheet.xlsx -o output.md
```

### 使用管道输入

当需要从其他命令获取输入时：

```bash
# 从 curl 获取网页内容
curl -s https://example.com/page.html | markitdown -x html > output.md

# 从 gzip 解压的 PDF
gzip -dc document.pdf.gz | markitdown -x pdf > output.md
```

### 结合 Azure 服务使用

```bash
# 使用 Document Intelligence 进行 PDF 转换
markitdown document.pdf -d -e "https://your-resource.cognitiveservices.azure.com/" -o output.md

# 使用 Content Understanding 进行多模态转换
markitdown audio.mp3 --use-cu --cu-endpoint "https://your-resource.cognitiveservices.azure.com/" -o output.md
```

### 使用插件

```bash
# 列出可用插件
markitdown --list-plugins

# 启用插件进行 OCR
markitdown scanned.pdf --use-plugins --llm-client openai --llm-model gpt-4o -o output.md
```

## Docker 使用

MarkItDown 也可以通过 Docker 运行，这对于在没有 Python 环境的系统上使用很有帮助。

### 构建 Docker 镜像

```sh
docker build -t markitdown:latest .
```

### 运行 Docker 容器

```sh
# 基本用法
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

# 指定文件
docker run --rm -v /path/to/files:/data markitdown:latest /data/document.pdf > output.md
```

注意：Docker 镜像可能不包含所有可选依赖（如 ffmpeg），在使用音频转换功能时可能需要扩展基础镜像。

## 退出码

MarkItDown CLI 使用以下退出码：

| 退出码 | 含义 |
|--------|------|
| 0 | 成功转换 |
| 1 | 一般错误 |
| 2 | 参数错误或缺少必需参数 |

## 错误处理

### 缺少必需参数

如果使用需要端点的功能（如 Document Intelligence）但未提供端点：

```
Error: Document Intelligence Endpoint is required when using Document Intelligence.
```

如果未提供文件名：

```
Error: Filename is required when using Document Intelligence.
```

### 不支持的格式

当尝试转换不支持的文件格式时，MarkItDown 会输出友好的错误信息：

```
Error: Unsupported file format: .xyz
```

### 文件不存在

```bash
markitdown nonexistent.pdf
# Error: [Errno 2] No such file or directory: 'nonexistent.pdf'
```

## 环境变量

MarkItDown CLI 支持以下环境变量：

| 环境变量 | 说明 | 对应 CLI 选项 |
|---------|------|---------------|
| `MARKITDOWN_ENABLE_PLUGINS` | 是否启用插件 | `--use-plugins` |
| `MARKITDOWN_CU_ENDPOINT` | Content Understanding 端点 | `--cu-endpoint` |
| `MARKITDOWN_CU_FILE_TYPES` | Content Understanding 文件类型 | `--cu-file-types` |

```bash
# 通过环境变量启用插件
export MARKITDOWN_ENABLE_PLUGINS=1
markitdown document.pdf -o output.md
```

资料来源：[v0.1.3 Release Notes](https://github.com/microsoft/markitdown/releases/tag/v0.1.3)

## 与 Python API 的对比

| 特性 | CLI | Python API |
|------|-----|------------|
| 快速转换 | ✅ | ✅ |
| 脚本集成 | ✅ | ✅ |
| 精细控制 | 有限 | 完整 |
| 自定义转换器 | ❌ | ✅ |
| 复杂配置 | 有限 | ✅ |

对于需要精细控制或自定义转换逻辑的场景，建议使用 Python API。

## 故障排除

### RuntimeWarning: Couldn't find ffmpeg or avconv

在 Linux 系统上运行 MarkItDown 时，如果看到以下警告：

```
RuntimeWarning: Couldn't find ffmpeg or avconv
```

这是因为 `pydub` 依赖需要 ffmpeg 来处理音频文件。解决方案：

1. 安装 ffmpeg：
   ```bash
   # Debian/Ubuntu
   sudo apt-get install ffmpeg
   
   # macOS
   brew install ffmpeg
   ```

2. 或者忽略警告继续使用（音频功能将不可用）

资料来源：[Issue #1685](https://github.com/microsoft/markitdown/issues/1685)

### unrecognized arguments 错误

如果在运行以下命令时遇到"unrecognized arguments"错误：

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

这可能是因为某些 LLM 相关参数在 argparse 中未正确注册。作为临时解决方案，可以使用 Python API：

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
result = md.convert("document.pdf")
print(result.text_content)
```

资料来源：[Issue #1897](https://github.com/microsoft/markitdown/issues/1897)

### 非 ASCII 文件名问题

在处理包含非 ASCII 字符（如法语音符）的文件名时，可能会遇到 `UnicodeDecodeError`。这是因为 `IpynbConverter.accepts()` 方法在检测文件类型时尝试解码文件流。

资料来源：[Issue #1894](https://github.com/microsoft/markitdown/issues/1894)

## 参见

- [Python API 使用指南](./Python-API.md)
- [插件开发指南](./Plugins.md)
- [支持的格式](./Supported-Formats.md)
- [Azure 集成](./Azure-Integration.md)
- [安全注意事项](./Security.md)

---

<a id='page-python-api'></a>

## Python API

### 相关页面

相关主题：[安装与配置](#page-installation), [命令行使用](#page-cli-usage), [Azure 服务集成](#page-azure-integration)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)
- [packages/markitdown/src/markitdown/_stream_info.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_stream_info.py)
- [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)
- [packages/markitdown/src/markitdown/_uri_utils.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_uri_utils.py)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
- [packages/markitdown/src/markitdown/converters/_doc_intel_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py)
- [packages/markitdown/src/markitdown/converters/_pdf_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pdf_converter.py)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
</details>

# Python API

MarkItDown 提供了一套完整的 Python API，允许开发者在自己的 Python 应用中集成文档到 Markdown 的转换功能。本页详细说明了 MarkItDown 的核心类、转换方法、参数配置以及常见的使用模式。

## 概述

MarkItDown 的 Python API 以 `MarkItDown` 类为核心，提供了多种转换入口方法，支持从文件路径、URL、文件流等不同来源读取文档，并将其转换为 Markdown 格式。API 设计遵循"按需使用"原则，开发者可以根据实际场景选择最合适的转换方法。

资料来源：[packages/markitdown/src/markitdown/_markitdown.py:1-100]()

## 核心类

### MarkItDown 类

`MarkItDown` 是整个 API 的主入口类，负责管理转换器注册、处理文档转换请求、以及协调各种插件和服务。

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.markdown)
```

资料来源：[packages/markitdown/src/markitdown/_markitdown.py:40-80]()

#### 构造函数参数

| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `enable_plugins` | `bool` | `False` | 是否启用第三方插件 |
| `docintel_endpoint` | `str` | `None` | Azure Document Intelligence 端点 |
| `docintel_api_key` | `str` | `None` | Azure Document Intelligence API 密钥 |
| `cu_endpoint` | `str` | `None` | Azure Content Understanding 端点 |
| `cu_analyzer` | `str` | `None` | Content Understanding 分析器 ID |
| `cu_file_types` | `list` | `None` | 指定使用 Content Understanding 的文件类型 |
| `llm_client` | `Any` | `None` | LLM 客户端实例（如 OpenAI） |
| `llm_model` | `str` | `None` | LLM 模型名称 |
| `llm_prompt` | `str` | `None` | 自定义 LLM 提示词 |
| `keep_data_uris` | `bool` | `False` | 是否保留输出中的 data URI |

资料来源：[packages/markitdown/src/markitdown/_markitdown.py:40-80]()

### DocumentConverterResult 类

转换结果的容器类，封装了 Markdown 文本和可选的文档元数据。

```python
@dataclass
class DocumentConverterResult:
    markdown: str                          # 转换后的 Markdown 文本
    title: Optional[str] = None           # 文档标题
```

**属性说明：**

| 属性 | 类型 | 说明 |
|------|------|------|
| `markdown` | `str` | 转换后的 Markdown 内容 |
| `title` | `str \| None` | 提取的文档标题 |
| `text_content` | `str` | `markdown` 的别名（已弃用，建议使用 `markdown`） |

资料来源：[packages/markitdown/src/markitdown/_base_converter.py:10-45]()

### StreamInfo 类

`StreamInfo` 用于传递文件流的元信息，帮助转换器识别文件类型和编码。

```python
@dataclass
class StreamInfo:
    url: Optional[str] = None             # 来源 URL
    extension: Optional[str] = None       # 文件扩展名（如 ".pdf"）
    mimetype: Optional[str] = None         # MIME 类型
    charset: Optional[str] = None          # 字符编码（如 "utf-8"）
```

资料来源：[packages/markitdown/src/markitdown/_stream_info.py:1-30]()

## 转换方法

MarkItDown 提供了多个转换方法，从不同来源读取文档：

```mermaid
graph TD
    A[MarkItDown 转换入口] --> B[convert]
    A --> C[convert_local]
    A --> D[convert_uri]
    A --> E[convert_stream]
    
    B --> F[路径/URL/文件对象]
    C --> G[本地文件路径]
    D --> H[URI 字符串]
    E --> I[文件流 + StreamInfo]
    
    F --> J[自动路由]
    G --> J
    H --> J
    I --> J
    
    J --> K[选择合适的 Converter]
    K --> L[执行转换]
    L --> M[返回 DocumentConverterResult]
```

### convert() 方法

通用转换方法，自动根据输入类型选择最佳处理方式。

```python
from markitdown import MarkItDown

md = MarkItDown()

# 从本地文件转换
result = md.convert("document.pdf")

# 从 URL 转换
result = md.convert("https://example.com/document.docx")

# 从文件对象转换
with open("document.pdf", "rb") as f:
    result = md.convert(f)
```

资料来源：[packages/markitdown/src/markitdown/_markitdown.py:100-200]()

### convert_local() 方法

仅用于本地文件系统路径的转换方法。在不受信任的环境中使用此方法更安全，因为它限制了可以访问的资源范围。

```python
md = MarkItDown()
result = md.convert_local("/path/to/document.pdf")
```

资料来源：[packages/markitdown/src/markitdown/_markitdown.py:200-250]()

### convert_uri() 方法

用于处理 URI 字符串的转换方法，支持 file URI 和 data URI。

```python
md = MarkItDown()

# 文件 URI
result = md.convert_uri("file:///path/to/document.pdf")

# Data URI（Base64 编码）
result = md.convert_uri("data:text/plain;base64,SGVsbG8gV29ybGQ=")

# 远程 URL（v0.1.1+）
result = md.convert_uri("https://example.com/document.pdf")
```

> **注意**：`convert_url()` 在 v0.1.1 中被重命名为 `convert_uri`，但仍保留为别名以保持向后兼容。

资料来源：[packages/markitdown/src/markitdown/_markitdown.py:250-300]()
资料来源：[packages/markitdown/src/markitdown/_uri_utils.py:1-50]()

### convert_stream() 方法

最低级别的转换方法，直接接收文件流和流元信息。

```python
from markitdown import MarkItDown
from markitdown._stream_info import StreamInfo

md = MarkItDown()

with open("document.pdf", "rb") as f:
    stream_info = StreamInfo(
        extension=".pdf",
        mimetype="application/pdf"
    )
    result = md.convert_stream(f, stream_info)
```

此方法最适合需要自定义流元信息的场景，或在处理内存中的二进制数据时使用。

资料来源：[packages/markitdown/src/markitdown/_markitdown.py:300-350]()

## 插件系统

MarkItDown 支持通过插件扩展功能，插件通过 `entry_points` 机制注册。

### 启用插件

```python
from markitdown import MarkItDown

md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")  # 插件将被自动调用
```

### 列出已安装插件

```bash
markitdown --list-plugins
```

### 插件开发

开发者可以创建自定义转换器插件：

```python
# packages/markitdown-sample-plugin/README.md
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo

class RtfConverter(DocumentConverter):
    def __init__(self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT):
        super().__init__(priority=priority)

    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool:
        # 检查文件是否为 RTF 格式
        return stream_info.extension == ".rtf"

    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> DocumentConverterResult:
        # 实现转换逻辑
        return DocumentConverterResult(markdown="# Converted RTF Content")
```

插件需要在 `pyproject.toml` 中声明入口点：

```toml
[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"
```

资料来源：[packages/markitdown-sample-plugin/README.md:1-100]()

## LLM 集成

MarkItDown 支持使用 LLM 进行图像描述和 OCR（光学字符识别）。

### 基本用法

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)

# 转换图像文件
result = md.convert("image.png")
print(result.markdown)
```

### 自定义提示词

```python
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Extract all text from this image, preserving table structure."
)
```

### 兼容的 LLM 客户端

支持任何遵循 OpenAI API 的客户端：

```python
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="your-api-key",
    azure_endpoint="https://your-resource.openai.azure.com/",
    api_version="2024-02-01"
)

md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)
```

资料来源：[packages/markitdown/src/markitdown/_markitdown.py:40-80]()

## Azure 集成

### Azure Document Intelligence

使用 Azure Document Intelligence 进行高质量文档转换：

```python
from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://your-resource.cognitiveservices.azure.com/",
    docintel_api_key="your-api-key"
)

result = md.convert("document.pdf")
print(result.markdown)
```

**支持的格式：**

| 格式 | OCR 支持 |
|------|----------|
| PDF | ✓ |
| DOCX | ✗ |
| PPTX | ✗ |
| XLSX | ✗ |
| HTML | ✗ |
| JPEG | ✓ |
| PNG | ✓ |
| BMP | ✓ |
| TIFF | ✓ |

资料来源：[packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:1-80]()

### Azure Content Understanding

提供更高级的多模态提取功能：

```python
from markitdown import MarkItDown
from markitdown.converters._doc_intel_converter import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="https://your-resource.cognitiveservices.azure.com/",
    cu_file_types=[ContentUnderstandingFileType.PDF]  # 仅对 PDF 使用 CU
)

result = md.convert("document.pdf")
print(result.markdown)
```

资料来源：[packages/markitdown/src/markitdown/converters/_cu_converter.py:1-100]()

## 转换器架构

MarkItDown 使用转换器链模式处理不同文件格式：

```mermaid
graph LR
    A[输入文件] --> B[StreamInfo 提取]
    B --> C{文件类型检测}
    
    C -->|PDF| D[PdfConverter]
    C -->|DOCX| E[DocxConverter]
    C -->|PPTX| F[PptxConverter]
    C -->|XLSX| G[XlsxConverter]
    C -->|HTML| H[WikipediaConverter]
    C -->|RSS| I[RSSConverter]
    C -->|其他| Z[未知格式处理]
    
    D --> J[DocumentConverterResult]
    E --> J
    F --> J
    G --> J
    H --> J
    I --> J
    Z --> J
```

### 内置转换器

| 转换器 | 支持格式 | 说明 |
|--------|----------|------|
| `PdfConverter` | PDF | 使用 pdfplumber 和 pdfminer |
| `DocxConverter` | DOCX | 使用 mammoth 库 |
| `PptxConverter` | PPTX | 使用 python-pptx 库 |
| `XlsxConverter` | XLSX | 使用 openpyxl 库 |
| `WikipediaConverter` | HTML | 专门处理 Wikipedia 页面 |
| `RSSConverter` | XML | 解析 RSS/Atom 订阅源 |
| `IpynbConverter` | IPYNB | Jupyter Notebook 转换 |

资料来源：[packages/markitdown/src/markitdown/converters/_pdf_converter.py:1-50]()

## 常见问题与限制

### 已知的社区问题

#### 非 ASCII 文件的 UnicodeDecodeError

`IpynbConverter.accepts()` 方法在处理包含非 ASCII 字符的文件时可能抛出 `UnicodeDecodeError`，导致整个转换管道崩溃。

资料来源：[https://github.com/microsoft/markitdown/issues/1894]()

**临时解决方案：**

```python
try:
    result = md.convert("notebook.ipynb")
except UnicodeDecodeError:
    # 使用替代方法处理
    pass
```

#### Office Open XML 无效文件处理

当转换无效的 DOCX、XLSX 或 PPTX 文件时，MarkItDown 返回成功结果，但 `text_content` 中包含错误消息 `"This is not a valid Office Open XML file."`，而不是抛出异常。

资料来源：[https://github.com/microsoft/markitdown/issues/1408]()

**检测方法：**

```python
result = md.convert("invalid_file.docx")
if "not a valid Office Open XML" in result.markdown:
    print("Warning: Invalid file format")
```

#### Linux 上的 pydub 警告

在 Linux 系统上运行时，如果系统中没有安装 ffmpeg 或 avconv，`pydub` 会触发 `RuntimeWarning`，可能导致音频转换功能失败。

资料来源：[https://github.com/microsoft/markitdown/issues/1685]()

### PDF 转换限制

PDF 转换功能在处理以下内容时存在已知限制：

- 复杂表格结构可能无法正确提取（参见 issue #293）
- PDF 不是真正的 Markdown 格式，转换结果为纯文本（参见 issue #296）
- PDF 不支持 `.doc` 格式，只支持 `.docx`（参见 issue #23）

资料来源：[https://github.com/microsoft/markitdown/issues/293]()
资料来源：[https://github.com/microsoft/markitdown/issues/296]()

## 安全考虑

> **重要提示**：MarkItDown 使用当前进程的权限执行 I/O 操作。像 `open()` 或 `requests.get()` 一样，它会访问进程本身可以访问的资源。在不受信任的环境中使用时，请务必对输入进行清理，并调用最窄的 `convert_*` 方法（如 `convert_stream()` 或 `convert_local()`）。

资料来源：[packages/markitdown/README.md:1-30]()

## 完整示例

### 基本文档转换

```python
from markitdown import MarkItDown

md = MarkItDown()

# 转换多种格式
files = ["document.pdf", "report.docx", "slides.pptx", "data.xlsx"]

for file_path in files:
    try:
        result = md.convert(file_path)
        print(f"=== {file_path} ===")
        print(result.markdown[:500])  # 打印前 500 字符
        print()
    except Exception as e:
        print(f"Error converting {file_path}: {e}")
```

### 使用 LLM 进行图像描述

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()

md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this image in detail."
)

result = md.convert("screenshot.png")
print(result.markdown)
```

### 批量处理与插件

```python
from markitdown import MarkItDown
import glob

# 启用插件支持
md = MarkItDown(enable_plugins=True)

# 查找所有文档
documents = glob.glob("documents/*")

for doc_path in documents:
    result = md.convert(doc_path)
    
    # 保存结果
    output_path = doc_path.rsplit(".", 1)[0] + ".md"
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(result.markdown)
    
    print(f"Converted: {doc_path} -> {output_path}")
```

### 使用 Azure Document Intelligence

```python
import os
from markitdown import MarkItDown

# 从环境变量读取配置
endpoint = os.environ.get("DOCINTEL_ENDPOINT")
api_key = os.environ.get("DOCINTEL_API_KEY")

md = MarkItDown(
    docintel_endpoint=endpoint,
    docintel_api_key=api_key
)

result = md.convert("scanned_document.pdf")
print(result.markdown)
```

## 另请参阅

- [命令行界面文档](./CLI-Usage) - CLI 使用方法和参数说明
- [插件开发指南](./Plugin-Development) - 如何开发自定义转换器插件
- [Azure 集成指南](./Azure-Integration) - Azure Document Intelligence 和 Content Understanding 配置
- [支持的格式列表](./Supported-Formats) - 完整支持的文件格式说明

---

<a id='page-plugin-system'></a>

## 插件系统

### 相关页面

相关主题：[系统架构](#page-architecture), [OCR 插件](#page-ocr-plugin)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py)
- [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
</details>

# 插件系统

## 概述

MarkItDown 的插件系统是自 v0.1.0 版本引入的重要架构特性，旨在允许第三方开发者扩展 MarkItDown 的文档转换能力。该系统通过标准化的入口点（Entry Points）机制实现插件的自动发现和注册，使开发者无需修改 MarkItDown 核心代码即可添加新的文件格式支持或替换内置转换器。

插件系统的核心价值在于：

- **可扩展性**：支持添加任意新的文件格式转换器
- **模块化**：将功能代码与核心库分离
- **灵活性**：插件可以选择替换或增强内置转换器
- **解耦性**：插件通过标准接口与 MarkItDown 核心通信

资料来源：[v0.1.0 Release Notes](https://github.com/microsoft/markitdown/releases/tag/v0.1.0)

## 架构设计

### 整体架构图

```mermaid
graph TD
    A[用户代码 / CLI] --> B[MarkItDown 核心]
    B --> C[转换器注册表]
    C --> D[内置转换器集合]
    C --> E[第三方插件转换器]
    
    F[插件发现机制] --> E
    F --> |entry_points| G[已安装插件包]
    
    H[LLM 客户端] --> E
    I[其他配置参数] --> B
    
    style F fill:#f9f,stroke:#333
    style E fill:#ff9,stroke:#333
```

### 核心组件

| 组件 | 职责 | 源码位置 |
|------|------|----------|
| `MarkItDown` | 核心类，管理转换器注册和转换流程 | [`_markitdown.py:30-80`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py#L30-L80) |
| `DocumentConverter` | 转换器基类，定义插件接口 | [`_base_converter.py`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py) |
| `StreamInfo` | 文件流信息封装，包含扩展名、MIME类型等 | [`_stream_info.py`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_stream_info.py) |
| 插件入口点 | 通过 `markitdown.plugin` 组实现自动发现 | [`pyproject.toml`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/pyproject.toml) |

### 转换器优先级机制

MarkItDown 使用优先级机制来决定使用哪个转换器处理文件。内置转换器遵循以下优先级规则：

```python
class DocumentConverter:
    # 优先级常量
    PRIORITY_MAXIMUM = float("inf")      # 最高优先级
    PRIORITY_DEFAULT = 0.0               # 默认优先级
    PRIORITY_SPECIFIC_FILE_FORMAT = -1.0 # 特定文件格式
    PRIORITY_AMBIGUOUS_FILE_FORMAT = -2.0 # 模糊文件格式
    PRIORITY_FALLBACK = -10.0            # 回退转换器
```

插件转换器可以使用自定义优先级。当多个转换器都接受同一文件时，优先级最高的转换器将被选中。

资料来源：[`_base_converter.py`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

## 插件接口规范

### DocumentConverter 基类

所有插件必须继承 `DocumentConverter` 基类并实现以下核心方法：

```python
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any

class YourConverter(DocumentConverter):
    def __init__(self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT):
        super().__init__(priority=priority)

    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> bool:
        """
        判断此转换器是否接受给定文件流
        返回 True 表示可以处理此文件
        """
        # 检查文件格式
        extension = stream_info.extension.lower()
        mimetype = stream_info.mimetype or ""
        
        if extension == ".yourext":
            return True
        return False

    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> DocumentConverterResult:
        """
        执行文件到 Markdown 的转换
        """
        # 实现转换逻辑
        markdown_content = "..."
        return DocumentConverterResult(
            text_content=markdown_content,
            title=None,
            authors=[],
            images=[]
        )
```

资料来源：[`packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py)

### 插件注册函数

每个插件包必须导出 `register_converters` 函数，MarkItDown 在初始化时会调用此函数注册转换器：

```python
from markitdown import MarkItDown

def register_converters(markitdown: MarkItDown, **kwargs):
    """
    在 MarkItDown 实例创建时调用，用于注册插件提供的转换器
    
    Args:
        markitdown: MarkItDown 实例
        **kwargs: 传递给 MarkItDown 的参数（llm_client, llm_model 等）
    """
    markitdown.register_converter(YourConverter())
```

### 插件接口版本

插件必须声明其使用的接口版本：

```python
# 当前支持的版本为 1
__plugin_interface_version__ = 1
```

资料来源：[`packages/markitdown-sample-plugin/README.md`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

## 入口点配置

插件包需要在 `pyproject.toml` 中配置入口点，以便 MarkItDown 能够自动发现：

```toml
[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"
```

其中：
- `"markitdown.plugin"` 是固定的入口点组名
- `sample_plugin` 是插件标识符，可自定义
- `markitdown_sample_plugin` 是包含插件代码的包名

资料来源：[`packages/markitdown-sample-plugin/pyproject.toml`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/pyproject.toml)

## 使用方式

### CLI 使用

```bash
# 列出已安装的插件
markitdown --list-plugins

# 启用插件进行转换
markitdown --use-plugins document.pdf
```

### Python API 使用

```python
from markitdown import MarkItDown

# 启用插件支持
md = MarkItDown(enable_plugins=True)

# 转换文件
result = md.convert("document.pdf")
print(result.text_content)
```

### 传递 LLM 客户端参数

插件可以接收通过 `MarkItDown` 构造函数传递的 LLM 相关参数：

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="自定义提示词"  # 可选
)

result = md.convert("document.pdf")
```

资料来源：[`_markitdown.py`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

## 官方插件示例

### markitdown-ocr 插件

`markitdown-ocr` 是官方提供的 OCR 插件，展示了如何利用 LLM Vision 能力增强内置转换器：

| 功能 | 说明 |
|------|------|
| PDF OCR | 提取 PDF 中嵌入图片的文字 |
| DOCX OCR | 提取 Word 文档中图片的文字 |
| PPTX OCR | 提取 PowerPoint 中图片的文字 |
| XLSX OCR | 提取 Excel 中嵌入图片的文字 |

资料来源：[`packages/markitdown-ocr/README.md`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## 插件开发指南

### 开发流程

```mermaid
graph LR
    A[创建插件包结构] --> B[实现 DocumentConverter]
    B --> C[配置入口点]
    C --> D[安装插件]
    D --> E[测试插件]
    E --> F{发布分享?}
    F -->|是| G[发布到 PyPI]
    F -->|否| H[内部使用]
```

### 完整示例

以下是一个完整的 RTF 文件转换器插件示例：

```python
# markitdown_sample_plugin/_plugin.py
from typing import BinaryIO, Any
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo

__plugin_interface_version__ = 1

class RtfConverter(DocumentConverter):
    def __init__(self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT):
        super().__init__(priority=priority)

    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
        return stream_info.extension == ".rtf"

    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
        content = file_stream.read().decode("utf-8", errors="ignore")
        # 实际实现中需要 RTF 解析逻辑
        return DocumentConverterResult(text_content=f"# RTF Document\n\n{content}")

def register_converters(markitdown: MarkItDown, **kwargs):
    markitdown.register_converter(RtfConverter())
```

资料来源：[`packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py)

### 安装本地插件

```bash
# 进入插件目录
cd packages/markitdown-sample-plugin

# 以开发模式安装
pip install -e .

# 验证插件是否已注册
markitdown --list-plugins
```

## 常见问题

### CLI 参数未识别错误

**问题描述**：使用 `--llm-client` 和 `--llm-model` 参数时报 "Unrecognized Arguments" 错误。

**原因**：这些参数是插件参数，不在 MarkItDown CLI 的内置参数列表中。参数由插件的 `register_converters` 函数接收。

**解决方案**：目前 CLI 层面对插件参数的支持有限，建议通过 Python API 使用这些参数：

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o"
)
```

资料来源：[Issue #1897](https://github.com/microsoft/markitdown/issues/1897)

### 插件未被发现

**排查步骤**：

1. 确认入口点配置正确
2. 检查包是否已正确安装：`pip show your-plugin`
3. 验证插件代码无语法错误
4. 确认 `enable_plugins=True` 已设置

## 安全考虑

> [!IMPORTANT]
> MarkItDown 以当前进程的权限执行 I/O 操作。在不受信任的环境中使用时，请对输入进行清理，并使用最窄的 `convert_*` 函数（如 `convert_stream()` 或 `convert_local()`）。

插件系统额外引入以下安全考量：

| 风险 | 缓解措施 |
|------|----------|
| 恶意插件代码 | 仅安装来自可信来源的插件 |
| 文件系统访问 | 插件可能在转换过程中读写文件 |
| 依赖冲突 | 定期更新插件依赖以修复安全漏洞 |

## 配置参数参考

### MarkItDown 构造函数参数

| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `enable_plugins` | `bool` | `False` | 是否启用第三方插件 |
| `llm_client` | `Any` | `None` | LLM 客户端实例 |
| `llm_model` | `str` | `None` | LLM 模型名称 |
| `llm_prompt` | `str` | `None` | 自定义 LLM 提示词 |
| `docintel_endpoint` | `str` | `None` | Azure Document Intelligence 端点 |
| `cu_endpoint` | `str` | `None` | Azure Content Understanding 端点 |

### CLI 参数

| 参数 | 说明 |
|------|------|
| `-p`, `--use-plugins` | 启用第三方插件 |
| `--list-plugins` | 列出已安装的插件 |
| `-o`, `--output` | 输出文件路径 |
| `-x`, `--extension` | 文件扩展名提示（从 stdin 读取时） |

资料来源：[`__main__.py`](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

## See Also

- [项目主文档](../README)
- [快速开始指南](./Quick-Start)
- [支持的文档格式](./Supported-Formats)
- [Azure Content Understanding 集成](./Azure-Content-Understanding)
- [markitdown-ocr 插件文档](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-ocr)
- [示例插件源码](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin)

---

<a id='page-ocr-plugin'></a>

## OCR 插件

### 相关页面

相关主题：[插件系统](#page-plugin-system), [Python API](#page-python-api)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)
</details>

# OCR 插件

## 概述

markitdown-ocr 是 MarkItDown 的官方插件包，通过 LLM Vision 技术为 PDF、DOCX、PPTX 和 XLSX 文件中的嵌入式图像提供 OCR（光学字符识别）支持。该插件利用 MarkItDown 已有的 `llm_client` 和 `llm_model` 参数模式，无需引入新的机器学习库或二进制依赖项，即可实现图像文字提取功能。

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## 功能特性

| 功能 | 说明 |
|------|------|
| **增强型 PDF 转换器** | 从 PDF 内嵌图像中提取文字，支持扫描件 PDF 的全页 OCR 回退 |
| **增强型 DOCX 转换器** | 对 Word 文档中的图像执行 OCR |
| **增强型 PPTX 转换器** | 对 PowerPoint 演示文稿中的图像执行 OCR |
| **增强型 XLSX 转换器** | 对 Excel 电子表格中的图像执行 OCR |
| **上下文保持** | 在插入提取文字时维持文档结构和流程 |
| **智能回退** | 当 LLM 调用失败时，转换继续进行而不中断 |
| **无额外依赖** | 使用 MarkItDown 原生的 LLM 客户端，无需额外 ML 库 |

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## 系统架构

### 组件关系图

```mermaid
graph TD
    A[MarkItDown 实例] -->|enable_plugins=True| B[插件发现机制]
    B -->|entry_points| C[markitdown-ocr 插件]
    C -->|register_converters| D[OCR 增强型转换器注册]
    
    D --> E[PdfConverterWithOCR]
    D --> F[DocxConverterWithOCR]
    D --> G[PptxConverterWithOCR]
    D --> H[XlsxConverterWithOCR]
    
    E --> I[LLMVisionOCRService]
    F --> I
    G --> I
    H --> I
    
    I -->|llm_client| J[OpenAI 兼容客户端]
    I -->|llm_model| K[GPT-4o 等模型]
    
    L[文档文件] -->|转换请求| E
    L -->|转换请求| F
    L -->|转换请求| G
    L -->|转换请求| H
    
    E -->|优先级 -1.0| M[内置转换器 优先级 0.0]
    F --> M
    G --> M
    H --> M
```

### 工作流程

```mermaid
sequenceDiagram
    participant User as 用户
    participant MD as MarkItDown
    participant Plugin as OCR 插件
    participant OCR as LLMVisionOCRService
    participant LLM as LLM API
    
    User->>MD: MarkItDown(enable_plugins=True,<br/>llm_client=..., llm_model=...)
    MD->>Plugin: 发现并加载插件
    Plugin->>Plugin: register_converters()
    Note over Plugin: 创建 OCR 增强型转换器<br/>优先级设为 -1.0
    
    User->>MD: convert(document.pdf)
    MD->>Plugin: 选择优先级最高的转换器
    Plugin->>OCR: 处理文档
    OCR->>OCR: 提取嵌入式图像
    
    alt 有嵌入式图像
        OCR->>LLM: 发送图像 + 提取提示
        LLM-->>OCR: 返回提取的文本
        OCR-->>Plugin: 文本内容
    else 扫描件 PDF
        OCR->>OCR: 300 DPI 渲染页面
        OCR->>LLM: 发送全页图像
        LLM-->>OCR: 返回提取的文本
    end
    
    Plugin-->>MD: DocumentConverterResult
    MD-->>User: Markdown 输出
```

## 安装指南

### 前置条件

- Python 3.10 或更高版本
- MarkItDown 已安装：`pip install markitdown`
- OpenAI 兼容的 LLM 客户端

### 安装步骤

```bash
# 安装 markitdown-ocr 插件
pip install markitdown-ocr

# 安装 OpenAI 客户端（或其他兼容客户端）
pip install openai
```

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## 使用方法

### 命令行接口

使用插件时，需要同时启用插件并配置 LLM 参数：

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

> [!IMPORTANT]
> CLI 参数 `--llm-client` 和 `--llm-model` 需要明确指定，否则插件会静默跳过 OCR 流程，回退到标准内置转换器。

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

### Python API

#### 基础用法

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)
```

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

#### 自定义提示词

对于专业文档，可以覆盖默认的提取提示词：

```python
md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Extract all text from this image, preserving table structure.",
)

result = md.convert("document_with_tables.pdf")
```

#### Azure OpenAI 兼容客户端

插件支持任何遵循 OpenAI API 规范的客户端：

```python
from openai import AzureOpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=AzureOpenAI(
        api_key="your-api-key",
        azure_endpoint="https://your-resource.openai.azure.com/",
        api_version="2024-02-01",
    ),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
```

## 支持的文件格式

### PDF

| 功能 | 说明 |
|------|------|
| **嵌入式图像 OCR** | 通过 `page.images` 提取图像，按垂直阅读顺序与周围文本交错插入 |
| **扫描件检测** | 自动检测无文本提取内容的页面，渲染为 300 DPI 全页图像 |
| **格式容错** | 处理格式损坏的 PDF（如截断 EOF），尝试使用 PyMuPDF 页面渲染恢复内容 |

### DOCX

| 功能 | 说明 |
|------|------|
| **图像提取** | 通过 `doc.part.rels` 提取图像 |
| **流程控制** | OCR 在 DOCX→HTML→Markdown 管道执行前运行 |
| **占位符注入** | 将占位符令牌注入 HTML，防止 markdown 转换器转义 OCR 标记 |

资料来源：[packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py)

### PPTX 和 XLSX

| 功能 | 说明 |
|------|------|
| **内嵌图像处理** | 提取演示文稿和电子表格中的图像 |
| **上下文保持** | 维持原有文档结构和流程 |

## 插件注册机制

markitdown-ocr 通过 Python entry point 机制注册为 MarkItDown 插件。包在 `pyproject.toml` 中定义入口点：

```toml
[project.entry-points."markitdown.plugin"]
markitdown_ocr = "markitdown_ocr"
```

当 `MarkItDown(enable_plugins=True, ...)` 被调用时：

1. MarkItDown 通过 `markitdown.plugin` entry point 组发现插件
2. 调用 `register_converters()`，转发所有 kwargs（包括 `llm_client` 和 `llm_model`）
3. 插件创建 `LLMVisionOCRService`
4. 四个 OCR 增强型转换器以 **优先级 -1.0** 注册——在优先级 0.0 的内置转换器之前

资料来源：[packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

## 常见问题与故障排除

### 问题一：命令行参数未被识别

**问题描述**：运行 `markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o` 时报错"Unrecognized Arguments"。

**原因**：README 中的 CLI 示例参数可能与当前版本不匹配。请查阅 `markitdown --help` 获取支持的实际参数列表。

资料来源：[社区 Issue #1897](https://github.com/microsoft/markitdown/issues/1897)

### 问题二：RuntimeWarning 关于 ffmpeg

**问题描述**：在 Linux 系统上运行时出现 `RuntimeWarning: Couldn't find ffmpeg or avconv`。

**说明**：此警告来自 pydub 依赖项，与 markitdown-ocr 插件本身无关，但可能影响音频转文字功能。

资料来源：[社区 Issue #1685](https://github.com/microsoft/markitdown/issues/1685)

### 问题三：OCR 未执行

**检查清单**：

1. 确认已安装 markitdown-ocr：`pip show markitdown-ocr`
2. 确认启用了插件：`--use-plugins`
3. 确认提供了 `llm_client` 和 `llm_model`
4. 如未提供 `llm_client`，插件会静默跳过 OCR，回退到标准转换器

### 问题四：Office Open XML 文件转换异常

**问题描述**：转换无效的 DOCX、XLSX 或 PPTX 文件时返回成功，但 `text_content` 中包含 "This is not a valid Office Open XML file." 字符串。

**说明**：这是已知行为——MarkItDown 对无效文件返回成功结果而非抛出异常，难以与真正的成功结果区分。

资料来源：[社区 Issue #1408](https://github.com/microsoft/markitdown/issues/1408)

## 安全注意事项

> [!IMPORTANT]
> MarkItDown 以当前进程的权限执行 I/O 操作。与 `open()` 或 `requests.get()` 一样，它会访问进程本身可访问的资源。在不受信任的环境中，请对输入进行清理，并调用最窄的 `convert_*` 函数（如 `convert_stream()` 或 `convert_local()`）以满足使用需求。

## 配置参数参考

### MarkItDown 构造函数参数

| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `enable_plugins` | `bool` | `False` | 是否启用插件加载 |
| `llm_client` | `Any` | `None` | OpenAI 兼容的 LLM 客户端实例 |
| `llm_model` | `str` | `None` | LLM 模型名称（如 "gpt-4o"） |
| `llm_prompt` | `str` | `None` | 自定义 OCR 提示词 |

### OCR 插件行为配置

| 场景 | llm_client 提供 | llm_client 未提供 |
|------|-----------------|-------------------|
| 文档转换 | 执行 OCR 处理图像 | 静默跳过 OCR，使用标准转换器 |
| API 错误 | 继续处理，跳过失败图像 | 回退到标准转换器 |

## 技术实现细节

### DOCX 转换器的占位符机制

DocxConverterWithOCR 在处理图像时使用占位符令牌，确保 mammoth 库不会破坏 OCR 标记：

```python
# 占位符注入到 HTML，防止 mammoth 转义
_PLACEHOLDER = "MARKITDOWNOCRBLOCK{}"
```

资料来源：[packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py)

### 优先级机制

| 转换器类型 | 优先级 | 说明 |
|-----------|--------|------|
| OCR 增强型转换器 | -1.0 | 优先于内置转换器被选中 |
| 内置转换器 | 0.0 | 默认转换器 |

MarkItDown 选择优先级最高（数值最大）的转换器处理文件，因此 OCR 插件的优先级设为 -1.0，确保先于内置转换器尝试处理。

## 扩展阅读

- [MarkItDown 主文档](https://github.com/microsoft/markitdown) — 核心功能和使用指南
- [插件开发指南](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin) — 开发自定义转换器插件
- [Azure Content Understanding 集成](./azure-content-understanding.md) — 云端高级文档理解服务
- [Azure Document Intelligence](./azure-document-intelligence.md) — Azure AI 文档智能服务

---

<a id='page-azure-integration'></a>

## Azure 服务集成

### 相关页面

相关主题：[Python API](#page-python-api), [命令行使用](#page-cli-usage)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/markitdown/src/markitdown/converters/_doc_intel_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
</details>

# Azure 服务集成

## 概述

MarkItDown 提供了与 Microsoft Azure 服务的深度集成，通过 Azure Document Intelligence 和 Azure Content Understanding 两个转换器实现高质量的文档提取。这两个服务主要面向需要云端处理能力的企业用户，特别是在处理复杂文档结构、扫描件、以及多模态内容（音频、视频）时提供更优的转换效果。

Azure 服务集成的核心价值在于：
- **云端布局分析**：利用 Azure 的机器学习模型进行文档布局识别
- **结构化字段提取**：从文档中提取特定领域的结构化数据
- **多模态支持**：处理音频、视频等非传统文档格式
- **OCR 增强**：对扫描 PDF 和图片中的文字进行识别

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## 架构设计

### 转换器层次结构

MarkItDown 采用插件化的转换器架构，Azure 服务作为特定的高优先级转换器实现。架构如下：

```mermaid
graph TD
    A[MarkItDown.convert] --> B{检测文件类型}
    B --> C[内置转换器]
    B --> D[Document Intelligence]
    B --> E[Content Understanding]
    
    C --> F[本地处理]
    D --> G[Azure Doc Intel API]
    E --> H[Azure CU API]
    
    G --> I[Markdown 输出]
    H --> I
    
    style D fill:#0078d4,color:#fff
    style E fill:#0078d4,color:#fff
```

### 依赖加载机制

MarkItDown 使用延迟依赖加载模式处理 Azure SDK，当缺少依赖时不会直接失败，而是在实际调用时抛出明确的错误信息。

```mermaid
graph LR
    A[导入模块] --> B{azure.ai.contentunderstanding 可用?}
    B -->|是| C[加载 ContentUnderstandingClient]
    B -->|否| D[保存异常信息]
    D --> E[记录 MissingDependencyException]
    C --> F[初始化转换器]
    E --> G[运行时检测]
```

资料来源：[packages/markitdown/src/markitdown/converters/_cu_converter.py:1-40](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

## Azure Document Intelligence 集成

### 功能概述

Azure Document Intelligence（原 Form Recognizer）提供基于云的文档分析服务，能够识别文档布局、提取键值对、表格和选择标记。MarkItDown 通过 `_doc_intel_converter.py` 实现了与该服务的集成。

主要支持能力：
- 文档布局分析
- 表格提取
- 键值对提取
- 多语言文档支持

### 安装配置

需要安装相关依赖包：

```bash
pip install 'markitdown[az-doc-intel]'
```

### 命令行使用

```bash
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```

其中 `-d` 参数启用 Document Intelligence，`-e` 参数指定服务端点。

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### Python API 使用

```python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
```

### 身份验证配置

Document Intelligence 支持多种身份验证凭据：

| 凭据类型 | 说明 | 使用场景 |
|---------|------|---------|
| AzureKeyCredential | API 密钥认证 | 快速开始、本地开发 |
| DefaultAzureCredential | 自动发现凭据 | Azure 环境部署 |
| TokenCredential | 自定义令牌认证 | 企业 SSO 集成 |

资料来源：[packages/markitdown/src/markitdown/converters/_doc_intel_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py)

## Azure Content Understanding 集成

### 功能概述

Azure Content Understanding 是更高级的云端服务，提供多模态文档处理能力，支持文档、图片、音频和视频的统一处理。该服务通过结构化分析器（Analyzer）实现领域特定的字段提取，结果以 YAML 前置matter 格式输出。

资料来源：[packages/markitdown/src/markitdown/converters/_cu_converter.py:1-30](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

### 安装配置

```bash
pip install 'markitdown[az-content-understanding]'
```

### 支持的文件类型与分析器映射

Content Understanding 服务根据文件类型自动选择合适的分析器：

| 文件类型 | 自动选择分析器 | 说明 |
|---------|---------------|------|
| PDF、DOCX、PPTX | prebuilt-documentSearch | 文档全文检索 |
| 图片文件 | prebuilt-documentSearch | 图片文档处理 |
| 视频文件 | prebuilt-videoSearch | 视频内容分析 |
| 音频文件 | prebuilt-audioSearch | 音频转录分析 |

### 命令行使用

```bash
markitdown path-to-file.pdf --use-cu --cu-endpoint "<content_understanding_endpoint>"
```

### Python API 使用

#### 自动分析器选择

```python
from markitdown import MarkItDown

# 零配置 — 根据文件类型自动选择分析器
md = MarkItDown(cu_endpoint="<content_understanding_endpoint>")

result = md.convert("report.pdf")   # 文档 → prebuilt-documentSearch
result = md.convert("meeting.mp4")  # 视频 → prebuilt-videoSearch
result = md.convert("call.wav")     # 音频 → prebuilt-audioSearch

print(result.markdown)
```

#### 使用自定义分析器

```python
from markitdown import MarkItDown, ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],  # 仅 PDF 使用 CU
)
result = md.convert("document.pdf")
print(result.markdown)
```

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### 输出格式

Content Understanding 的输出包含两部分：

1. **YAML 前置matter**：包含分析器提取的结构化字段
2. **Markdown 主体内容**：文档的降级格式表示

```yaml
---
fields:
  invoice_date: "2024-01-15"
  total_amount: "$1,234.56"
  vendor_name: "Example Corp"
---
# 文档内容 Markdown 表示
```

## 与内置转换器的对比

| 能力 | 内置转换器 | Azure Document Intelligence | Azure Content Understanding |
|------|----------|-----------------------------|------------------------------|
| 文档转换 | 离线、格式特定提取 | 云端布局提取 | 云端多模态提取 |
| 结构化字段 | 不可用 | 不通过此集成暴露 | YAML 前置matter 来自分析器字段 |
| 自定义分析器 | 不可用 | 此集成中不可配置 | 支持使用 `cu_analyzer_id` |
| 音频和视频 | 基本音频，无视频 | 不支持 | 音频和视频分析器 |
| 成本 | 仅本地计算 | Azure API 调用计费 | Azure API 调用计费 |
| 离线支持 | 是 | 否 | 否 |

资料来源：[README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## 命令行接口详解

### 完整参数列表

| 参数 | 说明 | 必需 | 示例 |
|------|------|------|------|
| `--use-docintel` 或 `-d` | 启用 Document Intelligence | 与 `-e` 配合 | `--use-docintel` |
| `--use-cu` | 启用 Content Understanding | 与 `--cu-endpoint` 配合 | `--use-cu` |
| `--endpoint` 或 `-e` | Document Intelligence 服务端点 | 使用 Doc Intel 时必填 | `-e "https://xxx.cognitiveservices.azure.com/"` |
| `--cu-endpoint` | Content Understanding 服务端点 | 使用 CU 时必填 | `--cu-endpoint "https://xxx.contentunderstanding.azure.com"` |
| `--cu-analyzer-id` | 指定分析器 ID | 可选 | `--cu-analyzer-id "custom-analyzer"` |
| `--list-plugins` | 列出已安装插件 | 否 | `--list-plugins` |
| `--use-plugins` 或 `-p` | 启用第三方插件 | 否 | `--use-plugins` |

资料来源：[packages/markitdown/src/markitdown/__main__.py:50-120](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### CLI 参数处理流程

```mermaid
graph TD
    A[markitdown CLI] --> B{--list-plugins?}
    B -->|是| C[列出插件并退出]
    B -->|否| D{--use-docintel?}
    D -->|是| E{--endpoint 存在?}
    E -->|是| F{filename 存在?}
    F -->|是| G[创建 MarkItDown with docintel_endpoint]
    F -->|否| H[错误: 需要文件名]
    E -->|否| I[错误: 需要端点]
    D -->|否| J{--use-cu?}
    J -->|是| K{--cu-endpoint 存在?}
    K -->|是| L{filename 存在?}
    L -->|是| M[创建 MarkItDown with cu_endpoint]
    L -->|否| N[错误: 需要文件名]
    K -->|否| O[错误: 需要 CU 端点]
    J -->|否| P[创建标准 MarkItDown]
    G --> Q[执行转换]
    M --> Q
    P --> Q
```

## 与 OCR 插件的协同

markitdown-ocr 插件可以与 Azure Document Intelligence 协同工作，提供更完整的文档处理能力：

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

当同时启用 Document Intelligence 和 OCR 插件时，转换流程如下：

```mermaid
graph LR
    A[原始 PDF] --> B{内置转换器}
    A --> C[Doc Intelligence]
    A --> D[OCR 插件]
    
    B --> E[基础文本提取]
    C --> F[布局分析文本]
    D --> G[图片 OCR 文本]
    
    E --> H[结果合并]
    F --> H
    G --> H
```

资料来源：[packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## 常见问题与故障排除

### 1. 端点参数缺失

**错误信息**：
```
Document Intelligence Endpoint is required when using Document Intelligence.
```

**解决方案**：确保同时提供 `--use-docintel` 和 `--endpoint` 参数。

```bash
# 错误用法
markitdown test.pdf --use-docintel

# 正确用法
markitdown test.pdf --use-docintel -e "https://xxx.cognitiveservices.azure.com/"
```

### 2. 依赖未安装

**错误信息**：
```
MissingDependencyException: azure.ai.contentunderstanding is not installed
```

**解决方案**：安装相应的可选依赖

```bash
# Document Intelligence
pip install 'markitdown[az-doc-intel]'

# Content Understanding
pip install 'markitdown[az-content-understanding]'
```

### 3. 身份验证失败

| 问题现象 | 可能原因 | 解决方案 |
|---------|---------|---------|
| 401 Unauthorized | API 密钥错误 | 检查 endpoint 和 API 密钥配置 |
| 403 Forbidden | 权限不足 | 确认 Azure 资源已启用相应 API |
| 认证类型不匹配 | 凭据设置问题 | 使用正确的凭据类型 |

### 4. 网络连接问题

Azure 服务需要网络访问。离线环境下请使用内置转换器，或考虑在隔离网络中部署 Azure AI 服务。

## 安全注意事项

> [!IMPORTANT]
> MarkItDown 的 Azure 集成会通过网络与 Microsoft Azure 服务通信。在此过程中：
> - 文档内容会被发送到 Azure 服务端进行处理
> - 确保在可信的网络环境中使用
> - 对于敏感文档，建议先评估 Azure 服务的合规性要求
> - API 密钥和凭据应通过环境变量或安全的密钥管理服务管理

## 适用场景建议

### 选择内置转换器的场景

- 本地处理，无网络需求
- 基本的文档格式转换
- 对延迟敏感的生产环境
- 处理公开或非敏感文档

### 选择 Document Intelligence 的场景

- 需要高质量的表格提取
- 处理扫描件和图片文档
- 企业级文档处理流水线
- 需要 Azure 平台的合规和监控

### 选择 Content Understanding 的场景

- 音频和视频内容处理
- 需要结构化字段提取（发票、合同等）
- 多模态文档的统一处理
- 使用自定义分析器的领域特定需求

## 参见

- [主项目文档](../README.md)
- [Python API 使用指南](./Python-API.md)
- [插件系统](./Plugins.md)
- [命令行工具](./CLI-Usage.md)
- [markitdown-ocr 插件](../markitdown-ocr/README.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Doramagic 踩坑日志

项目：microsoft/markitdown

摘要：发现 30 个潜在踩坑项，其中 4 个为 high/blocking；最高优先级：安装坑 - 来源证据：[Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux。

## 1. 安装坑 · 来源证据：[Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_f70b2e3ea5ed47418a4aeb9ef27230f9 | https://github.com/microsoft/markitdown/issues/1685 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 2. 运行坑 · 来源证据：Unrecognized Arguments Error in markitdown CLI for undocumented arguments

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个运行相关的待验证问题：Unrecognized Arguments Error in markitdown CLI for undocumented arguments
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_252ef0d45ac040688ffa066bc1b64ba0 | https://github.com/microsoft/markitdown/issues/1897 | 来源类型 github_issue 暴露的待验证使用条件。

## 3. 维护坑 · 来源证据：bug: DOCX math converter crashes when oMath element is missing in malformed equations

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：bug: DOCX math converter crashes when oMath element is missing in malformed equations
- 对用户的影响：可能阻塞安装或首次运行。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_6e08b71ee29f46a98e6825a5d5b11e6e | https://github.com/microsoft/markitdown/issues/1979 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 4. 维护坑 · 来源证据：bug: DOCX math converter crashes with NotImplementedError on unknown functions

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：bug: DOCX math converter crashes with NotImplementedError on unknown functions
- 对用户的影响：可能阻塞安装或首次运行。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_439f22f47a524773808819148caadca5 | https://github.com/microsoft/markitdown/issues/1982 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 5. 安装坑 · 失败模式：installation: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this installation risk before relying on the project: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- 对用户的影响：Developers may fail before the first successful local run: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception. Context: Source discussion did not expose a precise runtime context.
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_issue | fmev_087a8a7b6538b2ce2b065ade73c555af | https://github.com/microsoft/markitdown/issues/1408 | Office Open XML: Invalid Files Return Success with Error Message Instead of Exception

## 6. 安装坑 · 失败模式：installation: Support for .doc extensions

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this installation risk before relying on the project: Support for .doc extensions
- 对用户的影响：Developers may fail before the first successful local run: Support for .doc extensions
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: Support for .doc extensions. Context: Observed when using windows, linux
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_issue | fmev_d5a467d012987779306cb5c50725275b | https://github.com/microsoft/markitdown/issues/23 | Support for .doc extensions

## 7. 安装坑 · 失败模式：installation: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this installation risk before relying on the project: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- 对用户的影响：Developers may fail before the first successful local run: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux. Context: Observed when using python, windows, linux
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_issue | fmev_1f9167a15a1eec72c8f79514f1b70b76 | https://github.com/microsoft/markitdown/issues/1685 | [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux

## 8. 安装坑 · 失败模式：installation: v0.1.0

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this installation risk before relying on the project: v0.1.0
- 对用户的影响：Upgrade or migration may change expected behavior: v0.1.0
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: v0.1.0. Context: Observed when using python
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_release | fmev_1d5ae6ee21225356f45c36c20024dccd | https://github.com/microsoft/markitdown/releases/tag/v0.1.0 | v0.1.0

## 9. 安装坑 · 来源证据：Office Open XML: Invalid Files Return Success with Error Message Instead of Exception

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_734e117518a3496eb3779e5f22b600b5 | https://github.com/microsoft/markitdown/issues/1408 | 来源类型 github_issue 暴露的待验证使用条件。

## 10. 安装坑 · 来源证据：bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)
- 对用户的影响：可能阻塞安装或首次运行。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_77597bea6262485b9609d8fc5f50a69a | https://github.com/microsoft/markitdown/issues/1894 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 11. 配置坑 · 失败模式：configuration: Enhancement: Add MCP server support for document processing

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this configuration risk before relying on the project: Enhancement: Add MCP server support for document processing
- 对用户的影响：Developers may misconfigure credentials, environment, or host setup: Enhancement: Add MCP server support for document processing
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: Enhancement: Add MCP server support for document processing. Context: Source discussion did not expose a precise runtime context.
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_issue | fmev_969d5f508051e086435b78736eae3e88 | https://github.com/microsoft/markitdown/issues/2004 | Enhancement: Add MCP server support for document processing

## 12. 配置坑 · 失败模式：configuration: v0.1.2

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this configuration risk before relying on the project: v0.1.2
- 对用户的影响：Upgrade or migration may change expected behavior: v0.1.2
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: v0.1.2. Context: Observed when using python
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_release | fmev_076605feea6e0b4830282709121d3c90 | https://github.com/microsoft/markitdown/releases/tag/v0.1.2 | v0.1.2

## 13. 配置坑 · 失败模式：configuration: v0.1.2a1

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this configuration risk before relying on the project: v0.1.2a1
- 对用户的影响：Upgrade or migration may change expected behavior: v0.1.2a1
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: v0.1.2a1. Context: Observed when using python
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_release | fmev_22fa2fa9d8ed93f594844ce5550fc4d8 | https://github.com/microsoft/markitdown/releases/tag/v0.1.2a1 | v0.1.2a1

## 14. 配置坑 · 来源证据：Enhancement: Add MCP server support for document processing

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：Enhancement: Add MCP server support for document processing
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_94fcd5bbf87541d1ab988bae7c501a95 | https://github.com/microsoft/markitdown/issues/2004 | 来源类型 github_issue 暴露的待验证使用条件。

## 15. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | github_repo:888092115 | https://github.com/microsoft/markitdown | README/documentation is current enough for a first validation pass.

## 16. 运行坑 · 失败模式：runtime: bug: DOCX math converter crashes when oMath element is missing in malformed equations

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this runtime risk before relying on the project: bug: DOCX math converter crashes when oMath element is missing in malformed equations
- 对用户的影响：Developers may hit a documented source-backed failure mode: bug: DOCX math converter crashes when oMath element is missing in malformed equations
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: bug: DOCX math converter crashes when oMath element is missing in malformed equations. Context: Observed when using python
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_issue | fmev_2d85aabe3c00f8d53d781ac03dd69f62 | https://github.com/microsoft/markitdown/issues/1979 | bug: DOCX math converter crashes when oMath element is missing in malformed equations

## 17. 运行坑 · 失败模式：runtime: bug: DOCX math converter crashes with NotImplementedError on unknown functions

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this runtime risk before relying on the project: bug: DOCX math converter crashes with NotImplementedError on unknown functions
- 对用户的影响：Developers may hit a documented source-backed failure mode: bug: DOCX math converter crashes with NotImplementedError on unknown functions
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: bug: DOCX math converter crashes with NotImplementedError on unknown functions. Context: Observed when using python
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_issue | fmev_3ca154355492e590afae4917a8e9c7af | https://github.com/microsoft/markitdown/issues/1982 | bug: DOCX math converter crashes with NotImplementedError on unknown functions

## 18. 运行坑 · 失败模式：runtime: bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this runtime risk before relying on the project: bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)
- 对用户的影响：Developers may hit a documented source-backed failure mode: bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.). Context: Observed when using python, windows
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_issue | fmev_2603b970a28eceb8da6246b000a927d3 | https://github.com/microsoft/markitdown/issues/1894 | bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)

## 19. 运行坑 · 失败模式：runtime: v0.1.3

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this runtime risk before relying on the project: v0.1.3
- 对用户的影响：Upgrade or migration may change expected behavior: v0.1.3
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: v0.1.3. Context: Observed when using windows
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_release | fmev_994386694bb3b31fb731336f58573ff3 | https://github.com/microsoft/markitdown/releases/tag/v0.1.3 | v0.1.3

## 20. 运行坑 · 失败模式：runtime: v0.1.5

- 严重度：medium
- 证据强度：source_linked
- 发现：Developers should check this runtime risk before relying on the project: v0.1.5
- 对用户的影响：Upgrade or migration may change expected behavior: v0.1.5
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: v0.1.5. Context: Observed when using windows
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_release | fmev_38cc2743269efc75c24242abb0e2746c | https://github.com/microsoft/markitdown/releases/tag/v0.1.5 | v0.1.5

## 21. 运行坑 · 来源证据：Timeout needed

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个运行相关的待验证问题：Timeout needed
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_ba28a1cc5c004225b80d2ef380e51a77 | https://github.com/microsoft/markitdown/issues/2000 | 来源类型 github_issue 暴露的待验证使用条件。

## 22. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | github_repo:888092115 | https://github.com/microsoft/markitdown | last_activity_observed missing

## 23. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | github_repo:888092115 | https://github.com/microsoft/markitdown | no_demo; severity=medium

## 24. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | github_repo:888092115 | https://github.com/microsoft/markitdown | no_demo; severity=medium

## 25. 能力坑 · 失败模式：conceptual: Unrecognized Arguments Error in markitdown CLI for undocumented arguments

- 严重度：low
- 证据强度：source_linked
- 发现：Developers should check this conceptual risk before relying on the project: Unrecognized Arguments Error in markitdown CLI for undocumented arguments
- 对用户的影响：Developers may hit a documented source-backed failure mode: Unrecognized Arguments Error in markitdown CLI for undocumented arguments
- 建议检查：复核 source-backed failure mode cluster，并把适用版本和验证路径写入资产。
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_issue | fmev_d7ddfa04bce33d2ca53c58ce9f0265c0 | https://github.com/microsoft/markitdown/issues/1897 | Unrecognized Arguments Error in markitdown CLI for undocumented arguments

## 26. 运行坑 · 失败模式：performance: Timeout needed

- 严重度：low
- 证据强度：source_linked
- 发现：Developers should check this performance risk before relying on the project: Timeout needed
- 对用户的影响：Developers may hit a documented source-backed failure mode: Timeout needed
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: Timeout needed. Context: Source discussion did not expose a precise runtime context.
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_issue | fmev_acdf8e881ef175760bcc59b92eae1aef | https://github.com/microsoft/markitdown/issues/2000 | Timeout needed

## 27. 运行坑 · 失败模式：performance: Version 0.1.6

- 严重度：low
- 证据强度：source_linked
- 发现：Developers should check this performance risk before relying on the project: Version 0.1.6
- 对用户的影响：Upgrade or migration may change expected behavior: Version 0.1.6
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: Version 0.1.6. Context: Source discussion did not expose a precise runtime context.
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_release | fmev_037f240b8fd9da8ecfc973a1f7eae18c | https://github.com/microsoft/markitdown/releases/tag/v0.1.6 | Version 0.1.6

## 28. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | github_repo:888092115 | https://github.com/microsoft/markitdown | issue_or_pr_quality=unknown

## 29. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | github_repo:888092115 | https://github.com/microsoft/markitdown | release_recency=unknown

## 30. 维护坑 · 失败模式：maintenance: Version 0.1.5b1

- 严重度：low
- 证据强度：source_linked
- 发现：Developers should check this maintenance risk before relying on the project: Version 0.1.5b1
- 对用户的影响：Upgrade or migration may change expected behavior: Version 0.1.5b1
- 建议检查：Before packaging this project, run the relevant install/config/quickstart check for: Version 0.1.5b1. Context: Source discussion did not expose a precise runtime context.
- 防护动作：State this as source-backed community evidence, not as Doramagic reproduction.
- 证据：failure_mode_cluster:github_release | fmev_fe15b901250727fa3263b3b5af451b94 | https://github.com/microsoft/markitdown/releases/tag/v0.1.5b1 | Version 0.1.5b1

<!-- canonical_name: microsoft/markitdown; human_manual_source: deepwiki_human_wiki -->