markitdown 项目说明书

Doramagic 项目包 · 项目说明书

markitdown 项目

一款用于将文件和办公文档转换为 Markdown 的 Python 工具。

项目概览与系统架构

MarkItDown 是由微软开源团队（AutoGen Team）维护的一款轻量级 Python 工具库，核心目标是将多种文件格式转换为 Markdown，以便下游的大语言模型（LLM）与文本分析流水线使用。其设计灵感来源于 textract，但更强调以 Markdown 形式保留文档的结构语义，包括标题、列表、表格、链接等关键元素（README.md）。

章节 相关页面

继续阅读本节完整说明和来源证据。

章节 1.1 适用场景

继续阅读本节完整说明和来源证据。

章节 1.2 支持的输入格式

继续阅读本节完整说明和来源证据。

一、项目定位与目标

1.1 适用场景

LLM 索引与检索：将 PDF、Office 文档等异构源统一为 Markdown 文本，便于建立向量索引。
文本分析：保留语义结构有利于段落级与文档级 NLP 处理。
可读性兼顾：输出在多数情况下"对人友好"，但项目明确指出它并非高保真文档转换方案，主要服务于 LLM 消费（README.md）。

1.2 支持的输入格式

依据主仓库 README.md，MarkItDown 当前支持以下输入：

类别	具体格式
办公文档	PDF、PowerPoint（`.pptx`）、Word（`.docx`）、Excel（`.xlsx`）
多媒体	图像（EXIF 元数据 + OCR）、音频（EXIF 元数据 + 语音转录）
Web	HTML、YouTube URL
结构化文本	CSV、JSON、XML
容器与电子书	ZIP（迭代内容）、EPUB

社区关注的格式缺口：

- OneNote 支持缺失 —— Issue #47 中用户希望加入对 .one 文件的支持，目前官方路线图中尚未提供。

- 旧版 .doc 格式未原生支持 —— Issue #23 中用户提出扩展请求；当前 _pdf_converter.py 的扩展名白名单中仅识别 .docx，并不直接处理二进制 .doc。

来源：https://github.com/microsoft/markitdown / 项目说明书

内置格式转换器与文件格式支持

MarkItDown 是一个用于将各种文件格式转换为 Markdown 的 Python 工具与命令行实用程序，其设计目标是为大语言模型（LLM）与文本分析流水线提供结构清晰、Token 友好的文本输入。项目内置了一组格式转换器（DocumentConverter），每个转换器负责一种或一族文件格式，并通过统一的注册与调度机制在 MarkItDown.convert() 调...

章节 相关页面

继续阅读本节完整说明和来源证据。

章节 支持的文件格式总览

继续阅读本节完整说明和来源证据。

概述

MarkItDown 是一个用于将各种文件格式转换为 Markdown 的 Python 工具与命令行实用程序，其设计目标是为大语言模型（LLM）与文本分析流水线提供结构清晰、Token 友好的文本输入。项目内置了一组格式转换器（DocumentConverter），每个转换器负责一种或一族文件格式，并通过统一的注册与调度机制在 MarkItDown.convert() 调用时被自动选取。

资料来源：README.md

支持的文件格式总览

格式	扩展名 / MIME	转换器文件	是否默认安装	备注
PDF	`.pdf`	`_pdf_converter.py`	是	默认无 OCR，可通过插件启用
PowerPoint	`.pptx`	`_pptx_converter.py`	是	含演讲者备注、表格、嵌入图片
Word	`.docx`	`_docx_converter.py`	是	旧版 `.doc` 未官方支持
Excel	`.xlsx`	`_xlsx_converter.py`	是	通过 OpenPyXL 解析
HTML	`.html` / `.htm`	`_html_converter.py`	是	内部基于 `markdownify`
Wikipedia HTML	同上，但 URL 匹配	`_wikipedia_converter.py`	是	仅匹配 `*.wikipedia.org`
图片	`.jpg` / `.png` 等	`_image_converter.py`	是	EXIF + 可选 LLM 描述
音频	`.wav` / `.mp3` 等	`_audio_converter.py`	是	EXIF + 可选 LLM 转录
EPUB	`.epub`	`_epub_converter.py`	是	通过 ebooklib 解析
ZIP	`.zip`	`_zip_converter.py`	是	递归迭代压缩包内容
文本类	`.csv` / `.json` / `.xml`	`_html_converter.py`（HTML 路径）	是	经过 HTML 转换
YouTube URL	—	`_youtube_converter.py`	是	抓取字幕
Azure 内容理解	多模态	`_cu_converter.py`	否	可选依赖 `az-content-understanding`

资料来源：packages/markitdown/src/markitdown/converters/__init__.py

社区中曾多次询问 OneNote（#47） 与 旧版 .doc（#23） 的支持情况。截至当前版本，MarkItDown 仍未内置 OneNote 与 .doc 格式的转换器。

资料来源：README.md

云服务、LLM 与 AI 集成

MarkItDown 是一个轻量级的 Python 实用工具，用于将各种文件转换为 Markdown，以便与 LLM 及相关文本分析管道配合使用。为了在本地离线转换质量不足的场景下获得更好的提取效果，MarkItDown 提供了一套云服务与 AI 集成层，包括 Azure 文档智能（Document Intelligence）、Azure 内容理解（Content Unde...

章节 相关页面

继续阅读本节完整说明和来源证据。

章节 用途与定位

继续阅读本节完整说明和来源证据。

章节 使用方式

继续阅读本节完整说明和来源证据。

章节 实现原理

继续阅读本节完整说明和来源证据。

概述

MarkItDown 是一个轻量级的 Python 实用工具，用于将各种文件转换为 Markdown，以便与 LLM 及相关文本分析管道配合使用。为了在本地离线转换质量不足的场景下获得更好的提取效果，MarkItDown 提供了一套云服务与 AI 集成层，包括 Azure 文档智能（Document Intelligence）、Azure 内容理解（Content Understanding）以及基于 LLM 视觉的图像描述与 OCR 能力。

资料来源：README.md

整个集成体系的目标是：在保留本地转换器零依赖、易部署的优势的同时，允许用户按需启用云端的多模态提取能力，从而应对社区中频繁报告的复杂 PDF 表格还原（参见 issue #293）、PDF 排版与页眉/页脚识别（参见 issue #296）等问题。

集成架构总览

MarkItDown 的 AI 与云服务集成分为三个互补的层级：

云端布局分析层 —— 通过 Azure 托管服务进行文档布局识别、字段提取与多模态内容理解。
LLM 视觉描述层 —— 通过任意 OpenAI 兼容的 llm_client 为 PPTX 与图片生成语义化描述。
LLM 视觉 OCR 层 —— 通过 markitdown-ocr 插件，对嵌入图片与扫描型 PDF 进行 OCR 文本提取。

graph TD
    A[用户输入文件] --> B{MarkItDown 转换调度器}
    B -->|enable_plugins=False| C[内置转换器]
    B -->|enable_plugins=True| D[插件发现<br/>markitdown.plugin]
    B -->|docintel_endpoint| E[Azure Document Intelligence]
    B -->|cu_endpoint| F[Azure Content Understanding]
    B -->|llm_client + llm_model| G[LLM 视觉描述 / OCR]

    C --> C1[PDF / DOCX / PPTX / XLSX / ...]
    D --> D1[markitdown-ocr 插件]
    E --> E1[云端布局分析]
    F --> F1[多模态内容理解]
    G --> G1[图像描述 / OCR 文本提取]

    C1 --> H[Markdown 输出]
    D1 --> H
    E1 --> H
    F1 --> H
    G1 --> H

资料来源：packages/markitdown/src/markitdown/converters/_cu_converter.py、packages/markitdown-ocr/README.md

Azure Document Intelligence 集成

用途与定位

Azure 文档智能（原名 Form Recognizer）是一项云端布局提取服务，通过将文档交给 Azure 托管的预训练模型进行版面分析，可以获得比本地 pdfplumber / pdfminer 更好的标题、段落、表格与多列结构识别质量。这直接回应了社区中关于 PDF 表格无法正确转换（issue #293）以及 PDF 整体识别能力不足（issue #296）的反馈。

资料来源：README.md

使用方式

CLI 调用：

markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"

Python API 调用：

from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)

资料来源：README.md

实现原理

_doc_intel_converter.py 实现了 DocumentConverter 子类，主要完成以下工作：

在 accepts() 中通过 stream_info.mimetype 与 extension 判断是否为 PDF、DOCX、PPTX、XLSX、HTML 等文档智能支持的目标类型。
在 convert() 中将文件流读取为字节流并调用 Azure 文档智能 SDK 发起布局分析请求。
将云端返回的结构化结果序列化为 Markdown 文本。

资料来源：packages/markitdown/src/markitdown/converters/_doc_intel_converter.py

能力对比

能力	内置 PDF 转换器	Azure Document Intelligence
运行模式	完全离线、格式特定	云端托管布局提取
表格识别	基于坐标启发式	深度学习模型
扫描型 PDF	不支持（纯文本提取）	支持 OCR 与布局分析
成本	仅本地算力	产生 Azure API 计费

资料来源：README.md

Azure Content Understanding 集成

用途与定位

Azure 内容理解（Content Understanding，CU）是比文档智能更进一步的多模态内容分析服务。它不仅支持文档，还支持图像、音频和视频的统一分析，并能将分析器（analyzer）抽取的字段以 YAML front matter 形式嵌入到 Markdown 输出中。

资料来源：packages/markitdown/src/markitdown/converters/_cu_converter.py

核心特性

能力	内置转换器	Azure 文档智能	Azure 内容理解
文档转换	离线、格式特定	云端布局提取	云端多模态提取
结构化字段	不可用	此集成未暴露	YAML front matter 输出
自定义 analyzer	不支持	此集成未配置	支持 `cu_analyzer_id`
音视频	仅基础音频、不支持视频	不支持	支持音频与视频 analyzer
成本	仅本地算力	计费	计费

资料来源：README.md

使用方式

CLI：

markitdown path-to-file.pdf --use-cu --cu-endpoint "<content_understanding_endpoint>"

Python API（零配置自动选择 analyzer）：

from markitdown import MarkItDown

md = MarkItDown(cu_endpoint="<content_understanding_endpoint>")
result = md.convert("report.pdf")   # 文档 → prebuilt-documentSearch
result = md.convert("meeting.mp4")  # 视频 → prebuilt-videoSearch
result = md.convert("call.wav")     # 音频 → prebuilt-audioSearch
print(result.markdown)

使用自定义 analyzer（用于领域特定字段抽取）：

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_analyzer_id="<custom_analyzer_id>",
)

限制文件类型（仅 PDF 走 CU）：

from markitdown import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],
)

资料来源：README.md

可选依赖

CU 集成位于可选依赖 markitdown[az-content-understanding] 中，避免对所有用户强制引入 Azure SDK。安装命令如下：

pip install markitdown[az-content-understanding]

_cu_converter.py 顶部通过 try / except ImportError 延迟加载 azure.ai.contentunderstanding 等依赖；若加载失败，会保存异常信息并在真正调用时抛出 MissingDependencyException，同时提供桩类以保证模块导入不中断。

资料来源：packages/markitdown/src/markitdown/converters/_cu_converter.py

与 LLM 的关键差异

YAML front matter 的暴露是 CU 集成的独特价值：内置转换器与文档智能集成均不会将分析器字段暴露为结构化数据。这使得 CU 成为需要把分析结果喂给下游结构化处理的场景（如检索增强生成 RAG、知识库构建）的首选。

资料来源：README.md

LLM 视觉描述集成

用途与定位

对于包含图像的输入（特别是 .jpg、.png 等纯图片文件，以及 .pptx 中的嵌入图片），MarkItDown 支持通过任意 OpenAI 兼容的 LLM 客户端生成语义化的图像描述，从而把视觉内容以文本形式融入 Markdown。

资料来源：README.md

使用方式

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="optional custom prompt",
)
result = md.convert("example.jpg")
print(result.text_content)

llm_prompt 是可选参数，用于覆盖默认的图像描述 prompt，从而支持特定领域或风格化的输出。

资料来源：README.md

实现位置

该能力由 converters/_llm_caption.py 实现，作为内置转换器之一，仅对 llm_client 不为 None 的场景激活。底层通过多模态聊天补全（chat completion）端点调用模型，将返回的文本作为图片的 Markdown 替代内容。

资料来源：packages/markitdown/src/markitdown/converters/_llm_caption.py

markitdown-ocr 插件：LLM 视觉 OCR

插件定位

markitdown-ocr 是官方维护的OCR 插件，与 LLM 描述使用相同的 llm_client / llm_model 模式，但专注于从 PDF、DOCX、PPTX、XLSX 中的嵌入图片以及扫描型 PDF 中提取文本。该插件的设计直接回应了最新发布（v0.1.6）中的变更：

"Add OCR layer service for embedded images and PDF scans"（PR #1541）

资料来源：packages/markitdown-ocr/README.md

安装与使用

安装：

pip install markitdown-ocr
pip install openai  # 或任意 OpenAI 兼容客户端

Python API：

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)

CLI：

markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o

如果未提供 llm_client，插件仍会加载但 OCR 会被静默跳过，回退到标准内置转换器。

资料来源：packages/markitdown-ocr/README.md

自定义 Prompt

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Extract all text from this image, preserving table structure.",
)

资料来源：packages/markitdown-ocr/README.md

任意 OpenAI 兼容客户端

from openai import AzureOpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=AzureOpenAI(
        api_key="...",
        azure_endpoint="https://your-resource.openai.azure.com/",
        api_version="2024-02-01",
    ),
    llm_model="gpt-4o",
)

资料来源：packages/markitdown-ocr/README.md

插件工作机制

graph TD
    A[MarkItDown enable_plugins=True] --> B[通过 markitdown.plugin 入口点发现插件]
    B --> C[调用 register_converters 转发 llm_client / llm_model]
    C --> D[插件构造 LLMVisionOCRService]
    D --> E[以优先级 -1.0 注册 4 个 OCR 增强转换器]
    E --> F[内置转换器优先级 0.0]

    G[用户调用 md.convert file] --> H{OCR 转换器 accepts}
    H -->|True| I[提取嵌入图片]
    I --> J[调用 LLM 视觉 OCR]
    J --> K[将 OCR 文本内联插入 HTML / Markdown]
    K --> L[输出 Markdown]

    H -->|False| F
    J -->|失败| M[继续转换 仅跳过该图片]

资料来源：packages/markitdown-ocr/README.md

OCR 增强转换器在优先级 -1.0 注册，早于内置转换器的 0.0，因此会被优先匹配；同时 LLM 调用失败不会中断整体转换流程。

按格式的实现细节

#### PDF

通过 page.images / 页面 XObject 提取嵌入图片，并按垂直阅读顺序将 OCR 文本与周围文本交错。
自动检测扫描型 PDF（无可提取文本的页面）：以 300 DPI 渲染页面并以全页图像方式发送给 LLM。
对 pdfplumber / pdfminer 无法打开的残缺 PDF（如截断的 EOF）会回退到 PyMuPDF 进行渲染，从而仍能恢复内容。

资料来源：packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py、packages/markitdown-ocr/README.md

graph TD
    A[PDF 输入] --> B{pdfplumber 可解析?}
    B -->|是| C[提取嵌入图片]
    C --> D[对每张图片调用 LLM OCR]
    D --> E[内联插入文本]
    B -->|否 异常| F[PyMuPDF 渲染]
    F --> G{页面有可提取文本?}
    G -->|否 扫描型| H[300 DPI 全页渲染后 LLM OCR]
    G -->|是| I[走常规路径]
    E --> J[Markdown 输出]
    H --> J
    I --> J

资料来源：packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py

#### DOCX

通过 doc.part.rels 提取图片。
OCR 在 DOCX → HTML → Markdown 管线之前执行：在 HTML 中注入占位符（如 MARKITDOWNOCRBLOCK{}），使后续 mammoth 转 markdown 时不会转义 OCR 标记。
占位符以不含特殊 Markdown 字符的单一 token 设计，确保 mammoth 不对其做处理。

资料来源：packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py

内存优化

最新版本（v0.1.6）修复了 PDF 转换中的 O(n) 内存增长问题：在 pdfplumber 处理完每页后立即调用 page.close()，释放缓存的页面数据。该修复由 PR #1612 提交。

资料来源：packages/markitdown/src/markitdown/converters/_pdf_converter.py

自定义插件开发

对于希望扩展转换器集合的用户，MarkItDown 提供了插件机制。任何实现 DocumentConverter 子类并通过 markitdown.plugin 入口点导出的 Python 包都可被自动发现。

from typing import BinaryIO, Any
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo

class RtfConverter(DocumentConverter):
    def __init__(self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT):
        super().__init__(priority=priority)

    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
        # 判断是否为 RTF 文件
        ...

    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
        # 转换逻辑
        ...

资料来源：packages/markitdown-sample-plugin/README.md

插件 CLI 控制

# 列出已安装的插件
markitdown --list-plugins

# 启用插件进行转换
markitdown --use-plugins path-to-file.pdf

社区插件可通过搜索 GitHub 上的 #markitdown-plugin 标签查找。

资料来源：README.md

安全注意事项

由于云服务与 LLM 集成均涉及向外部端点发送数据，必须遵循 MarkItDown 的安全规范：

沙箱化输入：在不受信任的环境中，对用户提供的文件路径、URI 与网络目标进行严格校验。
使用最窄的 API：若仅需读取本地文件，优先调用 convert_local() 或 convert_stream()，而不是通用的 convert()。
注意网络绑定警告：v0.1.6 中更新了关于绑定到非本地接口时的安全警告（PR 由 @afourney 提交），在部署自托管服务时需格外留意。

资料来源：README.md、packages/markitdown/README.md

常见问题与社区反馈

社区问题	MarkItDown 的应对方案
#293 PDF 表格转换不准确	启用 Azure Document Intelligence 进行云端布局识别；或使用 Azure Content Understanding 通过结构化字段以 YAML 输出
#296 PDF 标题/页脚/表格识别不足	启用云端布局分析服务；同时 v0.1.6 通过 `page.close()` 修复了内存增长，提升大文件处理能力
#23 是否支持 `.doc`	暂未提供；如需可基于插件机制实现自定义二进制格式转换器
#47 是否支持 OneNote	暂未提供；可通过实现自定义 `DocumentConverter` 扩展
#1179 通过 brew 安装	官方仅发布到 PyPI；社区 brew formula 由第三方维护

资料来源：README.md、packages/markitdown-ocr/README.md

集成选型决策表

graph TD
    A[需要 AI 增强转换?] -->|否| B[使用内置转换器]
    A -->|是| C{对什么数据增强?}
    C -->|纯图片描述| D[llm_client + llm_model]
    C -->|嵌入图片 OCR / 扫描 PDF| E[markitdown-ocr 插件]
    C -->|PDF/DOCX 复杂布局| F{Azure 集成?}
    F -->|是| G{需要结构化字段?}
    G -->|是| H[Azure Content Understanding]
    G -->|否| I[Azure Document Intelligence]
    F -->|否| J[markitdown-ocr 或 LLM Caption]

资料来源：README.md、packages/markitdown/src/markitdown/converters/_cu_converter.py

场景	推荐方案	备注
离线、隐私敏感	内置转换器	零依赖、零网络
普通图片语义化	`llm_client` + `llm_model`	任意 OpenAI 兼容客户端
含嵌入图片的文档	`markitdown-ocr` 插件	同时可处理扫描型 PDF
高保真表格 / 复杂版式	Azure Document Intelligence	按调用计费
多模态 + 结构化字段	Azure Content Understanding	支持音视频；YAML front matter
私有/特殊格式	自定义插件	基于 `DocumentConverter` 接口

资料来源：README.md

插件系统、MCP 服务与扩展性

MarkItDown 是一个用于将多种文件格式（PDF、Office、图像、音频、HTML 等）转换为 Markdown 的轻量级 Python 工具与命令行工具。除了内置的格式转换器之外，MarkItDown 还提供了清晰的扩展点，使社区与第三方能够以“插件”的方式贡献新的转换器，或将 MarkItDown 嵌入到 MCP（Model Context Protocol）服...

章节 相关页面

继续阅读本节完整说明和来源证据。

1. 概述

本页聚焦于 MarkItDown 的插件机制、官方示例插件、第三方 OCR 插件以及MCP 服务集成，并探讨其与 Azure Content Understanding 等云端扩展的关系。

资料来源：README.md、packages/markitdown/README.md

失败模式与踩坑日记

保留 Doramagic 在发现、验证和编译中沉淀的项目专属风险，不把社区讨论只当作装饰信息。

high 来源证据：Word Document table conversion issue

可能增加新用户试用和生产接入成本。

high 来源证据：[BUG]: CsvConverter produces broken Markdown tables when cell values contain pipe characters (|)

可能影响升级、迁移或版本选择。

high 来源证据：cloned repo and pip install fails

可能增加新用户试用和生产接入成本。

high 来源证据：Enhancement: Add MCP server support for document processing

可能增加新用户试用和生产接入成本。

Pitfall Log / 踩坑日志

项目：microsoft/markitdown

摘要：发现 30 个潜在踩坑项，其中 6 个为 high/blocking；最高优先级：安装坑 - 来源证据：Word Document table conversion issue。

1. 安装坑 · 来源证据：Word Document table conversion issue

严重度：high
证据强度：source_linked
发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Word Document table conversion issue
对用户的影响：可能增加新用户试用和生产接入成本。
证据：community_evidence:github | https://github.com/microsoft/markitdown/issues/20 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

2. 安装坑 · 来源证据：[BUG]: CsvConverter produces broken Markdown tables when cell values contain pipe characters (|)

严重度：high
证据强度：source_linked
发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[BUG]: CsvConverter produces broken Markdown tables when cell values contain pipe characters (|)
对用户的影响：可能影响升级、迁移或版本选择。
证据：community_evidence:github | https://github.com/microsoft/markitdown/issues/2019 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

3. 安装坑 · 来源证据：cloned repo and pip install fails

严重度：high
证据强度：source_linked
发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：cloned repo and pip install fails
对用户的影响：可能增加新用户试用和生产接入成本。
证据：community_evidence:github | https://github.com/microsoft/markitdown/issues/1489 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

4. 配置坑 · 来源证据：Enhancement: Add MCP server support for document processing

严重度：high
证据强度：source_linked
发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：Enhancement: Add MCP server support for document processing
对用户的影响：可能增加新用户试用和生产接入成本。
证据：community_evidence:github | https://github.com/microsoft/markitdown/issues/2004 | 来源类型 github_issue 暴露的待验证使用条件。

5. 配置坑 · 来源证据：How to use it in Windows 11？

严重度：high
证据强度：source_linked
发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：How to use it in Windows 11？
对用户的影响：可能增加新用户试用和生产接入成本。
证据：community_evidence:github | https://github.com/microsoft/markitdown/issues/2106 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

6. 维护坑 · 来源证据：[Feature] Use HTML Tables Instead of Markdown Syntax for Better Table Support

严重度：high
证据强度：source_linked
发现：GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：[Feature] Use HTML Tables Instead of Markdown Syntax for Better Table Support
对用户的影响：可能增加新用户试用和生产接入成本。
证据：community_evidence:github | https://github.com/microsoft/markitdown/issues/1211 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

7. 安装坑 · 失败模式：installation: Support for .doc extensions

严重度：medium
证据强度：source_linked
发现：Developers should check this installation risk before relying on the project: Support for .doc extensions
对用户的影响：Developers may fail before the first successful local run: Support for .doc extensions
证据：failure_mode_cluster:github_issue | https://github.com/microsoft/markitdown/issues/23 | Support for .doc extensions

8. 安装坑 · 失败模式：installation: XLSX Conversion Fails: SheetView parameter 'showZeroes' incompatible with openpyxl 3.1.0+

严重度：medium
证据强度：source_linked
发现：Developers should check this installation risk before relying on the project: XLSX Conversion Fails: SheetView parameter 'showZeroes' incompatible with openpyxl 3.1.0+
对用户的影响：Developers may fail before the first successful local run: XLSX Conversion Fails: SheetView parameter 'showZeroes' incompatible with openpyxl 3.1.0+
证据：failure_mode_cluster:github_issue | https://github.com/microsoft/markitdown/issues/2063 | XLSX Conversion Fails: SheetView parameter 'showZeroes' incompatible with openpyxl 3.1.0+

9. 安装坑 · 失败模式：installation: cloned repo and pip install fails

严重度：medium
证据强度：source_linked
发现：Developers should check this installation risk before relying on the project: cloned repo and pip install fails
对用户的影响：Developers may fail before the first successful local run: cloned repo and pip install fails
证据：failure_mode_cluster:github_issue | https://github.com/microsoft/markitdown/issues/1489 | cloned repo and pip install fails

10. 安装坑 · 失败模式：installation: v0.1.0

严重度：medium
证据强度：source_linked
发现：Developers should check this installation risk before relying on the project: v0.1.0
对用户的影响：Upgrade or migration may change expected behavior: v0.1.0
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.0 | v0.1.0

11. 安装坑 · 来源证据：XLSX Conversion Fails: SheetView parameter 'showZeroes' incompatible with openpyxl 3.1.0+

严重度：medium
证据强度：source_linked
发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：XLSX Conversion Fails: SheetView parameter 'showZeroes' incompatible with openpyxl 3.1.0+
对用户的影响：可能增加新用户试用和生产接入成本。
证据：community_evidence:github | https://github.com/microsoft/markitdown/issues/2063 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

12. 配置坑 · 失败模式：configuration: Enhancement: Add MCP server support for document processing

严重度：medium
证据强度：source_linked
发现：Developers should check this configuration risk before relying on the project: Enhancement: Add MCP server support for document processing
对用户的影响：Developers may misconfigure credentials, environment, or host setup: Enhancement: Add MCP server support for document processing
证据：failure_mode_cluster:github_issue | https://github.com/microsoft/markitdown/issues/2004 | Enhancement: Add MCP server support for document processing

13. 配置坑 · 失败模式：configuration: v0.1.2

严重度：medium
证据强度：source_linked
发现：Developers should check this configuration risk before relying on the project: v0.1.2
对用户的影响：Upgrade or migration may change expected behavior: v0.1.2
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.2 | v0.1.2

14. 配置坑 · 失败模式：configuration: v0.1.2a1

严重度：medium
证据强度：source_linked
发现：Developers should check this configuration risk before relying on the project: v0.1.2a1
对用户的影响：Upgrade or migration may change expected behavior: v0.1.2a1
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.2a1 | v0.1.2a1

15. 配置坑 · 来源证据：bug: IpynbConverter loses document title when cell source is a string instead of list

严重度：medium
证据强度：source_linked
发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：bug: IpynbConverter loses document title when cell source is a string instead of list
对用户的影响：可能增加新用户试用和生产接入成本。
证据：community_evidence:github | https://github.com/microsoft/markitdown/issues/2115 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

16. 能力坑 · 能力判断依赖假设

严重度：medium
证据强度：source_linked
发现：README/documentation is current enough for a first validation pass.
对用户的影响：假设不成立时，用户拿不到承诺的能力。
证据：capability.assumptions | github_repo:888092115 | https://github.com/microsoft/markitdown | README/documentation is current enough for a first validation pass.

17. 运行坑 · 失败模式：runtime: v0.1.3

严重度：medium
证据强度：source_linked
发现：Developers should check this runtime risk before relying on the project: v0.1.3
对用户的影响：Upgrade or migration may change expected behavior: v0.1.3
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.3 | v0.1.3

18. 运行坑 · 失败模式：runtime: v0.1.5

严重度：medium
证据强度：source_linked
发现：Developers should check this runtime risk before relying on the project: v0.1.5
对用户的影响：Upgrade or migration may change expected behavior: v0.1.5
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.5 | v0.1.5

19. 运行坑 · 来源证据：bug: consecutive partial numbers (.1 followed by .2) wrongly merged into '.1 .2'

严重度：medium
证据强度：source_linked
发现：GitHub 社区证据显示该项目存在一个运行相关的待验证问题：bug: consecutive partial numbers (.1 followed by .2) wrongly merged into '.1 .2'
对用户的影响：可能增加新用户试用和生产接入成本。
证据：community_evidence:github | https://github.com/microsoft/markitdown/issues/2114 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

20. 维护坑 · 失败模式：migration: [BUG]: CsvConverter produces broken Markdown tables when cell values contain pipe characters (|)

严重度：medium
证据强度：source_linked
发现：Developers should check this migration risk before relying on the project: [BUG]: CsvConverter produces broken Markdown tables when cell values contain pipe characters (|)
对用户的影响：Developers may hit a documented source-backed failure mode: [BUG]: CsvConverter produces broken Markdown tables when cell values contain pipe characters (|)
证据：failure_mode_cluster:github_issue | https://github.com/microsoft/markitdown/issues/2019 | [BUG]: CsvConverter produces broken Markdown tables when cell values contain pipe characters (|)

21. 维护坑 · 维护活跃度未知

严重度：medium
证据强度：source_linked
发现：未记录 last_activity_observed。
对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
证据：evidence.maintainer_signals | github_repo:888092115 | https://github.com/microsoft/markitdown | last_activity_observed missing

严重度：medium
证据强度：source_linked
发现：no_demo
证据：downstream_validation.risk_items | github_repo:888092115 | https://github.com/microsoft/markitdown | no_demo; severity=medium

23. 安全/权限坑 · 存在评分风险

严重度：medium
证据强度：source_linked
发现：no_demo
对用户的影响：风险会影响是否适合普通用户安装。
证据：risks.scoring_risks | github_repo:888092115 | https://github.com/microsoft/markitdown | no_demo; severity=medium

24. 运行坑 · 失败模式：performance: Version 0.1.6

严重度：low
证据强度：source_linked
发现：Developers should check this performance risk before relying on the project: Version 0.1.6
对用户的影响：Upgrade or migration may change expected behavior: Version 0.1.6
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.6 | Version 0.1.6

25. 维护坑 · issue/PR 响应质量未知

严重度：low
证据强度：source_linked
发现：issue_or_pr_quality=unknown。
对用户的影响：用户无法判断遇到问题后是否有人维护。
证据：evidence.maintainer_signals | github_repo:888092115 | https://github.com/microsoft/markitdown | issue_or_pr_quality=unknown

26. 维护坑 · 发布节奏不明确

严重度：low
证据强度：source_linked
发现：release_recency=unknown。
对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
证据：evidence.maintainer_signals | github_repo:888092115 | https://github.com/microsoft/markitdown | release_recency=unknown

27. 维护坑 · 失败模式：maintenance: Version 0.1.4

严重度：low
证据强度：source_linked
发现：Developers should check this maintenance risk before relying on the project: Version 0.1.4
对用户的影响：Upgrade or migration may change expected behavior: Version 0.1.4
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.4 | Version 0.1.4

28. 维护坑 · 失败模式：maintenance: Version 0.1.5b1

严重度：low
证据强度：source_linked
发现：Developers should check this maintenance risk before relying on the project: Version 0.1.5b1
对用户的影响：Upgrade or migration may change expected behavior: Version 0.1.5b1
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.5b1 | Version 0.1.5b1

29. 维护坑 · 失败模式：maintenance: v0.1.0a6

严重度：low
证据强度：source_linked
发现：Developers should check this maintenance risk before relying on the project: v0.1.0a6
对用户的影响：Upgrade or migration may change expected behavior: v0.1.0a6
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.0a6 | v0.1.0a6

30. 维护坑 · 失败模式：maintenance: v0.1.1

严重度：low
证据强度：source_linked
发现：Developers should check this maintenance risk before relying on the project: v0.1.1
对用户的影响：Upgrade or migration may change expected behavior: v0.1.1
证据：failure_mode_cluster:github_release | https://github.com/microsoft/markitdown/releases/tag/v0.1.1 | v0.1.1

来源：Doramagic 发现、验证与编译记录