# https://github.com/microsoft/markitdown Project Manual

Generated at: 2026-05-30 19:16:05 UTC

## Table of Contents

- [Home](#home)
- [Installation Guide](#installation)
- [Command-Line Interface](#cli-usage)
- [Architecture Overview](#architecture)
- [Python API Reference](#python-api)
- [Supported File Formats](#supported-formats)
- [Azure Integrations](#azure-integrations)
- [OCR Plugin](#ocr-plugin)
- [MCP Server](#mcp-server)
- [Plugin Development Guide](#plugin-development)

<a id='home'></a>

## Home

### Related Pages

Related topics: [Installation Guide](#installation), [Architecture Overview](#architecture), [Supported File Formats](#supported-formats)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
- [packages/markitdown/src/markitdown/converters/_wikipedia_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py)
- [packages/markitdown/src/markitdown/converters/_rss_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_rss_converter.py)
- [packages/markitdown/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/README.md)
- [packages/markitdown/ThirdPartyNotices.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/ThirdPartyNotices.md)
</details>

# MarkItDown Home

MarkItDown is a lightweight Python utility and command-line tool for converting various document formats into Markdown. It is designed primarily for use with Large Language Models (LLMs) and text analysis pipelines, extracting content while preserving document structure including headings, lists, tables, links, and other semantic elements. Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Overview

MarkItDown provides a unified interface for converting files to Markdown, abstracting away the complexity of handling different file formats. The tool is particularly valuable for:

- **LLM Integration**: Feeding documents to language models that understand Markdown natively
- **Document Indexing**: Creating searchable indexes from various document types
- **Text Analysis**: Extracting structured text content for downstream processing

> [!IMPORTANT]
> MarkItDown performs I/O with the privileges of the current process. Like `open()` or `requests.get()`, it accesses resources that the process itself can access. Sanitize your inputs in untrusted environments, and use the narrowest conversion function for your use case (e.g., `convert_stream()`, or `convert_local()`). Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### Supported File Formats

MarkItDown supports conversion from numerous formats, organized by the built-in converters. Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

| Format | Description | Notes |
|--------|-------------|-------|
| PDF | Portable Document Format | Basic text extraction; see [Known Limitations](#known-limitations) |
| PowerPoint (.pptx) | Microsoft PowerPoint presentations | Supports images and math equations |
| Word (.docx) | Microsoft Word documents | Supports images and math equations |
| Excel (.xlsx) | Microsoft Excel spreadsheets | Table extraction supported |
| Images | JPEG, PNG, GIF, etc. | EXIF metadata extraction; OCR via plugin |
| Audio | MP3, WAV, etc. | EXIF metadata and speech transcription |
| HTML | Web pages | Includes Wikipedia-specific processing |
| CSV | Comma-separated values | Converted to Markdown tables |
| JSON | JavaScript Object Notation | Structured text output |
| XML | Extensible Markup Language | RSS feeds fully supported |
| EPUB | Electronic publications | E-book content extraction |
| ZIP | Archive files | Iterates over contents |
| YouTube URLs | Video content | Requires Azure integration |

### Why Markdown?

Markdown is extremely close to plain text with minimal markup, yet provides a way to represent important document structure. Mainstream LLMs such as OpenAI's GPT-4o natively understand Markdown, making it an efficient format for document consumption by AI systems. As a side benefit, Markdown conventions are also highly token-efficient compared to other structured formats. Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Architecture

MarkItDown uses a converter-based architecture that allows for extensibility through plugins while providing a consistent interface for all supported formats.

### Core Components

```mermaid
graph TD
    A[MarkItDown API] --> B[Converter Registry]
    B --> C[Built-in Converters]
    B --> D[Plugin Converters]
    
    C --> C1[PDF Converter]
    C --> C2[DOCX Converter]
    C --> C3[PPTX Converter]
    C --> C4[XLSX Converter]
    C --> C5[Image Converter]
    C --> C6[Audio Converter]
    C --> C7[HTML Converter]
    C --> C8[Wikipedia Converter]
    C --> C9[RSS Converter]
    C --> C10[CSV Converter]
    C --> C11[EPUB Converter]
    C --> C12[YouTube Converter]
    
    D --> D1[OCR Plugin]
    D --> D2[Custom Plugins]
    
    E[Azure Integrations] -.->|Optional| B
    E --> E1[Document Intelligence]
    E --> E2[Content Understanding]
```

### Conversion Pipeline

```mermaid
graph LR
    A[Input File/Stream/URI] --> B[Stream Info Extraction]
    B --> C{Hint Extension?}
    C -->|Yes| D[Use Extension Hint]
    C -->|No| E[Detect from Content]
    D --> F[Find Matching Converter]
    E --> F
    F --> G{Converter Found?}
    G -->|Yes| H[Execute Conversion]
    G -->|No| I[Return Error]
    H --> J[DocumentConverterResult]
```

### Converter Base Class

All converters inherit from `DocumentConverter` and implement two key methods:

1. **`accepts(file_stream, stream_info, **kwargs)`**: Determines if the converter can handle the given input
2. **`convert(file_stream, stream_info, **kwargs)`**: Performs the actual conversion to Markdown

Source: [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

### Plugin System

The plugin architecture uses Python entry points to discover and load converters at runtime. Source: [packages/markitdown/src/markitdown/__main__.py:1-35](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

Plugins must implement:

```python
__plugin_interface_version__ = 1

def register_converters(markitdown: MarkItDown, **kwargs):
    """Called during MarkItDown instantiation to register plugin converters."""
    markitdown.register_converter(YourCustomConverter())
```

Entry point configuration in `pyproject.toml`:

```toml
[project.entry-points."markitdown.plugin"]
your_plugin = "your_package_name"
```

Source: [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

## Installation

### Prerequisites

MarkItDown requires **Python 3.10 or higher**. Using a virtual environment is recommended to avoid dependency conflicts. Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

```bash
# Using standard Python venv
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or: .venv\Scripts\activate  # Windows

# Using uv
uv venv --python 3.12 .venv
source .venv/bin/activate
```

### Installation Options

| Command | Description |
|---------|-------------|
| `pip install markitdown` | Core package only |
| `pip install markitdown[all]` | All dependencies included |
| `pip install markitdown[pdf]` | PDF conversion support |
| `pip install markitdown[docx]` | Word document support |
| `pip install markitdown[pptx]` | PowerPoint support |
| `pip install markitdown[xlsx]` | Excel support |
| `pip install markitdown[images]` | Image processing |
| `pip install markitdown[audio]` | Audio transcription |
| `pip install markitdown[html]` | HTML parsing |
| `pip install markitdown[az-docintel]` | Azure Document Intelligence |
| `pip install markitdown[az-content-understanding]` | Azure Content Understanding |
| `pip install markitdown-ocr` | OCR plugin for embedded images |

From source:

```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown[all]
```

Source: [packages/markitdown/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/README.md)

## Usage

### Command-Line Interface

#### Basic Usage

```bash
# Convert a file and output to stdout
markitdown path-to-file.pdf > document.md

# Convert and save to a specific file
markitdown example.xlsx -o output.md

# Read from stdin
cat document.pdf | markitdown > output.md

# Provide file extension hint (for stdin input)
cat document.pdf | markitdown -x .pdf > output.md
```

Source: [packages/markitdown/src/markitdown/__main__.py:1-50](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

#### CLI Options Reference

| Option | Description |
|--------|-------------|
| `-v, --version` | Show version number and exit |
| `-o, --output FILE` | Output file name (default: stdout) |
| `-x, --extension EXT` | Hint file extension (for stdin input) |
| `-d, --use-docintel` | Use Azure Document Intelligence |
| `-e, --endpoint URL` | Azure Document Intelligence endpoint |
| `--use-cu` | Use Azure Content Understanding |
| `--cu-endpoint URL` | Content Understanding endpoint |
| `--cu-analyzer ID` | Content Understanding analyzer ID |
| `--cu-file-types TYPES` | Comma-separated file types for CU routing |
| `-p, --use-plugins` | Enable 3rd-party plugins |
| `--list-plugins` | List installed plugins |
| `--keep-data-uris` | Keep base64-encoded images in output |

Source: [packages/markitdown/src/markitdown/__main__.py:50-130](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### Python API

#### Basic Usage

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
```

#### MarkItDown Constructor Options

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `enable_plugins` | bool | False | Enable 3rd-party plugin loading |
| `docintel_endpoint` | str | None | Azure Document Intelligence endpoint |
| `docintel_model_id` | str | None | Document Intelligence model ID |
| `llm_client` | object | None | OpenAI-compatible LLM client |
| `llm_model` | str | None | LLM model name (e.g., "gpt-4o") |
| `llm_prompt` | str | None | Custom prompt for LLM operations |
| `cu_endpoint` | str | None | Azure Content Understanding endpoint |
| `cu_file_types` | list | None | File types to route to CU |

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

#### Convert Methods

```python
from markitdown import MarkItDown

md = MarkItDown()

# Convert local file
result = md.convert("document.pdf")

# Convert URI (http, https, data:, file:)
result = md.convert_uri("https://example.com/page.html")

# Convert file stream
with open("document.pdf", "rb") as f:
    result = md.convert_stream(f, extension=".pdf")

# Convert local file (no URL resolution)
result = md.convert_local("document.pdf")
```

> [!NOTE]
> `convert_url` is an alias for `convert_uri` maintained for backward compatibility. Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

#### Return Value

The `convert()` method returns a `DocumentConverterResult` object with the following attributes:

| Attribute | Type | Description |
|-----------|------|-------------|
| `text_content` | str | The converted Markdown content |
| `title` | str | Document title (if extractable) |
| `markdown` | str | Alias for `text_content` |

## Azure Integrations

### Azure Document Intelligence

Use Azure Document Intelligence for higher-quality PDF and image extraction with layout analysis.

```bash
markitdown document.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```

Python API:

```python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<endpoint>")
result = md.convert("document.pdf")
```

For Azure Key Credentials:

```python
from markitdown import MarkItDown
from azure.core.credentials import AzureKeyCredential

md = MarkItDown(
    docintel_endpoint="<endpoint>",
    docintel_credential=AzureKeyCredential("<api_key>")
)
```

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### Azure Content Understanding

Azure Content Understanding provides structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers. Install with:

```bash
pip install 'markitdown[az-content-understanding]'
```

```python
from markitdown import MarkItDown
import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],  # Route only PDFs to CU
)
```

Source: [packages/markitdown/src/markitdown/converters/_cu_converter.py:1-40](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

#### When to Use Content Understanding

| Capability | Built-in | Azure Document Intelligence | Azure Content Understanding |
|------------|----------|----------------------------|-----------------------------|
| Basic PDF conversion | ✓ | ✓ | ✓ |
| Layout analysis | Basic | ✓ | ✓ |
| Structured field extraction | ✗ | ✗ | ✓ |
| Image files | EXIF only | ✗ | ✓ |
| Audio files | Basic transcription | ✗ | ✓ |
| Video files | ✗ | ✗ | ✓ |
| YAML front matter | ✗ | ✗ | ✓ |

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## OCR Plugin (markitdown-ocr)

The `markitdown-ocr` plugin adds OCR support for extracting text from images embedded in PDF, DOCX, PPTX, and XLSX files using LLM Vision. Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

### Installation

```bash
pip install markitdown-ocr
pip install openai  # or any OpenAI-compatible client
```

### Usage

```bash
markitdown document_with_images.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

Python API:

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)
```

Custom prompt:

```python
md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Extract all text from this image, preserving table structure.",
)
```

Works with any OpenAI-compatible client:

```python
from openai import AzureOpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=AzureOpenAI(
        api_key="...",
        azure_endpoint="https://your-resource.openai.azure.com/",
        api_version="2024-02-01",
    ),
    llm_model="gpt-4o",
)
```

> [!NOTE]
> If no `llm_client` is provided, the plugin loads but silently skips OCR, falling back to the standard built-in converter. Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Known Limitations

### PDF Conversion

MarkItDown's built-in PDF converter performs basic text extraction and may not reliably recognize:

- Headings and footers
- Tables (see [Issue #293](https://github.com/microsoft/markitdown/issues/293))
- Complex layouts

For improved PDF handling, consider:

1. **Azure Document Intelligence** (`-d` flag): Better layout analysis
2. **Azure Content Understanding**: Highest quality extraction with structured output
3. **OCR Plugin**: For scanned PDFs and embedded images

Source: [Community Issue #296](https://github.com/microsoft/markitdown/issues/296)

### Microsoft Word (.doc vs .docx)

MarkItDown only supports `.docx` format (Office Open XML). Legacy `.doc` files are not supported. Source: [Community Issue #23](https://github.com/microsoft/markitdown/issues/23)

### Audio Transcription

Audio transcription requires `ffmpeg` or `avconv` to be installed on the system. On Linux, ensure one of these is available in the system PATH. Source: [Community Issue #1685](https://github.com/microsoft/markitdown/issues/1685)

### Error Handling Behavior

When converting invalid Office Open XML files (DOCX, XLSX, PPTX), MarkItDown returns a successful result with the error message in the text content rather than raising an exception. This may make it difficult to distinguish between successful and failed conversions. Source: [Community Issue #1408](https://github.com/microsoft/markitdown/issues/1408)

## Version History

| Version | Key Changes |
|---------|-------------|
| 0.1.6 | OCR layer service for embedded images and PDF scans; Fixed O(n) memory growth in PDF conversion |
| 0.1.5 | PDF table extraction with aligned Markdown; Fix for partially numbered lists; Wide table support |
| 0.1.4 | Security updates: mammoth 1.11.0, pdfminer.six 20251107 |
| 0.1.3 | ONNXRuntime pinning on Windows; MCP server environment variable support |
| 0.1.2 | Math equation rendering in DOCX; Azure Document Intelligence credentials; CSV to Markdown tables |
| 0.1.1 | `convert_url` renamed to `convert_uri`; Support for file and data URIs |
| 0.1.0 | Plugin architecture; Feature groups for dependencies; 3rd-party extensibility |

Source: [Release Notes](https://github.com/microsoft/markitdown/releases)

## Docker

MarkItDown can be run in a Docker container:

```bash
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Security Considerations

> [!IMPORTANT]
> MarkItDown performs I/O operations with the privileges of the running process. In untrusted environments:
>
> 1. **Sanitize inputs** before passing them to MarkItDown
> 2. **Use narrowest conversion methods**: Prefer `convert_local()` or `convert_stream()` over `convert_uri()` to limit network access
> 3. **Restrict file permissions** on the process running MarkItDown
>
> The tool will access any resources that the process user can access, including local files and network resources.

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## See Also

- [Plugin Development Guide](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md) - Creating custom MarkItDown converters
- [OCR Plugin Documentation](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md) - LLM Vision OCR for embedded images
- [Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) - Microsoft documentation
- [Azure Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource) - Setup guide

---

<a id='installation'></a>

## Installation Guide

### Related Pages

Related topics: [Home](#home), [Command-Line Interface](#cli-usage)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)
- [packages/markitdown/pyproject.toml](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/pyproject.toml)
- [Dockerfile](https://github.com/microsoft/markitdown/blob/main/Dockerfile)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
</details>

# Installation Guide

This guide covers all supported methods for installing MarkItDown, including Python package installation, Docker deployment, and plugin setup. Choose the method that best fits your environment and use case.

## Prerequisites

Before installing MarkItDown, ensure your environment meets the following requirements.

### Python Version

MarkItDown requires **Python 3.10 or higher**. You can verify your Python version by running:

```bash
python --version
# or
python3 --version
```

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### Optional: Virtual Environment

It is recommended to use a virtual environment to avoid dependency conflicts with other Python packages. MarkItDown supports multiple methods for creating virtual environments:

| Method | Commands |
|--------|----------|
| Standard Python | `python -m venv .venv && source .venv/bin/activate` |
| uv | `uv venv --python=3.12 .venv && source .venv/bin/activate` |
| conda | `conda create -n markitdown python=3.12 && conda activate markitdown` |

> **Note:** When using `uv`, ensure you use `uv pip install` rather than `pip install` to install packages within the virtual environment.

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Installation Methods

### Install from PyPI (Recommended)

The simplest installation method uses pip to install MarkItDown from PyPI.

#### Full Installation (All Features)

To install MarkItDown with all optional dependencies for maximum functionality:

```bash
pip install 'markitdown[all]'
```

This installs every converter and dependency group, enabling support for PDF, DOCX, PPTX, XLSX, images, audio, and more.

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

#### Selective Installation

Install only the converters you need by specifying feature groups:

```bash
pip install 'markitdown[pdf,docx,pptx]'
```

| Feature Group | Description | File Formats Supported |
|---------------|-------------|------------------------|
| `pdf` | PDF converter | .pdf |
| `docx` | Word document converter | .docx |
| `pptx` | PowerPoint converter | .pptx, .ppt |
| `xlsx` | Excel converter | .xlsx, .xls |
| `html` | HTML/Wikipedia converter | .html, .htm |
| `image` | Image converter with EXIF/OCR | .jpg, .png, .gif, etc. |
| `audio` | Audio converter with transcription | .mp3, .wav, .ogg, etc. |
| `epub` | E-book converter | .epub |
| `az-content-understanding` | Azure Content Understanding integration | All formats |
| `all` | All feature groups | All formats |

Source: [packages/markitdown/pyproject.toml](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/pyproject.toml)

### Install from Source

For development, testing, or customization, install MarkItDown from the GitHub repository.

```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'
```

The `-e` flag installs the package in editable mode, which is useful for development as changes to the source code take effect immediately without reinstallation.

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### Docker Installation

MarkItDown provides a Docker image for containerized deployments. This method requires no Python installation on the host system.

#### Building the Image

```bash
docker build -t markitdown:latest .
```

#### Running MarkItDown in Docker

Convert a file and output to stdout:

```bash
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```

Or with a named volume for larger files:

```bash
docker run --rm -v $(pwd):/data markitdown:latest markitdown /data/input.pdf -o /data/output.md
```

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Plugin Installation

MarkItDown uses a plugin architecture that allows extending functionality through separate packages.

### markitdown-ocr Plugin

The `markitdown-ocr` plugin adds OCR support for extracting text from embedded images in PDF, DOCX, PPTX, and XLSX files using LLM Vision. This is particularly useful for scanned PDFs or documents containing image-based text.

#### Installation

```bash
pip install markitdown-ocr
pip install openai  # or any OpenAI-compatible client
```

#### Usage

After installation, enable plugins when creating a `MarkItDown` instance:

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)
```

#### CLI Usage

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

> **Important:** The `--llm-client` and `--llm-model` CLI arguments require the plugin to be installed. Without the plugin, these arguments will cause an "Unrecognized Arguments" error. See [Issue #1897](https://github.com/microsoft/markitdown/issues/1897) for details.

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

### Third-Party Plugins

To discover third-party plugins, search GitHub for the hashtag `#markitdown-plugin`.

#### Listing Installed Plugins

```bash
markitdown --list-plugins
```

Sample output:
```
Installed MarkItDown 3rd-party Plugins:

  * sample_plugin     (package: markitdown_sample_plugin)

Use the -p (or --use-plugins) option to enable 3rd-party plugins.
```

#### Developing Custom Plugins

For developing a custom plugin, see the [Sample Plugin](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md) documentation. A plugin must:

1. Implement a `DocumentConverter` class with `accepts()` and `convert()` methods
2. Export `__plugin_interface_version__ = 1`
3. Export a `register_converters()` function
4. Define an entry point in `pyproject.toml`:

```toml
[project.entry-points."markitdown.plugin"]
my_plugin = "my_plugin_package"
```

Source: [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

## Azure Service Integration

### Azure Document Intelligence

For higher-quality PDF conversion using Azure's Document Intelligence service:

```bash
pip install 'markitdown[az-doc-intel]'
```

Configure the endpoint when converting:

```bash
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```

Or in Python:

```python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
```

For setup instructions, see [Azure Document Intelligence documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource).

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### Azure Content Understanding

For advanced multi-modal extraction with structured field output (YAML front matter):

```bash
pip install 'markitdown[az-content-understanding]'
```

Configure via Python:

```python
from markitdown import ContentUnderstandingFileType
from markitdown import MarkItDown

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],  # only PDFs use CU
)
```

For more information, see [Azure Content Understanding documentation](https://learn.microsoft.com/azure/ai-services/content-understanding/).

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Verifying Installation

After installation, verify that MarkItDown is correctly installed by checking the version:

```bash
markitdown --version
```

Or test the Python API:

```bash
python -c "from markitdown import MarkItDown; print(MarkItDown().convert('test.txt').text_content)"
```

### Installation Architecture

```mermaid
graph TD
    A[MarkItDown Installation] --> B[Core Package<br/>markitdown]
    A --> C[Optional Dependencies]
    A --> D[Plugins]
    
    C --> C1[PDF: pdfminer.six]
    C --> C2[DOCX: mammoth]
    C --> C3[PPTX: python-pptx]
    C --> C4[XLSX: openpyxl]
    C --> C5[Image: Pillow]
    C --> C6[Audio: pydub, speechrecognition]
    
    D --> D1[markitdown-ocr]
    D --> D2[Custom Plugins<br/>#markitdown-plugin]
    
    B --> E[CLI Interface<br/>markitdown command]
    B --> F[Python API<br/>MarkItDown class]
```

## Troubleshooting

### RuntimeWarning: ffmpeg/avconv not found

On Linux systems, if you see this warning when converting audio files:

```
RuntimeWarning: Couldn't find ffmpeg or avconv
```

Install ffmpeg to enable audio conversion:

```bash
# Debian/Ubuntu
sudo apt install ffmpeg

# Fedora
sudo dnf install ffmpeg

# macOS
brew install ffmpeg
```

Source: [Issue #1685](https://github.com/microsoft/markitdown/issues/1685)

### Unrecognized Arguments Error with --llm-client

If you encounter an "Unrecognized Arguments" error when using `--llm-client` or `--llm-model`:

```
markitdown: error: unrecognized arguments: --llm-client openai
```

Install the `markitdown-ocr` plugin first:

```bash
pip install markitdown-ocr
```

Source: [Issue #1897](https://github.com/microsoft/markitdown/issues/1897)

### Dependency Conflicts

If you encounter dependency conflicts with other packages, create a dedicated virtual environment:

```bash
python -m venv markitdown-env
source markitdown-env/bin/activate
pip install 'markitdown[all]'
```

## Next Steps

After installation, see the following guides:

- [Usage Guide](./Usage) - Converting files using CLI and Python API
- [Plugin Development](./Plugin-Development) - Creating custom MarkItDown plugins
- [Configuration Reference](./Configuration) - All available configuration options
- [Troubleshooting](./Troubleshooting) - Solving common issues

## See Also

- [MarkItDown README](https://github.com/microsoft/markitdown/blob/main/README.md) - Project overview and quick start
- [markitdown-ocr Plugin](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md) - OCR plugin documentation
- [Sample Plugin](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md) - Plugin development guide
- [Azure Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/) - Cloud-based document conversion
- [Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) - Advanced multi-modal extraction

---

<a id='cli-usage'></a>

## Command-Line Interface

### Related Pages

Related topics: [Installation Guide](#installation), [Python API Reference](#python-api)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
</details>

# Command-Line Interface

MarkItDown provides a command-line interface (CLI) for converting various file formats to Markdown directly from the terminal. The CLI is installed as an entry point when you install the `markitdown` package, making the `markitdown` command available system-wide.

## Overview

The MarkItDown CLI enables users to convert documents without writing Python code. It wraps the `MarkItDown` Python class and provides a streamlined interface for common conversion tasks including file conversion, streaming input, plugin management, and integration with Azure AI services.

```mermaid
graph TD
    A["markitdown CLI"] --> B{Input Type}
    B -->|File Path| C["convert_local()"]
    B -->|Stdin| D["convert_stream()"]
    B -->|URL| E["convert_uri()"]
    C --> F["MarkItDown Engine"]
    D --> F
    E --> F
    F --> G{Converter Selection}
    G -->|Built-in| H["PDF, DOCX, PPTX, XLSX..."]
    G -->|Plugin| I["3rd-party Converters"]
    G -->|Azure| J["Doc Intel / Content Understanding"]
    H --> K["Markdown Output"]
    I --> K
    J --> K
```

## Installation

Ensure MarkItDown is installed with the CLI component:

```bash
# Install with all optional dependencies
pip install markitdown[all]

# Verify CLI is available
markitdown --version
```

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Basic Usage

### Converting a File

The simplest use case is converting a local file:

```bash
markitdown path/to/document.pdf > output.md
```

Or using the output flag:

```bash
markitdown path/to/document.pdf -o output.md
```

### Reading from Stdin

MarkItDown can read from standard input when no filename is provided. Use the `-x` flag to specify the file extension when reading from stdin:

```bash
cat document.pdf | markitdown -x pdf > output.md
```

### Docker Usage

For environments without Python installed, use Docker:

```sh
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Command Reference

### Global Options

| Option | Short | Description |
|--------|-------|-------------|
| `--version` | `-v` | Show the version number and exit |
| `--output` | `-o` | Output file name. If not provided, output is written to stdout |
| `--extension` | `-x` | Provide a hint about the file extension when reading from stdin |
| `--list-plugins` | | List installed 3rd-party plugins and exit |

### Plugin Options

| Option | Short | Description |
|--------|-------|-------------|
| `--use-plugins` | `-p` | Enable 3rd-party plugin support during conversion |

### LLM Options

| Option | Description |
|--------|-------------|
| `--llm-client` | Specify the LLM client to use (e.g., `openai`) for image descriptions |
| `--llm-model` | Specify the LLM model to use (e.g., `gpt-4o`) |
| `--llm-prompt` | Custom prompt for LLM-based feature extraction |

### Azure Document Intelligence Options

| Option | Short | Description |
|--------|-------|-------------|
| `--use-docintel` | `-d` | Use Azure Document Intelligence for conversion |
| `--endpoint` | `-e` | Azure Document Intelligence endpoint URL (required with `--use-docintel`) |

### Azure Content Understanding Options

| Option | Short | Description |
|--------|-------|-------------|
| `--use-cu` | | Use Azure Content Understanding for conversion |
| `--cu-endpoint` | | Content Understanding endpoint URL (required with `--use-cu`) |
| `--cu-file-types` | | Comma-separated list of file types to process with Content Understanding |

Source: [packages/markitdown/src/markitdown/__main__.py:20-120](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

## Common Usage Patterns

### Standard File Conversion

Convert supported file types (PDF, DOCX, PPTX, XLSX, images, audio, etc.):

```bash
markitdown document.pdf -o document.md
markitdown presentation.pptx -o presentation.md
markitdown spreadsheet.xlsx -o spreadsheet.md
```

### Plugin Usage

List available plugins:

```bash
markitdown --list-plugins
```

Convert using 3rd-party plugins:

```bash
markitdown document.rtf --use-plugins -o document.md
```

### LLM-Enhanced Conversion

Use LLM vision for image descriptions (currently supports PPTX and images):

```bash
markitdown image.jpg --llm-client openai --llm-model gpt-4o -o image.md
```

### Azure Document Intelligence

For higher-quality PDF and Office document conversion:

```bash
markitdown document.pdf -d -e "https://your-resource.cognitiveservices.azure.com/" -o document.md
```

### Azure Content Understanding

For structured field extraction with YAML front matter:

```bash
markitdown invoice.pdf --use-cu --cu-endpoint "https://your-cu-endpoint" -o invoice.md
```

Limit Content Understanding to specific file types:

```bash
markitdown document.pdf --use-cu --cu-endpoint "..." --cu-file-types PDF -o document.md
```

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## CLI Architecture

The CLI is implemented in `__main__.py` and follows a clear initialization pattern:

```mermaid
graph LR
    A["Argument Parsing<br/>(argparse)"] --> B{"Mode Selection"}
    B -->|--list-plugins| C["List Plugins & Exit"]
    B -->|--use-docintel| D["Document Intelligence Mode"]
    B -->|--use-cu| E["Content Understanding Mode"]
    B -->|Default| F["Standard Conversion Mode"]
    
    D --> G["MarkItDown(<br/>docintel_endpoint)"]
    E --> H["MarkItDown(<br/>cu_endpoint, cu_file_types)"]
    F --> I["MarkItDown(<br/>enable_plugins)"]
    
    G --> J["Conversion & Output"]
    H --> J
    I --> J
```

### Argument Parsing Flow

Source: [packages/markitdown/src/markitdown/__main__.py:20-150](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

The CLI uses `argparse` with `RawDescriptionHelpFormatter` to provide formatted help output including usage examples:

```python
parser = argparse.ArgumentParser(
    description="Convert various file formats to markdown.",
    prog="markitdown",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    usage=dedent("""
        SYNTAX:
            markitdown <OPTIONAL: FILENAME>
            If FILENAME is empty, markitdown reads from stdin.
    """).strip(),
)
```

### Mode Initialization

Based on the flags provided, the CLI initializes `MarkItDown` with different configurations:

```python
# Document Intelligence mode
if args.use_docintel:
    markitdown = MarkItDown(
        enable_plugins=args.use_plugins, 
        docintel_endpoint=args.endpoint
    )

# Content Understanding mode
elif args.use_cu:
    markitdown = MarkItDown(
        enable_plugins=args.use_plugins,
        cu_endpoint=args.cu_endpoint,
        cu_file_types=args.cu_file_types
    )

# Standard mode with optional plugins
else:
    markitdown = MarkItDown(
        enable_plugins=args.use_plugins,
        llm_client=llm_client,
        llm_model=args.llm_model,
        llm_prompt=args.llm_prompt
    )
```

Source: [packages/markitdown/src/markitdown/__main__.py:100-145](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

## Plugin Integration

### Listing Plugins

The CLI discovers plugins via Python entry points in the `markitdown.plugin` group:

```bash
markitdown --list-plugins
```

Example output:
```
Installed MarkItDown 3rd-party Plugins:

  * markitdown-ocr  (package: markitdown_ocr)

Use the -p (or --use-plugins) option to enable 3rd-party plugins.
```

### Enabling Plugins

To use 3rd-party plugins, pass the `--use-plugins` flag:

```bash
markitdown document.pdf --use-plugins -o output.md
```

### markitdown-ocr Plugin

The `markitdown-ocr` plugin adds OCR support for embedded images in PDF, DOCX, PPTX, and XLSX files:

```bash
markitdown document_with_images.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

> [!NOTE]
> If no `llm_client` is provided, the plugin loads but OCR is silently skipped, falling back to the standard converter.

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Known Limitations and Issues

### Unrecognized Arguments Error

A known issue exists where certain command-line arguments may not be recognized as expected. The example in the `markitdown-ocr` documentation shows:

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

However, the `--llm-client` and `--llm-model` flags are parsed but may not be properly handled depending on the conversion mode selected. When this occurs, an "Unrecognized Arguments" error is displayed.

**Workaround**: Ensure the correct argument format is used and that required dependencies are installed:

```bash
# Verify markitdown is properly installed
pip show markitdown

# Reinstall with all dependencies
pip install markitdown[all]
```

Reference: [Issue #1897](https://github.com/microsoft/markitdown/issues/1897)

### Missing ffmpeg Warning on Linux

When processing audio files, a `RuntimeWarning` may appear if `ffmpeg` or `avconv` is not installed:

```
RuntimeWarning: Couldn't find ffmpeg or avconv
```

**Workaround**: Install ffmpeg on Linux systems:

```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg

# Fedora
sudo dnf install ffmpeg
```

Reference: [Issue #1685](https://github.com/microsoft/markitdown/issues/1685)

## Security Considerations

> [!IMPORTANT]
> MarkItDown performs I/O with the privileges of the current process. Like `open()` or `requests.get()`, it will access resources that the process itself can access.

For untrusted environments:
- Sanitize inputs before passing them to the CLI
- Use the narrowest conversion function needed for your use case
- Consider running in sandboxed environments when processing untrusted files

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Exit Codes

| Code | Meaning |
|------|---------|
| 0 | Success |
| 1 | Error (file not found, unsupported format, conversion failed) |
| 2 | Invalid arguments or missing required parameters |

## See Also

- [MarkItDown Main Documentation](https://github.com/microsoft/markitdown#readme)
- [Plugin Development Guide](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [markitdown-ocr Plugin](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [Python API Reference](./Python-API.md)

---

<a id='architecture'></a>

## Architecture Overview

### Related Pages

Related topics: [Home](#home), [Python API Reference](#python-api), [Plugin Development Guide](#plugin-development)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)
- [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)
- [packages/markitdown/src/markitdown/converters/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/__init__.py)
- [packages/markitdown/src/markitdown/_uri_utils.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_uri_utils.py)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
- [packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py)
- [packages/markitdown/src/markitdown/converters/_wikipedia_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py)
</details>

# Architecture Overview

MarkItDown is a lightweight Python utility for converting various file formats to Markdown, designed primarily for consumption by Large Language Models (LLMs) and text analysis pipelines. This page provides a comprehensive technical overview of MarkItDown's architecture, including its core components, converter system, plugin architecture, and data flow.

## High-Level Architecture

MarkItDown employs a **converter-based architecture** where each file format is handled by a specialized converter that implements a common interface. The system uses a priority-based selection mechanism to allow converters to be overridden or extended through plugins.

```mermaid
graph TD
    subgraph "Client Layer"
        CLI["CLI<br/>(__main__.py)"]
        API["Python API<br/>(MarkItDown class)"]
    end

    subgraph "Core Engine"
        MD["MarkItDown<br/>Orchestrator"]
        SI["StreamInfo<br/>Metadata"]
        EP["Entry Points<br/>(Plugin Discovery)"]
    end

    subgraph "Converter System"
        BC["Base Converter<br/>(DocumentConverter)"]
        PDF["PdfConverter"]
        DOCX["DocxConverter"]
        PPTX["PptxConverter"]
        XLSX["XlsxConverter"]
        IMG["ImageConverter"]
        AUD["AudioConverter"]
        CU["ContentUnderstandingConverter"]
    end

    subgraph "Output"
        DCR["DocumentConverterResult"]
        MD_OUT["Markdown Output"]
    end

    CLI --> API
    API --> MD
    MD --> EP
    MD --> SI
    EP --> BC
    BC --> PDF
    BC --> DOCX
    BC --> PPTX
    BC --> XLSX
    BC --> IMG
    BC --> AUD
    BC --> CU
    MD --> DCR
    DCR --> MD_OUT
```

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

## Core Components

### MarkItDown Orchestrator

The `MarkItDown` class serves as the main orchestrator for the conversion pipeline. It is responsible for:

- **Converter discovery and registration** via Python entry points
- **File format detection** using `StreamInfo` metadata
- **Converter selection and execution** based on priority
- **Result aggregation** into a standardized output format

Source: [packages/markitdown/src/markitdown/_markitdown.py:1-50](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

#### Key Methods

| Method | Description |
|--------|-------------|
| `convert(file_path)` | Convert a local file to Markdown |
| `convert_stream(stream, stream_info)` | Convert a stream with metadata |
| `convert_uri(uri)` | Convert a file, data, or remote URI |
| `convert_url(url)` | Alias for `convert_uri` |
| `register_converter(converter)` | Register a custom converter |
| `list_converters()` | List all registered converters |

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

### StreamInfo

The `StreamInfo` class encapsulates metadata about the file being converted:

```python
@dataclass
class StreamInfo:
    extension: Optional[str] = None
    mimetype: Optional[str] = None
    charset: Optional[str] = None
    url: Optional[str] = None
```

This metadata is used by converters to determine whether they can handle a given file through their `accepts()` method.

Source: [packages/markitdown/src/markitdown/_stream_info.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_stream_info.py)

### DocumentConverterResult

The `DocumentConverterResult` class represents the output of a successful conversion:

```python
class DocumentConverterResult:
    text_content: str          # The Markdown output
    attachments: List[Any]     # Extracted images or other attachments
    metadata: Dict[str, Any]   # Conversion metadata
```

Source: [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

## Converter System

### Base Converter Interface

All converters inherit from `DocumentConverter`, which defines the contract for file conversion:

```python
class DocumentConverter(ABC):
    # Priority constants
    PRIORITY_MAX = float("inf")
    PRIORITY_DEFAULT = 0.0
    PRIORITY_SPECIFIC_FILE_FORMAT = -1.0
    PRIORITY_FALLBACK_FORMAT = -10.0

    def __init__(self, priority: float = PRIORITY_DEFAULT):
        self.priority = priority

    @abstractmethod
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any
    ) -> bool:
        """Determine if this converter can handle the file."""
        pass

    @abstractmethod
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any
    ) -> DocumentConverterResult:
        """Convert the file to Markdown."""
        pass
```

Source: [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

### Priority System

Converters use a priority system where **higher values are selected first**. This allows plugins to override built-in converters:

| Priority Range | Usage |
|---------------|-------|
| `> 0.0` | Reserved for specialized/premium converters |
| `0.0` (DEFAULT) | Built-in converters |
| `-1.0` (SPECIFIC_FILE_FORMAT) | Format-specific plugins |
| `-10.0` (FALLBACK_FORMAT) | Fallback converters |

Source: [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

### Built-in Converters

The following converters are included with the core package:

| Converter | File Types | Priority |
|-----------|------------|----------|
| `PdfConverter` | `.pdf` | 0.0 |
| `DocxConverter` | `.docx` | 0.0 |
| `PptxConverter` | `.pptx` | 0.0 |
| `XlsxConverter` | `.xlsx` | 0.0 |
| `ImageConverter` | `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.webp` | 0.0 |
| `AudioConverter` | `.mp3`, `.wav`, `.m4a`, `.flac` | 0.0 |
| `WikipediaConverter` | Wikipedia URLs | 0.0 |
| `CsvConverter` | `.csv` | 0.0 |
| `JsonConverter` | `.json` | 0.0 |
| `XmlConverter` | `.xml` | 0.0 |
| `HtmlConverter` | `.html`, `.htm` | 0.0 |
| `EpubConverter` | `.epub` | 0.0 |
| `YouTubeConverter` | YouTube URLs | 0.0 |
| `IpynbConverter` | `.ipynb` | 0.0 |
| `ZipConverter` | `.zip` | 0.0 |
| `ContentUnderstandingConverter` | All (when enabled) | 0.0 |

Source: [packages/markitdown/src/markitdown/converters/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/__init__.py)

## Plugin Architecture

Version 0.1.0 introduced a plugin-based architecture that allows third-party developers to extend MarkItDown's capabilities.

### Plugin Discovery Mechanism

Plugins are discovered through Python entry points in the `markitdown.plugin` group:

```toml
[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"
```

Source: [packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py)

### Plugin Interface

Each plugin must implement and export the following:

```python
# The version of the plugin interface that this plugin uses
__plugin_interface_version__ = 1

def register_converters(markitdown: MarkItDown, **kwargs):
    """Called during construction of MarkItDown instances."""
    markitdown.register_converter(YourCustomConverter())
```

Source: [packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py)

### Plugin Registration Flow

```mermaid
sequenceDiagram
    participant User
    participant MarkItDown
    participant EntryPoints
    participant Plugin
    participant Converter

    User->>MarkItDown: new MarkItDown(enable_plugins=True)
    MarkItDown->>EntryPoints: entry_points(group="markitdown.plugin")
    EntryPoints-->>MarkItDown: List of plugins
    loop For each plugin
        MarkItDown->>Plugin: load_plugin()
        Plugin->>Plugin: register_converters()
        Plugin->>MarkItDown: register_converter(Converter)
    end
    User->>MarkItDown: convert(file)
    MarkItDown->>Converter: accepts(stream, info)
    Converter-->>MarkItDown: True
    MarkItDown->>Converter: convert(stream, info)
    Converter-->>MarkItDown: DocumentConverterResult
```

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

### Plugin Parameters

When plugins are loaded, the following parameters are forwarded from the `MarkItDown` constructor:

| Parameter | Type | Purpose |
|-----------|------|---------|
| `llm_client` | OpenAI-compatible | LLM client for image descriptions and OCR |
| `llm_model` | `str` | Model name for LLM calls |
| `llm_prompt` | `str` | Custom prompt for LLM extraction |
| `docintel_endpoint` | `str` | Azure Document Intelligence endpoint |
| `docintel_credential` | `AzureKeyCredential` | Authentication for Document Intelligence |

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

### Example: markitdown-ocr Plugin

The `markitdown-ocr` plugin demonstrates the plugin architecture. It registers OCR-enhanced converters at priority `-1.0`, which allows them to take precedence over built-in converters:

```python
# Inside register_converters()
markitdown.register_converter(OcrPdfConverter(priority=-1.0))
markitdown.register_converter(OcrDocxConverter(priority=-1.0))
markitdown.register_converter(OcrPptxConverter(priority=-1.0))
markitdown.register_converter(OcrXlsxConverter(priority=-1.0))
```

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Conversion Pipeline

### URI Handling

MarkItDown supports multiple input sources through a unified URI handling system:

```mermaid
graph TD
    URI["convert_uri(uri)"]
    URI_TYPES{"URI Type?"}
    FILE["file:///path/to/file"]
    DATA["data:application/pdf;base64,..."]
    HTTP["https://example.com/file.pdf"]
    YT["https://youtube.com/watch?v=..."]

    URI --> URI_TYPES
    URI_TYPES -->|"file:"| FILE
    URI_TYPES -->|"data:"| DATA
    URI_TYPES -->|"http/https"| HTTP
    URI_TYPES -->|"youtube.com"| YT

    FILE --> STREAM["_convert_stream()"]
    DATA --> STREAM
    HTTP --> FETCH["Fetch content"]
    FETCH --> STREAM
    YT --> STREAM

    STREAM --> ACCEPT["Try each converter"]
    ACCEPT --> RESULT["DocumentConverterResult"]
```

Source: [packages/markitdown/src/markitdown/_uri_utils.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_uri_utils.py)

### Stream Processing

The conversion pipeline processes files as streams to support:

- **Piped input** (stdin)
- **Remote URLs** (HTTP/HTTPS)
- **Data URIs** (embedded content)
- **File paths** (local)

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

## Azure Integration

### Azure Document Intelligence

For PDF files, MarkItDown can optionally use Azure Document Intelligence for enhanced extraction:

```python
md = MarkItDown(
    docintel_endpoint="<endpoint>",
    docintel_credential=AzureKeyCredential("<key>")
)
result = md.convert("document.pdf")
```

Source: [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### Azure Content Understanding

For multi-modal documents (audio, video, structured fields), MarkItDown supports Azure Content Understanding:

```python
from markitdown import MarkItDown, ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF]
)
```

Source: [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

## Command-Line Interface

The CLI is implemented in `__main__.py` and provides the following interface:

```mermaid
graph LR
    CLI["markitdown CLI"]
    OPTS["Options"]
    CONV["Conversion<br/>Modes"]

    subgraph OPTS
        V["-v, --version"]
        O["-o, --output"]
        X["-x, --extension"]
        P["-p, --use-plugins"]
        L["-l, --list-plugins"]
        M["-m_hint"]
    end

    subgraph CONV
        STD["Standard"]
        DOCINTEL["--use-docintel"]
        CU["--use-cu"]
    end

    CLI --> OPTS
    CLI --> CONV
```

Source: [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### CLI Options

| Option | Description |
|--------|-------------|
| `-v, --version` | Show version number |
| `-o, --output FILE` | Output file path |
| `-x, --extension EXT` | Hint file extension (for stdin) |
| `-m_hint HINT` | Hint for MIME type and charset |
| `-p, --use-plugins` | Enable plugins |
| `-l, --list-plugins` | List installed plugins |
| `--use-docintel` | Use Azure Document Intelligence |
| `--endpoint URL` | Document Intelligence endpoint |
| `--use-cu` | Use Azure Content Understanding |
| `--cu-endpoint URL` | Content Understanding endpoint |

Source: [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/__main__.py)

## Known Limitations and Failure Modes

Based on community-reported issues, be aware of the following:

### 1. CLI Unrecognized Arguments

The CLI may not recognize all documented arguments. Issue [#1897](https://github.com/microsoft/markitdown/issues/1897) reports that the `--llm-client` and `--llm-model` arguments shown in documentation are not properly recognized.

### 2. UnicodeDecodeError in IpynbConverter

The `IpynbConverter.accepts()` method reads files with UTF-8 encoding, which can fail for files with non-ASCII bytes. See issue [#1894](https://github.com/microsoft/markitdown/issues/1894).

### 3. Invalid Office Open XML Files

Invalid DOCX, XLSX, or PPTX files return success with an error message in `text_content` rather than raising an exception. See issue [#1408](https://github.com/microsoft/markitdown/issues/1408).

### 4. Audio Processing on Linux

When `ffmpeg` or `avconv` is not installed, `pydub` raises a `RuntimeWarning` that may affect audio conversion. See issue [#1685](https://github.com/microsoft/markitdown/issues/1685).

### 5. PDF Table Extraction

Tables in PDF files may not be converted properly. Community reports indicate that complex table structures can be problematic. See issue [#293](https://github.com/microsoft/markitdown/issues/293).

## Security Considerations

> [!IMPORTANT]
> MarkItDown performs I/O with the privileges of the current process. Like `open()` or `requests.get()`, it will access resources that the process itself can access.

For untrusted environments:

1. **Sanitize inputs** before passing them to MarkItDown
2. **Use narrow conversion functions** such as `convert_stream()` or `convert_local()` when possible
3. **Be cautious with URL inputs** as they may trigger network requests

Source: [packages/markitdown/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/README.md)

## Dependency Groups

MarkItDown organizes dependencies into feature groups:

| Group | Description |
|-------|-------------|
| `pdf` | PDF conversion (pdfminer.six) |
| `docx` | Word document conversion (mammoth) |
| `pptx` | PowerPoint conversion (python-pptx) |
| `xlsx` | Excel conversion (openpyxl) |
| `image` | Image processing |
| `audio` | Audio transcription (pydub) |
| `youtube` | YouTube download (yt-dlp) |
| `azure-doc-intel` | Azure Document Intelligence |
| `az-content-understanding` | Azure Content Understanding |
| `all` | All optional dependencies |

Install with: `pip install 'markitdown[all]'`

Source: [packages/markitdown/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/README.md)

## Class Diagram

```mermaid
classDiagram
    class DocumentConverter {
        <<abstract>>
        +float priority
        +accepts(file_stream, stream_info, **kwargs) bool
        +convert(file_stream, stream_info, **kwargs) DocumentConverterResult
    }

    class MarkItDown {
        +bool enable_plugins
        +str llm_model
        +convert(file_path) DocumentConverterResult
        +convert_uri(uri) DocumentConverterResult
        +convert_stream(stream, info) DocumentConverterResult
        +register_converter(converter)
    }

    class StreamInfo {
        +Optional~str~ extension
        +Optional~str~ mimetype
        +Optional~str~ charset
        +Optional~str~ url
    }

    class DocumentConverterResult {
        +str text_content
        +List attachments
        +Dict metadata
    }

    MarkItDown "1" --> "*" DocumentConverter : registers
    MarkItDown --> StreamInfo : creates
    MarkItDown --> DocumentConverterResult : returns
    DocumentConverter --> DocumentConverterResult : returns
    DocumentConverter ..> StreamInfo : uses
```

## See Also

- [MarkItDown README](https://github.com/microsoft/markitdown) - Official project documentation
- [markitdown-ocr Plugin](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md) - OCR plugin documentation
- [Sample Plugin Guide](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md) - Creating custom plugins
- [Azure Document Intelligence Setup](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource) - Azure integration guide
- [Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) - Content Understanding integration

---

<a id='python-api'></a>

## Python API Reference

### Related Pages

Related topics: [Architecture Overview](#architecture), [Command-Line Interface](#cli-usage), [Azure Integrations](#azure-integrations)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)
- [packages/markitdown/src/markitdown/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__init__.py)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown/src/markitdown/_stream_info.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_stream_info.py)
- [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)
- [packages/markitdown/src/markitdown/_exceptions.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/_exceptions.py)
- [packages/markitdown/src/markitdown/converters/_pdf_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pdf_converter.py)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
</details>

# Python API Reference

MarkItDown provides a comprehensive Python API for converting various file formats to Markdown. This reference documents the core classes, methods, and configuration options available to developers integrating MarkItDown into their Python applications.

## Overview

The MarkItDown Python API enables programmatic document conversion with support for:

- **Local files** - Convert files from the filesystem
- **URLs/URIs** - Convert remote documents via HTTP or file URIs
- **Streams** - Convert from file-like objects with optional metadata hints
- **Azure services** - Integration with Document Intelligence and Content Understanding
- **Plugin architecture** - Extensible converter system for custom formats

Source: [packages/markitdown/src/markitdown/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__init__.py)

## Core Classes

### MarkItDown

The main entry point for the Python API. Create an instance with desired configuration and call `convert()` to transform documents to Markdown.

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```

#### Constructor Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `enable_plugins` | `bool` | `False` | Enable loading of 3rd-party plugins via the `markitdown.plugin` entry point group |
| `llm_client` | `Any` | `None` | OpenAI-compatible LLM client for image descriptions (used with `llm_model`) |
| `llm_model` | `str` | `None` | Model name for LLM-based image description generation |
| `llm_prompt` | `str` | `None` | Custom prompt template for LLM image description |
| `docintel_endpoint` | `str` | `None` | Azure Document Intelligence endpoint URL |
| `docintel_api_key` | `str` | `None` | Azure Document Intelligence API key (alternative to credential) |
| `cu_endpoint` | `str` | `None` | Azure Content Understanding endpoint URL |
| `cu_api_key` | `str` | `None` | Azure Content Understanding API key |
| `cu_analyzer` | `str` | `None` | Specific Content Understanding analyzer ID to use |
| `cu_file_types` | `List[ContentUnderstandingFileType]` | `None` | Filter which file types route to Content Understanding |

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

#### Conversion Methods

##### `convert(uri: str, **kwargs) -> DocumentConverterResult`

The primary conversion method that automatically detects the input type and routes to the appropriate handler.

```python
# Local file
result = md.convert("document.docx")

# URL
result = md.convert("https://example.com/document.pdf")

# File URI
result = md.convert("file:///path/to/document.pdf")

# Data URI
result = md.convert("data:text/plain;base64,SGVsbG8=")
```

The method supports optional keyword arguments that override the instance defaults for a single conversion call.

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

##### `convert_uri(uri: str, **kwargs) -> DocumentConverterResult`

Explicitly converts a URI (file, data, or HTTP/HTTPS URL). The `convert_url` method remains as a deprecated alias for backward compatibility.

```python
result = md.convert_uri("file:///path/to/document.pdf")
```

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

##### `convert_local(file_path: str, **kwargs) -> DocumentConverterResult`

Converts a local file by path. This method is preferred when the file source is known to be local, as it bypasses URI parsing.

```python
result = md.convert_local("./documents/report.pdf")
```

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

##### `convert_stream(stream: BinaryIO, stream_info: Optional[StreamInfo] = None, **kwargs) -> DocumentConverterResult`

Converts from a file-like object. Use `stream_info` to provide hints about the file type when the stream lacks inherent type information.

```python
with open("document.pdf", "rb") as f:
    stream_info = StreamInfo(extension=".pdf")
    result = md.convert_stream(f, stream_info)
```

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

##### `register_converter(converter: DocumentConverter) -> None`

Registers a custom converter. This allows adding support for additional file formats at runtime.

```python
md = MarkItDown()
md.register_converter(MyCustomConverter())
```

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

---

### DocumentConverterResult

The result object returned by all conversion methods.

| Attribute | Type | Description |
|-----------|------|-------------|
| `markdown` | `str` | The converted Markdown content (primary attribute) |
| `title` | `Optional[str]` | Document title, if extracted |
| `text_content` | `str` | **Deprecated** alias for `markdown`. New code should use `markdown` directly. |

Source: [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

```python
result = md.convert("document.docx")

# Preferred (v0.1.0+)
print(result.markdown)

# Deprecated (still works for backward compatibility)
print(result.text_content)
```

---

### StreamInfo

Provides metadata hints about file streams being converted. Used primarily with `convert_stream()` when the file type cannot be inferred from the stream itself.

| Attribute | Type | Description |
|-----------|------|-------------|
| `extension` | `Optional[str]` | File extension including the dot (e.g., `.pdf`) |
| `mimetype` | `Optional[str]` | MIME type (e.g., `application/pdf`) |
| `charset` | `Optional[str]` | Character encoding (e.g., `utf-8`) |
| `url` | `Optional[str]` | Source URL, if applicable |

Source: [packages/markitdown/src/markitdown/_stream_info.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_stream_info.py)

```python
from markitdown import MarkItDown, StreamInfo

md = MarkItDown()
stream_info = StreamInfo(
    extension=".pdf",
    mimetype="application/pdf",
    charset="utf-8"
)
```

---

### DocumentConverter (Abstract Base Class)

The base class for all document converters. Custom converters must inherit from this class and implement the `accepts()` and `convert()` methods.

```python
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any

class MyFormatConverter(DocumentConverter):
    
    def __init__(self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT):
        super().__init__(priority=priority)
    
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> bool:
        # Return True if this converter can handle the input
        return stream_info.extension == ".myformat"
    
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> DocumentConverterResult:
        # Implement conversion logic
        markdown = "..."  # Your conversion logic
        return DocumentConverterResult(markdown=markdown)
```

Source: [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

#### Converter Priority Constants

| Constant | Value | Description |
|----------|-------|-------------|
| `PRIORITY_EARLY_BREAK` | `100.0` | Highest priority; stops further processing |
| `PRIORITY_DEFAULT` | `50.0` | Standard priority for general converters |
| `PRIORITY_SPECIFIC_FILE_FORMAT` | `30.0` | Priority for format-specific converters |

Source: [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

---

## Exception Handling

### MissingDependencyException

Raised when a required optional dependency is not installed. This typically occurs when using Azure services without installing the corresponding package.

```python
from markitdown._exceptions import MissingDependencyException

try:
    from markitdown import MarkItDown
    md = MarkItDown(docintel_endpoint="https://...")
    result = md.convert("document.pdf")
except MissingDependencyException as e:
    print(f"Missing dependency: {e}")
```

Source: [packages/markitdown/src/markitdown/_exceptions.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_exceptions.py)

---

## Built-in Converters

MarkItDown includes converters for the following formats out of the box:

| Format | File Extensions | Notes |
|--------|---------------|-------|
| PDF | `.pdf` | Text extraction with table support |
| Word (OOXML) | `.docx` | Full formatting preservation |
| Excel (OOXML) | `.xlsx` | Sheet and table conversion |
| PowerPoint (OOXML) | `.pptx` | Slide content extraction |
| CSV | `.csv` | Table-to-Markdown conversion |
| Markdown | `.md`, `.markdown` | Pass-through |
| HTML | `.html`, `.htm` | HTML-to-Markdown conversion |
| Wikipedia | HTML from `*.wikipedia.org` | Article-only content extraction |
| RSS/Atom | `.rss`, `.atom`, `.xml` | Feed and entry parsing |
| Images | `.jpg`, `.jpeg`, `.png`, `.gif`, `.webp` | LLM-based description (requires `llm_client`) |
| Audio | `.mp3`, `.wav`, `.ogg`, `.flac` | Audio transcription (requires ffmpeg) |
| Video | `.mp4`, `.avi`, `.mkv`, `.mov` | Frame extraction and LLM description |
| Jupyter Notebook | `.ipynb` | Notebook cell extraction |

Source: [packages/markitdown/src/markitdown/converters/_pdf_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pdf_converter.py)

---

## LLM Integration

### Image Description

MarkItDown can generate descriptions for images using LLM Vision models. This works for images embedded in documents and standalone image files.

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    llm_client=OpenAI(),
    llm_model="gpt-4o"
)

result = md.convert("chart.png")
print(result.markdown)  # Contains LLM-generated image description
```

### Custom LLM Prompt

Override the default extraction prompt for specialized document types:

```python
md = MarkItDown(
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Extract all text from this image, preserving table structure with pipe separators."
)

result = md.convert("document_scan.png")
```

### Azure OpenAI Compatibility

Any OpenAI-compatible client works with MarkItDown:

```python
from openai import AzureOpenAI

md = MarkItDown(
    llm_client=AzureOpenAI(
        api_key="your-api-key",
        azure_endpoint="https://your-resource.openai.azure.com/",
        api_version="2024-02-01"
    ),
    llm_model="gpt-4o"
)
```

---

## Azure Service Integration

### Azure Document Intelligence

For high-quality PDF and Office document conversion using Azure AI services:

```python
from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://your-resource.cognitiveservices.azure.com/",
    docintel_api_key="your-api-key"
)

result = md.convert("complex_document.pdf")
```

For easier Azure Key Credentials integration:

```python
from azure.core.credentials import AzureKeyCredential
from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://your-resource.cognitiveservices.azure.com/",
    docintel_credential=AzureKeyCredential("your-api-key")
)
```

Source: [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)

### Azure Content Understanding

Content Understanding provides multi-modal extraction with structured field output:

```python
from markitdown import MarkItDown
from markitdown.converters._cu_converter import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="https://your-endpoint.azure.com/",
    cu_file_types=[ContentUnderstandingFileType.PDF],  # Only route PDFs to CU
    cu_analyzer="your-analyzer-id"  # Optional: specific analyzer
)

result = md.convert("document.pdf")
```

Source: [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

---

## Plugin Architecture

MarkItDown supports a plugin system for extending converter capabilities. Plugins are discovered via the `markitdown.plugin` entry point group.

### Creating a Custom Plugin

1. Create a package implementing `DocumentConverter`:

```python
# my_markitdown_plugin/__init__.py
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo, MarkItDown
from typing import BinaryIO, Any

class RtfConverter(DocumentConverter):
    
    def __init__(self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT):
        super().__init__(priority=priority)
    
    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
        return stream_info.extension == ".rtf"
    
    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
        # Implement RTF to Markdown conversion
        markdown = "..."
        return DocumentConverterResult(markdown=markdown)

__plugin_interface_version__ = 1

def register_converters(markitdown: MarkItDown, **kwargs):
    markitdown.register_converter(RtfConverter())
```

2. Define the entry point in `pyproject.toml`:

```toml
[project.entry-points."markitdown.plugin"]
my_rtf_plugin = "my_markitdown_plugin"
```

3. Install and use:

```bash
pip install -e .
markitdown --list-plugins
markitdown --use-plugins document.rtf
```

Source: [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

---

## Data Flow

The following diagram illustrates the conversion pipeline:

```mermaid
graph TD
    A[User Code] --> B[MarkItDown.convert]
    B --> C{Input Type Detection}
    
    C -->|Local File| D[convert_local]
    C -->|URL/HTTP| E[convert_uri - URL Handler]
    C -->|File URI| E
    C -->|Data URI| E
    C -->|Binary Stream| F[convert_stream]
    
    D --> G[StreamInfo Generation]
    E --> G
    F --> G
    
    G --> H[Converter Selection Loop]
    
    H --> I{Converter.accepts ?}
    I -->|True| J[Converter.convert]
    I -->|False| K{Next Converter}
    K -->|Exists| I
    K -->|None| L[UnsupportedFormatError]
    
    J --> M[DocumentConverterResult]
    
    H1[Built-in Converters]
    H2[Plugin Converters]
    
    H --> H1
    H --> H2
    
    L --> M1[Error]
    
    M --> N[Return to User]
```

---

## Common Usage Patterns

### Basic File Conversion

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.docx")
print(result.markdown)
```

### Batch Conversion

```python
from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()
documents = Path("documents").glob("**/*")

for doc in documents:
    if doc.is_file():
        result = md.convert(str(doc))
        output = Path("output") / f"{doc.stem}.md"
        output.write_text(result.markdown)
```

### Streaming Input with Type Hints

```python
from markitdown import MarkItDown, StreamInfo

md = MarkItDown()
stream_info = StreamInfo(
    extension=".pdf",
    mimetype="application/pdf"
)

with open("document.pdf", "rb") as f:
    result = md.convert_stream(f, stream_info)
    print(result.markdown)
```

### Using Plugins

```python
from markitdown import MarkItDown

# Enable plugin discovery on MarkItDown instantiation
md = MarkItDown(enable_plugins=True)

# Or enable per-conversion
result = md.convert("file.xyz", enable_plugins=True)
```

---

## Known Limitations and Caveats

### Unicode Handling in Jupyter Notebooks

The `IpynbConverter.accepts()` method reads raw file streams and decodes them to check if the file is a Jupyter notebook. When the decode fails (non-ASCII bytes in the file), the exception may propagate uncaught and crash the conversion pipeline. Source: [Issue #1894](https://github.com/microsoft/markitdown/issues/1894)

### Office Open XML Error Handling

When converting invalid DOCX, XLSX, or PPTX files, MarkItDown returns a successful `DocumentConverterResult` with the string `"This is not a valid Office Open XML file."` in the markdown content, rather than raising an exception. This makes it difficult to distinguish between successful and failed conversions programmatically. Source: [Issue #1408](https://github.com/microsoft/markitdown/issues/1408)

### Audio Processing on Linux

When running MarkItDown on Linux systems without ffmpeg or avconv installed, a `RuntimeWarning` is triggered by the pydub dependency, which may cause audio-to-markdown conversion features to fail silently. Source: [Issue #1685](https://github.com/microsoft/markitdown/issues/1685)

---

## Security Considerations

> [!IMPORTANT]
> MarkItDown performs I/O with the privileges of the current process. Like `open()` or `requests.get()`, it will access resources that the process itself can access. Sanitize your inputs in untrusted environments, and call the narrowest `convert_*` function needed for your use case (e.g., `convert_stream()`, or `convert_local()`).

When using MarkItDown in untrusted environments:

1. Use `convert_local()` for known local files rather than `convert()` which may follow arbitrary URLs
2. Validate file paths before passing them to the API
3. Consider running with restricted filesystem permissions
4. Disable plugin loading (`enable_plugins=False`) unless explicitly required

Source: [packages/markitdown/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/README.md)

---

## See Also

- [CLI Reference](markitdown-cli) - Command-line interface documentation
- [OCR Plugin](markitdown-ocr) - LLM Vision OCR for embedded images and scanned PDFs
- [Plugin Development Guide](plugin-development) - Creating custom MarkItDown plugins
- [Azure Integration](azure-integration) - Document Intelligence and Content Understanding setup

---

<a id='supported-formats'></a>

## Supported File Formats

### Related Pages

Related topics: [Home](#home), [Azure Integrations](#azure-integrations), [OCR Plugin](#ocr-plugin)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/markitdown/src/markitdown/converters/_pdf_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pdf_converter.py)
- [packages/markitdown/src/markitdown/converters/_docx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_docx_converter.py)
- [packages/markitdown/src/markitdown/converters/_pptx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pptx_converter.py)
- [packages/markitdown/src/markitdown/converters/_xlsx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_xlsx_converter.py)
- [packages/markitdown/src/markitdown/converters/_image_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_image_converter.py)
- [packages/markitdown/src/markitdown/converters/_audio_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_audio_converter.py)
- [packages/markitdown/src/markitdown/converters/_html_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_html_converter.py)
- [packages/markitdown/src/markitdown/converters/_wikipedia_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py)
- [packages/markitdown/src/markitdown/converters/_rss_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_rss_converter.py)
- [packages/markitdown/src/markitdown/converters/_zip_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_zip_converter.py)
- [packages/markitdown/src/markitdown/converters/_csv_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_csv_converter.py)
- [packages/markitdown/src/markitdown/converters/_json_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_json_converter.py)
- [packages/markitdown/src/markitdown/converters/_epub_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_epub_converter.py)
- [packages/markitdown/src/markitdown/converters/_ipynb_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_ipynb_converter.py)
- [packages/markitdown/src/markitdown/converters/_youtube_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_youtube_converter.py)
</details>

# Supported File Formats

MarkItDown provides a unified interface for converting a wide variety of file formats into Markdown. The conversion pipeline uses a plugin-based architecture where each file format is handled by a dedicated converter. When you call `MarkItDown().convert()`, the system iterates through registered converters in priority order until one accepts the input.

## Overview

MarkItDown supports the following high-level categories of file formats:

| Category | Formats | Primary Converter |
|----------|---------|-------------------|
| Documents | PDF, DOCX, PPTX, XLSX | Built-in converters using pdfminer.six, mammoth |
| Media | Images (JPEG, PNG, GIF, WebP, BMP, TIFF) | Built-in with EXIF metadata and LLM Vision OCR |
| Media | Audio (MP3, WAV, M4A, OGG, FLAC) | Built-in with EXIF metadata and transcription |
| Web | HTML, Wikipedia, RSS/Atom | Built-in converters using BeautifulSoup |
| Data | CSV, JSON, XML | Built-in converters |
| Archives | ZIP | Built-in converter with recursive processing |
| eBooks | EPUB | Built-in converter |
| Notebooks | Jupyter Notebook (IPYNB) | Built-in converter |
| URLs | YouTube Videos | Built-in converter with transcript extraction |

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Converter Architecture

MarkItDown uses a converter registration system where each format is handled by a class implementing the `DocumentConverter` interface. Converters are registered with a priority value that determines the order in which they are tried.

```mermaid
graph TD
    A[MarkItDown.convert] --> B[Get registered converters]
    B --> C[Sort by priority descending]
    C --> D{Loop through converters}
    D --> E{Converter.accepts?}
    E -->|Yes| F[Converter.convert]
    E -->|No| G[Next converter]
    F --> H[Return DocumentConverterResult]
    G --> D
    D -->|All fail| I[UnsupportedFormatException]
```

Each converter implements two key methods:

1. **`accepts(file_stream, stream_info, **kwargs)`** - Returns `True` if the converter can handle the input
2. **`convert(file_stream, stream_info, **kwargs)`** - Performs the actual conversion to Markdown

Source: [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

## Document Formats

### PDF

PDF conversion extracts text content while preserving document structure including headings, paragraphs, lists, and tables.

**Supported Extensions:** `.pdf`

**Dependencies:** `pdfminer.six`

**Features:**
- Text extraction with layout preservation
- Table extraction with aligned Markdown output
- Heading and list recognition
- Support for numbered and bulleted lists

**Known Limitations:**
- Scanned PDFs with no extractable text require the `markitdown-ocr` plugin for full-page OCR
- Complex table structures may not convert perfectly (see [Issue #293](https://github.com/microsoft/markitdown/issues/293))
- PDF is converted to text/Markdown, not high-fidelity reproduction (see [Issue #296](https://github.com/microsoft/markitdown/issues/296))

**Memory Optimization:** In version 0.1.6, PDF conversion was fixed to prevent O(n) memory growth by properly calling `page.close()` after processing each page.

Source: [packages/markitdown/src/markitdown/converters/_pdf_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pdf_converter.py)

### Microsoft Word (DOCX)

Word documents are converted using the `mammoth` library, which extracts text and converts it to Markdown.

**Supported Extensions:** `.docx`

**Dependencies:** `mammoth`

**Features:**
- Heading extraction and conversion
- Paragraph and text formatting preservation
- Table extraction
- Math equation rendering (OMML to LaTeX)
- Image extraction with optional LLM Vision descriptions
- Linked image handling

**Known Limitations:**
- The legacy `.doc` format is not supported (see [Issue #23](https://github.com/microsoft/markitdown/issues/23))
- Invalid DOCX files return a success result with an error message in `text_content` rather than raising an exception (see [Issue #1408](https://github.com/microsoft/markitdown/issues/1408))

Source: [packages/markitdown/src/markitdown/converters/_docx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_docx_converter.py)

### Microsoft PowerPoint (PPTX)

PowerPoint presentations are converted slide by slide, with each slide rendered as a Markdown section.

**Supported Extensions:** `.pptx`

**Dependencies:** `python-pptx`

**Features:**
- Slide-by-slide conversion
- Title and content extraction
- Bullet point preservation
- Image extraction with optional LLM Vision descriptions
- Table extraction

Source: [packages/markitdown/src/markitdown/converters/_pptx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_pptx_converter.py)

### Microsoft Excel (XLSX)

Excel spreadsheets are converted with each sheet represented as a separate Markdown section with tables.

**Supported Extensions:** `.xlsx`, `.xlsm`

**Dependencies:** `openpyxl`

**Features:**
- Multi-sheet support
- Table extraction with header row identification
- Cell value preservation including formulas (as displayed values)
- Named sheet sections in output

**Known Limitations:**
- Invalid XLSX files return a success result with an error message rather than raising an exception (see [Issue #1408](https://github.com/microsoft/markitdown/issues/1408))

Source: [packages/markitdown/src/markitdown/converters/_xlsx_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_xlsx_converter.py)

## Media Formats

### Images

Images are converted by extracting metadata and optionally generating descriptions using LLM Vision.

**Supported Extensions:** `.jpg`, `.jpeg`, `.png`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`

**Dependencies:** `Pillow` (for metadata), LLM client for descriptions

**Features:**
- EXIF metadata extraction (camera info, GPS, date/time)
- LLM Vision image descriptions (requires `llm_client` and `llm_model`)
- Dimension reporting

**Usage Example:**
```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Describe this image in detail."
)
result = md.convert("photo.jpg")
print(result.text_content)
```

Source: [packages/markitdown/src/markitdown/converters/_image_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_image_converter.py)

### Audio

Audio files are converted by extracting metadata and optionally transcribing speech.

**Supported Extensions:** `.mp3`, `.wav`, `.m4a`, `.ogg`, `.flac`

**Dependencies:** `pydub`, `speech-recognition` (for transcription)

**Features:**
- EXIF/metadata extraction
- Speech-to-text transcription using Google Speech Recognition
- Format detection

**Known Limitations:**
- On Linux systems, a `RuntimeWarning` may appear if `ffmpeg` or `avconv` is not installed (see [Issue #1685](https://github.com/microsoft/markitdown/issues/1685)). Install ffmpeg to resolve:
  ```bash
  # Ubuntu/Debian
  sudo apt-get install ffmpeg
  # macOS
  brew install ffmpeg
  ```

Source: [packages/markitdown/src/markitdown/converters/_audio_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_audio_converter.py)

## Web Formats

### HTML

HTML files are converted to Markdown using BeautifulSoup with customizable rendering options.

**Supported Extensions:** `.html`, `.htm`

**Accepted MIME Types:** `text/html`, `application/xhtml+xml`

Source: [packages/markitdown/src/markitdown/converters/_html_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_html_converter.py)

### Wikipedia

Wikipedia pages are specially handled to extract only the main article content, stripping navigation and sidebar elements.

**Accepted URLs:** `*.wikipedia.org/*`

**Features:**
- Article title extraction
- Main content extraction (excluding sidebars, navigation)
- HTML conversion pipeline

Source: [packages/markitdown/src/markitdown/converters/_wikipedia_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py)

### RSS and Atom Feeds

RSS and Atom feeds are converted with each item represented as a section in the output.

**Supported Extensions:** `.rss`, `.atom`, `.xml`

**Accepted MIME Types:**
- Precise: `application/rss`, `application/rss+xml`, `application/atom`, `application/atom+xml`
- Candidate: `text/xml`, `application/xml`

Source: [packages/markitdown/src/markitdown/converters/_rss_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_rss_converter.py)

## Data Formats

### CSV

CSV files are converted to Markdown tables with automatic header detection.

**Supported Extensions:** `.csv`

Source: [packages/markitdown/src/markitdown/converters/_csv_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_csv_converter.py)

### JSON

JSON files are converted with basic formatting to maintain readability.

**Supported Extensions:** `.json`

Source: [packages/markitdown/src/markitdown/converters/_json_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_json_converter.py)

### XML

XML files are parsed and converted to readable Markdown format.

**Supported Extensions:** `.xml`

Source: [packages/markitdown/src/markitdown/converters/_xml_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_xml_converter.py)

## Archive Formats

### ZIP Files

ZIP archives are recursively processed, with each contained file converted using the appropriate converter.

**Supported Extensions:** `.zip`

**Accepted MIME Types:** `application/zip`

**Features:**
- Recursive conversion of all contained files
- File path preservation in output headings
- Support for nested archives

**Output Format:**
```markdown
Content from the zip file `example.zip`:

## File: docs/readme.txt

[Content of readme.txt]

## File: images/example.jpg

ImageSize: 1920x1080
Description: [Image description]

## File: data/report.xlsx

[Converted Excel content]
```

Source: [packages/markitdown/src/markitdown/converters/_zip_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_zip_converter.py)

## eBook and Notebook Formats

### EPUB

EPUB e-books are converted with chapter-by-chapter extraction.

**Supported Extensions:** `.epub`

**Features:**
- Chapter extraction
- Content preservation
- Metadata extraction

Source: [packages/markitdown/src/markitdown/converters/_epub_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_epub_converter.py)

### Jupyter Notebooks

Jupyter notebooks (`.ipynb` files) are converted preserving both code cells and markdown cells.

**Supported Extensions:** `.ipynb`

**Known Limitations:**
- `IpynbConverter.accepts()` may raise `UnicodeDecodeError` on files containing non-ASCII bytes (see [Issue #1894](https://github.com/microsoft/markitdown/issues/1894)). This is particularly relevant when processing files created from non-English PDFs.

Source: [packages/markitdown/src/markitdown/converters/_ipynb_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_ipynb_converter.py)

## URL-Based Conversions

### YouTube Videos

YouTube video URLs can be converted to extract transcripts and metadata.

**Supported URLs:** YouTube video and playlist URLs

**Features:**
- Automatic transcript extraction
- Video metadata (title, description)
- Subtitle/script preservation

**Usage:**
```bash
markitdown "https://www.youtube.com/watch?v=VIDEO_ID"
```

Source: [packages/markitdown/src/markitdown/converters/_youtube_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_youtube_converter.py)

## Cloud-Based Conversions

### Azure Document Intelligence

For PDFs requiring advanced layout analysis, Azure Document Intelligence provides higher-quality extraction.

**Installation:** `pip install 'markitdown[docintel]'`

**Usage:**
```bash
markitdown document.pdf -d -e "<document_intelligence_endpoint>"
```

Source: [packages/markitdown/src/markitdown/converters/_docintel_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_docintel_converter.py)

### Azure Content Understanding

Azure Content Understanding provides structured field extraction with YAML front matter output.

**Installation:** `pip install 'markitdown[az-content-understanding]'`

**Supported File Types via CU:**
- PDF (with structured extraction)
- Images
- Audio
- Video

Source: [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

## Plugin-Extended Formats

### OCR Plugin (markitdown-ocr)

The `markitdown-ocr` plugin extends PDF, DOCX, PPTX, and XLSX converters with LLM Vision OCR for embedded images.

**Installation:**
```bash
pip install markitdown-ocr
pip install openai  # or any OpenAI-compatible client
```

**Supported Formats with OCR:**
- PDF (embedded images and scanned pages)
- DOCX (inline images)
- PPTX (inline images)
- XLSX (inline images)

**Usage:**
```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

**Important:** The CLI argument format changed in recent versions. Use `--llm-client` and `--llm-model` instead of combining them (see [Issue #1897](https://github.com/microsoft/markitdown/issues/1897)).

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Unsupported Formats

The following formats are explicitly not supported:

| Format | Extension | Notes |
|--------|-----------|-------|
| Legacy Word | `.doc` | Only `.docx` is supported (see [Issue #23](https://github.com/microsoft/markitdown/issues/23)) |
| OneNote | `.one` | No current support (see [Issue #47](https://github.com/microsoft/markitdown/issues/47)) |
| RTF | `.rtf` | Requires a custom plugin |

## Converter Priority System

Each converter has a priority value that determines when it is tried during the conversion process:

| Priority Value | Meaning | Example |
|----------------|---------|---------|
| `PRIORITY_SPECIFIC_FILE_FORMAT` (100) | Specific file format | DOCX, PDF converters |
| `PRIORITY_COMMON_FILE_FORMAT` (50) | Common formats | ZIP, HTML converters |
| `PRIORITY_FALLBACK` (0) | Fallback handler | Plain text converter |
| `-1.0` (Plugin) | Runs before built-in | markitdown-ocr converters |

Converters with higher priority values are tried first. When a converter returns `True` from `accepts()`, it is used for conversion.

## Feature Comparison by Format

| Format | Text | Tables | Images | Math | Metadata | Notes |
|--------|------|--------|--------|------|----------|-------|
| PDF | ✓ | ✓ | ✓* | - | ✓ | *With OCR plugin |
| DOCX | ✓ | ✓ | ✓ | ✓ | - | Math via OMML→LaTeX |
| PPTX | ✓ | ✓ | ✓ | - | - | Slide-based |
| XLSX | ✓ | ✓ | ✓ | - | - | Sheet-based |
| Images | - | - | ✓ | - | ✓ | EXIF metadata |
| Audio | ✓** | - | - | - | ✓ | **Transcription |
| HTML | ✓ | ✓ | ✓ | - | - | |
| EPUB | ✓ | ✓ | ✓ | - | ✓ | |
| CSV | - | ✓ | - | - | - | |
| JSON | ✓ | - | - | - | - | Formatted |

## See Also

- [Installation Guide](../Installation) - Installing MarkItDown and dependencies
- [CLI Usage](../CLI-Usage) - Command-line interface reference
- [Python API](../Python-API) - Python library usage
- [Plugin Development](../Plugin-Development) - Creating custom converters
- [Azure Integration](../Azure-Integration) - Document Intelligence and Content Understanding

---

<a id='azure-integrations'></a>

## Azure Integrations

### Related Pages

Related topics: [Python API Reference](#python-api), [Supported File Formats](#supported-formats)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/markitdown/src/markitdown/converters/_doc_intel_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py)
- [packages/markitdown/src/markitdown/converters/_cu_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)
- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
</details>

# Azure Integrations

MarkItDown provides two Azure-based integration options for enhanced document conversion: **Azure Document Intelligence** and **Azure Content Understanding**. Both integrations leverage cloud-based AI services to provide higher-quality extraction than built-in offline converters, but they serve different use cases and offer distinct capabilities.

## Overview

MarkItDown's Azure integrations are implemented as separate converter classes that can be enabled when needed. These converters are optional dependencies that must be installed explicitly:

```bash
# Document Intelligence
pip install 'markitdown[docintel]'

# Content Understanding
pip install 'markitdown[az-content-understanding]'
```

```mermaid
graph TD
    A[Input Document] --> B{MarkItDown Instance}
    B --> C{Feature Flag Check}
    
    C -->|--use-docintel| D[Document Intelligence Converter]
    C -->|--use-cu| E[Content Understanding Converter]
    C -->|No Azure flag| F[Built-in Converters]
    
    D --> G[Azure Document Intelligence Service]
    E --> H[Azure Content Understanding Service]
    
    F --> I[Offline PDF/Office Extractors]
    
    G --> J[Markdown Output]
    H --> K[Markdown + YAML Front Matter]
    I --> J
```

Source: [packages/markitdown/src/markitdown/__main__.py:86-130](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

## Azure Document Intelligence

Azure Document Intelligence (formerly Form Recognizer) provides cloud-based layout analysis and OCR for document conversion. It is particularly useful for scanned PDFs and complex document layouts.

### Supported File Types

| File Type | Description | Notes |
|-----------|-------------|-------|
| `pdf` | PDF documents | Full OCR support for scanned documents |
| `docx` | Word documents | Enhanced layout preservation |
| `pptx` | PowerPoint presentations | Slide structure extraction |
| `xlsx` | Excel spreadsheets | Table and data extraction |
| `html` | HTML documents | Markup interpretation |

Source: [packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:1-50](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py)

### Installation and Configuration

1. Create an Azure Document Intelligence resource in the Azure portal
2. Obtain the endpoint URL and API key (or configure managed identity)

```bash
pip install 'markitdown[docintel]'
```

### CLI Usage

```bash
markitdown document.pdf -o output.md --use-docintel -e "https://<your-resource>.cognitiveservices.azure.com/"
```

| Argument | Short | Description |
|----------|-------|-------------|
| `--use-docintel` | `-d` | Enable Document Intelligence converter |
| `--endpoint` | `-e` | Document Intelligence endpoint URL |

Source: [packages/markitdown/src/markitdown/__main__.py:86-100](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### Python API

```python
from markitdown import MarkItDown

# Using endpoint and API key
md = MarkItDown(docintel_endpoint="https://<your-resource>.cognitiveservices.azure.com/")
result = md.convert("document.pdf")
print(result.text_content)
```

The `MarkItDown` class accepts the following Document Intelligence parameters:

| Parameter | Type | Description |
|-----------|------|-------------|
| `docintel_endpoint` | `str` | Azure Document Intelligence endpoint URL |
| `docintel_api_key` | `str` | API key for authentication |
| `docintel_use_custom_model` | `bool` | Whether to use a custom model |
| `docintel_model_id` | `str` | Custom model identifier |

Source: [packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:100-150](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py)

### Internal Architecture

The Document Intelligence converter (`_doc_intel_converter.py`) operates as follows:

```mermaid
sequenceDiagram
    participant App as MarkItDown
    participant DI as DocumentIntelligenceClient
    participant Azure as Azure Document Intelligence Service
    
    App->>DI: Create client with endpoint + credentials
    App->>DI: analyze_document(file_stream, features=[OCR, STYLE])
    DI->>Azure: POST request with document
    Azure-->>DI: AnalyzeResult with markdown content
    DI-->>App: DocumentConverterResult with markdown
```

The converter uses the `DocumentAnalysisFeature` enum to enable OCR and layout analysis:

```python
from azure.ai.documentintelligence.models import DocumentAnalysisFeature

features = [DocumentAnalysisFeature.OCR]
```

Source: [packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:50-100](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py)

## Azure Content Understanding

Azure Content Understanding provides higher-quality, multi-modal extraction with structured field output. It supports documents, images, audio, and video files through prebuilt or custom analyzers.

### Key Capabilities

| Capability | Description |
|------------|-------------|
| **Multi-modal support** | Documents, images, audio, and video |
| **Structured field extraction** | YAML front matter from analyzer fields |
| **Prebuilt analyzers** | Domain-specific extraction (search, contracts, etc.) |
| **Custom analyzers** | User-defined extraction patterns |
| **Cloud-based OCR** | Higher-quality text recognition for scanned documents |

Source: [packages/markitdown/src/markitdown/converters/_cu_converter.py:1-50](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

### When to Use Content Understanding

| Use Case | Recommendation |
|----------|----------------|
| Scanned PDFs requiring high-quality OCR | Use Content Understanding |
| Complex tables and multi-page documents | Use Content Understanding |
| Audio transcription | Use Content Understanding |
| Video analysis | Use Content Understanding |
| Domain-specific field extraction | Use Content Understanding with custom analyzer |
| Simple text extraction | Use built-in converters |
| Basic audio transcription | Built-in converters are sufficient |

### Installation

```bash
pip install 'markitdown[az-content-understanding]'
```

### CLI Usage

```bash
markitdown document.pdf --use-cu --cu-endpoint "https://<your-resource>.cognitiveservices.azure.com/"
```

| Argument | Description |
|----------|-------------|
| `--use-cu` | Enable Content Understanding converter |
| `--cu-endpoint` | Content Understanding endpoint URL |

Source: [packages/markitdown/src/markitdown/__main__.py:100-130](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### Python API

**Zero-config usage (auto-selects analyzer):**

```python
from markitdown import MarkItDown

md = MarkItDown(cu_endpoint="https://<your-resource>.cognitiveservices.azure.com/")
result = md.convert("report.pdf")
print(result.markdown)
```

**With a custom analyzer:**

```python
from markitdown import MarkItDown
from markitdown.converters._cu_converter import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="https://<your-resource>.cognitiveservices.azure.com/",
    cu_file_types=[ContentUnderstandingFileType.PDF],
    cu_analyzer_id="your-custom-analyzer-id"
)
result = md.convert("contract.pdf")
print(result.markdown)
```

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

### Supported File Types

The Content Understanding converter supports automatic analyzer selection based on file type:

| File Type | Auto-selected Analyzer |
|-----------|----------------------|
| PDF, DOCX, PPTX | `prebuilt-documentSearch` |
| Images (JPG, PNG) | `prebuilt-documentSearch` |
| Video (MP4, MOV) | `prebuilt-videoSearch` |
| Audio (WAV, MP3) | `prebuilt-audioSearch` |

### Output Format

Content Understanding output includes YAML front matter with extracted fields:

```yaml
---
extracted_fields:
  - field_name: value
  - another_field: data
---
# Markdown content follows
```

The converter uses `to_llm_input()` from the Azure CU SDK to serialize analyzer fields as YAML front matter.

Source: [packages/markitdown/src/markitdown/converters/_cu_converter.py:50-150](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_cu_converter.py)

## Comparison: Built-in vs. Azure Integrations

| Feature | Built-in Converters | Document Intelligence | Content Understanding |
|---------|---------------------|----------------------|----------------------|
| **Processing** | Offline, local | Cloud-based | Cloud-based |
| **PDF OCR** | Basic | High-quality | High-quality |
| **Video support** | None | None | Yes |
| **Audio support** | Basic transcription | Not supported | Full analysis |
| **Structured fields** | None | Not exposed | YAML front matter |
| **Custom analyzers** | No | No | Yes |
| **Cost** | Local compute only | Per-API call | Per-API call |
| **Dependencies** | None | `azure-ai-documentintelligence` | `azure-ai-contentunderstanding` |

## Authentication

Both Azure integrations support multiple authentication methods:

### API Key Authentication

```python
md = MarkItDown(
    docintel_endpoint="https://<resource>.cognitiveservices.azure.com/",
    docintel_api_key="your-api-key"
)
```

### Azure Managed Identity

The converters use `DefaultAzureCredential` when no explicit credentials are provided, which automatically discovers credentials from:

- Environment variables
- Managed identity (Azure resources)
- Visual Studio Code credentials
- Azure CLI credentials
- Interactive browser login (development)

Source: [packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:20-40](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py)

## Security Considerations

> [!IMPORTANT]
> Azure integrations perform I/O with the privileges of the current process. When using cloud-based converters, document content is transmitted to Azure services for processing. Ensure that:
> - Input files are from trusted sources
> - Network connections to Azure endpoints are secure
> - Appropriate Azure role-based access controls are configured

The `MarkItDown` class validates that either `--use-docintel` or `--use-cu` is specified when providing an endpoint:

```python
if args.use_docintel:
    if args.endpoint is None:
        _exit_with_error("Document Intelligence Endpoint is required...")
```

Source: [packages/markitdown/src/markitdown/__main__.py:86-100](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

## Common Issues

### Missing Endpoint Error

If you encounter "Unrecognized Arguments" errors when using Azure flags, ensure you are using the correct flag combinations:

```bash
# Correct: --use-docintel requires --endpoint
markitdown doc.pdf --use-docintel --endpoint "https://..."

# Correct: --use-cu requires --cu-endpoint
markitdown doc.pdf --use-cu --cu-endpoint "https://..."
```

This issue was reported in [GitHub Issue #1897](https://github.com/microsoft/markitdown/issues/1897).

### Missing Dependency Errors

If you see import errors for Azure packages, verify the optional dependencies are installed:

```bash
# For Document Intelligence
pip install 'markitdown[docintel]'

# For Content Understanding
pip install 'markitdown[az-content-understanding]'
```

## See Also

- [README.md](https://github.com/microsoft/markitdown/blob/main/README.md) - Main project documentation
- [markitdown-ocr Plugin](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md) - LLM Vision OCR for embedded images
- [Sample Plugin](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md) - Creating custom converters
- [Azure Document Intelligence Documentation](https://learn.microsoft.com/azure/ai-services/document-intelligence/)
- [Azure Content Understanding Documentation](https://learn.microsoft.com/azure/ai-services/content-understanding/)

---

<a id='ocr-plugin'></a>

## OCR Plugin

### Related Pages

Related topics: [Supported File Formats](#supported-formats), [Plugin Development Guide](#plugin-development)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
- [packages/markitdown-ocr/src/markitdown_ocr/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_plugin.py)
- [packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py)
- [packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py)
- [packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py)
- [packages/markitdown-ocr/src/markitdown_ocr/_xlsx_converter_with_ocr.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_xlsx_converter_with_ocr.py)
- [packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py)
</details>

# OCR Plugin

The **MarkItDown OCR Plugin** (`markitdown-ocr`) is an official plugin that adds Optical Character Recognition (OCR) capabilities to MarkItDown. It extends the built-in converters for PDF, DOCX, PPTX, and XLSX files to extract text from embedded images using LLM Vision, providing enhanced document conversion for files containing scanned content or images with text.

## Overview

MarkItDown's built-in converters handle native text extraction from documents, but many documents contain text embedded in images rather than as accessible text. The OCR plugin addresses this gap by:

- Extracting images from document formats (PDF, DOCX, PPTX, XLSX)
- Sending images to LLM Vision models for text extraction
- Inserting extracted text inline with document content in reading order
- Providing full-page OCR fallback for scanned PDFs

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Features

The OCR plugin provides the following capabilities:

| Feature | Description |
|---------|-------------|
| **Enhanced PDF Converter** | Extracts text from images within PDFs, with full-page OCR fallback for scanned documents |
| **Enhanced DOCX Converter** | OCR for images embedded in Word documents |
| **Enhanced PPTX Converter** | OCR for images embedded in PowerPoint presentations |
| **Enhanced XLSX Converter** | OCR for images embedded in Excel spreadsheets |
| **Context Preservation** | Maintains document structure and flow when inserting extracted text |
| **LLM Vision Integration** | Uses OpenAI-compatible LLM clients for image-to-text conversion |
| **Malformed PDF Handling** | Retries problematic PDFs with PyMuPDF when pdfplumber/pdfminer fail |

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Architecture

### Plugin Registration Flow

The OCR plugin uses MarkItDown's plugin architecture with a priority-based replacement strategy. Converters are registered at priority `-1.0`, which causes them to run **before** the built-in converters at priority `0.0`, effectively replacing the standard conversion behavior when the plugin is enabled.

```mermaid
graph TD
    A[User Creates MarkItDown Instance<br/>enable_plugins=True] --> B[MarkItDown Discovers Plugin<br/>via markitdown.plugin entry point]
    B --> C[Calls register_converters<br/>with all kwargs]
    C --> D[Plugin Creates LLMVisionOCRService<br/>from llm_client/llm_model]
    D --> E[Registers 4 OCR Converters<br/>at priority -1.0]
    E --> F[Built-in Converters Remain<br/>at priority 0.0 as fallback]
    
    G[File Conversion Request] --> H{Which Converter<br/>Accepts First?}
    H -->|OCR Converter| I[Extract Images from Document]
    H -->|Built-in Converter| J[Standard Conversion]
    I --> K[Send Images to LLM Vision]
    K --> L[Insert Extracted Text Inline]
    L --> M[Return Markdown Result]
    J --> N[Return Standard Markdown]
```

Source: [packages/markitdown-ocr/src/markitdown_ocr/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_plugin.py)

### Component Overview

| Component | File | Purpose |
|-----------|------|---------|
| `LLMVisionOCRService` | `_ocr_service.py` | Core OCR service that handles LLM Vision API calls |
| `PdfConverterWithOCR` | `_pdf_converter_with_ocr.py` | Enhanced PDF converter with image extraction and full-page OCR |
| `DocxConverterWithOCR` | `_docx_converter_with_ocr.py` | Enhanced DOCX converter with image extraction |
| `PptxConverterWithOCR` | `_pptx_converter_with_ocr.py` | Enhanced PPTX converter with image extraction |
| `XlsxConverterWithOCR` | `_xlsx_converter_with_ocr.py` | Enhanced XLSX converter with image extraction |

Source: [packages/markitdown-ocr/src/markitdown_ocr/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_plugin.py)

## Installation

### Prerequisites

- Python 3.10 or higher
- MarkItDown core package installed
- An OpenAI-compatible LLM client (e.g., `openai`, `AzureOpenAI`)

### Install the Plugin

```bash
pip install markitdown-ocr
pip install openai  # or any OpenAI-compatible client
```

To verify installation and see available plugins:

```bash
markitdown --list-plugins
```

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Usage

### Command-Line Interface

Enable the OCR plugin using the `--use-plugins` flag along with LLM configuration:

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

> [!IMPORTANT]
> The `--llm-client` and `--llm-model` arguments must be passed when using the OCR plugin via CLI. Without an `llm_client`, the plugin loads but OCR is silently skipped, falling back to the standard built-in converter.

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

### Python API

#### Basic Usage with OpenAI

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)
```

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

#### Using Azure OpenAI

```python
from markitdown import MarkItDown
from openai import AzureOpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=AzureOpenAI(
        api_key="your-api-key",
        azure_endpoint="https://your-resource.openai.azure.com/",
        api_version="2024-02-01",
    ),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.docx")
print(result.text_content)
```

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

#### Custom Extraction Prompt

Override the default prompt for specialized document types:

```python
md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Extract all text from this image, preserving table structure.",
)

result = md.convert("document_with_tables.pdf")
```

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

#### Fallback Behavior

If no `llm_client` is provided, the plugin still loads but OCR is silently skipped:

```python
# Plugin loads but OCR is skipped - falls back to standard converter
md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")  # Standard conversion without OCR
```

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Configuration Options

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `enable_plugins` | `bool` | Yes | `False` | Enable plugin loading |
| `llm_client` | `OpenAI-compatible` | Yes* | `None` | LLM client for Vision OCR |
| `llm_model` | `str` | Yes* | `None` | Model name (e.g., `gpt-4o`) |
| `llm_prompt` | `str` | No | System default | Custom prompt for text extraction |

*Required for OCR to function; without these, the plugin falls back to standard converters.

Source: [packages/markitdown-ocr/src/markitdown_ocr/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_plugin.py)

## Supported File Formats

### PDF

| PDF Type | OCR Behavior |
|----------|--------------|
| **Text-based PDFs** | Extracts embedded images and OCRs them inline with surrounding text |
| **Scanned PDFs** | Detected automatically when no extractable text exists; each page rendered at 300 DPI and sent to LLM |
| **Malformed PDFs** | Retried with PyMuPDF rendering if pdfplumber/pdfminer fail |

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

### Office Documents (DOCX, PPTX, XLSX)

For Office Open XML formats, images are extracted via document part relationships and OCR is performed before the conversion pipeline:

1. Images are extracted from the document archive
2. Each image is processed through the LLM Vision OCR service
3. Placeholder tokens are injected into the content
4. The standard conversion pipeline executes with OCR placeholders preserved

Source: [packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py)

## How It Works

### PDF Conversion Flow

```mermaid
graph TD
    A[PDF Document Input] --> B{Contains extractable text?}
    B -->|Yes| C[Extract text from PDF]
    B -->|No| D[Detected as Scanned PDF]
    C --> E{Contains embedded images?}
    E -->|Yes| F[Extract images by position]
    E -->|No| G[Return standard result]
    F --> H[Send each image to LLM Vision]
    D --> I[Render each page at 300 DPI]
    I --> H
    H --> J[Interleave extracted text<br/>with OCR results in reading order]
    J --> K[Return Markdown with<br/>OCR text inline]
    G --> K
```

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

### DOCX/PPTX/XLSX Conversion Flow

```mermaid
graph TD
    A[Office Document Input] --> B[Extract images from<br/>document part relationships]
    B --> C[Process each image<br/>through LLM Vision OCR]
    C --> D[Generate placeholder tokens<br/>MARKITDOWNOCRBLOCK{id}]
    D --> E[Inject placeholders into<br/>HTML/Document content]
    E --> F[Run standard conversion pipeline]
    F --> G[Replace placeholders with<br/>OCR-extracted text]
    G --> H[Return Markdown result]
```

Source: [packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py)

## Common Issues and Troubleshooting

### Unrecognized Arguments Error

If you encounter an "Unrecognized Arguments" error when using CLI arguments like `--llm-client` and `--llm-model`, ensure you have installed the correct version of the plugin. The CLI example shown in documentation may differ between versions.

**Workaround**: Use the Python API for more reliable argument handling.

Source: [Community Issue #1897](https://github.com/microsoft/markitdown/issues/1897)

### Office Open XML Validation

When converting invalid DOCX, XLSX, or PPTX files, MarkItDown may return a successful result containing the message `"This is not a valid Office Open XML file."` in `text_content` rather than raising an exception. This is a known limitation of the underlying converters.

**Workaround**: Validate files before conversion or check `text_content` for error strings.

Source: [Community Issue #1408](https://github.com/microsoft/markitdown/issues/1408)

### LLM Call Failures

If an LLM call fails during OCR processing, the conversion continues without that specific image's text. The plugin is designed to be resilient to partial failures.

Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Memory Management

Recent releases (v0.1.6+) address O(n) memory growth during PDF conversion by properly calling `page.close()` after processing each PDF page. Ensure you are running the latest version for optimal memory efficiency.

Source: [Release Notes v0.1.6](https://github.com/microsoft/markitdown/releases/tag/v0.1.6)

## See Also

- [MarkItDown Documentation](https://github.com/microsoft/markitdown#readme) — Main project documentation
- [Sample Plugin](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin) — Reference implementation for creating custom plugins
- [Azure Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource) — Alternative cloud-based document conversion
- [Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) — Higher-quality multi-modal extraction with structured field output
- [Plugin Architecture](../Plugin-Architecture) — Understanding MarkItDown's plugin system

---

<a id='mcp-server'></a>

## MCP Server

### Related Pages

Related topics: [Python API Reference](#python-api), [Plugin Development Guide](#plugin-development)

```markdown
<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/README.md)
- [packages/markitdown-mcp/src/markitdown_mcp/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/src/markitdown_mcp/__main__.py)
- [packages/markitdown-mcp/src/markitdown_mcp/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/src/markitdown_mcp/__init__.py)
- [packages/markitdown-mcp/Dockerfile](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/Dockerfile)
</details>

# MCP Server

The MarkItDown MCP (Model Context Protocol) Server is a component that enables AI coding assistants and LLM-powered tools to directly utilize MarkItDown's document conversion capabilities through the standardized MCP protocol. This integration allows AI assistants to process documents without requiring custom tool implementations or external script execution.

## Overview

The MCP Server acts as a bridge between AI assistants (such as Claude, Cursor, or other MCP-compatible clients) and the MarkItDown Python library. It exposes MarkItDown's conversion functionality as MCP tools that AI assistants can invoke directly.

```mermaid
graph LR
    AI[AI Assistant<br/>Claude, Cursor, etc.] -->|MCP Protocol| MCP[MarkItDown<br/>MCP Server]
    MCP -->|Convert| MD[MarkItDown<br/>Python Library]
    MD -->|Documents| DOC[PDF, DOCX, PPTX<br/>XLSX, Images, etc.]
    DOC -->|Markdown| AI
```

**Key capabilities provided through MCP:**

- Convert local files to Markdown format
- Process files from URLs (HTTP/HTTPS)
- Process data URIs (base64-encoded content)
- Support for Azure Document Intelligence integration
- Plugin system integration for extended formats
- Configurable LLM clients for image descriptions

Source: [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/README.md)

## Installation

### From PyPI

The MCP server can be installed as a separate package:

```bash
pip install markitdown-mcp
```

### Using Docker

Pre-built Docker images are available for isolated execution:

```bash
docker pull ghcr.io/microsoft/markitdown-mcp:latest
```

To run the container:

```bash
docker run --rm -i ghcr.io/microsoft/markitdown-mcp:latest
```

Source: [packages/markitdown-mcp/Dockerfile](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/Dockerfile)

### From Source

For development or customization:

```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown-mcp
```

## Configuration

### Environment Variables

The MCP server reads configuration from environment variables, allowing flexible deployment without code changes.

| Environment Variable | Description | Default |
|---------------------|-------------|---------|
| `MARKITDOWN_ENABLE_PLUGINS` | Enable 3rd-party plugin support | `false` |
| `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` | Azure Document Intelligence endpoint URL | Not set |
| `AZURE_DOCUMENT_INTELLIGENCE_KEY` | Azure Document Intelligence API key | Not set |

The server respects the `MARKITDOWN_ENABLE_PLUGINS` environment variable during initialization, allowing plugin support to be toggled without modifying the server code.

Source: [packages/markitdown-mcp/src/markitdown_mcp/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/src/markitdown_mcp/__init__.py)

### Server Startup Options

When running the MCP server directly, several startup parameters control its behavior.

| Option | Description |
|--------|-------------|
| `--host` | Host address to bind the server (default: `127.0.0.1`) |
| `--port` | Port number to listen on (default: `8000`) |
| `--log-level` | Logging verbosity (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |

> [!WARNING]
> The server binds to localhost by default. Binding to non-local interfaces (`0.0.0.0`) should only be done in trusted environments with proper network access controls.

Source: [packages/markitdown-mcp/src/markitdown_mcp/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/src/markitdown_mcp/__main__.py)

## MCP Tools

The server exposes the following tools to MCP clients:

### `markitdown_convert`

Converts a document to Markdown format.

**Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `source` | string | Yes | File path, URL, or data URI to convert |
| `use_plugins` | boolean | No | Enable 3rd-party plugins (default: `false`) |

**Returns:** Markdown-formatted text content extracted from the document.

### `markitdown_list_plugins`

Lists all installed 3rd-party plugins available to MarkItDown.

**Parameters:** None required

**Returns:** List of installed plugin names and their package locations.

### `markitdown_get_version`

Returns the current version of MarkItDown.

**Parameters:** None required

**Returns:** Version string (e.g., `"0.1.6"`).

Source: [packages/markitdown-mcp/src/markitdown_mcp/__init__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/src/markitdown_mcp/__init__.py)

## Architecture

### Component Flow

```mermaid
sequenceDiagram
    participant Client as AI Assistant
    participant MCPServer as MCP Server
    participant MarkItDown as MarkItDown Core
    participant Plugin as Plugin System
    participant Azure as Azure Services
    
    Client->>MCPServer: Call markitdown_convert(source)
    MCPServer->>MarkItDown: Forward conversion request
    alt Plugin Enabled
        MCPServer->>Plugin: Load enabled plugins
        Plugin->>MarkItDown: Register custom converters
    end
    alt Azure Document Intelligence
        MarkItDown->>Azure: Call Document Intelligence API
        Azure-->>MarkItDown: Extracted content
    end
    MarkItDown-->>MCPServer: Markdown result
    MCPServer-->>Client: Return text_content
```

### Request Processing Pipeline

1. **Client Request**: AI assistant invokes an MCP tool with parameters
2. **Server Validation**: MCP server validates input parameters
3. **MarkItDown Initialization**: Creates `MarkItDown` instance with appropriate configuration
4. **Format Detection**: Determines file type from extension or URI scheme
5. **Conversion**: Routes to appropriate converter based on file type
6. **Result Serialization**: Returns markdown content via MCP protocol

Source: [packages/markitdown-mcp/src/markitdown_mcp/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/src/markitdown_mcp/__main__.py)

## Supported Input Formats

The MCP server inherits all format support from the core MarkItDown library:

| Category | Formats |
|----------|---------|
| **Documents** | PDF, DOCX, PPTX, XLSX, ODT |
| **Images** | JPG, PNG, GIF, BMP, WebP (with OCR and EXIF extraction) |
| **Audio** | MP3, WAV, FLAC (with EXIF and transcription) |
| **Web** | HTML, Wikipedia pages |
| **Data** | CSV, JSON, XML, RSS/Atom feeds |
| **Archives** | ZIP (iterates over contents) |
| **Documents** | EPUB, Jupyter notebooks |
| **Video** | Via Azure Content Understanding (when configured) |

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Integration with Azure Services

### Azure Document Intelligence

When `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` is configured, the MCP server can leverage Azure's cloud-based document extraction for higher quality results:

```bash
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="https://your-resource.cognitiveservices.azure.com/"
export AZURE_DOCUMENT_INTELLIGENCE_KEY="your-api-key"
```

This is particularly beneficial for:
- Scanned PDF documents
- Complex table structures
- Handwritten content
- Multi-language documents

Source: [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/README.md)

### LLM-Based Image Processing

When the MCP server is used with MarkItDown's built-in LLM support, image descriptions can be generated for embedded images in documents:

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
```

This same configuration pattern applies when initializing the MCP server, enabling AI assistants to get both document conversion and intelligent image descriptions.

## Security Considerations

> [!IMPORTANT]
> MarkItDown performs I/O with the privileges of the current process. Like `open()` or `requests.get()`, it accesses resources that the process itself can access.

**Recommendations for untrusted environments:**

1. **Sanitize inputs**: Validate file paths and URLs before passing to the MCP server
2. **Use narrow conversion functions**: Prefer `convert_stream()` or `convert_local()` when possible
3. **Restrict network access**: Run the MCP server in a sandboxed environment
4. **Limit file access**: Use container isolation (Docker) to restrict filesystem access

Source: [README.md](https://github.com/microsoft/markitdown/blob/main/README.md)

## Usage Examples

### Claude Desktop Integration

Add to your Claude Desktop configuration:

```json
{
  "mcpServers": {
    "markitdown": {
      "command": "markitdown-mcp",
      "args": ["--host", "127.0.0.1", "--port", "8000"]
    }
  }
}
```

### Python Client Usage

```python
# Example MCP client calling markitdown_convert
import json

# Tool call to convert a PDF
tool_request = {
    "name": "markitdown_convert",
    "arguments": {
        "source": "/path/to/document.pdf",
        "use_plugins": True
    }
}

# Process response
response = await call_mcp_tool(tool_request)
markdown_content = response["text_content"]
```

### Using with Azure Document Intelligence

```bash
# Set environment variables
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="https://YOUR-RESOURCE.cognitiveservices.azure.com/"
export AZURE_DOCUMENT_INTELLIGENCE_KEY="YOUR-KEY"

# Run MCP server with Azure integration
markitdown-mcp --host 127.0.0.1 --port 8000
```

Source: [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-mcp/README.md)

## Troubleshooting

### Server Connection Issues

| Symptom | Solution |
|---------|----------|
| Connection refused | Ensure server is running and `--host`/`--port` are correct |
| Timeout errors | Check firewall rules; server binds to localhost by default |
| Plugin not found | Install plugin package and set `MARKITDOWN_ENABLE_PLUGINS=true` |

### Conversion Failures

| Error | Cause | Resolution |
|-------|-------|------------|
| Unsupported format | File type not recognized | Check supported formats list |
| Plugin error | Plugin failed during conversion | Enable debug logging; check plugin compatibility |
| Azure auth failure | Invalid credentials | Verify `AZURE_DOCUMENT_INTELLIGENCE_KEY` |

### Common Issues

**Issue**: `UnicodeDecodeError` on non-ASCII files  
**Context**: Reported in [GitHub Issue #1894](https://github.com/microsoft/markitdown/issues/1894)  
**Status**: This affects the core library; ensure you're using the latest version.

**Issue**: Office Open XML files return success with error message  
**Context**: Reported in [GitHub Issue #1408](https://github.com/microsoft/markitdown/issues/1408)  
**Note**: Invalid DOCX/XLSX/PPTX files may return `"This is not a valid Office Open XML file."` in text_content rather than raising an exception.

**Issue**: RuntimeWarning about ffmpeg on Linux  
**Context**: Reported in [GitHub Issue #1685](https://github.com/microsoft/markitdown/issues/1685)  
**Resolution**: Install ffmpeg system package for audio conversion features

## See Also

- [Main README](README.md) — Project overview and core documentation
- [Plugin Development](markitdown-sample-plugin.md) — Guide to creating custom plugins
- [OCR Plugin](markitdown-ocr.md) — LLM Vision OCR for embedded images
- [Azure Content Understanding](markitdown-content-understanding.md) — Advanced cloud-based extraction

---

<a id='plugin-development'></a>

## Plugin Development Guide

### Related Pages

Related topics: [Architecture Overview](#architecture), [OCR Plugin](#ocr-plugin), [MCP Server](#mcp-server)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)
- [packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py)
- [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)
- [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py)
- [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)
- [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)
</details>

# Plugin Development Guide

This guide explains how to create custom plugins for MarkItDown to extend its document conversion capabilities. The plugin architecture allows developers to add support for new file formats or override existing converters with custom implementations.

## Overview

MarkItDown uses a plugin-based architecture that enables third-party developers to extend its document conversion capabilities. Plugins can register custom `DocumentConverter` implementations that handle specific file formats. The system supports:

- **New file format support** — Add converters for formats not natively supported (e.g., RTF, EPUB variants)
- **Converter replacement** — Override built-in converters with priority-based precedence
- **LLM integration** — Pass through LLM client credentials to enable AI-powered features in plugins

Plugins are discovered via Python entry points and loaded dynamically when `MarkItDown(enable_plugins=True)` is instantiated. Source: [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

## Architecture

The plugin system is built around the following core components:

```mermaid
graph TD
    A[User Code] -->|MarkItDown enable_plugins=True| B[MarkItDown Core]
    B -->|Discovers via entry_points| C[Plugin Entry Point Group<br/>markitdown.plugin]
    C --> D[Plugin Package<br/>markitdown_sample_plugin]
    D -->|Calls| E[register_converters function]
    E -->|Registers| F[DocumentConverter Subclass<br/>RtfConverter]
    B -->|Stores| F
    F --> G[Conversion Pipeline]
    G --> H[Markdown Output]
    
    I[LLM Client<br/>llm_client, llm_model] -.->|Forwarded via kwargs| E
```

### Component Responsibilities

| Component | Responsibility | Source |
|-----------|---------------|--------|
| `MarkItDown` | Core orchestrator, plugin discovery, converter dispatch | [packages/markitdown/src/markitdown/_markitdown.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_markitdown.py) |
| `DocumentConverter` | Base class for all converters | [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py) |
| Entry Point Group | Plugin discovery mechanism (`markitdown.plugin`) | [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md) |
| `register_converters()` | Plugin callback to register converters | [packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py) |

## Plugin Interface

### Interface Version

Plugins must declare the interface version they target. Currently, only version `1` is supported:

```python
__plugin_interface_version__ = 1
```

Source: [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

### Required Exports

A valid plugin package must export:

| Symbol | Type | Description |
|--------|------|-------------|
| `__plugin_interface_version__` | `int` | Plugin interface version (must be `1`) |
| `register_converters(markitdown, **kwargs)` | `function` | Called to register converters |

## Creating a Document Converter

### Step 1: Subclass DocumentConverter

Create a new converter class that inherits from `DocumentConverter`:

```python
from typing import BinaryIO, Any
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo

class RtfConverter(DocumentConverter):
    def __init__(
        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
    ):
        super().__init__(priority=priority)
```

Source: [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

### Step 2: Implement the `accepts()` Method

The `accepts()` method determines whether this converter can handle a given file stream. Return `True` if the converter should process the file:

```python
def accepts(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool:
    # Check if the file stream is an RTF file
    # Read the first few bytes to check for RTF magic number
    header = file_stream.read(4)
    file_stream.seek(0)  # Reset stream position
    return header.startswith(b'{\\rtf')
```

### Step 3: Implement the `convert()` Method

The `convert()` method performs the actual conversion from the file format to Markdown:

```python
def convert(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult:
    # Read and parse the RTF content
    content = file_stream.read().decode('utf-8', errors='replace')
    
    # Convert RTF to Markdown (implementation-specific)
    markdown_content = self._rtf_to_markdown(content)
    
    return DocumentConverterResult(
        text_content=markdown_content,
        metadata={}
    )
```

### Priority System

The `DocumentConverter` base class defines priority constants that control converter selection:

| Priority Constant | Value | Use Case |
|-------------------|-------|----------|
| `PRIORITY_DEFAULT` | `0.0` | Default priority for built-in converters |
| `PRIORITY_SPECIFIC_FILE_FORMAT` | `10.0` | Format-specific converters (higher wins) |
| `PRIORITY_OVERRIDE_ALL` | `100.0` | Override all other converters |

Higher priority values take precedence. Plugins that want to override built-in converters should use `PRIORITY_SPECIFIC_FILE_FORMAT` or higher.

Source: [packages/markitdown/src/markitdown/_base_converter.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py)

## Plugin Registration

### The `register_converters()` Function

Create a `register_converters()` function in your plugin package. This function is called during `MarkItDown` instantiation:

```python
from markitdown import MarkItDown

def register_converters(markitdown: MarkItDown, **kwargs):
    """
    Called during construction of MarkItDown instances to register
    converters provided by plugins.
    """
    # Simply create and attach an RtfConverter instance
    markitdown.register_converter(RtfConverter())
```

Source: [packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py)

### Handling LLM Client Credentials

Plugins can receive and use LLM credentials passed through `MarkItDown()`:

```python
def register_converters(markitdown: MarkItDown, **kwargs):
    llm_client = kwargs.get('llm_client')
    llm_model = kwargs.get('llm_model')
    llm_prompt = kwargs.get('llm_prompt')
    
    # Use LLM credentials if available
    if llm_client and llm_model:
        converter = LLMVisionConverter(
            llm_client=llm_client,
            llm_model=llm_model,
            llm_prompt=llm_prompt
        )
    else:
        converter = BasicRtfConverter()
    
    markitdown.register_converter(converter)
```

This pattern is used by the `markitdown-ocr` plugin to enable LLM-powered OCR for images. Source: [packages/markitdown-ocr/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-ocr/README.md)

## Entry Points Configuration

Configure the entry point in your plugin's `pyproject.toml`:

```toml
[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"
```

| Field | Value | Description |
|-------|-------|-------------|
| Group | `"markitdown.plugin"` | Fixed entry point group name |
| Key | `sample_plugin` | Plugin identifier (can be any unique name) |
| Value | `"markitdown_sample_plugin"` | Fully qualified package name |

Source: [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

## CLI Integration

### Listing Installed Plugins

Use the `--list-plugins` flag to see all installed third-party plugins:

```bash
markitdown --list-plugins
```

Output format:
```
Installed MarkItDown 3rd-party Plugins:

  * sample_plugin     (package: markitdown_sample_plugin)

Use the -p (or --use-plugins) option to enable 3rd-party plugins.
```

Source: [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

### Enabling Plugins

Pass the `--use-plugins` flag to enable third-party plugins:

```bash
markitdown --use-plugins document.rtf -o output.md
```

### Plugin Discovery Mechanism

Plugins are discovered using Python's `importlib.metadata.entry_points()`:

```python
from importlib.metadata import entry_points

plugin_entry_points = list(entry_points(group="markitdown.plugin"))
```

Source: [packages/markitdown/src/markitdown/__main__.py](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/__main__.py)

## Complete Plugin Example

Below is a minimal but complete plugin structure:

```
markitdown_rtf_plugin/
├── pyproject.toml
└── src/
    └── markitdown_rtf_plugin/
        ├── __init__.py
        └── _plugin.py
```

### `pyproject.toml`

```toml
[project]
name = "markitdown-rtf-plugin"
version = "0.1.0"

[project.entry-points."markitdown.plugin"]
rtf = "markitdown_rtf_plugin"
```

### `src/markitdown_rtf_plugin/_plugin.py`

```python
"""RTF to Markdown converter plugin for MarkItDown."""

from typing import BinaryIO, Any
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo

__plugin_interface_version__ = 1

class RtfConverter(DocumentConverter):
    """Converts RTF files to Markdown format."""

    def __init__(self):
        super().__init__(priority=DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT)

    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> bool:
        extension = (stream_info.extension or "").lower()
        if extension == ".rtf":
            return True
        
        # Check RTF magic number
        header = file_stream.read(4)
        file_stream.seek(0)
        return header.startswith(b'{\\rtf')

    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> DocumentConverterResult:
        content = file_stream.read().decode('utf-8', errors='replace')
        # RTF to Markdown conversion logic here
        markdown = self._convert_rtf_to_markdown(content)
        
        return DocumentConverterResult(
            text_content=markdown,
            metadata={"source_format": "rtf"}
        )
    
    def _convert_rtf_to_markdown(self, content: str) -> str:
        # Simplified conversion logic
        # Replace RTF formatting with Markdown equivalents
        markdown = content
        # ... conversion implementation
        return markdown


def register_converters(markitdown, **kwargs):
    """Register the RTF converter with MarkItDown."""
    markitdown.register_converter(RtfConverter())
```

### `src/markitdown_rtf_plugin/__init__.py`

```python
from ._plugin import register_converters, __plugin_interface_version__
```

## Installation and Testing

### Installing the Plugin

Install the plugin in development mode:

```bash
pip install -e .
```

### Verifying Installation

List installed plugins:

```bash
markitdown --list-plugins
```

### Testing the Plugin

Convert a file using the plugin:

```bash
markitdown --use-plugins document.rtf -o output.md
```

## Finding Plugins

To discover available third-party plugins:

1. Search GitHub for the hashtag `#markitdown-plugin`
2. Check PyPI for packages with `markitdown-plugin` keyword
3. Review the [Community Plugins](#see-also) section

Source: [packages/markitdown-sample-plugin/README.md](https://github.com/microsoft/markitdown/blob/main/packages/markitdown-sample-plugin/README.md)

## Best Practices

### Error Handling

Implement robust error handling in your converter:

```python
def convert(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult:
    try:
        content = file_stream.read()
        markdown = self._convert_to_markdown(content)
        return DocumentConverterResult(
            text_content=markdown,
            metadata={"source_format": "custom"}
        )
    except SpecificFormatError as e:
        # Return partial result or re-raise
        raise
    except Exception as e:
        # Log and provide fallback
        return DocumentConverterResult(
            text_content=f"[Conversion error: {str(e)}]",
            metadata={"error": str(e)}
        )
```

### Stream Position Management

Always reset stream position after reading for format detection:

```python
def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool:
    # Read to check format
    header = file_stream.read(512)
    file_stream.seek(0)  # Reset for next reader
    return self._is_our_format(header)
```

### Priority Selection Guidelines

| Scenario | Recommended Priority |
|----------|---------------------|
| Adding new format support | `PRIORITY_SPECIFIC_FILE_FORMAT` (10.0) |
| Enhancing existing format | `PRIORITY_SPECIFIC_FILE_FORMAT` + 1 |
| Completely overriding built-in | `PRIORITY_OVERRIDE_ALL` (100.0) |

## Common Issues

### Plugin Not Discovered

Ensure the entry point is correctly configured in `pyproject.toml`:

```toml
[project.entry-points."markitdown.plugin"]
your_plugin_name = "your_package_name"
```

### Converter Not Called

1. Verify `accepts()` returns `True` for your file type
2. Check that priority is high enough to take precedence
3. Ensure `register_converters()` is called during `MarkItDown` instantiation

### CLI Unrecognized Arguments

As noted in [Issue #1897](https://github.com/microsoft/markitdown/issues/1897), the CLI may not document all plugin-specific arguments. If your plugin adds CLI options, document them separately in your plugin's documentation.

## See Also

- [MarkItDown README](https://github.com/microsoft/markitdown) — Main project documentation
- [Sample Plugin Repository](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin) — Reference implementation
- [markitdown-ocr Plugin](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-ocr) — Production plugin example with LLM integration
- [Base Converter Class](https://github.com/microsoft/markitdown/blob/main/packages/markitdown/src/markitdown/_base_converter.py) — DocumentConverter API reference
- [Built-in Converters](https://github.com/microsoft/markitdown/tree/main/packages/markitdown/src/markitdown/converters) — Reference implementations for PDF, DOCX, PPTX, XLSX, and more

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: microsoft/markitdown

Summary: Found 30 structured pitfall item(s), including 4 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_f70b2e3ea5ed47418a4aeb9ef27230f9 | https://github.com/microsoft/markitdown/issues/1685

## 2. Runtime risk - Runtime risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_252ef0d45ac040688ffa066bc1b64ba0 | https://github.com/microsoft/markitdown/issues/1897

## 3. Maintenance risk - Maintenance risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_6e08b71ee29f46a98e6825a5d5b11e6e | https://github.com/microsoft/markitdown/issues/1979

## 4. Maintenance risk - Maintenance risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_439f22f47a524773808819148caadca5 | https://github.com/microsoft/markitdown/issues/1982

## 5. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this installation risk before relying on the project: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- User impact: Developers may fail before the first successful local run: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception. Context: Source discussion did not expose a precise runtime context.
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_087a8a7b6538b2ce2b065ade73c555af | https://github.com/microsoft/markitdown/issues/1408

## 6. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this installation risk before relying on the project: Support for .doc extensions
- User impact: Developers may fail before the first successful local run: Support for .doc extensions
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: Support for .doc extensions. Context: Observed when using windows, linux
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_d5a467d012987779306cb5c50725275b | https://github.com/microsoft/markitdown/issues/23

## 7. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this installation risk before relying on the project: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- User impact: Developers may fail before the first successful local run: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux. Context: Observed when using python, windows, linux
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_1f9167a15a1eec72c8f79514f1b70b76 | https://github.com/microsoft/markitdown/issues/1685

## 8. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this installation risk before relying on the project: v0.1.0
- User impact: Upgrade or migration may change expected behavior: v0.1.0
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.0. Context: Observed when using python
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_1d5ae6ee21225356f45c36c20024dccd | https://github.com/microsoft/markitdown/releases/tag/v0.1.0

## 9. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_734e117518a3496eb3779e5f22b600b5 | https://github.com/microsoft/markitdown/issues/1408

## 10. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_77597bea6262485b9609d8fc5f50a69a | https://github.com/microsoft/markitdown/issues/1894

## 11. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: Enhancement: Add MCP server support for document processing
- User impact: Developers may misconfigure credentials, environment, or host setup: Enhancement: Add MCP server support for document processing
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: Enhancement: Add MCP server support for document processing. Context: Source discussion did not expose a precise runtime context.
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_969d5f508051e086435b78736eae3e88 | https://github.com/microsoft/markitdown/issues/2004

## 12. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: v0.1.2
- User impact: Upgrade or migration may change expected behavior: v0.1.2
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.2. Context: Observed when using python
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_076605feea6e0b4830282709121d3c90 | https://github.com/microsoft/markitdown/releases/tag/v0.1.2

## 13. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: v0.1.2a1
- User impact: Upgrade or migration may change expected behavior: v0.1.2a1
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.2a1. Context: Observed when using python
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_22fa2fa9d8ed93f594844ce5550fc4d8 | https://github.com/microsoft/markitdown/releases/tag/v0.1.2a1

## 14. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_94fcd5bbf87541d1ab988bae7c501a95 | https://github.com/microsoft/markitdown/issues/2004

## 15. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | github_repo:888092115 | https://github.com/microsoft/markitdown

## 16. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: bug: DOCX math converter crashes when oMath element is missing in malformed equations
- User impact: Developers may hit a documented source-backed failure mode: bug: DOCX math converter crashes when oMath element is missing in malformed equations
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: bug: DOCX math converter crashes when oMath element is missing in malformed equations. Context: Observed when using python
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_2d85aabe3c00f8d53d781ac03dd69f62 | https://github.com/microsoft/markitdown/issues/1979

## 17. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: bug: DOCX math converter crashes with NotImplementedError on unknown functions
- User impact: Developers may hit a documented source-backed failure mode: bug: DOCX math converter crashes with NotImplementedError on unknown functions
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: bug: DOCX math converter crashes with NotImplementedError on unknown functions. Context: Observed when using python
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_3ca154355492e590afae4917a8e9c7af | https://github.com/microsoft/markitdown/issues/1982

## 18. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)
- User impact: Developers may hit a documented source-backed failure mode: bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.). Context: Observed when using python, windows
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_2603b970a28eceb8da6246b000a927d3 | https://github.com/microsoft/markitdown/issues/1894

## 19. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: v0.1.3
- User impact: Upgrade or migration may change expected behavior: v0.1.3
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.3. Context: Observed when using windows
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_994386694bb3b31fb731336f58573ff3 | https://github.com/microsoft/markitdown/releases/tag/v0.1.3

## 20. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: v0.1.5
- User impact: Upgrade or migration may change expected behavior: v0.1.5
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.5. Context: Observed when using windows
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_38cc2743269efc75c24242abb0e2746c | https://github.com/microsoft/markitdown/releases/tag/v0.1.5

## 21. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_ba28a1cc5c004225b80d2ef380e51a77 | https://github.com/microsoft/markitdown/issues/2000

## 22. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:888092115 | https://github.com/microsoft/markitdown

## 23. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | github_repo:888092115 | https://github.com/microsoft/markitdown

## 24. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | github_repo:888092115 | https://github.com/microsoft/markitdown

## 25. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this conceptual risk before relying on the project: Unrecognized Arguments Error in markitdown CLI for undocumented arguments
- User impact: Developers may hit a documented source-backed failure mode: Unrecognized Arguments Error in markitdown CLI for undocumented arguments
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_d7ddfa04bce33d2ca53c58ce9f0265c0 | https://github.com/microsoft/markitdown/issues/1897

## 26. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: Timeout needed
- User impact: Developers may hit a documented source-backed failure mode: Timeout needed
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: Timeout needed. Context: Source discussion did not expose a precise runtime context.
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_acdf8e881ef175760bcc59b92eae1aef | https://github.com/microsoft/markitdown/issues/2000

## 27. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: Version 0.1.6
- User impact: Upgrade or migration may change expected behavior: Version 0.1.6
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: Version 0.1.6. Context: Source discussion did not expose a precise runtime context.
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_037f240b8fd9da8ecfc973a1f7eae18c | https://github.com/microsoft/markitdown/releases/tag/v0.1.6

## 28. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:888092115 | https://github.com/microsoft/markitdown

## 29. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:888092115 | https://github.com/microsoft/markitdown

## 30. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: Version 0.1.5b1
- User impact: Upgrade or migration may change expected behavior: Version 0.1.5b1
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: Version 0.1.5b1. Context: Source discussion did not expose a precise runtime context.
- Guardrail: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_fe15b901250727fa3263b3b5af451b94 | https://github.com/microsoft/markitdown/releases/tag/v0.1.5b1

<!-- canonical_name: microsoft/markitdown; human_manual_source: deepwiki_human_wiki -->