Doramagic Project Pack · Human Manual
markitdown
MarkItDown provides a unified interface for converting files to Markdown, abstracting away the complexity of handling different file formats. The tool is particularly valuable for:
Home
Related topics: Installation Guide, Architecture Overview, Supported File Formats
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Installation Guide, Architecture Overview, Supported File Formats
MarkItDown Home
MarkItDown is a lightweight Python utility and command-line tool for converting various document formats into Markdown. It is designed primarily for use with Large Language Models (LLMs) and text analysis pipelines, extracting content while preserving document structure including headings, lists, tables, links, and other semantic elements. Source: README.md
Overview
MarkItDown provides a unified interface for converting files to Markdown, abstracting away the complexity of handling different file formats. The tool is particularly valuable for:
- LLM Integration: Feeding documents to language models that understand Markdown natively
- Document Indexing: Creating searchable indexes from various document types
- Text Analysis: Extracting structured text content for downstream processing
[!IMPORTANT]
MarkItDown performs I/O with the privileges of the current process. Likeopen()orrequests.get(), it accesses resources that the process itself can access. Sanitize your inputs in untrusted environments, and use the narrowest conversion function for your use case (e.g.,convert_stream(), orconvert_local()). Source: README.md
Supported File Formats
MarkItDown supports conversion from numerous formats, organized by the built-in converters. Source: README.md
| Format | Description | Notes |
|---|---|---|
| Portable Document Format | Basic text extraction; see Known Limitations | |
| PowerPoint (.pptx) | Microsoft PowerPoint presentations | Supports images and math equations |
| Word (.docx) | Microsoft Word documents | Supports images and math equations |
| Excel (.xlsx) | Microsoft Excel spreadsheets | Table extraction supported |
| Images | JPEG, PNG, GIF, etc. | EXIF metadata extraction; OCR via plugin |
| Audio | MP3, WAV, etc. | EXIF metadata and speech transcription |
| HTML | Web pages | Includes Wikipedia-specific processing |
| CSV | Comma-separated values | Converted to Markdown tables |
| JSON | JavaScript Object Notation | Structured text output |
| XML | Extensible Markup Language | RSS feeds fully supported |
| EPUB | Electronic publications | E-book content extraction |
| ZIP | Archive files | Iterates over contents |
| YouTube URLs | Video content | Requires Azure integration |
Why Markdown?
Markdown is extremely close to plain text with minimal markup, yet provides a way to represent important document structure. Mainstream LLMs such as OpenAI's GPT-4o natively understand Markdown, making it an efficient format for document consumption by AI systems. As a side benefit, Markdown conventions are also highly token-efficient compared to other structured formats. Source: README.md
Architecture
MarkItDown uses a converter-based architecture that allows for extensibility through plugins while providing a consistent interface for all supported formats.
Core Components
graph TD
A[MarkItDown API] --> B[Converter Registry]
B --> C[Built-in Converters]
B --> D[Plugin Converters]
C --> C1[PDF Converter]
C --> C2[DOCX Converter]
C --> C3[PPTX Converter]
C --> C4[XLSX Converter]
C --> C5[Image Converter]
C --> C6[Audio Converter]
C --> C7[HTML Converter]
C --> C8[Wikipedia Converter]
C --> C9[RSS Converter]
C --> C10[CSV Converter]
C --> C11[EPUB Converter]
C --> C12[YouTube Converter]
D --> D1[OCR Plugin]
D --> D2[Custom Plugins]
E[Azure Integrations] -.->|Optional| B
E --> E1[Document Intelligence]
E --> E2[Content Understanding]Conversion Pipeline
graph LR
A[Input File/Stream/URI] --> B[Stream Info Extraction]
B --> C{Hint Extension?}
C -->|Yes| D[Use Extension Hint]
C -->|No| E[Detect from Content]
D --> F[Find Matching Converter]
E --> F
F --> G{Converter Found?}
G -->|Yes| H[Execute Conversion]
G -->|No| I[Return Error]
H --> J[DocumentConverterResult]Converter Base Class
All converters inherit from DocumentConverter and implement two key methods:
accepts(file_stream, stream_info,kwargs)**: Determines if the converter can handle the given inputconvert(file_stream, stream_info,kwargs)**: Performs the actual conversion to Markdown
Source: packages/markitdown-sample-plugin/README.md
Plugin System
The plugin architecture uses Python entry points to discover and load converters at runtime. Source: packages/markitdown/src/markitdown/__main__.py:1-35
Plugins must implement:
__plugin_interface_version__ = 1
def register_converters(markitdown: MarkItDown, **kwargs):
"""Called during MarkItDown instantiation to register plugin converters."""
markitdown.register_converter(YourCustomConverter())
Entry point configuration in pyproject.toml:
[project.entry-points."markitdown.plugin"]
your_plugin = "your_package_name"
Source: packages/markitdown-sample-plugin/README.md
Installation
Prerequisites
MarkItDown requires Python 3.10 or higher. Using a virtual environment is recommended to avoid dependency conflicts. Source: README.md
# Using standard Python venv
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or: .venv\Scripts\activate # Windows
# Using uv
uv venv --python 3.12 .venv
source .venv/bin/activate
Installation Options
| Command | Description |
|---|---|
pip install markitdown | Core package only |
pip install markitdown[all] | All dependencies included |
pip install markitdown[pdf] | PDF conversion support |
pip install markitdown[docx] | Word document support |
pip install markitdown[pptx] | PowerPoint support |
pip install markitdown[xlsx] | Excel support |
pip install markitdown[images] | Image processing |
pip install markitdown[audio] | Audio transcription |
pip install markitdown[html] | HTML parsing |
pip install markitdown[az-docintel] | Azure Document Intelligence |
pip install markitdown[az-content-understanding] | Azure Content Understanding |
pip install markitdown-ocr | OCR plugin for embedded images |
From source:
git clone [email protected]:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown[all]
Source: packages/markitdown/README.md
Usage
Command-Line Interface
#### Basic Usage
# Convert a file and output to stdout
markitdown path-to-file.pdf > document.md
# Convert and save to a specific file
markitdown example.xlsx -o output.md
# Read from stdin
cat document.pdf | markitdown > output.md
# Provide file extension hint (for stdin input)
cat document.pdf | markitdown -x .pdf > output.md
Source: packages/markitdown/src/markitdown/__main__.py:1-50
#### CLI Options Reference
| Option | Description |
|---|---|
-v, --version | Show version number and exit |
-o, --output FILE | Output file name (default: stdout) |
-x, --extension EXT | Hint file extension (for stdin input) |
-d, --use-docintel | Use Azure Document Intelligence |
-e, --endpoint URL | Azure Document Intelligence endpoint |
--use-cu | Use Azure Content Understanding |
--cu-endpoint URL | Content Understanding endpoint |
--cu-analyzer ID | Content Understanding analyzer ID |
--cu-file-types TYPES | Comma-separated file types for CU routing |
-p, --use-plugins | Enable 3rd-party plugins |
--list-plugins | List installed plugins |
--keep-data-uris | Keep base64-encoded images in output |
Source: packages/markitdown/src/markitdown/__main__.py:50-130
Python API
#### Basic Usage
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
#### MarkItDown Constructor Options
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_plugins | bool | False | Enable 3rd-party plugin loading |
docintel_endpoint | str | None | Azure Document Intelligence endpoint |
docintel_model_id | str | None | Document Intelligence model ID |
llm_client | object | None | OpenAI-compatible LLM client |
llm_model | str | None | LLM model name (e.g., "gpt-4o") |
llm_prompt | str | None | Custom prompt for LLM operations |
cu_endpoint | str | None | Azure Content Understanding endpoint |
cu_file_types | list | None | File types to route to CU |
Source: README.md
#### Convert Methods
from markitdown import MarkItDown
md = MarkItDown()
# Convert local file
result = md.convert("document.pdf")
# Convert URI (http, https, data:, file:)
result = md.convert_uri("https://example.com/page.html")
# Convert file stream
with open("document.pdf", "rb") as f:
result = md.convert_stream(f, extension=".pdf")
# Convert local file (no URL resolution)
result = md.convert_local("document.pdf")
[!NOTE]
convert_urlis an alias forconvert_urimaintained for backward compatibility. Source: README.md
#### Return Value
The convert() method returns a DocumentConverterResult object with the following attributes:
| Attribute | Type | Description |
|---|---|---|
text_content | str | The converted Markdown content |
title | str | Document title (if extractable) |
markdown | str | Alias for text_content |
Azure Integrations
Azure Document Intelligence
Use Azure Document Intelligence for higher-quality PDF and image extraction with layout analysis.
markitdown document.pdf -o document.md -d -e "<document_intelligence_endpoint>"
Python API:
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<endpoint>")
result = md.convert("document.pdf")
For Azure Key Credentials:
from markitdown import MarkItDown
from azure.core.credentials import AzureKeyCredential
md = MarkItDown(
docintel_endpoint="<endpoint>",
docintel_credential=AzureKeyCredential("<api_key>")
)
Source: README.md
Azure Content Understanding
Azure Content Understanding provides structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers. Install with:
pip install 'markitdown[az-content-understanding]'
from markitdown import MarkItDown
import ContentUnderstandingFileType
md = MarkItDown(
cu_endpoint="<content_understanding_endpoint>",
cu_file_types=[ContentUnderstandingFileType.PDF], # Route only PDFs to CU
)
Source: packages/markitdown/src/markitdown/converters/_cu_converter.py:1-40
#### When to Use Content Understanding
| Capability | Built-in | Azure Document Intelligence | Azure Content Understanding |
|---|---|---|---|
| Basic PDF conversion | ✓ | ✓ | ✓ |
| Layout analysis | Basic | ✓ | ✓ |
| Structured field extraction | ✗ | ✗ | ✓ |
| Image files | EXIF only | ✗ | ✓ |
| Audio files | Basic transcription | ✗ | ✓ |
| Video files | ✗ | ✗ | ✓ |
| YAML front matter | ✗ | ✗ | ✓ |
Source: README.md
OCR Plugin (markitdown-ocr)
The markitdown-ocr plugin adds OCR support for extracting text from images embedded in PDF, DOCX, PPTX, and XLSX files using LLM Vision. Source: packages/markitdown-ocr/README.md
Installation
pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client
Usage
markitdown document_with_images.pdf --use-plugins --llm-client openai --llm-model gpt-4o
Python API:
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
Custom prompt:
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
llm_prompt="Extract all text from this image, preserving table structure.",
)
Works with any OpenAI-compatible client:
from openai import AzureOpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=AzureOpenAI(
api_key="...",
azure_endpoint="https://your-resource.openai.azure.com/",
api_version="2024-02-01",
),
llm_model="gpt-4o",
)
[!NOTE]
If no llm_client is provided, the plugin loads but silently skips OCR, falling back to the standard built-in converter. Source: packages/markitdown-ocr/README.md
Known Limitations
PDF Conversion
MarkItDown's built-in PDF converter performs basic text extraction and may not reliably recognize:
- Headings and footers
- Tables (see Issue #293)
- Complex layouts
For improved PDF handling, consider:
- Azure Document Intelligence (
-dflag): Better layout analysis - Azure Content Understanding: Highest quality extraction with structured output
- OCR Plugin: For scanned PDFs and embedded images
Source: Community Issue #296
Microsoft Word (.doc vs .docx)
MarkItDown only supports .docx format (Office Open XML). Legacy .doc files are not supported. Source: Community Issue #23
Audio Transcription
Audio transcription requires ffmpeg or avconv to be installed on the system. On Linux, ensure one of these is available in the system PATH. Source: Community Issue #1685
Error Handling Behavior
When converting invalid Office Open XML files (DOCX, XLSX, PPTX), MarkItDown returns a successful result with the error message in the text content rather than raising an exception. This may make it difficult to distinguish between successful and failed conversions. Source: Community Issue #1408
Version History
| Version | Key Changes |
|---|---|
| 0.1.6 | OCR layer service for embedded images and PDF scans; Fixed O(n) memory growth in PDF conversion |
| 0.1.5 | PDF table extraction with aligned Markdown; Fix for partially numbered lists; Wide table support |
| 0.1.4 | Security updates: mammoth 1.11.0, pdfminer.six 20251107 |
| 0.1.3 | ONNXRuntime pinning on Windows; MCP server environment variable support |
| 0.1.2 | Math equation rendering in DOCX; Azure Document Intelligence credentials; CSV to Markdown tables |
| 0.1.1 | convert_url renamed to convert_uri; Support for file and data URIs |
| 0.1.0 | Plugin architecture; Feature groups for dependencies; 3rd-party extensibility |
Source: Release Notes
Docker
MarkItDown can be run in a Docker container:
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
Source: README.md
Security Considerations
[!IMPORTANT]
MarkItDown performs I/O operations with the privileges of the running process. In untrusted environments:
1. Sanitize inputs before passing them to MarkItDown
2. Use narrowest conversion methods: Preferconvert_local()orconvert_stream()overconvert_uri()to limit network access
3. Restrict file permissions on the process running MarkItDown
The tool will access any resources that the process user can access, including local files and network resources.
Source: README.md
See Also
- Plugin Development Guide - Creating custom MarkItDown converters
- OCR Plugin Documentation - LLM Vision OCR for embedded images
- Azure Content Understanding - Microsoft documentation
- Azure Document Intelligence - Setup guide
Source: https://github.com/microsoft/markitdown / Human Manual
Installation Guide
Related topics: Home, Command-Line Interface
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Home, Command-Line Interface
Installation Guide
This guide covers all supported methods for installing MarkItDown, including Python package installation, Docker deployment, and plugin setup. Choose the method that best fits your environment and use case.
Prerequisites
Before installing MarkItDown, ensure your environment meets the following requirements.
Python Version
MarkItDown requires Python 3.10 or higher. You can verify your Python version by running:
python --version
# or
python3 --version
Source: README.md
Optional: Virtual Environment
It is recommended to use a virtual environment to avoid dependency conflicts with other Python packages. MarkItDown supports multiple methods for creating virtual environments:
| Method | Commands |
|---|---|
| Standard Python | python -m venv .venv && source .venv/bin/activate |
| uv | uv venv --python=3.12 .venv && source .venv/bin/activate |
| conda | conda create -n markitdown python=3.12 && conda activate markitdown |
Note: When usinguv, ensure you useuv pip installrather thanpip installto install packages within the virtual environment.
Source: README.md
Installation Methods
Install from PyPI (Recommended)
The simplest installation method uses pip to install MarkItDown from PyPI.
#### Full Installation (All Features)
To install MarkItDown with all optional dependencies for maximum functionality:
pip install 'markitdown[all]'
This installs every converter and dependency group, enabling support for PDF, DOCX, PPTX, XLSX, images, audio, and more.
Source: README.md
#### Selective Installation
Install only the converters you need by specifying feature groups:
pip install 'markitdown[pdf,docx,pptx]'
| Feature Group | Description | File Formats Supported |
|---|---|---|
pdf | PDF converter | |
docx | Word document converter | .docx |
pptx | PowerPoint converter | .pptx, .ppt |
xlsx | Excel converter | .xlsx, .xls |
html | HTML/Wikipedia converter | .html, .htm |
image | Image converter with EXIF/OCR | .jpg, .png, .gif, etc. |
audio | Audio converter with transcription | .mp3, .wav, .ogg, etc. |
epub | E-book converter | .epub |
az-content-understanding | Azure Content Understanding integration | All formats |
all | All feature groups | All formats |
Source: packages/markitdown/pyproject.toml
Install from Source
For development, testing, or customization, install MarkItDown from the GitHub repository.
git clone [email protected]:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'
The -e flag installs the package in editable mode, which is useful for development as changes to the source code take effect immediately without reinstallation.
Source: README.md
Docker Installation
MarkItDown provides a Docker image for containerized deployments. This method requires no Python installation on the host system.
#### Building the Image
docker build -t markitdown:latest .
#### Running MarkItDown in Docker
Convert a file and output to stdout:
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
Or with a named volume for larger files:
docker run --rm -v $(pwd):/data markitdown:latest markitdown /data/input.pdf -o /data/output.md
Source: README.md
Plugin Installation
MarkItDown uses a plugin architecture that allows extending functionality through separate packages.
markitdown-ocr Plugin
The markitdown-ocr plugin adds OCR support for extracting text from embedded images in PDF, DOCX, PPTX, and XLSX files using LLM Vision. This is particularly useful for scanned PDFs or documents containing image-based text.
#### Installation
pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client
#### Usage
After installation, enable plugins when creating a MarkItDown instance:
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
#### CLI Usage
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
Important: The--llm-clientand--llm-modelCLI arguments require the plugin to be installed. Without the plugin, these arguments will cause an "Unrecognized Arguments" error. See Issue #1897 for details.
Source: packages/markitdown-ocr/README.md
Third-Party Plugins
To discover third-party plugins, search GitHub for the hashtag #markitdown-plugin.
#### Listing Installed Plugins
markitdown --list-plugins
Sample output:
Installed MarkItDown 3rd-party Plugins:
* sample_plugin (package: markitdown_sample_plugin)
Use the -p (or --use-plugins) option to enable 3rd-party plugins.
#### Developing Custom Plugins
For developing a custom plugin, see the Sample Plugin documentation. A plugin must:
- Implement a
DocumentConverterclass withaccepts()andconvert()methods - Export
__plugin_interface_version__ = 1 - Export a
register_converters()function - Define an entry point in
pyproject.toml:
[project.entry-points."markitdown.plugin"]
my_plugin = "my_plugin_package"
Source: packages/markitdown-sample-plugin/README.md
Azure Service Integration
Azure Document Intelligence
For higher-quality PDF conversion using Azure's Document Intelligence service:
pip install 'markitdown[az-doc-intel]'
Configure the endpoint when converting:
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
Or in Python:
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
For setup instructions, see Azure Document Intelligence documentation.
Source: README.md
Azure Content Understanding
For advanced multi-modal extraction with structured field output (YAML front matter):
pip install 'markitdown[az-content-understanding]'
Configure via Python:
from markitdown import ContentUnderstandingFileType
from markitdown import MarkItDown
md = MarkItDown(
cu_endpoint="<content_understanding_endpoint>",
cu_file_types=[ContentUnderstandingFileType.PDF], # only PDFs use CU
)
For more information, see Azure Content Understanding documentation.
Source: README.md
Verifying Installation
After installation, verify that MarkItDown is correctly installed by checking the version:
markitdown --version
Or test the Python API:
python -c "from markitdown import MarkItDown; print(MarkItDown().convert('test.txt').text_content)"
Installation Architecture
graph TD
A[MarkItDown Installation] --> B[Core Package<br/>markitdown]
A --> C[Optional Dependencies]
A --> D[Plugins]
C --> C1[PDF: pdfminer.six]
C --> C2[DOCX: mammoth]
C --> C3[PPTX: python-pptx]
C --> C4[XLSX: openpyxl]
C --> C5[Image: Pillow]
C --> C6[Audio: pydub, speechrecognition]
D --> D1[markitdown-ocr]
D --> D2[Custom Plugins<br/>#markitdown-plugin]
B --> E[CLI Interface<br/>markitdown command]
B --> F[Python API<br/>MarkItDown class]Troubleshooting
RuntimeWarning: ffmpeg/avconv not found
On Linux systems, if you see this warning when converting audio files:
RuntimeWarning: Couldn't find ffmpeg or avconv
Install ffmpeg to enable audio conversion:
# Debian/Ubuntu
sudo apt install ffmpeg
# Fedora
sudo dnf install ffmpeg
# macOS
brew install ffmpeg
Source: Issue #1685
Unrecognized Arguments Error with --llm-client
If you encounter an "Unrecognized Arguments" error when using --llm-client or --llm-model:
markitdown: error: unrecognized arguments: --llm-client openai
Install the markitdown-ocr plugin first:
pip install markitdown-ocr
Source: Issue #1897
Dependency Conflicts
If you encounter dependency conflicts with other packages, create a dedicated virtual environment:
python -m venv markitdown-env
source markitdown-env/bin/activate
pip install 'markitdown[all]'
Next Steps
After installation, see the following guides:
- Usage Guide - Converting files using CLI and Python API
- Plugin Development - Creating custom MarkItDown plugins
- Configuration Reference - All available configuration options
- Troubleshooting - Solving common issues
See Also
- MarkItDown README - Project overview and quick start
- markitdown-ocr Plugin - OCR plugin documentation
- Sample Plugin - Plugin development guide
- Azure Document Intelligence - Cloud-based document conversion
- Azure Content Understanding - Advanced multi-modal extraction
Source: https://github.com/microsoft/markitdown / Human Manual
Command-Line Interface
Related topics: Installation Guide, Python API Reference
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Installation Guide, Python API Reference
Command-Line Interface
MarkItDown provides a command-line interface (CLI) for converting various file formats to Markdown directly from the terminal. The CLI is installed as an entry point when you install the markitdown package, making the markitdown command available system-wide.
Overview
The MarkItDown CLI enables users to convert documents without writing Python code. It wraps the MarkItDown Python class and provides a streamlined interface for common conversion tasks including file conversion, streaming input, plugin management, and integration with Azure AI services.
graph TD
A["markitdown CLI"] --> B{Input Type}
B -->|File Path| C["convert_local()"]
B -->|Stdin| D["convert_stream()"]
B -->|URL| E["convert_uri()"]
C --> F["MarkItDown Engine"]
D --> F
E --> F
F --> G{Converter Selection}
G -->|Built-in| H["PDF, DOCX, PPTX, XLSX..."]
G -->|Plugin| I["3rd-party Converters"]
G -->|Azure| J["Doc Intel / Content Understanding"]
H --> K["Markdown Output"]
I --> K
J --> KInstallation
Ensure MarkItDown is installed with the CLI component:
# Install with all optional dependencies
pip install markitdown[all]
# Verify CLI is available
markitdown --version
Source: README.md
Basic Usage
Converting a File
The simplest use case is converting a local file:
markitdown path/to/document.pdf > output.md
Or using the output flag:
markitdown path/to/document.pdf -o output.md
Reading from Stdin
MarkItDown can read from standard input when no filename is provided. Use the -x flag to specify the file extension when reading from stdin:
cat document.pdf | markitdown -x pdf > output.md
Docker Usage
For environments without Python installed, use Docker:
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
Source: README.md
Command Reference
Global Options
| Option | Short | Description |
|---|---|---|
--version | -v | Show the version number and exit |
--output | -o | Output file name. If not provided, output is written to stdout |
--extension | -x | Provide a hint about the file extension when reading from stdin |
--list-plugins | List installed 3rd-party plugins and exit |
Plugin Options
| Option | Short | Description |
|---|---|---|
--use-plugins | -p | Enable 3rd-party plugin support during conversion |
LLM Options
| Option | Description |
|---|---|
--llm-client | Specify the LLM client to use (e.g., openai) for image descriptions |
--llm-model | Specify the LLM model to use (e.g., gpt-4o) |
--llm-prompt | Custom prompt for LLM-based feature extraction |
Azure Document Intelligence Options
| Option | Short | Description |
|---|---|---|
--use-docintel | -d | Use Azure Document Intelligence for conversion |
--endpoint | -e | Azure Document Intelligence endpoint URL (required with --use-docintel) |
Azure Content Understanding Options
| Option | Short | Description |
|---|---|---|
--use-cu | Use Azure Content Understanding for conversion | |
--cu-endpoint | Content Understanding endpoint URL (required with --use-cu) | |
--cu-file-types | Comma-separated list of file types to process with Content Understanding |
Source: packages/markitdown/src/markitdown/__main__.py:20-120
Common Usage Patterns
Standard File Conversion
Convert supported file types (PDF, DOCX, PPTX, XLSX, images, audio, etc.):
markitdown document.pdf -o document.md
markitdown presentation.pptx -o presentation.md
markitdown spreadsheet.xlsx -o spreadsheet.md
Plugin Usage
List available plugins:
markitdown --list-plugins
Convert using 3rd-party plugins:
markitdown document.rtf --use-plugins -o document.md
LLM-Enhanced Conversion
Use LLM vision for image descriptions (currently supports PPTX and images):
markitdown image.jpg --llm-client openai --llm-model gpt-4o -o image.md
Azure Document Intelligence
For higher-quality PDF and Office document conversion:
markitdown document.pdf -d -e "https://your-resource.cognitiveservices.azure.com/" -o document.md
Azure Content Understanding
For structured field extraction with YAML front matter:
markitdown invoice.pdf --use-cu --cu-endpoint "https://your-cu-endpoint" -o invoice.md
Limit Content Understanding to specific file types:
markitdown document.pdf --use-cu --cu-endpoint "..." --cu-file-types PDF -o document.md
Source: README.md
CLI Architecture
The CLI is implemented in __main__.py and follows a clear initialization pattern:
graph LR
A["Argument Parsing<br/>(argparse)"] --> B{"Mode Selection"}
B -->|--list-plugins| C["List Plugins & Exit"]
B -->|--use-docintel| D["Document Intelligence Mode"]
B -->|--use-cu| E["Content Understanding Mode"]
B -->|Default| F["Standard Conversion Mode"]
D --> G["MarkItDown(<br/>docintel_endpoint)"]
E --> H["MarkItDown(<br/>cu_endpoint, cu_file_types)"]
F --> I["MarkItDown(<br/>enable_plugins)"]
G --> J["Conversion & Output"]
H --> J
I --> JArgument Parsing Flow
Source: packages/markitdown/src/markitdown/__main__.py:20-150
The CLI uses argparse with RawDescriptionHelpFormatter to provide formatted help output including usage examples:
parser = argparse.ArgumentParser(
description="Convert various file formats to markdown.",
prog="markitdown",
formatter_class=argparse.RawDescriptionHelpFormatter,
usage=dedent("""
SYNTAX:
markitdown <OPTIONAL: FILENAME>
If FILENAME is empty, markitdown reads from stdin.
""").strip(),
)
Mode Initialization
Based on the flags provided, the CLI initializes MarkItDown with different configurations:
# Document Intelligence mode
if args.use_docintel:
markitdown = MarkItDown(
enable_plugins=args.use_plugins,
docintel_endpoint=args.endpoint
)
# Content Understanding mode
elif args.use_cu:
markitdown = MarkItDown(
enable_plugins=args.use_plugins,
cu_endpoint=args.cu_endpoint,
cu_file_types=args.cu_file_types
)
# Standard mode with optional plugins
else:
markitdown = MarkItDown(
enable_plugins=args.use_plugins,
llm_client=llm_client,
llm_model=args.llm_model,
llm_prompt=args.llm_prompt
)
Source: packages/markitdown/src/markitdown/__main__.py:100-145
Plugin Integration
Listing Plugins
The CLI discovers plugins via Python entry points in the markitdown.plugin group:
markitdown --list-plugins
Example output:
Installed MarkItDown 3rd-party Plugins:
* markitdown-ocr (package: markitdown_ocr)
Use the -p (or --use-plugins) option to enable 3rd-party plugins.
Enabling Plugins
To use 3rd-party plugins, pass the --use-plugins flag:
markitdown document.pdf --use-plugins -o output.md
markitdown-ocr Plugin
The markitdown-ocr plugin adds OCR support for embedded images in PDF, DOCX, PPTX, and XLSX files:
markitdown document_with_images.pdf --use-plugins --llm-client openai --llm-model gpt-4o
[!NOTE]
If no llm_client is provided, the plugin loads but OCR is silently skipped, falling back to the standard converter.
Source: packages/markitdown-ocr/README.md
Known Limitations and Issues
Unrecognized Arguments Error
A known issue exists where certain command-line arguments may not be recognized as expected. The example in the markitdown-ocr documentation shows:
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
However, the --llm-client and --llm-model flags are parsed but may not be properly handled depending on the conversion mode selected. When this occurs, an "Unrecognized Arguments" error is displayed.
Workaround: Ensure the correct argument format is used and that required dependencies are installed:
# Verify markitdown is properly installed
pip show markitdown
# Reinstall with all dependencies
pip install markitdown[all]
Reference: Issue #1897
Missing ffmpeg Warning on Linux
When processing audio files, a RuntimeWarning may appear if ffmpeg or avconv is not installed:
RuntimeWarning: Couldn't find ffmpeg or avconv
Workaround: Install ffmpeg on Linux systems:
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Fedora
sudo dnf install ffmpeg
Reference: Issue #1685
Security Considerations
[!IMPORTANT]
MarkItDown performs I/O with the privileges of the current process. Likeopen()orrequests.get(), it will access resources that the process itself can access.
For untrusted environments:
- Sanitize inputs before passing them to the CLI
- Use the narrowest conversion function needed for your use case
- Consider running in sandboxed environments when processing untrusted files
Source: README.md
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Error (file not found, unsupported format, conversion failed) |
| 2 | Invalid arguments or missing required parameters |
See Also
Source: https://github.com/microsoft/markitdown / Human Manual
Architecture Overview
Related topics: Home, Python API Reference, Plugin Development Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Home, Python API Reference, Plugin Development Guide
Architecture Overview
MarkItDown is a lightweight Python utility for converting various file formats to Markdown, designed primarily for consumption by Large Language Models (LLMs) and text analysis pipelines. This page provides a comprehensive technical overview of MarkItDown's architecture, including its core components, converter system, plugin architecture, and data flow.
High-Level Architecture
MarkItDown employs a converter-based architecture where each file format is handled by a specialized converter that implements a common interface. The system uses a priority-based selection mechanism to allow converters to be overridden or extended through plugins.
graph TD
subgraph "Client Layer"
CLI["CLI<br/>(__main__.py)"]
API["Python API<br/>(MarkItDown class)"]
end
subgraph "Core Engine"
MD["MarkItDown<br/>Orchestrator"]
SI["StreamInfo<br/>Metadata"]
EP["Entry Points<br/>(Plugin Discovery)"]
end
subgraph "Converter System"
BC["Base Converter<br/>(DocumentConverter)"]
PDF["PdfConverter"]
DOCX["DocxConverter"]
PPTX["PptxConverter"]
XLSX["XlsxConverter"]
IMG["ImageConverter"]
AUD["AudioConverter"]
CU["ContentUnderstandingConverter"]
end
subgraph "Output"
DCR["DocumentConverterResult"]
MD_OUT["Markdown Output"]
end
CLI --> API
API --> MD
MD --> EP
MD --> SI
EP --> BC
BC --> PDF
BC --> DOCX
BC --> PPTX
BC --> XLSX
BC --> IMG
BC --> AUD
BC --> CU
MD --> DCR
DCR --> MD_OUTSource: packages/markitdown/src/markitdown/_markitdown.py
Core Components
MarkItDown Orchestrator
The MarkItDown class serves as the main orchestrator for the conversion pipeline. It is responsible for:
- Converter discovery and registration via Python entry points
- File format detection using
StreamInfometadata - Converter selection and execution based on priority
- Result aggregation into a standardized output format
Source: packages/markitdown/src/markitdown/_markitdown.py:1-50
#### Key Methods
| Method | Description |
|---|---|
convert(file_path) | Convert a local file to Markdown |
convert_stream(stream, stream_info) | Convert a stream with metadata |
convert_uri(uri) | Convert a file, data, or remote URI |
convert_url(url) | Alias for convert_uri |
register_converter(converter) | Register a custom converter |
list_converters() | List all registered converters |
Source: packages/markitdown/src/markitdown/_markitdown.py
StreamInfo
The StreamInfo class encapsulates metadata about the file being converted:
@dataclass
class StreamInfo:
extension: Optional[str] = None
mimetype: Optional[str] = None
charset: Optional[str] = None
url: Optional[str] = None
This metadata is used by converters to determine whether they can handle a given file through their accepts() method.
Source: packages/markitdown/src/markitdown/_stream_info.py
DocumentConverterResult
The DocumentConverterResult class represents the output of a successful conversion:
class DocumentConverterResult:
text_content: str # The Markdown output
attachments: List[Any] # Extracted images or other attachments
metadata: Dict[str, Any] # Conversion metadata
Source: packages/markitdown/src/markitdown/_base_converter.py
Converter System
Base Converter Interface
All converters inherit from DocumentConverter, which defines the contract for file conversion:
class DocumentConverter(ABC):
# Priority constants
PRIORITY_MAX = float("inf")
PRIORITY_DEFAULT = 0.0
PRIORITY_SPECIFIC_FILE_FORMAT = -1.0
PRIORITY_FALLBACK_FORMAT = -10.0
def __init__(self, priority: float = PRIORITY_DEFAULT):
self.priority = priority
@abstractmethod
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any
) -> bool:
"""Determine if this converter can handle the file."""
pass
@abstractmethod
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any
) -> DocumentConverterResult:
"""Convert the file to Markdown."""
pass
Source: packages/markitdown/src/markitdown/_base_converter.py
Priority System
Converters use a priority system where higher values are selected first. This allows plugins to override built-in converters:
| Priority Range | Usage |
|---|---|
> 0.0 | Reserved for specialized/premium converters |
0.0 (DEFAULT) | Built-in converters |
-1.0 (SPECIFIC_FILE_FORMAT) | Format-specific plugins |
-10.0 (FALLBACK_FORMAT) | Fallback converters |
Source: packages/markitdown/src/markitdown/_base_converter.py
Built-in Converters
The following converters are included with the core package:
| Converter | File Types | Priority |
|---|---|---|
PdfConverter | .pdf | 0.0 |
DocxConverter | .docx | 0.0 |
PptxConverter | .pptx | 0.0 |
XlsxConverter | .xlsx | 0.0 |
ImageConverter | .jpg, .jpeg, .png, .gif, .bmp, .webp | 0.0 |
AudioConverter | .mp3, .wav, .m4a, .flac | 0.0 |
WikipediaConverter | Wikipedia URLs | 0.0 |
CsvConverter | .csv | 0.0 |
JsonConverter | .json | 0.0 |
XmlConverter | .xml | 0.0 |
HtmlConverter | .html, .htm | 0.0 |
EpubConverter | .epub | 0.0 |
YouTubeConverter | YouTube URLs | 0.0 |
IpynbConverter | .ipynb | 0.0 |
ZipConverter | .zip | 0.0 |
ContentUnderstandingConverter | All (when enabled) | 0.0 |
Source: packages/markitdown/src/markitdown/converters/__init__.py
Plugin Architecture
Version 0.1.0 introduced a plugin-based architecture that allows third-party developers to extend MarkItDown's capabilities.
Plugin Discovery Mechanism
Plugins are discovered through Python entry points in the markitdown.plugin group:
[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"
Source: packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py
Plugin Interface
Each plugin must implement and export the following:
# The version of the plugin interface that this plugin uses
__plugin_interface_version__ = 1
def register_converters(markitdown: MarkItDown, **kwargs):
"""Called during construction of MarkItDown instances."""
markitdown.register_converter(YourCustomConverter())
Source: packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py
Plugin Registration Flow
sequenceDiagram
participant User
participant MarkItDown
participant EntryPoints
participant Plugin
participant Converter
User->>MarkItDown: new MarkItDown(enable_plugins=True)
MarkItDown->>EntryPoints: entry_points(group="markitdown.plugin")
EntryPoints-->>MarkItDown: List of plugins
loop For each plugin
MarkItDown->>Plugin: load_plugin()
Plugin->>Plugin: register_converters()
Plugin->>MarkItDown: register_converter(Converter)
end
User->>MarkItDown: convert(file)
MarkItDown->>Converter: accepts(stream, info)
Converter-->>MarkItDown: True
MarkItDown->>Converter: convert(stream, info)
Converter-->>MarkItDown: DocumentConverterResultSource: packages/markitdown/src/markitdown/_markitdown.py
Plugin Parameters
When plugins are loaded, the following parameters are forwarded from the MarkItDown constructor:
| Parameter | Type | Purpose |
|---|---|---|
llm_client | OpenAI-compatible | LLM client for image descriptions and OCR |
llm_model | str | Model name for LLM calls |
llm_prompt | str | Custom prompt for LLM extraction |
docintel_endpoint | str | Azure Document Intelligence endpoint |
docintel_credential | AzureKeyCredential | Authentication for Document Intelligence |
Source: packages/markitdown/src/markitdown/_markitdown.py
Example: markitdown-ocr Plugin
The markitdown-ocr plugin demonstrates the plugin architecture. It registers OCR-enhanced converters at priority -1.0, which allows them to take precedence over built-in converters:
# Inside register_converters()
markitdown.register_converter(OcrPdfConverter(priority=-1.0))
markitdown.register_converter(OcrDocxConverter(priority=-1.0))
markitdown.register_converter(OcrPptxConverter(priority=-1.0))
markitdown.register_converter(OcrXlsxConverter(priority=-1.0))
Source: packages/markitdown-ocr/README.md
Conversion Pipeline
URI Handling
MarkItDown supports multiple input sources through a unified URI handling system:
graph TD
URI["convert_uri(uri)"]
URI_TYPES{"URI Type?"}
FILE["file:///path/to/file"]
DATA["data:application/pdf;base64,..."]
HTTP["https://example.com/file.pdf"]
YT["https://youtube.com/watch?v=..."]
URI --> URI_TYPES
URI_TYPES -->|"file:"| FILE
URI_TYPES -->|"data:"| DATA
URI_TYPES -->|"http/https"| HTTP
URI_TYPES -->|"youtube.com"| YT
FILE --> STREAM["_convert_stream()"]
DATA --> STREAM
HTTP --> FETCH["Fetch content"]
FETCH --> STREAM
YT --> STREAM
STREAM --> ACCEPT["Try each converter"]
ACCEPT --> RESULT["DocumentConverterResult"]Source: packages/markitdown/src/markitdown/_uri_utils.py
Stream Processing
The conversion pipeline processes files as streams to support:
- Piped input (stdin)
- Remote URLs (HTTP/HTTPS)
- Data URIs (embedded content)
- File paths (local)
Source: packages/markitdown/src/markitdown/_markitdown.py
Azure Integration
Azure Document Intelligence
For PDF files, MarkItDown can optionally use Azure Document Intelligence for enhanced extraction:
md = MarkItDown(
docintel_endpoint="<endpoint>",
docintel_credential=AzureKeyCredential("<key>")
)
result = md.convert("document.pdf")
Source: packages/markitdown/src/markitdown/__main__.py
Azure Content Understanding
For multi-modal documents (audio, video, structured fields), MarkItDown supports Azure Content Understanding:
from markitdown import MarkItDown, ContentUnderstandingFileType
md = MarkItDown(
cu_endpoint="<content_understanding_endpoint>",
cu_file_types=[ContentUnderstandingFileType.PDF]
)
Source: packages/markitdown/src/markitdown/converters/_cu_converter.py
Command-Line Interface
The CLI is implemented in __main__.py and provides the following interface:
graph LR
CLI["markitdown CLI"]
OPTS["Options"]
CONV["Conversion<br/>Modes"]
subgraph OPTS
V["-v, --version"]
O["-o, --output"]
X["-x, --extension"]
P["-p, --use-plugins"]
L["-l, --list-plugins"]
M["-m_hint"]
end
subgraph CONV
STD["Standard"]
DOCINTEL["--use-docintel"]
CU["--use-cu"]
end
CLI --> OPTS
CLI --> CONVSource: packages/markitdown/src/markitdown/__main__.py
CLI Options
| Option | Description |
|---|---|
-v, --version | Show version number |
-o, --output FILE | Output file path |
-x, --extension EXT | Hint file extension (for stdin) |
-m_hint HINT | Hint for MIME type and charset |
-p, --use-plugins | Enable plugins |
-l, --list-plugins | List installed plugins |
--use-docintel | Use Azure Document Intelligence |
--endpoint URL | Document Intelligence endpoint |
--use-cu | Use Azure Content Understanding |
--cu-endpoint URL | Content Understanding endpoint |
Source: packages/markitdown/src/markitdown/__main__.py
Known Limitations and Failure Modes
Based on community-reported issues, be aware of the following:
1. CLI Unrecognized Arguments
The CLI may not recognize all documented arguments. Issue #1897 reports that the --llm-client and --llm-model arguments shown in documentation are not properly recognized.
2. UnicodeDecodeError in IpynbConverter
The IpynbConverter.accepts() method reads files with UTF-8 encoding, which can fail for files with non-ASCII bytes. See issue #1894.
3. Invalid Office Open XML Files
Invalid DOCX, XLSX, or PPTX files return success with an error message in text_content rather than raising an exception. See issue #1408.
4. Audio Processing on Linux
When ffmpeg or avconv is not installed, pydub raises a RuntimeWarning that may affect audio conversion. See issue #1685.
5. PDF Table Extraction
Tables in PDF files may not be converted properly. Community reports indicate that complex table structures can be problematic. See issue #293.
Security Considerations
[!IMPORTANT]
MarkItDown performs I/O with the privileges of the current process. Likeopen()orrequests.get(), it will access resources that the process itself can access.
For untrusted environments:
- Sanitize inputs before passing them to MarkItDown
- Use narrow conversion functions such as
convert_stream()orconvert_local()when possible - Be cautious with URL inputs as they may trigger network requests
Source: packages/markitdown/README.md
Dependency Groups
MarkItDown organizes dependencies into feature groups:
| Group | Description |
|---|---|
pdf | PDF conversion (pdfminer.six) |
docx | Word document conversion (mammoth) |
pptx | PowerPoint conversion (python-pptx) |
xlsx | Excel conversion (openpyxl) |
image | Image processing |
audio | Audio transcription (pydub) |
youtube | YouTube download (yt-dlp) |
azure-doc-intel | Azure Document Intelligence |
az-content-understanding | Azure Content Understanding |
all | All optional dependencies |
Install with: pip install 'markitdown[all]'
Source: packages/markitdown/README.md
Class Diagram
classDiagram
class DocumentConverter {
<<abstract>>
+float priority
+accepts(file_stream, stream_info, **kwargs) bool
+convert(file_stream, stream_info, **kwargs) DocumentConverterResult
}
class MarkItDown {
+bool enable_plugins
+str llm_model
+convert(file_path) DocumentConverterResult
+convert_uri(uri) DocumentConverterResult
+convert_stream(stream, info) DocumentConverterResult
+register_converter(converter)
}
class StreamInfo {
+Optional~str~ extension
+Optional~str~ mimetype
+Optional~str~ charset
+Optional~str~ url
}
class DocumentConverterResult {
+str text_content
+List attachments
+Dict metadata
}
MarkItDown "1" --> "*" DocumentConverter : registers
MarkItDown --> StreamInfo : creates
MarkItDown --> DocumentConverterResult : returns
DocumentConverter --> DocumentConverterResult : returns
DocumentConverter ..> StreamInfo : usesSee Also
- MarkItDown README - Official project documentation
- markitdown-ocr Plugin - OCR plugin documentation
- Sample Plugin Guide - Creating custom plugins
- Azure Document Intelligence Setup - Azure integration guide
- Azure Content Understanding - Content Understanding integration
Source: https://github.com/microsoft/markitdown / Human Manual
Python API Reference
Related topics: Architecture Overview, Command-Line Interface, Azure Integrations
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Architecture Overview, Command-Line Interface, Azure Integrations
Python API Reference
MarkItDown provides a comprehensive Python API for converting various file formats to Markdown. This reference documents the core classes, methods, and configuration options available to developers integrating MarkItDown into their Python applications.
Overview
The MarkItDown Python API enables programmatic document conversion with support for:
- Local files - Convert files from the filesystem
- URLs/URIs - Convert remote documents via HTTP or file URIs
- Streams - Convert from file-like objects with optional metadata hints
- Azure services - Integration with Document Intelligence and Content Understanding
- Plugin architecture - Extensible converter system for custom formats
Source: packages/markitdown/src/markitdown/__init__.py
Core Classes
MarkItDown
The main entry point for the Python API. Create an instance with desired configuration and call convert() to transform documents to Markdown.
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
#### Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_plugins | bool | False | Enable loading of 3rd-party plugins via the markitdown.plugin entry point group |
llm_client | Any | None | OpenAI-compatible LLM client for image descriptions (used with llm_model) |
llm_model | str | None | Model name for LLM-based image description generation |
llm_prompt | str | None | Custom prompt template for LLM image description |
docintel_endpoint | str | None | Azure Document Intelligence endpoint URL |
docintel_api_key | str | None | Azure Document Intelligence API key (alternative to credential) |
cu_endpoint | str | None | Azure Content Understanding endpoint URL |
cu_api_key | str | None | Azure Content Understanding API key |
cu_analyzer | str | None | Specific Content Understanding analyzer ID to use |
cu_file_types | List[ContentUnderstandingFileType] | None | Filter which file types route to Content Understanding |
Source: packages/markitdown/src/markitdown/_markitdown.py
#### Conversion Methods
##### convert(uri: str, **kwargs) -> DocumentConverterResult
The primary conversion method that automatically detects the input type and routes to the appropriate handler.
# Local file
result = md.convert("document.docx")
# URL
result = md.convert("https://example.com/document.pdf")
# File URI
result = md.convert("file:///path/to/document.pdf")
# Data URI
result = md.convert("data:text/plain;base64,SGVsbG8=")
The method supports optional keyword arguments that override the instance defaults for a single conversion call.
Source: packages/markitdown/src/markitdown/_markitdown.py
##### convert_uri(uri: str, **kwargs) -> DocumentConverterResult
Explicitly converts a URI (file, data, or HTTP/HTTPS URL). The convert_url method remains as a deprecated alias for backward compatibility.
result = md.convert_uri("file:///path/to/document.pdf")
Source: packages/markitdown/src/markitdown/_markitdown.py
##### convert_local(file_path: str, **kwargs) -> DocumentConverterResult
Converts a local file by path. This method is preferred when the file source is known to be local, as it bypasses URI parsing.
result = md.convert_local("./documents/report.pdf")
Source: packages/markitdown/src/markitdown/_markitdown.py
##### convert_stream(stream: BinaryIO, stream_info: Optional[StreamInfo] = None, **kwargs) -> DocumentConverterResult
Converts from a file-like object. Use stream_info to provide hints about the file type when the stream lacks inherent type information.
with open("document.pdf", "rb") as f:
stream_info = StreamInfo(extension=".pdf")
result = md.convert_stream(f, stream_info)
Source: packages/markitdown/src/markitdown/_markitdown.py
##### register_converter(converter: DocumentConverter) -> None
Registers a custom converter. This allows adding support for additional file formats at runtime.
md = MarkItDown()
md.register_converter(MyCustomConverter())
Source: https://github.com/microsoft/markitdown / Human Manual
Supported File Formats
Related topics: Home, Azure Integrations, OCR Plugin
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Home, Azure Integrations, OCR Plugin
Supported File Formats
MarkItDown provides a unified interface for converting a wide variety of file formats into Markdown. The conversion pipeline uses a plugin-based architecture where each file format is handled by a dedicated converter. When you call MarkItDown().convert(), the system iterates through registered converters in priority order until one accepts the input.
Overview
MarkItDown supports the following high-level categories of file formats:
| Category | Formats | Primary Converter |
|---|---|---|
| Documents | PDF, DOCX, PPTX, XLSX | Built-in converters using pdfminer.six, mammoth |
| Media | Images (JPEG, PNG, GIF, WebP, BMP, TIFF) | Built-in with EXIF metadata and LLM Vision OCR |
| Media | Audio (MP3, WAV, M4A, OGG, FLAC) | Built-in with EXIF metadata and transcription |
| Web | HTML, Wikipedia, RSS/Atom | Built-in converters using BeautifulSoup |
| Data | CSV, JSON, XML | Built-in converters |
| Archives | ZIP | Built-in converter with recursive processing |
| eBooks | EPUB | Built-in converter |
| Notebooks | Jupyter Notebook (IPYNB) | Built-in converter |
| URLs | YouTube Videos | Built-in converter with transcript extraction |
Source: README.md
Converter Architecture
MarkItDown uses a converter registration system where each format is handled by a class implementing the DocumentConverter interface. Converters are registered with a priority value that determines the order in which they are tried.
graph TD
A[MarkItDown.convert] --> B[Get registered converters]
B --> C[Sort by priority descending]
C --> D{Loop through converters}
D --> E{Converter.accepts?}
E -->|Yes| F[Converter.convert]
E -->|No| G[Next converter]
F --> H[Return DocumentConverterResult]
G --> D
D -->|All fail| I[UnsupportedFormatException]Each converter implements two key methods:
accepts(file_stream, stream_info,kwargs)** - ReturnsTrueif the converter can handle the inputconvert(file_stream, stream_info,kwargs)** - Performs the actual conversion to Markdown
Source: packages/markitdown/src/markitdown/_base_converter.py
Document Formats
PDF conversion extracts text content while preserving document structure including headings, paragraphs, lists, and tables.
Supported Extensions: .pdf
Dependencies: pdfminer.six
Features:
- Text extraction with layout preservation
- Table extraction with aligned Markdown output
- Heading and list recognition
- Support for numbered and bulleted lists
Known Limitations:
- Scanned PDFs with no extractable text require the
markitdown-ocrplugin for full-page OCR - Complex table structures may not convert perfectly (see Issue #293)
- PDF is converted to text/Markdown, not high-fidelity reproduction (see Issue #296)
Memory Optimization: In version 0.1.6, PDF conversion was fixed to prevent O(n) memory growth by properly calling page.close() after processing each page.
Source: packages/markitdown/src/markitdown/converters/_pdf_converter.py
Microsoft Word (DOCX)
Word documents are converted using the mammoth library, which extracts text and converts it to Markdown.
Supported Extensions: .docx
Dependencies: mammoth
Features:
- Heading extraction and conversion
- Paragraph and text formatting preservation
- Table extraction
- Math equation rendering (OMML to LaTeX)
- Image extraction with optional LLM Vision descriptions
- Linked image handling
Known Limitations:
- The legacy
.docformat is not supported (see Issue #23) - Invalid DOCX files return a success result with an error message in
text_contentrather than raising an exception (see Issue #1408)
Source: packages/markitdown/src/markitdown/converters/_docx_converter.py
Microsoft PowerPoint (PPTX)
PowerPoint presentations are converted slide by slide, with each slide rendered as a Markdown section.
Supported Extensions: .pptx
Dependencies: python-pptx
Features:
- Slide-by-slide conversion
- Title and content extraction
- Bullet point preservation
- Image extraction with optional LLM Vision descriptions
- Table extraction
Source: packages/markitdown/src/markitdown/converters/_pptx_converter.py
Microsoft Excel (XLSX)
Excel spreadsheets are converted with each sheet represented as a separate Markdown section with tables.
Supported Extensions: .xlsx, .xlsm
Dependencies: openpyxl
Features:
- Multi-sheet support
- Table extraction with header row identification
- Cell value preservation including formulas (as displayed values)
- Named sheet sections in output
Known Limitations:
- Invalid XLSX files return a success result with an error message rather than raising an exception (see Issue #1408)
Source: packages/markitdown/src/markitdown/converters/_xlsx_converter.py
Media Formats
Images
Images are converted by extracting metadata and optionally generating descriptions using LLM Vision.
Supported Extensions: .jpg, .jpeg, .png, .gif, .webp, .bmp, .tiff, .tif
Dependencies: Pillow (for metadata), LLM client for descriptions
Features:
- EXIF metadata extraction (camera info, GPS, date/time)
- LLM Vision image descriptions (requires
llm_clientandllm_model) - Dimension reporting
Usage Example:
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(
llm_client=OpenAI(),
llm_model="gpt-4o",
llm_prompt="Describe this image in detail."
)
result = md.convert("photo.jpg")
print(result.text_content)
Source: packages/markitdown/src/markitdown/converters/_image_converter.py
Audio
Audio files are converted by extracting metadata and optionally transcribing speech.
Supported Extensions: .mp3, .wav, .m4a, .ogg, .flac
Dependencies: pydub, speech-recognition (for transcription)
Features:
- EXIF/metadata extraction
- Speech-to-text transcription using Google Speech Recognition
- Format detection
Known Limitations:
``bash # Ubuntu/Debian sudo apt-get install ffmpeg # macOS brew install ffmpeg ``
- On Linux systems, a
RuntimeWarningmay appear ifffmpegoravconvis not installed (see Issue #1685). Install ffmpeg to resolve:
Source: packages/markitdown/src/markitdown/converters/_audio_converter.py
Web Formats
HTML
HTML files are converted to Markdown using BeautifulSoup with customizable rendering options.
Supported Extensions: .html, .htm
Accepted MIME Types: text/html, application/xhtml+xml
Source: packages/markitdown/src/markitdown/converters/_html_converter.py
Wikipedia
Wikipedia pages are specially handled to extract only the main article content, stripping navigation and sidebar elements.
Accepted URLs: *.wikipedia.org/*
Features:
- Article title extraction
- Main content extraction (excluding sidebars, navigation)
- HTML conversion pipeline
Source: packages/markitdown/src/markitdown/converters/_wikipedia_converter.py
RSS and Atom Feeds
RSS and Atom feeds are converted with each item represented as a section in the output.
Supported Extensions: .rss, .atom, .xml
Accepted MIME Types:
- Precise:
application/rss,application/rss+xml,application/atom,application/atom+xml - Candidate:
text/xml,application/xml
Source: packages/markitdown/src/markitdown/converters/_rss_converter.py
Data Formats
CSV
CSV files are converted to Markdown tables with automatic header detection.
Supported Extensions: .csv
Source: packages/markitdown/src/markitdown/converters/_csv_converter.py
JSON
JSON files are converted with basic formatting to maintain readability.
Supported Extensions: .json
Source: packages/markitdown/src/markitdown/converters/_json_converter.py
XML
XML files are parsed and converted to readable Markdown format.
Supported Extensions: .xml
Source: packages/markitdown/src/markitdown/converters/_xml_converter.py
Archive Formats
ZIP Files
ZIP archives are recursively processed, with each contained file converted using the appropriate converter.
Supported Extensions: .zip
Accepted MIME Types: application/zip
Features:
- Recursive conversion of all contained files
- File path preservation in output headings
- Support for nested archives
Output Format:
Content from the zip file `example.zip`:
## File: docs/readme.txt
[Content of readme.txt]
## File: images/example.jpg
ImageSize: 1920x1080
Description: [Image description]
## File: data/report.xlsx
[Converted Excel content]
Source: packages/markitdown/src/markitdown/converters/_zip_converter.py
eBook and Notebook Formats
EPUB
EPUB e-books are converted with chapter-by-chapter extraction.
Supported Extensions: .epub
Features:
- Chapter extraction
- Content preservation
- Metadata extraction
Source: packages/markitdown/src/markitdown/converters/_epub_converter.py
Jupyter Notebooks
Jupyter notebooks (.ipynb files) are converted preserving both code cells and markdown cells.
Supported Extensions: .ipynb
Known Limitations:
IpynbConverter.accepts()may raiseUnicodeDecodeErroron files containing non-ASCII bytes (see Issue #1894). This is particularly relevant when processing files created from non-English PDFs.
Source: packages/markitdown/src/markitdown/converters/_ipynb_converter.py
URL-Based Conversions
YouTube Videos
YouTube video URLs can be converted to extract transcripts and metadata.
Supported URLs: YouTube video and playlist URLs
Features:
- Automatic transcript extraction
- Video metadata (title, description)
- Subtitle/script preservation
Usage:
markitdown "https://www.youtube.com/watch?v=VIDEO_ID"
Source: packages/markitdown/src/markitdown/converters/_youtube_converter.py
Cloud-Based Conversions
Azure Document Intelligence
For PDFs requiring advanced layout analysis, Azure Document Intelligence provides higher-quality extraction.
Installation: pip install 'markitdown[docintel]'
Usage:
markitdown document.pdf -d -e "<document_intelligence_endpoint>"
Source: packages/markitdown/src/markitdown/converters/_docintel_converter.py
Azure Content Understanding
Azure Content Understanding provides structured field extraction with YAML front matter output.
Installation: pip install 'markitdown[az-content-understanding]'
Supported File Types via CU:
- PDF (with structured extraction)
- Images
- Audio
- Video
Source: packages/markitdown/src/markitdown/converters/_cu_converter.py
Plugin-Extended Formats
OCR Plugin (markitdown-ocr)
The markitdown-ocr plugin extends PDF, DOCX, PPTX, and XLSX converters with LLM Vision OCR for embedded images.
Installation:
pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client
Supported Formats with OCR:
- PDF (embedded images and scanned pages)
- DOCX (inline images)
- PPTX (inline images)
- XLSX (inline images)
Usage:
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
Important: The CLI argument format changed in recent versions. Use --llm-client and --llm-model instead of combining them (see Issue #1897).
Source: packages/markitdown-ocr/README.md
Unsupported Formats
The following formats are explicitly not supported:
| Format | Extension | Notes |
|---|---|---|
| Legacy Word | .doc | Only .docx is supported (see Issue #23) |
| OneNote | .one | No current support (see Issue #47) |
| RTF | .rtf | Requires a custom plugin |
Converter Priority System
Each converter has a priority value that determines when it is tried during the conversion process:
| Priority Value | Meaning | Example |
|---|---|---|
PRIORITY_SPECIFIC_FILE_FORMAT (100) | Specific file format | DOCX, PDF converters |
PRIORITY_COMMON_FILE_FORMAT (50) | Common formats | ZIP, HTML converters |
PRIORITY_FALLBACK (0) | Fallback handler | Plain text converter |
-1.0 (Plugin) | Runs before built-in | markitdown-ocr converters |
Converters with higher priority values are tried first. When a converter returns True from accepts(), it is used for conversion.
Feature Comparison by Format
| Format | Text | Tables | Images | Math | Metadata | Notes |
|---|---|---|---|---|---|---|
| ✓ | ✓ | ✓* | - | ✓ | *With OCR plugin | |
| DOCX | ✓ | ✓ | ✓ | ✓ | - | Math via OMML→LaTeX |
| PPTX | ✓ | ✓ | ✓ | - | - | Slide-based |
| XLSX | ✓ | ✓ | ✓ | - | - | Sheet-based |
| Images | - | - | ✓ | - | ✓ | EXIF metadata |
| Audio | ✓** | - | - | - | ✓ | **Transcription |
| HTML | ✓ | ✓ | ✓ | - | - | |
| EPUB | ✓ | ✓ | ✓ | - | ✓ | |
| CSV | - | ✓ | - | - | - | |
| JSON | ✓ | - | - | - | - | Formatted |
See Also
- Installation Guide - Installing MarkItDown and dependencies
- CLI Usage - Command-line interface reference
- Python API - Python library usage
- Plugin Development - Creating custom converters
- Azure Integration - Document Intelligence and Content Understanding
Source: https://github.com/microsoft/markitdown / Human Manual
Azure Integrations
Related topics: Python API Reference, Supported File Formats
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Python API Reference, Supported File Formats
Azure Integrations
MarkItDown provides two Azure-based integration options for enhanced document conversion: Azure Document Intelligence and Azure Content Understanding. Both integrations leverage cloud-based AI services to provide higher-quality extraction than built-in offline converters, but they serve different use cases and offer distinct capabilities.
Overview
MarkItDown's Azure integrations are implemented as separate converter classes that can be enabled when needed. These converters are optional dependencies that must be installed explicitly:
# Document Intelligence
pip install 'markitdown[docintel]'
# Content Understanding
pip install 'markitdown[az-content-understanding]'
graph TD
A[Input Document] --> B{MarkItDown Instance}
B --> C{Feature Flag Check}
C -->|--use-docintel| D[Document Intelligence Converter]
C -->|--use-cu| E[Content Understanding Converter]
C -->|No Azure flag| F[Built-in Converters]
D --> G[Azure Document Intelligence Service]
E --> H[Azure Content Understanding Service]
F --> I[Offline PDF/Office Extractors]
G --> J[Markdown Output]
H --> K[Markdown + YAML Front Matter]
I --> JSource: packages/markitdown/src/markitdown/__main__.py:86-130
Azure Document Intelligence
Azure Document Intelligence (formerly Form Recognizer) provides cloud-based layout analysis and OCR for document conversion. It is particularly useful for scanned PDFs and complex document layouts.
Supported File Types
| File Type | Description | Notes |
|---|---|---|
pdf | PDF documents | Full OCR support for scanned documents |
docx | Word documents | Enhanced layout preservation |
pptx | PowerPoint presentations | Slide structure extraction |
xlsx | Excel spreadsheets | Table and data extraction |
html | HTML documents | Markup interpretation |
Source: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:1-50
Installation and Configuration
- Create an Azure Document Intelligence resource in the Azure portal
- Obtain the endpoint URL and API key (or configure managed identity)
pip install 'markitdown[docintel]'
CLI Usage
markitdown document.pdf -o output.md --use-docintel -e "https://<your-resource>.cognitiveservices.azure.com/"
| Argument | Short | Description |
|---|---|---|
--use-docintel | -d | Enable Document Intelligence converter |
--endpoint | -e | Document Intelligence endpoint URL |
Source: packages/markitdown/src/markitdown/__main__.py:86-100
Python API
from markitdown import MarkItDown
# Using endpoint and API key
md = MarkItDown(docintel_endpoint="https://<your-resource>.cognitiveservices.azure.com/")
result = md.convert("document.pdf")
print(result.text_content)
The MarkItDown class accepts the following Document Intelligence parameters:
| Parameter | Type | Description |
|---|---|---|
docintel_endpoint | str | Azure Document Intelligence endpoint URL |
docintel_api_key | str | API key for authentication |
docintel_use_custom_model | bool | Whether to use a custom model |
docintel_model_id | str | Custom model identifier |
Source: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:100-150
Internal Architecture
The Document Intelligence converter (_doc_intel_converter.py) operates as follows:
sequenceDiagram
participant App as MarkItDown
participant DI as DocumentIntelligenceClient
participant Azure as Azure Document Intelligence Service
App->>DI: Create client with endpoint + credentials
App->>DI: analyze_document(file_stream, features=[OCR, STYLE])
DI->>Azure: POST request with document
Azure-->>DI: AnalyzeResult with markdown content
DI-->>App: DocumentConverterResult with markdownThe converter uses the DocumentAnalysisFeature enum to enable OCR and layout analysis:
from azure.ai.documentintelligence.models import DocumentAnalysisFeature
features = [DocumentAnalysisFeature.OCR]
Source: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:50-100
Azure Content Understanding
Azure Content Understanding provides higher-quality, multi-modal extraction with structured field output. It supports documents, images, audio, and video files through prebuilt or custom analyzers.
Key Capabilities
| Capability | Description |
|---|---|
| Multi-modal support | Documents, images, audio, and video |
| Structured field extraction | YAML front matter from analyzer fields |
| Prebuilt analyzers | Domain-specific extraction (search, contracts, etc.) |
| Custom analyzers | User-defined extraction patterns |
| Cloud-based OCR | Higher-quality text recognition for scanned documents |
Source: packages/markitdown/src/markitdown/converters/_cu_converter.py:1-50
When to Use Content Understanding
| Use Case | Recommendation |
|---|---|
| Scanned PDFs requiring high-quality OCR | Use Content Understanding |
| Complex tables and multi-page documents | Use Content Understanding |
| Audio transcription | Use Content Understanding |
| Video analysis | Use Content Understanding |
| Domain-specific field extraction | Use Content Understanding with custom analyzer |
| Simple text extraction | Use built-in converters |
| Basic audio transcription | Built-in converters are sufficient |
Installation
pip install 'markitdown[az-content-understanding]'
CLI Usage
markitdown document.pdf --use-cu --cu-endpoint "https://<your-resource>.cognitiveservices.azure.com/"
| Argument | Description |
|---|---|
--use-cu | Enable Content Understanding converter |
--cu-endpoint | Content Understanding endpoint URL |
Source: packages/markitdown/src/markitdown/__main__.py:100-130
Python API
Zero-config usage (auto-selects analyzer):
from markitdown import MarkItDown
md = MarkItDown(cu_endpoint="https://<your-resource>.cognitiveservices.azure.com/")
result = md.convert("report.pdf")
print(result.markdown)
With a custom analyzer:
from markitdown import MarkItDown
from markitdown.converters._cu_converter import ContentUnderstandingFileType
md = MarkItDown(
cu_endpoint="https://<your-resource>.cognitiveservices.azure.com/",
cu_file_types=[ContentUnderstandingFileType.PDF],
cu_analyzer_id="your-custom-analyzer-id"
)
result = md.convert("contract.pdf")
print(result.markdown)
Source: README.md
Supported File Types
The Content Understanding converter supports automatic analyzer selection based on file type:
| File Type | Auto-selected Analyzer |
|---|---|
| PDF, DOCX, PPTX | prebuilt-documentSearch |
| Images (JPG, PNG) | prebuilt-documentSearch |
| Video (MP4, MOV) | prebuilt-videoSearch |
| Audio (WAV, MP3) | prebuilt-audioSearch |
Output Format
Content Understanding output includes YAML front matter with extracted fields:
Source: https://github.com/microsoft/markitdown / Human Manual
OCR Plugin
Related topics: Supported File Formats, Plugin Development Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Supported File Formats, Plugin Development Guide
OCR Plugin
The MarkItDown OCR Plugin (markitdown-ocr) is an official plugin that adds Optical Character Recognition (OCR) capabilities to MarkItDown. It extends the built-in converters for PDF, DOCX, PPTX, and XLSX files to extract text from embedded images using LLM Vision, providing enhanced document conversion for files containing scanned content or images with text.
Overview
MarkItDown's built-in converters handle native text extraction from documents, but many documents contain text embedded in images rather than as accessible text. The OCR plugin addresses this gap by:
- Extracting images from document formats (PDF, DOCX, PPTX, XLSX)
- Sending images to LLM Vision models for text extraction
- Inserting extracted text inline with document content in reading order
- Providing full-page OCR fallback for scanned PDFs
Source: packages/markitdown-ocr/README.md
Features
The OCR plugin provides the following capabilities:
| Feature | Description |
|---|---|
| Enhanced PDF Converter | Extracts text from images within PDFs, with full-page OCR fallback for scanned documents |
| Enhanced DOCX Converter | OCR for images embedded in Word documents |
| Enhanced PPTX Converter | OCR for images embedded in PowerPoint presentations |
| Enhanced XLSX Converter | OCR for images embedded in Excel spreadsheets |
| Context Preservation | Maintains document structure and flow when inserting extracted text |
| LLM Vision Integration | Uses OpenAI-compatible LLM clients for image-to-text conversion |
| Malformed PDF Handling | Retries problematic PDFs with PyMuPDF when pdfplumber/pdfminer fail |
Source: packages/markitdown-ocr/README.md
Architecture
Plugin Registration Flow
The OCR plugin uses MarkItDown's plugin architecture with a priority-based replacement strategy. Converters are registered at priority -1.0, which causes them to run before the built-in converters at priority 0.0, effectively replacing the standard conversion behavior when the plugin is enabled.
graph TD
A[User Creates MarkItDown Instance<br/>enable_plugins=True] --> B[MarkItDown Discovers Plugin<br/>via markitdown.plugin entry point]
B --> C[Calls register_converters<br/>with all kwargs]
C --> D[Plugin Creates LLMVisionOCRService<br/>from llm_client/llm_model]
D --> E[Registers 4 OCR Converters<br/>at priority -1.0]
E --> F[Built-in Converters Remain<br/>at priority 0.0 as fallback]
G[File Conversion Request] --> H{Which Converter<br/>Accepts First?}
H -->|OCR Converter| I[Extract Images from Document]
H -->|Built-in Converter| J[Standard Conversion]
I --> K[Send Images to LLM Vision]
K --> L[Insert Extracted Text Inline]
L --> M[Return Markdown Result]
J --> N[Return Standard Markdown]Source: packages/markitdown-ocr/src/markitdown_ocr/_plugin.py
Component Overview
| Component | File | Purpose |
|---|---|---|
LLMVisionOCRService | _ocr_service.py | Core OCR service that handles LLM Vision API calls |
PdfConverterWithOCR | _pdf_converter_with_ocr.py | Enhanced PDF converter with image extraction and full-page OCR |
DocxConverterWithOCR | _docx_converter_with_ocr.py | Enhanced DOCX converter with image extraction |
PptxConverterWithOCR | _pptx_converter_with_ocr.py | Enhanced PPTX converter with image extraction |
XlsxConverterWithOCR | _xlsx_converter_with_ocr.py | Enhanced XLSX converter with image extraction |
Source: packages/markitdown-ocr/src/markitdown_ocr/_plugin.py
Installation
Prerequisites
- Python 3.10 or higher
- MarkItDown core package installed
- An OpenAI-compatible LLM client (e.g.,
openai,AzureOpenAI)
Install the Plugin
pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client
To verify installation and see available plugins:
markitdown --list-plugins
Source: packages/markitdown-ocr/README.md
Usage
Command-Line Interface
Enable the OCR plugin using the --use-plugins flag along with LLM configuration:
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
[!IMPORTANT]
The--llm-clientand--llm-modelarguments must be passed when using the OCR plugin via CLI. Without anllm_client, the plugin loads but OCR is silently skipped, falling back to the standard built-in converter.
Source: packages/markitdown-ocr/README.md
Python API
#### Basic Usage with OpenAI
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
Source: packages/markitdown-ocr/README.md
#### Using Azure OpenAI
from markitdown import MarkItDown
from openai import AzureOpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=AzureOpenAI(
api_key="your-api-key",
azure_endpoint="https://your-resource.openai.azure.com/",
api_version="2024-02-01",
),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.docx")
print(result.text_content)
Source: packages/markitdown-ocr/README.md
#### Custom Extraction Prompt
Override the default prompt for specialized document types:
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
llm_prompt="Extract all text from this image, preserving table structure.",
)
result = md.convert("document_with_tables.pdf")
Source: packages/markitdown-ocr/README.md
#### Fallback Behavior
If no llm_client is provided, the plugin still loads but OCR is silently skipped:
# Plugin loads but OCR is skipped - falls back to standard converter
md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf") # Standard conversion without OCR
Source: packages/markitdown-ocr/README.md
Configuration Options
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
enable_plugins | bool | Yes | False | Enable plugin loading |
llm_client | OpenAI-compatible | Yes* | None | LLM client for Vision OCR |
llm_model | str | Yes* | None | Model name (e.g., gpt-4o) |
llm_prompt | str | No | System default | Custom prompt for text extraction |
*Required for OCR to function; without these, the plugin falls back to standard converters.
Source: packages/markitdown-ocr/src/markitdown_ocr/_plugin.py
Supported File Formats
| PDF Type | OCR Behavior |
|---|---|
| Text-based PDFs | Extracts embedded images and OCRs them inline with surrounding text |
| Scanned PDFs | Detected automatically when no extractable text exists; each page rendered at 300 DPI and sent to LLM |
| Malformed PDFs | Retried with PyMuPDF rendering if pdfplumber/pdfminer fail |
Source: packages/markitdown-ocr/README.md
Office Documents (DOCX, PPTX, XLSX)
For Office Open XML formats, images are extracted via document part relationships and OCR is performed before the conversion pipeline:
- Images are extracted from the document archive
- Each image is processed through the LLM Vision OCR service
- Placeholder tokens are injected into the content
- The standard conversion pipeline executes with OCR placeholders preserved
Source: packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py
How It Works
PDF Conversion Flow
graph TD
A[PDF Document Input] --> B{Contains extractable text?}
B -->|Yes| C[Extract text from PDF]
B -->|No| D[Detected as Scanned PDF]
C --> E{Contains embedded images?}
E -->|Yes| F[Extract images by position]
E -->|No| G[Return standard result]
F --> H[Send each image to LLM Vision]
D --> I[Render each page at 300 DPI]
I --> H
H --> J[Interleave extracted text<br/>with OCR results in reading order]
J --> K[Return Markdown with<br/>OCR text inline]
G --> KSource: packages/markitdown-ocr/README.md
DOCX/PPTX/XLSX Conversion Flow
graph TD
A[Office Document Input] --> B[Extract images from<br/>document part relationships]
B --> C[Process each image<br/>through LLM Vision OCR]
C --> D[Generate placeholder tokens<br/>MARKITDOWNOCRBLOCK{id}]
D --> E[Inject placeholders into<br/>HTML/Document content]
E --> F[Run standard conversion pipeline]
F --> G[Replace placeholders with<br/>OCR-extracted text]
G --> H[Return Markdown result]Source: packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py
Common Issues and Troubleshooting
Unrecognized Arguments Error
If you encounter an "Unrecognized Arguments" error when using CLI arguments like --llm-client and --llm-model, ensure you have installed the correct version of the plugin. The CLI example shown in documentation may differ between versions.
Workaround: Use the Python API for more reliable argument handling.
Source: Community Issue #1897
Office Open XML Validation
When converting invalid DOCX, XLSX, or PPTX files, MarkItDown may return a successful result containing the message "This is not a valid Office Open XML file." in text_content rather than raising an exception. This is a known limitation of the underlying converters.
Workaround: Validate files before conversion or check text_content for error strings.
Source: Community Issue #1408
LLM Call Failures
If an LLM call fails during OCR processing, the conversion continues without that specific image's text. The plugin is designed to be resilient to partial failures.
Source: packages/markitdown-ocr/README.md
Memory Management
Recent releases (v0.1.6+) address O(n) memory growth during PDF conversion by properly calling page.close() after processing each PDF page. Ensure you are running the latest version for optimal memory efficiency.
Source: Release Notes v0.1.6
See Also
- MarkItDown Documentation — Main project documentation
- Sample Plugin — Reference implementation for creating custom plugins
- Azure Document Intelligence — Alternative cloud-based document conversion
- Azure Content Understanding — Higher-quality multi-modal extraction with structured field output
- Plugin Architecture — Understanding MarkItDown's plugin system
Source: https://github.com/microsoft/markitdown / Human Manual
MCP Server
Related topics: Python API Reference, Plugin Development Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Python API Reference, Plugin Development Guide
# MCP Server
The MarkItDown MCP (Model Context Protocol) Server is a component that enables AI coding assistants and LLM-powered tools to directly utilize MarkItDown's document conversion capabilities through the standardized MCP protocol. This integration allows AI assistants to process documents without requiring custom tool implementations or external script execution.
## Overview
The MCP Server acts as a bridge between AI assistants (such as Claude, Cursor, or other MCP-compatible clients) and the MarkItDown Python library. It exposes MarkItDown's conversion functionality as MCP tools that AI assistants can invoke directly.
graph LR AI[AI Assistant<br/>Claude, Cursor, etc.] -->|MCP Protocol| MCP[MarkItDown<br/>MCP Server] MCP -->|Convert| MD[MarkItDown<br/>Python Library] MD -->|Documents| DOC[PDF, DOCX, PPTX<br/>XLSX, Images, etc.] DOC -->|Markdown| AI
**Key capabilities provided through MCP:**
- Convert local files to Markdown format
- Process files from URLs (HTTP/HTTPS)
- Process data URIs (base64-encoded content)
- Support for Azure Document Intelligence integration
- Plugin system integration for extended formats
- Configurable LLM clients for image descriptions
Source: [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/README.md)
## Installation
### From PyPI
The MCP server can be installed as a separate package:
pip install markitdown-mcp
### Using Docker
Pre-built Docker images are available for isolated execution:
docker pull ghcr.io/microsoft/markitdown-mcp:latest
To run the container:
docker run --rm -i ghcr.io/microsoft/markitdown-mcp:latest
Source: [packages/markitdown-mcp/Dockerfile](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/Dockerfile)
### From Source
For development or customization:
git clone [email protected]:microsoft/markitdown.git cd markitdown pip install -e packages/markitdown-mcp
## Configuration
### Environment Variables
The MCP server reads configuration from environment variables, allowing flexible deployment without code changes.
| Environment Variable | Description | Default |
|---------------------|-------------|---------|
| `MARKITDOWN_ENABLE_PLUGINS` | Enable 3rd-party plugin support | `false` |
| `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` | Azure Document Intelligence endpoint URL | Not set |
| `AZURE_DOCUMENT_INTELLIGENCE_KEY` | Azure Document Intelligence API key | Not set |
The server respects the `MARKITDOWN_ENABLE_PLUGINS` environment variable during initialization, allowing plugin support to be toggled without modifying the server code.
Source: [packages/markitdown-mcp/src/markitdown_mcp/__init__.py](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/src/markitdown_mcp/__init__.py)
### Server Startup Options
When running the MCP server directly, several startup parameters control its behavior.
| Option | Description |
|--------|-------------|
| `--host` | Host address to bind the server (default: `127.0.0.1`) |
| `--port` | Port number to listen on (default: `8000`) |
| `--log-level` | Logging verbosity (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |
> [!WARNING]
> The server binds to localhost by default. Binding to non-local interfaces (`0.0.0.0`) should only be done in trusted environments with proper network access controls.
Source: [packages/markitdown-mcp/src/markitdown_mcp/__main__.py](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/src/markitdown_mcp/__main__.py)
## MCP Tools
The server exposes the following tools to MCP clients:
### `markitdown_convert`
Converts a document to Markdown format.
**Parameters:**
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `source` | string | Yes | File path, URL, or data URI to convert |
| `use_plugins` | boolean | No | Enable 3rd-party plugins (default: `false`) |
**Returns:** Markdown-formatted text content extracted from the document.
### `markitdown_list_plugins`
Lists all installed 3rd-party plugins available to MarkItDown.
**Parameters:** None required
**Returns:** List of installed plugin names and their package locations.
### `markitdown_get_version`
Returns the current version of MarkItDown.
**Parameters:** None required
**Returns:** Version string (e.g., `"0.1.6"`).
Source: [packages/markitdown-mcp/src/markitdown_mcp/__init__.py](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/src/markitdown_mcp/__init__.py)
## Architecture
### Component Flow
sequenceDiagram participant Client as AI Assistant participant MCPServer as MCP Server participant MarkItDown as MarkItDown Core participant Plugin as Plugin System participant Azure as Azure Services
Client->>MCPServer: Call markitdown_convert(source) MCPServer->>MarkItDown: Forward conversion request alt Plugin Enabled MCPServer->>Plugin: Load enabled plugins Plugin->>MarkItDown: Register custom converters end alt Azure Document Intelligence MarkItDown->>Azure: Call Document Intelligence API Azure-->>MarkItDown: Extracted content end MarkItDown-->>MCPServer: Markdown result MCPServer-->>Client: Return text_content
### Request Processing Pipeline
1. **Client Request**: AI assistant invokes an MCP tool with parameters
2. **Server Validation**: MCP server validates input parameters
3. **MarkItDown Initialization**: Creates `MarkItDown` instance with appropriate configuration
4. **Format Detection**: Determines file type from extension or URI scheme
5. **Conversion**: Routes to appropriate converter based on file type
6. **Result Serialization**: Returns markdown content via MCP protocol
Source: [packages/markitdown-mcp/src/markitdown_mcp/__main__.py](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/src/markitdown_mcp/__main__.py)
## Supported Input Formats
The MCP server inherits all format support from the core MarkItDown library:
| Category | Formats |
|----------|---------|
| **Documents** | PDF, DOCX, PPTX, XLSX, ODT |
| **Images** | JPG, PNG, GIF, BMP, WebP (with OCR and EXIF extraction) |
| **Audio** | MP3, WAV, FLAC (with EXIF and transcription) |
| **Web** | HTML, Wikipedia pages |
| **Data** | CSV, JSON, XML, RSS/Atom feeds |
| **Archives** | ZIP (iterates over contents) |
| **Documents** | EPUB, Jupyter notebooks |
| **Video** | Via Azure Content Understanding (when configured) |
Source: [README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/README.md)
## Integration with Azure Services
### Azure Document Intelligence
When `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` is configured, the MCP server can leverage Azure's cloud-based document extraction for higher quality results:
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="https://your-resource.cognitiveservices.azure.com/" export AZURE_DOCUMENT_INTELLIGENCE_KEY="your-api-key"
This is particularly beneficial for:
- Scanned PDF documents
- Complex table structures
- Handwritten content
- Multi-language documents
Source: [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/README.md)
### LLM-Based Image Processing
When the MCP server is used with MarkItDown's built-in LLM support, image descriptions can be generated for embedded images in documents:
from markitdown import MarkItDown from openai import OpenAI
md = MarkItDown( enable_plugins=True, llm_client=OpenAI(), llm_model="gpt-4o", )
This same configuration pattern applies when initializing the MCP server, enabling AI assistants to get both document conversion and intelligent image descriptions.
## Security Considerations
> [!IMPORTANT]
> MarkItDown performs I/O with the privileges of the current process. Like `open()` or `requests.get()`, it accesses resources that the process itself can access.
**Recommendations for untrusted environments:**
1. **Sanitize inputs**: Validate file paths and URLs before passing to the MCP server
2. **Use narrow conversion functions**: Prefer `convert_stream()` or `convert_local()` when possible
3. **Restrict network access**: Run the MCP server in a sandboxed environment
4. **Limit file access**: Use container isolation (Docker) to restrict filesystem access
Source: [README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/README.md)
## Usage Examples
### Claude Desktop Integration
Add to your Claude Desktop configuration:
{ "mcpServers": { "markitdown": { "command": "markitdown-mcp", "args": ["--host", "127.0.0.1", "--port", "8000"] } } }
### Python Client Usage
Example MCP client calling markitdown_convert
import json
Tool call to convert a PDF
tool_request = { "name": "markitdown_convert", "arguments": { "source": "/path/to/document.pdf", "use_plugins": True } }
Process response
response = await call_mcp_tool(tool_request) markdown_content = response["text_content"]
### Using with Azure Document Intelligence
Set environment variables
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="https://YOUR-RESOURCE.cognitiveservices.azure.com/" export AZURE_DOCUMENT_INTELLIGENCE_KEY="YOUR-KEY"
Run MCP server with Azure integration
markitdown-mcp --host 127.0.0.1 --port 8000
Source: [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/README.md)
## Troubleshooting
### Server Connection Issues
| Symptom | Solution |
|---------|----------|
| Connection refused | Ensure server is running and `--host`/`--port` are correct |
| Timeout errors | Check firewall rules; server binds to localhost by default |
| Plugin not found | Install plugin package and set `MARKITDOWN_ENABLE_PLUGINS=true` |
### Conversion Failures
| Error | Cause | Resolution |
|-------|-------|------------|
| Unsupported format | File type not recognized | Check supported formats list |
| Plugin error | Plugin failed during conversion | Enable debug logging; check plugin compatibility |
| Azure auth failure | Invalid credentials | Verify `AZURE_DOCUMENT_INTELLIGENCE_KEY` |
### Common Issues
**Issue**: `UnicodeDecodeError` on non-ASCII files
**Context**: Reported in [GitHub Issue #1894](https://github.com/microsoft/markitdown/issues/1894)
**Status**: This affects the core library; ensure you're using the latest version.
**Issue**: Office Open XML files return success with error message
**Context**: Reported in [GitHub Issue #1408](https://github.com/microsoft/markitdown/issues/1408)
**Note**: Invalid DOCX/XLSX/PPTX files may return `"This is not a valid Office Open XML file."` in text_content rather than raising an exception.
**Issue**: RuntimeWarning about ffmpeg on Linux
**Context**: Reported in [GitHub Issue #1685](https://github.com/microsoft/markitdown/issues/1685)
**Resolution**: Install ffmpeg system package for audio conversion features
## See Also
- [Main README](README.md) — Project overview and core documentation
- [Plugin Development](markitdown-sample-plugin.md) — Guide to creating custom plugins
- [OCR Plugin](markitdown-ocr.md) — LLM Vision OCR for embedded images
- [Azure Content Understanding](markitdown-content-understanding.md) — Advanced cloud-based extractionSource: https://github.com/microsoft/markitdown / Human Manual
Plugin Development Guide
Related topics: Architecture Overview, OCR Plugin, MCP Server
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Architecture Overview, OCR Plugin, MCP Server
Plugin Development Guide
This guide explains how to create custom plugins for MarkItDown to extend its document conversion capabilities. The plugin architecture allows developers to add support for new file formats or override existing converters with custom implementations.
Overview
MarkItDown uses a plugin-based architecture that enables third-party developers to extend its document conversion capabilities. Plugins can register custom DocumentConverter implementations that handle specific file formats. The system supports:
- New file format support — Add converters for formats not natively supported (e.g., RTF, EPUB variants)
- Converter replacement — Override built-in converters with priority-based precedence
- LLM integration — Pass through LLM client credentials to enable AI-powered features in plugins
Plugins are discovered via Python entry points and loaded dynamically when MarkItDown(enable_plugins=True) is instantiated. Source: packages/markitdown-sample-plugin/README.md
Architecture
The plugin system is built around the following core components:
graph TD
A[User Code] -->|MarkItDown enable_plugins=True| B[MarkItDown Core]
B -->|Discovers via entry_points| C[Plugin Entry Point Group<br/>markitdown.plugin]
C --> D[Plugin Package<br/>markitdown_sample_plugin]
D -->|Calls| E[register_converters function]
E -->|Registers| F[DocumentConverter Subclass<br/>RtfConverter]
B -->|Stores| F
F --> G[Conversion Pipeline]
G --> H[Markdown Output]
I[LLM Client<br/>llm_client, llm_model] -.->|Forwarded via kwargs| EComponent Responsibilities
| Component | Responsibility | Source |
|---|---|---|
MarkItDown | Core orchestrator, plugin discovery, converter dispatch | packages/markitdown/src/markitdown/_markitdown.py |
DocumentConverter | Base class for all converters | packages/markitdown/src/markitdown/_base_converter.py |
| Entry Point Group | Plugin discovery mechanism (markitdown.plugin) | packages/markitdown-sample-plugin/README.md |
register_converters() | Plugin callback to register converters | packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py |
Plugin Interface
Interface Version
Plugins must declare the interface version they target. Currently, only version 1 is supported:
__plugin_interface_version__ = 1
Source: packages/markitdown-sample-plugin/README.md
Required Exports
A valid plugin package must export:
| Symbol | Type | Description |
|---|---|---|
__plugin_interface_version__ | int | Plugin interface version (must be 1) |
register_converters(markitdown, **kwargs) | function | Called to register converters |
Creating a Document Converter
Step 1: Subclass DocumentConverter
Create a new converter class that inherits from DocumentConverter:
from typing import BinaryIO, Any
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo
class RtfConverter(DocumentConverter):
def __init__(
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
):
super().__init__(priority=priority)
Source: packages/markitdown-sample-plugin/README.md
Step 2: Implement the `accepts()` Method
The accepts() method determines whether this converter can handle a given file stream. Return True if the converter should process the file:
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:
# Check if the file stream is an RTF file
# Read the first few bytes to check for RTF magic number
header = file_stream.read(4)
file_stream.seek(0) # Reset stream position
return header.startswith(b'{\\rtf')
Step 3: Implement the `convert()` Method
The convert() method performs the actual conversion from the file format to Markdown:
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
# Read and parse the RTF content
content = file_stream.read().decode('utf-8', errors='replace')
# Convert RTF to Markdown (implementation-specific)
markdown_content = self._rtf_to_markdown(content)
return DocumentConverterResult(
text_content=markdown_content,
metadata={}
)
Priority System
The DocumentConverter base class defines priority constants that control converter selection:
| Priority Constant | Value | Use Case |
|---|---|---|
PRIORITY_DEFAULT | 0.0 | Default priority for built-in converters |
PRIORITY_SPECIFIC_FILE_FORMAT | 10.0 | Format-specific converters (higher wins) |
PRIORITY_OVERRIDE_ALL | 100.0 | Override all other converters |
Higher priority values take precedence. Plugins that want to override built-in converters should use PRIORITY_SPECIFIC_FILE_FORMAT or higher.
Source: packages/markitdown/src/markitdown/_base_converter.py
Plugin Registration
The `register_converters()` Function
Create a register_converters() function in your plugin package. This function is called during MarkItDown instantiation:
from markitdown import MarkItDown
def register_converters(markitdown: MarkItDown, **kwargs):
"""
Called during construction of MarkItDown instances to register
converters provided by plugins.
"""
# Simply create and attach an RtfConverter instance
markitdown.register_converter(RtfConverter())
Source: packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py
Handling LLM Client Credentials
Plugins can receive and use LLM credentials passed through MarkItDown():
def register_converters(markitdown: MarkItDown, **kwargs):
llm_client = kwargs.get('llm_client')
llm_model = kwargs.get('llm_model')
llm_prompt = kwargs.get('llm_prompt')
# Use LLM credentials if available
if llm_client and llm_model:
converter = LLMVisionConverter(
llm_client=llm_client,
llm_model=llm_model,
llm_prompt=llm_prompt
)
else:
converter = BasicRtfConverter()
markitdown.register_converter(converter)
This pattern is used by the markitdown-ocr plugin to enable LLM-powered OCR for images. Source: packages/markitdown-ocr/README.md
Entry Points Configuration
Configure the entry point in your plugin's pyproject.toml:
[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"
| Field | Value | Description |
|---|---|---|
| Group | "markitdown.plugin" | Fixed entry point group name |
| Key | sample_plugin | Plugin identifier (can be any unique name) |
| Value | "markitdown_sample_plugin" | Fully qualified package name |
Source: packages/markitdown-sample-plugin/README.md
CLI Integration
Listing Installed Plugins
Use the --list-plugins flag to see all installed third-party plugins:
markitdown --list-plugins
Output format:
Installed MarkItDown 3rd-party Plugins:
* sample_plugin (package: markitdown_sample_plugin)
Use the -p (or --use-plugins) option to enable 3rd-party plugins.
Source: packages/markitdown/src/markitdown/__main__.py
Enabling Plugins
Pass the --use-plugins flag to enable third-party plugins:
markitdown --use-plugins document.rtf -o output.md
Plugin Discovery Mechanism
Plugins are discovered using Python's importlib.metadata.entry_points():
from importlib.metadata import entry_points
plugin_entry_points = list(entry_points(group="markitdown.plugin"))
Source: packages/markitdown/src/markitdown/__main__.py
Complete Plugin Example
Below is a minimal but complete plugin structure:
markitdown_rtf_plugin/
├── pyproject.toml
└── src/
└── markitdown_rtf_plugin/
├── __init__.py
└── _plugin.py
`pyproject.toml`
[project]
name = "markitdown-rtf-plugin"
version = "0.1.0"
[project.entry-points."markitdown.plugin"]
rtf = "markitdown_rtf_plugin"
`src/markitdown_rtf_plugin/_plugin.py`
"""RTF to Markdown converter plugin for MarkItDown."""
from typing import BinaryIO, Any
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
__plugin_interface_version__ = 1
class RtfConverter(DocumentConverter):
"""Converts RTF files to Markdown format."""
def __init__(self):
super().__init__(priority=DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT)
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:
extension = (stream_info.extension or "").lower()
if extension == ".rtf":
return True
# Check RTF magic number
header = file_stream.read(4)
file_stream.seek(0)
return header.startswith(b'{\\rtf')
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
content = file_stream.read().decode('utf-8', errors='replace')
# RTF to Markdown conversion logic here
markdown = self._convert_rtf_to_markdown(content)
return DocumentConverterResult(
text_content=markdown,
metadata={"source_format": "rtf"}
)
def _convert_rtf_to_markdown(self, content: str) -> str:
# Simplified conversion logic
# Replace RTF formatting with Markdown equivalents
markdown = content
# ... conversion implementation
return markdown
def register_converters(markitdown, **kwargs):
"""Register the RTF converter with MarkItDown."""
markitdown.register_converter(RtfConverter())
`src/markitdown_rtf_plugin/__init__.py`
from ._plugin import register_converters, __plugin_interface_version__
Installation and Testing
Installing the Plugin
Install the plugin in development mode:
pip install -e .
Verifying Installation
List installed plugins:
markitdown --list-plugins
Testing the Plugin
Convert a file using the plugin:
markitdown --use-plugins document.rtf -o output.md
Finding Plugins
To discover available third-party plugins:
- Search GitHub for the hashtag
#markitdown-plugin - Check PyPI for packages with
markitdown-pluginkeyword - Review the Community Plugins section
Source: packages/markitdown-sample-plugin/README.md
Best Practices
Error Handling
Implement robust error handling in your converter:
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
try:
content = file_stream.read()
markdown = self._convert_to_markdown(content)
return DocumentConverterResult(
text_content=markdown,
metadata={"source_format": "custom"}
)
except SpecificFormatError as e:
# Return partial result or re-raise
raise
except Exception as e:
# Log and provide fallback
return DocumentConverterResult(
text_content=f"[Conversion error: {str(e)}]",
metadata={"error": str(e)}
)
Stream Position Management
Always reset stream position after reading for format detection:
def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool:
# Read to check format
header = file_stream.read(512)
file_stream.seek(0) # Reset for next reader
return self._is_our_format(header)
Priority Selection Guidelines
| Scenario | Recommended Priority |
|---|---|
| Adding new format support | PRIORITY_SPECIFIC_FILE_FORMAT (10.0) |
| Enhancing existing format | PRIORITY_SPECIFIC_FILE_FORMAT + 1 |
| Completely overriding built-in | PRIORITY_OVERRIDE_ALL (100.0) |
Common Issues
Plugin Not Discovered
Ensure the entry point is correctly configured in pyproject.toml:
[project.entry-points."markitdown.plugin"]
your_plugin_name = "your_package_name"
Converter Not Called
- Verify
accepts()returnsTruefor your file type - Check that priority is high enough to take precedence
- Ensure
register_converters()is called duringMarkItDowninstantiation
CLI Unrecognized Arguments
As noted in Issue #1897, the CLI may not document all plugin-specific arguments. If your plugin adds CLI options, document them separately in your plugin's documentation.
See Also
- MarkItDown README — Main project documentation
- Sample Plugin Repository — Reference implementation
- markitdown-ocr Plugin — Production plugin example with LLM integration
- Base Converter Class — DocumentConverter API reference
- Built-in Converters — Reference implementations for PDF, DOCX, PPTX, XLSX, and more
Source: https://github.com/microsoft/markitdown / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 30 structured pitfall item(s), including 4 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_f70b2e3ea5ed47418a4aeb9ef27230f9 | https://github.com/microsoft/markitdown/issues/1685
2. Runtime risk: Runtime risk requires verification
- Severity: high
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_252ef0d45ac040688ffa066bc1b64ba0 | https://github.com/microsoft/markitdown/issues/1897
3. Maintenance risk: Maintenance risk requires verification
- Severity: high
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_6e08b71ee29f46a98e6825a5d5b11e6e | https://github.com/microsoft/markitdown/issues/1979
4. Maintenance risk: Maintenance risk requires verification
- Severity: high
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_439f22f47a524773808819148caadca5 | https://github.com/microsoft/markitdown/issues/1982
5. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- User impact: Developers may fail before the first successful local run: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | fmev_087a8a7b6538b2ce2b065ade73c555af | https://github.com/microsoft/markitdown/issues/1408
6. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: Support for .doc extensions
- User impact: Developers may fail before the first successful local run: Support for .doc extensions
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Support for .doc extensions. Context: Observed when using windows, linux
- Evidence: failure_mode_cluster:github_issue | fmev_d5a467d012987779306cb5c50725275b | https://github.com/microsoft/markitdown/issues/23
7. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- User impact: Developers may fail before the first successful local run: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux. Context: Observed when using python, windows, linux
- Evidence: failure_mode_cluster:github_issue | fmev_1f9167a15a1eec72c8f79514f1b70b76 | https://github.com/microsoft/markitdown/issues/1685
8. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: v0.1.0
- User impact: Upgrade or migration may change expected behavior: v0.1.0
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.0. Context: Observed when using python
- Evidence: failure_mode_cluster:github_release | fmev_1d5ae6ee21225356f45c36c20024dccd | https://github.com/microsoft/markitdown/releases/tag/v0.1.0
9. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_734e117518a3496eb3779e5f22b600b5 | https://github.com/microsoft/markitdown/issues/1408
10. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | cevd_77597bea6262485b9609d8fc5f50a69a | https://github.com/microsoft/markitdown/issues/1894
11. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: Enhancement: Add MCP server support for document processing
- User impact: Developers may misconfigure credentials, environment, or host setup: Enhancement: Add MCP server support for document processing
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Enhancement: Add MCP server support for document processing. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | fmev_969d5f508051e086435b78736eae3e88 | https://github.com/microsoft/markitdown/issues/2004
12. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: v0.1.2
- User impact: Upgrade or migration may change expected behavior: v0.1.2
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.2. Context: Observed when using python
- Evidence: failure_mode_cluster:github_release | fmev_076605feea6e0b4830282709121d3c90 | https://github.com/microsoft/markitdown/releases/tag/v0.1.2
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using markitdown with real data or production workflows.
- Enhancement: Add MCP server support for document processing - github / github_issue
- bug: DOCX math converter crashes when oMath element is missing in malfor - github / github_issue
- Timeout needed - github / github_issue
- Support for .doc extensions - github / github_issue
- bug: DOCX math converter crashes with NotImplementedError on unknown fun - github / github_issue
- Unrecognized Arguments Error in markitdown CLI for undocumented argument - github / github_issue
- bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII fil - github / github_issue
- Office Open XML: Invalid Files Return Success with Error Message Instead - github / github_issue
- [[Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Li](https://github.com/microsoft/markitdown/issues/1685) - github / github_issue
- Version 0.1.6 - github / github_release
- v0.1.5 - github / github_release
- Version 0.1.5b1 - github / github_release
Source: Project Pack community evidence and pitfall evidence