Doramagic Project Pack · Human Manual

markitdown

MarkItDown provides a unified interface for converting files to Markdown, abstracting away the complexity of handling different file formats. The tool is particularly valuable for:

Home

Related topics: Installation Guide, Architecture Overview, Supported File Formats

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Supported File Formats

Continue reading this section for the full explanation and source context.

Section Why Markdown?

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Related topics: Installation Guide, Architecture Overview, Supported File Formats

MarkItDown Home

MarkItDown is a lightweight Python utility and command-line tool for converting various document formats into Markdown. It is designed primarily for use with Large Language Models (LLMs) and text analysis pipelines, extracting content while preserving document structure including headings, lists, tables, links, and other semantic elements. Source: README.md

Overview

MarkItDown provides a unified interface for converting files to Markdown, abstracting away the complexity of handling different file formats. The tool is particularly valuable for:

  • LLM Integration: Feeding documents to language models that understand Markdown natively
  • Document Indexing: Creating searchable indexes from various document types
  • Text Analysis: Extracting structured text content for downstream processing
[!IMPORTANT]
MarkItDown performs I/O with the privileges of the current process. Like open() or requests.get(), it accesses resources that the process itself can access. Sanitize your inputs in untrusted environments, and use the narrowest conversion function for your use case (e.g., convert_stream(), or convert_local()). Source: README.md

Supported File Formats

MarkItDown supports conversion from numerous formats, organized by the built-in converters. Source: README.md

FormatDescriptionNotes
PDFPortable Document FormatBasic text extraction; see Known Limitations
PowerPoint (.pptx)Microsoft PowerPoint presentationsSupports images and math equations
Word (.docx)Microsoft Word documentsSupports images and math equations
Excel (.xlsx)Microsoft Excel spreadsheetsTable extraction supported
ImagesJPEG, PNG, GIF, etc.EXIF metadata extraction; OCR via plugin
AudioMP3, WAV, etc.EXIF metadata and speech transcription
HTMLWeb pagesIncludes Wikipedia-specific processing
CSVComma-separated valuesConverted to Markdown tables
JSONJavaScript Object NotationStructured text output
XMLExtensible Markup LanguageRSS feeds fully supported
EPUBElectronic publicationsE-book content extraction
ZIPArchive filesIterates over contents
YouTube URLsVideo contentRequires Azure integration

Why Markdown?

Markdown is extremely close to plain text with minimal markup, yet provides a way to represent important document structure. Mainstream LLMs such as OpenAI's GPT-4o natively understand Markdown, making it an efficient format for document consumption by AI systems. As a side benefit, Markdown conventions are also highly token-efficient compared to other structured formats. Source: README.md

Architecture

MarkItDown uses a converter-based architecture that allows for extensibility through plugins while providing a consistent interface for all supported formats.

Core Components

graph TD
    A[MarkItDown API] --> B[Converter Registry]
    B --> C[Built-in Converters]
    B --> D[Plugin Converters]
    
    C --> C1[PDF Converter]
    C --> C2[DOCX Converter]
    C --> C3[PPTX Converter]
    C --> C4[XLSX Converter]
    C --> C5[Image Converter]
    C --> C6[Audio Converter]
    C --> C7[HTML Converter]
    C --> C8[Wikipedia Converter]
    C --> C9[RSS Converter]
    C --> C10[CSV Converter]
    C --> C11[EPUB Converter]
    C --> C12[YouTube Converter]
    
    D --> D1[OCR Plugin]
    D --> D2[Custom Plugins]
    
    E[Azure Integrations] -.->|Optional| B
    E --> E1[Document Intelligence]
    E --> E2[Content Understanding]

Conversion Pipeline

graph LR
    A[Input File/Stream/URI] --> B[Stream Info Extraction]
    B --> C{Hint Extension?}
    C -->|Yes| D[Use Extension Hint]
    C -->|No| E[Detect from Content]
    D --> F[Find Matching Converter]
    E --> F
    F --> G{Converter Found?}
    G -->|Yes| H[Execute Conversion]
    G -->|No| I[Return Error]
    H --> J[DocumentConverterResult]

Converter Base Class

All converters inherit from DocumentConverter and implement two key methods:

  1. accepts(file_stream, stream_info, kwargs)**: Determines if the converter can handle the given input
  2. convert(file_stream, stream_info, kwargs)**: Performs the actual conversion to Markdown

Source: packages/markitdown-sample-plugin/README.md

Plugin System

The plugin architecture uses Python entry points to discover and load converters at runtime. Source: packages/markitdown/src/markitdown/__main__.py:1-35

Plugins must implement:

__plugin_interface_version__ = 1

def register_converters(markitdown: MarkItDown, **kwargs):
    """Called during MarkItDown instantiation to register plugin converters."""
    markitdown.register_converter(YourCustomConverter())

Entry point configuration in pyproject.toml:

[project.entry-points."markitdown.plugin"]
your_plugin = "your_package_name"

Source: packages/markitdown-sample-plugin/README.md

Installation

Prerequisites

MarkItDown requires Python 3.10 or higher. Using a virtual environment is recommended to avoid dependency conflicts. Source: README.md

# Using standard Python venv
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or: .venv\Scripts\activate  # Windows

# Using uv
uv venv --python 3.12 .venv
source .venv/bin/activate

Installation Options

CommandDescription
pip install markitdownCore package only
pip install markitdown[all]All dependencies included
pip install markitdown[pdf]PDF conversion support
pip install markitdown[docx]Word document support
pip install markitdown[pptx]PowerPoint support
pip install markitdown[xlsx]Excel support
pip install markitdown[images]Image processing
pip install markitdown[audio]Audio transcription
pip install markitdown[html]HTML parsing
pip install markitdown[az-docintel]Azure Document Intelligence
pip install markitdown[az-content-understanding]Azure Content Understanding
pip install markitdown-ocrOCR plugin for embedded images

From source:

git clone [email protected]:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown[all]

Source: packages/markitdown/README.md

Usage

Command-Line Interface

#### Basic Usage

# Convert a file and output to stdout
markitdown path-to-file.pdf > document.md

# Convert and save to a specific file
markitdown example.xlsx -o output.md

# Read from stdin
cat document.pdf | markitdown > output.md

# Provide file extension hint (for stdin input)
cat document.pdf | markitdown -x .pdf > output.md

Source: packages/markitdown/src/markitdown/__main__.py:1-50

#### CLI Options Reference

OptionDescription
-v, --versionShow version number and exit
-o, --output FILEOutput file name (default: stdout)
-x, --extension EXTHint file extension (for stdin input)
-d, --use-docintelUse Azure Document Intelligence
-e, --endpoint URLAzure Document Intelligence endpoint
--use-cuUse Azure Content Understanding
--cu-endpoint URLContent Understanding endpoint
--cu-analyzer IDContent Understanding analyzer ID
--cu-file-types TYPESComma-separated file types for CU routing
-p, --use-pluginsEnable 3rd-party plugins
--list-pluginsList installed plugins
--keep-data-urisKeep base64-encoded images in output

Source: packages/markitdown/src/markitdown/__main__.py:50-130

Python API

#### Basic Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)

#### MarkItDown Constructor Options

ParameterTypeDefaultDescription
enable_pluginsboolFalseEnable 3rd-party plugin loading
docintel_endpointstrNoneAzure Document Intelligence endpoint
docintel_model_idstrNoneDocument Intelligence model ID
llm_clientobjectNoneOpenAI-compatible LLM client
llm_modelstrNoneLLM model name (e.g., "gpt-4o")
llm_promptstrNoneCustom prompt for LLM operations
cu_endpointstrNoneAzure Content Understanding endpoint
cu_file_typeslistNoneFile types to route to CU

Source: README.md

#### Convert Methods

from markitdown import MarkItDown

md = MarkItDown()

# Convert local file
result = md.convert("document.pdf")

# Convert URI (http, https, data:, file:)
result = md.convert_uri("https://example.com/page.html")

# Convert file stream
with open("document.pdf", "rb") as f:
    result = md.convert_stream(f, extension=".pdf")

# Convert local file (no URL resolution)
result = md.convert_local("document.pdf")
[!NOTE]
convert_url is an alias for convert_uri maintained for backward compatibility. Source: README.md

#### Return Value

The convert() method returns a DocumentConverterResult object with the following attributes:

AttributeTypeDescription
text_contentstrThe converted Markdown content
titlestrDocument title (if extractable)
markdownstrAlias for text_content

Azure Integrations

Azure Document Intelligence

Use Azure Document Intelligence for higher-quality PDF and image extraction with layout analysis.

markitdown document.pdf -o document.md -d -e "<document_intelligence_endpoint>"

Python API:

from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<endpoint>")
result = md.convert("document.pdf")

For Azure Key Credentials:

from markitdown import MarkItDown
from azure.core.credentials import AzureKeyCredential

md = MarkItDown(
    docintel_endpoint="<endpoint>",
    docintel_credential=AzureKeyCredential("<api_key>")
)

Source: README.md

Azure Content Understanding

Azure Content Understanding provides structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers. Install with:

pip install 'markitdown[az-content-understanding]'
from markitdown import MarkItDown
import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],  # Route only PDFs to CU
)

Source: packages/markitdown/src/markitdown/converters/_cu_converter.py:1-40

#### When to Use Content Understanding

CapabilityBuilt-inAzure Document IntelligenceAzure Content Understanding
Basic PDF conversion
Layout analysisBasic
Structured field extraction
Image filesEXIF only
Audio filesBasic transcription
Video files
YAML front matter

Source: README.md

OCR Plugin (markitdown-ocr)

The markitdown-ocr plugin adds OCR support for extracting text from images embedded in PDF, DOCX, PPTX, and XLSX files using LLM Vision. Source: packages/markitdown-ocr/README.md

Installation

pip install markitdown-ocr
pip install openai  # or any OpenAI-compatible client

Usage

markitdown document_with_images.pdf --use-plugins --llm-client openai --llm-model gpt-4o

Python API:

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)

Custom prompt:

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Extract all text from this image, preserving table structure.",
)

Works with any OpenAI-compatible client:

from openai import AzureOpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=AzureOpenAI(
        api_key="...",
        azure_endpoint="https://your-resource.openai.azure.com/",
        api_version="2024-02-01",
    ),
    llm_model="gpt-4o",
)
[!NOTE]
If no llm_client is provided, the plugin loads but silently skips OCR, falling back to the standard built-in converter. Source: packages/markitdown-ocr/README.md

Known Limitations

PDF Conversion

MarkItDown's built-in PDF converter performs basic text extraction and may not reliably recognize:

  • Headings and footers
  • Tables (see Issue #293)
  • Complex layouts

For improved PDF handling, consider:

  1. Azure Document Intelligence (-d flag): Better layout analysis
  2. Azure Content Understanding: Highest quality extraction with structured output
  3. OCR Plugin: For scanned PDFs and embedded images

Source: Community Issue #296

Microsoft Word (.doc vs .docx)

MarkItDown only supports .docx format (Office Open XML). Legacy .doc files are not supported. Source: Community Issue #23

Audio Transcription

Audio transcription requires ffmpeg or avconv to be installed on the system. On Linux, ensure one of these is available in the system PATH. Source: Community Issue #1685

Error Handling Behavior

When converting invalid Office Open XML files (DOCX, XLSX, PPTX), MarkItDown returns a successful result with the error message in the text content rather than raising an exception. This may make it difficult to distinguish between successful and failed conversions. Source: Community Issue #1408

Version History

VersionKey Changes
0.1.6OCR layer service for embedded images and PDF scans; Fixed O(n) memory growth in PDF conversion
0.1.5PDF table extraction with aligned Markdown; Fix for partially numbered lists; Wide table support
0.1.4Security updates: mammoth 1.11.0, pdfminer.six 20251107
0.1.3ONNXRuntime pinning on Windows; MCP server environment variable support
0.1.2Math equation rendering in DOCX; Azure Document Intelligence credentials; CSV to Markdown tables
0.1.1convert_url renamed to convert_uri; Support for file and data URIs
0.1.0Plugin architecture; Feature groups for dependencies; 3rd-party extensibility

Source: Release Notes

Docker

MarkItDown can be run in a Docker container:

docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

Source: README.md

Security Considerations

[!IMPORTANT]
MarkItDown performs I/O operations with the privileges of the running process. In untrusted environments:
1. Sanitize inputs before passing them to MarkItDown
2. Use narrowest conversion methods: Prefer convert_local() or convert_stream() over convert_uri() to limit network access
3. Restrict file permissions on the process running MarkItDown
The tool will access any resources that the process user can access, including local files and network resources.

Source: README.md

See Also

Source: https://github.com/microsoft/markitdown / Human Manual

Installation Guide

Related topics: Home, Command-Line Interface

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Python Version

Continue reading this section for the full explanation and source context.

Section Optional: Virtual Environment

Continue reading this section for the full explanation and source context.

Section Install from PyPI (Recommended)

Continue reading this section for the full explanation and source context.

Related topics: Home, Command-Line Interface

Installation Guide

This guide covers all supported methods for installing MarkItDown, including Python package installation, Docker deployment, and plugin setup. Choose the method that best fits your environment and use case.

Prerequisites

Before installing MarkItDown, ensure your environment meets the following requirements.

Python Version

MarkItDown requires Python 3.10 or higher. You can verify your Python version by running:

python --version
# or
python3 --version

Source: README.md

Optional: Virtual Environment

It is recommended to use a virtual environment to avoid dependency conflicts with other Python packages. MarkItDown supports multiple methods for creating virtual environments:

MethodCommands
Standard Pythonpython -m venv .venv && source .venv/bin/activate
uvuv venv --python=3.12 .venv && source .venv/bin/activate
condaconda create -n markitdown python=3.12 && conda activate markitdown
Note: When using uv, ensure you use uv pip install rather than pip install to install packages within the virtual environment.

Source: README.md

Installation Methods

The simplest installation method uses pip to install MarkItDown from PyPI.

#### Full Installation (All Features)

To install MarkItDown with all optional dependencies for maximum functionality:

pip install 'markitdown[all]'

This installs every converter and dependency group, enabling support for PDF, DOCX, PPTX, XLSX, images, audio, and more.

Source: README.md

#### Selective Installation

Install only the converters you need by specifying feature groups:

pip install 'markitdown[pdf,docx,pptx]'
Feature GroupDescriptionFile Formats Supported
pdfPDF converter.pdf
docxWord document converter.docx
pptxPowerPoint converter.pptx, .ppt
xlsxExcel converter.xlsx, .xls
htmlHTML/Wikipedia converter.html, .htm
imageImage converter with EXIF/OCR.jpg, .png, .gif, etc.
audioAudio converter with transcription.mp3, .wav, .ogg, etc.
epubE-book converter.epub
az-content-understandingAzure Content Understanding integrationAll formats
allAll feature groupsAll formats

Source: packages/markitdown/pyproject.toml

Install from Source

For development, testing, or customization, install MarkItDown from the GitHub repository.

git clone [email protected]:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'

The -e flag installs the package in editable mode, which is useful for development as changes to the source code take effect immediately without reinstallation.

Source: README.md

Docker Installation

MarkItDown provides a Docker image for containerized deployments. This method requires no Python installation on the host system.

#### Building the Image

docker build -t markitdown:latest .

#### Running MarkItDown in Docker

Convert a file and output to stdout:

docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

Or with a named volume for larger files:

docker run --rm -v $(pwd):/data markitdown:latest markitdown /data/input.pdf -o /data/output.md

Source: README.md

Plugin Installation

MarkItDown uses a plugin architecture that allows extending functionality through separate packages.

markitdown-ocr Plugin

The markitdown-ocr plugin adds OCR support for extracting text from embedded images in PDF, DOCX, PPTX, and XLSX files using LLM Vision. This is particularly useful for scanned PDFs or documents containing image-based text.

#### Installation

pip install markitdown-ocr
pip install openai  # or any OpenAI-compatible client

#### Usage

After installation, enable plugins when creating a MarkItDown instance:

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)

#### CLI Usage

markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
Important: The --llm-client and --llm-model CLI arguments require the plugin to be installed. Without the plugin, these arguments will cause an "Unrecognized Arguments" error. See Issue #1897 for details.

Source: packages/markitdown-ocr/README.md

Third-Party Plugins

To discover third-party plugins, search GitHub for the hashtag #markitdown-plugin.

#### Listing Installed Plugins

markitdown --list-plugins

Sample output:

Installed MarkItDown 3rd-party Plugins:

  * sample_plugin     (package: markitdown_sample_plugin)

Use the -p (or --use-plugins) option to enable 3rd-party plugins.

#### Developing Custom Plugins

For developing a custom plugin, see the Sample Plugin documentation. A plugin must:

  1. Implement a DocumentConverter class with accepts() and convert() methods
  2. Export __plugin_interface_version__ = 1
  3. Export a register_converters() function
  4. Define an entry point in pyproject.toml:
[project.entry-points."markitdown.plugin"]
my_plugin = "my_plugin_package"

Source: packages/markitdown-sample-plugin/README.md

Azure Service Integration

Azure Document Intelligence

For higher-quality PDF conversion using Azure's Document Intelligence service:

pip install 'markitdown[az-doc-intel]'

Configure the endpoint when converting:

markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"

Or in Python:

from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")

For setup instructions, see Azure Document Intelligence documentation.

Source: README.md

Azure Content Understanding

For advanced multi-modal extraction with structured field output (YAML front matter):

pip install 'markitdown[az-content-understanding]'

Configure via Python:

from markitdown import ContentUnderstandingFileType
from markitdown import MarkItDown

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF],  # only PDFs use CU
)

For more information, see Azure Content Understanding documentation.

Source: README.md

Verifying Installation

After installation, verify that MarkItDown is correctly installed by checking the version:

markitdown --version

Or test the Python API:

python -c "from markitdown import MarkItDown; print(MarkItDown().convert('test.txt').text_content)"

Installation Architecture

graph TD
    A[MarkItDown Installation] --> B[Core Package<br/>markitdown]
    A --> C[Optional Dependencies]
    A --> D[Plugins]
    
    C --> C1[PDF: pdfminer.six]
    C --> C2[DOCX: mammoth]
    C --> C3[PPTX: python-pptx]
    C --> C4[XLSX: openpyxl]
    C --> C5[Image: Pillow]
    C --> C6[Audio: pydub, speechrecognition]
    
    D --> D1[markitdown-ocr]
    D --> D2[Custom Plugins<br/>#markitdown-plugin]
    
    B --> E[CLI Interface<br/>markitdown command]
    B --> F[Python API<br/>MarkItDown class]

Troubleshooting

RuntimeWarning: ffmpeg/avconv not found

On Linux systems, if you see this warning when converting audio files:

RuntimeWarning: Couldn't find ffmpeg or avconv

Install ffmpeg to enable audio conversion:

# Debian/Ubuntu
sudo apt install ffmpeg

# Fedora
sudo dnf install ffmpeg

# macOS
brew install ffmpeg

Source: Issue #1685

Unrecognized Arguments Error with --llm-client

If you encounter an "Unrecognized Arguments" error when using --llm-client or --llm-model:

markitdown: error: unrecognized arguments: --llm-client openai

Install the markitdown-ocr plugin first:

pip install markitdown-ocr

Source: Issue #1897

Dependency Conflicts

If you encounter dependency conflicts with other packages, create a dedicated virtual environment:

python -m venv markitdown-env
source markitdown-env/bin/activate
pip install 'markitdown[all]'

Next Steps

After installation, see the following guides:

See Also

Source: https://github.com/microsoft/markitdown / Human Manual

Command-Line Interface

Related topics: Installation Guide, Python API Reference

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Converting a File

Continue reading this section for the full explanation and source context.

Section Reading from Stdin

Continue reading this section for the full explanation and source context.

Section Docker Usage

Continue reading this section for the full explanation and source context.

Related topics: Installation Guide, Python API Reference

Command-Line Interface

MarkItDown provides a command-line interface (CLI) for converting various file formats to Markdown directly from the terminal. The CLI is installed as an entry point when you install the markitdown package, making the markitdown command available system-wide.

Overview

The MarkItDown CLI enables users to convert documents without writing Python code. It wraps the MarkItDown Python class and provides a streamlined interface for common conversion tasks including file conversion, streaming input, plugin management, and integration with Azure AI services.

graph TD
    A["markitdown CLI"] --> B{Input Type}
    B -->|File Path| C["convert_local()"]
    B -->|Stdin| D["convert_stream()"]
    B -->|URL| E["convert_uri()"]
    C --> F["MarkItDown Engine"]
    D --> F
    E --> F
    F --> G{Converter Selection}
    G -->|Built-in| H["PDF, DOCX, PPTX, XLSX..."]
    G -->|Plugin| I["3rd-party Converters"]
    G -->|Azure| J["Doc Intel / Content Understanding"]
    H --> K["Markdown Output"]
    I --> K
    J --> K

Installation

Ensure MarkItDown is installed with the CLI component:

# Install with all optional dependencies
pip install markitdown[all]

# Verify CLI is available
markitdown --version

Source: README.md

Basic Usage

Converting a File

The simplest use case is converting a local file:

markitdown path/to/document.pdf > output.md

Or using the output flag:

markitdown path/to/document.pdf -o output.md

Reading from Stdin

MarkItDown can read from standard input when no filename is provided. Use the -x flag to specify the file extension when reading from stdin:

cat document.pdf | markitdown -x pdf > output.md

Docker Usage

For environments without Python installed, use Docker:

docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

Source: README.md

Command Reference

Global Options

OptionShortDescription
--version-vShow the version number and exit
--output-oOutput file name. If not provided, output is written to stdout
--extension-xProvide a hint about the file extension when reading from stdin
--list-pluginsList installed 3rd-party plugins and exit

Plugin Options

OptionShortDescription
--use-plugins-pEnable 3rd-party plugin support during conversion

LLM Options

OptionDescription
--llm-clientSpecify the LLM client to use (e.g., openai) for image descriptions
--llm-modelSpecify the LLM model to use (e.g., gpt-4o)
--llm-promptCustom prompt for LLM-based feature extraction

Azure Document Intelligence Options

OptionShortDescription
--use-docintel-dUse Azure Document Intelligence for conversion
--endpoint-eAzure Document Intelligence endpoint URL (required with --use-docintel)

Azure Content Understanding Options

OptionShortDescription
--use-cuUse Azure Content Understanding for conversion
--cu-endpointContent Understanding endpoint URL (required with --use-cu)
--cu-file-typesComma-separated list of file types to process with Content Understanding

Source: packages/markitdown/src/markitdown/__main__.py:20-120

Common Usage Patterns

Standard File Conversion

Convert supported file types (PDF, DOCX, PPTX, XLSX, images, audio, etc.):

markitdown document.pdf -o document.md
markitdown presentation.pptx -o presentation.md
markitdown spreadsheet.xlsx -o spreadsheet.md

Plugin Usage

List available plugins:

markitdown --list-plugins

Convert using 3rd-party plugins:

markitdown document.rtf --use-plugins -o document.md

LLM-Enhanced Conversion

Use LLM vision for image descriptions (currently supports PPTX and images):

markitdown image.jpg --llm-client openai --llm-model gpt-4o -o image.md

Azure Document Intelligence

For higher-quality PDF and Office document conversion:

markitdown document.pdf -d -e "https://your-resource.cognitiveservices.azure.com/" -o document.md

Azure Content Understanding

For structured field extraction with YAML front matter:

markitdown invoice.pdf --use-cu --cu-endpoint "https://your-cu-endpoint" -o invoice.md

Limit Content Understanding to specific file types:

markitdown document.pdf --use-cu --cu-endpoint "..." --cu-file-types PDF -o document.md

Source: README.md

CLI Architecture

The CLI is implemented in __main__.py and follows a clear initialization pattern:

graph LR
    A["Argument Parsing<br/>(argparse)"] --> B{"Mode Selection"}
    B -->|--list-plugins| C["List Plugins & Exit"]
    B -->|--use-docintel| D["Document Intelligence Mode"]
    B -->|--use-cu| E["Content Understanding Mode"]
    B -->|Default| F["Standard Conversion Mode"]
    
    D --> G["MarkItDown(<br/>docintel_endpoint)"]
    E --> H["MarkItDown(<br/>cu_endpoint, cu_file_types)"]
    F --> I["MarkItDown(<br/>enable_plugins)"]
    
    G --> J["Conversion & Output"]
    H --> J
    I --> J

Argument Parsing Flow

Source: packages/markitdown/src/markitdown/__main__.py:20-150

The CLI uses argparse with RawDescriptionHelpFormatter to provide formatted help output including usage examples:

parser = argparse.ArgumentParser(
    description="Convert various file formats to markdown.",
    prog="markitdown",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    usage=dedent("""
        SYNTAX:
            markitdown <OPTIONAL: FILENAME>
            If FILENAME is empty, markitdown reads from stdin.
    """).strip(),
)

Mode Initialization

Based on the flags provided, the CLI initializes MarkItDown with different configurations:

# Document Intelligence mode
if args.use_docintel:
    markitdown = MarkItDown(
        enable_plugins=args.use_plugins, 
        docintel_endpoint=args.endpoint
    )

# Content Understanding mode
elif args.use_cu:
    markitdown = MarkItDown(
        enable_plugins=args.use_plugins,
        cu_endpoint=args.cu_endpoint,
        cu_file_types=args.cu_file_types
    )

# Standard mode with optional plugins
else:
    markitdown = MarkItDown(
        enable_plugins=args.use_plugins,
        llm_client=llm_client,
        llm_model=args.llm_model,
        llm_prompt=args.llm_prompt
    )

Source: packages/markitdown/src/markitdown/__main__.py:100-145

Plugin Integration

Listing Plugins

The CLI discovers plugins via Python entry points in the markitdown.plugin group:

markitdown --list-plugins

Example output:

Installed MarkItDown 3rd-party Plugins:

  * markitdown-ocr  (package: markitdown_ocr)

Use the -p (or --use-plugins) option to enable 3rd-party plugins.

Enabling Plugins

To use 3rd-party plugins, pass the --use-plugins flag:

markitdown document.pdf --use-plugins -o output.md

markitdown-ocr Plugin

The markitdown-ocr plugin adds OCR support for embedded images in PDF, DOCX, PPTX, and XLSX files:

markitdown document_with_images.pdf --use-plugins --llm-client openai --llm-model gpt-4o
[!NOTE]
If no llm_client is provided, the plugin loads but OCR is silently skipped, falling back to the standard converter.

Source: packages/markitdown-ocr/README.md

Known Limitations and Issues

Unrecognized Arguments Error

A known issue exists where certain command-line arguments may not be recognized as expected. The example in the markitdown-ocr documentation shows:

markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o

However, the --llm-client and --llm-model flags are parsed but may not be properly handled depending on the conversion mode selected. When this occurs, an "Unrecognized Arguments" error is displayed.

Workaround: Ensure the correct argument format is used and that required dependencies are installed:

# Verify markitdown is properly installed
pip show markitdown

# Reinstall with all dependencies
pip install markitdown[all]

Reference: Issue #1897

Missing ffmpeg Warning on Linux

When processing audio files, a RuntimeWarning may appear if ffmpeg or avconv is not installed:

RuntimeWarning: Couldn't find ffmpeg or avconv

Workaround: Install ffmpeg on Linux systems:

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Fedora
sudo dnf install ffmpeg

Reference: Issue #1685

Security Considerations

[!IMPORTANT]
MarkItDown performs I/O with the privileges of the current process. Like open() or requests.get(), it will access resources that the process itself can access.

For untrusted environments:

  • Sanitize inputs before passing them to the CLI
  • Use the narrowest conversion function needed for your use case
  • Consider running in sandboxed environments when processing untrusted files

Source: README.md

Exit Codes

CodeMeaning
0Success
1Error (file not found, unsupported format, conversion failed)
2Invalid arguments or missing required parameters

See Also

Source: https://github.com/microsoft/markitdown / Human Manual

Architecture Overview

Related topics: Home, Python API Reference, Plugin Development Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section MarkItDown Orchestrator

Continue reading this section for the full explanation and source context.

Section StreamInfo

Continue reading this section for the full explanation and source context.

Section DocumentConverterResult

Continue reading this section for the full explanation and source context.

Related topics: Home, Python API Reference, Plugin Development Guide

Architecture Overview

MarkItDown is a lightweight Python utility for converting various file formats to Markdown, designed primarily for consumption by Large Language Models (LLMs) and text analysis pipelines. This page provides a comprehensive technical overview of MarkItDown's architecture, including its core components, converter system, plugin architecture, and data flow.

High-Level Architecture

MarkItDown employs a converter-based architecture where each file format is handled by a specialized converter that implements a common interface. The system uses a priority-based selection mechanism to allow converters to be overridden or extended through plugins.

graph TD
    subgraph "Client Layer"
        CLI["CLI<br/>(__main__.py)"]
        API["Python API<br/>(MarkItDown class)"]
    end

    subgraph "Core Engine"
        MD["MarkItDown<br/>Orchestrator"]
        SI["StreamInfo<br/>Metadata"]
        EP["Entry Points<br/>(Plugin Discovery)"]
    end

    subgraph "Converter System"
        BC["Base Converter<br/>(DocumentConverter)"]
        PDF["PdfConverter"]
        DOCX["DocxConverter"]
        PPTX["PptxConverter"]
        XLSX["XlsxConverter"]
        IMG["ImageConverter"]
        AUD["AudioConverter"]
        CU["ContentUnderstandingConverter"]
    end

    subgraph "Output"
        DCR["DocumentConverterResult"]
        MD_OUT["Markdown Output"]
    end

    CLI --> API
    API --> MD
    MD --> EP
    MD --> SI
    EP --> BC
    BC --> PDF
    BC --> DOCX
    BC --> PPTX
    BC --> XLSX
    BC --> IMG
    BC --> AUD
    BC --> CU
    MD --> DCR
    DCR --> MD_OUT

Source: packages/markitdown/src/markitdown/_markitdown.py

Core Components

MarkItDown Orchestrator

The MarkItDown class serves as the main orchestrator for the conversion pipeline. It is responsible for:

  • Converter discovery and registration via Python entry points
  • File format detection using StreamInfo metadata
  • Converter selection and execution based on priority
  • Result aggregation into a standardized output format

Source: packages/markitdown/src/markitdown/_markitdown.py:1-50

#### Key Methods

MethodDescription
convert(file_path)Convert a local file to Markdown
convert_stream(stream, stream_info)Convert a stream with metadata
convert_uri(uri)Convert a file, data, or remote URI
convert_url(url)Alias for convert_uri
register_converter(converter)Register a custom converter
list_converters()List all registered converters

Source: packages/markitdown/src/markitdown/_markitdown.py

StreamInfo

The StreamInfo class encapsulates metadata about the file being converted:

@dataclass
class StreamInfo:
    extension: Optional[str] = None
    mimetype: Optional[str] = None
    charset: Optional[str] = None
    url: Optional[str] = None

This metadata is used by converters to determine whether they can handle a given file through their accepts() method.

Source: packages/markitdown/src/markitdown/_stream_info.py

DocumentConverterResult

The DocumentConverterResult class represents the output of a successful conversion:

class DocumentConverterResult:
    text_content: str          # The Markdown output
    attachments: List[Any]     # Extracted images or other attachments
    metadata: Dict[str, Any]   # Conversion metadata

Source: packages/markitdown/src/markitdown/_base_converter.py

Converter System

Base Converter Interface

All converters inherit from DocumentConverter, which defines the contract for file conversion:

class DocumentConverter(ABC):
    # Priority constants
    PRIORITY_MAX = float("inf")
    PRIORITY_DEFAULT = 0.0
    PRIORITY_SPECIFIC_FILE_FORMAT = -1.0
    PRIORITY_FALLBACK_FORMAT = -10.0

    def __init__(self, priority: float = PRIORITY_DEFAULT):
        self.priority = priority

    @abstractmethod
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any
    ) -> bool:
        """Determine if this converter can handle the file."""
        pass

    @abstractmethod
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any
    ) -> DocumentConverterResult:
        """Convert the file to Markdown."""
        pass

Source: packages/markitdown/src/markitdown/_base_converter.py

Priority System

Converters use a priority system where higher values are selected first. This allows plugins to override built-in converters:

Priority RangeUsage
> 0.0Reserved for specialized/premium converters
0.0 (DEFAULT)Built-in converters
-1.0 (SPECIFIC_FILE_FORMAT)Format-specific plugins
-10.0 (FALLBACK_FORMAT)Fallback converters

Source: packages/markitdown/src/markitdown/_base_converter.py

Built-in Converters

The following converters are included with the core package:

ConverterFile TypesPriority
PdfConverter.pdf0.0
DocxConverter.docx0.0
PptxConverter.pptx0.0
XlsxConverter.xlsx0.0
ImageConverter.jpg, .jpeg, .png, .gif, .bmp, .webp0.0
AudioConverter.mp3, .wav, .m4a, .flac0.0
WikipediaConverterWikipedia URLs0.0
CsvConverter.csv0.0
JsonConverter.json0.0
XmlConverter.xml0.0
HtmlConverter.html, .htm0.0
EpubConverter.epub0.0
YouTubeConverterYouTube URLs0.0
IpynbConverter.ipynb0.0
ZipConverter.zip0.0
ContentUnderstandingConverterAll (when enabled)0.0

Source: packages/markitdown/src/markitdown/converters/__init__.py

Plugin Architecture

Version 0.1.0 introduced a plugin-based architecture that allows third-party developers to extend MarkItDown's capabilities.

Plugin Discovery Mechanism

Plugins are discovered through Python entry points in the markitdown.plugin group:

[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"

Source: packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py

Plugin Interface

Each plugin must implement and export the following:

# The version of the plugin interface that this plugin uses
__plugin_interface_version__ = 1

def register_converters(markitdown: MarkItDown, **kwargs):
    """Called during construction of MarkItDown instances."""
    markitdown.register_converter(YourCustomConverter())

Source: packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py

Plugin Registration Flow

sequenceDiagram
    participant User
    participant MarkItDown
    participant EntryPoints
    participant Plugin
    participant Converter

    User->>MarkItDown: new MarkItDown(enable_plugins=True)
    MarkItDown->>EntryPoints: entry_points(group="markitdown.plugin")
    EntryPoints-->>MarkItDown: List of plugins
    loop For each plugin
        MarkItDown->>Plugin: load_plugin()
        Plugin->>Plugin: register_converters()
        Plugin->>MarkItDown: register_converter(Converter)
    end
    User->>MarkItDown: convert(file)
    MarkItDown->>Converter: accepts(stream, info)
    Converter-->>MarkItDown: True
    MarkItDown->>Converter: convert(stream, info)
    Converter-->>MarkItDown: DocumentConverterResult

Source: packages/markitdown/src/markitdown/_markitdown.py

Plugin Parameters

When plugins are loaded, the following parameters are forwarded from the MarkItDown constructor:

ParameterTypePurpose
llm_clientOpenAI-compatibleLLM client for image descriptions and OCR
llm_modelstrModel name for LLM calls
llm_promptstrCustom prompt for LLM extraction
docintel_endpointstrAzure Document Intelligence endpoint
docintel_credentialAzureKeyCredentialAuthentication for Document Intelligence

Source: packages/markitdown/src/markitdown/_markitdown.py

Example: markitdown-ocr Plugin

The markitdown-ocr plugin demonstrates the plugin architecture. It registers OCR-enhanced converters at priority -1.0, which allows them to take precedence over built-in converters:

# Inside register_converters()
markitdown.register_converter(OcrPdfConverter(priority=-1.0))
markitdown.register_converter(OcrDocxConverter(priority=-1.0))
markitdown.register_converter(OcrPptxConverter(priority=-1.0))
markitdown.register_converter(OcrXlsxConverter(priority=-1.0))

Source: packages/markitdown-ocr/README.md

Conversion Pipeline

URI Handling

MarkItDown supports multiple input sources through a unified URI handling system:

graph TD
    URI["convert_uri(uri)"]
    URI_TYPES{"URI Type?"}
    FILE["file:///path/to/file"]
    DATA["data:application/pdf;base64,..."]
    HTTP["https://example.com/file.pdf"]
    YT["https://youtube.com/watch?v=..."]

    URI --> URI_TYPES
    URI_TYPES -->|"file:"| FILE
    URI_TYPES -->|"data:"| DATA
    URI_TYPES -->|"http/https"| HTTP
    URI_TYPES -->|"youtube.com"| YT

    FILE --> STREAM["_convert_stream()"]
    DATA --> STREAM
    HTTP --> FETCH["Fetch content"]
    FETCH --> STREAM
    YT --> STREAM

    STREAM --> ACCEPT["Try each converter"]
    ACCEPT --> RESULT["DocumentConverterResult"]

Source: packages/markitdown/src/markitdown/_uri_utils.py

Stream Processing

The conversion pipeline processes files as streams to support:

  • Piped input (stdin)
  • Remote URLs (HTTP/HTTPS)
  • Data URIs (embedded content)
  • File paths (local)

Source: packages/markitdown/src/markitdown/_markitdown.py

Azure Integration

Azure Document Intelligence

For PDF files, MarkItDown can optionally use Azure Document Intelligence for enhanced extraction:

md = MarkItDown(
    docintel_endpoint="<endpoint>",
    docintel_credential=AzureKeyCredential("<key>")
)
result = md.convert("document.pdf")

Source: packages/markitdown/src/markitdown/__main__.py

Azure Content Understanding

For multi-modal documents (audio, video, structured fields), MarkItDown supports Azure Content Understanding:

from markitdown import MarkItDown, ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="<content_understanding_endpoint>",
    cu_file_types=[ContentUnderstandingFileType.PDF]
)

Source: packages/markitdown/src/markitdown/converters/_cu_converter.py

Command-Line Interface

The CLI is implemented in __main__.py and provides the following interface:

graph LR
    CLI["markitdown CLI"]
    OPTS["Options"]
    CONV["Conversion<br/>Modes"]

    subgraph OPTS
        V["-v, --version"]
        O["-o, --output"]
        X["-x, --extension"]
        P["-p, --use-plugins"]
        L["-l, --list-plugins"]
        M["-m_hint"]
    end

    subgraph CONV
        STD["Standard"]
        DOCINTEL["--use-docintel"]
        CU["--use-cu"]
    end

    CLI --> OPTS
    CLI --> CONV

Source: packages/markitdown/src/markitdown/__main__.py

CLI Options

OptionDescription
-v, --versionShow version number
-o, --output FILEOutput file path
-x, --extension EXTHint file extension (for stdin)
-m_hint HINTHint for MIME type and charset
-p, --use-pluginsEnable plugins
-l, --list-pluginsList installed plugins
--use-docintelUse Azure Document Intelligence
--endpoint URLDocument Intelligence endpoint
--use-cuUse Azure Content Understanding
--cu-endpoint URLContent Understanding endpoint

Source: packages/markitdown/src/markitdown/__main__.py

Known Limitations and Failure Modes

Based on community-reported issues, be aware of the following:

1. CLI Unrecognized Arguments

The CLI may not recognize all documented arguments. Issue #1897 reports that the --llm-client and --llm-model arguments shown in documentation are not properly recognized.

2. UnicodeDecodeError in IpynbConverter

The IpynbConverter.accepts() method reads files with UTF-8 encoding, which can fail for files with non-ASCII bytes. See issue #1894.

3. Invalid Office Open XML Files

Invalid DOCX, XLSX, or PPTX files return success with an error message in text_content rather than raising an exception. See issue #1408.

4. Audio Processing on Linux

When ffmpeg or avconv is not installed, pydub raises a RuntimeWarning that may affect audio conversion. See issue #1685.

5. PDF Table Extraction

Tables in PDF files may not be converted properly. Community reports indicate that complex table structures can be problematic. See issue #293.

Security Considerations

[!IMPORTANT]
MarkItDown performs I/O with the privileges of the current process. Like open() or requests.get(), it will access resources that the process itself can access.

For untrusted environments:

  1. Sanitize inputs before passing them to MarkItDown
  2. Use narrow conversion functions such as convert_stream() or convert_local() when possible
  3. Be cautious with URL inputs as they may trigger network requests

Source: packages/markitdown/README.md

Dependency Groups

MarkItDown organizes dependencies into feature groups:

GroupDescription
pdfPDF conversion (pdfminer.six)
docxWord document conversion (mammoth)
pptxPowerPoint conversion (python-pptx)
xlsxExcel conversion (openpyxl)
imageImage processing
audioAudio transcription (pydub)
youtubeYouTube download (yt-dlp)
azure-doc-intelAzure Document Intelligence
az-content-understandingAzure Content Understanding
allAll optional dependencies

Install with: pip install 'markitdown[all]'

Source: packages/markitdown/README.md

Class Diagram

classDiagram
    class DocumentConverter {
        <<abstract>>
        +float priority
        +accepts(file_stream, stream_info, **kwargs) bool
        +convert(file_stream, stream_info, **kwargs) DocumentConverterResult
    }

    class MarkItDown {
        +bool enable_plugins
        +str llm_model
        +convert(file_path) DocumentConverterResult
        +convert_uri(uri) DocumentConverterResult
        +convert_stream(stream, info) DocumentConverterResult
        +register_converter(converter)
    }

    class StreamInfo {
        +Optional~str~ extension
        +Optional~str~ mimetype
        +Optional~str~ charset
        +Optional~str~ url
    }

    class DocumentConverterResult {
        +str text_content
        +List attachments
        +Dict metadata
    }

    MarkItDown "1" --> "*" DocumentConverter : registers
    MarkItDown --> StreamInfo : creates
    MarkItDown --> DocumentConverterResult : returns
    DocumentConverter --> DocumentConverterResult : returns
    DocumentConverter ..> StreamInfo : uses

See Also

Source: https://github.com/microsoft/markitdown / Human Manual

Python API Reference

Related topics: Architecture Overview, Command-Line Interface, Azure Integrations

Section Related Pages

Continue reading this section for the full explanation and source context.

Section MarkItDown

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, Command-Line Interface, Azure Integrations

Python API Reference

MarkItDown provides a comprehensive Python API for converting various file formats to Markdown. This reference documents the core classes, methods, and configuration options available to developers integrating MarkItDown into their Python applications.

Overview

The MarkItDown Python API enables programmatic document conversion with support for:

  • Local files - Convert files from the filesystem
  • URLs/URIs - Convert remote documents via HTTP or file URIs
  • Streams - Convert from file-like objects with optional metadata hints
  • Azure services - Integration with Document Intelligence and Content Understanding
  • Plugin architecture - Extensible converter system for custom formats

Source: packages/markitdown/src/markitdown/__init__.py

Core Classes

MarkItDown

The main entry point for the Python API. Create an instance with desired configuration and call convert() to transform documents to Markdown.

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

#### Constructor Parameters

ParameterTypeDefaultDescription
enable_pluginsboolFalseEnable loading of 3rd-party plugins via the markitdown.plugin entry point group
llm_clientAnyNoneOpenAI-compatible LLM client for image descriptions (used with llm_model)
llm_modelstrNoneModel name for LLM-based image description generation
llm_promptstrNoneCustom prompt template for LLM image description
docintel_endpointstrNoneAzure Document Intelligence endpoint URL
docintel_api_keystrNoneAzure Document Intelligence API key (alternative to credential)
cu_endpointstrNoneAzure Content Understanding endpoint URL
cu_api_keystrNoneAzure Content Understanding API key
cu_analyzerstrNoneSpecific Content Understanding analyzer ID to use
cu_file_typesList[ContentUnderstandingFileType]NoneFilter which file types route to Content Understanding

Source: packages/markitdown/src/markitdown/_markitdown.py

#### Conversion Methods

##### convert(uri: str, **kwargs) -> DocumentConverterResult

The primary conversion method that automatically detects the input type and routes to the appropriate handler.

# Local file
result = md.convert("document.docx")

# URL
result = md.convert("https://example.com/document.pdf")

# File URI
result = md.convert("file:///path/to/document.pdf")

# Data URI
result = md.convert("data:text/plain;base64,SGVsbG8=")

The method supports optional keyword arguments that override the instance defaults for a single conversion call.

Source: packages/markitdown/src/markitdown/_markitdown.py

##### convert_uri(uri: str, **kwargs) -> DocumentConverterResult

Explicitly converts a URI (file, data, or HTTP/HTTPS URL). The convert_url method remains as a deprecated alias for backward compatibility.

result = md.convert_uri("file:///path/to/document.pdf")

Source: packages/markitdown/src/markitdown/_markitdown.py

##### convert_local(file_path: str, **kwargs) -> DocumentConverterResult

Converts a local file by path. This method is preferred when the file source is known to be local, as it bypasses URI parsing.

result = md.convert_local("./documents/report.pdf")

Source: packages/markitdown/src/markitdown/_markitdown.py

##### convert_stream(stream: BinaryIO, stream_info: Optional[StreamInfo] = None, **kwargs) -> DocumentConverterResult

Converts from a file-like object. Use stream_info to provide hints about the file type when the stream lacks inherent type information.

with open("document.pdf", "rb") as f:
    stream_info = StreamInfo(extension=".pdf")
    result = md.convert_stream(f, stream_info)

Source: packages/markitdown/src/markitdown/_markitdown.py

##### register_converter(converter: DocumentConverter) -> None

Registers a custom converter. This allows adding support for additional file formats at runtime.

md = MarkItDown()
md.register_converter(MyCustomConverter())

Source: packages/markitdown/src/markitdown/_markitdown.py

Source: https://github.com/microsoft/markitdown / Human Manual

Supported File Formats

Related topics: Home, Azure Integrations, OCR Plugin

Section Related Pages

Continue reading this section for the full explanation and source context.

Section PDF

Continue reading this section for the full explanation and source context.

Section Microsoft Word (DOCX)

Continue reading this section for the full explanation and source context.

Section Microsoft PowerPoint (PPTX)

Continue reading this section for the full explanation and source context.

Related topics: Home, Azure Integrations, OCR Plugin

Supported File Formats

MarkItDown provides a unified interface for converting a wide variety of file formats into Markdown. The conversion pipeline uses a plugin-based architecture where each file format is handled by a dedicated converter. When you call MarkItDown().convert(), the system iterates through registered converters in priority order until one accepts the input.

Overview

MarkItDown supports the following high-level categories of file formats:

CategoryFormatsPrimary Converter
DocumentsPDF, DOCX, PPTX, XLSXBuilt-in converters using pdfminer.six, mammoth
MediaImages (JPEG, PNG, GIF, WebP, BMP, TIFF)Built-in with EXIF metadata and LLM Vision OCR
MediaAudio (MP3, WAV, M4A, OGG, FLAC)Built-in with EXIF metadata and transcription
WebHTML, Wikipedia, RSS/AtomBuilt-in converters using BeautifulSoup
DataCSV, JSON, XMLBuilt-in converters
ArchivesZIPBuilt-in converter with recursive processing
eBooksEPUBBuilt-in converter
NotebooksJupyter Notebook (IPYNB)Built-in converter
URLsYouTube VideosBuilt-in converter with transcript extraction

Source: README.md

Converter Architecture

MarkItDown uses a converter registration system where each format is handled by a class implementing the DocumentConverter interface. Converters are registered with a priority value that determines the order in which they are tried.

graph TD
    A[MarkItDown.convert] --> B[Get registered converters]
    B --> C[Sort by priority descending]
    C --> D{Loop through converters}
    D --> E{Converter.accepts?}
    E -->|Yes| F[Converter.convert]
    E -->|No| G[Next converter]
    F --> H[Return DocumentConverterResult]
    G --> D
    D -->|All fail| I[UnsupportedFormatException]

Each converter implements two key methods:

  1. accepts(file_stream, stream_info, kwargs)** - Returns True if the converter can handle the input
  2. convert(file_stream, stream_info, kwargs)** - Performs the actual conversion to Markdown

Source: packages/markitdown/src/markitdown/_base_converter.py

Document Formats

PDF

PDF conversion extracts text content while preserving document structure including headings, paragraphs, lists, and tables.

Supported Extensions: .pdf

Dependencies: pdfminer.six

Features:

  • Text extraction with layout preservation
  • Table extraction with aligned Markdown output
  • Heading and list recognition
  • Support for numbered and bulleted lists

Known Limitations:

  • Scanned PDFs with no extractable text require the markitdown-ocr plugin for full-page OCR
  • Complex table structures may not convert perfectly (see Issue #293)
  • PDF is converted to text/Markdown, not high-fidelity reproduction (see Issue #296)

Memory Optimization: In version 0.1.6, PDF conversion was fixed to prevent O(n) memory growth by properly calling page.close() after processing each page.

Source: packages/markitdown/src/markitdown/converters/_pdf_converter.py

Microsoft Word (DOCX)

Word documents are converted using the mammoth library, which extracts text and converts it to Markdown.

Supported Extensions: .docx

Dependencies: mammoth

Features:

  • Heading extraction and conversion
  • Paragraph and text formatting preservation
  • Table extraction
  • Math equation rendering (OMML to LaTeX)
  • Image extraction with optional LLM Vision descriptions
  • Linked image handling

Known Limitations:

  • The legacy .doc format is not supported (see Issue #23)
  • Invalid DOCX files return a success result with an error message in text_content rather than raising an exception (see Issue #1408)

Source: packages/markitdown/src/markitdown/converters/_docx_converter.py

Microsoft PowerPoint (PPTX)

PowerPoint presentations are converted slide by slide, with each slide rendered as a Markdown section.

Supported Extensions: .pptx

Dependencies: python-pptx

Features:

  • Slide-by-slide conversion
  • Title and content extraction
  • Bullet point preservation
  • Image extraction with optional LLM Vision descriptions
  • Table extraction

Source: packages/markitdown/src/markitdown/converters/_pptx_converter.py

Microsoft Excel (XLSX)

Excel spreadsheets are converted with each sheet represented as a separate Markdown section with tables.

Supported Extensions: .xlsx, .xlsm

Dependencies: openpyxl

Features:

  • Multi-sheet support
  • Table extraction with header row identification
  • Cell value preservation including formulas (as displayed values)
  • Named sheet sections in output

Known Limitations:

  • Invalid XLSX files return a success result with an error message rather than raising an exception (see Issue #1408)

Source: packages/markitdown/src/markitdown/converters/_xlsx_converter.py

Media Formats

Images

Images are converted by extracting metadata and optionally generating descriptions using LLM Vision.

Supported Extensions: .jpg, .jpeg, .png, .gif, .webp, .bmp, .tiff, .tif

Dependencies: Pillow (for metadata), LLM client for descriptions

Features:

  • EXIF metadata extraction (camera info, GPS, date/time)
  • LLM Vision image descriptions (requires llm_client and llm_model)
  • Dimension reporting

Usage Example:

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Describe this image in detail."
)
result = md.convert("photo.jpg")
print(result.text_content)

Source: packages/markitdown/src/markitdown/converters/_image_converter.py

Audio

Audio files are converted by extracting metadata and optionally transcribing speech.

Supported Extensions: .mp3, .wav, .m4a, .ogg, .flac

Dependencies: pydub, speech-recognition (for transcription)

Features:

  • EXIF/metadata extraction
  • Speech-to-text transcription using Google Speech Recognition
  • Format detection

Known Limitations:

``bash # Ubuntu/Debian sudo apt-get install ffmpeg # macOS brew install ffmpeg ``

  • On Linux systems, a RuntimeWarning may appear if ffmpeg or avconv is not installed (see Issue #1685). Install ffmpeg to resolve:

Source: packages/markitdown/src/markitdown/converters/_audio_converter.py

Web Formats

HTML

HTML files are converted to Markdown using BeautifulSoup with customizable rendering options.

Supported Extensions: .html, .htm

Accepted MIME Types: text/html, application/xhtml+xml

Source: packages/markitdown/src/markitdown/converters/_html_converter.py

Wikipedia

Wikipedia pages are specially handled to extract only the main article content, stripping navigation and sidebar elements.

Accepted URLs: *.wikipedia.org/*

Features:

  • Article title extraction
  • Main content extraction (excluding sidebars, navigation)
  • HTML conversion pipeline

Source: packages/markitdown/src/markitdown/converters/_wikipedia_converter.py

RSS and Atom Feeds

RSS and Atom feeds are converted with each item represented as a section in the output.

Supported Extensions: .rss, .atom, .xml

Accepted MIME Types:

  • Precise: application/rss, application/rss+xml, application/atom, application/atom+xml
  • Candidate: text/xml, application/xml

Source: packages/markitdown/src/markitdown/converters/_rss_converter.py

Data Formats

CSV

CSV files are converted to Markdown tables with automatic header detection.

Supported Extensions: .csv

Source: packages/markitdown/src/markitdown/converters/_csv_converter.py

JSON

JSON files are converted with basic formatting to maintain readability.

Supported Extensions: .json

Source: packages/markitdown/src/markitdown/converters/_json_converter.py

XML

XML files are parsed and converted to readable Markdown format.

Supported Extensions: .xml

Source: packages/markitdown/src/markitdown/converters/_xml_converter.py

Archive Formats

ZIP Files

ZIP archives are recursively processed, with each contained file converted using the appropriate converter.

Supported Extensions: .zip

Accepted MIME Types: application/zip

Features:

  • Recursive conversion of all contained files
  • File path preservation in output headings
  • Support for nested archives

Output Format:

Content from the zip file `example.zip`:

## File: docs/readme.txt

[Content of readme.txt]

## File: images/example.jpg

ImageSize: 1920x1080
Description: [Image description]

## File: data/report.xlsx

[Converted Excel content]

Source: packages/markitdown/src/markitdown/converters/_zip_converter.py

eBook and Notebook Formats

EPUB

EPUB e-books are converted with chapter-by-chapter extraction.

Supported Extensions: .epub

Features:

  • Chapter extraction
  • Content preservation
  • Metadata extraction

Source: packages/markitdown/src/markitdown/converters/_epub_converter.py

Jupyter Notebooks

Jupyter notebooks (.ipynb files) are converted preserving both code cells and markdown cells.

Supported Extensions: .ipynb

Known Limitations:

  • IpynbConverter.accepts() may raise UnicodeDecodeError on files containing non-ASCII bytes (see Issue #1894). This is particularly relevant when processing files created from non-English PDFs.

Source: packages/markitdown/src/markitdown/converters/_ipynb_converter.py

URL-Based Conversions

YouTube Videos

YouTube video URLs can be converted to extract transcripts and metadata.

Supported URLs: YouTube video and playlist URLs

Features:

  • Automatic transcript extraction
  • Video metadata (title, description)
  • Subtitle/script preservation

Usage:

markitdown "https://www.youtube.com/watch?v=VIDEO_ID"

Source: packages/markitdown/src/markitdown/converters/_youtube_converter.py

Cloud-Based Conversions

Azure Document Intelligence

For PDFs requiring advanced layout analysis, Azure Document Intelligence provides higher-quality extraction.

Installation: pip install 'markitdown[docintel]'

Usage:

markitdown document.pdf -d -e "<document_intelligence_endpoint>"

Source: packages/markitdown/src/markitdown/converters/_docintel_converter.py

Azure Content Understanding

Azure Content Understanding provides structured field extraction with YAML front matter output.

Installation: pip install 'markitdown[az-content-understanding]'

Supported File Types via CU:

  • PDF (with structured extraction)
  • Images
  • Audio
  • Video

Source: packages/markitdown/src/markitdown/converters/_cu_converter.py

Plugin-Extended Formats

OCR Plugin (markitdown-ocr)

The markitdown-ocr plugin extends PDF, DOCX, PPTX, and XLSX converters with LLM Vision OCR for embedded images.

Installation:

pip install markitdown-ocr
pip install openai  # or any OpenAI-compatible client

Supported Formats with OCR:

  • PDF (embedded images and scanned pages)
  • DOCX (inline images)
  • PPTX (inline images)
  • XLSX (inline images)

Usage:

markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o

Important: The CLI argument format changed in recent versions. Use --llm-client and --llm-model instead of combining them (see Issue #1897).

Source: packages/markitdown-ocr/README.md

Unsupported Formats

The following formats are explicitly not supported:

FormatExtensionNotes
Legacy Word.docOnly .docx is supported (see Issue #23)
OneNote.oneNo current support (see Issue #47)
RTF.rtfRequires a custom plugin

Converter Priority System

Each converter has a priority value that determines when it is tried during the conversion process:

Priority ValueMeaningExample
PRIORITY_SPECIFIC_FILE_FORMAT (100)Specific file formatDOCX, PDF converters
PRIORITY_COMMON_FILE_FORMAT (50)Common formatsZIP, HTML converters
PRIORITY_FALLBACK (0)Fallback handlerPlain text converter
-1.0 (Plugin)Runs before built-inmarkitdown-ocr converters

Converters with higher priority values are tried first. When a converter returns True from accepts(), it is used for conversion.

Feature Comparison by Format

FormatTextTablesImagesMathMetadataNotes
PDF✓*-*With OCR plugin
DOCX-Math via OMML→LaTeX
PPTX--Slide-based
XLSX--Sheet-based
Images---EXIF metadata
Audio✓**---**Transcription
HTML--
EPUB-
CSV----
JSON----Formatted

See Also

Source: https://github.com/microsoft/markitdown / Human Manual

Azure Integrations

Related topics: Python API Reference, Supported File Formats

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Supported File Types

Continue reading this section for the full explanation and source context.

Section Installation and Configuration

Continue reading this section for the full explanation and source context.

Section CLI Usage

Continue reading this section for the full explanation and source context.

Related topics: Python API Reference, Supported File Formats

Azure Integrations

MarkItDown provides two Azure-based integration options for enhanced document conversion: Azure Document Intelligence and Azure Content Understanding. Both integrations leverage cloud-based AI services to provide higher-quality extraction than built-in offline converters, but they serve different use cases and offer distinct capabilities.

Overview

MarkItDown's Azure integrations are implemented as separate converter classes that can be enabled when needed. These converters are optional dependencies that must be installed explicitly:

# Document Intelligence
pip install 'markitdown[docintel]'

# Content Understanding
pip install 'markitdown[az-content-understanding]'
graph TD
    A[Input Document] --> B{MarkItDown Instance}
    B --> C{Feature Flag Check}
    
    C -->|--use-docintel| D[Document Intelligence Converter]
    C -->|--use-cu| E[Content Understanding Converter]
    C -->|No Azure flag| F[Built-in Converters]
    
    D --> G[Azure Document Intelligence Service]
    E --> H[Azure Content Understanding Service]
    
    F --> I[Offline PDF/Office Extractors]
    
    G --> J[Markdown Output]
    H --> K[Markdown + YAML Front Matter]
    I --> J

Source: packages/markitdown/src/markitdown/__main__.py:86-130

Azure Document Intelligence

Azure Document Intelligence (formerly Form Recognizer) provides cloud-based layout analysis and OCR for document conversion. It is particularly useful for scanned PDFs and complex document layouts.

Supported File Types

File TypeDescriptionNotes
pdfPDF documentsFull OCR support for scanned documents
docxWord documentsEnhanced layout preservation
pptxPowerPoint presentationsSlide structure extraction
xlsxExcel spreadsheetsTable and data extraction
htmlHTML documentsMarkup interpretation

Source: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:1-50

Installation and Configuration

  1. Create an Azure Document Intelligence resource in the Azure portal
  2. Obtain the endpoint URL and API key (or configure managed identity)
pip install 'markitdown[docintel]'

CLI Usage

markitdown document.pdf -o output.md --use-docintel -e "https://<your-resource>.cognitiveservices.azure.com/"
ArgumentShortDescription
--use-docintel-dEnable Document Intelligence converter
--endpoint-eDocument Intelligence endpoint URL

Source: packages/markitdown/src/markitdown/__main__.py:86-100

Python API

from markitdown import MarkItDown

# Using endpoint and API key
md = MarkItDown(docintel_endpoint="https://<your-resource>.cognitiveservices.azure.com/")
result = md.convert("document.pdf")
print(result.text_content)

The MarkItDown class accepts the following Document Intelligence parameters:

ParameterTypeDescription
docintel_endpointstrAzure Document Intelligence endpoint URL
docintel_api_keystrAPI key for authentication
docintel_use_custom_modelboolWhether to use a custom model
docintel_model_idstrCustom model identifier

Source: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:100-150

Internal Architecture

The Document Intelligence converter (_doc_intel_converter.py) operates as follows:

sequenceDiagram
    participant App as MarkItDown
    participant DI as DocumentIntelligenceClient
    participant Azure as Azure Document Intelligence Service
    
    App->>DI: Create client with endpoint + credentials
    App->>DI: analyze_document(file_stream, features=[OCR, STYLE])
    DI->>Azure: POST request with document
    Azure-->>DI: AnalyzeResult with markdown content
    DI-->>App: DocumentConverterResult with markdown

The converter uses the DocumentAnalysisFeature enum to enable OCR and layout analysis:

from azure.ai.documentintelligence.models import DocumentAnalysisFeature

features = [DocumentAnalysisFeature.OCR]

Source: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py:50-100

Azure Content Understanding

Azure Content Understanding provides higher-quality, multi-modal extraction with structured field output. It supports documents, images, audio, and video files through prebuilt or custom analyzers.

Key Capabilities

CapabilityDescription
Multi-modal supportDocuments, images, audio, and video
Structured field extractionYAML front matter from analyzer fields
Prebuilt analyzersDomain-specific extraction (search, contracts, etc.)
Custom analyzersUser-defined extraction patterns
Cloud-based OCRHigher-quality text recognition for scanned documents

Source: packages/markitdown/src/markitdown/converters/_cu_converter.py:1-50

When to Use Content Understanding

Use CaseRecommendation
Scanned PDFs requiring high-quality OCRUse Content Understanding
Complex tables and multi-page documentsUse Content Understanding
Audio transcriptionUse Content Understanding
Video analysisUse Content Understanding
Domain-specific field extractionUse Content Understanding with custom analyzer
Simple text extractionUse built-in converters
Basic audio transcriptionBuilt-in converters are sufficient

Installation

pip install 'markitdown[az-content-understanding]'

CLI Usage

markitdown document.pdf --use-cu --cu-endpoint "https://<your-resource>.cognitiveservices.azure.com/"
ArgumentDescription
--use-cuEnable Content Understanding converter
--cu-endpointContent Understanding endpoint URL

Source: packages/markitdown/src/markitdown/__main__.py:100-130

Python API

Zero-config usage (auto-selects analyzer):

from markitdown import MarkItDown

md = MarkItDown(cu_endpoint="https://<your-resource>.cognitiveservices.azure.com/")
result = md.convert("report.pdf")
print(result.markdown)

With a custom analyzer:

from markitdown import MarkItDown
from markitdown.converters._cu_converter import ContentUnderstandingFileType

md = MarkItDown(
    cu_endpoint="https://<your-resource>.cognitiveservices.azure.com/",
    cu_file_types=[ContentUnderstandingFileType.PDF],
    cu_analyzer_id="your-custom-analyzer-id"
)
result = md.convert("contract.pdf")
print(result.markdown)

Source: README.md

Supported File Types

The Content Understanding converter supports automatic analyzer selection based on file type:

File TypeAuto-selected Analyzer
PDF, DOCX, PPTXprebuilt-documentSearch
Images (JPG, PNG)prebuilt-documentSearch
Video (MP4, MOV)prebuilt-videoSearch
Audio (WAV, MP3)prebuilt-audioSearch

Output Format

Content Understanding output includes YAML front matter with extracted fields:

Source: https://github.com/microsoft/markitdown / Human Manual

OCR Plugin

Related topics: Supported File Formats, Plugin Development Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Plugin Registration Flow

Continue reading this section for the full explanation and source context.

Section Component Overview

Continue reading this section for the full explanation and source context.

Section Prerequisites

Continue reading this section for the full explanation and source context.

Related topics: Supported File Formats, Plugin Development Guide

OCR Plugin

The MarkItDown OCR Plugin (markitdown-ocr) is an official plugin that adds Optical Character Recognition (OCR) capabilities to MarkItDown. It extends the built-in converters for PDF, DOCX, PPTX, and XLSX files to extract text from embedded images using LLM Vision, providing enhanced document conversion for files containing scanned content or images with text.

Overview

MarkItDown's built-in converters handle native text extraction from documents, but many documents contain text embedded in images rather than as accessible text. The OCR plugin addresses this gap by:

  • Extracting images from document formats (PDF, DOCX, PPTX, XLSX)
  • Sending images to LLM Vision models for text extraction
  • Inserting extracted text inline with document content in reading order
  • Providing full-page OCR fallback for scanned PDFs

Source: packages/markitdown-ocr/README.md

Features

The OCR plugin provides the following capabilities:

FeatureDescription
Enhanced PDF ConverterExtracts text from images within PDFs, with full-page OCR fallback for scanned documents
Enhanced DOCX ConverterOCR for images embedded in Word documents
Enhanced PPTX ConverterOCR for images embedded in PowerPoint presentations
Enhanced XLSX ConverterOCR for images embedded in Excel spreadsheets
Context PreservationMaintains document structure and flow when inserting extracted text
LLM Vision IntegrationUses OpenAI-compatible LLM clients for image-to-text conversion
Malformed PDF HandlingRetries problematic PDFs with PyMuPDF when pdfplumber/pdfminer fail

Source: packages/markitdown-ocr/README.md

Architecture

Plugin Registration Flow

The OCR plugin uses MarkItDown's plugin architecture with a priority-based replacement strategy. Converters are registered at priority -1.0, which causes them to run before the built-in converters at priority 0.0, effectively replacing the standard conversion behavior when the plugin is enabled.

graph TD
    A[User Creates MarkItDown Instance<br/>enable_plugins=True] --> B[MarkItDown Discovers Plugin<br/>via markitdown.plugin entry point]
    B --> C[Calls register_converters<br/>with all kwargs]
    C --> D[Plugin Creates LLMVisionOCRService<br/>from llm_client/llm_model]
    D --> E[Registers 4 OCR Converters<br/>at priority -1.0]
    E --> F[Built-in Converters Remain<br/>at priority 0.0 as fallback]
    
    G[File Conversion Request] --> H{Which Converter<br/>Accepts First?}
    H -->|OCR Converter| I[Extract Images from Document]
    H -->|Built-in Converter| J[Standard Conversion]
    I --> K[Send Images to LLM Vision]
    K --> L[Insert Extracted Text Inline]
    L --> M[Return Markdown Result]
    J --> N[Return Standard Markdown]

Source: packages/markitdown-ocr/src/markitdown_ocr/_plugin.py

Component Overview

ComponentFilePurpose
LLMVisionOCRService_ocr_service.pyCore OCR service that handles LLM Vision API calls
PdfConverterWithOCR_pdf_converter_with_ocr.pyEnhanced PDF converter with image extraction and full-page OCR
DocxConverterWithOCR_docx_converter_with_ocr.pyEnhanced DOCX converter with image extraction
PptxConverterWithOCR_pptx_converter_with_ocr.pyEnhanced PPTX converter with image extraction
XlsxConverterWithOCR_xlsx_converter_with_ocr.pyEnhanced XLSX converter with image extraction

Source: packages/markitdown-ocr/src/markitdown_ocr/_plugin.py

Installation

Prerequisites

  • Python 3.10 or higher
  • MarkItDown core package installed
  • An OpenAI-compatible LLM client (e.g., openai, AzureOpenAI)

Install the Plugin

pip install markitdown-ocr
pip install openai  # or any OpenAI-compatible client

To verify installation and see available plugins:

markitdown --list-plugins

Source: packages/markitdown-ocr/README.md

Usage

Command-Line Interface

Enable the OCR plugin using the --use-plugins flag along with LLM configuration:

markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
[!IMPORTANT]
The --llm-client and --llm-model arguments must be passed when using the OCR plugin via CLI. Without an llm_client, the plugin loads but OCR is silently skipped, falling back to the standard built-in converter.

Source: packages/markitdown-ocr/README.md

Python API

#### Basic Usage with OpenAI

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)

Source: packages/markitdown-ocr/README.md

#### Using Azure OpenAI

from markitdown import MarkItDown
from openai import AzureOpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=AzureOpenAI(
        api_key="your-api-key",
        azure_endpoint="https://your-resource.openai.azure.com/",
        api_version="2024-02-01",
    ),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.docx")
print(result.text_content)

Source: packages/markitdown-ocr/README.md

#### Custom Extraction Prompt

Override the default prompt for specialized document types:

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Extract all text from this image, preserving table structure.",
)

result = md.convert("document_with_tables.pdf")

Source: packages/markitdown-ocr/README.md

#### Fallback Behavior

If no llm_client is provided, the plugin still loads but OCR is silently skipped:

# Plugin loads but OCR is skipped - falls back to standard converter
md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")  # Standard conversion without OCR

Source: packages/markitdown-ocr/README.md

Configuration Options

ParameterTypeRequiredDefaultDescription
enable_pluginsboolYesFalseEnable plugin loading
llm_clientOpenAI-compatibleYes*NoneLLM client for Vision OCR
llm_modelstrYes*NoneModel name (e.g., gpt-4o)
llm_promptstrNoSystem defaultCustom prompt for text extraction

*Required for OCR to function; without these, the plugin falls back to standard converters.

Source: packages/markitdown-ocr/src/markitdown_ocr/_plugin.py

Supported File Formats

PDF

PDF TypeOCR Behavior
Text-based PDFsExtracts embedded images and OCRs them inline with surrounding text
Scanned PDFsDetected automatically when no extractable text exists; each page rendered at 300 DPI and sent to LLM
Malformed PDFsRetried with PyMuPDF rendering if pdfplumber/pdfminer fail

Source: packages/markitdown-ocr/README.md

Office Documents (DOCX, PPTX, XLSX)

For Office Open XML formats, images are extracted via document part relationships and OCR is performed before the conversion pipeline:

  1. Images are extracted from the document archive
  2. Each image is processed through the LLM Vision OCR service
  3. Placeholder tokens are injected into the content
  4. The standard conversion pipeline executes with OCR placeholders preserved

Source: packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py

How It Works

PDF Conversion Flow

graph TD
    A[PDF Document Input] --> B{Contains extractable text?}
    B -->|Yes| C[Extract text from PDF]
    B -->|No| D[Detected as Scanned PDF]
    C --> E{Contains embedded images?}
    E -->|Yes| F[Extract images by position]
    E -->|No| G[Return standard result]
    F --> H[Send each image to LLM Vision]
    D --> I[Render each page at 300 DPI]
    I --> H
    H --> J[Interleave extracted text<br/>with OCR results in reading order]
    J --> K[Return Markdown with<br/>OCR text inline]
    G --> K

Source: packages/markitdown-ocr/README.md

DOCX/PPTX/XLSX Conversion Flow

graph TD
    A[Office Document Input] --> B[Extract images from<br/>document part relationships]
    B --> C[Process each image<br/>through LLM Vision OCR]
    C --> D[Generate placeholder tokens<br/>MARKITDOWNOCRBLOCK{id}]
    D --> E[Inject placeholders into<br/>HTML/Document content]
    E --> F[Run standard conversion pipeline]
    F --> G[Replace placeholders with<br/>OCR-extracted text]
    G --> H[Return Markdown result]

Source: packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py

Common Issues and Troubleshooting

Unrecognized Arguments Error

If you encounter an "Unrecognized Arguments" error when using CLI arguments like --llm-client and --llm-model, ensure you have installed the correct version of the plugin. The CLI example shown in documentation may differ between versions.

Workaround: Use the Python API for more reliable argument handling.

Source: Community Issue #1897

Office Open XML Validation

When converting invalid DOCX, XLSX, or PPTX files, MarkItDown may return a successful result containing the message "This is not a valid Office Open XML file." in text_content rather than raising an exception. This is a known limitation of the underlying converters.

Workaround: Validate files before conversion or check text_content for error strings.

Source: Community Issue #1408

LLM Call Failures

If an LLM call fails during OCR processing, the conversion continues without that specific image's text. The plugin is designed to be resilient to partial failures.

Source: packages/markitdown-ocr/README.md

Memory Management

Recent releases (v0.1.6+) address O(n) memory growth during PDF conversion by properly calling page.close() after processing each PDF page. Ensure you are running the latest version for optimal memory efficiency.

Source: Release Notes v0.1.6

See Also

Source: https://github.com/microsoft/markitdown / Human Manual

MCP Server

Related topics: Python API Reference, Plugin Development Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section From PyPI

Continue reading this section for the full explanation and source context.

Section Using Docker

Continue reading this section for the full explanation and source context.

Section From Source

Continue reading this section for the full explanation and source context.

Related topics: Python API Reference, Plugin Development Guide



# MCP Server

The MarkItDown MCP (Model Context Protocol) Server is a component that enables AI coding assistants and LLM-powered tools to directly utilize MarkItDown's document conversion capabilities through the standardized MCP protocol. This integration allows AI assistants to process documents without requiring custom tool implementations or external script execution.

## Overview

The MCP Server acts as a bridge between AI assistants (such as Claude, Cursor, or other MCP-compatible clients) and the MarkItDown Python library. It exposes MarkItDown's conversion functionality as MCP tools that AI assistants can invoke directly.

graph LR AI[AI Assistant<br/>Claude, Cursor, etc.] -->|MCP Protocol| MCP[MarkItDown<br/>MCP Server] MCP -->|Convert| MD[MarkItDown<br/>Python Library] MD -->|Documents| DOC[PDF, DOCX, PPTX<br/>XLSX, Images, etc.] DOC -->|Markdown| AI


**Key capabilities provided through MCP:**

- Convert local files to Markdown format
- Process files from URLs (HTTP/HTTPS)
- Process data URIs (base64-encoded content)
- Support for Azure Document Intelligence integration
- Plugin system integration for extended formats
- Configurable LLM clients for image descriptions

Source: [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/README.md)

## Installation

### From PyPI

The MCP server can be installed as a separate package:

pip install markitdown-mcp


### Using Docker

Pre-built Docker images are available for isolated execution:

docker pull ghcr.io/microsoft/markitdown-mcp:latest


To run the container:

docker run --rm -i ghcr.io/microsoft/markitdown-mcp:latest


Source: [packages/markitdown-mcp/Dockerfile](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/Dockerfile)

### From Source

For development or customization:

git clone [email protected]:microsoft/markitdown.git cd markitdown pip install -e packages/markitdown-mcp


## Configuration

### Environment Variables

The MCP server reads configuration from environment variables, allowing flexible deployment without code changes.

| Environment Variable | Description | Default |
|---------------------|-------------|---------|
| `MARKITDOWN_ENABLE_PLUGINS` | Enable 3rd-party plugin support | `false` |
| `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` | Azure Document Intelligence endpoint URL | Not set |
| `AZURE_DOCUMENT_INTELLIGENCE_KEY` | Azure Document Intelligence API key | Not set |

The server respects the `MARKITDOWN_ENABLE_PLUGINS` environment variable during initialization, allowing plugin support to be toggled without modifying the server code.

Source: [packages/markitdown-mcp/src/markitdown_mcp/__init__.py](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/src/markitdown_mcp/__init__.py)

### Server Startup Options

When running the MCP server directly, several startup parameters control its behavior.

| Option | Description |
|--------|-------------|
| `--host` | Host address to bind the server (default: `127.0.0.1`) |
| `--port` | Port number to listen on (default: `8000`) |
| `--log-level` | Logging verbosity (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |

> [!WARNING]
> The server binds to localhost by default. Binding to non-local interfaces (`0.0.0.0`) should only be done in trusted environments with proper network access controls.

Source: [packages/markitdown-mcp/src/markitdown_mcp/__main__.py](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/src/markitdown_mcp/__main__.py)

## MCP Tools

The server exposes the following tools to MCP clients:

### `markitdown_convert`

Converts a document to Markdown format.

**Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `source` | string | Yes | File path, URL, or data URI to convert |
| `use_plugins` | boolean | No | Enable 3rd-party plugins (default: `false`) |

**Returns:** Markdown-formatted text content extracted from the document.

### `markitdown_list_plugins`

Lists all installed 3rd-party plugins available to MarkItDown.

**Parameters:** None required

**Returns:** List of installed plugin names and their package locations.

### `markitdown_get_version`

Returns the current version of MarkItDown.

**Parameters:** None required

**Returns:** Version string (e.g., `"0.1.6"`).

Source: [packages/markitdown-mcp/src/markitdown_mcp/__init__.py](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/src/markitdown_mcp/__init__.py)

## Architecture

### Component Flow

sequenceDiagram participant Client as AI Assistant participant MCPServer as MCP Server participant MarkItDown as MarkItDown Core participant Plugin as Plugin System participant Azure as Azure Services

Client->>MCPServer: Call markitdown_convert(source) MCPServer->>MarkItDown: Forward conversion request alt Plugin Enabled MCPServer->>Plugin: Load enabled plugins Plugin->>MarkItDown: Register custom converters end alt Azure Document Intelligence MarkItDown->>Azure: Call Document Intelligence API Azure-->>MarkItDown: Extracted content end MarkItDown-->>MCPServer: Markdown result MCPServer-->>Client: Return text_content


### Request Processing Pipeline

1. **Client Request**: AI assistant invokes an MCP tool with parameters
2. **Server Validation**: MCP server validates input parameters
3. **MarkItDown Initialization**: Creates `MarkItDown` instance with appropriate configuration
4. **Format Detection**: Determines file type from extension or URI scheme
5. **Conversion**: Routes to appropriate converter based on file type
6. **Result Serialization**: Returns markdown content via MCP protocol

Source: [packages/markitdown-mcp/src/markitdown_mcp/__main__.py](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/src/markitdown_mcp/__main__.py)

## Supported Input Formats

The MCP server inherits all format support from the core MarkItDown library:

| Category | Formats |
|----------|---------|
| **Documents** | PDF, DOCX, PPTX, XLSX, ODT |
| **Images** | JPG, PNG, GIF, BMP, WebP (with OCR and EXIF extraction) |
| **Audio** | MP3, WAV, FLAC (with EXIF and transcription) |
| **Web** | HTML, Wikipedia pages |
| **Data** | CSV, JSON, XML, RSS/Atom feeds |
| **Archives** | ZIP (iterates over contents) |
| **Documents** | EPUB, Jupyter notebooks |
| **Video** | Via Azure Content Understanding (when configured) |

Source: [README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/README.md)

## Integration with Azure Services

### Azure Document Intelligence

When `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` is configured, the MCP server can leverage Azure's cloud-based document extraction for higher quality results:

export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="https://your-resource.cognitiveservices.azure.com/" export AZURE_DOCUMENT_INTELLIGENCE_KEY="your-api-key"


This is particularly beneficial for:
- Scanned PDF documents
- Complex table structures
- Handwritten content
- Multi-language documents

Source: [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/README.md)

### LLM-Based Image Processing

When the MCP server is used with MarkItDown's built-in LLM support, image descriptions can be generated for embedded images in documents:

from markitdown import MarkItDown from openai import OpenAI

md = MarkItDown( enable_plugins=True, llm_client=OpenAI(), llm_model="gpt-4o", )


This same configuration pattern applies when initializing the MCP server, enabling AI assistants to get both document conversion and intelligent image descriptions.

## Security Considerations

> [!IMPORTANT]
> MarkItDown performs I/O with the privileges of the current process. Like `open()` or `requests.get()`, it accesses resources that the process itself can access.

**Recommendations for untrusted environments:**

1. **Sanitize inputs**: Validate file paths and URLs before passing to the MCP server
2. **Use narrow conversion functions**: Prefer `convert_stream()` or `convert_local()` when possible
3. **Restrict network access**: Run the MCP server in a sandboxed environment
4. **Limit file access**: Use container isolation (Docker) to restrict filesystem access

Source: [README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/README.md)

## Usage Examples

### Claude Desktop Integration

Add to your Claude Desktop configuration:

{ "mcpServers": { "markitdown": { "command": "markitdown-mcp", "args": ["--host", "127.0.0.1", "--port", "8000"] } } }


### Python Client Usage

Example MCP client calling markitdown_convert

import json

Tool call to convert a PDF

tool_request = { "name": "markitdown_convert", "arguments": { "source": "/path/to/document.pdf", "use_plugins": True } }

Process response

response = await call_mcp_tool(tool_request) markdown_content = response["text_content"]


### Using with Azure Document Intelligence

Set environment variables

export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="https://YOUR-RESOURCE.cognitiveservices.azure.com/" export AZURE_DOCUMENT_INTELLIGENCE_KEY="YOUR-KEY"

Run MCP server with Azure integration

markitdown-mcp --host 127.0.0.1 --port 8000


Source: [packages/markitdown-mcp/README.md](https://github.com/microsoft/markitdown/blob/e144e0a2be95b34df17433bac904e635f2c5e551/packages/markitdown-mcp/README.md)

## Troubleshooting

### Server Connection Issues

| Symptom | Solution |
|---------|----------|
| Connection refused | Ensure server is running and `--host`/`--port` are correct |
| Timeout errors | Check firewall rules; server binds to localhost by default |
| Plugin not found | Install plugin package and set `MARKITDOWN_ENABLE_PLUGINS=true` |

### Conversion Failures

| Error | Cause | Resolution |
|-------|-------|------------|
| Unsupported format | File type not recognized | Check supported formats list |
| Plugin error | Plugin failed during conversion | Enable debug logging; check plugin compatibility |
| Azure auth failure | Invalid credentials | Verify `AZURE_DOCUMENT_INTELLIGENCE_KEY` |

### Common Issues

**Issue**: `UnicodeDecodeError` on non-ASCII files  
**Context**: Reported in [GitHub Issue #1894](https://github.com/microsoft/markitdown/issues/1894)  
**Status**: This affects the core library; ensure you're using the latest version.

**Issue**: Office Open XML files return success with error message  
**Context**: Reported in [GitHub Issue #1408](https://github.com/microsoft/markitdown/issues/1408)  
**Note**: Invalid DOCX/XLSX/PPTX files may return `"This is not a valid Office Open XML file."` in text_content rather than raising an exception.

**Issue**: RuntimeWarning about ffmpeg on Linux  
**Context**: Reported in [GitHub Issue #1685](https://github.com/microsoft/markitdown/issues/1685)  
**Resolution**: Install ffmpeg system package for audio conversion features

## See Also

- [Main README](README.md) — Project overview and core documentation
- [Plugin Development](markitdown-sample-plugin.md) — Guide to creating custom plugins
- [OCR Plugin](markitdown-ocr.md) — LLM Vision OCR for embedded images
- [Azure Content Understanding](markitdown-content-understanding.md) — Advanced cloud-based extraction

Source: https://github.com/microsoft/markitdown / Human Manual

Plugin Development Guide

Related topics: Architecture Overview, OCR Plugin, MCP Server

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Component Responsibilities

Continue reading this section for the full explanation and source context.

Section Interface Version

Continue reading this section for the full explanation and source context.

Section Required Exports

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, OCR Plugin, MCP Server

Plugin Development Guide

This guide explains how to create custom plugins for MarkItDown to extend its document conversion capabilities. The plugin architecture allows developers to add support for new file formats or override existing converters with custom implementations.

Overview

MarkItDown uses a plugin-based architecture that enables third-party developers to extend its document conversion capabilities. Plugins can register custom DocumentConverter implementations that handle specific file formats. The system supports:

  • New file format support — Add converters for formats not natively supported (e.g., RTF, EPUB variants)
  • Converter replacement — Override built-in converters with priority-based precedence
  • LLM integration — Pass through LLM client credentials to enable AI-powered features in plugins

Plugins are discovered via Python entry points and loaded dynamically when MarkItDown(enable_plugins=True) is instantiated. Source: packages/markitdown-sample-plugin/README.md

Architecture

The plugin system is built around the following core components:

graph TD
    A[User Code] -->|MarkItDown enable_plugins=True| B[MarkItDown Core]
    B -->|Discovers via entry_points| C[Plugin Entry Point Group<br/>markitdown.plugin]
    C --> D[Plugin Package<br/>markitdown_sample_plugin]
    D -->|Calls| E[register_converters function]
    E -->|Registers| F[DocumentConverter Subclass<br/>RtfConverter]
    B -->|Stores| F
    F --> G[Conversion Pipeline]
    G --> H[Markdown Output]
    
    I[LLM Client<br/>llm_client, llm_model] -.->|Forwarded via kwargs| E

Component Responsibilities

ComponentResponsibilitySource
MarkItDownCore orchestrator, plugin discovery, converter dispatchpackages/markitdown/src/markitdown/_markitdown.py
DocumentConverterBase class for all converterspackages/markitdown/src/markitdown/_base_converter.py
Entry Point GroupPlugin discovery mechanism (markitdown.plugin)packages/markitdown-sample-plugin/README.md
register_converters()Plugin callback to register converterspackages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py

Plugin Interface

Interface Version

Plugins must declare the interface version they target. Currently, only version 1 is supported:

__plugin_interface_version__ = 1

Source: packages/markitdown-sample-plugin/README.md

Required Exports

A valid plugin package must export:

SymbolTypeDescription
__plugin_interface_version__intPlugin interface version (must be 1)
register_converters(markitdown, **kwargs)functionCalled to register converters

Creating a Document Converter

Step 1: Subclass DocumentConverter

Create a new converter class that inherits from DocumentConverter:

from typing import BinaryIO, Any
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo

class RtfConverter(DocumentConverter):
    def __init__(
        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
    ):
        super().__init__(priority=priority)

Source: packages/markitdown-sample-plugin/README.md

Step 2: Implement the `accepts()` Method

The accepts() method determines whether this converter can handle a given file stream. Return True if the converter should process the file:

def accepts(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool:
    # Check if the file stream is an RTF file
    # Read the first few bytes to check for RTF magic number
    header = file_stream.read(4)
    file_stream.seek(0)  # Reset stream position
    return header.startswith(b'{\\rtf')

Step 3: Implement the `convert()` Method

The convert() method performs the actual conversion from the file format to Markdown:

def convert(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult:
    # Read and parse the RTF content
    content = file_stream.read().decode('utf-8', errors='replace')
    
    # Convert RTF to Markdown (implementation-specific)
    markdown_content = self._rtf_to_markdown(content)
    
    return DocumentConverterResult(
        text_content=markdown_content,
        metadata={}
    )

Priority System

The DocumentConverter base class defines priority constants that control converter selection:

Priority ConstantValueUse Case
PRIORITY_DEFAULT0.0Default priority for built-in converters
PRIORITY_SPECIFIC_FILE_FORMAT10.0Format-specific converters (higher wins)
PRIORITY_OVERRIDE_ALL100.0Override all other converters

Higher priority values take precedence. Plugins that want to override built-in converters should use PRIORITY_SPECIFIC_FILE_FORMAT or higher.

Source: packages/markitdown/src/markitdown/_base_converter.py

Plugin Registration

The `register_converters()` Function

Create a register_converters() function in your plugin package. This function is called during MarkItDown instantiation:

from markitdown import MarkItDown

def register_converters(markitdown: MarkItDown, **kwargs):
    """
    Called during construction of MarkItDown instances to register
    converters provided by plugins.
    """
    # Simply create and attach an RtfConverter instance
    markitdown.register_converter(RtfConverter())

Source: packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py

Handling LLM Client Credentials

Plugins can receive and use LLM credentials passed through MarkItDown():

def register_converters(markitdown: MarkItDown, **kwargs):
    llm_client = kwargs.get('llm_client')
    llm_model = kwargs.get('llm_model')
    llm_prompt = kwargs.get('llm_prompt')
    
    # Use LLM credentials if available
    if llm_client and llm_model:
        converter = LLMVisionConverter(
            llm_client=llm_client,
            llm_model=llm_model,
            llm_prompt=llm_prompt
        )
    else:
        converter = BasicRtfConverter()
    
    markitdown.register_converter(converter)

This pattern is used by the markitdown-ocr plugin to enable LLM-powered OCR for images. Source: packages/markitdown-ocr/README.md

Entry Points Configuration

Configure the entry point in your plugin's pyproject.toml:

[project.entry-points."markitdown.plugin"]
sample_plugin = "markitdown_sample_plugin"
FieldValueDescription
Group"markitdown.plugin"Fixed entry point group name
Keysample_pluginPlugin identifier (can be any unique name)
Value"markitdown_sample_plugin"Fully qualified package name

Source: packages/markitdown-sample-plugin/README.md

CLI Integration

Listing Installed Plugins

Use the --list-plugins flag to see all installed third-party plugins:

markitdown --list-plugins

Output format:

Installed MarkItDown 3rd-party Plugins:

  * sample_plugin     (package: markitdown_sample_plugin)

Use the -p (or --use-plugins) option to enable 3rd-party plugins.

Source: packages/markitdown/src/markitdown/__main__.py

Enabling Plugins

Pass the --use-plugins flag to enable third-party plugins:

markitdown --use-plugins document.rtf -o output.md

Plugin Discovery Mechanism

Plugins are discovered using Python's importlib.metadata.entry_points():

from importlib.metadata import entry_points

plugin_entry_points = list(entry_points(group="markitdown.plugin"))

Source: packages/markitdown/src/markitdown/__main__.py

Complete Plugin Example

Below is a minimal but complete plugin structure:

markitdown_rtf_plugin/
├── pyproject.toml
└── src/
    └── markitdown_rtf_plugin/
        ├── __init__.py
        └── _plugin.py

`pyproject.toml`

[project]
name = "markitdown-rtf-plugin"
version = "0.1.0"

[project.entry-points."markitdown.plugin"]
rtf = "markitdown_rtf_plugin"

`src/markitdown_rtf_plugin/_plugin.py`

"""RTF to Markdown converter plugin for MarkItDown."""

from typing import BinaryIO, Any
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo

__plugin_interface_version__ = 1

class RtfConverter(DocumentConverter):
    """Converts RTF files to Markdown format."""

    def __init__(self):
        super().__init__(priority=DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT)

    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> bool:
        extension = (stream_info.extension or "").lower()
        if extension == ".rtf":
            return True
        
        # Check RTF magic number
        header = file_stream.read(4)
        file_stream.seek(0)
        return header.startswith(b'{\\rtf')

    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> DocumentConverterResult:
        content = file_stream.read().decode('utf-8', errors='replace')
        # RTF to Markdown conversion logic here
        markdown = self._convert_rtf_to_markdown(content)
        
        return DocumentConverterResult(
            text_content=markdown,
            metadata={"source_format": "rtf"}
        )
    
    def _convert_rtf_to_markdown(self, content: str) -> str:
        # Simplified conversion logic
        # Replace RTF formatting with Markdown equivalents
        markdown = content
        # ... conversion implementation
        return markdown


def register_converters(markitdown, **kwargs):
    """Register the RTF converter with MarkItDown."""
    markitdown.register_converter(RtfConverter())

`src/markitdown_rtf_plugin/__init__.py`

from ._plugin import register_converters, __plugin_interface_version__

Installation and Testing

Installing the Plugin

Install the plugin in development mode:

pip install -e .

Verifying Installation

List installed plugins:

markitdown --list-plugins

Testing the Plugin

Convert a file using the plugin:

markitdown --use-plugins document.rtf -o output.md

Finding Plugins

To discover available third-party plugins:

  1. Search GitHub for the hashtag #markitdown-plugin
  2. Check PyPI for packages with markitdown-plugin keyword
  3. Review the Community Plugins section

Source: packages/markitdown-sample-plugin/README.md

Best Practices

Error Handling

Implement robust error handling in your converter:

def convert(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult:
    try:
        content = file_stream.read()
        markdown = self._convert_to_markdown(content)
        return DocumentConverterResult(
            text_content=markdown,
            metadata={"source_format": "custom"}
        )
    except SpecificFormatError as e:
        # Return partial result or re-raise
        raise
    except Exception as e:
        # Log and provide fallback
        return DocumentConverterResult(
            text_content=f"[Conversion error: {str(e)}]",
            metadata={"error": str(e)}
        )

Stream Position Management

Always reset stream position after reading for format detection:

def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool:
    # Read to check format
    header = file_stream.read(512)
    file_stream.seek(0)  # Reset for next reader
    return self._is_our_format(header)

Priority Selection Guidelines

ScenarioRecommended Priority
Adding new format supportPRIORITY_SPECIFIC_FILE_FORMAT (10.0)
Enhancing existing formatPRIORITY_SPECIFIC_FILE_FORMAT + 1
Completely overriding built-inPRIORITY_OVERRIDE_ALL (100.0)

Common Issues

Plugin Not Discovered

Ensure the entry point is correctly configured in pyproject.toml:

[project.entry-points."markitdown.plugin"]
your_plugin_name = "your_package_name"

Converter Not Called

  1. Verify accepts() returns True for your file type
  2. Check that priority is high enough to take precedence
  3. Ensure register_converters() is called during MarkItDown instantiation

CLI Unrecognized Arguments

As noted in Issue #1897, the CLI may not document all plugin-specific arguments. If your plugin adds CLI options, document them separately in your plugin's documentation.

See Also

Source: https://github.com/microsoft/markitdown / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Runtime risk requires verification

May increase setup, validation, or first-run risk for the user.

high Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

high Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 30 structured pitfall item(s), including 4 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

  • Severity: high
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | cevd_f70b2e3ea5ed47418a4aeb9ef27230f9 | https://github.com/microsoft/markitdown/issues/1685

2. Runtime risk: Runtime risk requires verification

  • Severity: high
  • Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | cevd_252ef0d45ac040688ffa066bc1b64ba0 | https://github.com/microsoft/markitdown/issues/1897

3. Maintenance risk: Maintenance risk requires verification

  • Severity: high
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | cevd_6e08b71ee29f46a98e6825a5d5b11e6e | https://github.com/microsoft/markitdown/issues/1979

4. Maintenance risk: Maintenance risk requires verification

  • Severity: high
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | cevd_439f22f47a524773808819148caadca5 | https://github.com/microsoft/markitdown/issues/1982

5. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
  • User impact: Developers may fail before the first successful local run: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Office Open XML: Invalid Files Return Success with Error Message Instead of Exception. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_issue | fmev_087a8a7b6538b2ce2b065ade73c555af | https://github.com/microsoft/markitdown/issues/1408

6. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: Support for .doc extensions
  • User impact: Developers may fail before the first successful local run: Support for .doc extensions
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Support for .doc extensions. Context: Observed when using windows, linux
  • Evidence: failure_mode_cluster:github_issue | fmev_d5a467d012987779306cb5c50725275b | https://github.com/microsoft/markitdown/issues/23

7. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
  • User impact: Developers may fail before the first successful local run: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: RuntimeWarning from pydub: "Couldn't find ffmpeg or avconv" on Linux. Context: Observed when using python, windows, linux
  • Evidence: failure_mode_cluster:github_issue | fmev_1f9167a15a1eec72c8f79514f1b70b76 | https://github.com/microsoft/markitdown/issues/1685

8. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: v0.1.0
  • User impact: Upgrade or migration may change expected behavior: v0.1.0
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.0. Context: Observed when using python
  • Evidence: failure_mode_cluster:github_release | fmev_1d5ae6ee21225356f45c36c20024dccd | https://github.com/microsoft/markitdown/releases/tag/v0.1.0

9. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | cevd_734e117518a3496eb3779e5f22b600b5 | https://github.com/microsoft/markitdown/issues/1408

10. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | cevd_77597bea6262485b9609d8fc5f50a69a | https://github.com/microsoft/markitdown/issues/1894

11. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Developers should check this configuration risk before relying on the project: Enhancement: Add MCP server support for document processing
  • User impact: Developers may misconfigure credentials, environment, or host setup: Enhancement: Add MCP server support for document processing
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Enhancement: Add MCP server support for document processing. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_issue | fmev_969d5f508051e086435b78736eae3e88 | https://github.com/microsoft/markitdown/issues/2004

12. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Developers should check this configuration risk before relying on the project: v0.1.2
  • User impact: Upgrade or migration may change expected behavior: v0.1.2
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v0.1.2. Context: Observed when using python
  • Evidence: failure_mode_cluster:github_release | fmev_076605feea6e0b4830282709121d3c90 | https://github.com/microsoft/markitdown/releases/tag/v0.1.2

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using markitdown with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence