Software Development & Delivery · Public

olmocr

Toolkit for linearizing PDFs for LLM datasets/training

Best fitUsers who want source-backed project understanding before installing it.

Check whether this project matches your task before installing it.

What it can doskill, recipe, host_instruction, eval, preflight

Review the portable capability path.

Before continuingVerify in a sandbox

Do not treat a preview pack as a proven local install.

GitHub snapshot17k stars

1.4k forks · 16 contributors

Doramagic.ai Last verification date: 2026-07-29 Verification method: source evidence, semantic profile, public page gate, and static build acceptance.

Official first step Read manual preview Source repository

Publication status · 2026-07-29

What is olmocr?

Toolkit for linearizing PDFs for LLM datasets/training
Best fit: Users who want source-backed project understanding before installing it.
Not for: Not for users who want to skip sandbox verification or cannot accept configuration, permission, or maintenance overhead.
Capability added to an AI workflow: skill, recipe, host_instruction, eval, preflight
First safe verification step: Verify the smallest path in an isolated environment and keep a rollback path.
Verification state: source, Quick Start, and sandbox install checks are recorded as passed.
Top risk: May increase setup, validation, or first-run risk for the user.
Evidence base: https://github.com/allenai/olmocr, https://github.com/allenai/olmocr#readme, Human Manual, Pitfall Log

Quick decision

Use this section to decide whether the project is worth a deeper read.

Best forUsers who want source-backed project understanding before installing it.

Match the project to your task before installing it.

Capabilityskill, recipe, host_instruction, eval, preflight

Toolkit for linearizing PDFs for LLM datasets/training

Repositoryallenai/olmocr

17k stars · 1.4k forks

What it can do

Translate the upstream project into concrete capabilities the user can judge before installing.

Installation and Platform Support

Related topics: Pipeline and Inference Modes

Source: https://github.com/allenai/olmocr / Human Manual

Pipeline and Inference Modes

Related topics: Installation and Platform Support, Benchmark Suite and OCR Evaluation

Source: https://github.com/allenai/olmocr / Human Manual

Benchmark Suite and OCR Evaluation

Related topics: Pipeline and Inference Modes, Model Training, Filtering, and Synthetic Data

Source: https://github.com/allenai/olmocr / Human Manual

Model Training, Filtering, and Synthetic Data

Related topics: Pipeline and Inference Modes, Benchmark Suite and OCR Evaluation

Source: https://github.com/allenai/olmocr / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

Source: Doramagic discovery, validation, and Project Pack records

Sources: https://github.com/allenai/olmocr, Human Manual, Project Pack evidence, and downstream validation signals.

Community Discussion Evidence

Project-level external discussion stays visible on the detail page, not only inside the manual.

Stars17k stars

Forks1.4k forks

Contributors16 contributors

Licenseunknown

Community Discussion Evidence

12 source-linked items

Review these external discussions before using olmocr with real data or production workflows. They are review inputs, not standalone proof that the project is production-ready.

01
Fail to parse b4c3c4ac3d6f7b52a993cec7ca8b3ad43cecabad_page_3.pdf
github / github_issue
02
olmocr.bench scoring: `partial_ratio` falsely matches when candidate is
github / github_issue
03
Model allenai/olmOCR-2-7B-1025 on DeepInfra will be deprecated on 2026-0
github / github_issue
04
Writing markdown error : 'gbk' codec can't encode character '\u1eca' in
github / github_issue
05
configurable timeout for HTTP client in server method
github / github_issue
06
numpy is missing from [bench] dependencies
github / github_issue
07
[bug] badly formed help string
github / github_issue
08
v0.4.27
github / github_release
09
v0.4.25
github / github_release
10
v0.4.24
github / github_release
11
v0.4.21
github / github_release
12
v0.4.20
github / github_release

How to start

Only source-backed commands are shown here. Verify them in an isolated environment first.

Try the prompt first

Test the workflow without installing the upstream project.

preview

Read the Human Manual

Understand inputs, outputs, limits, and failure modes.

manual

Take context to your AI host

Use the compiled assets in your preferred AI environment.

context

Run sandbox verification

Confirm install commands and rollback before using a primary environment.

verify

pip install olmocr

Official start command · https://github.com/allenai/olmocr#readme · verified: yes

Human Manual

The English page must expose the real manual, not a short placeholder.

8+ sections · Human Manual

olmocr Manual

olmOCR is a vision-language-model-based OCR pipeline that can be run either against a local NVIDIA GPU or against any remote OpenAI-API-compatible inference server. Installation is split i...

Open the full manual

https://github.com/allenai/olmocr Project Manual
Table of Contents
Installation and Platform Support
Related Pages
Overview
System Dependencies
Python Installation
Platform Support and Known Limitations

Installation and Platform Support

Related topics: Pipeline and Inference Modes

Source: https://github.com/allenai/olmocr / Human Manual

Pipeline and Inference Modes

Related topics: Installation and Platform Support, Benchmark Suite and OCR Evaluation

Source: https://github.com/allenai/olmocr / Human Manual

Benchmark Suite and OCR Evaluation

Related topics: Pipeline and Inference Modes, Model Training, Filtering, and Synthetic Data

Source: https://github.com/allenai/olmocr / Human Manual

Model Training, Filtering, and Synthetic Data

Related topics: Pipeline and Inference Modes, Benchmark Suite and OCR Evaluation

Source: https://github.com/allenai/olmocr / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

Source: Doramagic discovery, validation, and Project Pack records

AI Context Pack and portable assets

After deciding to continue, take the project context into your own AI host.

Complete pack plus user-owned assets

These files are planning and verification assets for Claude Code, Codex, Gemini, Cursor, ChatGPT, and other AI hosts.

Download complete pack Read Human Manual

BundleComplete Project Pack AssetAI Context Pack AssetBoundary & Risk Card AssetHuman Manual AssetPitfall Log AssetPrompt Preview AssetQuick Start EvidenceREPO_INSPECTION.json

Preflight checks

Treat this page as a planning asset, not proof that your local environment is ready.

The manual is generated from source-linked project files and Doramagic validation signals.
Community evidence warnings stay visible instead of being converted into marketing claims.
This English page is indexable because the locale quality gate passed and explicit English index approval is enabled.
Use the upstream repository as the final authority for installation commands, license, and version-specific behavior.