# https://github.com/allenai/olmocr Project Manual

Generated at: 2026-06-13 23:23:00 UTC

## Table of Contents

- [Installation and Platform Support](#page-1)
- [Pipeline and Inference Modes](#page-2)
- [Benchmark Suite and OCR Evaluation](#page-3)
- [Model Training, Filtering, and Synthetic Data](#page-4)

<a id='page-1'></a>

## Installation and Platform Support

### Related Pages

Related topics: [Pipeline and Inference Modes](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/allenai/olmocr/blob/main/README.md)
- [pyproject.toml](https://github.com/allenai/olmocr/blob/main/pyproject.toml)
- [Dockerfile](https://github.com/allenai/olmocr/blob/main/Dockerfile)
- [Dockerfile.with-model](https://github.com/allenai/olmocr/blob/main/Dockerfile.with-model)
- [docs/source/installation.md](https://github.com/allenai/olmocr/blob/main/docs/source/installation.md)
- [olmocr/train/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md)
- [olmocr/bench/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/bench/README.md)
</details>

# Installation and Platform Support

## Overview

olmOCR is a vision-language-model-based OCR pipeline that can be run either against a local NVIDIA GPU or against any remote OpenAI-API-compatible inference server. Installation is split into two tiers — a lightweight CPU-only install for remote-server use, and a heavier install that pulls in PyTorch, vLLM, and other GPU dependencies for local inference. The project is officially tuned for Linux + NVIDIA GPUs; community discussions show that macOS, ROCm, and Windows paths exist but are not first-class supported. Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)

## System Dependencies

olmOCR renders PDF pages to images before sending them to the VLM, which requires a working `poppler-utils` install plus several TrueType font packages so that rendered glyphs look correct. The README specifies the Ubuntu/Debian install line:

```bash
sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts \
  fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
```

Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)

On non-Ubuntu systems the user is expected to translate this list to the equivalent platform package manager. Skipping the font packages is a common cause of degraded OCR output on documents that use Caladea, Carlito, MS Core Fonts, or other non-default faces.

## Python Installation

The project explicitly recommends installing into a fresh Conda environment because the local-GPU dependency set (PyTorch, flash-attn, vLLM) is fragile in pre-existing environments. Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)

```bash
conda create -n olmocr python=3.11
conda activate olmocr
```

Python 3.11 is the version called out in the README. After activation, the user picks one of two `pip install` options:

| Option | Command | When to use |
|---|---|---|
| Remote inference only | `pip install olmocr` | You have a vLLM / OpenAI-API server running elsewhere. Avoids the ~2 GB+ of GPU dependencies. |
| Local GPU inference | `pip install olmocr[gpu]` (or the full `pip install .` per the README) | You want olmOCR to spawn a local vLLM instance. Requires a recent NVIDIA GPU. |

Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)

The optional extras groups are also relevant for downstream workflows:

- `[train]` — extra packages required by the RLVR training pipeline documented in `olmocr/train/README.md`. That README additionally pins `transformers==4.52.4` and `flash-attn>=2.8.0.post2 --no-build-isolation`. Source: [olmocr/train/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md)
- `[bench]` — used to run `python -m olmocr.bench.benchmark`. Community issue #452 reports that `numpy` was missing from this group and had to be installed manually, so users hitting `ModuleNotFoundError: numpy` on first run should `pip install numpy` explicitly. Source: [olmocr/bench/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/bench/README.md)

## Platform Support and Known Limitations

```mermaid
flowchart LR
    A[Choose install path] --> B{Remote vLLM<br/>server available?}
    B -- Yes --> C[pip install olmocr<br/>Linux/macOS/Windows]
    B -- No --> D{NVIDIA GPU<br/>on Linux?}
    D -- Yes --> E[pip install olmocr gpu extras<br/>poppler-utils + fonts]
    D -- No --> F[Community paths:<br/>macOS/ROCm/CPU]
```

The README states that local inference is tested on a recent NVIDIA GPU. Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)

Platforms beyond this are community-driven:

- **macOS / Apple Silicon.** Issue #33 has 28 comments requesting official support. As of release v0.4.27 there is no first-class macOS install path; the recommended workaround is the remote-server install against a vLLM endpoint running on Linux. Source: community context, issue #33.
- **AMD ROCm.** Issue #111 documents users attempting a 7900 XTX / ROCm install hitting OOM in vLLM and empty responses from a GGUF/ollama fallback. Source: community context, issue #111.
- **Windows.** The CLI works under Windows 11 + Conda + Python 3.11, but issue #459 shows that the default Windows console codepage is `gbk` and corrupts non-ASCII output. The accepted workaround in that thread is to set `PYTHONIOENCODING=utf-8` and run `chcp 65001` from PowerShell before launching olmOCR. Source: community context, issue #459.

The training stack documented in `olmocr/train/README.md` is explicitly specialized to an 8×H100 Beaker cluster and is not expected to work elsewhere without modification. Source: [olmocr/train/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md)

## Docker and Common Installation Pitfalls

For users who do not want to manage the system dependencies by hand, a `Dockerfile` and a `Dockerfile.with-model` variant are shipped at the repository root. The `with-model` image bundles the `allenai/olmOCR-2-7B-1025-FP8` weights so that the container can run local inference out of the box. Community issue #161 also provides a community-maintained CUDA 12.1 / Ubuntu 22.04 Dockerfile for users who want a known-good base. Source: [Dockerfile](https://github.com/allenai/olmocr/blob/main/Dockerfile), [Dockerfile.with-model](https://github.com/allenai/olmocr/blob/main/Dockerfile.with-model); community context, issue #161.

The most common first-run failures, drawn from the issue tracker and the in-repo docs, are:

1. **Missing `poppler-utils` or fonts** — pages render as boxes or with missing glyphs; install the apt packages listed above.
2. **Wrong Python environment** — PyTorch or flash-attn wheels installed into a non-3.11 env cause opaque import errors; recreate the conda env.
3. **`numpy` / missing `[bench]` extras** — covered in issue #452.
4. **Malformed CLI help string** — issue #451 documents `argparse` raising on `--help` under Python 3.14, so use Python 3.11 for now. Source: community context, issue #451.
5. **HTTP client timeout in server runner** — `olmocr/bench/runners/run_server.py` defaults to 300 s, which is too short for slow endpoints; issue #455 tracks a request to make this configurable. Source: community context, issue #455.

## See Also

- [olmocr/train/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md) — Training-specific environment setup, including dataset preparation.
- [olmocr/bench/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/bench/README.md) — Benchmark harness, including the documented document categories.
- [docs/source/installation.md](https://github.com/allenai/olmocr/blob/main/docs/source/installation.md) — Sphinx-rendered installation guide (mirrors the README).
- GitHub issues #33, #111, #161, #451, #452, #455, #459 — Platform and install-related discussions referenced above.

---

<a id='page-2'></a>

## Pipeline and Inference Modes

### Related Pages

Related topics: [Installation and Platform Support](#page-1), [Benchmark Suite and OCR Evaluation](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [olmocr/pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
- [olmocr/work_queue.py](https://github.com/allenai/olmocr/blob/main/olmocr/work_queue.py)
- [olmocr/datatypes.py](https://github.com/allenai/olmocr/blob/main/olmocr/datatypes.py)
- [olmocr/s3_utils.py](https://github.com/allenai/olmocr/blob/main/olmocr/s3_utils.py)
- [olmocr/prompts/prompts.py](https://github.com/allenai/olmocr/blob/main/olmocr/prompts/prompts.py)
- [olmocr/image_utils.py](https://github.com/allenai/olmocr/blob/main/olmocr/image_utils.py)
- [README.md](https://github.com/allenai/olmocr/blob/main/README.md)
</details>

# Pipeline and Inference Modes

The olmOCR pipeline (`olmocr/pipeline.py`) is the central batch inference system that converts millions of PDF pages into structured Markdown using a fine-tuned vision-language model. It is invoked either as `python -m olmocr.pipeline <workspace> ...` or simply `olmocr <workspace> ...`, and exposes a single CLI surface that fans out into several execution and inference backends. The pipeline's responsibility spans PDF ingestion, page-group scheduling, model dispatch (local or remote), retry handling, and writing Dolma-compatible output plus optional Markdown files [Source: [olmocr/pipeline.py:1-50]()] [Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)].

## Pipeline Responsibilities and Workspace Layout

A pipeline run is anchored to a single `workspace` argument — a local folder or an `s3://bucket/prefix/` path. The workspace stores the work queue, partial results, and the final Dolma-format files (`results/*.jsonl.gz`) preserving the input PDF folder structure. The pipeline is described as a "Manager for running millions of PDFs through a batch inference pipeline" in its top-level help text [Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)].

Key behaviors of the workspace:

- Acts as a work queue when running on S3, allowing multiple worker nodes to coordinate via shared prefixes [Source: [olmocr/work_queue.py:1-80]()] [Source: [olmocr/s3_utils.py:1-120]()] .
- Local workspaces behave identically, but contention is serialized at the filesystem level.
- Pages are grouped (`--pages_per_group`, default small batch) and each group is dispatched independently. Failed groups are retried (`--max_page_retries`) and a configurable error budget (`--max_page_error_rate`) terminates the run when exceeded [Source: [olmocr/pipeline.py:100-180]()] .
- When `--markdown` is passed, an additional `markdown/` folder is populated with human-readable outputs alongside the Dolma JSONL stream [Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)].

## Local vLLM Inference Mode

In the default mode, the pipeline spawns a local vLLM server and runs inference in-process. This requires the GPU-enabled install (`pip install olmocr[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/`) and exposes a number of vLLM passthrough arguments [Source: [olmocr/pipeline.py:200-260]()] [Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)].

Notable local-mode flags include:

| Flag | Purpose |
|------|---------|
| `--model` | Path or HF repo id (default `allenai/olmOCR-7B-0725-FP8`) |
| `--gpu-memory-utilization` | Fraction of VRAM allocated to KV-cache (forwarded to `vllm serve`) |
| `--max_model_len` | Upper bound on KV-cache tokens; lower it if vLLM fails to start |
| `--tensor-parallel-size` / `-tp` | Tensor parallel degree for vLLM |
| `--data-parallel-size` / `-dp` | Data parallel degree |
| `--port` | Port for the local vLLM server |
| `--apply_filter` | Runs an additional quality filter pass on the parsed output |
| `--guided_decoding` | Enables guided decoding for YAML-typed model outputs |

The pipeline also controls how PDF pages are rendered before being sent to the model: `--target_longest_image_dim` controls the rendered image size and `--target_anchor_text_len` caps the amount of anchor text used (legacy models only) [Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)].

## Remote Inference Server Mode (OpenAI-Compatible)

A second mode targets external vLLM servers, DeepInfra, or any OpenAI-compatible endpoint. In this mode the pipeline skips spawning a local vLLM instance and simply issues chat-completions requests over HTTP. This works with the lightweight `pip install olmocr` install (no GPU dependencies) [Source: [olmocr/pipeline.py:300-360]()] [Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)].

Key arguments for remote/server mode:

- `--server`: URL of an external vLLM or compatible server, e.g. `http://remote-host:8000/v1`. When set, the pipeline does not start a local server.
- `--api_key`: Bearer token passed via `Authorization` header.
- `--max_concurrent_requests`: Caps in-flight HTTP requests to the provider.
- `--workers`: Limits page groups processed simultaneously; setting to `1` ensures one group finishes before the next starts — useful for providers with low concurrency caps.
- `--pages_per_group`: Smaller groups help on providers with strict concurrent-request ceilings.
- `--model`: Model identifier (provider-specific; e.g. `allenai/olmOCR-2-7B-1025` on DeepInfra).

Example invocation: `olmocr ./localworkspace --server http://remote-server:8000/v1 --model allenai/olmOCR-2-7B-1025-FP8 --markdown --pdfs *.pdf` [Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)].

## Multi-Node / Beaker Cluster Mode

For very large jobs, the pipeline can coordinate multiple workers against an S3-backed workspace. The first worker seeds the queue: `olmocr s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf`. Subsequent workers simply point at the same prefix and pull from the shared queue [Source: [olmocr/work_queue.py:80-180]()] [Source: [olmocr/s3_utils.py:120-220]()] [Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)].

Alternatively, the pipeline can be submitted directly to Beaker with `--beaker`, `--beaker_workspace`, `--beaker_cluster`, `--beaker_gpus`, and `--beaker_priority`. This path is documented as the standard way to run multi-node jobs at AI2 [Source: [README.md](https://github.com/allenai/olmocr/blob/main/README.md)].

```mermaid
flowchart LR
    A[User / CLI] --> B[olmocr.pipeline]
    B --> C{Inference Mode}
    C -- local GPU --> D[Spawn vLLM]
    C -- --server --> E[Remote vLLM / OpenAI API]
    C -- --beaker / S3 --> F[Shared S3 Workspace]
    D --> G[Page Groups -> Model]
    E --> G
    F --> G
    G --> H[Dolma JSONL + Markdown]
```

## Common Failure Modes and Community-Reported Issues

Several issues trace directly back to pipeline/inference behavior:

- **Empty Markdown output on specific PDFs** (issue #463): the pipeline reports success and writes the markdown file but the body is empty, indicating model output failed validation for that page rather than a crash [Source: [issue #463](https://github.com/allenai/olmocr/issues/463)].
- **Encoding errors when writing Markdown** (issue #459): Windows + GBK locale cannot encode certain Unicode codepoints emitted by the model. Setting `PYTHONIOENCODING=utf-8` or `chcp 65001` before launching the pipeline resolves it [Source: [issue #459](https://github.com/allenai/olmocr/issues/459)].
- **Hard-coded 300s timeout in server runner** (issue #455): the bench HTTP client in `olmocr/bench/runners/run_server.py` uses a fixed timeout, which is insufficient for some larger page groups; community has requested making it configurable [Source: [issue #455](https://github.com/allenai/olmocr/issues/455)].
- **DeepInfra model deprecation** (issue #460): `allenai/olmOCR-2-7B-1025` is scheduled for deprecation on DeepInfra on 2026-05-07; users running the remote mode should plan for a self-hosted vLLM endpoint [Source: [issue #460](https://github.com/allenai/olmocr/issues/460)].

A known release-line note: v0.4.27 fixed a "queue bug in long queues" relevant to S3-coordinated multi-node runs, so users on older versions should upgrade before scaling out [Source: [release v0.4.27](https://github.com/allenai/olmocr/releases/tag/v0.4.27)].

## See Also

- [Training Guide](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md) — for fine-tuning new olmOCR models.
- [olmOCR-Bench](https://github.com/allenai/olmocr/blob/main/olmocr/bench/README.md) — for evaluating pipeline output.
- [README.md](https://github.com/allenai/olmocr/blob/main/README.md) — top-level usage documentation.

---

<a id='page-3'></a>

## Benchmark Suite and OCR Evaluation

### Related Pages

Related topics: [Pipeline and Inference Modes](#page-2), [Model Training, Filtering, and Synthetic Data](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [olmocr/bench/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/bench/README.md)
- [olmocr/bench/benchmark.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/benchmark.py)
- [olmocr/bench/tests.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/tests.py)
- [olmocr/bench/prompts.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/prompts.py)
- [olmocr/bench/runners/run_server.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/runners/run_server.py)
- [olmocr/bench/runners/run_olmocr_pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/runners/run_olmocr_pipeline.py)
- [olmocr/bench/scripts/workspace_to_bench.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/scripts/workspace_to_bench.py)
- [README.md](https://github.com/allenai/olmocr/blob/main/README.md)
- [olmocr/train/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md)
</details>

# Benchmark Suite and OCR Evaluation

## Purpose and Scope

`olmOCR-Bench` is the project's machine-checkable evaluation harness for document-level OCR systems. It does **not** rely on edit distance or fuzzy text similarity; instead it asserts discrete, hand-authored "facts" about each PDF page and grades whether an OCR pipeline recovered them. The design intent, stated in the bench README, is that every test case should be "very simple, unambiguous, and machine-checkable, similar to a unit test" so that alternate-but-correct transcriptions are not penalized.

Source: [olmocr/bench/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/bench/README.md)

The benchmark operates on **single-page PDFs** so that digital metadata is preserved, and accepts Markdown or plain-text output from any OCR tool. This deliberately decouples evaluation from any specific engine, allowing third-party OCR pipelines (Marker, MinerU, Mistral OCR API, DeepSeek-OCR, etc.) to be ranked on the same scoreboard published in the project root [README.md](https://github.com/allenai/olmocr/blob/main/README.md).

## Document Types and Test Categories

The suite defines seven categories that historically challenge OCR systems. Each category is sourced via a different acquisition strategy documented in [olmocr/bench/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/bench/README.md):

| Category | Code | Source strategy |
| --- | --- | --- |
| arXiv Math | AR | Single-TeX-source arXiv math papers, validated with KaTeX |
| Old Scans Math | OSM | Public-domain math textbooks from the Internet Archive, manually annotated |
| Tables | TA | Internally crawled PDFs filtered for tables with Gemini-Flash-2.0 |
| Old Scans | OS | Library of Congress letters with existing transcriptions |
| Headers & Footers | HF | DocLayout-YOLO "abandon" regions; tests expect text to be **absent** |
| Multi Column | MC | Claude-Sonnet-3.7 HTML renders of multi-article pages |
| Long Tiny Text | LTT | Dense small-print archive pages (dictionaries, references) |

Tests are stored as JSONL files in the `bench_data/` directory of the dataset (`allenai/olmOCR-bench` on Hugging Face). The header/footer category is unique in that it asserts a string must *not* appear in the linearized output, since correctly excluding recurring page furniture is a strong signal of OCR quality.

## Benchmark Principles and Scoring

The README enumerates rules that shape how candidate outputs are scored. These are worth understanding before debugging a low score:

- Output is treated as plain-text Unicode in a natural reading order; Markdown syntax is **ignored** when matching (so `**enlightenment**` still matches `enlightenment`).
- Matching is **not position-sensitive**, except for header/footer tests, which constrain search to the first or last N characters.
- Tables may be Markdown or HTML `<table>`; math must be delimited by `$`, `$$`, `\(`, or `\[` and must render with KaTeX.
- All Unicode is normalized to NFC; hyphens, single quotes, and double quotes are folded to ASCII variants before comparison.
- Each test belongs to one of the seven categories plus a "Base" category used for sanity checks.

The actual scoring logic lives in `TextPresenceTest.run` and related classes inside `olmocr/bench/tests.py`. As reported in [issue #461](https://github.com/allenai/olmocr/issues/461), `partial_ratio` from RapidFuzz can produce false positives when the candidate `md_content` is much shorter than the queried text (e.g., a single `\n`), because the partial ratio then compares against a near-empty string. Patches to the test logic must guard against this degenerate case.

## Running the Benchmark

A typical run follows the recipe documented in the bench README:

```bash
# 1. Install GPU extras for local inference
pip install olmocr[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

# 2. Convert benchmark PDFs into model output using a runner
python -m olmocr.bench.convert olmocr_pipeline --dir ./olmOCR-bench/bench_data

# 3. Score the outputs
python -m olmocr.bench.benchmark --dir ./olmOCR-bench/bench_data
```

Two runners are shipped under `olmocr/bench/runners/`:

- `run_olmocr_pipeline.py` invokes the local `olmocr.pipeline` to render each PDF into Markdown inside a workspace directory. The helper `olmocr/bench/scripts/workspace_to_bench.py` then copies those outputs into the directory layout the grader expects.
- `run_server.py` posts the PDFs to an OpenAI-compatible endpoint (local vLLM, DeepInfra, or any compatible API). [Issue #455](https://github.com/allenai/olmocr/issues/455) tracks a feature request to expose a configurable HTTP timeout, since the current default of 300 seconds is insufficient for slow backends or large pages.

The CLI entry point in `olmocr/bench/benchmark.py` has surfaced at least two regressions: a malformed `argparse` help string ([issue #451](https://github.com/allenai/olmocr/issues/451)) and a missing `numpy` import in the `[bench]` extras ([issue #452](https://github.com/allenai/olmocr/issues/452)). Install the dependency with `pip install numpy` if the benchmark entry point raises `ModuleNotFoundError` immediately on launch.

For annotation review, the README points to an internal tool:

```bash
python -m olmocr.bench.review_app --port 5000 --debug \
    ./olmOCR-bench/bench_data/multi_column.jsonl --force
```

## Common Failure Modes Seen by Users

The community has filed several reports that intersect with the benchmark workflow:

- **Empty markdown output for a specific page.** Running `olmocr output/ --pdfs bench_data/pdfs/headers_footers/<hash>_page_3.pdf --model /root/olmOCR-2-7B-1025-FP8/ --markdown` produced an empty Markdown file ([issue #463](https://github.com/allenai/olmocr/issues/463)). Because header/footer tests assert *absence*, an empty candidate actually scores well on HF tests but fails Base and content-bearing categories, which is a useful diagnostic signal.
- **GBK encoding crash on Windows.** [Issue #459](https://github.com/allenai/olmocr/issues/459) describes a `'gbk' codec can't encode character '\u1eca'` error during Markdown writing. Setting `PYTHONIOENCODING=utf-8` (or `chcp 65001` in PowerShell) before invoking the pipeline resolves the failure when running under the default Windows codepage.
- **DeepInfra deprecation.** [Issue #460](https://github.com/allenai/olmocr/issues/460) notes that the hosted `allenai/olmOCR-2-7B-1025` snapshot will be retired on 2026-05-07. Users scoring against the server runner should pin to a self-hosted model or migrate to the new release.
- **Scoring inversion for near-empty candidates.** As noted above, [issue #461](https://github.com/allenai/olmocr/issues/461) reports that `TextPresenceTest.run` in [olmocr/bench/tests.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/tests.py) can invert its decision when the candidate collapses to whitespace. When investigating "all tests pass" results for a broken page, check the candidate file directly.

## Integration with Training

The same JSONL schema is reused inside the GRPO trainer. Per [olmocr/train/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md), `mine_html_templates.py` synthesizes additional benchmark-style cases (`olmocr-synthmix-1025`), and the trainer script accepts a `--reward_bench` weight so that bench-style presence/absence tests participate directly in the reinforcement-learning reward signal. This is the mechanism behind the v0.4.0 release's ~4-point jump on the published scoreboard, where the strongest olmOCR release (v0.4.0) reaches an Overall score of 82.4±1.1.

## See Also

- olmOCR Pipeline and Inference
- Training olmOCR Models (GRPO + Synthetic Data)
- Dolma Output Viewer and Workspace Format
- Issue #451, #452, #455, #459, #460, #461, #463 on GitHub for known bench regressions and platform quirks.

---

<a id='page-4'></a>

## Model Training, Filtering, and Synthetic Data

### Related Pages

Related topics: [Pipeline and Inference Modes](#page-2), [Benchmark Suite and OCR Evaluation](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [olmocr/train/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md)
- [olmocr/train/train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
- [olmocr/train/grpo_train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/grpo_train.py)
- [olmocr/train/config.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/config.py)
- [olmocr/train/dataloader.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/dataloader.py)
- [olmocr/data/buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
- [olmocr/data/prepare_olmocrmix.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/prepare_olmocrmix.py)
- [olmocr/data/filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/filter.py)
- [olmocr/synth/mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/synth/mine_html_templates.py)
- [olmocr/pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
- [olmocr/bench/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/bench/README.md)
- [README.md](https://github.com/allenai/olmocr/blob/main/README.md)
</details>

# Model Training, Filtering, and Synthetic Data

## Overview

olmOCR's model training, filtering, and synthetic data pipeline is the machinery that produces fine-tuned OCR models such as `allenai/olmOCR-2-7B-1025-FP8` (referenced in [README.md](https://github.com/allenai/olmocr/blob/main/README.md)). The pipeline combines three concerns:

1. **Training** — supervised fine-tuning and GRPO-based reinforcement learning on PDF/markdown pairs, orchestrated in [olmocr/train/](https://github.com/allenai/olmocr/tree/main/olmocr/train).
2. **Filtering and data preparation** — converting raw PDFs and human/LLM-generated markdown into the single-page layout that the trainer consumes, handled in [olmocr/data/](https://github.com/allenai/olmocr/tree/main/olmocr/data).
3. **Synthetic data** — generating additional training examples from HTML templates to bolster coverage of hard document types, handled in [olmocr/synth/](https://github.com/allenai/olmocr/tree/main/olmocr/synth).

The README frames the broader workflow as: "Processing millions of PDFs through a finetuned model using VLLM" via [olmocr/pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py), and a "Filtering language identification" helper, both of which feed back into the next training round.

```mermaid
flowchart LR
    A[Raw PDFs] --> B[olmocr/pipeline.py]
    B --> C[olmocr.data.filter]
    C --> D[prepare_workspace / prepare_olmocrmix]
    E[mine_html_templates.py] --> F[Synthetic Markdown]
    D --> G[olmocr-synthmix-1025]
    F --> G
    G --> H[train.py / grpo_train.py]
    H --> I[Fine-tuned olmOCR Model]
```

## Training Pipeline

The training stack lives under [olmocr/train/](https://github.com/allenai/olmocr/tree/main/olmocr/train) and exposes two entry points:

- `train.py` — supervised fine-tuning, described in [olmocr/train/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md).
- `grpo_train.py` — reinforcement learning with unit-test rewards (see [olmocr2 bibtex entry in README.md](https://github.com/allenai/olmocr/blob/main/README.md), titled "olmOCR 2: Unit Test Rewards for Document OCR").

### Environment and dependencies

Training extends the base install with the `[train]` extra plus pinned versions:

```bash
pip install .[train]
pip install transformers==4.52.4
pip install flash-attn>=2.8.0.post2 --no-build-isolation
```

Source: [olmocr/train/README.md:21-25](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md).

### Hardware layout

The training guide states the run is "performed on an 8xH100 GPU node. One GPU is dedicated to running VLLM, while the other 7 are used to run training." Because the harness splits one GPU off for vLLm-based generation, a single H100 node is the minimum viable target, which is why community members have asked for macOS and ROCm support (issues [#33](https://github.com/allenai/olmocr/issues/33) and [#111](https://github.com/allenai/olmocr/issues/111)).

### GRPO hyperparameters

The provided training script ([scripts/train/grpotrainer-beaker-multi-gpu-augusta.sh](https://github.com/allenai/olmocr/blob/main/scripts/train/grpotrainer-beaker-multi-gpu-augusta.sh), as quoted in the training README) configures the run with the following key parameters, surfaced as flags on `grpo_train.py`:

| Flag | Example value | Meaning |
|---|---|---|
| `--reward_bench` | `1.0` | Weight for olmOCR-bench unit-test reward |
| `--reward_front_matter` | `1.0` | Weight for YAML front-matter validity |
| `--reward_eos` | `1.0` | Weight for proper end-of-sequence behavior |
| `--beta` | `0.01` | KL penalty in GRPO objective |
| `--learning_rate` | `2e-6` | Optimizer learning rate |
| `--gradient_accumulation_steps` | `28` | Effective batch-size multiplier |
| `--num-gpus` | `8` | Total GPUs on the Beaker node |

Source: [olmocr/train/README.md:9-12](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md).

### Data loader and config

`dataloader.py` is responsible for reading single-page PDFs and their paired markdown annotations, while `config.py` centralizes the YAML-driven hyperparameters consumed by `train.py` and `grpo_train.py`. The training README explicitly requires "Each PDF needs to be a single page only!", which means long documents must be split before training, and the markdown front matter follows a defined schema:

```markdown
---
primary_language: en
is_rotation_valid: True
rotation_correction: 0
is_table: False
is_diagram: False
---
```

Source: [olmocr/train/README.md:33-46](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md). The fields `is_table` and `is_diagram` line up with the page-classification features the inference path uses, and `rotation_correction` corresponds to the deskew step in [olmocr/pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py).

## Data Preparation and Filtering

The data side has two recommended entry points documented in the training README:

1. **`prepare_olmocrmix.py`** — pulls a subset of [allenai/olmOCR-mix-1025](https://huggingface.co/datasets/allenai/olmOCR-mix-1025) and unpacks it into the single-page PDF / markdown layout the trainer expects. Example invocation:

   ```bash
   python -m olmocr.data.prepare_olmocrmix \
     --dataset-path allenai/olmOCR-mix-1025 \
     --destination ~/olmOCR-mix-1025-extracted \
     --subset 01_books --split eval
   ```

   Source: [olmocr/train/README.md:55-60](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md).

2. **`prepare_workspace`** — reuses the user's own olmOCR inference output. After running `python -m olmocr.pipeline ./localworkspace --pdfs /home/username/pdfs/*.pdf`, the workspace contains Dolma-formatted JSON; `prepare_workspace` reshapes that into the PDF/markdown layout, which is convenient for fine-tuning on a domain-specific corpus.

   Source: [olmocr/train/README.md:62-66](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md).

Filtering sits in front of this: the main [README.md](https://github.com/allenai/olmocr/blob/main/README.md) links to a "Filtering language identification" tool, exposed as [olmocr/data/filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/filter.py). This stage drops pages whose detected language does not match `primary_language` in the front matter and removes noisy rows before they reach the dataloader. `buildsilver.py` is the higher-level utility that turns raw PDFs into a "silver" training set by running an earlier checkpoint over them and keeping the outputs that pass the filters.

## Synthetic Data

The synthetic-data module in [olmocr/synth/mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/synth/mine_html_templates.py) generates additional training documents from HTML templates. The training README states that the resulting mix is "in the same format as other olmOCR-bench test cases," which means a synthetic dataset can be scored directly with [olmocr/bench/](https://github.com/allenai/olmocr/tree/main/olmocr/bench) and folded into `grpo_train.py` as `train_bench_data_folder` (e.g. `/data/jakep/grpo_data_mixes/olmocr-synthmix-1025-v2-rotate10p/bench_data`).

Source: [olmocr/train/README.md:6-8](https://github.com/allenai/olmocr/blob/main/olmocr/train/README.md).

Synthetic data is most useful for the document categories olmOCR historically struggles with — the same seven categories documented in [olmocr/bench/README.md](https://github.com/allenai/olmocr/blob/main/olmocr/bench/README.md) (ArXiv math, Old Scans math, Tables, Old Scans, Headers/Footers, Multi-Column, Long Tiny Text). When a category is under-represented in human-labeled data, mined templates can be converted into `(single_page_pdf, markdown)` pairs and pushed through the same filter + GRPO loop.

## Community Notes and Failure Modes

- **Platform support gap.** The training stack requires NVIDIA GPUs and CUDA-specific kernels such as `flash-attn`; this is the root cause of the long-running macOS (issue [#33](https://github.com/allenai/olmocr/issues/33)) and ROCm (issue [#111](https://github.com/allenai/olmocr/issues/111)) requests. The community has produced unofficial CUDA-based Dockerfiles (issue [#161](https://github.com/allenai/olmocr/issues/161)) and a Gradio-based test harness (issue [#75](https://github.com/allenai/olmocr/issues/75)) to lower the bar, but training itself is not yet cross-platform.
- **Encoding on Windows.** When running downstream pipeline stages on Windows, non-ASCII characters can fail with `'gbk' codec can't encode character` (issue [#459](https://github.com/allenai/olmocr/issues/459)). Setting `PYTHONIOENCODING=utf-8` and `chcp 65001` is the documented workaround.
- **External model hosting deprecation.** `allenai/olmOCR-2-7B-1025` hosted on DeepInfra is being deprecated on 2026-05-07 (issue [#460](https://github.com/allenai/olmocr/issues/460)); users who cannot migrate to local vLLM should re-export the model through a self-hosted OpenAI-compatible endpoint, which `pipeline.py` consumes via the `--server` flag.
- **Scoring side-effects during training.** Reward signals that include olmOCR-bench tests are sensitive to near-empty candidates — `TextPresenceTest.run` in [olmocr/bench/tests.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/tests.py) can falsely pass when a candidate is just `"\n"` (issue [#461](https://github.com/allenai/olmocr/issues/461)). Practitioners using `--reward_bench` should sanity-check their reward values during GRPO rollouts.
- **Unparsed PDFs surface as training noise.** Issue [#463](https://github.com/allenai/olmocr/issues/463) shows that some pages render to an empty markdown; these should be filtered out by the language/rotation filters in `filter.py` before being added to the next training mix, otherwise the GRPO reward may collapse.

## See Also

- [Pipeline and Inference](./Pipeline-and-Inference.md)
- [olmOCR-Bench Evaluation](./OlmOCR-Bench-Evaluation.md)
- [Data Formats and Workspace Layout](./Data-Formats-and-Workspace-Layout.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: allenai/olmocr

Summary: Found 13 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/461

## 2. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/455

## 3. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | github_repo:858798469 | https://github.com/allenai/olmocr

## 4. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/463

## 5. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/460

## 6. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:858798469 | https://github.com/allenai/olmocr

## 7. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | github_repo:858798469 | https://github.com/allenai/olmocr

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | github_repo:858798469 | https://github.com/allenai/olmocr

## 9. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/459

## 10. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/451

## 11. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/allenai/olmocr/issues/452

## 12. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:858798469 | https://github.com/allenai/olmocr

## 13. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:858798469 | https://github.com/allenai/olmocr

<!-- canonical_name: allenai/olmocr; human_manual_source: deepwiki_human_wiki -->
