opendataloader-pdf Manual

Doramagic Project Pack · Human Manual

opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

Project Overview and System Architecture

Related topics: Core Processing Pipeline and PDF Element Detection, Language SDKs, CLI, and Build/Operations

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Java Core Layer

Continue reading this section for the full explanation and source context.

Section Wrapper and Server Layer

Continue reading this section for the full explanation and source context.

Project Overview and System Architecture

Purpose and Scope

OpenDataLoader PDF is a multi-language PDF extraction and conversion toolkit that turns PDF documents into structured outputs (JSON, Markdown, HTML, plain text, and tagged PDF) for downstream use in RAG pipelines, content-safety audits, and accessibility workflows. The repository is organized as a polyglot project: a Java core that does the heavy lifting, plus thin wrappers and servers in Python, Node.js/TypeScript, and an MCP (Model Context Protocol) server for AI agents.

The Java core exposes a public API via Config.java, which centralizes reading order, output format, hybrid-backend selection, image handling, page separators, and content-safety flags. The standalone tagging path is provided by AutoTagger.java, which returns an in-memory tagged PDDocument without intermediate files.

The Python package ships a hybrid backend server (Docling-based) and an MCP server, while the Node package mirrors the Java option surface as TypeScript types. Community discussions (issue #449) highlight a recurring ask to extend the same pipeline to DOCX, and issue #500 surfaces a request to make the hybrid backend's batch size configurable. Both reflect a user base that expects the same conversion guarantees across input formats and tuning knobs that survive at scale.

System Architecture

The project is structured as a Java core that performs extraction, with optional Python-side acceleration and language-specific bindings on top.

flowchart LR
    subgraph Clients
        CLI["Java CLI<br/>(opendataloader-pdf-cli)"]
        PY["Python CLI / API<br/>(opendataloader-pdf)"]
        NODE["Node / TypeScript<br/>(opendataloader-pdf)"]
        MCP["MCP Server<br/>(opendataloader-pdf-mcp)"]
        RAG["RAG examples<br/>(examples/python/rag)"]
    end

    subgraph JavaCore["Java Core (opendataloader-pdf-core)"]
        API["OpenDataLoaderPDF API"]
        CFG["Config / FilterConfig"]
        AT["AutoTagger"]
        PROC["DocumentProcessors<br/>(Hybrid, AutoTagging, ...)"]
    end

    subgraph HybridBackend["Hybrid Backend (optional)"]
        HS["opendataloader-pdf-hybrid<br/>(FastAPI + Docling)"]
    end

    CLI --> API
    PY --> API
    NODE --> API
    MCP --> PY
    RAG --> PY
    API --> CFG
    API --> PROC
    AT --> API
    PROC -. HTTP .-> HS

Source: CLIOptions.java, hybrid_server.py, server.py.

Java Core Layer

The core layer lives under java/opendataloader-pdf-core and contains the API surface and processors:

Config.java — single configuration object covering output formats (json, text, html, markdown, tagged-pdf), reading order (off vs xycut), hybrid backend selection (off, docling-fast, hancom-ai), image output (off, embedded, external), page separators, strikethrough detection, and HTML-in-Markdown toggles.
FilterConfig.java — controls content-safety filters (hidden text, out-of-page, tiny text, hidden OCG) and sensitive-data sanitization (emails, phone numbers, IPs, credit cards, URLs) with regex-based SanitizationRules.
CLIOptions.java — defines CLI flags via Apache Commons CLI and a single OPTION_DEFINITIONS list that is the source of truth for both the command line and the JSON export of options.
AutoTagger.java — entry point for standalone PDF auto-tagging, returning a PDDocument and supporting saveTo(...).

Wrapper and Server Layer

The Java core is consumed by language-specific wrappers that all share the same option vocabulary:

Python wrapper: cli_options_generated.py is generated from the Java option list, ensuring the Python CLI, the MCP server, and the RAG examples expose a uniform surface.
Hybrid server: hybrid_server.py is a FastAPI service that keeps a singleton Docling DocumentConverter and returns DoclingDocument JSON; the Java side then renders markdown/HTML. Failed pages are detected by union of error-message parsing and gap detection on the returned pages dict.
MCP server: server.py exposes a convert_pdf tool with parameters for format, page ranges, headers/footers, hybrid backend, OCR strategy, and image output, mapping them onto the underlying Python wrapper. Installation snippets for Claude Desktop, Claude Code, OpenAI Codex, Cursor, and Windsurf live in the MCP README.
Node/TypeScript: convert-options.generated.ts declares the same option set (hybrid, hybridMode, hybridUrl, imageOutput, tableMethod, readingOrder, separators, etc.) so JavaScript consumers get type-safe access to the same flags.

Processing Pipeline

At a high level, every entry point — CLI, Python API, MCP tool, RAG example — feeds a Config into the Java OpenDataLoaderPDF.processFile pipeline, which:

Loads the PDF and applies FilterConfig rules (hidden text, out-of-page, tiny text, hidden OCG, sensitive-data sanitization).
Detects structure (reading order via xycut or COS order) and, when a hybrid backend is selected (docling-fast or hancom-ai), forwards page batches to the Python FastAPI server; triage modes are auto (dynamic) or full (all pages to backend).
Optionally runs OCR / formula / picture-description enrichment on the hybrid server (configurable engine: easyocr, tesseract, tesserocr, rapidocr, ocrmac; configurable device: auto, cpu, cuda, mps, xpu).
Renders the selected output format and writes results to the output directory, with images embedded as Base64 or written externally to imageDir as png/jpeg.

RAG consumers, illustrated in examples/python/rag/README.md, receive pre-chunked records with text and metadata (type, page, bbox, source) ready for embedding into vector stores such as Chroma, FAISS, Pinecone, or Weaviate.

Configuration Model

Layer	Class / File	Responsibility
Output + layout	Config.java	Format, reading order, page separators, image mode, hybrid backend
Content safety	FilterConfig.java	Hidden/ocg/tiny filters, sensitive-data regex rules
CLI surface	CLIOptions.java	Apache Commons CLI mapping + JSON export
Hybrid runtime	hybrid_server.py	Docling singleton, OCR, formula/picture enrichment, failed-page detection
AI-agent tool	server.py	MCP `convert_pdf` tool wrapping the Python API
JS binding	convert-options.generated.ts	TypeScript types mirroring the Java options

Community-Reported Concerns

DOCX support (issue #449): users want a single application that handles PDF and DOCX with the same fidelity; the current pipeline is PDF-centric, with the Java core using verapdf PDDocument and the hybrid server running Docling's PDF pipeline.
Reproducible Maven builds (issue #551): downstream packagers (e.g. NixOS) need deterministic dependency resolution, since the Java core declares all PDF extraction dependencies via Maven.
Configurable hybrid batch size (issue #500): BACKEND_CHUNK_SIZE is currently a hardcoded 50 in HybridDocumentProcessor; making it a config value would let operators trade memory for latency.
Pre-compiled JAR for PyPI sdist (issue #435): conda-forge recipes want a Java-free build path, so the Python wheel should ship a pre-built JAR rather than recompiling on every install.
Hybrid runtime evolution (release v2.4.7): full-page DLA renders are now persisted for evidence overlays, and per-node ai_score / pdfua_tag metadata is emitted in JSON, reflecting the project's direction toward richer, audit-ready output.

Core Processing Pipeline and PDF Element Detection

Related topics: Project Overview and System Architecture, Hybrid AI Mode, Output Generators, and JSON Schema

Section Related Pages

Continue reading this section for the full explanation and source context.

Core Processing Pipeline and PDF Element Detection

Purpose and Scope

The core processing pipeline is the Java heart of OpenDataLoader PDF. It ingests a PDF document, detects structural elements (headings, paragraphs, lists, tables, images, hidden text), establishes reading order, and emits structured outputs such as JSON, Markdown, HTML, text, annotated PDF, and tagged PDF. The pipeline is exposed to other layers (Python wrapper, Node CLI, MCP server) through a small set of public API entry points, while keeping the per-element detection logic in a dedicated processors package.

The public API is intentionally thin: OpenDataLoaderPDF#processFile performs extraction and output in a single call, while DocumentProcessor#extractContents returns an ExtractionResult that downstream generators (such as OutputWriter and AutoTagger) can consume without re-running extraction. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java:1-1. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java:1-1.

Pipeline Architecture

The pipeline follows a two-phase model: extraction (a single, potentially expensive pass over the PDF that detects elements and computes reading order) followed by output generation (cheap, format-specific serialisation). This split is what enables AutoTagger to reuse the same ExtractionResult for tagged-PDF output without re-parsing the source document. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java:1-1.

flowchart LR
    A[Input PDF] --> B[DocumentProcessor.extractContents]
    B --> C{Element Detection}
    C --> D[HeadingProcessor]
    C --> E[TableBorderProcessor / ClusterTableProcessor]
    C --> F[ListProcessor]
    C --> G[HiddenTextProcessor]
    C --> H[Structure-Tree Reader]
    D --> I[Reading Order (xycut)]
    E --> I
    F --> I
    G --> I
    H --> I
    I --> J[ExtractionResult]
    J --> K[OutputWriter]
    J --> L[AutoTagger]
    K --> M[JSON / Markdown / HTML / Text]
    L --> N[Tagged / Annotated PDF]

When the hybrid backend is enabled, a parallel path is taken: pages are triaged by HybridDocumentProcessor and dispatched to an external server (such as docling-fast or hancom-ai) which returns a DoclingDocument JSON. That JSON is merged with locally detected content before output generation. The hybrid server itself is a FastAPI app written in Python and shipped as opendataloader-pdf-hybrid. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-1. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java:1-1.

PDF Element Detection

Element detection lives in the processors package and is split into specialised classes that each handle a single structural concern:

HeadingProcessor — classifies text spans as headings based on font size, weight, and surrounding whitespace, producing section markers used by Markdown and JSON outputs.
TableBorderProcessor — the default table detector. It uses visible border lines (default table method) to infer cell boundaries and merges neighbouring text spans into rows and columns. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:1-1.
ClusterTableProcessor — an alternative table detector (cluster table method) that combines border analysis with spatial clustering of text. It is intended for borderless tables that still exhibit a regular layout. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:1-1.
ListProcessor — recognises bulleted and numbered lists, normalising their markers and indentation for Markdown and JSON output.
HiddenTextProcessor — flags text that is hidden through colour matching, tiny font size, off-page placement, or hidden OCG layers. Filter behaviour is configurable through --content-safety-off.
Structure-Tree Reader — when --use-struct-tree is set, the pipeline consults the PDF's logical structure tree (tagged PDF) for reading order. By default this is disabled and reading order is reconstructed using the xycut algorithm. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:1-1.

The choice between default and cluster table detection, as well as between structure-tree and xycut reading order, is the main per-element configuration surface exposed to users. Source: node/opendataloader-pdf/src/convert-options.generated.ts:1-1.

Configuration Surface

Configuration flows from the CLI / Node / Python layers into the Java Config object. The fields most relevant to element detection are summarised below.

Configuration	CLI flag	Default	Effect
Table detection method	`--table-method`	`default`	Selects `TableBorderProcessor` or `ClusterTableProcessor`
Reading order	`--reading-order`	`xycut`	Reconstructed order vs. tagged structure tree
Structure tree usage	`--use-struct-tree`	disabled	Enables structure-tree reader
Hidden-text filter	`--content-safety-off`	all enabled	Disables hidden-text, off-page, tiny, or hidden-OCG filters
Sanitisation	`--sanitize`	disabled	Replaces emails, phones, IPs, cards, URLs
Line breaks	`--keep-line-breaks`	collapsed	Preserves original line breaks
Hybrid backend	`--hybrid`	`off`	Routes pages through docling-fast or hancom-ai
Hybrid timeout	`--hybrid-timeout`	`0` (no timeout)	Per-request timeout for the hybrid server

Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:1-1. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:1-1. Source: node/opendataloader-pdf/src/convert-options.generated.ts:1-1.

Integration With the Hybrid Backend

When --hybrid is enabled, the pipeline is no longer purely local. HybridDocumentProcessor triages pages into batches and forwards them to a running hybrid server. The default batch size is currently 50 (a hard-coded constant flagged for configurability in community discussion #500). Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java:1-1.

The hybrid server is a FastAPI app that owns a single DocumentConverter instance and returns DoclingDocument JSON. It supports multiple OCR engines (EasyOCR, Tesseract, RapidOCR, ocrmac) and accelerator devices (CPU, CUDA, MPS, XPU). Failures are surfaced as a partial_success status with explicit failed_pages so the Java pipeline can fall back to local extraction when --hybrid-fallback is set. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-1.

This hybrid mode is what powers the Python MCP server, which wraps the Java CLI behind a Model Context Protocol interface so AI agents can request conversions programmatically. Source: python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py:1-1.

Common Failure Modes

Reading order on two-column layouts — xycut handles most cases, but documents without structure trees can still produce interleaved columns. Enabling --use-struct-tree or switching to a tagged PDF source usually helps.
Borderless tables — --table-method default will miss them; switch to --table-method cluster.
Hybrid backend timeouts or OOM — Docling may emit std::bad_alloc or Page N: <error> messages that surface as partial_success with a populated failed_pages list. Re-running with --hybrid-fallback retries locally. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-1.
Maven reproducibility — packaging consumers (notably NixOS, see issue #551) report non-reproducible dependency resolution because the Java core is built from source.

Hybrid AI Mode, Output Generators, and JSON Schema

Related topics: Core Processing Pipeline and PDF Element Detection, Language SDKs, CLI, and Build/Operations

Section Related Pages

Continue reading this section for the full explanation and source context.

Hybrid AI Mode, Output Generators, and JSON Schema

OpenDataLoader PDF exposes three tightly related subsystems that determine what comes out of a conversion run: a Hybrid AI Mode that optionally routes pages to an external ML backend, a family of Output Generators that emit Markdown, HTML, plain text, and PDF artifacts, and a JSON Schema that carries the structural metadata used by downstream RAG pipelines.

1. Hybrid AI Mode

Hybrid Mode lets the Java core delegate pages to a separate HTTP service (Docling or Hancom AI) while keeping the Java-side text extraction, structure tree, and content-safety checks in charge. The mode is controlled by the hybrid option in java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java and exposed to CLI, Node, and Python clients through auto-generated option definitions such as python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py and node/opendataloader-pdf/src/convert-options.generated.ts.

flowchart LR
    PDF[Input PDF] --> Triage{Triage}
    Triage -- "auto (dynamic)" --> Java[Java extractors]
    Triage -- "full" --> Backend[Hybrid backend]
    Java --> Merge[Merge + generators]
    Backend --> Merge
    Merge --> Out[JSON / MD / HTML / PDF]

Two triage strategies are available. HYBRID_MODE_AUTO ("auto") inspects each page and decides whether to send it to the backend, while HYBRID_MODE_FULL ("full") skips triage and forwards every page. Source: Config.java. The Python quick-start runs the bundled FastAPI server via opendataloader-pdf-hybrid --port 5002 (defined in hybrid_server.py), and remote servers can be targeted with --hybrid-url. A BACKEND_CHUNK_SIZE constant batches pages sent per request — community issue #500 requests that this hardcoded limit become configurable. For Hancom AI backends two extra knobs control region-list overlap with TSR (hybrid-hancom-ai-regionlist-strategy) and OCR fallback behavior (hybrid-hancom-ai-ocr-strategy), documented in convert-options.generated.ts.

The hybrid server itself returns a structured JSON payload with status, document, errors, failed_pages, and processing_time. Failed pages are detected by combining two strategies — parsing "Page N:" messages from Docling errors and gap-detecting pages missing from the response — so that partial successes surface actionable diagnostics. Source: hybrid_server.py. hybrid-fallback allows the Java pipeline to recover if the backend fails.

2. Output Generators

Every conversion can emit one or more artifacts through flags in CLIOptions.java and Config.java. The format option accepts json, text, html, markdown, markdown-with-html, and tagged-pdf; the Python wrapper builds this list from the legacy boolean flags in wrapper.py (--json, --markdown, --html, --pdf).

Output mode	Key options	Notes
JSON	`--no-json` to disable	Default-on, carries structural metadata
Markdown	`--markdown-page-separator`, `--markdown-with-html`, `--markdown-with-images`	HTML mode allows complex row-spans
HTML	`--html-page-separator`	Rich layout output
Plain text	`--text-page-separator`, `--keep-line-reaks`	Whitespace-faithful
Tagged PDF	`--format tagged-pdf`	Re-tags the input

All three text-style generators share a PAGE_NUMBER_STRING placeholder ("%page-number%") that resolves to the current page index, defined in Config.java. Image handling is governed by --image-output (off, embedded, external) and --image-format (png, jpeg); only the external mode honors --image-dir. Source: convert-options.generated.ts.

The MCP server in python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py exposes a convert_pdf tool that accepts a format parameter and the same hybrid/image/separator options, mapping format strings to file extensions (.json, .md, .html, .txt) and writing into a temporary directory before returning the content. Setup snippets for Claude Desktop, Claude Code, Codex, Cursor, and Windsurf are documented in python/opendataloader-pdf-mcp/README.md.

3. JSON Schema

The JSON output is the canonical structured form. Each node carries text plus a metadata block (e.g. type, page, bbox, source) suitable for embedding and retrieval, as shown in examples/python/rag/README.md:

{
  "text": "Language model pretraining has led to significant...",
  "metadata": {
    "type": "paragraph",
    "page": 1,
    "bbox": [108.0, 526.2, 286.5, 592.8],
    "source": "1901.03003.pdf"
  }
}

Release v2.4.7 extended the schema with per-node ai_score and pdfua_tag fields (PR #530) and exposed the raw Docling object ID when hybrid mode is enabled, so that downstream consumers can correlate hybrid-backed content with the original PDF objects. The JSON emitter is always on by default; it can be disabled via --no-json (or isGenerateJSON = false) per CLIOptions.java.

Language SDKs, CLI, and Build/Operations

Related topics: Project Overview and System Architecture, Hybrid AI Mode, Output Generators, and JSON Schema

Section Related Pages

Continue reading this section for the full explanation and source context.

Language SDKs, CLI, and Build/Operations

The OpenDataLoader PDF project ships a single Java extraction engine and exposes it through several parallel surfaces: a Java SDK with a stable programmatic API, an Apache Commons CLI front-end, a Python SDK that bundles a FastAPI hybrid backend, a Node.js/TypeScript binding, and an MCP (Model Context Protocol) server for AI agents. This page explains how those surfaces relate, where their option definitions originate, and the operational concerns (packaging, reproducibility, batch sizing) that recur in community discussion.

Java SDK and CLI Surface

The Java core module under java/opendataloader-pdf-core/ is the canonical source of every configuration option. The Config class encapsulates all knobs (output formats, separators, reading order, hybrid backend selection, image handling) used by both the library API and the CLI. Public string constants on Config define the accepted enum-like values, e.g. READING_ORDER_OFF / READING_ORDER_XYCUT and HYBRID_OFF for the no-backend mode. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:42-49

The CLI is built on Apache Commons CLI. CLIOptions is explicitly documented as a stable integration surface for downstream consumers (such as opendataloader-pdfua); its defineOptions(), addAllTo(), and the parser helpers are the only members guaranteed to remain backward compatible. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:24-32

To prevent drift between CLI flags, the Java API, the Python wrapper, and the Node binding, the project keeps a single OPTION_DEFINITIONS list inside CLIOptions. Each entry carries the long flag, short flag, type, default, and Javadoc-derived description, and is rendered into every downstream surface from that list. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:74-95

Higher-level Java entry points sit alongside Config:

FilterConfig toggles content-safety filters (hidden text, out-of-page content, tiny text, hidden OCGs) and ships a default set of SanitizationRule patterns for emails, phone numbers, IPs, credit cards, and URLs. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/FilterConfig.java:22-39
AutoTagger returns an in-memory tagged PDDocument without writing intermediate files, exposing the tag(inputPath, config) and shutdown() lifecycle. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java:32-44
OutputWriter separates extraction from file emission, enabling a two-phase pipeline where an ExtractionResult is reused across multiple output format writes. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java:23-35

Python SDK and Hybrid Server

The Python package wraps the Java JAR. Because the Java side is the source of truth for every flag, cli_options_generated.py is an auto-generated mirror of OPTION_DEFINITIONS; the same pattern is used for convert-options.generated.ts on the Node side. This guarantees that --markdown-page-separator, --hybrid-mode, --image-output, and the rest of the ~30 flags remain identical across surfaces. Source: python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py:1-30

The hybrid server is a FastAPI application that hosts a single DocumentConverter singleton and returns Docling's JSON document. The CLI entry point opendataloader-pdf-hybrid accepts --port, --host, --ocr-lang, --ocr-engine, --psm, --device, --enrich-formula, --enrich-picture-description, and --max-file-size. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-50

Inside the server, a _build_response helper merges two failure-detection strategies when Docling reports partial_success: it parses "Page N:" substrings out of error messages and computes the set gap between expected and present pages in the JSON output. The union is what the caller actually receives as failed_pages, which the Java side uses to decide which pages to fall back on. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:55-110

The RAG example shows the consumer shape: each chunk is a {text, metadata} dict where metadata includes type, page, bbox, and source. The bbox is the layout coordinate from the Java processor, so downstream vector stores can index geometry alongside text. Source: examples/python/rag/README.md:21-34

{
  "text": "Language model pretraining has led to significant...",
  "metadata": {
    "type": "paragraph",
    "page": 1,
    "bbox": [108.0, 526.2, 286.5, 592.8],
    "source": "1901.03003.pdf"
  }
}

Node.js/TypeScript and MCP Surfaces

The Node binding reuses the same generated schema. convert-options.generated.ts exposes every CLI flag with TypeScript types and JSDoc descriptions copied from the Java OptionDefinition source (e.g. hybrid?: string, hybridMode?: string, hybridTimeout?: string). Source: node/opendataloader-pdf/src/convert-options.generated.ts:1-30

The MCP server (opendataloader-pdf-mcp) targets Claude Desktop, Claude Code, OpenAI Codex, Cursor, and Windsurf. It launches via uvx opendataloader-pdf-mcp and forwards tool invocations to the Java/Python pipeline. Its convert_pdf tool accepts the same option set as the CLI plus an output_dir and validates format against an explicit ext_map of json / text / html / markdown / markdown-with-html / markdown-with-images. Source: python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py:55-80, and python/opendataloader-pdf-mcp/README.md:21-58

Build, Operations, and Community Concerns

flowchart LR
  A[OPTION_DEFINITIONS in CLIOptions.java] --> B[Java Config / CLI]
  A --> C[cli_options_generated.py]
  A --> D[convert-options.generated.ts]
  B --> E[Python wrapper / uvx]
  C --> E
  D --> F[Node SDK]
  B --> G[MCP server]
  E --> G
  E --> H[FastAPI hybrid_server]
  H --> I[Docling DocumentConverter]

Two operational pain points show up repeatedly in the issue tracker:

Reproducible Maven resolution (#551): packaging opendataloader-pdf inside a Nix fixed-output derivation requires the full transitive Maven graph to be reproducible. Because Java is the implementation language and option generation walks the dependency tree, any non-deterministic resolution propagates into the PyPI and conda-forge artifacts.
Pre-compiled JAR in sdist (#435): conda-forge currently needs Maven plus OpenJDK at build time to recompile the JAR from sources. Shipping the JAR inside the PyPI source distribution would remove that barrier for the scientific Python ecosystem; this is the same path the Node and MCP surfaces already rely on.
Configurable hybrid batch size (#500): the value BACKEND_CHUNK_SIZE = 50 is hardcoded in HybridDocumentProcessor, which sets the page grouping the Java side sends per HTTP request to hybrid_server.py. Community request is to expose it as a CLI/API option, alongside the existing --hybrid-timeout flag.

Until those land, the practical guidance is: pin Java/dependency versions when packaging, run the hybrid server with --max-file-size and a tuned --hybrid-timeout for large PDFs, and use the --hybrid-fallback flag to keep a Java-only path available when the Docling backend is unreachable.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

high Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

high Runtime risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 15 structured pitfall item(s), including 3 high/blocking item(s). Top priority: Configuration risk - Configuration risk requires verification.

1. Configuration risk: Configuration risk requires verification

Severity: high
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/566

2. Capability evidence risk: Capability evidence risk requires verification

Severity: high
Finding: Project evidence flags a capability evidence risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/414

3. Runtime risk: Runtime risk requires verification

Severity: high
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/428

4. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/440

5. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/528

6. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/578

7. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/584

8. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/548

9. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/opendataloader-project/opendataloader-pdf

10. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/581

11. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/opendataloader-project/opendataloader-pdf

12. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/opendataloader-project/opendataloader-pdf

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using opendataloader-pdf with real data or production workflows.

Add Simplified Chinese README translation - github / github_issue
JSON output does not preserve table header cells (TH) — all cells serial - github / github_issue
unneccessary extra "-" and dot added in md files generated from pdf for - github / github_issue
missing character - github / github_issue
Integrated Lifecycle Management for Hybrid Server - github / github_issue
Community source 6 - github / github_issue
why it runs very slowly in CPU mode ？ - github / github_issue
Poor OCR extraction on low-resolution scanned PDF flyer in hybrid Doclin - github / github_issue
Local extraction of an image-only / scanned PDF succeeds silently with n - github / github_issue
Type3 fonts with non-1/1000 /FontMatrix: numbers lose a repeated digit a - github / github_issue
Release v2.4.7 - github / github_release
Release v2.4.6 - github / github_release

Source: Project Pack community evidence and pitfall evidence