Doramagic Project Pack · Human Manual
opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
Project Overview and System Architecture
Related topics: Core Processing Pipeline and PDF Element Detection, Language SDKs, CLI, and Build/Operations
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Processing Pipeline and PDF Element Detection, Language SDKs, CLI, and Build/Operations
Project Overview and System Architecture
Purpose and Scope
OpenDataLoader PDF is a multi-language PDF extraction and conversion toolkit that turns PDF documents into structured outputs (JSON, Markdown, HTML, plain text, and tagged PDF) for downstream use in RAG pipelines, content-safety audits, and accessibility workflows. The repository is organized as a polyglot project: a Java core that does the heavy lifting, plus thin wrappers and servers in Python, Node.js/TypeScript, and an MCP (Model Context Protocol) server for AI agents.
The Java core exposes a public API via Config.java, which centralizes reading order, output format, hybrid-backend selection, image handling, page separators, and content-safety flags. The standalone tagging path is provided by AutoTagger.java, which returns an in-memory tagged PDDocument without intermediate files.
The Python package ships a hybrid backend server (Docling-based) and an MCP server, while the Node package mirrors the Java option surface as TypeScript types. Community discussions (issue #449) highlight a recurring ask to extend the same pipeline to DOCX, and issue #500 surfaces a request to make the hybrid backend's batch size configurable. Both reflect a user base that expects the same conversion guarantees across input formats and tuning knobs that survive at scale.
System Architecture
The project is structured as a Java core that performs extraction, with optional Python-side acceleration and language-specific bindings on top.
flowchart LR
subgraph Clients
CLI["Java CLI<br/>(opendataloader-pdf-cli)"]
PY["Python CLI / API<br/>(opendataloader-pdf)"]
NODE["Node / TypeScript<br/>(opendataloader-pdf)"]
MCP["MCP Server<br/>(opendataloader-pdf-mcp)"]
RAG["RAG examples<br/>(examples/python/rag)"]
end
subgraph JavaCore["Java Core (opendataloader-pdf-core)"]
API["OpenDataLoaderPDF API"]
CFG["Config / FilterConfig"]
AT["AutoTagger"]
PROC["DocumentProcessors<br/>(Hybrid, AutoTagging, ...)"]
end
subgraph HybridBackend["Hybrid Backend (optional)"]
HS["opendataloader-pdf-hybrid<br/>(FastAPI + Docling)"]
end
CLI --> API
PY --> API
NODE --> API
MCP --> PY
RAG --> PY
API --> CFG
API --> PROC
AT --> API
PROC -. HTTP .-> HSSource: CLIOptions.java, hybrid_server.py, server.py.
Java Core Layer
The core layer lives under java/opendataloader-pdf-core and contains the API surface and processors:
- Config.java — single configuration object covering output formats (json, text, html, markdown, tagged-pdf), reading order (
offvsxycut), hybrid backend selection (off,docling-fast,hancom-ai), image output (off,embedded,external), page separators, strikethrough detection, and HTML-in-Markdown toggles. - FilterConfig.java — controls content-safety filters (hidden text, out-of-page, tiny text, hidden OCG) and sensitive-data sanitization (emails, phone numbers, IPs, credit cards, URLs) with regex-based
SanitizationRules. - CLIOptions.java — defines CLI flags via Apache Commons CLI and a single
OPTION_DEFINITIONSlist that is the source of truth for both the command line and the JSON export of options. - AutoTagger.java — entry point for standalone PDF auto-tagging, returning a
PDDocumentand supportingsaveTo(...).
Wrapper and Server Layer
The Java core is consumed by language-specific wrappers that all share the same option vocabulary:
- Python wrapper: cli_options_generated.py is generated from the Java option list, ensuring the Python CLI, the MCP server, and the RAG examples expose a uniform surface.
- Hybrid server: hybrid_server.py is a FastAPI service that keeps a singleton Docling
DocumentConverterand returns DoclingDocument JSON; the Java side then renders markdown/HTML. Failed pages are detected by union of error-message parsing and gap detection on the returnedpagesdict. - MCP server: server.py exposes a
convert_pdftool with parameters for format, page ranges, headers/footers, hybrid backend, OCR strategy, and image output, mapping them onto the underlying Python wrapper. Installation snippets for Claude Desktop, Claude Code, OpenAI Codex, Cursor, and Windsurf live in the MCP README. - Node/TypeScript: convert-options.generated.ts declares the same option set (
hybrid,hybridMode,hybridUrl,imageOutput,tableMethod,readingOrder, separators, etc.) so JavaScript consumers get type-safe access to the same flags.
Processing Pipeline
At a high level, every entry point — CLI, Python API, MCP tool, RAG example — feeds a Config into the Java OpenDataLoaderPDF.processFile pipeline, which:
- Loads the PDF and applies
FilterConfigrules (hidden text, out-of-page, tiny text, hidden OCG, sensitive-data sanitization). - Detects structure (reading order via
xycutor COS order) and, when a hybrid backend is selected (docling-fastorhancom-ai), forwards page batches to the Python FastAPI server; triage modes areauto(dynamic) orfull(all pages to backend). - Optionally runs OCR / formula / picture-description enrichment on the hybrid server (configurable engine:
easyocr,tesseract,tesserocr,rapidocr,ocrmac; configurable device:auto,cpu,cuda,mps,xpu). - Renders the selected output format and writes results to the output directory, with images embedded as Base64 or written externally to
imageDiraspng/jpeg.
RAG consumers, illustrated in examples/python/rag/README.md, receive pre-chunked records with text and metadata (type, page, bbox, source) ready for embedding into vector stores such as Chroma, FAISS, Pinecone, or Weaviate.
Configuration Model
| Layer | Class / File | Responsibility |
|---|---|---|
| Output + layout | Config.java | Format, reading order, page separators, image mode, hybrid backend |
| Content safety | FilterConfig.java | Hidden/ocg/tiny filters, sensitive-data regex rules |
| CLI surface | CLIOptions.java | Apache Commons CLI mapping + JSON export |
| Hybrid runtime | hybrid_server.py | Docling singleton, OCR, formula/picture enrichment, failed-page detection |
| AI-agent tool | server.py | MCP convert_pdf tool wrapping the Python API |
| JS binding | convert-options.generated.ts | TypeScript types mirroring the Java options |
Community-Reported Concerns
- DOCX support (issue #449): users want a single application that handles PDF and DOCX with the same fidelity; the current pipeline is PDF-centric, with the Java core using
verapdfPDDocumentand the hybrid server running Docling's PDF pipeline. - Reproducible Maven builds (issue #551): downstream packagers (e.g. NixOS) need deterministic dependency resolution, since the Java core declares all PDF extraction dependencies via Maven.
- Configurable hybrid batch size (issue #500):
BACKEND_CHUNK_SIZEis currently a hardcoded50inHybridDocumentProcessor; making it a config value would let operators trade memory for latency. - Pre-compiled JAR for PyPI sdist (issue #435): conda-forge recipes want a Java-free build path, so the Python wheel should ship a pre-built JAR rather than recompiling on every install.
- Hybrid runtime evolution (release v2.4.7): full-page DLA renders are now persisted for evidence overlays, and per-node
ai_score/pdfua_tagmetadata is emitted in JSON, reflecting the project's direction toward richer, audit-ready output.
See Also
- Hybrid Backend and Docling Integration — how the Java core talks to the FastAPI/Docling server.
- Configuration Reference — full enumeration of
ConfigandFilterConfigfields. - MCP Server Setup — installing and configuring the AI-agent tool.
- RAG Pipeline Examples — chunk schemas and vector-store integration.
Source: https://github.com/opendataloader-project/opendataloader-pdf / Human Manual
Core Processing Pipeline and PDF Element Detection
Related topics: Project Overview and System Architecture, Hybrid AI Mode, Output Generators, and JSON Schema
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Project Overview and System Architecture, Hybrid AI Mode, Output Generators, and JSON Schema
Core Processing Pipeline and PDF Element Detection
Purpose and Scope
The core processing pipeline is the Java heart of OpenDataLoader PDF. It ingests a PDF document, detects structural elements (headings, paragraphs, lists, tables, images, hidden text), establishes reading order, and emits structured outputs such as JSON, Markdown, HTML, text, annotated PDF, and tagged PDF. The pipeline is exposed to other layers (Python wrapper, Node CLI, MCP server) through a small set of public API entry points, while keeping the per-element detection logic in a dedicated processors package.
The public API is intentionally thin: OpenDataLoaderPDF#processFile performs extraction and output in a single call, while DocumentProcessor#extractContents returns an ExtractionResult that downstream generators (such as OutputWriter and AutoTagger) can consume without re-running extraction. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java:1-1. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java:1-1.
Pipeline Architecture
The pipeline follows a two-phase model: extraction (a single, potentially expensive pass over the PDF that detects elements and computes reading order) followed by output generation (cheap, format-specific serialisation). This split is what enables AutoTagger to reuse the same ExtractionResult for tagged-PDF output without re-parsing the source document. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java:1-1.
flowchart LR
A[Input PDF] --> B[DocumentProcessor.extractContents]
B --> C{Element Detection}
C --> D[HeadingProcessor]
C --> E[TableBorderProcessor / ClusterTableProcessor]
C --> F[ListProcessor]
C --> G[HiddenTextProcessor]
C --> H[Structure-Tree Reader]
D --> I[Reading Order (xycut)]
E --> I
F --> I
G --> I
H --> I
I --> J[ExtractionResult]
J --> K[OutputWriter]
J --> L[AutoTagger]
K --> M[JSON / Markdown / HTML / Text]
L --> N[Tagged / Annotated PDF]When the hybrid backend is enabled, a parallel path is taken: pages are triaged by HybridDocumentProcessor and dispatched to an external server (such as docling-fast or hancom-ai) which returns a DoclingDocument JSON. That JSON is merged with locally detected content before output generation. The hybrid server itself is a FastAPI app written in Python and shipped as opendataloader-pdf-hybrid. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-1. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java:1-1.
PDF Element Detection
Element detection lives in the processors package and is split into specialised classes that each handle a single structural concern:
- HeadingProcessor — classifies text spans as headings based on font size, weight, and surrounding whitespace, producing section markers used by Markdown and JSON outputs.
- TableBorderProcessor — the default table detector. It uses visible border lines (
defaulttable method) to infer cell boundaries and merges neighbouring text spans into rows and columns. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:1-1. - ClusterTableProcessor — an alternative table detector (
clustertable method) that combines border analysis with spatial clustering of text. It is intended for borderless tables that still exhibit a regular layout. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:1-1. - ListProcessor — recognises bulleted and numbered lists, normalising their markers and indentation for Markdown and JSON output.
- HiddenTextProcessor — flags text that is hidden through colour matching, tiny font size, off-page placement, or hidden OCG layers. Filter behaviour is configurable through
--content-safety-off. - Structure-Tree Reader — when
--use-struct-treeis set, the pipeline consults the PDF's logical structure tree (tagged PDF) for reading order. By default this is disabled and reading order is reconstructed using thexycutalgorithm. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:1-1.
The choice between default and cluster table detection, as well as between structure-tree and xycut reading order, is the main per-element configuration surface exposed to users. Source: node/opendataloader-pdf/src/convert-options.generated.ts:1-1.
Configuration Surface
Configuration flows from the CLI / Node / Python layers into the Java Config object. The fields most relevant to element detection are summarised below.
| Configuration | CLI flag | Default | Effect |
|---|---|---|---|
| Table detection method | --table-method | default | Selects TableBorderProcessor or ClusterTableProcessor |
| Reading order | --reading-order | xycut | Reconstructed order vs. tagged structure tree |
| Structure tree usage | --use-struct-tree | disabled | Enables structure-tree reader |
| Hidden-text filter | --content-safety-off | all enabled | Disables hidden-text, off-page, tiny, or hidden-OCG filters |
| Sanitisation | --sanitize | disabled | Replaces emails, phones, IPs, cards, URLs |
| Line breaks | --keep-line-breaks | collapsed | Preserves original line breaks |
| Hybrid backend | --hybrid | off | Routes pages through docling-fast or hancom-ai |
| Hybrid timeout | --hybrid-timeout | 0 (no timeout) | Per-request timeout for the hybrid server |
Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:1-1. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:1-1. Source: node/opendataloader-pdf/src/convert-options.generated.ts:1-1.
Integration With the Hybrid Backend
When --hybrid is enabled, the pipeline is no longer purely local. HybridDocumentProcessor triages pages into batches and forwards them to a running hybrid server. The default batch size is currently 50 (a hard-coded constant flagged for configurability in community discussion #500). Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java:1-1.
The hybrid server is a FastAPI app that owns a single DocumentConverter instance and returns DoclingDocument JSON. It supports multiple OCR engines (EasyOCR, Tesseract, RapidOCR, ocrmac) and accelerator devices (CPU, CUDA, MPS, XPU). Failures are surfaced as a partial_success status with explicit failed_pages so the Java pipeline can fall back to local extraction when --hybrid-fallback is set. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-1.
This hybrid mode is what powers the Python MCP server, which wraps the Java CLI behind a Model Context Protocol interface so AI agents can request conversions programmatically. Source: python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py:1-1.
Common Failure Modes
- Reading order on two-column layouts —
xycuthandles most cases, but documents without structure trees can still produce interleaved columns. Enabling--use-struct-treeor switching to a tagged PDF source usually helps. - Borderless tables —
--table-method defaultwill miss them; switch to--table-method cluster. - Hybrid backend timeouts or OOM — Docling may emit
std::bad_allocorPage N: <error>messages that surface aspartial_successwith a populatedfailed_pageslist. Re-running with--hybrid-fallbackretries locally. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-1. - Maven reproducibility — packaging consumers (notably NixOS, see issue #551) report non-reproducible dependency resolution because the Java core is built from source.
See Also
- Hybrid backend configuration and the
opendataloader-pdf-hybridserver - Output formats: JSON schema, Markdown rendering rules, tagged PDF structure
- Python wrapper (
convert()API) and MCP server integration - RAG examples in
examples/python/rag/
Source: https://github.com/opendataloader-project/opendataloader-pdf / Human Manual
Hybrid AI Mode, Output Generators, and JSON Schema
Related topics: Core Processing Pipeline and PDF Element Detection, Language SDKs, CLI, and Build/Operations
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Processing Pipeline and PDF Element Detection, Language SDKs, CLI, and Build/Operations
Hybrid AI Mode, Output Generators, and JSON Schema
OpenDataLoader PDF exposes three tightly related subsystems that determine what comes out of a conversion run: a Hybrid AI Mode that optionally routes pages to an external ML backend, a family of Output Generators that emit Markdown, HTML, plain text, and PDF artifacts, and a JSON Schema that carries the structural metadata used by downstream RAG pipelines.
1. Hybrid AI Mode
Hybrid Mode lets the Java core delegate pages to a separate HTTP service (Docling or Hancom AI) while keeping the Java-side text extraction, structure tree, and content-safety checks in charge. The mode is controlled by the hybrid option in java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java and exposed to CLI, Node, and Python clients through auto-generated option definitions such as python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py and node/opendataloader-pdf/src/convert-options.generated.ts.
flowchart LR
PDF[Input PDF] --> Triage{Triage}
Triage -- "auto (dynamic)" --> Java[Java extractors]
Triage -- "full" --> Backend[Hybrid backend]
Java --> Merge[Merge + generators]
Backend --> Merge
Merge --> Out[JSON / MD / HTML / PDF]Two triage strategies are available. HYBRID_MODE_AUTO ("auto") inspects each page and decides whether to send it to the backend, while HYBRID_MODE_FULL ("full") skips triage and forwards every page. Source: Config.java. The Python quick-start runs the bundled FastAPI server via opendataloader-pdf-hybrid --port 5002 (defined in hybrid_server.py), and remote servers can be targeted with --hybrid-url. A BACKEND_CHUNK_SIZE constant batches pages sent per request — community issue #500 requests that this hardcoded limit become configurable. For Hancom AI backends two extra knobs control region-list overlap with TSR (hybrid-hancom-ai-regionlist-strategy) and OCR fallback behavior (hybrid-hancom-ai-ocr-strategy), documented in convert-options.generated.ts.
The hybrid server itself returns a structured JSON payload with status, document, errors, failed_pages, and processing_time. Failed pages are detected by combining two strategies — parsing "Page N:" messages from Docling errors and gap-detecting pages missing from the response — so that partial successes surface actionable diagnostics. Source: hybrid_server.py. hybrid-fallback allows the Java pipeline to recover if the backend fails.
2. Output Generators
Every conversion can emit one or more artifacts through flags in CLIOptions.java and Config.java. The format option accepts json, text, html, markdown, markdown-with-html, and tagged-pdf; the Python wrapper builds this list from the legacy boolean flags in wrapper.py (--json, --markdown, --html, --pdf).
| Output mode | Key options | Notes |
|---|---|---|
| JSON | --no-json to disable | Default-on, carries structural metadata |
| Markdown | --markdown-page-separator, --markdown-with-html, --markdown-with-images | HTML mode allows complex row-spans |
| HTML | --html-page-separator | Rich layout output |
| Plain text | --text-page-separator, --keep-line-reaks | Whitespace-faithful |
| Tagged PDF | --format tagged-pdf | Re-tags the input |
All three text-style generators share a PAGE_NUMBER_STRING placeholder ("%page-number%") that resolves to the current page index, defined in Config.java. Image handling is governed by --image-output (off, embedded, external) and --image-format (png, jpeg); only the external mode honors --image-dir. Source: convert-options.generated.ts.
The MCP server in python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py exposes a convert_pdf tool that accepts a format parameter and the same hybrid/image/separator options, mapping format strings to file extensions (.json, .md, .html, .txt) and writing into a temporary directory before returning the content. Setup snippets for Claude Desktop, Claude Code, Codex, Cursor, and Windsurf are documented in python/opendataloader-pdf-mcp/README.md.
3. JSON Schema
The JSON output is the canonical structured form. Each node carries text plus a metadata block (e.g. type, page, bbox, source) suitable for embedding and retrieval, as shown in examples/python/rag/README.md:
{
"text": "Language model pretraining has led to significant...",
"metadata": {
"type": "paragraph",
"page": 1,
"bbox": [108.0, 526.2, 286.5, 592.8],
"source": "1901.03003.pdf"
}
}
Release v2.4.7 extended the schema with per-node ai_score and pdfua_tag fields (PR #530) and exposed the raw Docling object ID when hybrid mode is enabled, so that downstream consumers can correlate hybrid-backed content with the original PDF objects. The JSON emitter is always on by default; it can be disabled via --no-json (or isGenerateJSON = false) per CLIOptions.java.
See Also
Source: https://github.com/opendataloader-project/opendataloader-pdf / Human Manual
Language SDKs, CLI, and Build/Operations
Related topics: Project Overview and System Architecture, Hybrid AI Mode, Output Generators, and JSON Schema
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Project Overview and System Architecture, Hybrid AI Mode, Output Generators, and JSON Schema
Language SDKs, CLI, and Build/Operations
The OpenDataLoader PDF project ships a single Java extraction engine and exposes it through several parallel surfaces: a Java SDK with a stable programmatic API, an Apache Commons CLI front-end, a Python SDK that bundles a FastAPI hybrid backend, a Node.js/TypeScript binding, and an MCP (Model Context Protocol) server for AI agents. This page explains how those surfaces relate, where their option definitions originate, and the operational concerns (packaging, reproducibility, batch sizing) that recur in community discussion.
Java SDK and CLI Surface
The Java core module under java/opendataloader-pdf-core/ is the canonical source of every configuration option. The Config class encapsulates all knobs (output formats, separators, reading order, hybrid backend selection, image handling) used by both the library API and the CLI. Public string constants on Config define the accepted enum-like values, e.g. READING_ORDER_OFF / READING_ORDER_XYCUT and HYBRID_OFF for the no-backend mode. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:42-49
The CLI is built on Apache Commons CLI. CLIOptions is explicitly documented as a stable integration surface for downstream consumers (such as opendataloader-pdfua); its defineOptions(), addAllTo(), and the parser helpers are the only members guaranteed to remain backward compatible. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:24-32
To prevent drift between CLI flags, the Java API, the Python wrapper, and the Node binding, the project keeps a single OPTION_DEFINITIONS list inside CLIOptions. Each entry carries the long flag, short flag, type, default, and Javadoc-derived description, and is rendered into every downstream surface from that list. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:74-95
Higher-level Java entry points sit alongside Config:
FilterConfigtoggles content-safety filters (hidden text, out-of-page content, tiny text, hidden OCGs) and ships a default set ofSanitizationRulepatterns for emails, phone numbers, IPs, credit cards, and URLs. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/FilterConfig.java:22-39AutoTaggerreturns an in-memory taggedPDDocumentwithout writing intermediate files, exposing thetag(inputPath, config)andshutdown()lifecycle. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java:32-44OutputWriterseparates extraction from file emission, enabling a two-phase pipeline where anExtractionResultis reused across multiple output format writes. Source: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java:23-35
Python SDK and Hybrid Server
The Python package wraps the Java JAR. Because the Java side is the source of truth for every flag, cli_options_generated.py is an auto-generated mirror of OPTION_DEFINITIONS; the same pattern is used for convert-options.generated.ts on the Node side. This guarantees that --markdown-page-separator, --hybrid-mode, --image-output, and the rest of the ~30 flags remain identical across surfaces. Source: python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py:1-30
The hybrid server is a FastAPI application that hosts a single DocumentConverter singleton and returns Docling's JSON document. The CLI entry point opendataloader-pdf-hybrid accepts --port, --host, --ocr-lang, --ocr-engine, --psm, --device, --enrich-formula, --enrich-picture-description, and --max-file-size. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-50
Inside the server, a _build_response helper merges two failure-detection strategies when Docling reports partial_success: it parses "Page N:" substrings out of error messages and computes the set gap between expected and present pages in the JSON output. The union is what the caller actually receives as failed_pages, which the Java side uses to decide which pages to fall back on. Source: python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:55-110
The RAG example shows the consumer shape: each chunk is a {text, metadata} dict where metadata includes type, page, bbox, and source. The bbox is the layout coordinate from the Java processor, so downstream vector stores can index geometry alongside text. Source: examples/python/rag/README.md:21-34
{
"text": "Language model pretraining has led to significant...",
"metadata": {
"type": "paragraph",
"page": 1,
"bbox": [108.0, 526.2, 286.5, 592.8],
"source": "1901.03003.pdf"
}
}
Node.js/TypeScript and MCP Surfaces
The Node binding reuses the same generated schema. convert-options.generated.ts exposes every CLI flag with TypeScript types and JSDoc descriptions copied from the Java OptionDefinition source (e.g. hybrid?: string, hybridMode?: string, hybridTimeout?: string). Source: node/opendataloader-pdf/src/convert-options.generated.ts:1-30
The MCP server (opendataloader-pdf-mcp) targets Claude Desktop, Claude Code, OpenAI Codex, Cursor, and Windsurf. It launches via uvx opendataloader-pdf-mcp and forwards tool invocations to the Java/Python pipeline. Its convert_pdf tool accepts the same option set as the CLI plus an output_dir and validates format against an explicit ext_map of json / text / html / markdown / markdown-with-html / markdown-with-images. Source: python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py:55-80, and python/opendataloader-pdf-mcp/README.md:21-58
Build, Operations, and Community Concerns
flowchart LR A[OPTION_DEFINITIONS in CLIOptions.java] --> B[Java Config / CLI] A --> C[cli_options_generated.py] A --> D[convert-options.generated.ts] B --> E[Python wrapper / uvx] C --> E D --> F[Node SDK] B --> G[MCP server] E --> G E --> H[FastAPI hybrid_server] H --> I[Docling DocumentConverter]
Two operational pain points show up repeatedly in the issue tracker:
- Reproducible Maven resolution (#551): packaging
opendataloader-pdfinside a Nix fixed-output derivation requires the full transitive Maven graph to be reproducible. Because Java is the implementation language and option generation walks the dependency tree, any non-deterministic resolution propagates into the PyPI and conda-forge artifacts. - Pre-compiled JAR in sdist (#435): conda-forge currently needs Maven plus OpenJDK at build time to recompile the JAR from sources. Shipping the JAR inside the PyPI source distribution would remove that barrier for the scientific Python ecosystem; this is the same path the Node and MCP surfaces already rely on.
- Configurable hybrid batch size (#500): the value
BACKEND_CHUNK_SIZE = 50is hardcoded inHybridDocumentProcessor, which sets the page grouping the Java side sends per HTTP request tohybrid_server.py. Community request is to expose it as a CLI/API option, alongside the existing--hybrid-timeoutflag.
Until those land, the practical guidance is: pin Java/dependency versions when packaging, run the hybrid server with --max-file-size and a tuned --hybrid-timeout for large PDFs, and use the --hybrid-fallback flag to keep a Java-only path available when the Docling backend is unreachable.
See Also
- Architecture and Extraction Pipeline
- Hybrid Backend (Docling) Integration
- Output Formats and JSON Schema
Source: https://github.com/opendataloader-project/opendataloader-pdf / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 15 structured pitfall item(s), including 3 high/blocking item(s). Top priority: Configuration risk - Configuration risk requires verification.
1. Configuration risk: Configuration risk requires verification
- Severity: high
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/566
2. Capability evidence risk: Capability evidence risk requires verification
- Severity: high
- Finding: Project evidence flags a capability evidence risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/414
3. Runtime risk: Runtime risk requires verification
- Severity: high
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/428
4. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/440
5. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/528
6. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/578
7. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/584
8. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/548
9. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/opendataloader-project/opendataloader-pdf
10. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/581
11. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/opendataloader-project/opendataloader-pdf
12. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/opendataloader-project/opendataloader-pdf
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using opendataloader-pdf with real data or production workflows.
- Add Simplified Chinese README translation - github / github_issue
- JSON output does not preserve table header cells (TH) — all cells serial - github / github_issue
- unneccessary extra "-" and dot added in md files generated from pdf for - github / github_issue
- missing character - github / github_issue
- Integrated Lifecycle Management for Hybrid Server - github / github_issue
- Community source 6 - github / github_issue
- why it runs very slowly in CPU mode ? - github / github_issue
- Poor OCR extraction on low-resolution scanned PDF flyer in hybrid Doclin - github / github_issue
- Local extraction of an image-only / scanned PDF succeeds silently with n - github / github_issue
- Type3 fonts with non-1/1000 /FontMatrix: numbers lose a repeated digit a - github / github_issue
- Release v2.4.7 - github / github_release
- Release v2.4.6 - github / github_release
Source: Project Pack community evidence and pitfall evidence