# https://github.com/opendataloader-project/opendataloader-pdf Project Manual

Generated at: 2026-06-22 14:40:31 UTC

## Table of Contents

- [Project Overview and System Architecture](#page-1)
- [Core Processing Pipeline and PDF Element Detection](#page-2)
- [Hybrid AI Mode, Output Generators, and JSON Schema](#page-3)
- [Language SDKs, CLI, and Build/Operations](#page-4)

<a id='page-1'></a>

## Project Overview and System Architecture

### Related Pages

Related topics: [Core Processing Pipeline and PDF Element Detection](#page-2), [Language SDKs, CLI, and Build/Operations](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/FilterConfig.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/FilterConfig.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java)
- [python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py)
- [python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py)
- [python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py)
- [python/opendataloader-pdf-mcp/README.md](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/README.md)
- [node/opendataloader-pdf/src/convert-options.generated.ts](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/node/opendataloader-pdf/src/convert-options.generated.ts)
- [examples/python/rag/README.md](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/examples/python/rag/README.md)
</details>

# Project Overview and System Architecture

## Purpose and Scope

OpenDataLoader PDF is a multi-language PDF extraction and conversion toolkit that turns PDF documents into structured outputs (JSON, Markdown, HTML, plain text, and tagged PDF) for downstream use in RAG pipelines, content-safety audits, and accessibility workflows. The repository is organized as a polyglot project: a Java core that does the heavy lifting, plus thin wrappers and servers in Python, Node.js/TypeScript, and an MCP (Model Context Protocol) server for AI agents.

The Java core exposes a public API via [Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java), which centralizes reading order, output format, hybrid-backend selection, image handling, page separators, and content-safety flags. The standalone tagging path is provided by [AutoTagger.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java), which returns an in-memory tagged `PDDocument` without intermediate files.

The Python package ships a hybrid backend server (Docling-based) and an MCP server, while the Node package mirrors the Java option surface as TypeScript types. Community discussions (issue #449) highlight a recurring ask to extend the same pipeline to DOCX, and issue #500 surfaces a request to make the hybrid backend's batch size configurable. Both reflect a user base that expects the same conversion guarantees across input formats and tuning knobs that survive at scale.

## System Architecture

The project is structured as a Java core that performs extraction, with optional Python-side acceleration and language-specific bindings on top.

```mermaid
flowchart LR
    subgraph Clients
        CLI["Java CLI<br/>(opendataloader-pdf-cli)"]
        PY["Python CLI / API<br/>(opendataloader-pdf)"]
        NODE["Node / TypeScript<br/>(opendataloader-pdf)"]
        MCP["MCP Server<br/>(opendataloader-pdf-mcp)"]
        RAG["RAG examples<br/>(examples/python/rag)"]
    end

    subgraph JavaCore["Java Core (opendataloader-pdf-core)"]
        API["OpenDataLoaderPDF API"]
        CFG["Config / FilterConfig"]
        AT["AutoTagger"]
        PROC["DocumentProcessors<br/>(Hybrid, AutoTagging, ...)"]
    end

    subgraph HybridBackend["Hybrid Backend (optional)"]
        HS["opendataloader-pdf-hybrid<br/>(FastAPI + Docling)"]
    end

    CLI --> API
    PY --> API
    NODE --> API
    MCP --> PY
    RAG --> PY
    API --> CFG
    API --> PROC
    AT --> API
    PROC -. HTTP .-> HS
```

Source: [CLIOptions.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java), [hybrid_server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py), [server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py).

### Java Core Layer

The core layer lives under `java/opendataloader-pdf-core` and contains the API surface and processors:

- [Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java) — single configuration object covering output formats (json, text, html, markdown, tagged-pdf), reading order (`off` vs `xycut`), hybrid backend selection (`off`, `docling-fast`, `hancom-ai`), image output (`off`, `embedded`, `external`), page separators, strikethrough detection, and HTML-in-Markdown toggles.
- [FilterConfig.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/FilterConfig.java) — controls content-safety filters (hidden text, out-of-page, tiny text, hidden OCG) and sensitive-data sanitization (emails, phone numbers, IPs, credit cards, URLs) with regex-based `SanitizationRule`s.
- [CLIOptions.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java) — defines CLI flags via Apache Commons CLI and a single `OPTION_DEFINITIONS` list that is the source of truth for both the command line and the JSON export of options.
- [AutoTagger.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java) — entry point for standalone PDF auto-tagging, returning a `PDDocument` and supporting `saveTo(...)`.

### Wrapper and Server Layer

The Java core is consumed by language-specific wrappers that all share the same option vocabulary:

- Python wrapper: [cli_options_generated.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py) is generated from the Java option list, ensuring the Python CLI, the MCP server, and the RAG examples expose a uniform surface.
- Hybrid server: [hybrid_server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py) is a FastAPI service that keeps a singleton Docling `DocumentConverter` and returns DoclingDocument JSON; the Java side then renders markdown/HTML. Failed pages are detected by union of error-message parsing and gap detection on the returned `pages` dict.
- MCP server: [server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py) exposes a `convert_pdf` tool with parameters for format, page ranges, headers/footers, hybrid backend, OCR strategy, and image output, mapping them onto the underlying Python wrapper. Installation snippets for Claude Desktop, Claude Code, OpenAI Codex, Cursor, and Windsurf live in the [MCP README](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/README.md).
- Node/TypeScript: [convert-options.generated.ts](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/node/opendataloader-pdf/src/convert-options.generated.ts) declares the same option set (`hybrid`, `hybridMode`, `hybridUrl`, `imageOutput`, `tableMethod`, `readingOrder`, separators, etc.) so JavaScript consumers get type-safe access to the same flags.

## Processing Pipeline

At a high level, every entry point — CLI, Python API, MCP tool, RAG example — feeds a `Config` into the Java `OpenDataLoaderPDF.processFile` pipeline, which:

1. Loads the PDF and applies `FilterConfig` rules (hidden text, out-of-page, tiny text, hidden OCG, sensitive-data sanitization).
2. Detects structure (reading order via `xycut` or COS order) and, when a hybrid backend is selected (`docling-fast` or `hancom-ai`), forwards page batches to the Python FastAPI server; triage modes are `auto` (dynamic) or `full` (all pages to backend).
3. Optionally runs OCR / formula / picture-description enrichment on the hybrid server (configurable engine: `easyocr`, `tesseract`, `tesserocr`, `rapidocr`, `ocrmac`; configurable device: `auto`, `cpu`, `cuda`, `mps`, `xpu`).
4. Renders the selected output format and writes results to the output directory, with images embedded as Base64 or written externally to `imageDir` as `png`/`jpeg`.

RAG consumers, illustrated in [examples/python/rag/README.md](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/examples/python/rag/README.md), receive pre-chunked records with `text` and `metadata` (type, page, bbox, source) ready for embedding into vector stores such as Chroma, FAISS, Pinecone, or Weaviate.

## Configuration Model

| Layer | Class / File | Responsibility |
|---|---|---|
| Output + layout | [Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java) | Format, reading order, page separators, image mode, hybrid backend |
| Content safety | [FilterConfig.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/FilterConfig.java) | Hidden/ocg/tiny filters, sensitive-data regex rules |
| CLI surface | [CLIOptions.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java) | Apache Commons CLI mapping + JSON export |
| Hybrid runtime | [hybrid_server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py) | Docling singleton, OCR, formula/picture enrichment, failed-page detection |
| AI-agent tool | [server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py) | MCP `convert_pdf` tool wrapping the Python API |
| JS binding | [convert-options.generated.ts](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/node/opendataloader-pdf/src/convert-options.generated.ts) | TypeScript types mirroring the Java options |

## Community-Reported Concerns

- **DOCX support (issue #449)**: users want a single application that handles PDF and DOCX with the same fidelity; the current pipeline is PDF-centric, with the Java core using `verapdf` `PDDocument` and the hybrid server running Docling's PDF pipeline.
- **Reproducible Maven builds (issue #551)**: downstream packagers (e.g. NixOS) need deterministic dependency resolution, since the Java core declares all PDF extraction dependencies via Maven.
- **Configurable hybrid batch size (issue #500)**: `BACKEND_CHUNK_SIZE` is currently a hardcoded `50` in `HybridDocumentProcessor`; making it a config value would let operators trade memory for latency.
- **Pre-compiled JAR for PyPI sdist (issue #435)**: conda-forge recipes want a Java-free build path, so the Python wheel should ship a pre-built JAR rather than recompiling on every install.
- **Hybrid runtime evolution (release v2.4.7)**: full-page DLA renders are now persisted for evidence overlays, and per-node `ai_score` / `pdfua_tag` metadata is emitted in JSON, reflecting the project's direction toward richer, audit-ready output.

## See Also

- [Hybrid Backend and Docling Integration](#) — how the Java core talks to the FastAPI/Docling server.
- [Configuration Reference](#) — full enumeration of `Config` and `FilterConfig` fields.
- [MCP Server Setup](#) — installing and configuring the AI-agent tool.
- [RAG Pipeline Examples](#) — chunk schemas and vector-store integration.

---

<a id='page-2'></a>

## Core Processing Pipeline and PDF Element Detection

### Related Pages

Related topics: [Project Overview and System Architecture](#page-1), [Hybrid AI Mode, Output Generators, and JSON Schema](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java)
- [python/opendataloader-pdf/src/opendataloader_pdf/wrapper.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/wrapper.py)
- [python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py)
</details>

# Core Processing Pipeline and PDF Element Detection

## Purpose and Scope

The core processing pipeline is the Java heart of OpenDataLoader PDF. It ingests a PDF document, detects structural elements (headings, paragraphs, lists, tables, images, hidden text), establishes reading order, and emits structured outputs such as JSON, Markdown, HTML, text, annotated PDF, and tagged PDF. The pipeline is exposed to other layers (Python wrapper, Node CLI, MCP server) through a small set of public API entry points, while keeping the per-element detection logic in a dedicated `processors` package.

The public API is intentionally thin: `OpenDataLoaderPDF#processFile` performs extraction and output in a single call, while `DocumentProcessor#extractContents` returns an `ExtractionResult` that downstream generators (such as `OutputWriter` and `AutoTagger`) can consume without re-running extraction. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OpenDataLoaderPDF.java:1-1](). Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java:1-1]().

## Pipeline Architecture

The pipeline follows a two-phase model: **extraction** (a single, potentially expensive pass over the PDF that detects elements and computes reading order) followed by **output generation** (cheap, format-specific serialisation). This split is what enables `AutoTagger` to reuse the same `ExtractionResult` for tagged-PDF output without re-parsing the source document. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java:1-1]().

```mermaid
flowchart LR
    A[Input PDF] --> B[DocumentProcessor.extractContents]
    B --> C{Element Detection}
    C --> D[HeadingProcessor]
    C --> E[TableBorderProcessor / ClusterTableProcessor]
    C --> F[ListProcessor]
    C --> G[HiddenTextProcessor]
    C --> H[Structure-Tree Reader]
    D --> I[Reading Order (xycut)]
    E --> I
    F --> I
    G --> I
    H --> I
    I --> J[ExtractionResult]
    J --> K[OutputWriter]
    J --> L[AutoTagger]
    K --> M[JSON / Markdown / HTML / Text]
    L --> N[Tagged / Annotated PDF]
```

When the **hybrid backend** is enabled, a parallel path is taken: pages are triaged by `HybridDocumentProcessor` and dispatched to an external server (such as docling-fast or hancom-ai) which returns a `DoclingDocument` JSON. That JSON is merged with locally detected content before output generation. The hybrid server itself is a FastAPI app written in Python and shipped as `opendataloader-pdf-hybrid`. Source: [python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-1](). Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java:1-1]().

## PDF Element Detection

Element detection lives in the `processors` package and is split into specialised classes that each handle a single structural concern:

- **HeadingProcessor** — classifies text spans as headings based on font size, weight, and surrounding whitespace, producing section markers used by Markdown and JSON outputs.
- **TableBorderProcessor** — the default table detector. It uses visible border lines (`default` table method) to infer cell boundaries and merges neighbouring text spans into rows and columns. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:1-1]().
- **ClusterTableProcessor** — an alternative table detector (`cluster` table method) that combines border analysis with spatial clustering of text. It is intended for borderless tables that still exhibit a regular layout. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:1-1]().
- **ListProcessor** — recognises bulleted and numbered lists, normalising their markers and indentation for Markdown and JSON output.
- **HiddenTextProcessor** — flags text that is hidden through colour matching, tiny font size, off-page placement, or hidden OCG layers. Filter behaviour is configurable through `--content-safety-off`.
- **Structure-Tree Reader** — when `--use-struct-tree` is set, the pipeline consults the PDF's logical structure tree (tagged PDF) for reading order. By default this is disabled and reading order is reconstructed using the `xycut` algorithm. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:1-1]().

The choice between default and cluster table detection, as well as between structure-tree and xycut reading order, is the main per-element configuration surface exposed to users. Source: [node/opendataloader-pdf/src/convert-options.generated.ts:1-1]().

## Configuration Surface

Configuration flows from the CLI / Node / Python layers into the Java `Config` object. The fields most relevant to element detection are summarised below.

| Configuration | CLI flag | Default | Effect |
|---|---|---|---|
| Table detection method | `--table-method` | `default` | Selects `TableBorderProcessor` or `ClusterTableProcessor` |
| Reading order | `--reading-order` | `xycut` | Reconstructed order vs. tagged structure tree |
| Structure tree usage | `--use-struct-tree` | disabled | Enables structure-tree reader |
| Hidden-text filter | `--content-safety-off` | all enabled | Disables hidden-text, off-page, tiny, or hidden-OCG filters |
| Sanitisation | `--sanitize` | disabled | Replaces emails, phones, IPs, cards, URLs |
| Line breaks | `--keep-line-breaks` | collapsed | Preserves original line breaks |
| Hybrid backend | `--hybrid` | `off` | Routes pages through docling-fast or hancom-ai |
| Hybrid timeout | `--hybrid-timeout` | `0` (no timeout) | Per-request timeout for the hybrid server |

Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:1-1](). Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:1-1](). Source: [node/opendataloader-pdf/src/convert-options.generated.ts:1-1]().

## Integration With the Hybrid Backend

When `--hybrid` is enabled, the pipeline is no longer purely local. `HybridDocumentProcessor` triages pages into batches and forwards them to a running hybrid server. The default batch size is currently 50 (a hard-coded constant flagged for configurability in community discussion #500). Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java:1-1]().

The hybrid server is a FastAPI app that owns a single `DocumentConverter` instance and returns DoclingDocument JSON. It supports multiple OCR engines (EasyOCR, Tesseract, RapidOCR, ocrmac) and accelerator devices (CPU, CUDA, MPS, XPU). Failures are surfaced as a `partial_success` status with explicit `failed_pages` so the Java pipeline can fall back to local extraction when `--hybrid-fallback` is set. Source: [python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-1]().

This hybrid mode is what powers the Python MCP server, which wraps the Java CLI behind a Model Context Protocol interface so AI agents can request conversions programmatically. Source: [python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py:1-1]().

## Common Failure Modes

- **Reading order on two-column layouts** — `xycut` handles most cases, but documents without structure trees can still produce interleaved columns. Enabling `--use-struct-tree` or switching to a tagged PDF source usually helps.
- **Borderless tables** — `--table-method default` will miss them; switch to `--table-method cluster`.
- **Hybrid backend timeouts or OOM** — Docling may emit `std::bad_alloc` or `Page N: <error>` messages that surface as `partial_success` with a populated `failed_pages` list. Re-running with `--hybrid-fallback` retries locally. Source: [python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-1]().
- **Maven reproducibility** — packaging consumers (notably NixOS, see issue #551) report non-reproducible dependency resolution because the Java core is built from source.

## See Also

- Hybrid backend configuration and the `opendataloader-pdf-hybrid` server
- Output formats: JSON schema, Markdown rendering rules, tagged PDF structure
- Python wrapper (`convert()` API) and MCP server integration
- RAG examples in `examples/python/rag/`

---

<a id='page-3'></a>

## Hybrid AI Mode, Output Generators, and JSON Schema

### Related Pages

Related topics: [Core Processing Pipeline and PDF Element Detection](#page-2), [Language SDKs, CLI, and Build/Operations](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java)
- [node/opendataloader-pdf/src/convert-options.generated.ts](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/node/opendataloader-pdf/src/convert-options.generated.ts)
- [python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py)
- [python/opendataloader-pdf/src/opendataloader_pdf/wrapper.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/wrapper.py)
- [python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py)
- [python/opendataloader-pdf-mcp/README.md](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/README.md)
- [python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py)
- [examples/python/rag/README.md](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/examples/python/rag/README.md)
</details>

# Hybrid AI Mode, Output Generators, and JSON Schema

OpenDataLoader PDF exposes three tightly related subsystems that determine what comes out of a conversion run: a **Hybrid AI Mode** that optionally routes pages to an external ML backend, a family of **Output Generators** that emit Markdown, HTML, plain text, and PDF artifacts, and a **JSON Schema** that carries the structural metadata used by downstream RAG pipelines.

## 1. Hybrid AI Mode

Hybrid Mode lets the Java core delegate pages to a separate HTTP service (Docling or Hancom AI) while keeping the Java-side text extraction, structure tree, and content-safety checks in charge. The mode is controlled by the `hybrid` option in [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java) and exposed to CLI, Node, and Python clients through auto-generated option definitions such as [python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py) and [node/opendataloader-pdf/src/convert-options.generated.ts](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/node/opendataloader-pdf/src/convert-options.generated.ts).

```mermaid
flowchart LR
    PDF[Input PDF] --> Triage{Triage}
    Triage -- "auto (dynamic)" --> Java[Java extractors]
    Triage -- "full" --> Backend[Hybrid backend]
    Java --> Merge[Merge + generators]
    Backend --> Merge
    Merge --> Out[JSON / MD / HTML / PDF]
```

Two triage strategies are available. `HYBRID_MODE_AUTO` ("auto") inspects each page and decides whether to send it to the backend, while `HYBRID_MODE_FULL` ("full") skips triage and forwards every page. Source: [Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java). The Python quick-start runs the bundled FastAPI server via `opendataloader-pdf-hybrid --port 5002` (defined in [hybrid_server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py)), and remote servers can be targeted with `--hybrid-url`. A `BACKEND_CHUNK_SIZE` constant batches pages sent per request — community issue #500 requests that this hardcoded limit become configurable. For Hancom AI backends two extra knobs control region-list overlap with TSR (`hybrid-hancom-ai-regionlist-strategy`) and OCR fallback behavior (`hybrid-hancom-ai-ocr-strategy`), documented in [convert-options.generated.ts](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/node/opendataloader-pdf/src/convert-options.generated.ts).

The hybrid server itself returns a structured JSON payload with `status`, `document`, `errors`, `failed_pages`, and `processing_time`. Failed pages are detected by combining two strategies — parsing "Page N:" messages from Docling errors and gap-detecting pages missing from the response — so that partial successes surface actionable diagnostics. Source: [hybrid_server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py). `hybrid-fallback` allows the Java pipeline to recover if the backend fails.

## 2. Output Generators

Every conversion can emit one or more artifacts through flags in [CLIOptions.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java) and [Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java). The `format` option accepts `json`, `text`, `html`, `markdown`, `markdown-with-html`, and `tagged-pdf`; the Python wrapper builds this list from the legacy boolean flags in [wrapper.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/wrapper.py) (`--json`, `--markdown`, `--html`, `--pdf`).

| Output mode | Key options | Notes |
|---|---|---|
| JSON | `--no-json` to disable | Default-on, carries structural metadata |
| Markdown | `--markdown-page-separator`, `--markdown-with-html`, `--markdown-with-images` | HTML mode allows complex row-spans |
| HTML | `--html-page-separator` | Rich layout output |
| Plain text | `--text-page-separator`, `--keep-line-reaks` | Whitespace-faithful |
| Tagged PDF | `--format tagged-pdf` | Re-tags the input |

All three text-style generators share a `PAGE_NUMBER_STRING` placeholder (`"%page-number%"`) that resolves to the current page index, defined in [Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java). Image handling is governed by `--image-output` (`off`, `embedded`, `external`) and `--image-format` (`png`, `jpeg`); only the `external` mode honors `--image-dir`. Source: [convert-options.generated.ts](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/node/opendataloader-pdf/src/convert-options.generated.ts).

The MCP server in [python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py) exposes a `convert_pdf` tool that accepts a `format` parameter and the same hybrid/image/separator options, mapping format strings to file extensions (`.json`, `.md`, `.html`, `.txt`) and writing into a temporary directory before returning the content. Setup snippets for Claude Desktop, Claude Code, Codex, Cursor, and Windsurf are documented in [python/opendataloader-pdf-mcp/README.md](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/README.md).

## 3. JSON Schema

The JSON output is the canonical structured form. Each node carries `text` plus a `metadata` block (e.g. `type`, `page`, `bbox`, `source`) suitable for embedding and retrieval, as shown in [examples/python/rag/README.md](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/examples/python/rag/README.md):

```json
{
  "text": "Language model pretraining has led to significant...",
  "metadata": {
    "type": "paragraph",
    "page": 1,
    "bbox": [108.0, 526.2, 286.5, 592.8],
    "source": "1901.03003.pdf"
  }
}
```

Release v2.4.7 extended the schema with per-node `ai_score` and `pdfua_tag` fields (PR #530) and exposed the raw Docling object ID when hybrid mode is enabled, so that downstream consumers can correlate hybrid-backed content with the original PDF objects. The JSON emitter is always on by default; it can be disabled via `--no-json` (or `isGenerateJSON = false`) per [CLIOptions.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java).

## See Also

- [CLI and configuration reference](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java)
- [Hybrid backend server (`opendataloader-pdf-hybrid`)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py)
- [MCP server setup](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/README.md)
- [RAG example pipeline](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/examples/python/rag/README.md)

---

<a id='page-4'></a>

## Language SDKs, CLI, and Build/Operations

### Related Pages

Related topics: [Project Overview and System Architecture](#page-1), [Hybrid AI Mode, Output Generators, and JSON Schema](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/FilterConfig.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/FilterConfig.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java)
- [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java)
- [python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py)
- [python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py)
- [node/opendataloader-pdf/src/convert-options.generated.ts](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/node/opendataloader-pdf/src/convert-options.generated.ts)
- [python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py)
- [python/opendataloader-pdf-mcp/README.md](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/python/opendataloader-pdf-mcp/README.md)
- [examples/python/rag/README.md](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/examples/python/rag/README.md)
</details>

# Language SDKs, CLI, and Build/Operations

The OpenDataLoader PDF project ships a single Java extraction engine and exposes it through several parallel surfaces: a Java SDK with a stable programmatic API, an Apache Commons CLI front-end, a Python SDK that bundles a FastAPI hybrid backend, a Node.js/TypeScript binding, and an MCP (Model Context Protocol) server for AI agents. This page explains how those surfaces relate, where their option definitions originate, and the operational concerns (packaging, reproducibility, batch sizing) that recur in community discussion.

## Java SDK and CLI Surface

The Java core module under `java/opendataloader-pdf-core/` is the canonical source of every configuration option. The `Config` class encapsulates all knobs (output formats, separators, reading order, hybrid backend selection, image handling) used by both the library API and the CLI. Public string constants on `Config` define the accepted enum-like values, e.g. `READING_ORDER_OFF` / `READING_ORDER_XYCUT` and `HYBRID_OFF` for the no-backend mode. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java:42-49]()

The CLI is built on Apache Commons CLI. `CLIOptions` is explicitly documented as a **stable integration surface** for downstream consumers (such as `opendataloader-pdfua`); its `defineOptions()`, `addAllTo()`, and the parser helpers are the only members guaranteed to remain backward compatible. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:24-32]()

To prevent drift between CLI flags, the Java API, the Python wrapper, and the Node binding, the project keeps a single `OPTION_DEFINITIONS` list inside `CLIOptions`. Each entry carries the long flag, short flag, type, default, and Javadoc-derived description, and is rendered into every downstream surface from that list. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/cli/CLIOptions.java:74-95]()

Higher-level Java entry points sit alongside `Config`:

- `FilterConfig` toggles content-safety filters (hidden text, out-of-page content, tiny text, hidden OCGs) and ships a default set of `SanitizationRule` patterns for emails, phone numbers, IPs, credit cards, and URLs. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/FilterConfig.java:22-39]()
- `AutoTagger` returns an in-memory tagged `PDDocument` without writing intermediate files, exposing the `tag(inputPath, config)` and `shutdown()` lifecycle. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/AutoTagger.java:32-44]()
- `OutputWriter` separates extraction from file emission, enabling a two-phase pipeline where an `ExtractionResult` is reused across multiple output format writes. Source: [java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/OutputWriter.java:23-35]()

## Python SDK and Hybrid Server

The Python package wraps the Java JAR. Because the Java side is the source of truth for every flag, `cli_options_generated.py` is an auto-generated mirror of `OPTION_DEFINITIONS`; the same pattern is used for `convert-options.generated.ts` on the Node side. This guarantees that `--markdown-page-separator`, `--hybrid-mode`, `--image-output`, and the rest of the ~30 flags remain identical across surfaces. Source: [python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py:1-30]()

The hybrid server is a FastAPI application that hosts a single `DocumentConverter` singleton and returns Docling's JSON document. The CLI entry point `opendataloader-pdf-hybrid` accepts `--port`, `--host`, `--ocr-lang`, `--ocr-engine`, `--psm`, `--device`, `--enrich-formula`, `--enrich-picture-description`, and `--max-file-size`. Source: [python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:1-50]()

Inside the server, a `_build_response` helper merges two failure-detection strategies when Docling reports `partial_success`: it parses "Page N:" substrings out of error messages **and** computes the set gap between expected and present pages in the JSON output. The union is what the caller actually receives as `failed_pages`, which the Java side uses to decide which pages to fall back on. Source: [python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py:55-110]()

The RAG example shows the consumer shape: each chunk is a `{text, metadata}` dict where metadata includes `type`, `page`, `bbox`, and `source`. The bbox is the layout coordinate from the Java processor, so downstream vector stores can index geometry alongside text. Source: [examples/python/rag/README.md:21-34]()

```python
{
  "text": "Language model pretraining has led to significant...",
  "metadata": {
    "type": "paragraph",
    "page": 1,
    "bbox": [108.0, 526.2, 286.5, 592.8],
    "source": "1901.03003.pdf"
  }
}
```

## Node.js/TypeScript and MCP Surfaces

The Node binding reuses the same generated schema. `convert-options.generated.ts` exposes every CLI flag with TypeScript types and JSDoc descriptions copied from the Java `OptionDefinition` source (e.g. `hybrid?: string`, `hybridMode?: string`, `hybridTimeout?: string`). Source: [node/opendataloader-pdf/src/convert-options.generated.ts:1-30]()

The MCP server (`opendataloader-pdf-mcp`) targets Claude Desktop, Claude Code, OpenAI Codex, Cursor, and Windsurf. It launches via `uvx opendataloader-pdf-mcp` and forwards tool invocations to the Java/Python pipeline. Its `convert_pdf` tool accepts the same option set as the CLI plus an `output_dir` and validates `format` against an explicit `ext_map` of `json / text / html / markdown / markdown-with-html / markdown-with-images`. Source: [python/opendataloader-pdf-mcp/src/opendataloader_pdf_mcp/server.py:55-80](), and [python/opendataloader-pdf-mcp/README.md:21-58]()

## Build, Operations, and Community Concerns

```mermaid
flowchart LR
  A[OPTION_DEFINITIONS in CLIOptions.java] --> B[Java Config / CLI]
  A --> C[cli_options_generated.py]
  A --> D[convert-options.generated.ts]
  B --> E[Python wrapper / uvx]
  C --> E
  D --> F[Node SDK]
  B --> G[MCP server]
  E --> G
  E --> H[FastAPI hybrid_server]
  H --> I[Docling DocumentConverter]
```

Two operational pain points show up repeatedly in the issue tracker:

- **Reproducible Maven resolution (#551):** packaging `opendataloader-pdf` inside a Nix fixed-output derivation requires the full transitive Maven graph to be reproducible. Because Java is the implementation language and option generation walks the dependency tree, any non-deterministic resolution propagates into the PyPI and conda-forge artifacts.
- **Pre-compiled JAR in sdist (#435):** conda-forge currently needs Maven plus OpenJDK at build time to recompile the JAR from sources. Shipping the JAR inside the PyPI source distribution would remove that barrier for the scientific Python ecosystem; this is the same path the Node and MCP surfaces already rely on.
- **Configurable hybrid batch size (#500):** the value `BACKEND_CHUNK_SIZE = 50` is hardcoded in `HybridDocumentProcessor`, which sets the page grouping the Java side sends per HTTP request to `hybrid_server.py`. Community request is to expose it as a CLI/API option, alongside the existing `--hybrid-timeout` flag.

Until those land, the practical guidance is: pin Java/dependency versions when packaging, run the hybrid server with `--max-file-size` and a tuned `--hybrid-timeout` for large PDFs, and use the `--hybrid-fallback` flag to keep a Java-only path available when the Docling backend is unreachable.

## See Also

- [Architecture and Extraction Pipeline](Architecture-and-Extraction-Pipeline.md)
- [Hybrid Backend (Docling) Integration](Hybrid-Backend-Integration.md)
- [Output Formats and JSON Schema](Output-Formats-and-JSON-Schema.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: opendataloader-project/opendataloader-pdf

Summary: Found 15 structured pitfall item(s), including 3 high/blocking item(s). Top priority: Configuration risk - Configuration risk requires verification.

## 1. Configuration risk - Configuration risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/566

## 2. Capability evidence risk - Capability evidence risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a capability evidence risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/414

## 3. Runtime risk - Runtime risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/428

## 4. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/440

## 5. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/528

## 6. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/578

## 7. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/584

## 8. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/548

## 9. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/opendataloader-project/opendataloader-pdf

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/opendataloader-project/opendataloader-pdf/issues/581

## 11. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/opendataloader-project/opendataloader-pdf

## 12. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/opendataloader-project/opendataloader-pdf

## 13. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/opendataloader-project/opendataloader-pdf

## 14. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/opendataloader-project/opendataloader-pdf

## 15. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/opendataloader-project/opendataloader-pdf

<!-- canonical_name: opendataloader-project/opendataloader-pdf; human_manual_source: deepwiki_human_wiki -->
