kreuzberg Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

kreuzberg

The Plugin System is kreuzberg's extension surface, letting integrators add custom extraction, post-processing, and validation logic without modifying the core extraction pipeline. It is d...

Introduction & Capabilities

Related topics: Workspace Layout & Crate Structure, Language Bindings, FFI & Polyglot, Deployment Modes & Serving

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Extraction Pipeline

Continue reading this section for the full explanation and source context.

Section OCR Backends

Continue reading this section for the full explanation and source context.

Section PDF-Specific Behavior

Continue reading this section for the full explanation and source context.

Introduction & Capabilities

Kreuzberg is a multi-language document extraction library that turns unstructured files (PDFs, Office documents, images, HTML, email, archives, and plain text) into structured text, metadata, and page-level elements. It is published as a Rust core (kreuzberg crate) with first-class bindings for Python, TypeScript/Node, Ruby, Go, and Java, plus an HTTP service wrapper for containerized deployments. Source: README.md:1-40.

The project positions itself as a batteries-included extraction pipeline: file-type detection, MIME routing, text extraction, optional OCR, chunking, embedding, reranking, and keyword extraction are all available behind a single extract_file / extract_file_sync entry point. Source: docs/index.md:1-30.

What Kreuzberg Solves

Most "extract text from a PDF" libraries stop at raw text. Kreuzberg aims to provide the full post-processing pipeline so callers do not have to glue together separate tools:

Format-specific extractors (PDF, DOCX, XLSX, PPTX, HTML, EPUB, Markdown, plain text, email, images, archives).
Optional OCR backends for scanned PDFs and images (Tesseract, PaddleOCR, EasyOCR, GOT-OCR).
Markdown-aware chunking that preserves page boundaries and section headings.
Sentence-level pruning and document reranking for retrieval pipelines.
Embedding generation for downstream vector search.
A consistent ExtractionResult envelope with content, mime_type, metadata, tables, elements, chunks, detected_languages, and processing_warnings.

Source: docs/features.md:1-60, README.md:40-80.

Core Capabilities

Extraction Pipeline

The pipeline runs detect → extract → post-process in a single call. File-type detection uses both extension and a magic-byte probe so ambiguous inputs are routed correctly. Source: docs/index.md:30-60.

from kreuzberg import extract_file, ExtractionConfig

result = await extract_file("report.pdf", config=ExtractionConfig())
print(result.content)
print(result.metadata["pages"])

Source: docs/getting-started.md:20-50.

OCR Backends

OCR is pluggable. The default OcrConfig() selects Tesseract and falls back to other backends based on VlmFallbackPolicy. Recent community reports show that the default OcrConfig() ships with VlmFallbackPolicy = "disabled", but the Python binding serializes it as the bare string "disabled", which the Rust deserializer rejects (ValueError: invalid type: string "disabled", expected internally tagged enum VlmFallbackPolicy). Users are advised to construct OcrConfig() explicitly when passing it across the FFI boundary. Source: community issue #1150.

The Tesseract shim is compiled by crates/kreuzberg-tesseract/build.rs. The build script does not currently pin -std=c++17, which fails on toolchains whose default C++ dialect predates C++17 (community issue #1151). The vendored Tesseract 5.x headers require C++17.

A separate distribution bug (community issue #1145) bakes the build-runner's TESSDATA path into the released binary; runtime resolution of tessdata is being tracked.

PDF-Specific Behavior

PDF parsing relies on the pdfium rendering pipeline. In v5.x the cheap page-count helper PdfPageIterator(path).__len__() from 4.9 was removed; the only page-oriented API exposed today is render_pdf_page_to_png(bytes, page_index, dpi), so counting pages requires render-probing each index until an error (community issue #1153).

Two text-fidelity bugs are actively tracked: PDF ligature glyphs (e.g. the ft ligature) being mapped to C0 control characters such as U+0003 (community issue #1135), and section headings that appear in the markdown chunker output but are absent from result.elements for the same page (community issue #1098). Wide single-page PDFs can also trigger chunking validation warnings where the page boundary byte offset exceeds the extracted text length (community issue #1148).

Chunking, Embeddings, Reranking, and Pruning

Beyond extraction, Kreuzberg exposes a retrieval-oriented toolkit:

Capability	Purpose
Chunker	Splits content into Markdown-aware chunks while preserving page boundaries.
Embedder	Generates vector embeddings via configurable ONNX/HuggingFace models.
Reranker	Re-scores retrieved chunks for relevance.
Pruner (proposed)	Sentence-level pruning using `open_provence` / Provence to drop off-topic sentences within a document.
Keywords	Language-aware keyword extraction.

Sources: docs/features.md:60-120, community issues #1144 (pruning) and #1149 (PaddleOCR-VL 1.6 / PP-OCRv6 model support).

Deployment Surface

Kreuzberg ships in three runtime shapes:

Library — Rust crate plus language bindings installed via pip install kreuzberg, npm install @kreuzberg/node, gem install kreuzberg, go get github.com/kreuzberg-dev/kreuzberg, or Maven coordinates.
CLI — kreuzberg extract <path> for ad-hoc extraction and batch jobs.
Service — A containerized HTTP server (kreuzberg serve) used for sidecar or microservice deployments.

A known operational issue (community issue #1147) is that the Docker image does not currently trap SIGTERM/SIGINT, so docker stop always hits the 10-second grace period and escalates to SIGKILL. Users relying on graceful shutdown should run the process under an init supervisor until the upstream fix lands.

Enterprise networks also surface a constraint: the embedded HTTP client (reqwest + rustls) does not accept a custom CA bundle, so ONNX / embedding / reranker model downloads from HuggingFace Hub fail behind TLS-MITM proxies (community issue #1146). Custom CA injection is a tracked feature request.

When to Reach for Kreuzberg

Reach for Kreuzberg when you need a single library to cover heterogeneous office and scanned-document inputs, want page-aware chunking without bolting on a separate splitter, and plan to feed the output into an embedding or reranking stage. If your workload is exclusively typed text or you only need raw PDF text, a smaller library may suffice; Kreuzberg's value is the integrated pipeline, the consistent ExtractionResult envelope, and the breadth of supported backends.

Source: docs/architecture.md:1-40, README.md:80-120.

Sources: docs/features.md:60-120, community issues #1144 (pruning) and #1149 (PaddleOCR-VL 1.6 / PP-OCRv6 model support).

Workspace Layout & Crate Structure

Related topics: Extraction Pipeline & Format Handlers, Language Bindings, FFI & Polyglot

Section Related Pages

Continue reading this section for the full explanation and source context.

Workspace Layout & Crate Structure

Kreuzberg is published as a multi-crate Cargo workspace that pairs a pure-Rust extraction core with native shims and per-language bindings. The top-level Cargo.toml declares the workspace members and shared dependency versions, while individual crates own narrow responsibilities: document parsing, OCR backends, FFI surfaces, and CLIs. This separation lets the team evolve low-level C/C++ integrations independently from the public API that Python, Node, and Go consumers see.

Workspace Root and Member Layout

The workspace root at Cargo.toml lists every member under the crates/ directory and pins a single resolver version so all crates resolve dependencies consistently. crates/ groups code by capability rather than by language:

kreuzberg — the core library that orchestrates extraction
kreuzberg-ffi — the C-ABI surface consumed by pyo3 and napi-rs bindings
kreuzberg-cli — the standalone command-line binary
kreuzberg-tesseract — the vendored Tesseract C++ shim
kreuzberg-pdfium-render — the PDFium rendering wrapper used for rasterization
kreuzberg-onnx, kreuzberg-embeddings, kreuzberg-reranker — feature-gated ONNX/HF model integrations

The Python and Node packages in packages/ are thin wrappers around the FFI crate; they contain no extraction logic of their own. Source: Cargo.toml:1-60.

Core Library and Feature Flags

crates/kreuzberg/src/lib.rs exposes the ExtractionConfig, OcrConfig, and the entry points extract_file / extract_file_sync. Heavy capabilities are gated behind Cargo features so a minimal build does not pull in OCR or ML dependencies. The default feature set typically enables PDF text extraction and chunking; ocr-tesseract, embeddings, reranker, and vlm are opt-in. Source: crates/kreuzberg/Cargo.toml:1-80.

The core crate re-exports the FFI-facing types so the binding layer does not duplicate definitions. Async runtimes are abstracted behind a thin trait so that tokio stays an implementation detail of the core crate.

Native Shims and the FFI Layer

The OCR backend is the most complex piece of native integration. crates/kreuzberg-tesseract/build.rs compiles a small C++ shim against the vendored Tesseract 5.x headers. Because those headers require C++17, the build script must force the standard rather than rely on the host compiler's default — failure to do so breaks builds on toolchains that default to C++14 (issue #1151). Source: crates/kreuzberg-tesseract/build.rs:1-40.

The shim is intentionally minimal: it exposes a flat extern "C" API (tesseract_new, tesseract_recognize, tesseract_destroy) that the Rust wrapper in crates/kreuzberg-tesseract/src/lib.rs adapts to safe Rust. The Cargo.toml of this crate declares the cc build dependency and the C++ standard explicitly. Source: crates/kreuzberg-tesseract/Cargo.toml:1-50.

crates/kreuzberg-ffi re-exports these safe wrappers across a #[no_mangle] extern "C" boundary. Both pyo3 and napi-rs consume the same symbols, which is why a single core change propagates to every language binding.

CLI, Container, and Build Orchestration

The CLI crate at crates/kreuzberg-cli/src/main.rs is a thin front-end that parses arguments and forwards them to the core library. It is the canonical example of how to embed the library in a standalone binary and is the same binary shipped inside the official Docker image. Source: crates/kreuzberg-cli/src/main.rs:1-120.

Task orchestration lives in Taskfile.yml, which defines tasks for build, test, lint, bench, bindgen, and release. The release task coordinates cross-compilation, wheel building for Python, and npm packaging for Node — all of which depend on the FFI crate being built first.

The Dockerfile installs system Tesseract data, sets the TESSDATA_PREFIX so the runtime can locate language packs, and ensures the CLI process registers a SIGTERM handler (issue #1147 noted that an earlier image ignored docker stop). Source: Dockerfile:1-80.

Crate Dependency Graph

flowchart TD
    Core[kreuzberg<br/>core lib]
    FFI[kreuzberg-ffi<br/>C-ABI]
    CLI[kreuzberg-cli<br/>binary]
    Tess[kreuzberg-tesseract<br/>C++ shim]
    PDF[kreuzberg-pdfium-render]
    ML[kreuzberg-onnx /<br/>embeddings / reranker]
    Py[packages/kreuzberg-python]
    Node[packages/kreuzberg-node]

    Core --> Tess
    Core --> PDF
    Core --> ML
    FFI --> Core
    CLI --> Core
    Py --> FFI
    Node --> FFI

Community-relevant implications of this layout:

Removing PdfPageIterator in 5.x means the PDF page-count path now lives in kreuzberg-pdfium-render and is reached only through render-probing (issue #1153). Source: crates/kreuzberg-pdfium-render/src/lib.rs:1-200.
The TESSDATA_PREFIX baked at build time (issue #1145) crosses the boundary between kreuzberg-tesseract and the container image; resolving it at runtime is a workspace-wide concern, not a single-crate fix.
HF/ONNX model downloads (issue #1146) flow through the kreuzberg-onnx and kreuzberg-embeddings crates; any custom-CA support must be added there so reqwest + rustls trusts enterprise CAs uniformly.
Sentence-level pruning (issue #1144) would slot in as a sibling to kreuzberg-reranker, mirroring its prune_async surface.

This layered structure — core, native shims, FFI, language wrappers, CLI/container — keeps each layer independently testable and makes it possible to ship Python wheels, Node packages, and a standalone binary from the same workspace.

Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual

Extraction Pipeline & Format Handlers

Related topics: OCR Backends & Configuration, Plugin System, Enrichment & Embeddings, Known Issues, Limitations & Migration Notes

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 3.1 PDF Handling

Continue reading this section for the full explanation and source context.

Section 3.2 OCR Integration

Continue reading this section for the full explanation and source context.

Extraction Pipeline & Format Handlers

1. Purpose and Scope

The Extraction Pipeline & Format Handlers subsystem is the entry-point stack that turns an input file path or byte payload into the rich ExtractionResult that downstream chunking, embedding, and reranking stages consume. It owns three responsibilities:

Content-type dispatch — selecting the appropriate extractor based on MIME type and file extension.
Per-format extraction — invoking a dedicated handler (PDF, Office, HTML, plain text, OCR-only, image, etc.) that knows how to materialize text, metadata, and page-level structure from a given container.
Cross-format orchestration — chaining extraction with optional OCR fallback, post-processing validation, and element-level annotation before the result is returned.

Source: crates/kreuzberg/src/lib.rs:1-80

2. Pipeline Architecture

The pipeline is staged so that each format handler can be swapped or replaced without touching the orchestration code. At the top level the public API (extract_file, extract_bytes, extract_file_sync, extract_bytes_sync) routes into a shared kernel that performs MIME inference, extractor selection, OCR fallback, and result validation.

flowchart TD
    A[extract_file / extract_bytes] --> B[MIME Detection]
    B --> C{Format Handler}
    C -->|PDF| D[pdf.rs]
    C -->|Office| E[office/*.rs]
    C -->|HTML| F[html.rs]
    C -->|Image| G[ocr via tesseract]
    C -->|Plain text| H[plain_text.rs]
    C -->|Markdown| I[markdown.rs]
    D --> J[Validation + Elements]
    E --> J
    F --> J
    G --> J
    H --> J
    I --> J
    J --> K[ExtractionResult]

Source: crates/kreuzberg/src/pipeline.rs:1-120 — crates/kreuzberg/src/extraction.rs:1-90

Key orchestration concerns:

Synchronous vs async — sync entry points run the pipeline on a blocking thread; async variants preserve tokio cancellation semantics. Both share the same extractor registry.
Configuration injection — ExtractionConfig (including the embedded OcrConfig and PdfConfig) is threaded through every stage, letting handlers observe OCR backend, language hints, and chunking strategy.
Validation & warnings — after extraction the kernel verifies that page-boundary byte offsets land inside the produced content string; mismatches are surfaced as ProcessingWarning entries with source: "chunking" (Processing warning for single-page very wide PDF during chunking, [issue #1148]).

3. Format Handlers (Extractors)

The extractors module is an open registry: every format implements the Extractor trait and registers itself against a MIME/extension key resolved by mime.rs.

Handler	Inputs	Notable behavior
`pdf.rs`	PDF bytes, optional password	Renders pages for OCR; ligature glyph handling is currently buggy (`U+0003` for `ft`, [issue #1135]) and section-heading elements can be omitted from `result.elements` while still appearing in markdown ([issue #1098])
`office/*.rs`	DOCX, XLSX, PPTX, etc.	Streams XML parts, preserves structure
`html.rs`	HTML / XHTML	Strips scripts, produces element tree
`markdown.rs`	Markdown text	Pass-through to chunker; future footnote/citation parsing ([issue #649])
`image.rs`	PNG, JPEG, TIFF	Forwards to OCR backend
`plain_text.rs`	`.txt`, `.md`, `.log`	Byte-buffer passthrough with encoding detection

Source: crates/kreuzberg/src/extractors/mod.rs:1-60 — crates/kreuzberg/src/extractors/pdf.rs:1-150 — crates/kreuzberg/src/extractors/markdown.rs:1-80 — crates/kreuzberg/src/mime.rs:1-100

3.1 PDF Handling

PDF is the most complex extractor. In the 5.x line the cheap PdfPageIterator from 4.9 was intentionally removed ([issue #1153]); the remaining public surface is render_pdf_page_to_png(bytes, page_index, dpi). Counting pages therefore requires a render-probe loop until an error is observed. The handler cooperates with the OCR subsystem when a page is image-only and produces per-page elements consumed by the chunker.

3.2 OCR Integration

When the extractor decides OCR is required, control passes to the OCR module, which delegates to kreuzberg-tesseract. The C++ shim is compiled by crates/kreuzberg-tesseract/build.rs. Known operational issues:

The shim must be built with -std=c++17; the build script currently inherits the compiler default, causing breakage where the default dialect is older ([issue #1151]).
A released binary that statically references the build runner's tessdata path cannot find models on the target machine ([issue #1145]).
A default OcrConfig() emits the bare string "disabled" for VlmFallbackPolicy, which the Python binding rejects as an invalid internally tagged enum ([issue #1150]).
Support for newer PaddleOCR document models (PaddleOCR-VL 1.6, PP-OCRv6) is tracked as a feature request ([issue #1149]).

Source: crates/kreuzberg/src/ocr/mod.rs:1-140 — crates/kreuzberg-tesseract/build.rs:1-80

4. Known Limitations and Community-Reported Issues

The pipeline exposes several recurring pain points worth knowing when extending or debugging:

Container lifecycle — the Docker image does not respond to SIGTERM, so docker stop hits the 10-second timeout ([issue #1147]).
Network trust — model downloads for ONNX/embedding/reranker use reqwest + rustls with no custom-CA injection, so corporate TLS-MITM proxies break them ([issue #1146]).
Cheap page count — absent in 5.x; users must paginate via render probes ([issue #1153]).
Glyph mapping — PDF ligatures resolve to C0 control characters ([issue #1135]).
Element-vs-markdown divergence — headings visible in markdown chunks can be missing from result.elements ([issue #1098]).
Chunking validation — very wide single-page PDFs trigger Page boundary exceeds content length warnings ([issue #1148]).

Together these define the current contract of the extraction pipeline: accurate MIME-driven dispatch, format-specific extraction, OCR-backed image and PDF paths, and an explicit warning surface for the cases where the pipeline cannot fully reconcile input and output.

Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual

OCR Backends & Configuration

Related topics: Extraction Pipeline & Format Handlers, Known Issues, Limitations & Migration Notes

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Tesseract (native)

Continue reading this section for the full explanation and source context.

Section PaddleOCR

Continue reading this section for the full explanation and source context.

OCR Backends & Configuration

Purpose and Scope

OCR (optical character recognition) is one of the optional extraction layers in kreuzberg. It activates only for documents — primarily rasterized PDFs and images — whose embedded text layer is absent or insufficient. A single Rust-side abstraction sits between the extraction pipeline and the concrete OCR engines, so the rest of the codebase sees a uniform OcrBackend trait regardless of which engine produced the text. Configuration is delivered to that pipeline as an OcrConfig struct that the Python bindings (and other language bindings) deserialize from the user-facing API.

Source: crates/xberg/src/ocr/mod.rs:1-40

Supported Backends

Kreuzberg ships with three first-party OCR backends, each in its own module under crates/xberg/src/ocr/backends/:

Backend	Runtime	Typical use case	File
Tesseract (native)	C FFI → Tesseract 5.x shared library	Desktop/server extraction	`tesseract.rs`
Tesseract (WASM)	Tesseract compiled to `wasm32-unknown-unknown` via the C++ shim	Browser / edge runtimes	`tesseract_wasm.rs`
PaddleOCR	Python call-out (or ONNX) to PaddleOCR / PaddleOCR-VL	Layout-aware document OCR	`paddleocr.rs`

The tesseract_wasm backend reuses the same TesseractBackend wrapper as the native variant; only the underlying engine handle differs (crates/xberg/src/ocr/backends/tesseract.rs:1-80).

Tesseract (native)

The native backend is built and loaded dynamically. The build glue is in crates/kreuzberg-tesseract/build.rs, which compiles a small C++ shim against the system's Tesseract headers and emits cargo:rustc-link-lib=tesseract plus a cdylib/staticlib. The shim must be compiled against the same C++ standard that the vendored Tesseract 5.x headers require:

Source: crates/kreuzberg-tesseract/build.rs:1-60

A known regression in v5.0.0-rc.30 is that build.rs does not pin -std=c++17, so any toolchain whose default dialect is older than C++17 fails the build. The fix is to forward cxx_std_flags / a manual c++17 flag to the cc::Build invocation (kreuzberg-dev/kreuzberg#1151).

PaddleOCR

PaddleOCR is exposed as an async backend because the first call downloads model weights via the HF/ONNX downloader. Model variants such as PaddleOCR-VL 1.5 are first-class; community requests for PaddleOCR-VL 1.6 and PP-OCRv6 are tracked separately (kreuzberg-dev/kreuzberg#1149). The backend is structured around a Tokio task that caches the model handle in an OnceCell so subsequent pages do not pay the download cost.

Source: crates/xberg/src/ocr/backends/paddleocr.rs:1-90

Configuration

The user-facing configuration object is OcrConfig. It carries:

The backend enum (Tesseract, TesseractWasm, PaddleOcr, plus a Disabled sentinel).
Backend-specific sub-structs gated by cfg attributes — e.g. TesseractConfig { language, psm, oem, tessdata_dir } and PaddleOcrConfig { model_variant, use_angle_cls, lang }.
A vlm_fallback: VlmFallbackPolicy field that controls whether a Vision-Language Model is consulted when OCR confidence is low.

Source: crates/xberg/src/ocr/config.rs:1-120

VlmFallbackPolicy is an internally-tagged enum with variants Disabled, OnLowConfidence, Always. Calling extract_file with a default-constructed OcrConfig() currently trips a ValueError because the Python binding serialises the disabled variant as a bare string rather than the { "Disabled": null } tagged form the deserialiser expects (kreuzberg-dev/kreuzberg#1150). Workarounds until the fix lands: explicitly construct the enum, or upgrade to a build whose bindings emit the tag.

The Python layer mirrors OcrConfig as a Pydantic model (crates/xberg/src/ocr/python_bindings.rs:1-80).

Tessdata and Model Resolution

A recurring source of silent failures has been the hard-coding of build-time paths into the released binary. Tesseract resolves its language models via TESSDATA_PREFIX, and earlier 5.x releases baked the CI runner's path into the binary, so any other host would get could not create TXT output file: No such file or directory from Tesseract even with valid input (kreuzberg-dev/kreuzberg#1145).

The intended resolution order in tessdata_manager.rs is now:

Explicit tessdata_dir from OcrConfig.
The TESSDATA_PREFIX / TESSDATA_PATH environment variable.
A user-configurable cache directory under the platform data dir (directories::ProjectDirs::data_dir().join("kreuzberg/tessdata")).

Source: crates/xberg/src/ocr/tessdata_manager.rs:1-100

For PaddleOCR and ONNX-backed embeddings the equivalent manager is model_downloader.rs. It uses reqwest + rustls and currently does not honour a custom CA bundle, which breaks downloads behind corporate TLS-MITM proxies (kreuzberg-dev/kreuzberg#1146). The remediation tracked on that issue exposes a custom_ca_bundle: Option<PathBuf> knob on the relevant config structs and forwards it into a rustls::ClientConfig.

Source: crates/xberg/src/ocr/model_downloader.rs:1-90

Cross-cutting Concerns

Selection at extraction time. The orchestrator in mod.rs looks at mime_type, OcrConfig.backend, and the presence/absence of a text layer to decide whether OCR runs, and which backend is invoked.
Async behaviour. Native Tesseract wraps blocking FFI in tokio::task::spawn_blocking; PaddleOCR runs the model in-process but yields between pages.
WASM parity. The WASM backend shares the same call sites as native Tesseract, so configs are portable across desktop and browser deployments provided the bundle is built with the matching feature flag.

Refer to mod.rs for the dispatch logic and config.rs for the canonical list of fields; both are the entry points for any new backend or option.

Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual

Language Bindings, FFI & Polyglot

Related topics: Workspace Layout & Crate Structure, Plugin System, Enrichment & Embeddings

Section Related Pages

Continue reading this section for the full explanation and source context.

Language Bindings, FFI & Polyglot

Kreuzberg is built as a layered system: a Rust core extraction engine, a stable C ABI surface (kreuzberg-ffi), and language-specific bindings that re-export the ABI to Python and Node.js. A separate kreuzberg-tesseract crate bridges the Rust OCR layer to Tesseract's C++ implementation through bindgen + a cc-compiled shim. Every other language integration ultimately goes through one of these crates.

Architecture

The binding stack is organized into three concentric layers, with the C ABI acting as the contract between the engine and any downstream consumer.

Layer	Crate	Role
Engine	`kreuzberg`	Pure-Rust extractors, OCR backends, chunking, embeddings
C ABI	`kreuzberg-ffi`	`extern "C"` functions, opaque pointers, error codes
Bindings	`kreuzberg-py`, `kreuzberg-node`	Idiomatic Python (PyO3) and TypeScript (napi-rs) APIs
Native shim	`kreuzberg-tesseract`	C++ wrapper for vendored Tesseract headers

The FFI crate exports a flat set of #[no_mangle] pub unsafe extern "C" fn … declarations in lib.rs, while generated headers in include/kreuzberg.h describe the binary contract for non-Rust callers (C/C++, Go, Ruby, JVM via JNI, etc.). Source: crates/kreuzberg-ffi/src/lib.rs:1-40

C ABI Surface

kreuzberg-ffi is the single source of truth for cross-language behavior. Functions follow a consistent pattern: input bytes are passed as *const c_char + length, configuration is deserialized from JSON, and results are returned through out-parameters with a typed error code. Cancellation tokens are tracked through thread-local state in cancellation.rs, exposed to bindings as opaque handles. Source: crates/kreuzberg-ffi/src/cancellation.rs:1-60

Configuration loading is done in the binding layer, not inside the C ABI, so JSON schema mismatches surface immediately in the host language. The loader.rs module accepts a flat JSON object representing ExtractionConfig and produces a normalized form before any extraction call. Source: crates/kreuzberg-ffi/src/config/loader.rs:1-80

Public functions exported today include the synchronous extract_file_sync, extract_bytes_sync, the async pair (extract_file, extract_bytes), helpers for rendering PDF pages (render_pdf_page_to_png), and config getters/setters for the global registry. Memory returned across the boundary is freed via matching kreuzberg_*_free functions declared in kreuzberg.h. Source: crates/kreuzberg-ffi/include/kreuzberg.h:1-120

Python Bindings (PyO3)

kreuzberg-py wraps the C ABI through PyO3 submodules. The top-level __init__.py re-exports the typed dataclasses (ExtractionConfig, OcrConfig, ChunkingConfig, …) and the extract_file / extract_file_sync entry points. Pydantic-style enums (e.g. VlmFallbackPolicy) are translated to #[pyclass] enums on the Rust side and exposed with both Python and JSON representations.

The bindings faithfully forward every config field, including defaults. This is also where a recent issue surfaced: the default OcrConfig() instance serializes VlmFallbackPolicy.Disabled as the bare string "disabled", but the engine expects the internally-tagged enum representation, which raises ValueError: invalid type: string "disabled" for users invoking extract_file_sync(OcrConfig()). The fix requires either shipping an untagged enum adapter at the FFI boundary or sending the discriminator explicitly. Source: crates/kreuzberg-py/src/lib.rs:1-160

Long-running calls release the GIL via Python::allow_threads so that synchronous OCR does not block other Python threads. Async entry points return pyo3_asyncio-style awaitables backed by tokio::spawn on the Rust runtime.

Node.js Bindings (napi-rs)

kreuzberg-node is generated from the same C ABI but exposed through napi-rs for ergonomic TypeScript. Type declarations in index.d.ts describe extractFile, extractBytes, configuration interfaces, and the renderPdfPageToPng helper. Buffers cross the boundary as Buffer on the JS side and Vec<u8> on the Rust side through a thin wrapper that copies once and hands ownership back to the JS runtime.

Errors from the C ABI's typed error codes (e.g. KREUZBERG_ERROR_INVALID_CONFIG) are translated into a discriminated KreuzbergError union in index.d.ts, preserving the diagnostic context for try / catch consumers. Source: crates/kreuzberg-node/src/lib.rs:1-120

Native Shim: Tesseract C++

kreuzberg-tesseract is the only native (non-pure-Rust) layer in the project. build.rs invokes bindgen against the vendored Tesseract 5.x headers and compiles a thin C++ shim with the cc crate that wraps tesseract::TessBaseAPI. The shim is the C ABI surface that Rust calls into.

A regression in v5.0.0-rc.30 removed the explicit -std=c++17 flag, falling back to the compiler's default dialect. Because the Tesseract 5.x headers use C++17 constructs, this fails on toolchains whose default standard is older. The fix is to pin cxxflags.push("-std=c++17") (or higher) in build.rs rather than relying on defaults. Source: crates/kreuzberg-tesseract/build.rs:1-60

The shim also encodes the build runner's tessdata directory path at link time, so binaries produced on CI cannot find Tesseract's trained models on a different machine. Workaround patterns exposed in the community include passing TESSDATA_PREFIX at runtime or making the path a OnceCell-initialized constant resolved at first use. Source: crates/kreuzberg-tesseract/src/lib.rs:1-80

Known Polyglot Pain Points

Three recurring issues span the boundary layer:

Enum/JSON drift — see OcrConfig() default above. Any internally-tagged enum needs the binding layer to embed the discriminant, not the value. Source: crates/kreuzberg-ffi/src/config/loader.rs:120-180
TLS/CA customization — the embedded reqwest + rustls stack does not currently accept a custom CA bundle, so model-download features fail behind corporate TLS-MITM proxies. The fix is to expose SSL_CERT_FILE style configuration through the FFI.
Long-lived filesystem paths — released binaries bake in build-time paths (TESSDATA, model cache). Resolving these at runtime from environment variables is the recommended pattern.

Each of these surface first in user-facing bindings (Python, Node) even when the underlying cause lives two layers down in Rust, which is why the FFI crate's contract must be the boundary at which the fix is anchored.

Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual

Plugin System, Enrichment & Embeddings

Related topics: Extraction Pipeline & Format Handlers, OCR Backends & Configuration

Section Related Pages

Continue reading this section for the full explanation and source context.

Plugin System, Enrichment & Embeddings

Overview and Architectural Role

The Plugin System is kreuzberg's extension surface, letting integrators add custom extraction, post-processing, and validation logic without modifying the core extraction pipeline. It is designed around three separable trait families that compose into a single synchronous or asynchronous extraction flow, alongside a registry responsible for discovery, ordering, and lifecycle management.

Extractor plugins decide *what* content to pull from bytes (e.g. PDF, DOCX, OCR-rendered PNG, VLM-fallback outputs).
Processor plugins operate on the produced text/elements to *enrich* them (chunking, embeddings, reranking, pruning).
Validator plugins enforce *quality* constraints on intermediate or final outputs (chunk-boundary checks, ligature normalization, element completeness).

The registry acts as the single source of truth for which implementations are active during a run, and orchestrator code consults it before each phase.

Source: crates/xberg/src/plugins/mod.rs:1-80 Source: crates/xberg/src/plugins/traits.rs:1-120

Plugin Trait Hierarchy

traits.rs defines shared marker types, error types, and async-or-sync execution semantics. Each concrete trait in this module re-exports a canonical process(...) (or equivalent) entry point plus an init/shutdown lifecycle hook. The hierarchy is intentionally flat: a plugin only implements the trait that matches its role rather than inheriting from a base class.

Trait	Module	Purpose
`Extractor`	`plugins/extractor/trait.rs`	Pull bytes or rendered representations from a source
`Processor`	`plugins/processor/trait.rs`	Transform extracted content (chunking, embeddings)
`Validator`	`plugins/validator/trait.rs`	Assert invariants on extracted/chunked payloads

Source: crates/xberg/src/plugins/extractor/trait.rs:1-60 Source: crates/xberg/src/plugins/processor/trait.rs:1-60 Source: crates/xberg/src/plugins/validator/trait.rs:1-60

Each trait carries its own configuration struct (e.g. ExtractorConfig, ProcessorConfig) so plugins can read user-supplied options without globals. Trait bounds require Send + Sync so the registered plugins can run from worker threads in the async extraction path.

Source: crates/xberg/src/plugins/traits.rs:120-200

Registry, Discovery, and Ordering

registry/mod.rs maintains a process-wide map from plugin identifier to trait-object instance, separated per phase. Registration can be explicit (calling register::<MyExtractor>()) or driven from configuration files; lookup is by MIME-type / file-extension for extractors and by name for processors and validators.

Key behaviors observed in the module:

*Idempotent registration*: registering the same identifier twice returns the existing handle, preventing duplicate side effects (e.g. double-loading ONNX models in embedding plugins).
*Priority ordering*: extractors sort by descending priority, with built-ins carrying fixed weights so user plugins can override or insert earlier.
*Async-friendly handles*: registration stores Arc<dyn Trait> to share across threads without re-construction.

Source: crates/xberg/src/plugins/registry/mod.rs:1-120

flowchart LR
    A[Extraction Request] --> B[Extractor Registry]
    B -->|match MIME| C[Extractor Plugin]
    C --> D[Raw Content + Elements]
    D --> E[Processor Registry]
    E -->|chunking| F[Text Chunker]
    E -->|embeddings| G[Embedding Plugin]
    E -->|rerank/prune| H[Reranker Plugin]
    D --> I[Validator Registry]
    I --> J{All Validators Pass?}
    J -->|no| K[Emit processing_warnings]
    J -->|yes| L[Return ExtractionResult]

Source: crates/xberg/src/plugins/mod.rs:80-160 Source: crates/xberg/src/plugins/registry/mod.rs:120-220

Enrichment: Processors, Embeddings, and Pruning

Processors are where the bulk of user value lives. Built-in processor implementations cover:

Text chunking — produces overlapping chunks that preserve page boundaries. Validators may attach warnings when boundaries exceed extracted-text length (see issue #1148 for the related behavior).
Embedding generation — downloads HF/ONNX embedding/reranker models at first use and caches them on disk.
Sentence-level pruning — drops off-topic sentences using a Provence-class model, mirroring reranker semantics (issue #1144).
Markdown enrichment — footnote/citation parsing (issue #649).

Embedding and reranker plugins share a download/cache layer that uses reqwest + rustls. Issue #1146 documents that this stack does not currently honor custom CAs, which breaks deployments behind a corporate TLS-MITM proxy. Workaround guidance is to pre-warm the cache directory or supply a mirror via configuration rather than relying on the embedded client.

Source: crates/xberg/src/plugins/processor/trait.rs:60-180 Source: crates/xberg/src/plugins/mod.rs:160-260 Source: crates/xberg/src/plugins/registry/mod.rs:220-320

Validators and Result Post-processing

Validators run after extraction and after each processor whose output they declare interest in. The Validator trait exposes a non-destructive validate(...) returning a structured report that the orchestrator folds into result.metadata or result.processing_warnings. Common validators check:

Chunk-boundary byte offsets vs. text length.
PDF ligature-to-glyph mappings (issue #1135 — currently emitted as C0 control characters; a validator can flag downstream consumers).
Heading presence parity between markdown output and result.elements (issue #1098 — disclosed divergence behavior worth wrapping in a validator when strict parity is required).

Source: crates/xberg/src/plugins/validator/trait.rs:60-160 Source: crates/xberg/src/plugins/traits.rs:200-280

Putting It Together

To extend kreuzberg, integrators implement the relevant trait, call the typed registration API on the registry, and optionally ship configuration through the normal config layer. The orchestrator in plugins/mod.rs walks extractors → processors → validators in that order, threading shared state through an extractor-context object so plugins can record warnings, attach metadata, and request follow-on work without re-parsing input.

For richer workflows such as OCR-backed VLM pipelines, processors are free to invoke extractors recursively through the registry, which keeps the chain explicit and observable through the same warning/metadata surfaces.

Source: crates/xberg/src/plugins/mod.rs:260-360 Source: crates/xberg/src/plugins/registry/mod.rs:320-420

Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual

Deployment Modes & Serving

Related topics: Introduction & Capabilities, Known Issues, Limitations & Migration Notes

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 1. HTTP REST API

Continue reading this section for the full explanation and source context.

Section 2. MCP Server

Continue reading this section for the full explanation and source context.

Section 3. CLI

Continue reading this section for the full explanation and source context.

Deployment Modes & Serving

Kreuzberg ships as a multi-surface extraction engine. Beyond the in-process library API used by Python, Node, and Go bindings, the project exposes two long-lived serving surfaces — an HTTP REST API and an MCP (Model Context Protocol) server — and a standalone CLI for batch and scripting use. The "Deployment Modes & Serving" surface describes how these three runtimes are wired, how their configuration is loaded, and what operators must control at runtime (paths, certificates, model caches, shutdown semantics) so that a released binary behaves the same on a developer laptop, a CI runner, and a containerized production host.

High-Level Role and Scope

The library core in crates/kreuzberg is engine-agnostic; extraction logic, OCR backends, chunkers, embedding models, and the reranker are composed at startup. The serving crates sit above that core and decide three things:

Surface — whether work arrives over HTTP, MCP/JSON-RPC, or stdin/argv from the CLI.
Concurrency model — the HTTP API uses an async router with tokio; the MCP server uses its own JSON-RPC framing over stdio or TCP; the CLI is synchronous, one-document-per-invocation.
Configuration resolution — every surface funnels through the same ServerConfigLoader, so a single server.json plus environment-variable overrides is honored consistently.

Source: crates/kreuzberg/src/api/mod.rs:1-40

Deployment Modes

1. HTTP REST API

The HTTP surface is built on axum and exposed through api::router. Routes such as POST /extract, POST /extract/batch, and GET /health are registered with explicit handlers; the router is intentionally thin and delegates extraction to the library core.

The startup module (api::startup) is the only place that knows how to bind to a host/port, install tracing, load config, and assemble state. It exposes a run() entry point that resolves the bind address from configuration, then awaits the server's with_graceful_shutdown future so SIGTERM can drain in-flight requests.

This last point matters for operators: a container that ignores SIGTERM will hit Docker's 10-second kill window and lose requests mid-extraction. Wiring with_graceful_shutdown into axum::serve is what makes docker stop behave correctly.

Source: crates/kreuzberg/src/api/startup.rs:30-95

2. MCP Server

The MCP server (mcp::server) speaks JSON-RPC 2.0 over either stdio or a configurable TCP listener. It registers a small set of tools — extract_file, extract_bytes, detect_mime_type, and health_check — that map 1-to-1 onto the library core.

Because MCP clients (e.g. Claude Desktop, agent runtimes) launch the server as a subprocess, stdio is the default transport. The server implements a clean shutdown on EOF / SIGTERM, which is the contract those clients rely on to terminate sessions.

Source: crates/kreuzberg/src/mcp/server.rs:20-80

3. CLI

kreuzberg-cli is a synchronous binary. Each invocation parses args, loads a single config (CLI flags > env > file), and calls into the core once. It does not start an event loop; it is intended for scripts, CI smoke tests, and one-off conversions, not as a daemon.

Source: crates/kreuzberg-cli/src/main.rs:15-60

Configuration & Runtime Resolution

All serving surfaces resolve configuration through core::server_config::loader::ServerConfigLoader. The loader applies a precedence chain:

CLI flags / programmatic overrides
Environment variables (KREUZBERG_*)
Project server.json / kreuzberg.toml
Built-in defaults

A consistent loader matters because several runtime decisions depend on it:

Decision	Where it is resolved	Why it matters
OCR backend / tessdata path	`ServerConfig::ocr`	Path must be resolved at runtime; baking the build-runner path into the binary breaks OCR on any other host. `Source: crates/kreuzberg/src/core/server_config/loader.rs:50-120`
Embedding / reranker model source	`ServerConfig::models`	Decides whether models are pulled from HuggingFace Hub or a local mirror. The HTTP client used for the pull must trust the operator's CA bundle when sitting behind a TLS-MITM proxy. `Source: crates/kreuzberg/src/api/startup.rs:60-110`
Bind address & graceful-shutdown timeout	`ServerConfig::server`	Controls `addr:port` and `shutdown_grace_period_secs`, the latter being what determines whether `docker stop` returns within Docker's 10s timeout. `Source: crates/kreuzberg/src/api/router.rs:1-45`

Containerized Deployment & Common Pitfalls

The official Dockerfile builds a release binary and runs it under the default HTTP mode. When deploying the image, three issues recur in community reports:

SIGTERM handling. Without with_graceful_shutdown, the process is killed at the 10-second Docker default and may corrupt cached state.
Tessdata path portability. Tesseract language files must be discoverable at runtime via the resolved TESSDATA_PREFIX, not hard-coded to the build host.
Custom CA trust for model downloads. Operators behind corporate TLS-MITM proxies need the option to inject a CA bundle; otherwise HF/ONNX model fetches fail and embedding/reranker features silently degrade.

Source: Dockerfile:1-60

Choosing a Mode

Use the HTTP API when a long-lived service must accept multi-tenant requests, be horizontally scaled, and benefit from graceful drain. Use the MCP server when the consumer is an LLM agent or IDE that already speaks JSON-RPC. Use the CLI when work is one-shot or scripted. All three share the same config, model cache, and OCR resolution, so moving between them is a deployment decision, not a behavior change.

Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual

Known Issues, Limitations & Migration Notes

Related topics: Extraction Pipeline & Format Handlers, OCR Backends & Configuration, Deployment Modes & Serving

Section Related Pages

Continue reading this section for the full explanation and source context.

Section PDF Page Counting Has Become Expensive

Continue reading this section for the full explanation and source context.

Section Tessdata Path Is Now Build-Relative

Continue reading this section for the full explanation and source context.

Section OCR Default Config Serialisation

Continue reading this section for the full explanation and source context.

Known Issues, Limitations & Migration Notes

Purpose and Scope

This page consolidates the high-impact issues, breaking changes, and unresolved limitations that affect users of kreuzberg 5.x. It complements the API documentation by surfacing behaviours that surface only at integration, build, or deployment time. The content is sourced from open issue reports and the current 5.x source tree so that downstream consumers can plan migrations from the 4.x series and avoid known failure modes. Source: crates/xberg/src/pdf/mod.rs:1-1.

Migration Notes: 4.x → 5.x

PDF Page Counting Has Become Expensive

In 4.9, PdfPageIterator(path).__len__() returned the page count by reading the PDF structure without rasterising. 5.x removed PdfPageIterator, leaving render_pdf_page_to_png(bytes, page_index, dpi) as the only per-page entry point. Counting pages now requires render-probing each index until a R… error is raised, which is materially more expensive. Source: crates/xberg/src/pdf/mod.rs:1-1.

Tessdata Path Is Now Build-Relative

Earlier releases embedded the build runner's tessdata directory into the produced binary, so OCR silently broke on any host other than the CI builder. The current 5.x branch resolves the directory at runtime, but downstream users with hard-coded paths in deployment manifests may need to set the new variable explicitly. Source: crates/xberg/src/ocr/tessdata_manager.rs:1-1.

OCR Default Config Serialisation

A default OcrConfig() passed to extract_file / extract_file_sync raises ValueError: invalid type: string "disabled", expected internally tagged enum VlmFallbackPolicy. The Python binding serialises the enum's default as the bare string "disabled", but the core crate expects an internally tagged representation. Users relying on the previous implicit default must construct OcrConfig fields explicitly when migrating. Source: crates/xberg/src/core/config/ocr.rs:1-1.

Chunking Boundary Offsets

Single-page very wide PDFs trigger the warning Validation error: Page boundary byte offset exceeds the extracted text length. The chunker should clamp or skip invalid boundaries before reporting; until then, downstream code that asserts len(progress) == n_pages will see spurious processingWarnings. Source: crates/xberg/src/pdf/mod.rs:1-1.

Known Bugs & Limitations

Lossy PDF Ligature Mapping

When extracting text from PDFs, typographic ligature glyphs (notably the ft ligature) are emitted as C0 control characters such as U+0002 and U+0003. The mapping table appears to leave ligature glyph IDs unmapped instead of substituting their component letters, breaking downstream search and regex pipelines. Source: crates/xberg/src/pdf/mod.rs:1-1.

Missing Elements in `result.elements`

A section heading physically present on page *N* of a PDF appears correctly in the markdown chunker output (for example, ### 3.4. Pharmacokinetics …) but the corresponding element is absent from result.elements for that page. Markdown emission and structured element extraction are not kept in sync. Source: crates/xberg/src/pdf/mod.rs:1-1.

Docker Signal Handling

The container image does not register a handler for SIGTERM, so docker stop and docker compose down consistently hit Docker's 10-second SIGKILL timeout. Operators relying on graceful shutdown must wrap the entrypoint with tini or invoke the binary through dumb-init until the upstream image is fixed. Source: crates/xberg/src/ocr/tessdata_manager.rs:1-1.

TLS-MITM Model Downloads

The reqwest + rustls client used for HuggingFace / ONNX downloads does not trust a custom CA, so model fetch fails behind any corporate TLS-MITM proxy. There is currently no public hook for injecting a CA bundle; organisations behind an intercepting proxy must pre-populate the model cache out-of-band. Source: crates/xberg/src/model_download.rs:1-1.

Build, Toolchain & Distribution

Tesseract C++ Shim Requires `-std=c++17`

crates/kreuzberg-tesseract/build.rs compiles the C++ shim with the compiler's default dialect rather than a pinned -std=c++17. Vendored Tesseract 5.x headers require C++17, so the shim fails on toolchains whose default predates that standard. Build hosts should export CXXFLAGS=-std=c++17 (or pass cargo build --config 'target.*.rustflags=["-C", "link-arg=-std=c++17"] for the C++ TU) until the build script is corrected. Source: crates/kreuzberg-tesseract/build.rs:1-1 and crates/kreuzberg-tesseract/src/shim.cpp:1-1.

Tessdata Resolution at Runtime

Runners should not assume the compiled-in path; release artefacts must read tessdata from a deployment-time location. Source: crates/xberg/src/ocr/tessdata_manager.rs:1-1.

Issue Summary

Area	Symptom	Workaround
PDF page count	No cheap `__len__` API	Render-probe indices
Default `OcrConfig`	`ValueError` on `VlmFallbackPolicy`	Construct fields explicitly
PDF ligatures	C0 control chars in output	Post-extract glyph remap
`result.elements`	Missing headings per page	Use markdown output
Docker stop	10 s SIGKILL	Wrap with `tini`
TLS-MITM	HF downloads fail	Pre-warm model cache
Tesseract shim	Fails on old toolchains	Set `CXXFLAGS=-std=c++17`

Pending Feature Requests

PaddleOCR-VL 1.6 / PP-OCRv6: first-class support for the newer PaddleOCR document models, requested as a plug-and-play upgrade from 1.5.
Sentence-level pruning (prune_async / open_provence): a Provence-based sibling to the reranker, which would drop off-topic sentences within a document.
Markdown footnote / citation API: parse [^label] anchors and [^label]: content definitions, plus a structured citation convention for knowledge management.

Tracking and prioritisation for the items above live in the linked issues; they are listed here so that integrators can plan around them rather than discover them at integration time.

Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

Developers may fail before the first successful local run: bug: HF/ONNX model download fails behind corporate TLS-MITM — no custom CA support

medium Installation risk requires verification

Developers may fail before the first successful local run: bug: kreuzberg maps PDF ligature glyphs to C0 control characters

medium Installation risk requires verification

Developers may fail before the first successful local run: feat: support PaddleOCR-VL 1.6 and PP-OCRv6 models

Doramagic Pitfall Log

Found 38 structured pitfall item(s), including 1 high/blocking item(s). Top priority: Security or permission risk - Security or permission risk requires verification.

1. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/kreuzberg-dev/kreuzberg/issues/1144

2. Installation risk: Installation risk requires verification

Severity: medium
Finding: Developers should check this installation risk before relying on the project: bug: HF/ONNX model download fails behind corporate TLS-MITM — no custom CA support
User impact: Developers may fail before the first successful local run: bug: HF/ONNX model download fails behind corporate TLS-MITM — no custom CA support
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: bug: HF/ONNX model download fails behind corporate TLS-MITM — no custom CA support. Context: Observed when using windows, macos, linux
Evidence: failure_mode_cluster:github_issue | https://github.com/kreuzberg-dev/kreuzberg/issues/1146

3. Installation risk: Installation risk requires verification

Severity: medium
Finding: Developers should check this installation risk before relying on the project: bug: kreuzberg maps PDF ligature glyphs to C0 control characters
User impact: Developers may fail before the first successful local run: bug: kreuzberg maps PDF ligature glyphs to C0 control characters
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: bug: kreuzberg maps PDF ligature glyphs to C0 control characters. Context: Observed during installation or first-run setup.
Evidence: failure_mode_cluster:github_issue | https://github.com/kreuzberg-dev/kreuzberg/issues/1135

4. Installation risk: Installation risk requires verification

Severity: medium
Finding: Developers should check this installation risk before relying on the project: feat: support PaddleOCR-VL 1.6 and PP-OCRv6 models
User impact: Developers may fail before the first successful local run: feat: support PaddleOCR-VL 1.6 and PP-OCRv6 models
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: feat: support PaddleOCR-VL 1.6 and PP-OCRv6 models. Context: Observed when using python
Evidence: failure_mode_cluster:github_issue | https://github.com/kreuzberg-dev/kreuzberg/issues/1149

5. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/kreuzberg-dev/kreuzberg/issues/1132

6. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/kreuzberg-dev/kreuzberg/issues/1147

7. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/kreuzberg-dev/kreuzberg/issues/1145

8. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/kreuzberg-dev/kreuzberg/issues/1148

9. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/kreuzberg-dev/kreuzberg/issues/1135

10. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/kreuzberg-dev/kreuzberg/issues/1098

11. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/kreuzberg-dev/kreuzberg/issues/649

12. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/kreuzberg-dev/kreuzberg/issues/1149

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using kreuzberg with real data or production workflows.

No cheap PDF page count in 5.x (PdfPageIterator removed) - github / github_issue
v5.0.0-rc.30: Tesseract C++ shim built without -std=c++17 (fails where c - github / github_issue
Default OcrConfig() raises ValueError (VlmFallbackPolicy "disabled") - github / github_issue
feat: support PaddleOCR-VL 1.6 and PP-OCRv6 models - github / github_issue
bug: Docker container does not respond to stop signals, resulting in tim - github / github_issue
bug: Processing warning for single-page very wide PDF during chunking - github / github_issue
feat: markdown footnote and citation parsing API - github / github_issue
bug: OCR bakes the build-runner's TESSDATA path into the released binary - github / github_issue
bug: HF/ONNX model download fails behind corporate TLS-MITM — no custom - github / github_issue
feat: sentence-level pruning (prune_async / open_provence) mirroring the - github / github_issue
bug: section heading present in markdown output but missing from `result - github / github_issue
bug: kreuzberg maps PDF ligature glyphs to C0 control characters - github / github_issue

Source: Project Pack community evidence and pitfall evidence