Match the project to your task before installing it.
Vector Retrieval and RAG · Public
kreuzberg
Vector retrieval project for checking embedding storage, query semantics, RAG integration, data boundaries, and rollback.
Check whether this project matches your task before installing it.
What it can doVector database setup checks, embedding model boundaries, collection management, query acceptance, and deletion guidanceReview the portable capability path.
Before continuingVerify in a sandboxDo not treat a preview pack as a proven local install.
GitHub snapshot8.5k stars501 forks · 46 contributors
Doramagic.ai Last verification date: 2026-07-05 Verification method: source evidence, semantic profile, public page gate, and static build acceptance.
Publication status · 2026-07-05
What is kreuzberg?
- kreuzberg is a vector database, retrieval, or RAG storage component for AI applications.
- Best fit: Developers connecting knowledge bases, documents, or app data to semantic retrieval or RAG workflows.
- Not for: Not for one-off model API calls or environments that cannot isolate indexed data, credentials, and persistence paths.
- Capability added to an AI workflow: Vector database setup checks, embedding model boundaries, collection management, query acceptance, and deletion guidance
- First safe verification step: Verify create, query, delete, and rollback with a small public text sample before using real data.
- Verification state: source, Quick Start, and sandbox install checks are recorded as passed.
- Top risk: May increase setup, validation, or first-run risk for the user.
- Evidence base: https://github.com/kreuzberg-dev/kreuzberg, https://github.com/kreuzberg-dev/kreuzberg#readme, Human Manual, Pitfall Log
01
Quick decision
Use this section to decide whether the project is worth a deeper read.Vector retrieval project for checking embedding storage, query semantics, RAG integration, data boundaries, and rollback.
8.5k stars · 501 forks
02
What it can do
Translate the upstream project into concrete capabilities the user can judge before installing.Introduction & Capabilities
Related topics: Workspace Layout & Crate Structure, Language Bindings, FFI & Polyglot, Deployment Modes & Serving
Sources: [docs/features.md:60-120](), community issues #1144 (pruning) and #1149 (PaddleOCR-VL 1.6 / PP-OCRv6 model support).
Workspace Layout & Crate Structure
Related topics: Extraction Pipeline & Format Handlers, Language Bindings, FFI & Polyglot
Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual
Extraction Pipeline & Format Handlers
Related topics: OCR Backends & Configuration, Plugin System, Enrichment & Embeddings, Known Issues, Limitations & Migration Notes
Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual
OCR Backends & Configuration
Related topics: Extraction Pipeline & Format Handlers, Known Issues, Limitations & Migration Notes
Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual
Language Bindings, FFI & Polyglot
Related topics: Workspace Layout & Crate Structure, Plugin System, Enrichment & Embeddings
Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual
Sources: https://github.com/kreuzberg-dev/kreuzberg, Human Manual, Project Pack evidence, and downstream validation signals.
03
Community Discussion Evidence
Project-level external discussion stays visible on the detail page, not only inside the manual.Community Discussion Evidence
12 source-linked itemsReview these external discussions before using kreuzberg with real data or production workflows. They are review inputs, not standalone proof that the project is production-ready.
-
01
No cheap PDF page count in 5.x (PdfPageIterator removed)
github / github_issue
-
02
v5.0.0-rc.30: Tesseract C++ shim built without -std=c++17 (fails where c
github / github_issue
-
03
Default OcrConfig() raises ValueError (VlmFallbackPolicy "disabled")
github / github_issue
-
04
feat: support PaddleOCR-VL 1.6 and PP-OCRv6 models
github / github_issue
-
05
bug: Docker container does not respond to stop signals, resulting in tim
github / github_issue
-
06
bug: Processing warning for single-page very wide PDF during chunking
github / github_issue
-
07
feat: markdown footnote and citation parsing API
github / github_issue
-
08
bug: OCR bakes the build-runner's TESSDATA path into the released binary
github / github_issue
-
09
bug: HF/ONNX model download fails behind corporate TLS-MITM — no custom
github / github_issue
-
10
feat: sentence-level pruning (prune_async / open_provence) mirroring the
github / github_issue
-
11
bug: section heading present in markdown output but missing from `result
github / github_issue
-
12
bug: kreuzberg maps PDF ligature glyphs to C0 control characters
github / github_issue
04
How to start
Only source-backed commands are shown here. Verify them in an isolated environment first.Try the prompt first
Test the workflow without installing the upstream project.
previewRead the Human Manual
Understand inputs, outputs, limits, and failure modes.
manualTake context to your AI host
Use the compiled assets in your preferred AI environment.
contextRun sandbox verification
Confirm install commands and rollback before using a primary environment.
verifypip install kreuzbergOfficial start command · https://github.com/kreuzberg-dev/kreuzberg#readme · verified: yes
05
Human Manual
The English page must expose the real manual, not a short placeholder.8+ sections · Human Manual
kreuzberg Manual
The Plugin System is kreuzberg's extension surface, letting integrators add custom extraction, post-processing, and validation logic without modifying the core extraction pipeline. It is d...
Open the full manual- https://github.com/kreuzberg-dev/kreuzberg Project Manual
- Table of Contents
- Introduction & Capabilities
- Related Pages
- What Kreuzberg Solves
- Core Capabilities
- Extraction Pipeline
- OCR Backends
Introduction & Capabilities
Related topics: Workspace Layout & Crate Structure, Language Bindings, FFI & Polyglot, Deployment Modes & Serving
Sources: [docs/features.md:60-120](), community issues #1144 (pruning) and #1149 (PaddleOCR-VL 1.6 / PP-OCRv6 model support).
Workspace Layout & Crate Structure
Related topics: Extraction Pipeline & Format Handlers, Language Bindings, FFI & Polyglot
Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual
Extraction Pipeline & Format Handlers
Related topics: OCR Backends & Configuration, Plugin System, Enrichment & Embeddings, Known Issues, Limitations & Migration Notes
Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual
OCR Backends & Configuration
Related topics: Extraction Pipeline & Format Handlers, Known Issues, Limitations & Migration Notes
Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual
Language Bindings, FFI & Polyglot
Related topics: Workspace Layout & Crate Structure, Plugin System, Enrichment & Embeddings
Source: https://github.com/kreuzberg-dev/kreuzberg / Human Manual
06
AI Context Pack and portable assets
After deciding to continue, take the project context into your own AI host.Complete pack plus user-owned assets
These files are planning and verification assets for Claude Code, Codex, Gemini, Cursor, ChatGPT, and other AI hosts.
07
Preflight checks
Treat this page as a planning asset, not proof that your local environment is ready.- The manual is generated from source-linked project files and Doramagic validation signals.
- Community evidence warnings stay visible instead of being converted into marketing claims.
- This English page is indexable because the locale quality gate passed and explicit English index approval is enabled.
- Use the upstream repository as the final authority for installation commands, license, and version-specific behavior.
08
Pitfall Log and verification risks
Doramagic surfaces high-risk items before users treat a candidate capability as verified.Security or permission risk requires verification
May increase setup, validation, or first-run risk for the user.
Installation risk requires verification
Developers may fail before the first successful local run: bug: HF/ONNX model download fails behind corporate TLS-MITM — no custom CA support
Installation risk requires verification
Developers may fail before the first successful local run: bug: kreuzberg maps PDF ligature glyphs to C0 control characters
Installation risk requires verification
Developers may fail before the first successful local run: feat: support PaddleOCR-VL 1.6 and PP-OCRv6 models
Installation risk requires verification
May increase setup, validation, or first-run risk for the user.
Installation risk requires verification
May increase setup, validation, or first-run risk for the user.
Installation risk requires verification
May increase setup, validation, or first-run risk for the user.
Installation risk requires verification
May increase setup, validation, or first-run risk for the user.