Doramagic Project Pack · Human Manual

PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

Repository Overview and System Architecture

Related topics: Core Pipelines and Models (PP-OCR, PP-StructureV3, PaddleOCR-VL), Deployment, SDKs, and Integrations

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 2.1 Scene OCR (PP-OCRv6)

Continue reading this section for the full explanation and source context.

Section 2.2 Intelligent Document Parsing (PP-Structure & PaddleOCR-VL)

Continue reading this section for the full explanation and source context.

Section 2.3 Community-driven demand

Continue reading this section for the full explanation and source context.

Related topics: Core Pipelines and Models (PP-OCR, PP-StructureV3, PaddleOCR-VL), Deployment, SDKs, and Integrations

Repository Overview and System Architecture

1. Purpose and Scope

PaddleOCR is a multilingual, production-grade OCR and document-parsing toolkit built on top of PaddlePaddle. As described in the top-level README.md, the project "converts PDF documents and images into structured, LLM-ready data (JSON/Markdown) with industry-leading accuracy," and is positioned as the bedrock for RAG and Agentic applications, with 70k+ stars and adoption by projects such as Dify, RAGFlow, and Cherry Studio.

The repository organizes the system into three concentric layers:

  1. Algorithm core — PP-OCR family, PP-Structure, and the PaddleOCR-VL vision-language models.
  2. Engine layer — Python, C++, Paddle-Lite, ONNX, and PaddleServing inference stacks.
  3. SDK layer — first-party clients for Python, TypeScript, Go, and a browser bundle (paddleocr-js).

This separation allows the same model zoo to be reused across research notebooks, server-side services, and edge devices.

2. Capability Pillars

2.1 Scene OCR (PP-OCRv6)

The PP-OCR series is the global multilingual text-spotting flagship. According to the release notes embedded in README.md, PP-OCRv6 "supports 50 languages with a single unified" model and is the default text_type: general pipeline. The C++ reference configuration in deploy/cpp_infer/src/configs/OCR.yaml shows the canonical end-to-end recipe: a DocPreprocessor sub-pipeline (orientation + unwarping) followed by TextDetection (PP-OCRv6_medium_det), TextLineOrientation, and TextRecognition (PP-OCRv6_medium_rec).

2.2 Intelligent Document Parsing (PP-Structure & PaddleOCR-VL)

The repository exposes two complementary document-parsing approaches, both summarized in README.md:

  • PP-StructureV3 — structure-aware conversion into Markdown or JSON, preserving cell-level coordinates.
  • PaddleOCR-VL-1.6 (0.9B) — a NaViT-style dynamic-resolution VLM fused with ERNIE-4.5-0.3B, achieving 96.3% accuracy on OmniDocBench v1.6 and supporting 109–111 languages depending on minor version.

The structural sub-modules live under ppstructure/:

Sub-modulePathPurposeSource
Layout analysisppstructure/layout/Region segmentation (text/title/figure/table) via PP-PicoDetppstructure/layout/README.md
Layout recoveryppstructure/recovery/Restore images/PDFs into editable Word filesppstructure/recovery/README.md
KIEppstructure/kie/Key Information Extraction via VI-LayoutXLM (SER + RE)ppstructure/kie/README.md

2.3 Community-driven demand

The community evidence makes it clear which capabilities users push the hardest on. Issue #1048 "Multilingual OCR Development Plan" (72 comments) drove the consolidation toward a single multi-language model in PP-OCRv6. Issue #1663 discusses text-detection cropping padding — the very issue that motivates the limit_side_len, max_side_limit, and unclip parameters that appear in the C++ YAML above.

3. Deployment Topology

The deploy/README.md lists five official deployment schemes: Python inference, C++ inference, PaddleServing, Paddle-Lite (ARM CPU/OpenCL ARM GPU), and Paddle2ONNX. Each scheme consumes the same YAML-driven pipeline definition (see the pipeline_name: OCR example in deploy/cpp_infer/src/configs/OCR.yaml) but is compiled against a different runtime:

  • C++ Inference — fastest server-side path, uses PaddleInference + TensorRT.
  • Paddle-Lite — mobile/IoT path documented in deploy/lite/readme.md, which targets ARM7/ARM8 phones and depends on cross-compilation toolchains.
  • Paddle2ONNX — produces interoperable ONNX models for non-Paddle runtimes.
flowchart LR
    A[Image / PDF] --> B[DocPreprocessor]
    B --> C[TextDetection]
    C --> D[TextLineOrientation]
    D --> E[TextRecognition]
    E --> F[Structured Output]
    subgraph "Optional post-processing"
    F --> G[PP-StructureV3]
    F --> H[PaddleOCR-VL]
    F --> I[KIE: SER + RE]
    end

4. SDK and Multi-Language Bindings

The api_sdk/ directory contains the official server/client SDKs that wrap the HTTP inference API. Per api_sdk/README.md, the supported languages and their source locations are:

LanguageSource locationNotes
Pythonpaddleocr/Reference SDK, tested via pytest tests/api_client/
TypeScriptapi_sdk/typescript/Node ≥ 18, built with tsup, tested with vitest (package.json)
Goapi_sdk/go/Tested via go test ./...

A separate browser-oriented bundle lives in paddleocr-js/. Its package.json declares vitest, eslint, and prettier tooling, indicating it is a published client library intended for front-end integration with the official API endpoint rather than an in-browser inference runtime.

The SDK layer intentionally decouples clients from model evolution: the server may upgrade from PP-OCRv5 to PP-OCRv6 (as it did between v3.6 and v3.7.0) without breaking the TypeScript or Go clients as long as the JSON contract is preserved. Recent issues such as #18194 (PaddleOCR-VL HPS — returnMarkdownImages=false ineffective with default PaddleX 3.6 SDK) confirm that contract drift between the PaddleX inference SDK and the hosted API is an active integration risk worth tracking when pinning versions.

5. Configuration Reference (PP-OCR C++ pipeline)

The single most representative configuration in the repo is the C++ inference YAML for PP-OCR. Excerpted from deploy/cpp_infer/src/configs/OCR.yaml:

SectionKeyValuePurpose
Top leveltext_typegeneralSelects the OCR pipeline
DocPreprocessoruse_doc_orientation_classifyTrueEnables PP-LCNet_x1_0_doc_ori
DocPreprocessoruse_doc_unwarpingTrueEnables UVDoc
TextDetectionmodel_namePP-OCRv6_medium_detDefault detector
TextDetectionlimit_side_len / max_side_limit64 / 4000Bounds long-side resizing — directly addresses the cropping-padding concern raised in #1663
TextDetectionthresh / box_thresh / unclip_ratio0.3 / 0.6 / 1.5Standard DB++ post-processing
TextRecognitionmodel_namePP-OCRv6_medium_recDefault recognizer
TextRecognitionbatch_size6Throughput knob
TextRecognitionscore_thresh0.0Discard low-confidence text

Every model_dir: null entry means PaddleX will resolve the artifact from its model zoo at first run, which is the convention all other YAMLs in the project follow.

See Also

Source: https://github.com/PaddlePaddle/PaddleOCR / Human Manual

Core Pipelines and Models (PP-OCR, PP-StructureV3, PaddleOCR-VL)

Related topics: Repository Overview and System Architecture, Deployment, SDKs, and Integrations, Configuration, Training, and Customization

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Repository Overview and System Architecture, Deployment, SDKs, and Integrations, Configuration, Training, and Customization

Core Pipelines and Models (PP-OCR, PP-StructureV3, PaddleOCR-VL)

1. Purpose and Scope

PaddleOCR exposes three primary solution families through its Python pipeline layer, each addressing a different class of document-understanding workload:

  • PP-OCRv6 — a fast, multilingual scene-text spotting stack optimized for ~34.5M-parameter inference and 50+ languages in a single unified model (Source: README.md:0-0).
  • PP-StructureV3 — a structure-aware converter that turns complex PDFs and images into Markdown or JSON with fine-grained coordinates (table cells, text blocks) (Source: README.md:0-0).
  • PaddleOCR-VL (0.9B) — the flagship vision-language model for document parsing, achieving 96.3% on OmniDocBench v1.6 with structured Markdown/JSON output (Source: README.md:0-0).

The implementation surface is the paddleocr/_pipelines/ package, which contains thin orchestration classes — ocr.py, pp_structurev3.py, paddleocr_vl.py, plus auxiliary modules such as doc_understanding.py, formula_recognition.py, and seal_recognition.py. These pipelines share a common input/output contract so users can switch between them without rewriting client code.

2. Pipeline Architecture

The three pipelines are complementary rather than overlapping. The relationship is shown below.

graph LR
    A[Image / PDF Input] --> B{Use case}
    B -->|Scene text| C[PP-OCRv6]
    B -->|Structured PDF / layout| D[PP-StructureV3]
    B -->|VLM parsing| E[PaddleOCR-VL]
    C --> F[Text + boxes]
    D --> G[Markdown / JSON + cells]
    E --> H[Markdown / JSON elements]
    D -. layout .-> I[ppstructure/layout]
    D -. table .-> J[ppstructure/table]
    D -. recovery .-> K[ppstructure/recovery]
    D -. KIE .-> L[ppstructure/kie]

PP-OCRv6 is the high-throughput path for plain text extraction. PP-StructureV3 composes four PP-Structure subsystems — layout analysis (Source: ppstructure/layout/README.md:0-0), table recognition (Source: ppstructure/table/README.md:0-0), layout recovery (Source: ppstructure/recovery/README.md:0-0), and Key Information Extraction (Source: ppstructure/kie/README.md:0-0) — to produce document-level Markdown/JSON with explicit cell and block coordinates. PaddleOCR-VL collapses detection, recognition, layout, table, and formula tasks into a single end-to-end model when maximum accuracy on irregular layouts is required.

3. PP-OCRv6 Configuration

The canonical C++ configuration mirrors the Python pipeline and exposes every module name as a tunable parameter (Source: deploy/cpp_infer/src/configs/OCR.yaml:0-0).

ModuleDefault modelKey knobs
DocOrientationClassifyPP-LCNet_x1_0_doc_oritoggled via use_doc_preprocessor
DocUnwarpingUVDoctoggled via use_doc_preprocessor
TextDetectionPP-OCRv6_medium_detthresh, box_thresh, unclip_ratio, limit_side_len
TextLineOrientationPP-LCNet_x1_0_textline_oriuse_textline_orientation, batch_size
TextRecognitionPP-OCRv6_medium_recscore_thresh, batch_size

Two top-level flags control the doc preprocessor and textline orientation stages, so a deployment can disable orientation handling for already-clean scans without removing the YAML keys. The same composition is reflected in paddleocr/_pipelines/ocr.py, which is the Python entry point exposed to users (Source: paddleocr/_pipelines/ocr.py:0-0). A common production failure mode reported by the community is empty recognition output when the preprocessor strips content that the detector expects — tightening box_thresh and unclip_ratio, or disabling use_textline_orientation, is the documented workaround (cf. community issue: "图片识别没有文字输出", #17974).

4. PP-StructureV3 and Auxiliary Pipelines

pp_structurev3.py is the orchestrator that wires the four ppstructure/* submodules into a single end-to-end document-parsing call (Source: paddleocr/_pipelines/pp_structurev3.py:0-0). Its main inputs are an image or PDF directory, model directories for layout/table/KIE, and dictionary paths; outputs are Markdown plus an HTML table string and per-element JSON (Source: ppstructure/recovery/README.md:0-0).

Specialized pipelines complement it:

  • doc_understanding.py — language-model-based semantic parsing of detected regions.
  • formula_recognition.py — converts mathematical expressions to LaTeX.
  • seal_recognition.py — handles stamp / seal text extraction, a capability highlighted in the v3.4.0 release notes (Source: README.md:0-0).

KIE is built on top of LayoutXLM and VI-LayoutXLM, supporting Semantic Entity Recognition (SER) and Relation Extraction (RE), and integrates the PP-OCR inference engine for OCR preprocessing (Source: ppstructure/kie/README.md:0-0). On the Chinese XFUND benchmark, VI-LayoutXLM reaches 93.19% Hmean at 15.49 ms / image (Source: ppstructure/kie/README.md:0-0).

5. PaddleOCR-VL and the v3.7.0 Stack

paddleocr_vl.py wraps the PaddleOCR-VL-0.9B model, which combines a NaViT-style dynamic-resolution visual encoder with the ERNIE-4.5-0.3B language model to handle text, tables, formulas, and charts in 109+ languages (Source: README.md:0-0). The model is the recommended default when users need unified element recognition without per-task model switching.

The v3.7.0 release notes (June 2026) highlight that PP-OCRv6 now achieves +4.6% detection and +5.1% recognition improvements over PP-OCRv5_server while "surpassing mainstream VLMs (Qwen3-VL-235B, GPT-5.5) with only 34.5M parameters" — a positioning explicitly aimed at users who previously assumed VLMs were always superior (Source: README.md:0-0). A known incompatibility is that returnMarkdownImages=false is currently ineffective under the default PaddleX 3.6 SDK (community issue #18194), so callers relying on HPS output must pin a compatible SDK version until the bug is closed.

6. Deployment and SDK Surface

PaddleOCR ships multiple runtime targets so the same pipeline can be reached from different stacks (Source: deploy/README.md:0-0):

When choosing among the three core pipelines, the practical rule of thumb is: use PP-OCRv6 when speed and language breadth matter most; use PP-StructureV3 when downstream consumers need cell-level coordinates, recoverable Word output, or KIE; use PaddleOCR-VL when document layouts are highly irregular (skewed, warped, photographed) and structured Markdown is the primary deliverable (Source: README.md:0-0).

See Also

Source: https://github.com/PaddlePaddle/PaddleOCR / Human Manual

Deployment, SDKs, and Integrations

Related topics: Repository Overview and System Architecture, Core Pipelines and Models (PP-OCR, PP-StructureV3, PaddleOCR-VL)

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 2.1 Paddle Deployment Matrix

Continue reading this section for the full explanation and source context.

Section 2.2 Paddle-Lite Mobile Path

Continue reading this section for the full explanation and source context.

Section 3.1 Multi-Language Client Packages

Continue reading this section for the full explanation and source context.

Related topics: Repository Overview and System Architecture, Core Pipelines and Models (PP-OCR, PP-StructureV3, PaddleOCR-VL)

Deployment, SDKs, and Integrations

1. Overview

PaddleOCR is a multilingual, document-parsing OCR toolkit that converts PDFs and images into structured, LLM-ready Markdown or JSON. Beyond its core inference engines, the project ships a layered deployment and integration surface that targets three audiences: server-side integrators who need REST or gRPC serving, application developers who consume Python/TypeScript/Go/JavaScript SDKs, and edge/mobile teams that deploy via Paddle-Lite or native Android. Source: README.md.

The repository organizes this surface into five concrete sub-trees: deploy/ for native and serving targets, api_sdk/ for the official PaddleOCR Cloud API client packages, paddleocr-js/ for the browser-oriented JavaScript client, ppstructure/ for downstream document-AI modules, and deploy/android_demo/ for the on-device Android sample.

2. Server-Side Deployment

2.1 Paddle Deployment Matrix

PaddleOCR supports a range of server-side deployment options through the deploy/ directory. According to deploy/README.md, the supported schemes are:

Deployment TargetUse CaseSource Path
Python inferenceQuick prototyping, batch scriptsdoc/doc_en/inference_ppocr_en.md
C++ inferenceHigh-throughput production serversdeploy/cpp_infer/readme.md
Paddle Serving (Python/C++)REST/gRPC microservicedeploy/pdserving/README.md
Paddle2ONNXExport to ONNX for cross-framework usedeploy/paddle2onnx/readme.md
Paddle-LiteARM CPU / OpenCL ARM GPUdeploy/lite/readme.md

The deployment overview explicitly notes that the PaddlePaddle runtime "provides a variety of deployment schemes to meet the deployment requirements of different scenarios" and refers users to the diagram at ../doc/deployment_en.png for selection guidance. Source: deploy/README.md.

2.2 Paddle-Lite Mobile Path

For on-device deployment, deploy/lite/readme.md describes a two-phase flow: (1) prepare a cross-compilation environment (Docker, Linux, or other supported toolchains) and a Paddle-Lite toolchain, then (2) optimize the inference model with Paddle-Lite's converter and run the resulting model on an ARM7/ARM8 phone. Paddle-Lite itself is positioned as "a lightweight inference engine for PaddlePaddle" that targets mobile and IoT form factors, supporting cross-platform hardware acceleration. Source: deploy/lite/readme.md.

3. Official API SDKs

3.1 Multi-Language Client Packages

The api_sdk/ directory hosts the first-party SDKs that wrap the hosted PaddleOCR Cloud API. The package locations are summarized in api_sdk/README.md:

LanguageSource LocationUser Docs
Python../paddleocrdocs/version3.x/inference_deployment/serving/paddleocr_official_api/python.md
TypeScriptapi_sdk/typescriptdocs/version3.x/inference_deployment/serving/paddleocr_official_api/typescript.md
Goapi_sdk/godocs/version3.x/inference_deployment/serving/paddleocr_official_api/go.md

Each language binding is validated through its own test runner: python -m pytest tests/api_client/, npm run lint && npm test for TypeScript, and go test ./... for Go. Source: api_sdk/README.md.

3.2 TypeScript and JavaScript Build Profiles

The TypeScript SDK is built with tsup and typed against @types/node ^25.9.1 on Node >=18, with vitest as its test runner. It targets the paddleocr keyword space covering ocr, document-parsing, api-sdk, typescript, and official-api. Source: api_sdk/typescript/package.json.

The browser-oriented paddleocr-js/ package uses vitest ^3.2.4 for testing, eslint with typescript-eslint ^8.57.2 for linting, and prettier ^3.8.1 for formatting, with lint-staged configured to run eslint --fix and prettier --write on staged files. Source: paddleocr-js/package.json.

4. Edge and Mobile: Android Demo

The Android sample under deploy/android_demo/ ships a native C++ pipeline that performs polygon clipping for text-region processing. The C++ source wraps a translated Delphi Clipper library, exposed via ocr_clipper.hpp with the namespace ClipperLib and version string CLIPPER_VERSION "6.4.2". Source: deploy/android_demo/app/src/main/cpp/ocr_clipper.hpp.

The companion ocr_clipper.cpp defines the supporting scanline data structures (TEdge, IntPoint), winding rules (ctIntersection, ctUnion, ctDifference, ctXor), and constants such as pi = 3.141592653589793238 and def_arc_tolerance = 0.25. These primitives are the geometric foundation that the on-device pipeline uses to merge, intersect, or offset text polygons before recognition. Source: deploy/android_demo/app/src/main/cpp/ocr_clipper.cpp.

5. PP-Structure Downstream Modules

The ppstructure/ tree extends PaddleOCR into document-AI workflows and is tightly coupled to deployment, since the same pipelines can be served through the Python inference or C++ paths.

  • Layout analysis provides Chinese, English, and table-region detection built on PaddleDetection's PP-PicoDet. Models are available in ppstructure/docs/models_list_en.md, and the README documents the PubLayNet and CDLA pre-training data download commands. Source: ppstructure/layout/README.md.
  • Key Information Extraction (KIE) combines text detection, text recognition, semantic entity recognition (SER), and optional relationship extraction (RE) on top of the VI-LayoutXLM backbone, with pretrained models published in configs/kie/layoutlm_series/. Source: ppstructure/kie/README.md.
  • Layout recovery offers two strategies for restoring an editable Word file: a pdf2docx-based path for standard PDFs and an image-format PDF path that combines layout analysis, table recognition, and rule-based parsing. Source: ppstructure/recovery/README.md.

6. Ecosystem Integrations

PaddleOCR is consumed by several top-tier open-source projects; the README badges list RAGFlow (deep document understanding), Pathway (real-time analytics and LLM pipelines), MinerU (multi-type document to Markdown), Umi-OCR (batch offline OCR), Cherry Studio (multi-LLM desktop client), and Haystack (deepset's RAG framework). These integrations typically consume the Python wheel directly or the PaddleOCR-VL/PP-OCRv6 model checkpoints, depending on the host project's deployment shape. Source: README.md.

7. Common Failure Modes

Community-reported issues that intersect with the deployment and SDK surface include:

  • PaddleOCR-VL HPS option ignored on PaddleX 3.6: returnMarkdownImages=false does not take effect with the default PaddleX 3.6 SDK, requiring either a SDK upgrade or a workaround. Source: Issue #18194.
  • No text output for image input: Symptom of misconfigured detection or recognition parameters at the SDK or serving layer. Source: Issue #17974.
  • Windows + torch compatibility: OSError [WinError 127] when installing torch on Windows, which is a prerequisite for some PaddleOCR-VL pipelines. Source: Issue #14979.
  • Detection crop padding sensitivity: Long detection crops with large surrounding padding (≈5 px) degrade recognition; a tighter 1–2 px bounding box via OpenCV post-processing is the community-recommended workaround. Source: Issue #1663.

8. See Also

Source: https://github.com/PaddlePaddle/PaddleOCR / Human Manual

Configuration, Training, and Customization

Related topics: Core Pipelines and Models (PP-OCR, PP-StructureV3, PaddleOCR-VL), Deployment, SDKs, and Integrations

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Core Pipelines and Models (PP-OCR, PP-StructureV3, PaddleOCR-VL), Deployment, SDKs, and Integrations

Configuration, Training, and Customization

Overview and Scope

PaddleOCR is a multilingual OCR and document-parsing toolkit that ships a layered configuration and training system. Users can adopt pretrained models out of the box, or retrain and customize virtually every component — text detection, recognition, layout analysis, table recognition, key information extraction (KIE), and VLM-based parsing — to fit domain-specific data. The customization surface is exposed through three primary channels: YAML pipeline definitions, configuration files for individual modules, and per-language scripts under ppstructure/ Source: [README.md].

The project supports PP-OCRv6, PaddleOCR-VL, and PP-StructureV3 as headline models, and provides unified configuration paths for them. Customization typically follows a "config first, then train, then deploy" pattern.

Pipeline Configuration

PaddleOCR's production pipeline is described by a single YAML file that maps model names, module names, and hyperparameters. The canonical example is the C++ inference configuration Source: [deploy/cpp_infer/src/configs/OCR.yaml]:

pipeline_name: OCR
text_type: general
use_doc_preprocessor: True
use_textline_orientation: True

SubPipelines:
  DocPreprocessor:
    pipeline_name: doc_preprocessor
    use_doc_orientation_classify: True
    use_doc_unwarping: True
    SubModules:
      DocOrientationClassify:
        module_name: doc_text_orientation
        model_name: PP-LCNet_x1_0_doc_ori
      DocUnwarping:
        module_name: image_unwarping
        model_name: UVDoc

SubModules:
  TextDetection:
    module_name: text_detection
    model_name: PP-OCRv6_medium_det
    limit_side_len: 64
    limit_type: min
    thresh: 0.3
    box_thresh: 0.6
    unclip_ratio: 1.5
  TextRecognition:
    module_name: text_recognition
    model_name: PP-OCRv6_medium_rec
    batch_size: 6
    score_thresh: 0.0

Key configuration patterns observed in the YAML:

FieldPurposeExample Value
pipeline_nameDeclares the high-level pipelineOCR, doc_preprocessor
use_doc_preprocessorToggles orientation classification + unwarpingTrue
model_nameSelects a pretrained model checkpointPP-OCRv6_medium_det
module_nameMaps a model to its runtime moduletext_detection
limit_side_len / thresh / box_threshDetection hyper-parameters64, 0.3, 0.6
unclip_ratioExpansion ratio for detected polygons1.5
batch_size / score_threshRecognition throughput and confidence gate6, 0.0

Swapping model_name is the primary way to switch between server, mobile, and multilingual variants. Setting model_dir: null defers model resolution to the runtime, while a populated model_dir overrides the default download Source: [deploy/cpp_infer/src/configs/OCR.yaml:1-39].

flowchart LR
    A[YAML Pipeline] --> B[DocPreprocessor]
    B --> C[TextDetection]
    C --> D[TextLineOrientation]
    D --> E[TextRecognition]
    E --> F[Structured Output]
    G[Custom model_dir] --> C
    G --> E

Training Workflows

PaddleOCR exposes a uniform "download pretrained weights → prepare data → train → export → infer" loop. Each sub-module follows it.

Layout Analysis. Training relies on PaddleDetection's PP-PicoDet backbone. The repository documents pretrained downloads such as picodet_lcnet_x1_0_fgd_layout.pdparams for the PubLayNet dataset, and notes that Chinese CDLA and table-specific variants exist for other document types. FGD distillation is supported for accuracy improvements Source: [ppstructure/layout/README.md].

Key Information Extraction (KIE). The KIE pipeline extends layout analysis with semantic entity recognition (SER) and relationship extraction (RE). The repository ships LayoutXLM and VI-LayoutXLM configurations under configs/kie/, with a re_layoutxlm_xfund_zh.yml example reported at 74.83% accuracy. Customization paths include UDML knowledge distillation and textline sorting to fit reading order Source: [ppstructure/kie/README.md].

Layout Recovery. For PDF-to-Word recovery, two custom strategies are available: a rule-based pdf2docx path for standard PDFs, and an image-driven path that combines layout analysis, table recognition, and unwarping for image-based PDFs. Users can choose between them based on input format Source: [ppstructure/recovery/README.md].

Customization and Deployment Surfaces

Beyond core training, PaddleOCR is customizable along several axes:

  • Multilingual switching. A single PP-OCRv6 model supports 50 languages (Chinese, English, Japanese, and 46 Latin-script languages), removing the need to swap checkpoints per locale Source: [README.md].
  • VLM parsing. PaddleOCR-VL integrates a NaViT-style visual encoder with ERNIE-4.5-0.3B. PaddleOCR-VL-1.5 reaches 94.5% on OmniDocBench, supports 111 languages, and adds PP-DocLayoutV3 for irregular layouts (skew, warping, scanning, illumination, screen photography).
  • Deployment targets. Customization extends to deployment: Python inference, C++ inference (deploy/cpp_infer), Paddle Serving, Paddle-Lite for ARM/OpenCL, and Paddle2ONNX for cross-framework export Source: [deploy/README.md].
  • Mobile deployment. Paddle-Lite requires cross-compilation toolchains, then Paddle-Lite's model optimization, and finally a phone-side runner. The documentation walks through each step in Source: [deploy/lite/readme.md].
  • API SDKs. Official SDKs in Python, TypeScript, and Go enable service integration. The TypeScript SDK requires Node ≥ 18 and bundles tsup/vitest tooling Source: [api_sdk/README.md, api_sdk/typescript/package.json].

Common Failure Modes from the Community

Two patterns from community discussions are worth flagging when customizing:

  • Border/whitespace sensitivity in recognition. Issue #1663 reports that when detection crops carry wide (≈5px) borders, recognition accuracy degrades noticeably compared to tight 1–2px crops, because training data was synthesized with tight borders. The proposed mitigation is to post-process detected crops (e.g., re-crop to a tight bounding rectangle) before recognition.
  • Silent recognition failures. Issue #17974 documents cases where images yield no text output, often traced to pipeline configuration (e.g., use_textline_orientation disabled, aggressive score_thresh, or an inappropriate limit_side_len for tiny text). Verifying the YAML and lowering thresholds typically restores output.
  • SDK/HPS parameter drift. Issue #18194 reports that the PaddleOCR-VL HPS option returnMarkdownImages=false is ignored under the default PaddleX 3.6 SDK, illustrating that SDK-side configuration must be validated against the installed runtime, not just the latest docs.

See Also

  • PaddleOCR-VL and PaddleOCR-VL-1.5 release notes — flagship VLM-based document parsing
  • PP-OCRv6 architecture — unified multilingual OCR engine
  • PP-StructureV3 — structure-aware Markdown/JSON conversion with cell-level coordinates
  • deploy/README.md — deployment options matrix
  • api_sdk/README.md — multi-language SDK layout

Source: https://github.com/PaddlePaddle/PaddleOCR / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Installation risk requires verification

Developers may fail before the first successful local run: Link Checker Report

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 14 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: Link Checker Report
  • User impact: Developers may fail before the first successful local run: Link Checker Report
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Link Checker Report. Context: Observed when using python
  • Evidence: failure_mode_cluster:github_issue | https://github.com/PaddlePaddle/PaddleOCR/issues/18134

2. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: failure_mode_cluster:github_issue | https://github.com/PaddlePaddle/PaddleOCR/issues/17974

3. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/PaddlePaddle/PaddleOCR/issues/18157

4. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/PaddlePaddle/PaddleOCR/issues/18194

5. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/PaddlePaddle/PaddleOCR/issues/17974

6. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Developers should check this configuration risk before relying on the project: Link Checker Report
  • User impact: Developers may misconfigure credentials, environment, or host setup: Link Checker Report
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Link Checker Report. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_issue | https://github.com/PaddlePaddle/PaddleOCR/issues/18157

7. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Developers should check this configuration risk before relying on the project: PaddleOCR-VL HPS: returnMarkdownImages=false is ineffective with default PaddleX 3.6 SDK
  • User impact: Developers may misconfigure credentials, environment, or host setup: PaddleOCR-VL HPS: returnMarkdownImages=false is ineffective with default PaddleX 3.6 SDK
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: PaddleOCR-VL HPS: returnMarkdownImages=false is ineffective with default PaddleX 3.6 SDK. Context: Observed when using python, docker
  • Evidence: failure_mode_cluster:github_issue | https://github.com/PaddlePaddle/PaddleOCR/issues/18194

8. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | https://github.com/PaddlePaddle/PaddleOCR

9. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/PaddlePaddle/PaddleOCR

10. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: downstream_validation.risk_items | https://github.com/PaddlePaddle/PaddleOCR

11. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: risks.scoring_risks | https://github.com/PaddlePaddle/PaddleOCR

12. Runtime risk: Runtime risk requires verification

  • Severity: low
  • Finding: Developers should check this performance risk before relying on the project: v3.7.0
  • User impact: Upgrade or migration may change expected behavior: v3.7.0
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.7.0. Context: Observed when using cuda
  • Evidence: failure_mode_cluster:github_release | https://github.com/PaddlePaddle/PaddleOCR/releases/tag/v3.7.0

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using PaddleOCR with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence