# https://github.com/Unstructured-IO/unstructured Project Manual

Generated at: 2026-06-21 09:22:50 UTC

## Table of Contents

- [Overview, Installation, and Quick Start](#page-1)
- [Document Partitioning Pipeline](#page-2)
- [Elements, Chunking, and Output Formats](#page-3)
- [Embeddings, Connectors, and Metrics](#page-4)

<a id='page-1'></a>

## Overview, Installation, and Quick Start

### Related Pages

Related topics: [Document Partitioning Pipeline](#page-2), [Elements, Chunking, and Output Formats](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/Unstructured-IO/unstructured/blob/main/README.md)
- [example-docs/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/README.md)
- [unstructured/partition/utils/ocr_models/ocr_interface.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/ocr_interface.py)
- [unstructured/partition/utils/ocr_models/tesseract_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/tesseract_ocr.py)
- [unstructured/partition/utils/ocr_models/paddle_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/paddle_ocr.py)
- [unstructured/partition/utils/ocr_models/google_vision_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/google_vision_ocr.py)
- [unstructured/partition/utils/sorting.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/sorting.py)
- [unstructured/partition/utils/config.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/config.py)
- [scripts/performance/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/performance/README.md)
- [unstructured/embed/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/embed/README.md)
</details>

# Overview, Installation, and Quick Start

## What is `unstructured`

The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents — PDFs, HTML, Word documents, XML, and many more. The library is purpose-built to streamline the data processing workflow for Large Language Models (LLMs): its modular partitioning functions and connectors form a cohesive system that turns unstructured data into structured, element-typed outputs suitable for downstream pipelines such as RAG, embedding, and chunking.

Source: [README.md:3-15]()

The recommended entry point is `unstructured.partition.auto.partition`, which detects the file type and routes it to a format-specific partitioning function. Each file type (PDF, DOCX, HTML, …) has its own partitioner that returns a list of `Element` objects (e.g., `Title`, `NarrativeText`, `Table`, `ListItem`).

## Installation

There are three supported installation paths described in the project README. The choice depends on whether you want a quick local install, a reproducible container, or a full development environment.

### 1. Install from PyPI

The base package is published to PyPI. Extras pull in document-type-specific dependencies:

```bash
pip install "unstructured[docx]"
pip install "unstructured[pdf]"
```

For Windows users running `conda`, the README points to the dedicated Windows install guide in the project docs.

Source: [README.md:55-78]()

### 2. Run the library in a container

A multi-platform Docker image is built for every push to `main`, tagged with the short commit hash and the application version. This is the fastest way to get a working environment without worrying about OS-level dependencies such as `poppler` or Tesseract language packs.

```bash
docker pull unstructuredio/unstructured:latest
```

The README also documents platform selection (e.g., `--platform linux/amd64`) for Apple silicon users.

Source: [README.md:62-78]()

### 3. Install for local development

The repo ships a `Makefile` and an optional `pre-commit` configuration. The typical dev workflow is:

```bash
make install
pre-commit install       # optional, for auto-formatting on commit
make check               # show lint/format diffs without applying
make tidy                # apply lint/format fixes
make docker-start-dev    # run a dev container with the repo mounted
```

Source: [README.md:30-50]()

```mermaid
flowchart LR
    A[Choose install path] --> B{Use container?}
    B -- yes --> C[docker pull<br/>unstructuredio/unstructured]
    B -- no --> D{pip or dev?}
    D -- pip --> E[pip install<br/>'unstructured[pdf,docx,...]']
    D -- dev --> F[make install<br/>pre-commit install]
    E --> G[Run partition&#40;&#41;]
    F --> G
    C --> G
```

## Quick Start

The canonical quick start, taken directly from the README, uses the auto-router and the included `layout-parser-paper.pdf` example:

```python
from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper.pdf")

print("\n\n".join([str(el) for el in elements]))
```

The output is a sequence of `Element` objects that, when stringified, render as the document text with the detected element types inline.

Source: [README.md:82-105]()

The repository also ships example documents that are useful as smoke tests:

| File | Format | Purpose |
|------|--------|---------|
| `example-docs/layout-parser-paper.pdf` | PDF | Tests PDF partitioning and OCR strategies |
| `example-docs/example-10k.html` | HTML | Tests HTML partitioning and table metadata |
| `example-docs/factbook.xml` / `factbook.xsl` | XML / XSL | Tests stylesheet-aware XML partitioning |

Source: [example-docs/README.md:3-21]()

For production batch ingestion across many sources, the README points to a separate companion repository: `unstructured-ingest` (referenced as [unstructured-ingest](https://github.com/Unstructured-IO/unstructured-ingest)).

## Configuration, OCR Agents, and Diagnostics

### OCR agent selection

`partition_pdf` and other partitioners can delegate text extraction from images to an OCR backend. The selection is driven by the `OCR_AGENT` environment variable, parsed by `OCRAgent.get_agent()` in the OCR interface. The whitelist of supported agents is defined as module-level constants.

Source: [unstructured/partition/utils/ocr_models/ocr_interface.py:18-34]()

The currently shipped implementations are:

| Class | Backend | Notes |
|-------|---------|-------|
| `OCRAgentTesseract` | Tesseract (via `unstructured_pytesseract`) | Default; forces `OMP_THREAD_LIMIT=1` for performance. Source: [unstructured/partition/utils/ocr_models/tesseract_ocr.py:30-50]() |
| `OCRAgentPaddle` | PaddleOCR (via `unstructured_paddleocr`) | Auto-selects CPU/GPU; uses MKL-DNN when available. Source: [unstructured/partition/utils/ocr_models/paddle_ocr.py:24-45]() |
| `OCRAgentGoogleVision` | Google Cloud Vision | Configurable endpoint via `GOOGLEVISION_API_ENDPOINT`. Source: [unstructured/partition/utils/ocr_models/google_vision_ocr.py:26-44]() |

Community request [#2467](https://github.com/Unstructured-IO/unstructured/issues/2467) asks for a way to deactivate OCR while keeping `strategy="hi_res"` — useful for text-based PDFs that still need layout analysis but not Tesseract overhead.

### Environment configuration

Runtime knobs (OCR backend, temp directories, agent cache size, etc.) live in `ENVConfig` and are read from environment variables. Constants used for `OCR_AGENT` aliases (e.g., `OCR_AGENT_TESSERACT`, `OCR_AGENT_PADDLE_OLD`) are imported from `unstructured.partition.utils.constants`.

Source: [unstructured/partition/utils/config.py:14-58]()
Source: [unstructured/partition/utils/ocr_models/ocr_interface.py:20-28]()

### Diagnostics

Per release `0.22.26`, a new `unstructured doctor` CLI command was added that prints the resolved environment and dependency status — helpful when triaging install or OCR issues.

Source: Release notes referenced from the community context.

### Telemetry

Telemetry is **off by default**. To opt in, set `UNSTRUCTURED_TELEMETRY_ENABLED=true` (or `=1`) before importing `unstructured`. To opt out, set `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS` to any non-empty value — opt-out takes precedence over opt-in.

Source: [README.md:152-160]()

## Performance and Embedding

For performance work, the `scripts/performance/` directory ships two helpers: a benchmark script that publishes partitioning timings to S3 (gated by `PUBLISH_RESULTS=true`), and a profiling script that uses `py-spy` to emit `speedscope`-viewable flame graphs.

Source: [scripts/performance/README.md:3-37]()

The `unstructured.embed` module has been **moved to** [`unstructured-ingest`](https://github.com/Unstructured-IO/unstructured-ingest) and is marked unmaintained; it will be removed from this repository in the near future.

Source: [unstructured/embed/README.md:1-7]()

## Common Pitfalls (from community)

- **Two-column PDFs**: `partition_pdf` does not always recover reading order on two-column layouts. Tracking issue: [#356](https://github.com/Unstructured-IO/unstructured/issues/356). Element ordering is post-processed by `recursive_xy_cut` and `SORT_MODE_BASIC` helpers.
  Source: [unstructured/partition/utils/sorting.py:1-40](), [unstructured/partition/utils/xycut.py:1-20]()
- **Numpy 2.0 compatibility**: Requires bumping `onnxruntime` to 1.19 and patching type/sort behavior. Tracking issue: [#3684](https://github.com/Unstructured-IO/unstructured/issues/3684).
- **`pdfminer.six` pin**: `unstructured[pdf]==0.16.11` pulls in a `pdfminer.six` version that breaks `langchain_community.document_loaders`. Pin a compatible version until upstream fixes it. Tracking issue: [#3982](https://github.com/Unstructured-IO/unstructured/issues/3982).
- **Numbers in `metadata.text_as_html`**: Table HTML serialization can render large integers in scientific notation. Tracking issue: [#3871](https://github.com/Unstructured-IO/unstructured/issues/3871).

## See Also

- [Core Functionality: Partitioning](https://docs.unstructured.io/open-source/core-functionality/partitioning)
- [Concepts: Document Elements](https://docs.unstructured.io/open-source/concepts/document-elements)
- [Connectors / Ingest CLI](https://docs.unstructured.io/open-source/ingest/overview)
- Companion repo: [Unstructured-IO/unstructured-ingest](https://github.com/Unstructured-IO/unstructured-ingest)

---

<a id='page-2'></a>

## Document Partitioning Pipeline

### Related Pages

Related topics: [Overview, Installation, and Quick Start](#page-1), [Elements, Chunking, and Output Formats](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/Unstructured-IO/unstructured/blob/main/README.md)
- [unstructured/partition/utils/constants.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/constants.py)
- [unstructured/partition/utils/ocr_models/ocr_interface.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/ocr_interface.py)
- [unstructured/partition/utils/ocr_models/tesseract_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/tesseract_ocr.py)
- [unstructured/partition/utils/ocr_models/paddle_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/paddle_ocr.py)
- [unstructured/partition/utils/ocr_models/google_vision_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/google_vision_ocr.py)
- [unstructured/partition/utils/sorting.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/sorting.py)
- [unstructured/partition/utils/speech_to_text/speech_to_text_interface.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/speech_to_text/speech_to_text_interface.py)
- [unstructured/partition/utils/speech_to_text/whisper_stt.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/speech_to_text/whisper_stt.py)
- [scripts/performance/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/performance/README.md)

</details>

# Document Partitioning Pipeline

## Overview and Entry Points

The `unstructured` library provides a unified pipeline for converting heterogeneous unstructured files (PDFs, images, HTML, audio, and many more formats) into a normalized stream of structured `Element` objects suitable for downstream LLM ingestion. From the README, the canonical entry point is `unstructured.partition.auto.partition`, which detects file type and routes to the appropriate file-specific partitioner. The library advertises partitioning as its "core functionality" with a full list of options documented in the [partitioning docs](https://docs.unstructured.io/open-source/core-functionality/partitioning). Source: [README.md]().

The pipeline is intentionally pluggable: rather than coupling detection, OCR, layout analysis, and ordering, the project exposes abstract base classes for each role and ships concrete implementations for popular services. This lets users swap OCR backends or sorting algorithms without modifying the higher-level partitioner code, and it lets maintainers add new strategies incrementally.

## Partitioning Strategies and Source Identifiers

A single document may be parsed with different cost/quality trade-offs. The strategy namespace is declared as a class of string constants in [unstructured/partition/utils/constants.py:18-22]() with the four values `AUTO`, `FAST`, `OCR_ONLY`, and `HI_RES`. `FAST` skips model inference and uses only the text layer; `HI_RES` runs full layout detection plus OCR; `OCR_ONLY` bypasses the layout model and OCRs everything; `AUTO` lets the partitioner pick per file. Source: [unstructured/partition/utils/constants.py:18-22]().

The same module defines a `Source` enum that tags every produced element with the subsystem that produced its text — currently `PDFMINER`, `OCR_TESSERACT`, `OCR_PADDLE`, and `OCR_GOOGLEVISION`. This provenance tag is the basis of the `detection_origin` metadata field referenced in the 0.23.0 release notes ("add enrichment origins metadata field"), enabling users to trace why a piece of text appeared in the output. A second enum, `OCRMode`, distinguishes `INDIVIDUAL_BLOCKS` from `FULL_PAGE`, which controls whether per-block layout coordinates are preserved. Source: [unstructured/partition/utils/constants.py:1-30]().

## OCR Subsystem

The OCR layer is abstracted behind `OCRAgent` in [unstructured/partition/utils/ocr_models/ocr_interface.py](). The class exposes two main methods — `get_text_from_image` for raw string extraction and `get_layout_from_image` for layout-aware regions. Agent selection is environment-driven: `OCR_AGENT` resolves to a fully-qualified module/class name, validated against `OCR_AGENT_MODULES_WHITELIST` to prevent arbitrary import. Instances are cached with `functools.lru_cache(maxsize=env_config.OCR_AGENT_CACHE_SIZE)` so a single agent object is reused across calls, avoiding repeated model loads. Source: [unstructured/partition/utils/ocr_models/ocr_interface.py:18-42]().

Three concrete implementations ship in-tree:

1. **Tesseract** (`OCRAgentTesseract`) — Wraps `unstructured_pytesseract` and forces `OMP_THREAD_LIMIT=1` to avoid contention in multi-threaded hosts. It returns sorted text and exposes an HOCR namespace for parsing word-level geometry. Source: [unstructured/partition/utils/ocr_models/tesseract_ocr.py:17-44]().
2. **PaddleOCR** (`OCRAgentPaddle`) — Loads PaddlePaddle with `disable_signal_handler()` and prefers GPU when available, falling back to CPU with MKL-DNN when supported. Source: [unstructured/partition/utils/ocr_models/paddle_ocr.py:21-39]().
3. **Google Cloud Vision** (`OCRAgentGoogleVision`) — Calls `document_text_detection`, with the API endpoint overrideable via `GOOGLEVISION_API_ENDPOINT` and language hints passed through `ImageContext`. Source: [unstructured/partition/utils/ocr_models/google_vision_ocr.py:21-38]().

Community issue [#2467](https://github.com/Unstructured-IO/unstructured/issues/2467) tracks a request to optionally disable OCR inside `hi_res` to speed up text-heavy PDFs, which would require a new flag on the OCR abstraction rather than a partitioner-level change.

## Layout Sorting and Reading Order

Once regions are detected, the partitioner must decide their reading order. The XY-cut algorithm and a fallback "basic" mode are exposed in [unstructured/partition/utils/sorting.py](). The module converts element coordinates to `(left, top, right, bottom)` bounding boxes via `coordinates_to_bbox`, applies `shrink_bbox` for tolerance, then dispatches to either `recursive_xy_cut` or `recursive_xy_cut_swapped` depending on aspect ratio. Source: [unstructured/partition/utils/sorting.py:18-44]().

The three sort modes `SORT_MODE_XY_CUT`, `SORT_MODE_BASIC`, and `SORT_MODE_DONT` are exported from `constants.py`. Community issue [#356](https://github.com/Unstructured-IO/unstructured/issues/356) highlights a known limitation: two-column PDFs are sometimes read out of order because the XY-cut heuristic can fail on dense academic layouts, motivating ongoing improvements to the sort step.

## Multimodal Partitioning (Speech-to-Text)

Audio and video files are handled through a parallel speech-to-text (STT) pipeline declared in [unstructured/partition/utils/speech_to_text/speech_to_text_interface.py](). `SpeechToTextAgent` mirrors `OCRAgent` in spirit: a module path is resolved from the `STT_AGENT` env var (defaulting to Whisper), validated against `STT_AGENT_MODULES_WHITELIST`, and cached via `lru_cache`. The interface returns a list of `TranscriptionSegment` typed dicts (`text`, `start`, `end`). Source: [unstructured/partition/utils/speech_to_text/speech_to_text_interface.py:17-58]().

The default Whisper agent (`SpeechToTextAgentWhisper`) snapshots model size, device, and FP16 from env vars (`WHISPER_MODEL_SIZE`, `WHISPER_DEVICE`, `WHISPER_FP16`) at construction time. A per-instance `threading.Lock` serializes calls because `whisper.model.transcribe()` is not documented as thread-safe — a hidden throughput ceiling under concurrent workloads. For true parallelism the docstring recommends process-based concurrency. The agent may return empty or whitespace-only segments; the audio partitioner is the single place that strips and drops them, keeping agent implementations simple. Source: [unstructured/partition/utils/speech_to_text/whisper_stt.py:25-50](), Source: [unstructured/partition/utils/speech_to_text/speech_to_text_interface.py:13-22]().

## Pipeline Flow

```mermaid
flowchart LR
    A[Input file] --> B[partition function]
    B --> C{File type?}
    C -->|PDF/Image| D[Layout detection]
    D --> E{strategy?}
    E -->|fast| F[Text layer extraction]
    E -->|hi_res| G[Model + OCR]
    G --> H[OCRAgent<br/>Tesseract/Paddle/Vision]
    F --> I[XY-cut sorting]
    H --> I
    I --> J[Element stream]
    C -->|Audio| K[SpeechToTextAgent<br/>Whisper]
    K --> J
```

## Operational Concerns

- **Telemetry is off by default**; opt in with `UNSTRUCTURED_TELEMETRY_ENABLED=true`, opt out with any non-empty value of `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS`. Source: [README.md]().
- **Dependency selection matters**: `unstructured[pdf]==0.16.11` resolved `pdfminer.six 20250327` and broke imports — see community issue [#3982](https://github.com/Unstructured-IO/unstructured/issues/3982). Pin compatible versions in production environments.
- **Performance tooling**: `scripts/performance/README.md` documents `benchmark.sh` and `profile.sh`, plus an integration with [speedscope](https://www.speedscope.app/) for flame-graph inspection of partition runs. Source: [scripts/performance/README.md:1-15]().
- **Confidence scores** are not yet exposed on extracted elements; feature request [#4320](https://github.com/Unstructured-IO/unstructured/issues/4320) proposes surfacing per-element model confidence to help RAG pipelines triage low-quality extractions such as rotated tables or scanned handwriting.
- **Output formatting**: feature request [#3525](https://github.com/Unstructured-IO/unstructured/issues/3525) requests Markdown extraction alongside JSON, since JSON is not always ideal as a direct LLM input.

## See Also

- [`unstructured.partition.auto.partition` entry point](https://docs.unstructured.io/open-source/core-functionality/partitioning)
- [OCR agent implementations](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/partition/utils/ocr_models)
- [Layout sorting utilities](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/sorting.py)
- [Speech-to-text agents](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/partition/utils/speech_to_text)
- Release notes for 0.22.x and 0.23.x documenting recent pipeline fixes (table chunking with `isolate_table`, dense-PDF text recovery, AcroForm field extraction)

---

<a id='page-3'></a>

## Elements, Chunking, and Output Formats

### Related Pages

Related topics: [Document Partitioning Pipeline](#page-2), [Embeddings, Connectors, and Metrics](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/Unstructured-IO/unstructured/blob/main/README.md)
- [example-docs/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/README.md)
- [unstructured/partition/utils/ocr_models/ocr_interface.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/ocr_interface.py)
- [unstructured/partition/utils/ocr_models/tesseract_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/tesseract_ocr.py)
- [unstructured/partition/utils/ocr_models/paddle_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/paddle_ocr.py)
- [unstructured/partition/utils/ocr_models/google_vision_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/google_vision_ocr.py)
- [unstructured/partition/utils/sorting.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/sorting.py)
- [unstructured/partition/utils/xycut.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/xycut.py)
- [unstructured/partition/utils/config.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/config.py)
- [unstructured/partition/utils/constants.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/constants.py)
- [scripts/performance/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/performance/README.md)
- [unstructured/embed/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/embed/README.md)
</details>

# Elements, Chunking, and Output Formats

The `unstructured` library is designed around three tightly connected concepts: **Elements** (the typed content objects produced by partitioning), **Chunking** (grouping elements into LLM-friendly units), and **Output Formats** (how those elements and chunks are serialized for downstream consumers). This page documents the structure of each layer, the configuration knobs available, and known limitations observed in community discussions.

## Element Model and Coordinates

Partitioning functions produce a list of `Element` objects, each carrying a typed classification (e.g., `Title`, `NarrativeText`, `Table`, `Image`) plus optional positional and provenance metadata. The `Source` enum in `unstructured/partition/utils/constants.py` declares where an element's text was derived from, including `PDFMINER`, `OCR_TESSERACT`, `OCR_PADDLE`, and `OCR_GOOGLEVISION`. Source: [unstructured/partition/utils/constants.py:Source](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/constants.py)

Spatial layout matters for downstream sorting and chunking. `CoordinatesMetadata` is normalized into `(left, top, right, bottom)` bounding boxes via `coordinates_to_bbox`, and `shrink_bbox` contracts a box by a configurable factor while preserving its top-left corner. Source: [unstructured/partition/utils/sorting.py:coordinates_to_bbox](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/sorting.py)

When the layout detection backend produces overlapping or rotated boxes, the `xy-cut` algorithm (`recursive_xy_cut`, `recursive_xy_cut_swapped`) is invoked to produce a deterministic reading order. Source: [unstructured/partition/utils/xycut.py:recursive_xy_cut](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/xycut.py)

## OCR Backends and Element Provenance

Elements originating from image regions rely on one of three OCR agents, all implementing the `OCRAgent` abstract base class:

| Backend | Module | Notes |
|---|---|---|
| Tesseract | `unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract` | Default; sets `OMP_THREAD_LIMIT=1` for stability. Source: [tesseract_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/tesseract_ocr.py) |
| PaddleOCR | `unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle` | GPU auto-detected via `paddle.device.cuda.device_count()`. Source: [paddle_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/paddle_ocr.py) |
| Google Vision | `unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision` | Honors `GOOGLEVISION_API_ENDPOINT` env var. Source: [google_vision_ocr.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/google_vision_ocr.py) |

Selection is governed by the `OCR_AGENT` environment variable, and the resolved module must appear in the `OCR_AGENT_MODULES_WHITELIST` constant (overridable via env var of the same name). Source: [unstructured/partition/utils/ocr_models/ocr_interface.py:OCRAgent.get_agent](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/ocr_interface.py) and [constants.py:OCR_AGENT_MODULES_WHITELIST](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/constants.py)

## Chunking Pipeline

Chunking groups adjacent elements so that downstream embeddings and LLM context windows operate on coherent units rather than individual lines. Community releases between 0.22.27 and 0.23.1 added several chunking refinements:

- **Table chunking option** (0.22.30) — opt-in table-aware chunking.
- **Renamed `isolate_tables` → `isolate_table`** (0.22.31) for singular consistency.
- **First table chunk preserves column/row span** (0.22.23) to keep tabular structure intact.
- **AcroForm field text extraction** in PDFs (0.23.1) so form values flow into chunks rather than being dropped.

The chunking subsystem is configured through `ENVConfig`, which reads runtime switches such as the OCR agent cache size and global working directory. Source: [unstructured/partition/utils/config.py:ENVConfig](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/config.py)

```mermaid
flowchart LR
    A[partition*] --> B[Element list]
    B --> C[Chunking]
    C --> D[PreChunkFilter]
    D --> E[TextPreChunk]
    E --> F[Chunk elements]
    F --> G[Output: JSON / NDJSON]
```

Telemetry around this pipeline is **off by default** and gated by `UNSTRUCTURED_TELEMETRY_ENABLED`. Source: [README.md:Analytics](https://github.com/Unstructured-IO/unstructured/blob/main/README.md)

## Output Formats and Serialization

By default, `partition("...")` returns a Python list of `Element` objects that stringify to typed text. For inter-process serialization, the library emits JSON-compatible dicts and NDJSON streams. NDJSON file-type detection was hardened in 0.22.27. Source: [Release 0.22.27](https://github.com/Unstructured-IO/unstructured/releases/tag/0.22.27)

Table elements expose HTML through `chunk.metadata.text_as_html`. Community issue #3871 reports that numeric cell values can be emitted in scientific notation (e.g., `478923` → `4.7e+05`) when this serializer coerces types — a known gap that affects downstream RAG pipelines consuming structured tables. Source: [Issue #3871](https://github.com/Unstructured-IO/unstructured/issues/3871)

Issue #3525 requests a Markdown output mode alongside JSON so that partitioning results can be fed directly to LLMs without losing metadata fidelity. Source: [Issue #3525](https://github.com/Unstructured-IO/unstructured/issues/3525)

## Community Concerns and Known Limitations

Several recurring community themes influence how Elements, Chunking, and Output Formats should be used:

- **PDF two-column reading order** (#356): elements are extracted but not always in correct order; `xy-cut` is the primary mitigation.
- **NumPy 2.0 compatibility** (#3684): pending `onnxruntime` upgrade may shift element-coordinate sorting behavior.
- **Confidence scores** (#4320): feature request to surface per-element confidence in metadata so downstream filters can drop low-quality text from chunks.
- **`pdfminer.six` pinning** (#3982): mismatched `pdfminer.six 20250327` causes import errors, forcing `partition_pdf` to fail before elements are produced.
- **OCR deactivation** (#2467): no clean opt-out flag to skip OCR while keeping `strategy="hi_res"` table detection.
- **Embedded embedding module** is deprecated; the README redirects to `unstructured-ingest`. Source: [unstructured/embed/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/embed/README.md)

## Operational Tooling

The `scripts/performance/` directory provides benchmarking and `py-spy`-based profiling, allowing teams to verify that element extraction and chunking remain stable across commits. Source: [scripts/performance/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/performance/README.md)

## See Also

- PDF Partitioning Strategies (auto, fast, ocr_only, hi_res) — defined in `unstructured/partition/utils/constants.py:PartitionStrategy`
- OCR Agent Interface — `unstructured/partition/utils/ocr_models/ocr_interface.py`
- Environment Configuration — `unstructured/partition/utils/config.py`

---

<a id='page-4'></a>

## Embeddings, Connectors, and Metrics

### Related Pages

Related topics: [Document Partitioning Pipeline](#page-2), [Elements, Chunking, and Output Formats](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/Unstructured-IO/unstructured/blob/main/README.md)
- [unstructured/embed/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/embed/README.md)
- [scripts/performance/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/performance/README.md)
- [unstructured/partition/utils/config.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/config.py)
- [unstructured/partition/utils/constants.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/constants.py)
- [unstructured/partition/utils/ocr_models/ocr_interface.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/ocr_interface.py)
</details>

# Embeddings, Connectors, and Metrics

The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents so they can be fed downstream into LLM pipelines. Three supporting subsystems are commonly referenced from this repository: an `embed` module for generating vector embeddings, a family of `connectors` for moving data between sources and destinations, and a metrics/telemetry surface for observing behavior. This page documents the **current state** of each of these subsystems as reflected in the source tree of this repository, including the migration history of the embed module and the deprecation notice that affects users.

## Embeddings Module Status

The `unstructured/embed` directory historically contained client wrappers for third-party embedding providers (OpenAI, Voyage AI, Vertex AI, Bedrock, HuggingFace, OctoAI). As of the current revision, this directory carries an explicit unmaintained notice.

> "Project has been moved to: [Unstructured Ingest](https://github.com/Unstructured-IO/unstructured-ingest). This python module will be removed from this repo in the near future." Source: [unstructured/embed/README.md:1-5]()

This means:

- New embedding work should target the `unstructured-ingest` repository rather than this one.
- Anyone importing from `unstructured.embed.*` in production code should plan for removal and migrate to the equivalent provider modules under `unstructured_ingest`.
- The connector-style embedding integrations referenced by the deprecated README are no longer the canonical path; downstream RAG users typically obtain embeddings through ingest pipelines or by passing elements to a vector store directly.

If you must call the embed helpers while the directory still exists, expect limited maintenance — security fixes may not be backported. Plan migration before the planned removal.

## Connectors

Ingest and destination connectors are **not** maintained in this repository. The top-level README links users to a sibling project for batch processing:

> "[Batch Processing](https://github.com/Unstructured-IO/unstructured-ingest) | Ingesting batches of documents through Unstructured" Source: [README.md](https://github.com/Unstructured-IO/unstructured/blob/main/README.md)()

Documentation references for connector behavior live at `docs.unstructured.io/open-source/ingest/overview` rather than in this repo. From the perspective of `unstructured` proper, connectors therefore fall outside the supported API surface: file-type partitioning (PDF, DOCX, HTML, etc.) is handled here, while the connectors that read/write S3, Azure, GCS, databases, and chat platforms are owned by `unstructured-ingest`.

For users who arrived at this repo looking for connector configuration, the practical answer is: install and configure `unstructured-ingest` separately, then chain it with `unstructured.partition.*` outputs.

## Metrics and Telemetry

### Anonymous Telemetry (Default Off)

`unstructured` ships with anonymous usage telemetry that is **off by default**. Opt-in and opt-out are controlled via environment variables, with opt-out always taking precedence.

| Variable | Effect | Notes |
|---|---|---|
| `UNSTRUCTURED_TELEMETRY_ENABLED` | Set to `true` / `1` to opt in | Must be set before importing `unstructured` |
| `DO_NOT_TRACK` | Any non-empty value opts out | Overrides opt-in |
| `SCARF_NO_ANALYTICS` | Any non-empty value opts out | Overrides opt-in |

Source: [README.md](https://github.com/Unstructured-IO/unstructured/blob/main/README.md)() — "Telemetry is off by default. To opt in, set `UNSTRUCTURED_TELEMETRY_ENABLED=true` ... opt-out takes precedence. Unset the variable or leave it empty if you do not want to opt out."

In practice, the rule is simple: leave telemetry unset, or explicitly set one of the opt-out variables, and no events are emitted. If you choose to opt in, do so before the first `import unstructured` so the dispatcher initializes in the correct mode.

### Environment-Configurable Knobs

A broader set of operational parameters — model sizes, OCR agent selection, cache sizes, working-directory toggles — is exposed through `ENVConfig` rather than a public API. Source: [unstructured/partition/utils/config.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/config.py)()

For example, `OCR_AGENT_CACHE_SIZE` controls how the OCR agent factory memoizes per-language instances:

```python
@staticmethod
@lru_cache(maxsize=env_config.OCR_AGENT_CACHE_SIZE)
def get_instance(ocr_agent_module: str, language: str) -> "OCRAgent":
    ...
```

Source: [unstructured/partition/utils/ocr_models/ocr_interface.py:38-39]()

OCR agents themselves are selectable via the `OCR_AGENT` environment variable, with module paths whitelisted to prevent arbitrary code loading. Source: [unstructured/partition/utils/constants.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/constants.py)() and [unstructured/partition/utils/ocr_models/ocr_interface.py:14-15]().

### Performance Benchmarking and Profiling

For users who need to **measure** partitioning performance, a self-contained toolkit ships under `scripts/performance/`. Source: [scripts/performance/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/performance/README.md)()

The toolkit separates two concerns:

- **Benchmarking** — runs partitioning against a fixed corpus and records timings keyed by architecture, instance type, and git hash. Optional S3 publication is gated by `PUBLISH_RESULTS=true`. Iteration count is controlled by `NUM_ITERATIONS`. Docker execution is opt-in via `DOCKER_TEST=true`.
- **Profiling** — uses `py-spy` to capture CPU/allocations across called functions during a partition run. Results are viewable with `speedscope` (local install via `npm install -g speedscope`) or by uploading the `.speedscope` file to https://www.speedscope.app/.

## Where These Subsystems Fit in the Pipeline

```mermaid
flowchart LR
    A[Raw Documents] --> B[partition_* / partition]
    B --> C[Element objects]
    C --> D[unstructured-ingest]
    D --> E[Destination<br/>vector store / DB]
    E --> F[LLM / RAG]
    C -.opt-in.-> G[Telemetry sink]
    C -.bench.-> H[scripts/performance]
```

This diagram reflects the current responsibilities split: partitioning stays in `unstructured`, connectors and embedding-provider integrations live in `unstructured-ingest`, telemetry is an opt-in background signal, and benchmarking is a developer-side measurement tool rather than a runtime component.

## Common Failure Modes

- **`ImportError` from `unstructured.embed.*` after a refactor.** The directory is slated for removal; switch to the equivalent provider in `unstructured-ingest`. Source: [unstructured/embed/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/embed/README.md)()
- **OCR agent not picked up from environment.** The module path must appear in `OCR_AGENT_MODULES_WHITELIST`; otherwise `OCRAgent.get_instance` raises `ValueError`. Source: [unstructured/partition/utils/ocr_models/ocr_interface.py:42-46]()
- **Telemetry still firing after opt-out.** Ensure `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS` is set to a non-empty value before the Python process imports `unstructured`. Source: [README.md](https://github.com/Unstructured-IO/unstructured/blob/main/README.md)()
- **Benchmark numbers not comparable across runs.** Set `INSTANCE_TYPE` and a known git hash, and pin `NUM_ITERATIONS`, so S3-published results can be aggregated meaningfully. Source: [scripts/performance/README.md](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/performance/README.md)()

## See Also

- [Partitioning overview](https://docs.unstructured.io/open-source/core-functionality/partitioning)
- [unstructured-ingest repository](https://github.com/Unstructured-IO/unstructured-ingest)
- [Ingest connectors overview](https://docs.unstructured.io/open-source/ingest/overview)
- [Document elements concepts](https://docs.unstructured.io/open-source/concepts/document-elements)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: Unstructured-IO/unstructured

Summary: Found 22 structured pitfall item(s), including 3 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/Unstructured-IO/unstructured/issues/3871

## 2. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/Unstructured-IO/unstructured/issues/4320

## 3. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: packet_text.keyword_scan | https://github.com/Unstructured-IO/unstructured

## 4. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: runtime_trace
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Repro command: `docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest # this will drop you into a bash shell where the Docker image`
- Evidence: identity.distribution | https://github.com/Unstructured-IO/unstructured

## 5. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/Unstructured-IO/unstructured

## 6. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: Number getting converted into scientific notation in metadata.text_as_html
- User impact: Developers may hit a documented source-backed failure mode: Number getting converted into scientific notation in metadata.text_as_html
- Evidence: failure_mode_cluster:github_issue | https://github.com/Unstructured-IO/unstructured/issues/3871

## 7. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/Unstructured-IO/unstructured

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/Unstructured-IO/unstructured

## 9. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/Unstructured-IO/unstructured

## 10. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: [Feature Request] Add document layout analysis confidence scores
- User impact: Developers may hit a documented source-backed failure mode: [Feature Request] Add document layout analysis confidence scores
- Evidence: failure_mode_cluster:github_issue | https://github.com/Unstructured-IO/unstructured/issues/4320

## 11. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/Unstructured-IO/unstructured

## 12. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/Unstructured-IO/unstructured

## 13. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.22.23
- User impact: Upgrade or migration may change expected behavior: 0.22.23
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.22.23

## 14. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.22.27
- User impact: Upgrade or migration may change expected behavior: 0.22.27
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.22.27

## 15. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.22.28
- User impact: Upgrade or migration may change expected behavior: 0.22.28
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.22.28

## 16. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.22.29
- User impact: Upgrade or migration may change expected behavior: 0.22.29
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.22.29

## 17. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.22.30
- User impact: Upgrade or migration may change expected behavior: 0.22.30
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.22.30

## 18. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.22.31
- User impact: Upgrade or migration may change expected behavior: 0.22.31
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.22.31

## 19. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.22.32
- User impact: Upgrade or migration may change expected behavior: 0.22.32
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.22.32

## 20. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.22.6
- User impact: Upgrade or migration may change expected behavior: 0.22.6
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.22.26

## 21. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.23.0
- User impact: Upgrade or migration may change expected behavior: 0.23.0
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.23.0

## 22. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 0.23.1
- User impact: Upgrade or migration may change expected behavior: 0.23.1
- Evidence: failure_mode_cluster:github_release | https://github.com/Unstructured-IO/unstructured/releases/tag/0.23.1

<!-- canonical_name: Unstructured-IO/unstructured; human_manual_source: deepwiki_human_wiki -->
