Doramagic Project Pack · Human Manual
unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Overview, Installation, and Quick Start
Related topics: Document Partitioning Pipeline, Elements, Chunking, and Output Formats
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Document Partitioning Pipeline, Elements, Chunking, and Output Formats
Overview, Installation, and Quick Start
What is `unstructured`
The unstructured library provides open-source components for ingesting and pre-processing images and text documents — PDFs, HTML, Word documents, XML, and many more. The library is purpose-built to streamline the data processing workflow for Large Language Models (LLMs): its modular partitioning functions and connectors form a cohesive system that turns unstructured data into structured, element-typed outputs suitable for downstream pipelines such as RAG, embedding, and chunking.
Source: README.md:3-15
The recommended entry point is unstructured.partition.auto.partition, which detects the file type and routes it to a format-specific partitioning function. Each file type (PDF, DOCX, HTML, …) has its own partitioner that returns a list of Element objects (e.g., Title, NarrativeText, Table, ListItem).
Installation
There are three supported installation paths described in the project README. The choice depends on whether you want a quick local install, a reproducible container, or a full development environment.
1. Install from PyPI
The base package is published to PyPI. Extras pull in document-type-specific dependencies:
pip install "unstructured[docx]"
pip install "unstructured[pdf]"
For Windows users running conda, the README points to the dedicated Windows install guide in the project docs.
Source: README.md:55-78
2. Run the library in a container
A multi-platform Docker image is built for every push to main, tagged with the short commit hash and the application version. This is the fastest way to get a working environment without worrying about OS-level dependencies such as poppler or Tesseract language packs.
docker pull unstructuredio/unstructured:latest
The README also documents platform selection (e.g., --platform linux/amd64) for Apple silicon users.
Source: README.md:62-78
3. Install for local development
The repo ships a Makefile and an optional pre-commit configuration. The typical dev workflow is:
make install
pre-commit install # optional, for auto-formatting on commit
make check # show lint/format diffs without applying
make tidy # apply lint/format fixes
make docker-start-dev # run a dev container with the repo mounted
Source: README.md:30-50
flowchart LR
A[Choose install path] --> B{Use container?}
B -- yes --> C[docker pull<br/>unstructuredio/unstructured]
B -- no --> D{pip or dev?}
D -- pip --> E[pip install<br/>'unstructured[pdf,docx,...]']
D -- dev --> F[make install<br/>pre-commit install]
E --> G[Run partition()]
F --> G
C --> GQuick Start
The canonical quick start, taken directly from the README, uses the auto-router and the included layout-parser-paper.pdf example:
from unstructured.partition.auto import partition
elements = partition("example-docs/layout-parser-paper.pdf")
print("\n\n".join([str(el) for el in elements]))
The output is a sequence of Element objects that, when stringified, render as the document text with the detected element types inline.
Source: README.md:82-105
The repository also ships example documents that are useful as smoke tests:
| File | Format | Purpose |
|---|---|---|
example-docs/layout-parser-paper.pdf | Tests PDF partitioning and OCR strategies | |
example-docs/example-10k.html | HTML | Tests HTML partitioning and table metadata |
example-docs/factbook.xml / factbook.xsl | XML / XSL | Tests stylesheet-aware XML partitioning |
Source: example-docs/README.md:3-21
For production batch ingestion across many sources, the README points to a separate companion repository: unstructured-ingest (referenced as unstructured-ingest).
Configuration, OCR Agents, and Diagnostics
OCR agent selection
partition_pdf and other partitioners can delegate text extraction from images to an OCR backend. The selection is driven by the OCR_AGENT environment variable, parsed by OCRAgent.get_agent() in the OCR interface. The whitelist of supported agents is defined as module-level constants.
Source: unstructured/partition/utils/ocr_models/ocr_interface.py:18-34
The currently shipped implementations are:
| Class | Backend | Notes |
|---|---|---|
OCRAgentTesseract | Tesseract (via unstructured_pytesseract) | Default; forces OMP_THREAD_LIMIT=1 for performance. Source: unstructured/partition/utils/ocr_models/tesseract_ocr.py:30-50 |
OCRAgentPaddle | PaddleOCR (via unstructured_paddleocr) | Auto-selects CPU/GPU; uses MKL-DNN when available. Source: unstructured/partition/utils/ocr_models/paddle_ocr.py:24-45 |
OCRAgentGoogleVision | Google Cloud Vision | Configurable endpoint via GOOGLEVISION_API_ENDPOINT. Source: unstructured/partition/utils/ocr_models/google_vision_ocr.py:26-44 |
Community request #2467 asks for a way to deactivate OCR while keeping strategy="hi_res" — useful for text-based PDFs that still need layout analysis but not Tesseract overhead.
Environment configuration
Runtime knobs (OCR backend, temp directories, agent cache size, etc.) live in ENVConfig and are read from environment variables. Constants used for OCR_AGENT aliases (e.g., OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE_OLD) are imported from unstructured.partition.utils.constants.
Source: unstructured/partition/utils/config.py:14-58 Source: unstructured/partition/utils/ocr_models/ocr_interface.py:20-28
Diagnostics
Per release 0.22.26, a new unstructured doctor CLI command was added that prints the resolved environment and dependency status — helpful when triaging install or OCR issues.
Source: Release notes referenced from the community context.
Telemetry
Telemetry is off by default. To opt in, set UNSTRUCTURED_TELEMETRY_ENABLED=true (or =1) before importing unstructured. To opt out, set DO_NOT_TRACK or SCARF_NO_ANALYTICS to any non-empty value — opt-out takes precedence over opt-in.
Source: README.md:152-160
Performance and Embedding
For performance work, the scripts/performance/ directory ships two helpers: a benchmark script that publishes partitioning timings to S3 (gated by PUBLISH_RESULTS=true), and a profiling script that uses py-spy to emit speedscope-viewable flame graphs.
Source: scripts/performance/README.md:3-37
The unstructured.embed module has been moved to unstructured-ingest and is marked unmaintained; it will be removed from this repository in the near future.
Source: unstructured/embed/README.md:1-7
Common Pitfalls (from community)
Source: unstructured/partition/utils/sorting.py:1-40, unstructured/partition/utils/xycut.py:1-20
- Two-column PDFs:
partition_pdfdoes not always recover reading order on two-column layouts. Tracking issue: #356. Element ordering is post-processed byrecursive_xy_cutandSORT_MODE_BASIChelpers. - Numpy 2.0 compatibility: Requires bumping
onnxruntimeto 1.19 and patching type/sort behavior. Tracking issue: #3684. pdfminer.sixpin:unstructured[pdf]==0.16.11pulls in apdfminer.sixversion that breakslangchain_community.document_loaders. Pin a compatible version until upstream fixes it. Tracking issue: #3982.- Numbers in
metadata.text_as_html: Table HTML serialization can render large integers in scientific notation. Tracking issue: #3871.
See Also
Source: https://github.com/Unstructured-IO/unstructured / Human Manual
Document Partitioning Pipeline
Related topics: Overview, Installation, and Quick Start, Elements, Chunking, and Output Formats
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Installation, and Quick Start, Elements, Chunking, and Output Formats
Document Partitioning Pipeline
Overview and Entry Points
The unstructured library provides a unified pipeline for converting heterogeneous unstructured files (PDFs, images, HTML, audio, and many more formats) into a normalized stream of structured Element objects suitable for downstream LLM ingestion. From the README, the canonical entry point is unstructured.partition.auto.partition, which detects file type and routes to the appropriate file-specific partitioner. The library advertises partitioning as its "core functionality" with a full list of options documented in the partitioning docs. Source: README.md.
The pipeline is intentionally pluggable: rather than coupling detection, OCR, layout analysis, and ordering, the project exposes abstract base classes for each role and ships concrete implementations for popular services. This lets users swap OCR backends or sorting algorithms without modifying the higher-level partitioner code, and it lets maintainers add new strategies incrementally.
Partitioning Strategies and Source Identifiers
A single document may be parsed with different cost/quality trade-offs. The strategy namespace is declared as a class of string constants in unstructured/partition/utils/constants.py:18-22 with the four values AUTO, FAST, OCR_ONLY, and HI_RES. FAST skips model inference and uses only the text layer; HI_RES runs full layout detection plus OCR; OCR_ONLY bypasses the layout model and OCRs everything; AUTO lets the partitioner pick per file. Source: unstructured/partition/utils/constants.py:18-22.
The same module defines a Source enum that tags every produced element with the subsystem that produced its text — currently PDFMINER, OCR_TESSERACT, OCR_PADDLE, and OCR_GOOGLEVISION. This provenance tag is the basis of the detection_origin metadata field referenced in the 0.23.0 release notes ("add enrichment origins metadata field"), enabling users to trace why a piece of text appeared in the output. A second enum, OCRMode, distinguishes INDIVIDUAL_BLOCKS from FULL_PAGE, which controls whether per-block layout coordinates are preserved. Source: unstructured/partition/utils/constants.py:1-30.
OCR Subsystem
The OCR layer is abstracted behind OCRAgent in unstructured/partition/utils/ocr_models/ocr_interface.py. The class exposes two main methods — get_text_from_image for raw string extraction and get_layout_from_image for layout-aware regions. Agent selection is environment-driven: OCR_AGENT resolves to a fully-qualified module/class name, validated against OCR_AGENT_MODULES_WHITELIST to prevent arbitrary import. Instances are cached with functools.lru_cache(maxsize=env_config.OCR_AGENT_CACHE_SIZE) so a single agent object is reused across calls, avoiding repeated model loads. Source: unstructured/partition/utils/ocr_models/ocr_interface.py:18-42.
Three concrete implementations ship in-tree:
- Tesseract (
OCRAgentTesseract) — Wrapsunstructured_pytesseractand forcesOMP_THREAD_LIMIT=1to avoid contention in multi-threaded hosts. It returns sorted text and exposes an HOCR namespace for parsing word-level geometry. Source: unstructured/partition/utils/ocr_models/tesseract_ocr.py:17-44. - PaddleOCR (
OCRAgentPaddle) — Loads PaddlePaddle withdisable_signal_handler()and prefers GPU when available, falling back to CPU with MKL-DNN when supported. Source: unstructured/partition/utils/ocr_models/paddle_ocr.py:21-39. - Google Cloud Vision (
OCRAgentGoogleVision) — Callsdocument_text_detection, with the API endpoint overrideable viaGOOGLEVISION_API_ENDPOINTand language hints passed throughImageContext. Source: unstructured/partition/utils/ocr_models/google_vision_ocr.py:21-38.
Community issue #2467 tracks a request to optionally disable OCR inside hi_res to speed up text-heavy PDFs, which would require a new flag on the OCR abstraction rather than a partitioner-level change.
Layout Sorting and Reading Order
Once regions are detected, the partitioner must decide their reading order. The XY-cut algorithm and a fallback "basic" mode are exposed in unstructured/partition/utils/sorting.py. The module converts element coordinates to (left, top, right, bottom) bounding boxes via coordinates_to_bbox, applies shrink_bbox for tolerance, then dispatches to either recursive_xy_cut or recursive_xy_cut_swapped depending on aspect ratio. Source: unstructured/partition/utils/sorting.py:18-44.
The three sort modes SORT_MODE_XY_CUT, SORT_MODE_BASIC, and SORT_MODE_DONT are exported from constants.py. Community issue #356 highlights a known limitation: two-column PDFs are sometimes read out of order because the XY-cut heuristic can fail on dense academic layouts, motivating ongoing improvements to the sort step.
Multimodal Partitioning (Speech-to-Text)
Audio and video files are handled through a parallel speech-to-text (STT) pipeline declared in unstructured/partition/utils/speech_to_text/speech_to_text_interface.py. SpeechToTextAgent mirrors OCRAgent in spirit: a module path is resolved from the STT_AGENT env var (defaulting to Whisper), validated against STT_AGENT_MODULES_WHITELIST, and cached via lru_cache. The interface returns a list of TranscriptionSegment typed dicts (text, start, end). Source: unstructured/partition/utils/speech_to_text/speech_to_text_interface.py:17-58.
The default Whisper agent (SpeechToTextAgentWhisper) snapshots model size, device, and FP16 from env vars (WHISPER_MODEL_SIZE, WHISPER_DEVICE, WHISPER_FP16) at construction time. A per-instance threading.Lock serializes calls because whisper.model.transcribe() is not documented as thread-safe — a hidden throughput ceiling under concurrent workloads. For true parallelism the docstring recommends process-based concurrency. The agent may return empty or whitespace-only segments; the audio partitioner is the single place that strips and drops them, keeping agent implementations simple. Source: unstructured/partition/utils/speech_to_text/whisper_stt.py:25-50, Source: unstructured/partition/utils/speech_to_text/speech_to_text_interface.py:13-22.
Pipeline Flow
flowchart LR
A[Input file] --> B[partition function]
B --> C{File type?}
C -->|PDF/Image| D[Layout detection]
D --> E{strategy?}
E -->|fast| F[Text layer extraction]
E -->|hi_res| G[Model + OCR]
G --> H[OCRAgent<br/>Tesseract/Paddle/Vision]
F --> I[XY-cut sorting]
H --> I
I --> J[Element stream]
C -->|Audio| K[SpeechToTextAgent<br/>Whisper]
K --> JOperational Concerns
- Telemetry is off by default; opt in with
UNSTRUCTURED_TELEMETRY_ENABLED=true, opt out with any non-empty value ofDO_NOT_TRACKorSCARF_NO_ANALYTICS. Source: README.md. - Dependency selection matters:
unstructured[pdf]==0.16.11resolvedpdfminer.six 20250327and broke imports — see community issue #3982. Pin compatible versions in production environments. - Performance tooling:
scripts/performance/README.mddocumentsbenchmark.shandprofile.sh, plus an integration with speedscope for flame-graph inspection of partition runs. Source: scripts/performance/README.md:1-15. - Confidence scores are not yet exposed on extracted elements; feature request #4320 proposes surfacing per-element model confidence to help RAG pipelines triage low-quality extractions such as rotated tables or scanned handwriting.
- Output formatting: feature request #3525 requests Markdown extraction alongside JSON, since JSON is not always ideal as a direct LLM input.
See Also
unstructured.partition.auto.partitionentry point- OCR agent implementations
- Layout sorting utilities
- Speech-to-text agents
- Release notes for 0.22.x and 0.23.x documenting recent pipeline fixes (table chunking with
isolate_table, dense-PDF text recovery, AcroForm field extraction)
Source: https://github.com/Unstructured-IO/unstructured / Human Manual
Elements, Chunking, and Output Formats
Related topics: Document Partitioning Pipeline, Embeddings, Connectors, and Metrics
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Document Partitioning Pipeline, Embeddings, Connectors, and Metrics
Elements, Chunking, and Output Formats
The unstructured library is designed around three tightly connected concepts: Elements (the typed content objects produced by partitioning), Chunking (grouping elements into LLM-friendly units), and Output Formats (how those elements and chunks are serialized for downstream consumers). This page documents the structure of each layer, the configuration knobs available, and known limitations observed in community discussions.
Element Model and Coordinates
Partitioning functions produce a list of Element objects, each carrying a typed classification (e.g., Title, NarrativeText, Table, Image) plus optional positional and provenance metadata. The Source enum in unstructured/partition/utils/constants.py declares where an element's text was derived from, including PDFMINER, OCR_TESSERACT, OCR_PADDLE, and OCR_GOOGLEVISION. Source: unstructured/partition/utils/constants.py:Source
Spatial layout matters for downstream sorting and chunking. CoordinatesMetadata is normalized into (left, top, right, bottom) bounding boxes via coordinates_to_bbox, and shrink_bbox contracts a box by a configurable factor while preserving its top-left corner. Source: unstructured/partition/utils/sorting.py:coordinates_to_bbox
When the layout detection backend produces overlapping or rotated boxes, the xy-cut algorithm (recursive_xy_cut, recursive_xy_cut_swapped) is invoked to produce a deterministic reading order. Source: unstructured/partition/utils/xycut.py:recursive_xy_cut
OCR Backends and Element Provenance
Elements originating from image regions rely on one of three OCR agents, all implementing the OCRAgent abstract base class:
| Backend | Module | Notes |
|---|---|---|
| Tesseract | unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract | Default; sets OMP_THREAD_LIMIT=1 for stability. Source: tesseract_ocr.py |
| PaddleOCR | unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle | GPU auto-detected via paddle.device.cuda.device_count(). Source: paddle_ocr.py |
| Google Vision | unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision | Honors GOOGLEVISION_API_ENDPOINT env var. Source: google_vision_ocr.py |
Selection is governed by the OCR_AGENT environment variable, and the resolved module must appear in the OCR_AGENT_MODULES_WHITELIST constant (overridable via env var of the same name). Source: unstructured/partition/utils/ocr_models/ocr_interface.py:OCRAgent.get_agent and constants.py:OCR_AGENT_MODULES_WHITELIST
Chunking Pipeline
Chunking groups adjacent elements so that downstream embeddings and LLM context windows operate on coherent units rather than individual lines. Community releases between 0.22.27 and 0.23.1 added several chunking refinements:
- Table chunking option (0.22.30) — opt-in table-aware chunking.
- Renamed
isolate_tables→isolate_table(0.22.31) for singular consistency. - First table chunk preserves column/row span (0.22.23) to keep tabular structure intact.
- AcroForm field text extraction in PDFs (0.23.1) so form values flow into chunks rather than being dropped.
The chunking subsystem is configured through ENVConfig, which reads runtime switches such as the OCR agent cache size and global working directory. Source: unstructured/partition/utils/config.py:ENVConfig
flowchart LR
A[partition*] --> B[Element list]
B --> C[Chunking]
C --> D[PreChunkFilter]
D --> E[TextPreChunk]
E --> F[Chunk elements]
F --> G[Output: JSON / NDJSON]Telemetry around this pipeline is off by default and gated by UNSTRUCTURED_TELEMETRY_ENABLED. Source: README.md:Analytics
Output Formats and Serialization
By default, partition("...") returns a Python list of Element objects that stringify to typed text. For inter-process serialization, the library emits JSON-compatible dicts and NDJSON streams. NDJSON file-type detection was hardened in 0.22.27. Source: Release 0.22.27
Table elements expose HTML through chunk.metadata.text_as_html. Community issue #3871 reports that numeric cell values can be emitted in scientific notation (e.g., 478923 → 4.7e+05) when this serializer coerces types — a known gap that affects downstream RAG pipelines consuming structured tables. Source: Issue #3871
Issue #3525 requests a Markdown output mode alongside JSON so that partitioning results can be fed directly to LLMs without losing metadata fidelity. Source: Issue #3525
Community Concerns and Known Limitations
Several recurring community themes influence how Elements, Chunking, and Output Formats should be used:
- PDF two-column reading order (#356): elements are extracted but not always in correct order;
xy-cutis the primary mitigation. - NumPy 2.0 compatibility (#3684): pending
onnxruntimeupgrade may shift element-coordinate sorting behavior. - Confidence scores (#4320): feature request to surface per-element confidence in metadata so downstream filters can drop low-quality text from chunks.
pdfminer.sixpinning (#3982): mismatchedpdfminer.six 20250327causes import errors, forcingpartition_pdfto fail before elements are produced.- OCR deactivation (#2467): no clean opt-out flag to skip OCR while keeping
strategy="hi_res"table detection. - Embedded embedding module is deprecated; the README redirects to
unstructured-ingest. Source: unstructured/embed/README.md
Operational Tooling
The scripts/performance/ directory provides benchmarking and py-spy-based profiling, allowing teams to verify that element extraction and chunking remain stable across commits. Source: scripts/performance/README.md
See Also
- PDF Partitioning Strategies (auto, fast, ocr_only, hi_res) — defined in
unstructured/partition/utils/constants.py:PartitionStrategy - OCR Agent Interface —
unstructured/partition/utils/ocr_models/ocr_interface.py - Environment Configuration —
unstructured/partition/utils/config.py
Source: https://github.com/Unstructured-IO/unstructured / Human Manual
Embeddings, Connectors, and Metrics
Related topics: Document Partitioning Pipeline, Elements, Chunking, and Output Formats
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Document Partitioning Pipeline, Elements, Chunking, and Output Formats
Embeddings, Connectors, and Metrics
The unstructured library provides open-source components for ingesting and pre-processing images and text documents so they can be fed downstream into LLM pipelines. Three supporting subsystems are commonly referenced from this repository: an embed module for generating vector embeddings, a family of connectors for moving data between sources and destinations, and a metrics/telemetry surface for observing behavior. This page documents the current state of each of these subsystems as reflected in the source tree of this repository, including the migration history of the embed module and the deprecation notice that affects users.
Embeddings Module Status
The unstructured/embed directory historically contained client wrappers for third-party embedding providers (OpenAI, Voyage AI, Vertex AI, Bedrock, HuggingFace, OctoAI). As of the current revision, this directory carries an explicit unmaintained notice.
"Project has been moved to: Unstructured Ingest. This python module will be removed from this repo in the near future." Source: unstructured/embed/README.md:1-5
This means:
- New embedding work should target the
unstructured-ingestrepository rather than this one. - Anyone importing from
unstructured.embed.*in production code should plan for removal and migrate to the equivalent provider modules underunstructured_ingest. - The connector-style embedding integrations referenced by the deprecated README are no longer the canonical path; downstream RAG users typically obtain embeddings through ingest pipelines or by passing elements to a vector store directly.
If you must call the embed helpers while the directory still exists, expect limited maintenance — security fixes may not be backported. Plan migration before the planned removal.
Connectors
Ingest and destination connectors are not maintained in this repository. The top-level README links users to a sibling project for batch processing:
"Batch Processing | Ingesting batches of documents through Unstructured" Source: README.md()
Documentation references for connector behavior live at docs.unstructured.io/open-source/ingest/overview rather than in this repo. From the perspective of unstructured proper, connectors therefore fall outside the supported API surface: file-type partitioning (PDF, DOCX, HTML, etc.) is handled here, while the connectors that read/write S3, Azure, GCS, databases, and chat platforms are owned by unstructured-ingest.
For users who arrived at this repo looking for connector configuration, the practical answer is: install and configure unstructured-ingest separately, then chain it with unstructured.partition.* outputs.
Metrics and Telemetry
Anonymous Telemetry (Default Off)
unstructured ships with anonymous usage telemetry that is off by default. Opt-in and opt-out are controlled via environment variables, with opt-out always taking precedence.
| Variable | Effect | Notes |
|---|---|---|
UNSTRUCTURED_TELEMETRY_ENABLED | Set to true / 1 to opt in | Must be set before importing unstructured |
DO_NOT_TRACK | Any non-empty value opts out | Overrides opt-in |
SCARF_NO_ANALYTICS | Any non-empty value opts out | Overrides opt-in |
Source: README.md() — "Telemetry is off by default. To opt in, set UNSTRUCTURED_TELEMETRY_ENABLED=true ... opt-out takes precedence. Unset the variable or leave it empty if you do not want to opt out."
In practice, the rule is simple: leave telemetry unset, or explicitly set one of the opt-out variables, and no events are emitted. If you choose to opt in, do so before the first import unstructured so the dispatcher initializes in the correct mode.
Environment-Configurable Knobs
A broader set of operational parameters — model sizes, OCR agent selection, cache sizes, working-directory toggles — is exposed through ENVConfig rather than a public API. Source: unstructured/partition/utils/config.py()
For example, OCR_AGENT_CACHE_SIZE controls how the OCR agent factory memoizes per-language instances:
@staticmethod
@lru_cache(maxsize=env_config.OCR_AGENT_CACHE_SIZE)
def get_instance(ocr_agent_module: str, language: str) -> "OCRAgent":
...
Source: unstructured/partition/utils/ocr_models/ocr_interface.py:38-39
OCR agents themselves are selectable via the OCR_AGENT environment variable, with module paths whitelisted to prevent arbitrary code loading. Source: unstructured/partition/utils/constants.py() and unstructured/partition/utils/ocr_models/ocr_interface.py:14-15.
Performance Benchmarking and Profiling
For users who need to measure partitioning performance, a self-contained toolkit ships under scripts/performance/. Source: scripts/performance/README.md()
The toolkit separates two concerns:
- Benchmarking — runs partitioning against a fixed corpus and records timings keyed by architecture, instance type, and git hash. Optional S3 publication is gated by
PUBLISH_RESULTS=true. Iteration count is controlled byNUM_ITERATIONS. Docker execution is opt-in viaDOCKER_TEST=true. - Profiling — uses
py-spyto capture CPU/allocations across called functions during a partition run. Results are viewable withspeedscope(local install vianpm install -g speedscope) or by uploading the.speedscopefile to https://www.speedscope.app/.
Where These Subsystems Fit in the Pipeline
flowchart LR
A[Raw Documents] --> B[partition_* / partition]
B --> C[Element objects]
C --> D[unstructured-ingest]
D --> E[Destination<br/>vector store / DB]
E --> F[LLM / RAG]
C -.opt-in.-> G[Telemetry sink]
C -.bench.-> H[scripts/performance]This diagram reflects the current responsibilities split: partitioning stays in unstructured, connectors and embedding-provider integrations live in unstructured-ingest, telemetry is an opt-in background signal, and benchmarking is a developer-side measurement tool rather than a runtime component.
Common Failure Modes
- **
ImportErrorfromunstructured.embed.*after a refactor.** The directory is slated for removal; switch to the equivalent provider inunstructured-ingest. Source: unstructured/embed/README.md() - OCR agent not picked up from environment. The module path must appear in
OCR_AGENT_MODULES_WHITELIST; otherwiseOCRAgent.get_instanceraisesValueError. Source: unstructured/partition/utils/ocr_models/ocr_interface.py:42-46 - Telemetry still firing after opt-out. Ensure
DO_NOT_TRACKorSCARF_NO_ANALYTICSis set to a non-empty value before the Python process importsunstructured. Source: README.md() - Benchmark numbers not comparable across runs. Set
INSTANCE_TYPEand a known git hash, and pinNUM_ITERATIONS, so S3-published results can be aggregated meaningfully. Source: scripts/performance/README.md()
See Also
Source: https://github.com/Unstructured-IO/unstructured / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 22 structured pitfall item(s), including 3 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/Unstructured-IO/unstructured/issues/3871
2. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/Unstructured-IO/unstructured/issues/4320
3. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: packet_text.keyword_scan | https://github.com/Unstructured-IO/unstructured
4. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: identity.distribution | https://github.com/Unstructured-IO/unstructured
5. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/Unstructured-IO/unstructured
6. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: Number getting converted into scientific notation in metadata.text_as_html
- User impact: Developers may hit a documented source-backed failure mode: Number getting converted into scientific notation in metadata.text_as_html
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Number getting converted into scientific notation in metadata.text_as_html. Context: Observed when using python
- Evidence: failure_mode_cluster:github_issue | https://github.com/Unstructured-IO/unstructured/issues/3871
7. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/Unstructured-IO/unstructured
8. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/Unstructured-IO/unstructured
9. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/Unstructured-IO/unstructured
10. Runtime risk: Runtime risk requires verification
- Severity: low
- Finding: Developers should check this performance risk before relying on the project: [Feature Request] Add document layout analysis confidence scores
- User impact: Developers may hit a documented source-backed failure mode: [Feature Request] Add document layout analysis confidence scores
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Feature Request] Add document layout analysis confidence scores. Context: Observed when using python
- Evidence: failure_mode_cluster:github_issue | https://github.com/Unstructured-IO/unstructured/issues/4320
11. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/Unstructured-IO/unstructured
12. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/Unstructured-IO/unstructured
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using unstructured with real data or production workflows.
- Number getting converted into scientific notation in metadata.text_as_ht - github / github_issue
- [[Feature Request] Add document layout analysis confidence scores](https://github.com/Unstructured-IO/unstructured/issues/4320) - github / github_issue
- 0.23.1 - github / github_release
- 0.23.0 - github / github_release
- 0.22.32 - github / github_release
- 0.22.31 - github / github_release
- 0.22.30 - github / github_release
- 0.22.29 - github / github_release
- 0.22.28 - github / github_release
- 0.22.27 - github / github_release
- 0.22.6 - github / github_release
- 0.22.23 - github / github_release
Source: Project Pack community evidence and pitfall evidence