Doramagic Project Pack · Human Manual

unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

Overview, Installation, and Quick Start

Related topics: Document Partitioning Pipeline, Elements, Chunking, and Output Formats

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 1. Install from PyPI

Continue reading this section for the full explanation and source context.

Section 2. Run the library in a container

Continue reading this section for the full explanation and source context.

Section 3. Install for local development

Continue reading this section for the full explanation and source context.

Related topics: Document Partitioning Pipeline, Elements, Chunking, and Output Formats

Overview, Installation, and Quick Start

What is `unstructured`

The unstructured library provides open-source components for ingesting and pre-processing images and text documents — PDFs, HTML, Word documents, XML, and many more. The library is purpose-built to streamline the data processing workflow for Large Language Models (LLMs): its modular partitioning functions and connectors form a cohesive system that turns unstructured data into structured, element-typed outputs suitable for downstream pipelines such as RAG, embedding, and chunking.

Source: README.md:3-15

The recommended entry point is unstructured.partition.auto.partition, which detects the file type and routes it to a format-specific partitioning function. Each file type (PDF, DOCX, HTML, …) has its own partitioner that returns a list of Element objects (e.g., Title, NarrativeText, Table, ListItem).

Installation

There are three supported installation paths described in the project README. The choice depends on whether you want a quick local install, a reproducible container, or a full development environment.

1. Install from PyPI

The base package is published to PyPI. Extras pull in document-type-specific dependencies:

pip install "unstructured[docx]"
pip install "unstructured[pdf]"

For Windows users running conda, the README points to the dedicated Windows install guide in the project docs.

Source: README.md:55-78

2. Run the library in a container

A multi-platform Docker image is built for every push to main, tagged with the short commit hash and the application version. This is the fastest way to get a working environment without worrying about OS-level dependencies such as poppler or Tesseract language packs.

docker pull unstructuredio/unstructured:latest

The README also documents platform selection (e.g., --platform linux/amd64) for Apple silicon users.

Source: README.md:62-78

3. Install for local development

The repo ships a Makefile and an optional pre-commit configuration. The typical dev workflow is:

make install
pre-commit install       # optional, for auto-formatting on commit
make check               # show lint/format diffs without applying
make tidy                # apply lint/format fixes
make docker-start-dev    # run a dev container with the repo mounted

Source: README.md:30-50

flowchart LR
    A[Choose install path] --> B{Use container?}
    B -- yes --> C[docker pull<br/>unstructuredio/unstructured]
    B -- no --> D{pip or dev?}
    D -- pip --> E[pip install<br/>'unstructured[pdf,docx,...]']
    D -- dev --> F[make install<br/>pre-commit install]
    E --> G[Run partition&#40;&#41;]
    F --> G
    C --> G

Quick Start

The canonical quick start, taken directly from the README, uses the auto-router and the included layout-parser-paper.pdf example:

from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper.pdf")

print("\n\n".join([str(el) for el in elements]))

The output is a sequence of Element objects that, when stringified, render as the document text with the detected element types inline.

Source: README.md:82-105

The repository also ships example documents that are useful as smoke tests:

FileFormatPurpose
example-docs/layout-parser-paper.pdfPDFTests PDF partitioning and OCR strategies
example-docs/example-10k.htmlHTMLTests HTML partitioning and table metadata
example-docs/factbook.xml / factbook.xslXML / XSLTests stylesheet-aware XML partitioning

Source: example-docs/README.md:3-21

For production batch ingestion across many sources, the README points to a separate companion repository: unstructured-ingest (referenced as unstructured-ingest).

Configuration, OCR Agents, and Diagnostics

OCR agent selection

partition_pdf and other partitioners can delegate text extraction from images to an OCR backend. The selection is driven by the OCR_AGENT environment variable, parsed by OCRAgent.get_agent() in the OCR interface. The whitelist of supported agents is defined as module-level constants.

Source: unstructured/partition/utils/ocr_models/ocr_interface.py:18-34

The currently shipped implementations are:

ClassBackendNotes
OCRAgentTesseractTesseract (via unstructured_pytesseract)Default; forces OMP_THREAD_LIMIT=1 for performance. Source: unstructured/partition/utils/ocr_models/tesseract_ocr.py:30-50
OCRAgentPaddlePaddleOCR (via unstructured_paddleocr)Auto-selects CPU/GPU; uses MKL-DNN when available. Source: unstructured/partition/utils/ocr_models/paddle_ocr.py:24-45
OCRAgentGoogleVisionGoogle Cloud VisionConfigurable endpoint via GOOGLEVISION_API_ENDPOINT. Source: unstructured/partition/utils/ocr_models/google_vision_ocr.py:26-44

Community request #2467 asks for a way to deactivate OCR while keeping strategy="hi_res" — useful for text-based PDFs that still need layout analysis but not Tesseract overhead.

Environment configuration

Runtime knobs (OCR backend, temp directories, agent cache size, etc.) live in ENVConfig and are read from environment variables. Constants used for OCR_AGENT aliases (e.g., OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE_OLD) are imported from unstructured.partition.utils.constants.

Source: unstructured/partition/utils/config.py:14-58 Source: unstructured/partition/utils/ocr_models/ocr_interface.py:20-28

Diagnostics

Per release 0.22.26, a new unstructured doctor CLI command was added that prints the resolved environment and dependency status — helpful when triaging install or OCR issues.

Source: Release notes referenced from the community context.

Telemetry

Telemetry is off by default. To opt in, set UNSTRUCTURED_TELEMETRY_ENABLED=true (or =1) before importing unstructured. To opt out, set DO_NOT_TRACK or SCARF_NO_ANALYTICS to any non-empty value — opt-out takes precedence over opt-in.

Source: README.md:152-160

Performance and Embedding

For performance work, the scripts/performance/ directory ships two helpers: a benchmark script that publishes partitioning timings to S3 (gated by PUBLISH_RESULTS=true), and a profiling script that uses py-spy to emit speedscope-viewable flame graphs.

Source: scripts/performance/README.md:3-37

The unstructured.embed module has been moved to unstructured-ingest and is marked unmaintained; it will be removed from this repository in the near future.

Source: unstructured/embed/README.md:1-7

Common Pitfalls (from community)

Source: unstructured/partition/utils/sorting.py:1-40, unstructured/partition/utils/xycut.py:1-20

  • Two-column PDFs: partition_pdf does not always recover reading order on two-column layouts. Tracking issue: #356. Element ordering is post-processed by recursive_xy_cut and SORT_MODE_BASIC helpers.
  • Numpy 2.0 compatibility: Requires bumping onnxruntime to 1.19 and patching type/sort behavior. Tracking issue: #3684.
  • pdfminer.six pin: unstructured[pdf]==0.16.11 pulls in a pdfminer.six version that breaks langchain_community.document_loaders. Pin a compatible version until upstream fixes it. Tracking issue: #3982.
  • Numbers in metadata.text_as_html: Table HTML serialization can render large integers in scientific notation. Tracking issue: #3871.

See Also

Source: https://github.com/Unstructured-IO/unstructured / Human Manual

Document Partitioning Pipeline

Related topics: Overview, Installation, and Quick Start, Elements, Chunking, and Output Formats

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Overview, Installation, and Quick Start, Elements, Chunking, and Output Formats

Document Partitioning Pipeline

Overview and Entry Points

The unstructured library provides a unified pipeline for converting heterogeneous unstructured files (PDFs, images, HTML, audio, and many more formats) into a normalized stream of structured Element objects suitable for downstream LLM ingestion. From the README, the canonical entry point is unstructured.partition.auto.partition, which detects file type and routes to the appropriate file-specific partitioner. The library advertises partitioning as its "core functionality" with a full list of options documented in the partitioning docs. Source: README.md.

The pipeline is intentionally pluggable: rather than coupling detection, OCR, layout analysis, and ordering, the project exposes abstract base classes for each role and ships concrete implementations for popular services. This lets users swap OCR backends or sorting algorithms without modifying the higher-level partitioner code, and it lets maintainers add new strategies incrementally.

Partitioning Strategies and Source Identifiers

A single document may be parsed with different cost/quality trade-offs. The strategy namespace is declared as a class of string constants in unstructured/partition/utils/constants.py:18-22 with the four values AUTO, FAST, OCR_ONLY, and HI_RES. FAST skips model inference and uses only the text layer; HI_RES runs full layout detection plus OCR; OCR_ONLY bypasses the layout model and OCRs everything; AUTO lets the partitioner pick per file. Source: unstructured/partition/utils/constants.py:18-22.

The same module defines a Source enum that tags every produced element with the subsystem that produced its text — currently PDFMINER, OCR_TESSERACT, OCR_PADDLE, and OCR_GOOGLEVISION. This provenance tag is the basis of the detection_origin metadata field referenced in the 0.23.0 release notes ("add enrichment origins metadata field"), enabling users to trace why a piece of text appeared in the output. A second enum, OCRMode, distinguishes INDIVIDUAL_BLOCKS from FULL_PAGE, which controls whether per-block layout coordinates are preserved. Source: unstructured/partition/utils/constants.py:1-30.

OCR Subsystem

The OCR layer is abstracted behind OCRAgent in unstructured/partition/utils/ocr_models/ocr_interface.py. The class exposes two main methods — get_text_from_image for raw string extraction and get_layout_from_image for layout-aware regions. Agent selection is environment-driven: OCR_AGENT resolves to a fully-qualified module/class name, validated against OCR_AGENT_MODULES_WHITELIST to prevent arbitrary import. Instances are cached with functools.lru_cache(maxsize=env_config.OCR_AGENT_CACHE_SIZE) so a single agent object is reused across calls, avoiding repeated model loads. Source: unstructured/partition/utils/ocr_models/ocr_interface.py:18-42.

Three concrete implementations ship in-tree:

  1. Tesseract (OCRAgentTesseract) — Wraps unstructured_pytesseract and forces OMP_THREAD_LIMIT=1 to avoid contention in multi-threaded hosts. It returns sorted text and exposes an HOCR namespace for parsing word-level geometry. Source: unstructured/partition/utils/ocr_models/tesseract_ocr.py:17-44.
  2. PaddleOCR (OCRAgentPaddle) — Loads PaddlePaddle with disable_signal_handler() and prefers GPU when available, falling back to CPU with MKL-DNN when supported. Source: unstructured/partition/utils/ocr_models/paddle_ocr.py:21-39.
  3. Google Cloud Vision (OCRAgentGoogleVision) — Calls document_text_detection, with the API endpoint overrideable via GOOGLEVISION_API_ENDPOINT and language hints passed through ImageContext. Source: unstructured/partition/utils/ocr_models/google_vision_ocr.py:21-38.

Community issue #2467 tracks a request to optionally disable OCR inside hi_res to speed up text-heavy PDFs, which would require a new flag on the OCR abstraction rather than a partitioner-level change.

Layout Sorting and Reading Order

Once regions are detected, the partitioner must decide their reading order. The XY-cut algorithm and a fallback "basic" mode are exposed in unstructured/partition/utils/sorting.py. The module converts element coordinates to (left, top, right, bottom) bounding boxes via coordinates_to_bbox, applies shrink_bbox for tolerance, then dispatches to either recursive_xy_cut or recursive_xy_cut_swapped depending on aspect ratio. Source: unstructured/partition/utils/sorting.py:18-44.

The three sort modes SORT_MODE_XY_CUT, SORT_MODE_BASIC, and SORT_MODE_DONT are exported from constants.py. Community issue #356 highlights a known limitation: two-column PDFs are sometimes read out of order because the XY-cut heuristic can fail on dense academic layouts, motivating ongoing improvements to the sort step.

Multimodal Partitioning (Speech-to-Text)

Audio and video files are handled through a parallel speech-to-text (STT) pipeline declared in unstructured/partition/utils/speech_to_text/speech_to_text_interface.py. SpeechToTextAgent mirrors OCRAgent in spirit: a module path is resolved from the STT_AGENT env var (defaulting to Whisper), validated against STT_AGENT_MODULES_WHITELIST, and cached via lru_cache. The interface returns a list of TranscriptionSegment typed dicts (text, start, end). Source: unstructured/partition/utils/speech_to_text/speech_to_text_interface.py:17-58.

The default Whisper agent (SpeechToTextAgentWhisper) snapshots model size, device, and FP16 from env vars (WHISPER_MODEL_SIZE, WHISPER_DEVICE, WHISPER_FP16) at construction time. A per-instance threading.Lock serializes calls because whisper.model.transcribe() is not documented as thread-safe — a hidden throughput ceiling under concurrent workloads. For true parallelism the docstring recommends process-based concurrency. The agent may return empty or whitespace-only segments; the audio partitioner is the single place that strips and drops them, keeping agent implementations simple. Source: unstructured/partition/utils/speech_to_text/whisper_stt.py:25-50, Source: unstructured/partition/utils/speech_to_text/speech_to_text_interface.py:13-22.

Pipeline Flow

flowchart LR
    A[Input file] --> B[partition function]
    B --> C{File type?}
    C -->|PDF/Image| D[Layout detection]
    D --> E{strategy?}
    E -->|fast| F[Text layer extraction]
    E -->|hi_res| G[Model + OCR]
    G --> H[OCRAgent<br/>Tesseract/Paddle/Vision]
    F --> I[XY-cut sorting]
    H --> I
    I --> J[Element stream]
    C -->|Audio| K[SpeechToTextAgent<br/>Whisper]
    K --> J

Operational Concerns

  • Telemetry is off by default; opt in with UNSTRUCTURED_TELEMETRY_ENABLED=true, opt out with any non-empty value of DO_NOT_TRACK or SCARF_NO_ANALYTICS. Source: README.md.
  • Dependency selection matters: unstructured[pdf]==0.16.11 resolved pdfminer.six 20250327 and broke imports — see community issue #3982. Pin compatible versions in production environments.
  • Performance tooling: scripts/performance/README.md documents benchmark.sh and profile.sh, plus an integration with speedscope for flame-graph inspection of partition runs. Source: scripts/performance/README.md:1-15.
  • Confidence scores are not yet exposed on extracted elements; feature request #4320 proposes surfacing per-element model confidence to help RAG pipelines triage low-quality extractions such as rotated tables or scanned handwriting.
  • Output formatting: feature request #3525 requests Markdown extraction alongside JSON, since JSON is not always ideal as a direct LLM input.

See Also

Source: https://github.com/Unstructured-IO/unstructured / Human Manual

Elements, Chunking, and Output Formats

Related topics: Document Partitioning Pipeline, Embeddings, Connectors, and Metrics

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Document Partitioning Pipeline, Embeddings, Connectors, and Metrics

Elements, Chunking, and Output Formats

The unstructured library is designed around three tightly connected concepts: Elements (the typed content objects produced by partitioning), Chunking (grouping elements into LLM-friendly units), and Output Formats (how those elements and chunks are serialized for downstream consumers). This page documents the structure of each layer, the configuration knobs available, and known limitations observed in community discussions.

Element Model and Coordinates

Partitioning functions produce a list of Element objects, each carrying a typed classification (e.g., Title, NarrativeText, Table, Image) plus optional positional and provenance metadata. The Source enum in unstructured/partition/utils/constants.py declares where an element's text was derived from, including PDFMINER, OCR_TESSERACT, OCR_PADDLE, and OCR_GOOGLEVISION. Source: unstructured/partition/utils/constants.py:Source

Spatial layout matters for downstream sorting and chunking. CoordinatesMetadata is normalized into (left, top, right, bottom) bounding boxes via coordinates_to_bbox, and shrink_bbox contracts a box by a configurable factor while preserving its top-left corner. Source: unstructured/partition/utils/sorting.py:coordinates_to_bbox

When the layout detection backend produces overlapping or rotated boxes, the xy-cut algorithm (recursive_xy_cut, recursive_xy_cut_swapped) is invoked to produce a deterministic reading order. Source: unstructured/partition/utils/xycut.py:recursive_xy_cut

OCR Backends and Element Provenance

Elements originating from image regions rely on one of three OCR agents, all implementing the OCRAgent abstract base class:

BackendModuleNotes
Tesseractunstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseractDefault; sets OMP_THREAD_LIMIT=1 for stability. Source: tesseract_ocr.py
PaddleOCRunstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddleGPU auto-detected via paddle.device.cuda.device_count(). Source: paddle_ocr.py
Google Visionunstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVisionHonors GOOGLEVISION_API_ENDPOINT env var. Source: google_vision_ocr.py

Selection is governed by the OCR_AGENT environment variable, and the resolved module must appear in the OCR_AGENT_MODULES_WHITELIST constant (overridable via env var of the same name). Source: unstructured/partition/utils/ocr_models/ocr_interface.py:OCRAgent.get_agent and constants.py:OCR_AGENT_MODULES_WHITELIST

Chunking Pipeline

Chunking groups adjacent elements so that downstream embeddings and LLM context windows operate on coherent units rather than individual lines. Community releases between 0.22.27 and 0.23.1 added several chunking refinements:

  • Table chunking option (0.22.30) — opt-in table-aware chunking.
  • Renamed isolate_tablesisolate_table (0.22.31) for singular consistency.
  • First table chunk preserves column/row span (0.22.23) to keep tabular structure intact.
  • AcroForm field text extraction in PDFs (0.23.1) so form values flow into chunks rather than being dropped.

The chunking subsystem is configured through ENVConfig, which reads runtime switches such as the OCR agent cache size and global working directory. Source: unstructured/partition/utils/config.py:ENVConfig

flowchart LR
    A[partition*] --> B[Element list]
    B --> C[Chunking]
    C --> D[PreChunkFilter]
    D --> E[TextPreChunk]
    E --> F[Chunk elements]
    F --> G[Output: JSON / NDJSON]

Telemetry around this pipeline is off by default and gated by UNSTRUCTURED_TELEMETRY_ENABLED. Source: README.md:Analytics

Output Formats and Serialization

By default, partition("...") returns a Python list of Element objects that stringify to typed text. For inter-process serialization, the library emits JSON-compatible dicts and NDJSON streams. NDJSON file-type detection was hardened in 0.22.27. Source: Release 0.22.27

Table elements expose HTML through chunk.metadata.text_as_html. Community issue #3871 reports that numeric cell values can be emitted in scientific notation (e.g., 4789234.7e+05) when this serializer coerces types — a known gap that affects downstream RAG pipelines consuming structured tables. Source: Issue #3871

Issue #3525 requests a Markdown output mode alongside JSON so that partitioning results can be fed directly to LLMs without losing metadata fidelity. Source: Issue #3525

Community Concerns and Known Limitations

Several recurring community themes influence how Elements, Chunking, and Output Formats should be used:

  • PDF two-column reading order (#356): elements are extracted but not always in correct order; xy-cut is the primary mitigation.
  • NumPy 2.0 compatibility (#3684): pending onnxruntime upgrade may shift element-coordinate sorting behavior.
  • Confidence scores (#4320): feature request to surface per-element confidence in metadata so downstream filters can drop low-quality text from chunks.
  • pdfminer.six pinning (#3982): mismatched pdfminer.six 20250327 causes import errors, forcing partition_pdf to fail before elements are produced.
  • OCR deactivation (#2467): no clean opt-out flag to skip OCR while keeping strategy="hi_res" table detection.
  • Embedded embedding module is deprecated; the README redirects to unstructured-ingest. Source: unstructured/embed/README.md

Operational Tooling

The scripts/performance/ directory provides benchmarking and py-spy-based profiling, allowing teams to verify that element extraction and chunking remain stable across commits. Source: scripts/performance/README.md

See Also

Source: https://github.com/Unstructured-IO/unstructured / Human Manual

Embeddings, Connectors, and Metrics

Related topics: Document Partitioning Pipeline, Elements, Chunking, and Output Formats

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Anonymous Telemetry (Default Off)

Continue reading this section for the full explanation and source context.

Section Environment-Configurable Knobs

Continue reading this section for the full explanation and source context.

Section Performance Benchmarking and Profiling

Continue reading this section for the full explanation and source context.

Related topics: Document Partitioning Pipeline, Elements, Chunking, and Output Formats

Embeddings, Connectors, and Metrics

The unstructured library provides open-source components for ingesting and pre-processing images and text documents so they can be fed downstream into LLM pipelines. Three supporting subsystems are commonly referenced from this repository: an embed module for generating vector embeddings, a family of connectors for moving data between sources and destinations, and a metrics/telemetry surface for observing behavior. This page documents the current state of each of these subsystems as reflected in the source tree of this repository, including the migration history of the embed module and the deprecation notice that affects users.

Embeddings Module Status

The unstructured/embed directory historically contained client wrappers for third-party embedding providers (OpenAI, Voyage AI, Vertex AI, Bedrock, HuggingFace, OctoAI). As of the current revision, this directory carries an explicit unmaintained notice.

"Project has been moved to: Unstructured Ingest. This python module will be removed from this repo in the near future." Source: unstructured/embed/README.md:1-5

This means:

  • New embedding work should target the unstructured-ingest repository rather than this one.
  • Anyone importing from unstructured.embed.* in production code should plan for removal and migrate to the equivalent provider modules under unstructured_ingest.
  • The connector-style embedding integrations referenced by the deprecated README are no longer the canonical path; downstream RAG users typically obtain embeddings through ingest pipelines or by passing elements to a vector store directly.

If you must call the embed helpers while the directory still exists, expect limited maintenance — security fixes may not be backported. Plan migration before the planned removal.

Connectors

Ingest and destination connectors are not maintained in this repository. The top-level README links users to a sibling project for batch processing:

"Batch Processing | Ingesting batches of documents through Unstructured" Source: README.md()

Documentation references for connector behavior live at docs.unstructured.io/open-source/ingest/overview rather than in this repo. From the perspective of unstructured proper, connectors therefore fall outside the supported API surface: file-type partitioning (PDF, DOCX, HTML, etc.) is handled here, while the connectors that read/write S3, Azure, GCS, databases, and chat platforms are owned by unstructured-ingest.

For users who arrived at this repo looking for connector configuration, the practical answer is: install and configure unstructured-ingest separately, then chain it with unstructured.partition.* outputs.

Metrics and Telemetry

Anonymous Telemetry (Default Off)

unstructured ships with anonymous usage telemetry that is off by default. Opt-in and opt-out are controlled via environment variables, with opt-out always taking precedence.

VariableEffectNotes
UNSTRUCTURED_TELEMETRY_ENABLEDSet to true / 1 to opt inMust be set before importing unstructured
DO_NOT_TRACKAny non-empty value opts outOverrides opt-in
SCARF_NO_ANALYTICSAny non-empty value opts outOverrides opt-in

Source: README.md() — "Telemetry is off by default. To opt in, set UNSTRUCTURED_TELEMETRY_ENABLED=true ... opt-out takes precedence. Unset the variable or leave it empty if you do not want to opt out."

In practice, the rule is simple: leave telemetry unset, or explicitly set one of the opt-out variables, and no events are emitted. If you choose to opt in, do so before the first import unstructured so the dispatcher initializes in the correct mode.

Environment-Configurable Knobs

A broader set of operational parameters — model sizes, OCR agent selection, cache sizes, working-directory toggles — is exposed through ENVConfig rather than a public API. Source: unstructured/partition/utils/config.py()

For example, OCR_AGENT_CACHE_SIZE controls how the OCR agent factory memoizes per-language instances:

@staticmethod
@lru_cache(maxsize=env_config.OCR_AGENT_CACHE_SIZE)
def get_instance(ocr_agent_module: str, language: str) -> "OCRAgent":
    ...

Source: unstructured/partition/utils/ocr_models/ocr_interface.py:38-39

OCR agents themselves are selectable via the OCR_AGENT environment variable, with module paths whitelisted to prevent arbitrary code loading. Source: unstructured/partition/utils/constants.py() and unstructured/partition/utils/ocr_models/ocr_interface.py:14-15.

Performance Benchmarking and Profiling

For users who need to measure partitioning performance, a self-contained toolkit ships under scripts/performance/. Source: scripts/performance/README.md()

The toolkit separates two concerns:

  • Benchmarking — runs partitioning against a fixed corpus and records timings keyed by architecture, instance type, and git hash. Optional S3 publication is gated by PUBLISH_RESULTS=true. Iteration count is controlled by NUM_ITERATIONS. Docker execution is opt-in via DOCKER_TEST=true.
  • Profiling — uses py-spy to capture CPU/allocations across called functions during a partition run. Results are viewable with speedscope (local install via npm install -g speedscope) or by uploading the .speedscope file to https://www.speedscope.app/.

Where These Subsystems Fit in the Pipeline

flowchart LR
    A[Raw Documents] --> B[partition_* / partition]
    B --> C[Element objects]
    C --> D[unstructured-ingest]
    D --> E[Destination<br/>vector store / DB]
    E --> F[LLM / RAG]
    C -.opt-in.-> G[Telemetry sink]
    C -.bench.-> H[scripts/performance]

This diagram reflects the current responsibilities split: partitioning stays in unstructured, connectors and embedding-provider integrations live in unstructured-ingest, telemetry is an opt-in background signal, and benchmarking is a developer-side measurement tool rather than a runtime component.

Common Failure Modes

  • **ImportError from unstructured.embed.* after a refactor.** The directory is slated for removal; switch to the equivalent provider in unstructured-ingest. Source: unstructured/embed/README.md()
  • OCR agent not picked up from environment. The module path must appear in OCR_AGENT_MODULES_WHITELIST; otherwise OCRAgent.get_instance raises ValueError. Source: unstructured/partition/utils/ocr_models/ocr_interface.py:42-46
  • Telemetry still firing after opt-out. Ensure DO_NOT_TRACK or SCARF_NO_ANALYTICS is set to a non-empty value before the Python process imports unstructured. Source: README.md()
  • Benchmark numbers not comparable across runs. Set INSTANCE_TYPE and a known git hash, and pin NUM_ITERATIONS, so S3-published results can be aggregated meaningfully. Source: scripts/performance/README.md()

See Also

Source: https://github.com/Unstructured-IO/unstructured / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 22 structured pitfall item(s), including 3 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

  • Severity: high
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/Unstructured-IO/unstructured/issues/3871

2. Installation risk: Installation risk requires verification

  • Severity: high
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/Unstructured-IO/unstructured/issues/4320

3. Security or permission risk: Security or permission risk requires verification

  • Severity: high
  • Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: packet_text.keyword_scan | https://github.com/Unstructured-IO/unstructured

4. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: identity.distribution | https://github.com/Unstructured-IO/unstructured

5. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | https://github.com/Unstructured-IO/unstructured

6. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Developers should check this runtime risk before relying on the project: Number getting converted into scientific notation in metadata.text_as_html
  • User impact: Developers may hit a documented source-backed failure mode: Number getting converted into scientific notation in metadata.text_as_html
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Number getting converted into scientific notation in metadata.text_as_html. Context: Observed when using python
  • Evidence: failure_mode_cluster:github_issue | https://github.com/Unstructured-IO/unstructured/issues/3871

7. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/Unstructured-IO/unstructured

8. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: downstream_validation.risk_items | https://github.com/Unstructured-IO/unstructured

9. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: risks.scoring_risks | https://github.com/Unstructured-IO/unstructured

10. Runtime risk: Runtime risk requires verification

  • Severity: low
  • Finding: Developers should check this performance risk before relying on the project: [Feature Request] Add document layout analysis confidence scores
  • User impact: Developers may hit a documented source-backed failure mode: [Feature Request] Add document layout analysis confidence scores
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Feature Request] Add document layout analysis confidence scores. Context: Observed when using python
  • Evidence: failure_mode_cluster:github_issue | https://github.com/Unstructured-IO/unstructured/issues/4320

11. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: issue_or_pr_quality=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/Unstructured-IO/unstructured

12. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: release_recency=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/Unstructured-IO/unstructured

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using unstructured with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence