# https://github.com/microsoft/presidio Project Manual

Generated at: 2026-06-20 18:52:11 UTC

## Table of Contents

- [Presidio Overview & System Architecture](#page-1)
- [Analyzer: PII Detection, NLP Engines & Recognizers](#page-2)
- [Anonymization, Image Redaction & DICOM Processing](#page-3)
- [Structured Data, CLI, Deployment & Extensibility](#page-4)

<a id='page-1'></a>

## Presidio Overview & System Architecture

### Related Pages

Related topics: [Analyzer: PII Detection, NLP Engines & Recognizers](#page-2), [Anonymization, Image Redaction & DICOM Processing](#page-3), [Structured Data, CLI, Deployment & Extensibility](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [presidio/README.md](https://github.com/microsoft/presidio/blob/main/presidio/README.md)
- [presidio-analyzer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/README.md)
- [presidio-anonymizer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/README.md)
- [presidio-image-redactor/README.md](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/README.md)
- [presidio-structured/README.md](https://github.com/microsoft/presidio/blob/main/presidio-structured/README.md)
- [presidio-cli/README.md](https://github.com/microsoft/presidio/blob/main/presidio-cli/README.md)
</details>

# Presidio Overview & System Architecture

## 1. Purpose and Scope

Microsoft Presidio is a context-aware, pluggable, and customizable PII de-identification service for text and images. It is delivered as a monorepo containing several cooperating Python packages that together detect, redact, and anonymize personally identifiable information (PII) in unstructured text, structured/tabular data, and pixel data (images and DICOM medical files) ([presidio/README.md:1-5]()).

The top-level `presidio` package is a thin convenience meta-package that has no code of its own — it simply installs `presidio-analyzer` and `presidio-anonymizer` as dependencies ([presidio/README.md:7-11]()). All real functionality lives in the sub-packages.

Primary use cases observed across the repo include:

- Detecting PII in free-form text and substituting it with placeholders, hashes, or encrypted values.
- Redacting burnt-in text PHI in standard images and DICOM medical scans.
- Identifying and anonymizing PII in tabular data (pandas DataFrames) and JSON-like semi-structured records.
- Scanning source trees from the command line to flag PII inside code or documentation.

## 2. System Architecture

The repository is organized as a polyglot monorepo: each capability is its own installable Python package that exposes both a Python API and (for some) a Dockerized REST service.

```mermaid
flowchart LR
    User[User / Application]
    subgraph Presidio["presidio monorepo"]
        Analyzer["presidio-analyzer<br/>Detect PII in text"]
        Anonymizer["presidio-anonymizer<br/>Replace / Encrypt / Hash"]
        ImgRed["presidio-image-redactor<br/>Images + DICOM"]
        Structured["presidio-structured<br/>Pandas / JSON tabular"]
        CLI["presidio-cli<br/>Source-tree scanner"]
        Meta["presidio<br/>meta-package (deps only)"]
    end
    User --> Analyzer
    Analyzer -->|RecognizerResult[]| Anonymizer
    User --> ImgRed
    User --> Structured
    User --> CLI
    Meta -.bundles.-> Analyzer
    Meta -.bundles.-> Anonymizer
```

Key architectural properties:

- **Separation of detection and transformation.** Detection is performed by `AnalyzerEngine` ([presidio-analyzer/README.md:51-58]()), transformation by `AnonymizerEngine` ([presidio-anonymizer/README.md:39-44]()). The boundary is the `RecognizerResult` list, which lets users swap one side without touching the other.
- **Recognizer extensibility.** Each predefined recognizer is responsible for one or more PII entity types using regex, NER, checksum validation, or external models ([presidio-analyzer/README.md:3-9]()).
- **Pluggable operators.** Anonymization is composed of small operators (`replace`, `redact`, `hash`, `mask`, `encrypt`, `custom`, `surrogate`, etc.) configured per entity ([presidio-anonymizer/README.md:11-37]()).
- **Multiple deployment shapes.** The same code path is exposed as a Python library, a `docker-compose` HTTP service, and (via the meta-package) a single `pip install presidio` ([presidio/README.md:17-23]()).

## 3. Core Packages

| Package | Responsibility | Entry point | Source |
|---|---|---|---|
| `presidio-analyzer` | Detect PII entities in unstructured text; supports spaCy NER, regex, LangExtract/LLM, HuggingFace NER, and country-specific recognizers | `AnalyzerEngine().analyze(text, language)` | [presidio-analyzer/README.md:55-60]() |
| `presidio-anonymizer` | Apply anonymization operators to detected spans; reversible via `Decrypt` deanonymizer | `AnonymizerEngine().anonymize(text, analyzer_results, operators=…)` | [presidio-anonymizer/README.md:41-48]() |
| `presidio-image-redactor` | OCR + PII detection on images; specialized DICOM pipeline using Tesseract and pydicom | `ImageRedactorEngine`, `DicomImageRedactorEngine` | [presidio-image-redactor/README.md:41-55]() |
| `presidio-structured` | Map DataFrame columns / JSON keys to entities, then anonymize values | `StructuredEngine` + `PandasAnalysisBuilder` | [presidio-structured/README.md:5-15]() |
| `presidio-cli` | Walk files in a directory and print PII hits with format options (`standard`, `github`, `colored`, `parsable`) | `presidio <path>` | [presidio-cli/README.md:79-94]() |

The analyzer ships with optional extras — e.g. `presidio-analyzer[langextract]` for LLM-based detection through Ollama or Azure OpenAI ([presidio-analyzer/README.md:15-23]()), and `presidio-anonymizer[ahds]` for the Azure Health Data Services surrogate operator ([presidio-anonymizer/README.md:31-37]()).

## 4. End-to-End Data Flow

For the canonical text path, the pipeline is:

1. **Analyze** — `AnalyzerEngine.analyze(text, entities=…, language="en")` runs every registered recognizer and returns a list of `RecognizerResult` (entity type, span, score) ([presidio-analyzer/README.md:51-60]()).
2. **Anonymize** — `AnonymizerEngine.anonymize(text, analyzer_results, operators=…)` resolves overlapping spans, picks the highest-scoring operator per entity, and produces an `AnonymizedResult` ([presidio-anonymizer/README.md:41-48]()).
3. **(Optional) Deanonymize** — Encrypted spans can be reversed with the `Decrypt` operator when the AES key is available ([presidio-anonymizer/README.md:53-56]()).

For images, `ImageRedactorEngine` runs OCR (Tesseract), feeds the recognized text to `AnalyzerEngine`, and paints boxes over the original image; the DICOM variant additionally handles pydicom pixel data, metadata, and long Windows paths ([presidio-image-redactor/README.md:41-55, 87-95]()).

For tabular data, `StructuredEngine` uses `PandasAnalysisBuilder().generate_analysis(df)` to derive a column→entity map, then delegates anonymization to `presidio-anonymizer` per cell ([presidio-structured/README.md:17-27]()).

## 5. Configuration, Extensibility, and Known Pitfalls

- **YAML recognizers.** Recognizers can be declared declaratively; recent releases moved predefined recognizers to a config-file model ([Release 2.2.355 changelog](https://github.com/microsoft/presidio/releases/tag/2.2.355)). A known sharp edge: omitted optional YAML fields can be serialized as explicit `None` during registry validation ([Issue #2080](https://github.com/microsoft/presidio/issues/2080)).
- **Country-specific recognizers.** Starting with 2.2.359 most country-specific recognizers default to *disabled* to reduce false positives outside their locale ([Release 2.2.359 notes](https://github.com/microsoft/presidio/releases/tag/2.2.359)). New PH-specific recognizers have been requested ([Issue #2015](https://github.com/microsoft/presidio/issues/2015)).
- **DICOM duplicate suppression.** `DicomImagePiiVerifyEngine._remove_duplicate_entities` had a bug where `sorted()` was called but its result discarded, so the *lowest*-scored entity won; this is now tracked ([Issue #2083](https://github.com/microsoft/presidio/issues/2083)).
- **DICOM evaluation notebook.** The bundled sample notebook has reported failures — users should pin a known-good commit or apply the patches linked from the issue thread ([Issue #1251](https://github.com/microsoft/presidio/issues/1251)).
- **GPU / deployment.** GPU acceleration is opt-in via `cupy-cuda12x`; containers moved to `gunicorn` in 2.2.356 ([Release 2.2.356 notes](https://github.com/microsoft/presidio/releases/tag/2.2.356)).

## 6. Getting Started

```python
# Minimal text pipeline
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

results = analyzer.analyze(text="My name is John Doe", language="en")
anonymized = anonymizer.anonymize(text="My name is John Doe", analyzer_results=results)
print(anonymized)
```
([presidio/README.md:25-33]())

```sh
pip install presidio                       # meta-package
pip install presidio-image-redactor        # OCR + DICOM
pip install presidio-structured            # pandas / JSON
pip install presidio-analyzer[langextract] # LLM-based detection
```
([presidio-analyzer/README.md:15-23](), [presidio-image-redactor/README.md:23-29](), [presidio-structured/README.md:11-13]())

## See Also

- Analyzer internals — recognizer registration, custom recognizers, NER/LLM backends.
- Anonymizer operators — overlap handling, encryption, AHDS surrogate.
- Image Redactor — DICOM pipeline, OCR thresholds, PII verification engine.
- Structured — DataFrame/JSON analysis builders and operator mappings.
- CLI — output formats and CI integration.

---

<a id='page-2'></a>

## Analyzer: PII Detection, NLP Engines & Recognizers

### Related Pages

Related topics: [Presidio Overview & System Architecture](#page-1), [Anonymization, Image Redaction & DICOM Processing](#page-3), [Structured Data, CLI, Deployment & Extensibility](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [presidio-analyzer/presidio_analyzer/analyzer_engine.py](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/analyzer_engine.py)
- [presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py)
- [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py)
- [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry_provider.py](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry_provider.py)
- [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py)
- [presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py)
- [presidio-analyzer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/README.md)
</details>

# Analyzer: PII Detection, NLP Engines & Recognizers

The Presidio Analyzer is the detection half of the Presidio PII de-identification framework. It analyzes free-form text (and serves as a backend for the image, structured, and DICOM redactors) and returns a list of `RecognizerResult` objects describing where PII entities appear, what type they are, and a confidence score. This page documents how detection is wired together: the `AnalyzerEngine`, the pluggable NLP engine, and the recognizer registry.

## High-Level Architecture

The analyzer is composed of three cooperating subsystems: an NLP engine that provides tokenization, lemmatization, and (optionally) named entity recognition; a recognizer registry that holds a curated set of detectors per language; and the `AnalyzerEngine` itself, which orchestrates the call to each recognizer, merges the results, and returns them to the caller.

```mermaid
flowchart LR
    A[Caller] -->|analyze(text, entities, language)| B[AnalyzerEngine]
    B --> C[NLP Engine Provider]
    C -->|spaCy / stanza / transformers| D[NLP Engine]
    B --> E[Recognizer Registry]
    E -->|loads| F[Predefined Recognizers]
    E -->|loads| G[Custom Recognizers from YAML/Code]
    F --> H[EntityRecognizer subclasses]
    G --> H
    H -->|RecognizerResult list| B
    B -->|merged, scored results| A
```

The `AnalyzerEngine.__init__` wires the registry and the NLP engine together, and the `analyze` method is the single entry point for end users [Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py]()].

## NLP Engine Provider

The NLP engine abstracts the linguistic backbone of the analyzer. It must provide tokenization, sentence splitting, lemmatization, and named entity recognition; the recognizers depend on these primitives to evaluate regexes against lemmas, to filter by POS tags, and to combine their own logic with model output.

The `NlpEngineProvider` reads an `nlp_engine_name` and an optional configuration dictionary, then instantiates one of three backends:

| Engine | Notes | Source |
|---|---|---|
| `spacy` (default) | Uses a spaCy model; recommended `en_core_web_lg` for English. Supports GPU acceleration via `cupy-cuda12x` on Linux. | [nlp_engine_provider.py]() |
| `stanza` | Stanford Stanza-based pipeline; used for languages not well served by spaCy. | [nlp_engine_provider.py]() |
| `transformers` | Wraps a Hugging Face NER model directly, useful when a custom transformer is preferred over spaCy/stanza. | [nlp_engine_provider.py]() |

GPU acceleration is controlled via environment variable as introduced in release 2.2.362; on macOS with Apple Silicon, MPS is not currently supported and PyTorch operations fall back to CPU [Source: [presidio-analyzer/README.md]()].

The provider raises a `ValueError` when an unknown engine name is supplied, which is the most common configuration failure when bootstrapping a new language [Source: [presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py]()].

## Recognizer Registry

The registry is the catalog of detectors the engine will invoke. Each registry is bound to a single language and a single NLP engine instance; the `RecognizerRegistry.__init__` accepts both, plus an optional list of recognizer classes, an optional list of custom recognizers, and a context dictionary (e.g., a list of supported entity types) [Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py]()].

The registry exposes two principal operations:

- `load_predefined_recognizers(...)` — populates the registry with Presidio's built-in recognizers (regex, deny-list, NER-based) filtered by language. Since release 2.2.359, most country-specific recognizers that expect English text are registered as disabled by default to avoid spurious false positives in non-target languages [Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py]()].
- `add_recognizer(recognizer)` and `add_custom_recognizer(custom_recognizer)` — append user-supplied recognizers, which are validated against the registry's supported entity list [Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py]()].

Internally, recognizers are kept in two ordered dictionaries: the standard `recognizers` map and a `custom_recognizers` map; both are iterated during analysis. The `get_recognizers` method returns a flat list combining both, and `get_supported_entities` returns the deduplicated union of all entity types across every loaded recognizer [Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py]()].

### Loading Recognizers from Configuration

`RecognizerRegistryProvider` constructs one or more registries (one per language) from a configuration dictionary. It delegates to `RecognizersLoader` (in `recognizers_loader_utils.py`) to parse three kinds of entries: built-in recognizer names, Python import paths, and YAML-defined recognizer definitions. This is the recommended way to version-control recognizer configuration and to share setups across deployments [Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry_provider.py]() and [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py]()].

> **Community note (issue #2080):** When recognizers with dedicated YAML config models are configured through analyzer YAML, omitted optional fields can be serialized as explicit `None` values during registry validation. The config dump path for these recognizers may therefore emit fields with `null` values that are indistinguishable from fields the user deliberately set to `null`. This is a known serialization quirk to be aware of when diffing configuration files.

## AnalyzerEngine and the `analyze` Pipeline

`AnalyzerEngine` is the high-level façade. It lazily instantiates its default `NlpEngineProvider` and `RecognizerRegistry` if none is supplied, and forwards language-specific setup to a per-language registry cache, so calling `analyze` with several languages does not rebuild the recognizer catalog each time [Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py]()].

The `analyze(text, entities, language, ...)` method executes the following steps:

1. Validate the language against the configured supported languages, defaulting to `en` when not specified.
2. Resolve the registry for the requested language via the engine provider; this also resolves the correct NLP engine and returns the matching list of `EntityRecognizer` subclasses.
3. Iterate the recognizers, passing each one the text, the entity allow-list, and an analysis context. Each recognizer returns a list of `RecognizerResult` objects with `entity_type`, `start`, `end`, and `score`.
4. Merge results across recognizers, dedupe overlaps, and apply the confidence threshold (default `0.6`).
5. Return the final list of `RecognizerResult` objects to the caller [Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py]() and [presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py]()].

The threshold parameter is the most common knob to tune in production. Lowering it surfaces more candidates (including borderline regex matches) at the cost of false positives; raising it tightens precision at the cost of recall. The analyzer also accepts a `decision_process` argument that can be swapped for a custom merge strategy when the default is not appropriate [Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py]()].

## Recognizer Categories

Three recognizer families ship with Presidio and are typically mixed in a single registry:

- **Pattern-based recognizers** use regex and validation logic (e.g., Luhn checksum for credit cards, IBAN validation, URL parsing) and are deterministic and fast.
- **NLP-based recognizers** wrap the underlying NER model returned by the NLP engine and translate generic model labels (e.g., `PERSON`, `ORG`) into Presidio's entity taxonomy.
- **Context-aware and LLM-based recognizers** (e.g., the `BasicLangExtractRecognizer` and `AzureOpenAILangExtractRecognizer`) call out to a language model — local Ollama, Azure OpenAI, or Hugging Face NER — to detect entities that are difficult to capture with regexes. These require the optional `presidio-analyzer[langextract]` install, and they defer connectivity validation to the first call to `analyze()` [Source: [presidio-analyzer/README.md]() and [presidio-analyzer/presidio_analyzer/analyzer_engine.py]()].

Release 2.2.362 introduced a `HuggingFaceNerRecognizer` for direct NER model inference, providing a lower-friction alternative to spinning up an entire `transformers` NLP engine when only NER output is required.

## Common Failure Modes

- **No model installed for the requested language** — the most common initialization error; resolved by running `python -m spacy download <model>` for spaCy or installing the matching Stanza model.
- **Unsupported language** — `AnalyzerEngine.analyze` raises if `language` is not in the configured `supported_languages`. Pass `supported_languages=["en","es",...]` to the constructor to extend the set [Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py]()].
- **Low recall on a target entity** — first check the default-score threshold, then verify the recognizer is not one of the country-specific recognizers disabled by default since 2.2.359; explicitly re-enable it via the registry or YAML configuration when targeting that locale [Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py]()].
- **YAML recognizer loads with `None` fields** — known issue #2080; treat `null` YAML values as "unset" when diffing generated configuration dumps.

## See Also

- [Presidio Anonymizer](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/README.md) — consumes `RecognizerResult` objects and applies reversible or irreversible operators.
- [Presidio Image Redactor](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/README.md) — calls the analyzer as an OCR post-processor for standard images and DICOM.
- [Presidio Structured](https://github.com/microsoft/presidio/blob/main/presidio-structured/README.md) — uses analyzer detection to map tabular columns to PII entities before anonymization.

---

<a id='page-3'></a>

## Anonymization, Image Redaction & DICOM Processing

### Related Pages

Related topics: [Presidio Overview & System Architecture](#page-1), [Analyzer: PII Detection, NLP Engines & Recognizers](#page-2), [Structured Data, CLI, Deployment & Extensibility](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [presidio-anonymizer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/README.md)
- [presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py)
- [presidio-anonymizer/presidio_anonymizer/deanonymize_engine.py](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/presidio_anonymizer/deanonymize_engine.py)
- [presidio-anonymizer/presidio_anonymizer/batch_anonymizer_engine.py](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/presidio_anonymizer/batch_anonymizer_engine.py)
- [presidio-anonymizer/presidio_anonymizer/entities/conflict_resolution_strategy.py](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/presidio_anonymizer/entities/conflict_resolution_strategy.py)
- [presidio-anonymizer/presidio_anonymizer/operators/operators_factory.py](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/presidio_anonymizer/operators/operators_factory.py)
- [presidio-image-redactor/README.md](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/README.md)
- [presidio-image-redactor/presidio_image_redactor/image_redactor_engine.py](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/presidio_image_redactor/image_redactor_engine.py)
- [presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/presidio_image_redactor/dicom_image_redactor_engine.py)
- [presidio-image-redactor/presidio_image_redactor/dicom_image_pii_verify_engine.py](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/presidio_image_redactor/dicom_image_pii_verify_engine.py)
</details>

# Anonymization, Image Redaction & DICOM Processing

## Overview & Scope

Presidio ships three complementary capabilities that operate downstream of PII detection: text anonymization, optical image redaction, and DICOM (medical imaging) redaction. Together they cover the "de-identify" half of the Presidio pipeline — `presidio-analyzer` finds the PII, and these engines remove or mask it. The anonymizer works purely on text spans returned by the analyzer, while the image redactor extends the same recognizer pipeline to pixels and burnt-in text inside images and DICOM frames. Each package is independently installable and independently deployable as a REST service or Python library.

The text anonymizer also exposes a `Deanonymizer` (the inverse operation), a batch engine for CSV/JSON files, and an overlap-conflict resolution strategy so that nested or intersecting PII spans produce stable, predictable output.

## Text Anonymization with `presidio-anonymizer`

The `AnonymizerEngine` consumes a list of `RecognizerResult` objects (typically emitted by `AnalyzerEngine`) and applies a per-entity operator chain to produce redacted text. Operators are looked up via the factory in `operators/operators_factory.py` and selected per entity type, with `DEFAULT` acting as a global fallback (`Source: [presidio-anonymizer/README.md]`).

### Built-in Operators

The anonymizer ships with a fixed set of built-in operators, each accepting specific parameters:

| Operator | Purpose | Key Parameters |
|---|---|---|
| `replace` | Substitute with a static value (defaults to `<entity_type>`) | `new_value` |
| `redact` | Remove the PII completely | — |
| `hash` | Cryptographic hash of the PII (sha256/sha512) | `hash_type` |
| `mask` | Replace with a repeated character | `chars_to_mask`, `masking_char`, `from_end` |
| `encrypt` | AES (Rijndael) encryption, reversible via deanonymizer | `key` (128/192/256-bit) |
| `custom` | Apply a user lambda to the PII string | `lambda` |
| `ahds surrogate` | Call Azure Health Data Services de-identification service | `endpoint`, `entities`, `input_locale`, `surrogate_locale` |

The `Encrypt` operator and the `Decrypt` deanonymizer together form a reversible pair; the key length must be 128, 192, or 256 bits and is provided as a string. The `AHDS surrogate` operator was added in release 2.2.360 and requires the `presidio-anonymizer[ahds]` extra (`Source: [presidio-anonymizer/README.md]`).

### Overlap Conflict Resolution

When two recognizers return overlapping spans, the `AnonymizerEngine` relies on the `ConflictResolutionStrategy` enum in `entities/conflict_resolution_strategy.py`. The behavior is documented as: full-span overlaps resolve by score (higher wins, ties are arbitrary); one span contained inside another resolves to the *longer* span even when its score is lower; and partial intersections are anonymized independently and concatenated in the output. This produces the well-known pattern where `George Washington` and `Washington State Park` collapse to `<PERSON><LOCATION>` (`Source: [presidio-anonymizer/README.md]`).

### Deanonymization & Batch Processing

`Deanonymizer` is the symmetric counterpart and currently exposes a single `Decrypt` operator that uses AES to reverse the `Encrypt` operation (`Source: [presidio-anonymizer/presidio_anonymizer/deanonymize_engine.py]`). For dataset-scale workflows, `BatchAnonymizerEngine` accepts CSV or JSON files and applies the same per-entity operator pipeline row-by-row, writing the redacted output alongside an "analyzed" file containing the `RecognizerResult` metadata (`Source: [presidio-anonymizer/presidio_anonymizer/batch_anonymizer_engine.py]`).

## Image Redaction with `presidio-image-redactor`

The image redactor is a separate package that detects PII in raster images (PNG/JPG/etc.) and in DICOM medical frames, then draws filled rectangles over the detected regions. It depends on Tesseract OCR for text extraction and on `presidio-analyzer` for entity classification, so the same recognizer set used for plain text applies to pixels.

### Standard Image Redaction

`ImageRedactorEngine` is the entry point for ordinary raster images. Its primary method accepts a `PIL.Image` and an optional color fill (int or `(R,G,B)` tuple, default black) and returns a redacted image with rectangles drawn over the bounding boxes returned by the OCR + analyzer pipeline. A small HTTP service is exposed via `POST /redact`, accepting a multipart form with the image and a `data` field of the shape `{'color_fill':'0,0,0'}` (`Source: [presidio-image-redactor/README.md]`).

### DICOM Image Redaction

DICOM redaction is handled by `DicomImageRedactorEngine`, which works on `pydicom` datasets rather than `PIL.Image` objects. Four entry points are documented in the README: `redact(dicom_image, fill=...)` for in-memory redaction, `redact_and_return_bbox(...)` when callers also need the bounding boxes, `redact_from_file(input_path, output_dir, ...)` for single-file workflows that persist JSON bbox dumps, and `redact_from_directory(...)` for recursive batch jobs. Padding and `ocr_kwargs` (e.g. `ocr_threshold`) are configurable per call (`Source: [presidio-image-redactor/README.md]`).

> **Scope note:** the redactor scrubs burnt-in text in *pixel data only*; it does not touch the structured DICOM metadata headers. The README explicitly recommends pairing it with the [Tools for Health Data Anonymization](https://github.com/microsoft/Tools-for-Health-Data-Anonymization) package for metadata scrubbing (`Source: [presidio-image-redactor/README.md]`).

## Architecture & Data Flow

The end-to-end flow for both text and image modalities is uniform: detect → resolve conflicts → redact. The diagram below summarizes the image/DICOM path; text anonymization follows the same shape with the OCR step omitted.

```mermaid
flowchart LR
    A[Input: text / image / DICOM] --> B{Modality?}
    B -- Text --> C[AnalyzerEngine]
    B -- Image --> D[Tesseract OCR]
    B -- DICOM --> D
    D --> C
    C --> E[RecognizerResults]
    E --> F[ConflictResolutionStrategy]
    F --> G{Output sink}
    G -- Text --> H[AnonymizerEngine<br/>or BatchAnonymizerEngine]
    G -- Image --> I[ImageRedactorEngine]
    G -- DICOM --> J[DicomImageRedactorEngine]
    H --> K[Redacted text]
    I --> L[Redacted PNG]
    J --> M[Redacted DICOM + bbox JSON]
```

## Known Issues & Community Notes

Several recurring community issues are worth flagging before adopting the image/DICOM pipeline:

- **Sorted-result bug in PII verification** — `DicomImagePiiVerifyEngine._remove_duplicate_entities` calls `sorted()` on overlapping entity candidates but discards the sorted list, so the *lowest*-scored entity is kept instead of the highest. Track [issue #2083](https://github.com/microsoft/presidio/issues/2083) for the fix.
- **DICOM metadata edge cases** — Release 2.2.354 shipped a fix for "wrong condition for dicom metadata" ([#1347](https://github.com/microsoft/presidio/issues/1347)), so older versions may misclassify or skip header fields.
- **Evaluation notebook drift** — The `example_dicom_redactor_evaluation.ipynb` sample has historically failed with an error ([issue #1251](https://github.com/microsoft/presidio/issues/1251)); users running evaluation pipelines should pin a known-good commit or verify against the current notebook.
- **Country-specific recognizers** — As of release 2.2.359, country-specific predefined recognizers are disabled by default to reduce false positives. Adopters needing, for example, Philippines-specific PII (see [issue #2015](https://github.com/microsoft/presidio/issues/2015)) must opt them in explicitly.

For long file paths on Windows during DICOM batch jobs, the README also recommends [enabling Win32 long paths](https://learn.microsoft.com/en-us/answers/questions/293227/longpathsenabled.html) — a common practical failure point (`Source: [presidio-image-redactor/README.md]`).

## See Also

- [presidio-analyzer](../analyzer/index.md) — PII detection upstream of every operation described here
- [presidio-structured](../structured/index.md) — tabular/JSON de-identification that reuses the same recognizers
- [AHDS De-identification Service integration](../analyzer/ahds.md) — remote recognizer + `ahds surrogate` operator introduced in 2.2.360

---

<a id='page-4'></a>

## Structured Data, CLI, Deployment & Extensibility

### Related Pages

Related topics: [Presidio Overview & System Architecture](#page-1), [Analyzer: PII Detection, NLP Engines & Recognizers](#page-2), [Anonymization, Image Redaction & DICOM Processing](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [presidio-structured/README.md](https://github.com/microsoft/presidio/blob/main/presidio-structured/README.md)
- [presidio/README.md](https://github.com/microsoft/presidio/blob/main/presidio/README.md)
- [presidio-cli/README.md](https://github.com/microsoft/presidio/blob/main/presidio-cli/README.md)
- [presidio-analyzer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/README.md)
- [presidio-image-redactor/README.md](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/README.md)
- [presidio-anonymizer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/README.md)
</details>

# Structured Data, CLI, Deployment & Extensibility

This page describes four cross-cutting capabilities of the Presidio ecosystem: the **Structured** module for tabular/semi-structured data, the **CLI** tool for batch text analysis, the supported **Deployment** topologies (Azure, Docker, HTTP), and the **Extensibility** surface (recognizers, operators, language models). It is intended for engineers integrating Presidio into pipelines and for operators who need to choose a deployment topology.

## 1. Structured Data with `presidio-structured`

`presidio-structured` extends Presidio beyond free text. It uses `presidio-analyzer` to detect PII at the column/key level, then uses `presidio-anonymizer` operators to redact values. The package exposes a `StructuredEngine` and a `PandasAnalysisBuilder`, both installed via `pip install presidio-structured` ([presidio-structured/README.md](https://github.com/microsoft/presidio/blob/main/presidio-structured/README.md)).

The typical pattern is to (1) build a `tabular_analysis` that maps columns/keys to detected entity types, (2) define an `OperatorConfig` map per entity, and (3) invoke the engine over a `pandas.DataFrame` or a JSON document. The `Faker` library can be wired in via `OperatorConfig("replace", {"new_value": fake.name()})` to produce realistic surrogates ([presidio-structured/README.md](https://github.com/microsoft/presidio/blob/main/presidio-structured/README.md)).

A notable capability added in release 2.2.354 is user-defined entity selection strategies, allowing callers to plug in their own logic for choosing which entities get anonymized per column ([Release 2.2.354](https://github.com/microsoft/presidio/releases/tag/2.2.354)). Community requests, such as Philippines-specific predefined recognizers (issue #2015), indicate ongoing expansion of country-specific recognizers usable from `presidio-structured` ([Issue #2015](https://github.com/microsoft/presidio/issues/2015)).

## 2. CLI: `presidio-cli`

`presidio-cli` is a thin wrapper around `presidio-analyzer` that scans files and directories for PII. It is installed with `pip install presidio-cli` and requires Python 3.10–3.13 and `poetry` for source builds ([presidio-cli/README.md](https://github.com/microsoft/presidio/blob/main/presidio-cli/README.md)).

Configuration can be supplied via a YAML file (`-c`) or inline (`-d`). The YAML keys map directly to `presidio-analyzer` parameters:

| YAML key | Purpose | Example value |
|---|---|---|
| `language` | NLP language for analysis | `en` |
| `ignore` | Glob patterns to skip | `.git`, `*.cfg` |
| `entities` | Restrict detection to a subset | `[PERSON, EMAIL_ADDRESS]` |
| `allow` | Allow-list tokens that should not be flagged | list of strings |

The CLI supports four output formats selected with `-f` / `--format`: `standard`, `github` (CI annotations), `colored`, and `parsable` (one JSON object per finding). The `auto` mode picks `github` inside GitHub Actions and `colored` otherwise ([presidio-cli/README.md](https://github.com/microsoft/presidio/blob/main/presidio-cli/README.md)). `presidio .` runs against the current directory using a `.presidiocli` config if present; a `--help` flag enumerates every option.

## 3. Deployment Topologies

Presidio is shipped as multiple independently installable Python packages and as containerized services. The top-level `presidio` PyPI package is a meta-package that pulls in `presidio-analyzer` and `presidio-anonymizer` only — it contains no code of its own ([presidio/README.md](https://github.com/microsoft/presidio/blob/main/presidio/README.md)).

Three deployment paths are documented:

- **Azure one-click deploy.** Each sub-package ships a `deploytoazure.json` ARM template surfaced via a "Deploy to Azure" button in the README. Templates exist for `presidio-analyzer`, `presidio-anonymizer`, and `presidio-image-redactor` ([presidio-analyzer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/README.md), [presidio-anonymizer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/README.md), [presidio-image-redactor/README.md](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/README.md)).
- **Docker Compose.** The image-redactor and anonymizer packages each provide a `docker-compose.yml` runnable with `docker-compose up -d` from the package directory ([presidio-image-redactor/README.md](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/README.md), [presidio-anonymizer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/README.md)). Release 2.2.356 moved containers from the dev server to gunicorn for production parity ([Release 2.2.356](https://github.com/microsoft/presidio/releases/tag/2.2.356)).
- **HTTP API.** Once running, the image redactor exposes `POST /redact` accepting multipart form data with an image and a `color_fill` payload; the anonymizer exposes its endpoint as documented in the public API spec ([presidio-image-redactor/README.md](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/README.md), [presidio-anonymizer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/README.md)).

GPU acceleration is supported for analyzer workloads by installing the matching CUDA build of `cupy` (e.g. `cupy-cuda12x`) on Linux/NVIDIA; macOS/Apple Silicon currently falls back to CPU for PyTorch operations ([presidio-analyzer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/README.md)).

## 4. Extensibility

Presidio is designed to be extended along three axes: custom recognizers, custom anonymizer operators, and language-model backends.

- **Recognizers.** Predefined recognizers rely on regex, NER, and checksum logic, and can be replaced or augmented with custom classes that subclass the analyzer's `EntityRecognizer` contract. Release 2.2.355 moved predefined recognizers onto a config-file foundation, making it easier to override defaults without editing code ([Release 2.2.355](https://github.com/microsoft/presidio/releases/tag/2.2.355)).
- **Anonymizer operators.** Built-in operators include `replace`, `redact`, `hash` (sha256/sha512; md5 was deprecated in 2.2.358), `mask`, `encrypt` (AES), `custom` (lambda), and the **AHDS Surrogate** operator that calls the Azure Health Data Services de-identification service for medically-appropriate surrogates ([presidio-anonymizer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-anonymizer/README.md), [Release 2.2.358](https://github.com/microsoft/presidio/releases/tag/2.2.358)).
- **Language-model backends.** Release 2.2.362 introduced `HuggingFaceNerRecognizer` for direct NER model inference, and `presidio-analyzer` now ships LangExtract-backed recognizers (`BasicLangExtractRecognizer`, `AzureOpenAILangExtractRecognizer`) supporting Ollama and Azure OpenAI providers via the `presidio-analyzer[langextract]` extra ([Release 2.2.362](https://github.com/microsoft/presidio/releases/tag/2.2.362), [presidio-analyzer/README.md](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/README.md)).

### Known Extensibility Caveats

Community-reported issues show that the YAML-driven configuration path can serialize omitted optional recognizer fields as explicit `None`, which downstream registry validation may reject (issue #2080) ([Issue #2080](https://github.com/microsoft/presidio/issues/2080)). Similarly, the DICOM verification engine has had a sorting bug where `_remove_duplicate_entities` discarded its `sorted()` result and kept the lowest-scored entity (issue #2083) — a reminder that custom extensions should re-validate precedence logic when overriding built-ins ([Issue #2083](https://github.com/microsoft/presidio/issues/2083)). The DICOM redactor evaluation notebook has also been reported broken in the past (issue #1251); users integrating custom evaluation harnesses should pin notebook revisions accordingly ([Issue #1251](https://github.com/microsoft/presidio/issues/1251)).

## See Also

- Analyzer, Anonymizer, and Image Redactor package READMEs (linked above) for component-level details.
- [Release notes](https://github.com/microsoft/presidio/releases) for version-specific behavior changes (e.g. country recognizers defaulted to disabled in 2.2.359).
- Public API spec at `https://microsoft.github.io/presidio/api-docs/api-docs.html`.

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: microsoft/presidio

Summary: Found 10 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Runtime risk - Runtime risk requires verification.

## 1. Runtime risk - Runtime risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/1251

## 2. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/2080

## 3. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/microsoft/presidio

## 4. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/2083

## 5. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio

## 6. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/microsoft/presidio

## 7. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/microsoft/presidio

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/1882

## 9. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio

<!-- canonical_name: microsoft/presidio; human_manual_source: deepwiki_human_wiki -->