presidio Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

presidio

An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.

Presidio Overview & System Architecture

Related topics: Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Presidio Overview & System Architecture

1. Purpose and Scope

Microsoft Presidio is a context-aware, pluggable, and customizable PII de-identification service for text and images. It is delivered as a monorepo containing several cooperating Python packages that together detect, redact, and anonymize personally identifiable information (PII) in unstructured text, structured/tabular data, and pixel data (images and DICOM medical files) (presidio/README.md:1-5).

The top-level presidio package is a thin convenience meta-package that has no code of its own — it simply installs presidio-analyzer and presidio-anonymizer as dependencies (presidio/README.md:7-11). All real functionality lives in the sub-packages.

Primary use cases observed across the repo include:

Detecting PII in free-form text and substituting it with placeholders, hashes, or encrypted values.
Redacting burnt-in text PHI in standard images and DICOM medical scans.
Identifying and anonymizing PII in tabular data (pandas DataFrames) and JSON-like semi-structured records.
Scanning source trees from the command line to flag PII inside code or documentation.

2. System Architecture

The repository is organized as a polyglot monorepo: each capability is its own installable Python package that exposes both a Python API and (for some) a Dockerized REST service.

flowchart LR
    User[User / Application]
    subgraph Presidio["presidio monorepo"]
        Analyzer["presidio-analyzer<br/>Detect PII in text"]
        Anonymizer["presidio-anonymizer<br/>Replace / Encrypt / Hash"]
        ImgRed["presidio-image-redactor<br/>Images + DICOM"]
        Structured["presidio-structured<br/>Pandas / JSON tabular"]
        CLI["presidio-cli<br/>Source-tree scanner"]
        Meta["presidio<br/>meta-package (deps only)"]
    end
    User --> Analyzer
    Analyzer -->|RecognizerResult[]| Anonymizer
    User --> ImgRed
    User --> Structured
    User --> CLI
    Meta -.bundles.-> Analyzer
    Meta -.bundles.-> Anonymizer

Key architectural properties:

Separation of detection and transformation. Detection is performed by AnalyzerEngine (presidio-analyzer/README.md:51-58), transformation by AnonymizerEngine (presidio-anonymizer/README.md:39-44). The boundary is the RecognizerResult list, which lets users swap one side without touching the other.
Recognizer extensibility. Each predefined recognizer is responsible for one or more PII entity types using regex, NER, checksum validation, or external models (presidio-analyzer/README.md:3-9).
Pluggable operators. Anonymization is composed of small operators (replace, redact, hash, mask, encrypt, custom, surrogate, etc.) configured per entity (presidio-anonymizer/README.md:11-37).
Multiple deployment shapes. The same code path is exposed as a Python library, a docker-compose HTTP service, and (via the meta-package) a single pip install presidio (presidio/README.md:17-23).

3. Core Packages

Package	Responsibility	Entry point	Source
`presidio-analyzer`	Detect PII entities in unstructured text; supports spaCy NER, regex, LangExtract/LLM, HuggingFace NER, and country-specific recognizers	`AnalyzerEngine().analyze(text, language)`	presidio-analyzer/README.md:55-60
`presidio-anonymizer`	Apply anonymization operators to detected spans; reversible via `Decrypt` deanonymizer	`AnonymizerEngine().anonymize(text, analyzer_results, operators=…)`	presidio-anonymizer/README.md:41-48
`presidio-image-redactor`	OCR + PII detection on images; specialized DICOM pipeline using Tesseract and pydicom	`ImageRedactorEngine`, `DicomImageRedactorEngine`	presidio-image-redactor/README.md:41-55
`presidio-structured`	Map DataFrame columns / JSON keys to entities, then anonymize values	`StructuredEngine` + `PandasAnalysisBuilder`	presidio-structured/README.md:5-15
`presidio-cli`	Walk files in a directory and print PII hits with format options (`standard`, `github`, `colored`, `parsable`)	`presidio <path>`	presidio-cli/README.md:79-94

The analyzer ships with optional extras — e.g. presidio-analyzer[langextract] for LLM-based detection through Ollama or Azure OpenAI (presidio-analyzer/README.md:15-23), and presidio-anonymizer[ahds] for the Azure Health Data Services surrogate operator (presidio-anonymizer/README.md:31-37).

4. End-to-End Data Flow

For the canonical text path, the pipeline is:

Analyze — AnalyzerEngine.analyze(text, entities=…, language="en") runs every registered recognizer and returns a list of RecognizerResult (entity type, span, score) (presidio-analyzer/README.md:51-60).
Anonymize — AnonymizerEngine.anonymize(text, analyzer_results, operators=…) resolves overlapping spans, picks the highest-scoring operator per entity, and produces an AnonymizedResult (presidio-anonymizer/README.md:41-48).
(Optional) Deanonymize — Encrypted spans can be reversed with the Decrypt operator when the AES key is available (presidio-anonymizer/README.md:53-56).

For images, ImageRedactorEngine runs OCR (Tesseract), feeds the recognized text to AnalyzerEngine, and paints boxes over the original image; the DICOM variant additionally handles pydicom pixel data, metadata, and long Windows paths (presidio-image-redactor/README.md:41-55, 87-95).

For tabular data, StructuredEngine uses PandasAnalysisBuilder().generate_analysis(df) to derive a column→entity map, then delegates anonymization to presidio-anonymizer per cell (presidio-structured/README.md:17-27).

5. Configuration, Extensibility, and Known Pitfalls

YAML recognizers. Recognizers can be declared declaratively; recent releases moved predefined recognizers to a config-file model (Release 2.2.355 changelog). A known sharp edge: omitted optional YAML fields can be serialized as explicit None during registry validation (Issue #2080).
Country-specific recognizers. Starting with 2.2.359 most country-specific recognizers default to *disabled* to reduce false positives outside their locale (Release 2.2.359 notes). New PH-specific recognizers have been requested (Issue #2015).
DICOM duplicate suppression. DicomImagePiiVerifyEngine._remove_duplicate_entities had a bug where sorted() was called but its result discarded, so the *lowest*-scored entity won; this is now tracked (Issue #2083).
DICOM evaluation notebook. The bundled sample notebook has reported failures — users should pin a known-good commit or apply the patches linked from the issue thread (Issue #1251).
GPU / deployment. GPU acceleration is opt-in via cupy-cuda12x; containers moved to gunicorn in 2.2.356 (Release 2.2.356 notes).

6. Getting Started

# Minimal text pipeline
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

results = analyzer.analyze(text="My name is John Doe", language="en")
anonymized = anonymizer.anonymize(text="My name is John Doe", analyzer_results=results)
print(anonymized)

(presidio/README.md:25-33)

pip install presidio                       # meta-package
pip install presidio-image-redactor        # OCR + DICOM
pip install presidio-structured            # pandas / JSON
pip install presidio-analyzer[langextract] # LLM-based detection

(presidio-analyzer/README.md:15-23, presidio-image-redactor/README.md:23-29, presidio-structured/README.md:11-13)

Analyzer: PII Detection, NLP Engines & Recognizers

Related topics: Presidio Overview & System Architecture, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Loading Recognizers from Configuration

Continue reading this section for the full explanation and source context.

Analyzer: PII Detection, NLP Engines & Recognizers

The Presidio Analyzer is the detection half of the Presidio PII de-identification framework. It analyzes free-form text (and serves as a backend for the image, structured, and DICOM redactors) and returns a list of RecognizerResult objects describing where PII entities appear, what type they are, and a confidence score. This page documents how detection is wired together: the AnalyzerEngine, the pluggable NLP engine, and the recognizer registry.

High-Level Architecture

The analyzer is composed of three cooperating subsystems: an NLP engine that provides tokenization, lemmatization, and (optionally) named entity recognition; a recognizer registry that holds a curated set of detectors per language; and the AnalyzerEngine itself, which orchestrates the call to each recognizer, merges the results, and returns them to the caller.

flowchart LR
    A[Caller] -->|analyze(text, entities, language)| B[AnalyzerEngine]
    B --> C[NLP Engine Provider]
    C -->|spaCy / stanza / transformers| D[NLP Engine]
    B --> E[Recognizer Registry]
    E -->|loads| F[Predefined Recognizers]
    E -->|loads| G[Custom Recognizers from YAML/Code]
    F --> H[EntityRecognizer subclasses]
    G --> H
    H -->|RecognizerResult list| B
    B -->|merged, scored results| A

The AnalyzerEngine.__init__ wires the registry and the NLP engine together, and the analyze method is the single entry point for end users Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].

NLP Engine Provider

The NLP engine abstracts the linguistic backbone of the analyzer. It must provide tokenization, sentence splitting, lemmatization, and named entity recognition; the recognizers depend on these primitives to evaluate regexes against lemmas, to filter by POS tags, and to combine their own logic with model output.

The NlpEngineProvider reads an nlp_engine_name and an optional configuration dictionary, then instantiates one of three backends:

Engine	Notes	Source
`spacy` (default)	Uses a spaCy model; recommended `en_core_web_lg` for English. Supports GPU acceleration via `cupy-cuda12x` on Linux.	nlp_engine_provider.py
`stanza`	Stanford Stanza-based pipeline; used for languages not well served by spaCy.	nlp_engine_provider.py
`transformers`	Wraps a Hugging Face NER model directly, useful when a custom transformer is preferred over spaCy/stanza.	nlp_engine_provider.py

GPU acceleration is controlled via environment variable as introduced in release 2.2.362; on macOS with Apple Silicon, MPS is not currently supported and PyTorch operations fall back to CPU Source: [presidio-analyzer/README.md].

The provider raises a ValueError when an unknown engine name is supplied, which is the most common configuration failure when bootstrapping a new language Source: [presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py].

Recognizer Registry

The registry is the catalog of detectors the engine will invoke. Each registry is bound to a single language and a single NLP engine instance; the RecognizerRegistry.__init__ accepts both, plus an optional list of recognizer classes, an optional list of custom recognizers, and a context dictionary (e.g., a list of supported entity types) Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].

The registry exposes two principal operations:

load_predefined_recognizers(...) — populates the registry with Presidio's built-in recognizers (regex, deny-list, NER-based) filtered by language. Since release 2.2.359, most country-specific recognizers that expect English text are registered as disabled by default to avoid spurious false positives in non-target languages Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].
add_recognizer(recognizer) and add_custom_recognizer(custom_recognizer) — append user-supplied recognizers, which are validated against the registry's supported entity list Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].

Internally, recognizers are kept in two ordered dictionaries: the standard recognizers map and a custom_recognizers map; both are iterated during analysis. The get_recognizers method returns a flat list combining both, and get_supported_entities returns the deduplicated union of all entity types across every loaded recognizer Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].

Loading Recognizers from Configuration

RecognizerRegistryProvider constructs one or more registries (one per language) from a configuration dictionary. It delegates to RecognizersLoader (in recognizers_loader_utils.py) to parse three kinds of entries: built-in recognizer names, Python import paths, and YAML-defined recognizer definitions. This is the recommended way to version-control recognizer configuration and to share setups across deployments Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry_provider.py and presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py].

Community note (issue #2080): When recognizers with dedicated YAML config models are configured through analyzer YAML, omitted optional fields can be serialized as explicit None values during registry validation. The config dump path for these recognizers may therefore emit fields with null values that are indistinguishable from fields the user deliberately set to null. This is a known serialization quirk to be aware of when diffing configuration files.

AnalyzerEngine and the `analyze` Pipeline

AnalyzerEngine is the high-level façade. It lazily instantiates its default NlpEngineProvider and RecognizerRegistry if none is supplied, and forwards language-specific setup to a per-language registry cache, so calling analyze with several languages does not rebuild the recognizer catalog each time Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].

The analyze(text, entities, language, ...) method executes the following steps:

Validate the language against the configured supported languages, defaulting to en when not specified.
Resolve the registry for the requested language via the engine provider; this also resolves the correct NLP engine and returns the matching list of EntityRecognizer subclasses.
Iterate the recognizers, passing each one the text, the entity allow-list, and an analysis context. Each recognizer returns a list of RecognizerResult objects with entity_type, start, end, and score.
Merge results across recognizers, dedupe overlaps, and apply the confidence threshold (default 0.6).
Return the final list of RecognizerResult objects to the caller Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py and presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py].

The threshold parameter is the most common knob to tune in production. Lowering it surfaces more candidates (including borderline regex matches) at the cost of false positives; raising it tightens precision at the cost of recall. The analyzer also accepts a decision_process argument that can be swapped for a custom merge strategy when the default is not appropriate Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].

Recognizer Categories

Three recognizer families ship with Presidio and are typically mixed in a single registry:

Pattern-based recognizers use regex and validation logic (e.g., Luhn checksum for credit cards, IBAN validation, URL parsing) and are deterministic and fast.
NLP-based recognizers wrap the underlying NER model returned by the NLP engine and translate generic model labels (e.g., PERSON, ORG) into Presidio's entity taxonomy.
Context-aware and LLM-based recognizers (e.g., the BasicLangExtractRecognizer and AzureOpenAILangExtractRecognizer) call out to a language model — local Ollama, Azure OpenAI, or Hugging Face NER — to detect entities that are difficult to capture with regexes. These require the optional presidio-analyzer[langextract] install, and they defer connectivity validation to the first call to analyze() Source: [presidio-analyzer/README.md and presidio-analyzer/presidio_analyzer/analyzer_engine.py].

Release 2.2.362 introduced a HuggingFaceNerRecognizer for direct NER model inference, providing a lower-friction alternative to spinning up an entire transformers NLP engine when only NER output is required.

Common Failure Modes

No model installed for the requested language — the most common initialization error; resolved by running python -m spacy download <model> for spaCy or installing the matching Stanza model.
Unsupported language — AnalyzerEngine.analyze raises if language is not in the configured supported_languages. Pass supported_languages=["en","es",...] to the constructor to extend the set Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].
Low recall on a target entity — first check the default-score threshold, then verify the recognizer is not one of the country-specific recognizers disabled by default since 2.2.359; explicitly re-enable it via the registry or YAML configuration when targeting that locale Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].
YAML recognizer loads with None fields — known issue #2080; treat null YAML values as "unset" when diffing generated configuration dumps.

Anonymization, Image Redaction & DICOM Processing

Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Structured Data, CLI, Deployment & Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Built-in Operators

Continue reading this section for the full explanation and source context.

Section Overlap Conflict Resolution

Continue reading this section for the full explanation and source context.

Section Deanonymization & Batch Processing

Continue reading this section for the full explanation and source context.

Anonymization, Image Redaction & DICOM Processing

Overview & Scope

Presidio ships three complementary capabilities that operate downstream of PII detection: text anonymization, optical image redaction, and DICOM (medical imaging) redaction. Together they cover the "de-identify" half of the Presidio pipeline — presidio-analyzer finds the PII, and these engines remove or mask it. The anonymizer works purely on text spans returned by the analyzer, while the image redactor extends the same recognizer pipeline to pixels and burnt-in text inside images and DICOM frames. Each package is independently installable and independently deployable as a REST service or Python library.

The text anonymizer also exposes a Deanonymizer (the inverse operation), a batch engine for CSV/JSON files, and an overlap-conflict resolution strategy so that nested or intersecting PII spans produce stable, predictable output.

Text Anonymization with `presidio-anonymizer`

The AnonymizerEngine consumes a list of RecognizerResult objects (typically emitted by AnalyzerEngine) and applies a per-entity operator chain to produce redacted text. Operators are looked up via the factory in operators/operators_factory.py and selected per entity type, with DEFAULT acting as a global fallback (Source: [presidio-anonymizer/README.md]).

Built-in Operators

The anonymizer ships with a fixed set of built-in operators, each accepting specific parameters:

Operator	Purpose	Key Parameters
`replace`	Substitute with a static value (defaults to `<entity_type>`)	`new_value`
`redact`	Remove the PII completely	—
`hash`	Cryptographic hash of the PII (sha256/sha512)	`hash_type`
`mask`	Replace with a repeated character	`chars_to_mask`, `masking_char`, `from_end`
`encrypt`	AES (Rijndael) encryption, reversible via deanonymizer	`key` (128/192/256-bit)
`custom`	Apply a user lambda to the PII string	`lambda`
`ahds surrogate`	Call Azure Health Data Services de-identification service	`endpoint`, `entities`, `input_locale`, `surrogate_locale`

The Encrypt operator and the Decrypt deanonymizer together form a reversible pair; the key length must be 128, 192, or 256 bits and is provided as a string. The AHDS surrogate operator was added in release 2.2.360 and requires the presidio-anonymizer[ahds] extra (Source: [presidio-anonymizer/README.md]).

Overlap Conflict Resolution

When two recognizers return overlapping spans, the AnonymizerEngine relies on the ConflictResolutionStrategy enum in entities/conflict_resolution_strategy.py. The behavior is documented as: full-span overlaps resolve by score (higher wins, ties are arbitrary); one span contained inside another resolves to the *longer* span even when its score is lower; and partial intersections are anonymized independently and concatenated in the output. This produces the well-known pattern where George Washington and Washington State Park collapse to <PERSON><LOCATION> (Source: [presidio-anonymizer/README.md]).

Deanonymization & Batch Processing

Deanonymizer is the symmetric counterpart and currently exposes a single Decrypt operator that uses AES to reverse the Encrypt operation (Source: [presidio-anonymizer/presidio_anonymizer/deanonymize_engine.py]). For dataset-scale workflows, BatchAnonymizerEngine accepts CSV or JSON files and applies the same per-entity operator pipeline row-by-row, writing the redacted output alongside an "analyzed" file containing the RecognizerResult metadata (Source: [presidio-anonymizer/presidio_anonymizer/batch_anonymizer_engine.py]).

Image Redaction with `presidio-image-redactor`

The image redactor is a separate package that detects PII in raster images (PNG/JPG/etc.) and in DICOM medical frames, then draws filled rectangles over the detected regions. It depends on Tesseract OCR for text extraction and on presidio-analyzer for entity classification, so the same recognizer set used for plain text applies to pixels.

Standard Image Redaction

ImageRedactorEngine is the entry point for ordinary raster images. Its primary method accepts a PIL.Image and an optional color fill (int or (R,G,B) tuple, default black) and returns a redacted image with rectangles drawn over the bounding boxes returned by the OCR + analyzer pipeline. A small HTTP service is exposed via POST /redact, accepting a multipart form with the image and a data field of the shape {'color_fill':'0,0,0'} (Source: [presidio-image-redactor/README.md]).

DICOM Image Redaction

DICOM redaction is handled by DicomImageRedactorEngine, which works on pydicom datasets rather than PIL.Image objects. Four entry points are documented in the README: redact(dicom_image, fill=...) for in-memory redaction, redact_and_return_bbox(...) when callers also need the bounding boxes, redact_from_file(input_path, output_dir, ...) for single-file workflows that persist JSON bbox dumps, and redact_from_directory(...) for recursive batch jobs. Padding and ocr_kwargs (e.g. ocr_threshold) are configurable per call (Source: [presidio-image-redactor/README.md]).

Scope note: the redactor scrubs burnt-in text in *pixel data only*; it does not touch the structured DICOM metadata headers. The README explicitly recommends pairing it with the Tools for Health Data Anonymization package for metadata scrubbing (Source: [presidio-image-redactor/README.md]).

Architecture & Data Flow

The end-to-end flow for both text and image modalities is uniform: detect → resolve conflicts → redact. The diagram below summarizes the image/DICOM path; text anonymization follows the same shape with the OCR step omitted.

flowchart LR
    A[Input: text / image / DICOM] --> B{Modality?}
    B -- Text --> C[AnalyzerEngine]
    B -- Image --> D[Tesseract OCR]
    B -- DICOM --> D
    D --> C
    C --> E[RecognizerResults]
    E --> F[ConflictResolutionStrategy]
    F --> G{Output sink}
    G -- Text --> H[AnonymizerEngine<br/>or BatchAnonymizerEngine]
    G -- Image --> I[ImageRedactorEngine]
    G -- DICOM --> J[DicomImageRedactorEngine]
    H --> K[Redacted text]
    I --> L[Redacted PNG]
    J --> M[Redacted DICOM + bbox JSON]

Known Issues & Community Notes

Several recurring community issues are worth flagging before adopting the image/DICOM pipeline:

Sorted-result bug in PII verification — DicomImagePiiVerifyEngine._remove_duplicate_entities calls sorted() on overlapping entity candidates but discards the sorted list, so the *lowest*-scored entity is kept instead of the highest. Track issue #2083 for the fix.
DICOM metadata edge cases — Release 2.2.354 shipped a fix for "wrong condition for dicom metadata" (#1347), so older versions may misclassify or skip header fields.
Evaluation notebook drift — The example_dicom_redactor_evaluation.ipynb sample has historically failed with an error (issue #1251); users running evaluation pipelines should pin a known-good commit or verify against the current notebook.
Country-specific recognizers — As of release 2.2.359, country-specific predefined recognizers are disabled by default to reduce false positives. Adopters needing, for example, Philippines-specific PII (see issue #2015) must opt them in explicitly.

For long file paths on Windows during DICOM batch jobs, the README also recommends enabling Win32 long paths — a common practical failure point (Source: [presidio-image-redactor/README.md]).

Structured Data, CLI, Deployment & Extensibility

Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Known Extensibility Caveats

Continue reading this section for the full explanation and source context.

Structured Data, CLI, Deployment & Extensibility

This page describes four cross-cutting capabilities of the Presidio ecosystem: the Structured module for tabular/semi-structured data, the CLI tool for batch text analysis, the supported Deployment topologies (Azure, Docker, HTTP), and the Extensibility surface (recognizers, operators, language models). It is intended for engineers integrating Presidio into pipelines and for operators who need to choose a deployment topology.

1. Structured Data with `presidio-structured`

presidio-structured extends Presidio beyond free text. It uses presidio-analyzer to detect PII at the column/key level, then uses presidio-anonymizer operators to redact values. The package exposes a StructuredEngine and a PandasAnalysisBuilder, both installed via pip install presidio-structured (presidio-structured/README.md).

The typical pattern is to (1) build a tabular_analysis that maps columns/keys to detected entity types, (2) define an OperatorConfig map per entity, and (3) invoke the engine over a pandas.DataFrame or a JSON document. The Faker library can be wired in via OperatorConfig("replace", {"new_value": fake.name()}) to produce realistic surrogates (presidio-structured/README.md).

A notable capability added in release 2.2.354 is user-defined entity selection strategies, allowing callers to plug in their own logic for choosing which entities get anonymized per column (Release 2.2.354). Community requests, such as Philippines-specific predefined recognizers (issue #2015), indicate ongoing expansion of country-specific recognizers usable from presidio-structured (Issue #2015).

2. CLI: `presidio-cli`

presidio-cli is a thin wrapper around presidio-analyzer that scans files and directories for PII. It is installed with pip install presidio-cli and requires Python 3.10–3.13 and poetry for source builds (presidio-cli/README.md).

Configuration can be supplied via a YAML file (-c) or inline (-d). The YAML keys map directly to presidio-analyzer parameters:

YAML key	Purpose	Example value
`language`	NLP language for analysis	`en`
`ignore`	Glob patterns to skip	`.git`, `*.cfg`
`entities`	Restrict detection to a subset	`[PERSON, EMAIL_ADDRESS]`
`allow`	Allow-list tokens that should not be flagged	list of strings

The CLI supports four output formats selected with -f / --format: standard, github (CI annotations), colored, and parsable (one JSON object per finding). The auto mode picks github inside GitHub Actions and colored otherwise (presidio-cli/README.md). presidio . runs against the current directory using a .presidiocli config if present; a --help flag enumerates every option.

3. Deployment Topologies

Presidio is shipped as multiple independently installable Python packages and as containerized services. The top-level presidio PyPI package is a meta-package that pulls in presidio-analyzer and presidio-anonymizer only — it contains no code of its own (presidio/README.md).

Three deployment paths are documented:

Azure one-click deploy. Each sub-package ships a deploytoazure.json ARM template surfaced via a "Deploy to Azure" button in the README. Templates exist for presidio-analyzer, presidio-anonymizer, and presidio-image-redactor (presidio-analyzer/README.md, presidio-anonymizer/README.md, presidio-image-redactor/README.md).
Docker Compose. The image-redactor and anonymizer packages each provide a docker-compose.yml runnable with docker-compose up -d from the package directory (presidio-image-redactor/README.md, presidio-anonymizer/README.md). Release 2.2.356 moved containers from the dev server to gunicorn for production parity (Release 2.2.356).
HTTP API. Once running, the image redactor exposes POST /redact accepting multipart form data with an image and a color_fill payload; the anonymizer exposes its endpoint as documented in the public API spec (presidio-image-redactor/README.md, presidio-anonymizer/README.md).

GPU acceleration is supported for analyzer workloads by installing the matching CUDA build of cupy (e.g. cupy-cuda12x) on Linux/NVIDIA; macOS/Apple Silicon currently falls back to CPU for PyTorch operations (presidio-analyzer/README.md).

4. Extensibility

Presidio is designed to be extended along three axes: custom recognizers, custom anonymizer operators, and language-model backends.

Recognizers. Predefined recognizers rely on regex, NER, and checksum logic, and can be replaced or augmented with custom classes that subclass the analyzer's EntityRecognizer contract. Release 2.2.355 moved predefined recognizers onto a config-file foundation, making it easier to override defaults without editing code (Release 2.2.355).
Anonymizer operators. Built-in operators include replace, redact, hash (sha256/sha512; md5 was deprecated in 2.2.358), mask, encrypt (AES), custom (lambda), and the AHDS Surrogate operator that calls the Azure Health Data Services de-identification service for medically-appropriate surrogates (presidio-anonymizer/README.md, Release 2.2.358).
Language-model backends. Release 2.2.362 introduced HuggingFaceNerRecognizer for direct NER model inference, and presidio-analyzer now ships LangExtract-backed recognizers (BasicLangExtractRecognizer, AzureOpenAILangExtractRecognizer) supporting Ollama and Azure OpenAI providers via the presidio-analyzer[langextract] extra (Release 2.2.362, presidio-analyzer/README.md).

Known Extensibility Caveats

Community-reported issues show that the YAML-driven configuration path can serialize omitted optional recognizer fields as explicit None, which downstream registry validation may reject (issue #2080) (Issue #2080). Similarly, the DICOM verification engine has had a sorting bug where _remove_duplicate_entities discarded its sorted() result and kept the lowest-scored entity (issue #2083) — a reminder that custom extensions should re-validate precedence logic when overriding built-ins (Issue #2083). The DICOM redactor evaluation notebook has also been reported broken in the past (issue #1251); users integrating custom evaluation harnesses should pin notebook revisions accordingly (Issue #1251).

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Runtime risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Runtime risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 10 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Runtime risk - Runtime risk requires verification.

1. Runtime risk: Runtime risk requires verification

Severity: high
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/1251

2. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/2080

3. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/microsoft/presidio

4. Runtime risk: Runtime risk requires verification

Severity: medium
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/2083

5. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio

6. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/microsoft/presidio

7. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/microsoft/presidio

8. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/1882

9. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio

10. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 11

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using presidio with real data or production workflows.

Sample presidio image evaluation notebook generates error - github / github_issue
bug(image-redactor): sorted() result discarded in `_remove_duplicate_e - github / github_issue
DocumentIntelligenceOCR support for Azure Identity (e.g DefaultAzureCred - github / github_issue
Include support for hinglish transcripts masking - github / github_issue
[[Bug] Omitted YAML recognizer fields are passed as None](https://github.com/microsoft/presidio/issues/2080) - github / github_issue
Feature Request: Add Philippines (PH) country-specific predefined recogn - github / github_issue
Release 2.2.362 - github / github_release
Release 2.2.361 - github / github_release
Release 2.2.360 - github / github_release
2.2.359 - github / github_release
Capability evidence risk requires verification - GitHub / issue

Source: Project Pack community evidence and pitfall evidence

presidio

Presidio Overview & System Architecture

Related Pages

Presidio Overview & System Architecture

1. Purpose and Scope

2. System Architecture

3. Core Packages

4. End-to-End Data Flow

5. Configuration, Extensibility, and Known Pitfalls

6. Getting Started

See Also

Analyzer: PII Detection, NLP Engines & Recognizers

Related Pages

Analyzer: PII Detection, NLP Engines & Recognizers

High-Level Architecture

NLP Engine Provider

Recognizer Registry

Loading Recognizers from Configuration

AnalyzerEngine and the `analyze` Pipeline

Recognizer Categories

Common Failure Modes

See Also

Anonymization, Image Redaction & DICOM Processing

Related Pages

Anonymization, Image Redaction & DICOM Processing

Overview & Scope

Text Anonymization with `presidio-anonymizer`

Built-in Operators

Overlap Conflict Resolution

Deanonymization & Batch Processing

Image Redaction with `presidio-image-redactor`

Standard Image Redaction

DICOM Image Redaction

Architecture & Data Flow

Known Issues & Community Notes

See Also

Structured Data, CLI, Deployment & Extensibility

Related Pages

Structured Data, CLI, Deployment & Extensibility

1. Structured Data with `presidio-structured`

2. CLI: `presidio-cli`

3. Deployment Topologies

4. Extensibility

Known Extensibility Caveats

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Runtime risk: Runtime risk requires verification

2. Security or permission risk: Security or permission risk requires verification

3. Capability evidence risk: Capability evidence risk requires verification

4. Runtime risk: Runtime risk requires verification

5. Maintenance risk: Maintenance risk requires verification

6. Security or permission risk: Security or permission risk requires verification

7. Security or permission risk: Security or permission risk requires verification

8. Security or permission risk: Security or permission risk requires verification

9. Maintenance risk: Maintenance risk requires verification

10. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence