Doramagic Project Pack · Human Manual

presidio

An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.

Presidio Overview & System Architecture

Related topics: Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility

Presidio Overview & System Architecture

1. Purpose and Scope

Microsoft Presidio is a context-aware, pluggable, and customizable PII de-identification service for text and images. It is delivered as a monorepo containing several cooperating Python packages that together detect, redact, and anonymize personally identifiable information (PII) in unstructured text, structured/tabular data, and pixel data (images and DICOM medical files) (presidio/README.md:1-5).

The top-level presidio package is a thin convenience meta-package that has no code of its own — it simply installs presidio-analyzer and presidio-anonymizer as dependencies (presidio/README.md:7-11). All real functionality lives in the sub-packages.

Primary use cases observed across the repo include:

  • Detecting PII in free-form text and substituting it with placeholders, hashes, or encrypted values.
  • Redacting burnt-in text PHI in standard images and DICOM medical scans.
  • Identifying and anonymizing PII in tabular data (pandas DataFrames) and JSON-like semi-structured records.
  • Scanning source trees from the command line to flag PII inside code or documentation.

2. System Architecture

The repository is organized as a polyglot monorepo: each capability is its own installable Python package that exposes both a Python API and (for some) a Dockerized REST service.

flowchart LR
    User[User / Application]
    subgraph Presidio["presidio monorepo"]
        Analyzer["presidio-analyzer<br/>Detect PII in text"]
        Anonymizer["presidio-anonymizer<br/>Replace / Encrypt / Hash"]
        ImgRed["presidio-image-redactor<br/>Images + DICOM"]
        Structured["presidio-structured<br/>Pandas / JSON tabular"]
        CLI["presidio-cli<br/>Source-tree scanner"]
        Meta["presidio<br/>meta-package (deps only)"]
    end
    User --> Analyzer
    Analyzer -->|RecognizerResult[]| Anonymizer
    User --> ImgRed
    User --> Structured
    User --> CLI
    Meta -.bundles.-> Analyzer
    Meta -.bundles.-> Anonymizer

Key architectural properties:

  • Separation of detection and transformation. Detection is performed by AnalyzerEngine (presidio-analyzer/README.md:51-58), transformation by AnonymizerEngine (presidio-anonymizer/README.md:39-44). The boundary is the RecognizerResult list, which lets users swap one side without touching the other.
  • Recognizer extensibility. Each predefined recognizer is responsible for one or more PII entity types using regex, NER, checksum validation, or external models (presidio-analyzer/README.md:3-9).
  • Pluggable operators. Anonymization is composed of small operators (replace, redact, hash, mask, encrypt, custom, surrogate, etc.) configured per entity (presidio-anonymizer/README.md:11-37).
  • Multiple deployment shapes. The same code path is exposed as a Python library, a docker-compose HTTP service, and (via the meta-package) a single pip install presidio (presidio/README.md:17-23).

3. Core Packages

PackageResponsibilityEntry pointSource
presidio-analyzerDetect PII entities in unstructured text; supports spaCy NER, regex, LangExtract/LLM, HuggingFace NER, and country-specific recognizersAnalyzerEngine().analyze(text, language)presidio-analyzer/README.md:55-60
presidio-anonymizerApply anonymization operators to detected spans; reversible via Decrypt deanonymizerAnonymizerEngine().anonymize(text, analyzer_results, operators=…)presidio-anonymizer/README.md:41-48
presidio-image-redactorOCR + PII detection on images; specialized DICOM pipeline using Tesseract and pydicomImageRedactorEngine, DicomImageRedactorEnginepresidio-image-redactor/README.md:41-55
presidio-structuredMap DataFrame columns / JSON keys to entities, then anonymize valuesStructuredEngine + PandasAnalysisBuilderpresidio-structured/README.md:5-15
presidio-cliWalk files in a directory and print PII hits with format options (standard, github, colored, parsable)presidio <path>presidio-cli/README.md:79-94

The analyzer ships with optional extras — e.g. presidio-analyzer[langextract] for LLM-based detection through Ollama or Azure OpenAI (presidio-analyzer/README.md:15-23), and presidio-anonymizer[ahds] for the Azure Health Data Services surrogate operator (presidio-anonymizer/README.md:31-37).

4. End-to-End Data Flow

For the canonical text path, the pipeline is:

  1. AnalyzeAnalyzerEngine.analyze(text, entities=…, language="en") runs every registered recognizer and returns a list of RecognizerResult (entity type, span, score) (presidio-analyzer/README.md:51-60).
  2. AnonymizeAnonymizerEngine.anonymize(text, analyzer_results, operators=…) resolves overlapping spans, picks the highest-scoring operator per entity, and produces an AnonymizedResult (presidio-anonymizer/README.md:41-48).
  3. (Optional) Deanonymize — Encrypted spans can be reversed with the Decrypt operator when the AES key is available (presidio-anonymizer/README.md:53-56).

For images, ImageRedactorEngine runs OCR (Tesseract), feeds the recognized text to AnalyzerEngine, and paints boxes over the original image; the DICOM variant additionally handles pydicom pixel data, metadata, and long Windows paths (presidio-image-redactor/README.md:41-55, 87-95).

For tabular data, StructuredEngine uses PandasAnalysisBuilder().generate_analysis(df) to derive a column→entity map, then delegates anonymization to presidio-anonymizer per cell (presidio-structured/README.md:17-27).

5. Configuration, Extensibility, and Known Pitfalls

  • YAML recognizers. Recognizers can be declared declaratively; recent releases moved predefined recognizers to a config-file model (Release 2.2.355 changelog). A known sharp edge: omitted optional YAML fields can be serialized as explicit None during registry validation (Issue #2080).
  • Country-specific recognizers. Starting with 2.2.359 most country-specific recognizers default to *disabled* to reduce false positives outside their locale (Release 2.2.359 notes). New PH-specific recognizers have been requested (Issue #2015).
  • DICOM duplicate suppression. DicomImagePiiVerifyEngine._remove_duplicate_entities had a bug where sorted() was called but its result discarded, so the *lowest*-scored entity won; this is now tracked (Issue #2083).
  • DICOM evaluation notebook. The bundled sample notebook has reported failures — users should pin a known-good commit or apply the patches linked from the issue thread (Issue #1251).
  • GPU / deployment. GPU acceleration is opt-in via cupy-cuda12x; containers moved to gunicorn in 2.2.356 (Release 2.2.356 notes).

6. Getting Started

# Minimal text pipeline
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

results = analyzer.analyze(text="My name is John Doe", language="en")
anonymized = anonymizer.anonymize(text="My name is John Doe", analyzer_results=results)
print(anonymized)

(presidio/README.md:25-33)

pip install presidio                       # meta-package
pip install presidio-image-redactor        # OCR + DICOM
pip install presidio-structured            # pandas / JSON
pip install presidio-analyzer[langextract] # LLM-based detection

(presidio-analyzer/README.md:15-23, presidio-image-redactor/README.md:23-29, presidio-structured/README.md:11-13)

See Also

  • Analyzer internals — recognizer registration, custom recognizers, NER/LLM backends.
  • Anonymizer operators — overlap handling, encryption, AHDS surrogate.
  • Image Redactor — DICOM pipeline, OCR thresholds, PII verification engine.
  • Structured — DataFrame/JSON analysis builders and operator mappings.
  • CLI — output formats and CI integration.

Source: https://github.com/microsoft/presidio / Human Manual

Analyzer: PII Detection, NLP Engines & Recognizers

Related topics: Presidio Overview & System Architecture, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Loading Recognizers from Configuration

Continue reading this section for the full explanation and source context.

Related topics: Presidio Overview & System Architecture, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility

Analyzer: PII Detection, NLP Engines & Recognizers

The Presidio Analyzer is the detection half of the Presidio PII de-identification framework. It analyzes free-form text (and serves as a backend for the image, structured, and DICOM redactors) and returns a list of RecognizerResult objects describing where PII entities appear, what type they are, and a confidence score. This page documents how detection is wired together: the AnalyzerEngine, the pluggable NLP engine, and the recognizer registry.

High-Level Architecture

The analyzer is composed of three cooperating subsystems: an NLP engine that provides tokenization, lemmatization, and (optionally) named entity recognition; a recognizer registry that holds a curated set of detectors per language; and the AnalyzerEngine itself, which orchestrates the call to each recognizer, merges the results, and returns them to the caller.

flowchart LR
    A[Caller] -->|analyze(text, entities, language)| B[AnalyzerEngine]
    B --> C[NLP Engine Provider]
    C -->|spaCy / stanza / transformers| D[NLP Engine]
    B --> E[Recognizer Registry]
    E -->|loads| F[Predefined Recognizers]
    E -->|loads| G[Custom Recognizers from YAML/Code]
    F --> H[EntityRecognizer subclasses]
    G --> H
    H -->|RecognizerResult list| B
    B -->|merged, scored results| A

The AnalyzerEngine.__init__ wires the registry and the NLP engine together, and the analyze method is the single entry point for end users Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].

NLP Engine Provider

The NLP engine abstracts the linguistic backbone of the analyzer. It must provide tokenization, sentence splitting, lemmatization, and named entity recognition; the recognizers depend on these primitives to evaluate regexes against lemmas, to filter by POS tags, and to combine their own logic with model output.

The NlpEngineProvider reads an nlp_engine_name and an optional configuration dictionary, then instantiates one of three backends:

EngineNotesSource
spacy (default)Uses a spaCy model; recommended en_core_web_lg for English. Supports GPU acceleration via cupy-cuda12x on Linux.nlp_engine_provider.py
stanzaStanford Stanza-based pipeline; used for languages not well served by spaCy.nlp_engine_provider.py
transformersWraps a Hugging Face NER model directly, useful when a custom transformer is preferred over spaCy/stanza.nlp_engine_provider.py

GPU acceleration is controlled via environment variable as introduced in release 2.2.362; on macOS with Apple Silicon, MPS is not currently supported and PyTorch operations fall back to CPU Source: [presidio-analyzer/README.md].

The provider raises a ValueError when an unknown engine name is supplied, which is the most common configuration failure when bootstrapping a new language Source: [presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py].

Recognizer Registry

The registry is the catalog of detectors the engine will invoke. Each registry is bound to a single language and a single NLP engine instance; the RecognizerRegistry.__init__ accepts both, plus an optional list of recognizer classes, an optional list of custom recognizers, and a context dictionary (e.g., a list of supported entity types) Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].

The registry exposes two principal operations:

  • load_predefined_recognizers(...) — populates the registry with Presidio's built-in recognizers (regex, deny-list, NER-based) filtered by language. Since release 2.2.359, most country-specific recognizers that expect English text are registered as disabled by default to avoid spurious false positives in non-target languages Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].
  • add_recognizer(recognizer) and add_custom_recognizer(custom_recognizer) — append user-supplied recognizers, which are validated against the registry's supported entity list Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].

Internally, recognizers are kept in two ordered dictionaries: the standard recognizers map and a custom_recognizers map; both are iterated during analysis. The get_recognizers method returns a flat list combining both, and get_supported_entities returns the deduplicated union of all entity types across every loaded recognizer Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].

Loading Recognizers from Configuration

RecognizerRegistryProvider constructs one or more registries (one per language) from a configuration dictionary. It delegates to RecognizersLoader (in recognizers_loader_utils.py) to parse three kinds of entries: built-in recognizer names, Python import paths, and YAML-defined recognizer definitions. This is the recommended way to version-control recognizer configuration and to share setups across deployments Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry_provider.py and presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py].

Community note (issue #2080): When recognizers with dedicated YAML config models are configured through analyzer YAML, omitted optional fields can be serialized as explicit None values during registry validation. The config dump path for these recognizers may therefore emit fields with null values that are indistinguishable from fields the user deliberately set to null. This is a known serialization quirk to be aware of when diffing configuration files.

AnalyzerEngine and the `analyze` Pipeline

AnalyzerEngine is the high-level façade. It lazily instantiates its default NlpEngineProvider and RecognizerRegistry if none is supplied, and forwards language-specific setup to a per-language registry cache, so calling analyze with several languages does not rebuild the recognizer catalog each time Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].

The analyze(text, entities, language, ...) method executes the following steps:

  1. Validate the language against the configured supported languages, defaulting to en when not specified.
  2. Resolve the registry for the requested language via the engine provider; this also resolves the correct NLP engine and returns the matching list of EntityRecognizer subclasses.
  3. Iterate the recognizers, passing each one the text, the entity allow-list, and an analysis context. Each recognizer returns a list of RecognizerResult objects with entity_type, start, end, and score.
  4. Merge results across recognizers, dedupe overlaps, and apply the confidence threshold (default 0.6).
  5. Return the final list of RecognizerResult objects to the caller Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py and presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py].

The threshold parameter is the most common knob to tune in production. Lowering it surfaces more candidates (including borderline regex matches) at the cost of false positives; raising it tightens precision at the cost of recall. The analyzer also accepts a decision_process argument that can be swapped for a custom merge strategy when the default is not appropriate Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].

Recognizer Categories

Three recognizer families ship with Presidio and are typically mixed in a single registry:

  • Pattern-based recognizers use regex and validation logic (e.g., Luhn checksum for credit cards, IBAN validation, URL parsing) and are deterministic and fast.
  • NLP-based recognizers wrap the underlying NER model returned by the NLP engine and translate generic model labels (e.g., PERSON, ORG) into Presidio's entity taxonomy.
  • Context-aware and LLM-based recognizers (e.g., the BasicLangExtractRecognizer and AzureOpenAILangExtractRecognizer) call out to a language model — local Ollama, Azure OpenAI, or Hugging Face NER — to detect entities that are difficult to capture with regexes. These require the optional presidio-analyzer[langextract] install, and they defer connectivity validation to the first call to analyze() Source: [presidio-analyzer/README.md and presidio-analyzer/presidio_analyzer/analyzer_engine.py].

Release 2.2.362 introduced a HuggingFaceNerRecognizer for direct NER model inference, providing a lower-friction alternative to spinning up an entire transformers NLP engine when only NER output is required.

Common Failure Modes

  • No model installed for the requested language — the most common initialization error; resolved by running python -m spacy download <model> for spaCy or installing the matching Stanza model.
  • Unsupported languageAnalyzerEngine.analyze raises if language is not in the configured supported_languages. Pass supported_languages=["en","es",...] to the constructor to extend the set Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].
  • Low recall on a target entity — first check the default-score threshold, then verify the recognizer is not one of the country-specific recognizers disabled by default since 2.2.359; explicitly re-enable it via the registry or YAML configuration when targeting that locale Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].
  • YAML recognizer loads with None fields — known issue #2080; treat null YAML values as "unset" when diffing generated configuration dumps.

See Also

  • Presidio Anonymizer — consumes RecognizerResult objects and applies reversible or irreversible operators.
  • Presidio Image Redactor — calls the analyzer as an OCR post-processor for standard images and DICOM.
  • Presidio Structured — uses analyzer detection to map tabular columns to PII entities before anonymization.

Source: https://github.com/microsoft/presidio / Human Manual

Anonymization, Image Redaction & DICOM Processing

Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Structured Data, CLI, Deployment & Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Built-in Operators

Continue reading this section for the full explanation and source context.

Section Overlap Conflict Resolution

Continue reading this section for the full explanation and source context.

Section Deanonymization & Batch Processing

Continue reading this section for the full explanation and source context.

Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Structured Data, CLI, Deployment & Extensibility

Anonymization, Image Redaction & DICOM Processing

Overview & Scope

Presidio ships three complementary capabilities that operate downstream of PII detection: text anonymization, optical image redaction, and DICOM (medical imaging) redaction. Together they cover the "de-identify" half of the Presidio pipeline — presidio-analyzer finds the PII, and these engines remove or mask it. The anonymizer works purely on text spans returned by the analyzer, while the image redactor extends the same recognizer pipeline to pixels and burnt-in text inside images and DICOM frames. Each package is independently installable and independently deployable as a REST service or Python library.

The text anonymizer also exposes a Deanonymizer (the inverse operation), a batch engine for CSV/JSON files, and an overlap-conflict resolution strategy so that nested or intersecting PII spans produce stable, predictable output.

Text Anonymization with `presidio-anonymizer`

The AnonymizerEngine consumes a list of RecognizerResult objects (typically emitted by AnalyzerEngine) and applies a per-entity operator chain to produce redacted text. Operators are looked up via the factory in operators/operators_factory.py and selected per entity type, with DEFAULT acting as a global fallback (Source: [presidio-anonymizer/README.md]).

Built-in Operators

The anonymizer ships with a fixed set of built-in operators, each accepting specific parameters:

OperatorPurposeKey Parameters
replaceSubstitute with a static value (defaults to <entity_type>)new_value
redactRemove the PII completely
hashCryptographic hash of the PII (sha256/sha512)hash_type
maskReplace with a repeated characterchars_to_mask, masking_char, from_end
encryptAES (Rijndael) encryption, reversible via deanonymizerkey (128/192/256-bit)
customApply a user lambda to the PII stringlambda
ahds surrogateCall Azure Health Data Services de-identification serviceendpoint, entities, input_locale, surrogate_locale

The Encrypt operator and the Decrypt deanonymizer together form a reversible pair; the key length must be 128, 192, or 256 bits and is provided as a string. The AHDS surrogate operator was added in release 2.2.360 and requires the presidio-anonymizer[ahds] extra (Source: [presidio-anonymizer/README.md]).

Overlap Conflict Resolution

When two recognizers return overlapping spans, the AnonymizerEngine relies on the ConflictResolutionStrategy enum in entities/conflict_resolution_strategy.py. The behavior is documented as: full-span overlaps resolve by score (higher wins, ties are arbitrary); one span contained inside another resolves to the *longer* span even when its score is lower; and partial intersections are anonymized independently and concatenated in the output. This produces the well-known pattern where George Washington and Washington State Park collapse to <PERSON><LOCATION> (Source: [presidio-anonymizer/README.md]).

Deanonymization & Batch Processing

Deanonymizer is the symmetric counterpart and currently exposes a single Decrypt operator that uses AES to reverse the Encrypt operation (Source: [presidio-anonymizer/presidio_anonymizer/deanonymize_engine.py]). For dataset-scale workflows, BatchAnonymizerEngine accepts CSV or JSON files and applies the same per-entity operator pipeline row-by-row, writing the redacted output alongside an "analyzed" file containing the RecognizerResult metadata (Source: [presidio-anonymizer/presidio_anonymizer/batch_anonymizer_engine.py]).

Image Redaction with `presidio-image-redactor`

The image redactor is a separate package that detects PII in raster images (PNG/JPG/etc.) and in DICOM medical frames, then draws filled rectangles over the detected regions. It depends on Tesseract OCR for text extraction and on presidio-analyzer for entity classification, so the same recognizer set used for plain text applies to pixels.

Standard Image Redaction

ImageRedactorEngine is the entry point for ordinary raster images. Its primary method accepts a PIL.Image and an optional color fill (int or (R,G,B) tuple, default black) and returns a redacted image with rectangles drawn over the bounding boxes returned by the OCR + analyzer pipeline. A small HTTP service is exposed via POST /redact, accepting a multipart form with the image and a data field of the shape {'color_fill':'0,0,0'} (Source: [presidio-image-redactor/README.md]).

DICOM Image Redaction

DICOM redaction is handled by DicomImageRedactorEngine, which works on pydicom datasets rather than PIL.Image objects. Four entry points are documented in the README: redact(dicom_image, fill=...) for in-memory redaction, redact_and_return_bbox(...) when callers also need the bounding boxes, redact_from_file(input_path, output_dir, ...) for single-file workflows that persist JSON bbox dumps, and redact_from_directory(...) for recursive batch jobs. Padding and ocr_kwargs (e.g. ocr_threshold) are configurable per call (Source: [presidio-image-redactor/README.md]).

Scope note: the redactor scrubs burnt-in text in *pixel data only*; it does not touch the structured DICOM metadata headers. The README explicitly recommends pairing it with the Tools for Health Data Anonymization package for metadata scrubbing (Source: [presidio-image-redactor/README.md]).

Architecture & Data Flow

The end-to-end flow for both text and image modalities is uniform: detect → resolve conflicts → redact. The diagram below summarizes the image/DICOM path; text anonymization follows the same shape with the OCR step omitted.

flowchart LR
    A[Input: text / image / DICOM] --> B{Modality?}
    B -- Text --> C[AnalyzerEngine]
    B -- Image --> D[Tesseract OCR]
    B -- DICOM --> D
    D --> C
    C --> E[RecognizerResults]
    E --> F[ConflictResolutionStrategy]
    F --> G{Output sink}
    G -- Text --> H[AnonymizerEngine<br/>or BatchAnonymizerEngine]
    G -- Image --> I[ImageRedactorEngine]
    G -- DICOM --> J[DicomImageRedactorEngine]
    H --> K[Redacted text]
    I --> L[Redacted PNG]
    J --> M[Redacted DICOM + bbox JSON]

Known Issues & Community Notes

Several recurring community issues are worth flagging before adopting the image/DICOM pipeline:

  • Sorted-result bug in PII verificationDicomImagePiiVerifyEngine._remove_duplicate_entities calls sorted() on overlapping entity candidates but discards the sorted list, so the *lowest*-scored entity is kept instead of the highest. Track issue #2083 for the fix.
  • DICOM metadata edge cases — Release 2.2.354 shipped a fix for "wrong condition for dicom metadata" (#1347), so older versions may misclassify or skip header fields.
  • Evaluation notebook drift — The example_dicom_redactor_evaluation.ipynb sample has historically failed with an error (issue #1251); users running evaluation pipelines should pin a known-good commit or verify against the current notebook.
  • Country-specific recognizers — As of release 2.2.359, country-specific predefined recognizers are disabled by default to reduce false positives. Adopters needing, for example, Philippines-specific PII (see issue #2015) must opt them in explicitly.

For long file paths on Windows during DICOM batch jobs, the README also recommends enabling Win32 long paths — a common practical failure point (Source: [presidio-image-redactor/README.md]).

See Also

  • presidio-analyzer — PII detection upstream of every operation described here
  • presidio-structured — tabular/JSON de-identification that reuses the same recognizers
  • AHDS De-identification Service integration — remote recognizer + ahds surrogate operator introduced in 2.2.360

Source: https://github.com/microsoft/presidio / Human Manual

Structured Data, CLI, Deployment & Extensibility

Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Known Extensibility Caveats

Continue reading this section for the full explanation and source context.

Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing

Structured Data, CLI, Deployment & Extensibility

This page describes four cross-cutting capabilities of the Presidio ecosystem: the Structured module for tabular/semi-structured data, the CLI tool for batch text analysis, the supported Deployment topologies (Azure, Docker, HTTP), and the Extensibility surface (recognizers, operators, language models). It is intended for engineers integrating Presidio into pipelines and for operators who need to choose a deployment topology.

1. Structured Data with `presidio-structured`

presidio-structured extends Presidio beyond free text. It uses presidio-analyzer to detect PII at the column/key level, then uses presidio-anonymizer operators to redact values. The package exposes a StructuredEngine and a PandasAnalysisBuilder, both installed via pip install presidio-structured (presidio-structured/README.md).

The typical pattern is to (1) build a tabular_analysis that maps columns/keys to detected entity types, (2) define an OperatorConfig map per entity, and (3) invoke the engine over a pandas.DataFrame or a JSON document. The Faker library can be wired in via OperatorConfig("replace", {"new_value": fake.name()}) to produce realistic surrogates (presidio-structured/README.md).

A notable capability added in release 2.2.354 is user-defined entity selection strategies, allowing callers to plug in their own logic for choosing which entities get anonymized per column (Release 2.2.354). Community requests, such as Philippines-specific predefined recognizers (issue #2015), indicate ongoing expansion of country-specific recognizers usable from presidio-structured (Issue #2015).

2. CLI: `presidio-cli`

presidio-cli is a thin wrapper around presidio-analyzer that scans files and directories for PII. It is installed with pip install presidio-cli and requires Python 3.10–3.13 and poetry for source builds (presidio-cli/README.md).

Configuration can be supplied via a YAML file (-c) or inline (-d). The YAML keys map directly to presidio-analyzer parameters:

YAML keyPurposeExample value
languageNLP language for analysisen
ignoreGlob patterns to skip.git, *.cfg
entitiesRestrict detection to a subset[PERSON, EMAIL_ADDRESS]
allowAllow-list tokens that should not be flaggedlist of strings

The CLI supports four output formats selected with -f / --format: standard, github (CI annotations), colored, and parsable (one JSON object per finding). The auto mode picks github inside GitHub Actions and colored otherwise (presidio-cli/README.md). presidio . runs against the current directory using a .presidiocli config if present; a --help flag enumerates every option.

3. Deployment Topologies

Presidio is shipped as multiple independently installable Python packages and as containerized services. The top-level presidio PyPI package is a meta-package that pulls in presidio-analyzer and presidio-anonymizer only — it contains no code of its own (presidio/README.md).

Three deployment paths are documented:

GPU acceleration is supported for analyzer workloads by installing the matching CUDA build of cupy (e.g. cupy-cuda12x) on Linux/NVIDIA; macOS/Apple Silicon currently falls back to CPU for PyTorch operations (presidio-analyzer/README.md).

4. Extensibility

Presidio is designed to be extended along three axes: custom recognizers, custom anonymizer operators, and language-model backends.

  • Recognizers. Predefined recognizers rely on regex, NER, and checksum logic, and can be replaced or augmented with custom classes that subclass the analyzer's EntityRecognizer contract. Release 2.2.355 moved predefined recognizers onto a config-file foundation, making it easier to override defaults without editing code (Release 2.2.355).
  • Anonymizer operators. Built-in operators include replace, redact, hash (sha256/sha512; md5 was deprecated in 2.2.358), mask, encrypt (AES), custom (lambda), and the AHDS Surrogate operator that calls the Azure Health Data Services de-identification service for medically-appropriate surrogates (presidio-anonymizer/README.md, Release 2.2.358).
  • Language-model backends. Release 2.2.362 introduced HuggingFaceNerRecognizer for direct NER model inference, and presidio-analyzer now ships LangExtract-backed recognizers (BasicLangExtractRecognizer, AzureOpenAILangExtractRecognizer) supporting Ollama and Azure OpenAI providers via the presidio-analyzer[langextract] extra (Release 2.2.362, presidio-analyzer/README.md).

Known Extensibility Caveats

Community-reported issues show that the YAML-driven configuration path can serialize omitted optional recognizer fields as explicit None, which downstream registry validation may reject (issue #2080) (Issue #2080). Similarly, the DICOM verification engine has had a sorting bug where _remove_duplicate_entities discarded its sorted() result and kept the lowest-scored entity (issue #2083) — a reminder that custom extensions should re-validate precedence logic when overriding built-ins (Issue #2083). The DICOM redactor evaluation notebook has also been reported broken in the past (issue #1251); users integrating custom evaluation harnesses should pin notebook revisions accordingly (Issue #1251).

See Also

  • Analyzer, Anonymizer, and Image Redactor package READMEs (linked above) for component-level details.
  • Release notes for version-specific behavior changes (e.g. country recognizers defaulted to disabled in 2.2.359).
  • Public API spec at https://microsoft.github.io/presidio/api-docs/api-docs.html.

Source: https://github.com/microsoft/presidio / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Runtime risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Runtime risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 10 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Runtime risk - Runtime risk requires verification.

1. Runtime risk: Runtime risk requires verification

  • Severity: high
  • Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/1251

2. Security or permission risk: Security or permission risk requires verification

  • Severity: high
  • Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/2080

3. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | https://github.com/microsoft/presidio

4. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/2083

5. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio

6. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: downstream_validation.risk_items | https://github.com/microsoft/presidio

7. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: risks.scoring_risks | https://github.com/microsoft/presidio

8. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/1882

9. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: issue_or_pr_quality=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio

10. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: release_recency=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 11

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using presidio with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence