Doramagic Project Pack · Human Manual
presidio
An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.
Presidio Overview & System Architecture
Related topics: Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility
Presidio Overview & System Architecture
1. Purpose and Scope
Microsoft Presidio is a context-aware, pluggable, and customizable PII de-identification service for text and images. It is delivered as a monorepo containing several cooperating Python packages that together detect, redact, and anonymize personally identifiable information (PII) in unstructured text, structured/tabular data, and pixel data (images and DICOM medical files) (presidio/README.md:1-5).
The top-level presidio package is a thin convenience meta-package that has no code of its own — it simply installs presidio-analyzer and presidio-anonymizer as dependencies (presidio/README.md:7-11). All real functionality lives in the sub-packages.
Primary use cases observed across the repo include:
- Detecting PII in free-form text and substituting it with placeholders, hashes, or encrypted values.
- Redacting burnt-in text PHI in standard images and DICOM medical scans.
- Identifying and anonymizing PII in tabular data (pandas DataFrames) and JSON-like semi-structured records.
- Scanning source trees from the command line to flag PII inside code or documentation.
2. System Architecture
The repository is organized as a polyglot monorepo: each capability is its own installable Python package that exposes both a Python API and (for some) a Dockerized REST service.
flowchart LR
User[User / Application]
subgraph Presidio["presidio monorepo"]
Analyzer["presidio-analyzer<br/>Detect PII in text"]
Anonymizer["presidio-anonymizer<br/>Replace / Encrypt / Hash"]
ImgRed["presidio-image-redactor<br/>Images + DICOM"]
Structured["presidio-structured<br/>Pandas / JSON tabular"]
CLI["presidio-cli<br/>Source-tree scanner"]
Meta["presidio<br/>meta-package (deps only)"]
end
User --> Analyzer
Analyzer -->|RecognizerResult[]| Anonymizer
User --> ImgRed
User --> Structured
User --> CLI
Meta -.bundles.-> Analyzer
Meta -.bundles.-> AnonymizerKey architectural properties:
- Separation of detection and transformation. Detection is performed by
AnalyzerEngine(presidio-analyzer/README.md:51-58), transformation byAnonymizerEngine(presidio-anonymizer/README.md:39-44). The boundary is theRecognizerResultlist, which lets users swap one side without touching the other. - Recognizer extensibility. Each predefined recognizer is responsible for one or more PII entity types using regex, NER, checksum validation, or external models (presidio-analyzer/README.md:3-9).
- Pluggable operators. Anonymization is composed of small operators (
replace,redact,hash,mask,encrypt,custom,surrogate, etc.) configured per entity (presidio-anonymizer/README.md:11-37). - Multiple deployment shapes. The same code path is exposed as a Python library, a
docker-composeHTTP service, and (via the meta-package) a singlepip install presidio(presidio/README.md:17-23).
3. Core Packages
| Package | Responsibility | Entry point | Source |
|---|---|---|---|
presidio-analyzer | Detect PII entities in unstructured text; supports spaCy NER, regex, LangExtract/LLM, HuggingFace NER, and country-specific recognizers | AnalyzerEngine().analyze(text, language) | presidio-analyzer/README.md:55-60 |
presidio-anonymizer | Apply anonymization operators to detected spans; reversible via Decrypt deanonymizer | AnonymizerEngine().anonymize(text, analyzer_results, operators=…) | presidio-anonymizer/README.md:41-48 |
presidio-image-redactor | OCR + PII detection on images; specialized DICOM pipeline using Tesseract and pydicom | ImageRedactorEngine, DicomImageRedactorEngine | presidio-image-redactor/README.md:41-55 |
presidio-structured | Map DataFrame columns / JSON keys to entities, then anonymize values | StructuredEngine + PandasAnalysisBuilder | presidio-structured/README.md:5-15 |
presidio-cli | Walk files in a directory and print PII hits with format options (standard, github, colored, parsable) | presidio <path> | presidio-cli/README.md:79-94 |
The analyzer ships with optional extras — e.g. presidio-analyzer[langextract] for LLM-based detection through Ollama or Azure OpenAI (presidio-analyzer/README.md:15-23), and presidio-anonymizer[ahds] for the Azure Health Data Services surrogate operator (presidio-anonymizer/README.md:31-37).
4. End-to-End Data Flow
For the canonical text path, the pipeline is:
- Analyze —
AnalyzerEngine.analyze(text, entities=…, language="en")runs every registered recognizer and returns a list ofRecognizerResult(entity type, span, score) (presidio-analyzer/README.md:51-60). - Anonymize —
AnonymizerEngine.anonymize(text, analyzer_results, operators=…)resolves overlapping spans, picks the highest-scoring operator per entity, and produces anAnonymizedResult(presidio-anonymizer/README.md:41-48). - (Optional) Deanonymize — Encrypted spans can be reversed with the
Decryptoperator when the AES key is available (presidio-anonymizer/README.md:53-56).
For images, ImageRedactorEngine runs OCR (Tesseract), feeds the recognized text to AnalyzerEngine, and paints boxes over the original image; the DICOM variant additionally handles pydicom pixel data, metadata, and long Windows paths (presidio-image-redactor/README.md:41-55, 87-95).
For tabular data, StructuredEngine uses PandasAnalysisBuilder().generate_analysis(df) to derive a column→entity map, then delegates anonymization to presidio-anonymizer per cell (presidio-structured/README.md:17-27).
5. Configuration, Extensibility, and Known Pitfalls
- YAML recognizers. Recognizers can be declared declaratively; recent releases moved predefined recognizers to a config-file model (Release 2.2.355 changelog). A known sharp edge: omitted optional YAML fields can be serialized as explicit
Noneduring registry validation (Issue #2080). - Country-specific recognizers. Starting with 2.2.359 most country-specific recognizers default to *disabled* to reduce false positives outside their locale (Release 2.2.359 notes). New PH-specific recognizers have been requested (Issue #2015).
- DICOM duplicate suppression.
DicomImagePiiVerifyEngine._remove_duplicate_entitieshad a bug wheresorted()was called but its result discarded, so the *lowest*-scored entity won; this is now tracked (Issue #2083). - DICOM evaluation notebook. The bundled sample notebook has reported failures — users should pin a known-good commit or apply the patches linked from the issue thread (Issue #1251).
- GPU / deployment. GPU acceleration is opt-in via
cupy-cuda12x; containers moved togunicornin 2.2.356 (Release 2.2.356 notes).
6. Getting Started
# Minimal text pipeline
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
results = analyzer.analyze(text="My name is John Doe", language="en")
anonymized = anonymizer.anonymize(text="My name is John Doe", analyzer_results=results)
print(anonymized)
pip install presidio # meta-package
pip install presidio-image-redactor # OCR + DICOM
pip install presidio-structured # pandas / JSON
pip install presidio-analyzer[langextract] # LLM-based detection
(presidio-analyzer/README.md:15-23, presidio-image-redactor/README.md:23-29, presidio-structured/README.md:11-13)
See Also
- Analyzer internals — recognizer registration, custom recognizers, NER/LLM backends.
- Anonymizer operators — overlap handling, encryption, AHDS surrogate.
- Image Redactor — DICOM pipeline, OCR thresholds, PII verification engine.
- Structured — DataFrame/JSON analysis builders and operator mappings.
- CLI — output formats and CI integration.
Source: https://github.com/microsoft/presidio / Human Manual
Analyzer: PII Detection, NLP Engines & Recognizers
Related topics: Presidio Overview & System Architecture, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Presidio Overview & System Architecture, Anonymization, Image Redaction & DICOM Processing, Structured Data, CLI, Deployment & Extensibility
Analyzer: PII Detection, NLP Engines & Recognizers
The Presidio Analyzer is the detection half of the Presidio PII de-identification framework. It analyzes free-form text (and serves as a backend for the image, structured, and DICOM redactors) and returns a list of RecognizerResult objects describing where PII entities appear, what type they are, and a confidence score. This page documents how detection is wired together: the AnalyzerEngine, the pluggable NLP engine, and the recognizer registry.
High-Level Architecture
The analyzer is composed of three cooperating subsystems: an NLP engine that provides tokenization, lemmatization, and (optionally) named entity recognition; a recognizer registry that holds a curated set of detectors per language; and the AnalyzerEngine itself, which orchestrates the call to each recognizer, merges the results, and returns them to the caller.
flowchart LR
A[Caller] -->|analyze(text, entities, language)| B[AnalyzerEngine]
B --> C[NLP Engine Provider]
C -->|spaCy / stanza / transformers| D[NLP Engine]
B --> E[Recognizer Registry]
E -->|loads| F[Predefined Recognizers]
E -->|loads| G[Custom Recognizers from YAML/Code]
F --> H[EntityRecognizer subclasses]
G --> H
H -->|RecognizerResult list| B
B -->|merged, scored results| AThe AnalyzerEngine.__init__ wires the registry and the NLP engine together, and the analyze method is the single entry point for end users Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].
NLP Engine Provider
The NLP engine abstracts the linguistic backbone of the analyzer. It must provide tokenization, sentence splitting, lemmatization, and named entity recognition; the recognizers depend on these primitives to evaluate regexes against lemmas, to filter by POS tags, and to combine their own logic with model output.
The NlpEngineProvider reads an nlp_engine_name and an optional configuration dictionary, then instantiates one of three backends:
| Engine | Notes | Source |
|---|---|---|
spacy (default) | Uses a spaCy model; recommended en_core_web_lg for English. Supports GPU acceleration via cupy-cuda12x on Linux. | nlp_engine_provider.py |
stanza | Stanford Stanza-based pipeline; used for languages not well served by spaCy. | nlp_engine_provider.py |
transformers | Wraps a Hugging Face NER model directly, useful when a custom transformer is preferred over spaCy/stanza. | nlp_engine_provider.py |
GPU acceleration is controlled via environment variable as introduced in release 2.2.362; on macOS with Apple Silicon, MPS is not currently supported and PyTorch operations fall back to CPU Source: [presidio-analyzer/README.md].
The provider raises a ValueError when an unknown engine name is supplied, which is the most common configuration failure when bootstrapping a new language Source: [presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py].
Recognizer Registry
The registry is the catalog of detectors the engine will invoke. Each registry is bound to a single language and a single NLP engine instance; the RecognizerRegistry.__init__ accepts both, plus an optional list of recognizer classes, an optional list of custom recognizers, and a context dictionary (e.g., a list of supported entity types) Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].
The registry exposes two principal operations:
load_predefined_recognizers(...)— populates the registry with Presidio's built-in recognizers (regex, deny-list, NER-based) filtered by language. Since release 2.2.359, most country-specific recognizers that expect English text are registered as disabled by default to avoid spurious false positives in non-target languages Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].add_recognizer(recognizer)andadd_custom_recognizer(custom_recognizer)— append user-supplied recognizers, which are validated against the registry's supported entity list Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].
Internally, recognizers are kept in two ordered dictionaries: the standard recognizers map and a custom_recognizers map; both are iterated during analysis. The get_recognizers method returns a flat list combining both, and get_supported_entities returns the deduplicated union of all entity types across every loaded recognizer Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].
Loading Recognizers from Configuration
RecognizerRegistryProvider constructs one or more registries (one per language) from a configuration dictionary. It delegates to RecognizersLoader (in recognizers_loader_utils.py) to parse three kinds of entries: built-in recognizer names, Python import paths, and YAML-defined recognizer definitions. This is the recommended way to version-control recognizer configuration and to share setups across deployments Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry_provider.py and presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py].
Community note (issue #2080): When recognizers with dedicated YAML config models are configured through analyzer YAML, omitted optional fields can be serialized as explicitNonevalues during registry validation. The config dump path for these recognizers may therefore emit fields withnullvalues that are indistinguishable from fields the user deliberately set tonull. This is a known serialization quirk to be aware of when diffing configuration files.
AnalyzerEngine and the `analyze` Pipeline
AnalyzerEngine is the high-level façade. It lazily instantiates its default NlpEngineProvider and RecognizerRegistry if none is supplied, and forwards language-specific setup to a per-language registry cache, so calling analyze with several languages does not rebuild the recognizer catalog each time Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].
The analyze(text, entities, language, ...) method executes the following steps:
- Validate the language against the configured supported languages, defaulting to
enwhen not specified. - Resolve the registry for the requested language via the engine provider; this also resolves the correct NLP engine and returns the matching list of
EntityRecognizersubclasses. - Iterate the recognizers, passing each one the text, the entity allow-list, and an analysis context. Each recognizer returns a list of
RecognizerResultobjects withentity_type,start,end, andscore. - Merge results across recognizers, dedupe overlaps, and apply the confidence threshold (default
0.6). - Return the final list of
RecognizerResultobjects to the caller Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py and presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py].
The threshold parameter is the most common knob to tune in production. Lowering it surfaces more candidates (including borderline regex matches) at the cost of false positives; raising it tightens precision at the cost of recall. The analyzer also accepts a decision_process argument that can be swapped for a custom merge strategy when the default is not appropriate Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py].
Recognizer Categories
Three recognizer families ship with Presidio and are typically mixed in a single registry:
- Pattern-based recognizers use regex and validation logic (e.g., Luhn checksum for credit cards, IBAN validation, URL parsing) and are deterministic and fast.
- NLP-based recognizers wrap the underlying NER model returned by the NLP engine and translate generic model labels (e.g.,
PERSON,ORG) into Presidio's entity taxonomy. - Context-aware and LLM-based recognizers (e.g., the
BasicLangExtractRecognizerandAzureOpenAILangExtractRecognizer) call out to a language model — local Ollama, Azure OpenAI, or Hugging Face NER — to detect entities that are difficult to capture with regexes. These require the optionalpresidio-analyzer[langextract]install, and they defer connectivity validation to the first call toanalyze()Source: [presidio-analyzer/README.md and presidio-analyzer/presidio_analyzer/analyzer_engine.py].
Release 2.2.362 introduced a HuggingFaceNerRecognizer for direct NER model inference, providing a lower-friction alternative to spinning up an entire transformers NLP engine when only NER output is required.
Common Failure Modes
- No model installed for the requested language — the most common initialization error; resolved by running
python -m spacy download <model>for spaCy or installing the matching Stanza model. - Unsupported language —
AnalyzerEngine.analyzeraises iflanguageis not in the configuredsupported_languages. Passsupported_languages=["en","es",...]to the constructor to extend the set Source: [presidio-analyzer/presidio_analyzer/analyzer_engine.py]. - Low recall on a target entity — first check the default-score threshold, then verify the recognizer is not one of the country-specific recognizers disabled by default since 2.2.359; explicitly re-enable it via the registry or YAML configuration when targeting that locale Source: [presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py].
- YAML recognizer loads with
Nonefields — known issue #2080; treatnullYAML values as "unset" when diffing generated configuration dumps.
See Also
- Presidio Anonymizer — consumes
RecognizerResultobjects and applies reversible or irreversible operators. - Presidio Image Redactor — calls the analyzer as an OCR post-processor for standard images and DICOM.
- Presidio Structured — uses analyzer detection to map tabular columns to PII entities before anonymization.
Source: https://github.com/microsoft/presidio / Human Manual
Anonymization, Image Redaction & DICOM Processing
Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Structured Data, CLI, Deployment & Extensibility
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Structured Data, CLI, Deployment & Extensibility
Anonymization, Image Redaction & DICOM Processing
Overview & Scope
Presidio ships three complementary capabilities that operate downstream of PII detection: text anonymization, optical image redaction, and DICOM (medical imaging) redaction. Together they cover the "de-identify" half of the Presidio pipeline — presidio-analyzer finds the PII, and these engines remove or mask it. The anonymizer works purely on text spans returned by the analyzer, while the image redactor extends the same recognizer pipeline to pixels and burnt-in text inside images and DICOM frames. Each package is independently installable and independently deployable as a REST service or Python library.
The text anonymizer also exposes a Deanonymizer (the inverse operation), a batch engine for CSV/JSON files, and an overlap-conflict resolution strategy so that nested or intersecting PII spans produce stable, predictable output.
Text Anonymization with `presidio-anonymizer`
The AnonymizerEngine consumes a list of RecognizerResult objects (typically emitted by AnalyzerEngine) and applies a per-entity operator chain to produce redacted text. Operators are looked up via the factory in operators/operators_factory.py and selected per entity type, with DEFAULT acting as a global fallback (Source: [presidio-anonymizer/README.md]).
Built-in Operators
The anonymizer ships with a fixed set of built-in operators, each accepting specific parameters:
| Operator | Purpose | Key Parameters |
|---|---|---|
replace | Substitute with a static value (defaults to <entity_type>) | new_value |
redact | Remove the PII completely | — |
hash | Cryptographic hash of the PII (sha256/sha512) | hash_type |
mask | Replace with a repeated character | chars_to_mask, masking_char, from_end |
encrypt | AES (Rijndael) encryption, reversible via deanonymizer | key (128/192/256-bit) |
custom | Apply a user lambda to the PII string | lambda |
ahds surrogate | Call Azure Health Data Services de-identification service | endpoint, entities, input_locale, surrogate_locale |
The Encrypt operator and the Decrypt deanonymizer together form a reversible pair; the key length must be 128, 192, or 256 bits and is provided as a string. The AHDS surrogate operator was added in release 2.2.360 and requires the presidio-anonymizer[ahds] extra (Source: [presidio-anonymizer/README.md]).
Overlap Conflict Resolution
When two recognizers return overlapping spans, the AnonymizerEngine relies on the ConflictResolutionStrategy enum in entities/conflict_resolution_strategy.py. The behavior is documented as: full-span overlaps resolve by score (higher wins, ties are arbitrary); one span contained inside another resolves to the *longer* span even when its score is lower; and partial intersections are anonymized independently and concatenated in the output. This produces the well-known pattern where George Washington and Washington State Park collapse to <PERSON><LOCATION> (Source: [presidio-anonymizer/README.md]).
Deanonymization & Batch Processing
Deanonymizer is the symmetric counterpart and currently exposes a single Decrypt operator that uses AES to reverse the Encrypt operation (Source: [presidio-anonymizer/presidio_anonymizer/deanonymize_engine.py]). For dataset-scale workflows, BatchAnonymizerEngine accepts CSV or JSON files and applies the same per-entity operator pipeline row-by-row, writing the redacted output alongside an "analyzed" file containing the RecognizerResult metadata (Source: [presidio-anonymizer/presidio_anonymizer/batch_anonymizer_engine.py]).
Image Redaction with `presidio-image-redactor`
The image redactor is a separate package that detects PII in raster images (PNG/JPG/etc.) and in DICOM medical frames, then draws filled rectangles over the detected regions. It depends on Tesseract OCR for text extraction and on presidio-analyzer for entity classification, so the same recognizer set used for plain text applies to pixels.
Standard Image Redaction
ImageRedactorEngine is the entry point for ordinary raster images. Its primary method accepts a PIL.Image and an optional color fill (int or (R,G,B) tuple, default black) and returns a redacted image with rectangles drawn over the bounding boxes returned by the OCR + analyzer pipeline. A small HTTP service is exposed via POST /redact, accepting a multipart form with the image and a data field of the shape {'color_fill':'0,0,0'} (Source: [presidio-image-redactor/README.md]).
DICOM Image Redaction
DICOM redaction is handled by DicomImageRedactorEngine, which works on pydicom datasets rather than PIL.Image objects. Four entry points are documented in the README: redact(dicom_image, fill=...) for in-memory redaction, redact_and_return_bbox(...) when callers also need the bounding boxes, redact_from_file(input_path, output_dir, ...) for single-file workflows that persist JSON bbox dumps, and redact_from_directory(...) for recursive batch jobs. Padding and ocr_kwargs (e.g. ocr_threshold) are configurable per call (Source: [presidio-image-redactor/README.md]).
Scope note: the redactor scrubs burnt-in text in *pixel data only*; it does not touch the structured DICOM metadata headers. The README explicitly recommends pairing it with the Tools for Health Data Anonymization package for metadata scrubbing (Source: [presidio-image-redactor/README.md]).
Architecture & Data Flow
The end-to-end flow for both text and image modalities is uniform: detect → resolve conflicts → redact. The diagram below summarizes the image/DICOM path; text anonymization follows the same shape with the OCR step omitted.
flowchart LR
A[Input: text / image / DICOM] --> B{Modality?}
B -- Text --> C[AnalyzerEngine]
B -- Image --> D[Tesseract OCR]
B -- DICOM --> D
D --> C
C --> E[RecognizerResults]
E --> F[ConflictResolutionStrategy]
F --> G{Output sink}
G -- Text --> H[AnonymizerEngine<br/>or BatchAnonymizerEngine]
G -- Image --> I[ImageRedactorEngine]
G -- DICOM --> J[DicomImageRedactorEngine]
H --> K[Redacted text]
I --> L[Redacted PNG]
J --> M[Redacted DICOM + bbox JSON]Known Issues & Community Notes
Several recurring community issues are worth flagging before adopting the image/DICOM pipeline:
- Sorted-result bug in PII verification —
DicomImagePiiVerifyEngine._remove_duplicate_entitiescallssorted()on overlapping entity candidates but discards the sorted list, so the *lowest*-scored entity is kept instead of the highest. Track issue #2083 for the fix. - DICOM metadata edge cases — Release 2.2.354 shipped a fix for "wrong condition for dicom metadata" (#1347), so older versions may misclassify or skip header fields.
- Evaluation notebook drift — The
example_dicom_redactor_evaluation.ipynbsample has historically failed with an error (issue #1251); users running evaluation pipelines should pin a known-good commit or verify against the current notebook. - Country-specific recognizers — As of release 2.2.359, country-specific predefined recognizers are disabled by default to reduce false positives. Adopters needing, for example, Philippines-specific PII (see issue #2015) must opt them in explicitly.
For long file paths on Windows during DICOM batch jobs, the README also recommends enabling Win32 long paths — a common practical failure point (Source: [presidio-image-redactor/README.md]).
See Also
- presidio-analyzer — PII detection upstream of every operation described here
- presidio-structured — tabular/JSON de-identification that reuses the same recognizers
- AHDS De-identification Service integration — remote recognizer +
ahds surrogateoperator introduced in 2.2.360
Source: https://github.com/microsoft/presidio / Human Manual
Structured Data, CLI, Deployment & Extensibility
Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Presidio Overview & System Architecture, Analyzer: PII Detection, NLP Engines & Recognizers, Anonymization, Image Redaction & DICOM Processing
Structured Data, CLI, Deployment & Extensibility
This page describes four cross-cutting capabilities of the Presidio ecosystem: the Structured module for tabular/semi-structured data, the CLI tool for batch text analysis, the supported Deployment topologies (Azure, Docker, HTTP), and the Extensibility surface (recognizers, operators, language models). It is intended for engineers integrating Presidio into pipelines and for operators who need to choose a deployment topology.
1. Structured Data with `presidio-structured`
presidio-structured extends Presidio beyond free text. It uses presidio-analyzer to detect PII at the column/key level, then uses presidio-anonymizer operators to redact values. The package exposes a StructuredEngine and a PandasAnalysisBuilder, both installed via pip install presidio-structured (presidio-structured/README.md).
The typical pattern is to (1) build a tabular_analysis that maps columns/keys to detected entity types, (2) define an OperatorConfig map per entity, and (3) invoke the engine over a pandas.DataFrame or a JSON document. The Faker library can be wired in via OperatorConfig("replace", {"new_value": fake.name()}) to produce realistic surrogates (presidio-structured/README.md).
A notable capability added in release 2.2.354 is user-defined entity selection strategies, allowing callers to plug in their own logic for choosing which entities get anonymized per column (Release 2.2.354). Community requests, such as Philippines-specific predefined recognizers (issue #2015), indicate ongoing expansion of country-specific recognizers usable from presidio-structured (Issue #2015).
2. CLI: `presidio-cli`
presidio-cli is a thin wrapper around presidio-analyzer that scans files and directories for PII. It is installed with pip install presidio-cli and requires Python 3.10–3.13 and poetry for source builds (presidio-cli/README.md).
Configuration can be supplied via a YAML file (-c) or inline (-d). The YAML keys map directly to presidio-analyzer parameters:
| YAML key | Purpose | Example value |
|---|---|---|
language | NLP language for analysis | en |
ignore | Glob patterns to skip | .git, *.cfg |
entities | Restrict detection to a subset | [PERSON, EMAIL_ADDRESS] |
allow | Allow-list tokens that should not be flagged | list of strings |
The CLI supports four output formats selected with -f / --format: standard, github (CI annotations), colored, and parsable (one JSON object per finding). The auto mode picks github inside GitHub Actions and colored otherwise (presidio-cli/README.md). presidio . runs against the current directory using a .presidiocli config if present; a --help flag enumerates every option.
3. Deployment Topologies
Presidio is shipped as multiple independently installable Python packages and as containerized services. The top-level presidio PyPI package is a meta-package that pulls in presidio-analyzer and presidio-anonymizer only — it contains no code of its own (presidio/README.md).
Three deployment paths are documented:
- Azure one-click deploy. Each sub-package ships a
deploytoazure.jsonARM template surfaced via a "Deploy to Azure" button in the README. Templates exist forpresidio-analyzer,presidio-anonymizer, andpresidio-image-redactor(presidio-analyzer/README.md, presidio-anonymizer/README.md, presidio-image-redactor/README.md). - Docker Compose. The image-redactor and anonymizer packages each provide a
docker-compose.ymlrunnable withdocker-compose up -dfrom the package directory (presidio-image-redactor/README.md, presidio-anonymizer/README.md). Release 2.2.356 moved containers from the dev server to gunicorn for production parity (Release 2.2.356). - HTTP API. Once running, the image redactor exposes
POST /redactaccepting multipart form data with an image and acolor_fillpayload; the anonymizer exposes its endpoint as documented in the public API spec (presidio-image-redactor/README.md, presidio-anonymizer/README.md).
GPU acceleration is supported for analyzer workloads by installing the matching CUDA build of cupy (e.g. cupy-cuda12x) on Linux/NVIDIA; macOS/Apple Silicon currently falls back to CPU for PyTorch operations (presidio-analyzer/README.md).
4. Extensibility
Presidio is designed to be extended along three axes: custom recognizers, custom anonymizer operators, and language-model backends.
- Recognizers. Predefined recognizers rely on regex, NER, and checksum logic, and can be replaced or augmented with custom classes that subclass the analyzer's
EntityRecognizercontract. Release 2.2.355 moved predefined recognizers onto a config-file foundation, making it easier to override defaults without editing code (Release 2.2.355). - Anonymizer operators. Built-in operators include
replace,redact,hash(sha256/sha512; md5 was deprecated in 2.2.358),mask,encrypt(AES),custom(lambda), and the AHDS Surrogate operator that calls the Azure Health Data Services de-identification service for medically-appropriate surrogates (presidio-anonymizer/README.md, Release 2.2.358). - Language-model backends. Release 2.2.362 introduced
HuggingFaceNerRecognizerfor direct NER model inference, andpresidio-analyzernow ships LangExtract-backed recognizers (BasicLangExtractRecognizer,AzureOpenAILangExtractRecognizer) supporting Ollama and Azure OpenAI providers via thepresidio-analyzer[langextract]extra (Release 2.2.362, presidio-analyzer/README.md).
Known Extensibility Caveats
Community-reported issues show that the YAML-driven configuration path can serialize omitted optional recognizer fields as explicit None, which downstream registry validation may reject (issue #2080) (Issue #2080). Similarly, the DICOM verification engine has had a sorting bug where _remove_duplicate_entities discarded its sorted() result and kept the lowest-scored entity (issue #2083) — a reminder that custom extensions should re-validate precedence logic when overriding built-ins (Issue #2083). The DICOM redactor evaluation notebook has also been reported broken in the past (issue #1251); users integrating custom evaluation harnesses should pin notebook revisions accordingly (Issue #1251).
See Also
- Analyzer, Anonymizer, and Image Redactor package READMEs (linked above) for component-level details.
- Release notes for version-specific behavior changes (e.g. country recognizers defaulted to disabled in 2.2.359).
- Public API spec at
https://microsoft.github.io/presidio/api-docs/api-docs.html.
Source: https://github.com/microsoft/presidio / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 10 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Runtime risk - Runtime risk requires verification.
1. Runtime risk: Runtime risk requires verification
- Severity: high
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/1251
2. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/2080
3. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/microsoft/presidio
4. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/2083
5. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio
6. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/microsoft/presidio
7. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/microsoft/presidio
8. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/microsoft/presidio/issues/1882
9. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio
10. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/microsoft/presidio
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using presidio with real data or production workflows.
- Sample presidio image evaluation notebook generates error - github / github_issue
- bug(image-redactor):
sorted()result discarded in `_remove_duplicate_e - github / github_issue - DocumentIntelligenceOCR support for Azure Identity (e.g DefaultAzureCred - github / github_issue
- Include support for hinglish transcripts masking - github / github_issue
- [[Bug] Omitted YAML recognizer fields are passed as None](https://github.com/microsoft/presidio/issues/2080) - github / github_issue
- Feature Request: Add Philippines (PH) country-specific predefined recogn - github / github_issue
- Release 2.2.362 - github / github_release
- Release 2.2.361 - github / github_release
- Release 2.2.360 - github / github_release
- 2.2.359 - github / github_release
- Capability evidence risk requires verification - GitHub / issue
Source: Project Pack community evidence and pitfall evidence