Start using Microsoft’s host: Step by step detection and anonymization of personal identity information PII PII PII PII

In this tutorial, we will explore how to use Microsoft Presidio, an open framework designed to detect, analyze, and anonymize personally identifiable information (PII) in free form. Built on top of the efficient Spacy NLP library, Presidio is both lightweight and modular, making it easy to integrate into real-time applications and pipelines.
We will explain how:
- Set up and install the necessary Presidio packages
- Detect ordinary PII entities such as name, phone number and credit card details
- Custom recognizers that define specific domain-specific entities (e.g., pan, aadhaar)
- Create and register a custom anonymous (such as hash or pseudonym)
- Reuse anonymous mapping for consistent reanonymization
Install the library
To get started with Presidio, you need to install the following key libraries:
- Main load analyzer: This is the core library responsible for detecting PII entities in text using built-in and custom recognizers.
- President Anonymous: This library provides tools for anonymizing using configurable operator anonymity (e.g., edit, replace, hash).
- Spacy NLP model (en_core_web_lg): Presidio uses Spacy under the hood for natural language processing tasks such as named entity recognition. The EN_CORE_WEB_LG model provides highly accurate results and is recommended for English PII testing.
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg
If you are using Jupyter/Colab, you may need to restart the session to install the library.
Presidio analyzer
Basic PII detection
In this block, we initialize the Presidio Analyzer engine and run basic analysis to detect U.S. phone numbers from the sample text. We also suppress the underlying log warnings in the Presidio library to provide a cleaner output.
AnalligerEngine loads Spacy’s NLP pipeline and predefined recognizers to scan for input text for sensitive entities. In this example, we specify phone_number as the target entity.
import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
from presidio_analyzer import AnalyzerEngine
# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()
# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
entities=["PHONE_NUMBER"],
language="en")
print(results)
Create a custom PII recognizer with rejection list (Academic Champion)
This code block shows how to create a custom PII recognizer in Presidio using a simple rejection list, which is great for detecting fixed terms such as academic titles (e.g. “Dr.”, “Dr.”, “Prof.”). The recognizer is added to the Presidio registry and is used by the analyzer to scan input text.
Although this tutorial only covers reject list methods, Presidio also supports regular-based schemas, NLP models, and external recognizers. For these advanced methods, see the official documentation: Adding a custom recognizer.
Presidio analyzer
Basic PII detection
In this block, we initialize the Presidio Analyzer engine and run basic analysis to detect U.S. phone numbers from the sample text. We also suppress the underlying log warnings in the Presidio library to provide a cleaner output.
AnalligerEngine loads Spacy’s NLP pipeline and predefined recognizers to scan for input text for sensitive entities. In this example, we specify phone_number as the target entity.
import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
from presidio_analyzer import AnalyzerEngine
# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()
# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
entities=["PHONE_NUMBER"],
language="en")
print(results)
Create a custom PII recognizer with rejection list (Academic Champion)
This code block shows how to create a custom PII recognizer in Presidio using a simple rejection list, which is great for detecting fixed terms such as academic titles (e.g. “Dr.”, “Dr.”, “Prof.”). The recognizer is added to the Presidio registry and is used by the analyzer to scan input text.
Although this tutorial only covers reject list methods, Presidio also supports regular-based schemas, NLP models, and external recognizers. For these advanced methods, see the official documentation: Adding a custom recognizer.
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry
# Step 1: Create a custom pattern recognizer using deny_list
academic_title_recognizer = PatternRecognizer(
supported_entity="ACADEMIC_TITLE",
deny_list=["Dr.", "Dr", "Professor", "Prof."]
)
# Step 2: Add it to a registry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)
# Step 3: Create analyzer engine with the updated registry
analyzer = AnalyzerEngine(registry=registry)
# Step 4: Analyze text
text = "Prof. John Smith is meeting with Dr. Alice Brown."
results = analyzer.analyze(text=text, language="en")
for result in results:
print(result)
Presidio Anonymous
This code block demonstrates how to use the Presidio anonymity engine to detect PII entities in a given text. In this example, we use venizerresult to manually define two human entities, thus simulating the output of Presidio Analyzer. These entities represent the names of “Bond” and “James Bond” in the example text.
We use the “replace” operator to replace two names with placeholder values (“BIP”), effectively anonymizing sensitive data. This is done by passing the OperatorConfig with the required anonymous policy (replacement) to Anonymizerengine.
This mode can be easily extended to apply other built-in operations such as “REDACT”, “HASH”, or custom pseudonym policies.
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
# Initialize the engine:
engine = AnonymizerEngine()
# Invoke the anonymize function with the text,
# analyzer results (potentially coming from presidio-analyzer) and
# Operators to get the anonymization output:
result = engine.anonymize(
text="My name is Bond, James Bond",
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)
print(result)
Custom entity recognition, hash-based anonymization, and consistent reanonymization with Presidio
In this example, we prove by:
- ✅ Use a REGEX-based schema proofer to define custom PII entities (e.g., Aadhaar and PAN numbers)
- 🔐Anonymous data using anonymous data based on a custom hash operator (RealNommizer)
- ♻️Reanonymous the same value consistently by maintaining the mapping of the original → hash value
We implement a custom reanonymous operator that checks if a given value has been hashed and reuses the same output for consistency. This is especially useful when anonymous data requires certain utilities to be retained, for example, linking records through pseudonym IDs.
Define a ReAnonymizer based on custom hashing
This block defines a custom operator called ReAnymonizer that uses SHA-256 hashing to anonymous entities and ensures that the same input always gets the same anonymous output by storing the same anonymous output in the shared map.
from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict
class ReAnonymizer(Operator):
"""
Anonymizer that replaces text with a reusable SHA-256 hash,
stored in a shared mapping dict.
"""
def operate(self, text: str, params: Dict = None) -> str:
entity_type = params.get("entity_type", "DEFAULT")
mapping = params.get("entity_mapping")
if mapping is None:
raise ValueError("Missing `entity_mapping` in params")
# Check if already hashed
if entity_type in mapping and text in mapping[entity_type]:
return mapping[entity_type][text]
# Hash and store
hashed = ""
mapping.setdefault(entity_type, {})[text] = hashed
return hashed
def validate(self, params: Dict = None) -> None:
if "entity_mapping" not in params:
raise ValueError("You must pass an 'entity_mapping' dictionary.")
def operator_name(self) -> str:
return "reanonymizer"
def operator_type(self) -> OperatorType:
return OperatorType.Anonymize
Custom PII identifiers that define PAN and AADHAAR numbers
We define two pattern recognizers based on custom regular expressions – one for Indian pot numbers and one for Aadhaar numbers. These will detect custom PII entities in your text.
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
# Define custom recognizers
pan_recognizer = PatternRecognizer(
supported_entity="IND_PAN",
name="PAN Recognizer",
patterns=[Pattern(name="pan", regex=r"b[A-Z]{5}[0-9]{4}[A-Z]b", score=0.8)],
supported_language="en"
)
aadhaar_recognizer = PatternRecognizer(
supported_entity="AADHAAR",
name="Aadhaar Recognizer",
patterns=[Pattern(name="aadhaar", regex=r"bd{4}[- ]?d{4}[- ]?d{4}b", score=0.8)],
supported_language="en"
)
Set up analyzer and anonymous engine
Here we set up Presidio Analyzerengine, register a custom recognizer, and then add the custom anonymizer to Anonymizerengine.
from presidio_anonymizer import AnonymizerEngine, OperatorConfig
# Initialize analyzer and register custom recognizers
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)
# Initialize anonymizer and add custom operator
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)
# Shared mapping dictionary for consistent re-anonymization
entity_mapping = {}
Analyze and anonymous text input
We analyzed two separate texts, both containing the same pot and Aadhaar values. Custom operators make sure they are consistently anonymous in both inputs.
from pprint import pprint
# Example texts
text1 = "My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."
# Analyze and anonymize first text
results1 = analyzer.analyze(text=text1, language="en")
anon1 = anonymizer.anonymize(
text1,
results1,
{
"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
}
)
# Analyze and anonymize second text
results2 = analyzer.analyze(text=text2, language="en")
anon2 = anonymizer.anonymize(
text2,
results2,
{
"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
}
)
View anonymous results and mappings
Finally, we print out the anonymous output and internally check the mapping used to maintain a consistent hash across values.
print("📄 Original 1:", text1)
print("🔐 Anonymized 1:", anon1.text)
print("📄 Original 2:", text2)
print("🔐 Anonymized 2:", anon2.text)
print("n📦 Mapping used:")
pprint(entity_mapping)
Check Code. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

I am a civil engineering graduate in Islamic Islam in Jamia Milia New Delhi (2022) and I am very interested in data science, especially neural networks and their applications in various fields.
