Start using Microsoft’s host: Step by step detection and anonymization of personal identity information PII PII PII PII

admin7 hours ago

0 3 6 minutes read

Start using Microsoft’s host: Step by step detection and anonymization of personal identity information PII PII PII PII

In this tutorial, we will explore how to use Microsoft Presidio, an open framework designed to detect, analyze, and anonymize personally identifiable information (PII) in free form. Built on top of the efficient Spacy NLP library, Presidio is both lightweight and modular, making it easy to integrate into real-time applications and pipelines.

We will explain how:

Set up and install the necessary Presidio packages

Detect ordinary PII entities such as name, phone number and credit card details

Custom recognizers that define specific domain-specific entities (e.g., pan, aadhaar)

Create and register a custom anonymous (such as hash or pseudonym)

Reuse anonymous mapping for consistent reanonymization

Install the library

To get started with Presidio, you need to install the following key libraries:

Main load analyzer: This is the core library responsible for detecting PII entities in text using built-in and custom recognizers.

President Anonymous: This library provides tools for anonymizing using configurable operator anonymity (e.g., edit, replace, hash).

Spacy NLP model (en_core_web_lg): Presidio uses Spacy under the hood for natural language processing tasks such as named entity recognition. The EN_CORE_WEB_LG model provides highly accurate results and is recommended for English PII testing.

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

If you are using Jupyter/Colab, you may need to restart the session to install the library.

Presidio analyzer

Basic PII detection

In this block, we initialize the Presidio Analyzer engine and run basic analysis to detect U.S. phone numbers from the sample text. We also suppress the underlying log warnings in the Presidio library to provide a cleaner output.

AnalligerEngine loads Spacy’s NLP pipeline and predefined recognizers to scan for input text for sensitive entities. In this example, we specify phone_number as the target entity.

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(results)

Create a custom PII recognizer with rejection list (Academic Champion)

This code block shows how to create a custom PII recognizer in Presidio using a simple rejection list, which is great for detecting fixed terms such as academic titles (e.g. “Dr.”, “Dr.”, “Prof.”). The recognizer is added to the Presidio registry and is used by the analyzer to scan input text.
Although this tutorial only covers reject list methods, Presidio also supports regular-based schemas, NLP models, and external recognizers. For these advanced methods, see the official documentation: Adding a custom recognizer.

Presidio analyzer

Basic PII detection

AnalligerEngine loads Spacy’s NLP pipeline and predefined recognizers to scan for input text for sensitive entities. In this example, we specify phone_number as the target entity.

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(results)

Create a custom PII recognizer with rejection list (Academic Champion)

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry

# Step 1: Create a custom pattern recognizer using deny_list
academic_title_recognizer = PatternRecognizer(
    supported_entity="ACADEMIC_TITLE",
    deny_list=["Dr.", "Dr", "Professor", "Prof."]
)

# Step 2: Add it to a registry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)

# Step 3: Create analyzer engine with the updated registry
analyzer = AnalyzerEngine(registry=registry)

# Step 4: Analyze text
text = "Prof. John Smith is meeting with Dr. Alice Brown."
results = analyzer.analyze(text=text, language="en")

for result in results:
    print(result)

Presidio Anonymous

This code block demonstrates how to use the Presidio anonymity engine to detect PII entities in a given text. In this example, we use venizerresult to manually define two human entities, thus simulating the output of Presidio Analyzer. These entities represent the names of “Bond” and “James Bond” in the example text.

We use the “replace” operator to replace two names with placeholder values (“BIP”), effectively anonymizing sensitive data. This is done by passing the OperatorConfig with the required anonymous policy (replacement) to Anonymizerengine.

This mode can be easily extended to apply other built-in operations such as “REDACT”, “HASH”, or custom pseudonym policies.

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

# Initialize the engine:
engine = AnonymizerEngine()

# Invoke the anonymize function with the text, 
# analyzer results (potentially coming from presidio-analyzer) and
# Operators to get the anonymization output:
result = engine.anonymize(
    text="My name is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)

print(result)

Custom entity recognition, hash-based anonymization, and consistent reanonymization with Presidio

In this example, we prove by:

✅ Use a REGEX-based schema proofer to define custom PII entities (e.g., Aadhaar and PAN numbers)

🔐Anonymous data using anonymous data based on a custom hash operator (RealNommizer)

♻️Reanonymous the same value consistently by maintaining the mapping of the original → hash value

We implement a custom reanonymous operator that checks if a given value has been hashed and reuses the same output for consistency. This is especially useful when anonymous data requires certain utilities to be retained, for example, linking records through pseudonym IDs.

Define a ReAnonymizer based on custom hashing

This block defines a custom operator called ReAnymonizer that uses SHA-256 hashing to anonymous entities and ensures that the same input always gets the same anonymous output by storing the same anonymous output in the shared map.

from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict

class ReAnonymizer(Operator):
    """
    Anonymizer that replaces text with a reusable SHA-256 hash,
    stored in a shared mapping dict.
    """

    def operate(self, text: str, params: Dict = None) -> str:
        entity_type = params.get("entity_type", "DEFAULT")
        mapping = params.get("entity_mapping")

        if mapping is None:
            raise ValueError("Missing `entity_mapping` in params")

        # Check if already hashed
        if entity_type in mapping and text in mapping[entity_type]:
            return mapping[entity_type][text]

        # Hash and store
        hashed = ""
        mapping.setdefault(entity_type, {})[text] = hashed
        return hashed

    def validate(self, params: Dict = None) -> None:
        if "entity_mapping" not in params:
            raise ValueError("You must pass an 'entity_mapping' dictionary.")

    def operator_name(self) -> str:
        return "reanonymizer"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

Custom PII identifiers that define PAN and AADHAAR numbers

We define two pattern recognizers based on custom regular expressions – one for Indian pot numbers and one for Aadhaar numbers. These will detect custom PII entities in your text.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern

# Define custom recognizers
pan_recognizer = PatternRecognizer(
    supported_entity="IND_PAN",
    name="PAN Recognizer",
    patterns=[Pattern(name="pan", regex=r"b[A-Z]{5}[0-9]{4}[A-Z]b", score=0.8)],
    supported_language="en"
)

aadhaar_recognizer = PatternRecognizer(
    supported_entity="AADHAAR",
    name="Aadhaar Recognizer",
    patterns=[Pattern(name="aadhaar", regex=r"bd{4}[- ]?d{4}[- ]?d{4}b", score=0.8)],
    supported_language="en"
)

Set up analyzer and anonymous engine

Here we set up Presidio Analyzerengine, register a custom recognizer, and then add the custom anonymizer to Anonymizerengine.

from presidio_anonymizer import AnonymizerEngine, OperatorConfig

# Initialize analyzer and register custom recognizers
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)

# Initialize anonymizer and add custom operator
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)

# Shared mapping dictionary for consistent re-anonymization
entity_mapping = {}

Analyze and anonymous text input

We analyzed two separate texts, both containing the same pot and Aadhaar values. Custom operators make sure they are consistently anonymous in both inputs.

from pprint import pprint

# Example texts
text1 = "My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."

# Analyze and anonymize first text
results1 = analyzer.analyze(text=text1, language="en")
anon1 = anonymizer.anonymize(
    text1,
    results1,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

# Analyze and anonymize second text
results2 = analyzer.analyze(text=text2, language="en")
anon2 = anonymizer.anonymize(
    text2,
    results2,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

View anonymous results and mappings

Finally, we print out the anonymous output and internally check the mapping used to maintain a consistent hash across values.

print("📄 Original 1:", text1)
print("🔐 Anonymized 1:", anon1.text)
print("📄 Original 2:", text2)
print("🔐 Anonymized 2:", anon2.text)

print("n📦 Mapping used:")
pprint(entity_mapping)

Check Code. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

I am a civil engineering graduate in Islamic Islam in Jamia Milia New Delhi (2022) and I am very interested in data science, especially neural networks and their applications in various fields.

admin7 hours ago

0 3 6 minutes read

Start using Microsoft’s host: Step by step detection and anonymization of personal identity information PII PII PII PII

Install the library

Presidio analyzer

Basic PII detection

Create a custom PII recognizer with rejection list (Academic Champion)

Presidio analyzer

Basic PII detection

Create a custom PII recognizer with rejection list (Academic Champion)

Presidio Anonymous

Custom entity recognition, hash-based anonymization, and consistent reanonymization with Presidio

Define a ReAnonymizer based on custom hashing

Custom PII identifiers that define PAN and AADHAAR numbers

Set up analyzer and anonymous engine

Analyze and anonymous text input

View anonymous results and mappings

admin

Leave a Reply Cancel reply

New study finds freshwater availability amounts for lithium mining overestimate – Air quality issues

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Wastewater technology is not as “green” as it should be

Explore UAE headphone price expectations in 2025

Toddlers smash Chatgpt: The Secrets Behind Children’s Lightning Language Learning

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19

Install the library

Presidio analyzer

Basic PII detection

Create a custom PII recognizer with rejection list (Academic Champion)

Presidio analyzer

Basic PII detection

Create a custom PII recognizer with rejection list (Academic Champion)

Presidio Anonymous

Custom entity recognition, hash-based anonymization, and consistent reanonymization with Presidio

Define a ReAnonymizer based on custom hashing

Custom PII identifiers that define PAN and AADHAAR numbers

Set up analyzer and anonymous engine

Analyze and anonymous text input

View anonymous results and mappings

admin

Use the on-stage API and Langchain to create a grounding verification tool

GA4 Integration Issues - Jon Weaver Digital

Related Articles

How to leverage RevOps to overcome data inefficiencies and increase business revenue

Coding guide to unlocking the MEM0 memory of anthropomorphic Claude Bot: Enable context-rich dialogue

Openthoughts: Scalable supervised fine-tuning SFT data curation pipeline for inference models

Aytekin Tank, Founder and CEO of Jotform – Interview Series

Leave a Reply Cancel reply

Toddlers smash Chatgpt: The Secrets Behind Children’s Lightning Language Learning

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19