Updated for 2026

Not All De-Identification Tools Are Equal

Accuracy is the only metric that matters for PHI de-identification. A 4% gap looks small. On one million records, that is 40,000 exposed patients.

ECIR 2025 benchmarks show wide accuracy gaps across leading tools. These results should shape every healthcare buying decision.

ECIR 2025 Benchmark Results

Tool	F1-Score	Precision	Recall
John Snow Labs	96%	95%	97%
Azure AI	91%	90%	92%
AWS Comprehend Medical	83%	81%	85%
GPT-4o	79%	82%	76%

F1-score blends two things. Precision: how many flagged items were real PHI. Recall: how many real PHI items were found.

Low precision means over-redaction and lost context.
Low recall means missed PHI — a breach.

Why the Gap Exists

Training Data Matters

John Snow Labs trains on clinical notes. These notes are messy and full of short forms. GPT-4o trains on a broad mix of text. It was not built for clinical data.

Tool	Training Focus
John Snow Labs	Healthcare-specific, clinical notes
Azure AI	General medical + clinical
AWS Comprehend Medical	General medical entities
GPT-4o	Broad training, not healthcare-specific

Entity Coverage Varies

Not every tool finds the same PHI types.

Entity	John Snow	Azure	AWS	GPT-4o
Patient names	Yes	Yes	Yes	Yes
Medical record numbers	Yes	Yes	Limited	Limited
Medication dosages	Yes	Yes	Yes	Partial
Procedure codes	Yes	Yes	Limited	No
Clinical abbreviations	Yes	Partial	No	Partial
Family member names	Yes	Yes	Partial	Partial

Context Is Hard to Get Right

Take this clinical note:

"Patient reports taking Smith's medication. Dr. Johnson recommends increasing the dose."

A good PHI tool must do three things here:

Read "Smith" as a brand name, not a patient.
Flag "Dr. Johnson" as a provider name to redact.
Know "Patient" is a role label, not a name.

GPT-4o misses these cases. That pushes recall to 76%.

The Cost of Low Accuracy

Going from 79% to 96% cuts exposure by 170,000 records per million processed.

Accuracy	Records	PHI Exposure
96%	1,000,000	40,000
91%	1,000,000	90,000
83%	1,000,000	170,000
79%	1,000,000	210,000

HIPAA Penalties Scale With Exposure

Tier	Cause	Penalty Per Violation
1	Unaware	$100–$50,000
2	Reasonable cause	$1,000–$50,000
3	Willful neglect, corrected	$10,000–$50,000
4	Willful neglect, uncorrected	$50,000+

Picking a 79% tool when 96% tools exist may be willful neglect under HHS rules. The gap is known. A better tool is on the market.

How a Hybrid Pipeline Raises Accuracy

No single method finds all PHI types. A hybrid pipeline stacks methods. Each one fills the gaps the others leave.

Input Text
    ↓
[Regex Patterns] — Structured data: SSN, MRN, dates
    ↓
[spaCy NER] — Names, locations, organizations
    ↓
[Transformer Models] — Context-dependent entities
    ↓
[Medical Dictionaries] — Healthcare-specific terms
    ↓
Merged Results (highest confidence wins)

Method	Strengths	Weaknesses
Regex	Perfect for structured data	No context handling
spaCy	Fast, common entities	Limited medical vocab
Transformers	Context-aware, high recall	Slower
Dictionaries	Full medical terms	Static, needs updates

Each method catches what the others miss. See how this works in the security compliance page and legal conformance docs.

Questions to Ask Any Vendor

Before you sign, ask five things:

What F1-score on clinical notes? Get third-party data. Reject vague claims.
Which entity types? All 18 HIPAA Safe Harbor identifiers must be covered.
How do you handle abbreviations? "Pt," "Dx," and "Hx" need correct resolution.
Do you catch family member PHI? "Mother has diabetes" is PHI. Many tools miss it.
Do you support all note formats? Progress notes, discharge summaries, and radiology reports differ a lot.

Red flags to watch for:

No specific accuracy numbers
Testing only on clean, structured data
No healthcare training data
Few entity types
No HIPAA Safe Harbor validation

Testing Tools Yourself

Run your own test in four steps.

Step 1 — Build a dataset. Use de-identified notes from many specialties. Cover all 18 HIPAA types plus edge cases like short forms and family names.

Step 2 — Set a gold standard. Experts mark every PHI item with type and exact span.

Step 3 — Run each tool. Compare output to the gold standard. Score precision, recall, and F1.

Step 4 — Break down failures. Group misses by type, context, and format. This shows where each tool fails.

Conclusion

ECIR 2025 data is clear. A 17-point gap — 96% versus 79% — means 170,000 extra exposed records per million. Tool choice is the biggest risk variable at scale.

When you pick a PHI detection tool:

Require specific accuracy data on clinical text
Confirm full HIPAA Safe Harbor coverage
Test on your own document formats
Choose hybrid pipelines over single-method tools

Read how tokenization works in the token system docs. Common questions are in the FAQ.

anonym.legal replaces PHI with tokens before documents reach any AI tool. Names, dates, and record numbers are swapped on your side. Results come back with real details restored — only for you. Explore pricing.

When This Approach Has Limits

Choosing a high-accuracy detector materially lowers PHI exposure, but a benchmark score is not a compliance guarantee. Read the numbers with these caveats:

Even 96% F1 leaves a gap. A leading tool still misses roughly 4 in 100 PHI items, and on a million records that is tens of thousands of exposures. High accuracy reduces the residual; it never reaches zero. A human review step on sensitive output is still warranted.
Benchmark conditions rarely match your data. ECIR 2025 scores come from specific corpora and annotation rules. Your specialties, dictation styles, OCR quality, and note formats can move real-world accuracy up or down. The only number that should drive your decision is one measured on documents like yours — which is why the self-testing steps above matter.
Accuracy is not the same as de-identification. HIPAA Safe Harbor and Expert Determination are legal standards. A high F1 across the 18 identifier types supports them, but confirming a dataset actually meets either standard — including residual re-identification risk from rare conditions or small cohorts — is a compliance judgment, not a model output.
Scores drift. Vocabulary, drug names, and note conventions change, and a model that benchmarked at 96% last year is not guaranteed to hold that today. Treat accuracy as something to re-validate periodically, not a one-time procurement checkbox.

Use these benchmarks to rule out weak tools and to frame vendor questions — then validate the chosen tool on your own records and keep a review layer for the records that matter most.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

Start Free Trial View Features

PHI Detection: Snow Labs 96% vs GPT-4o

Not All De-Identification Tools Are Equal

ECIR 2025 Benchmark Results

Why the Gap Exists

Training Data Matters

Entity Coverage Varies

Context Is Hard to Get Right

The Cost of Low Accuracy

HIPAA Penalties Scale With Exposure

How a Hybrid Pipeline Raises Accuracy

Questions to Ask Any Vendor

Testing Tools Yourself

Conclusion

When This Approach Has Limits

Sources

Related Articles

HIPAA MRN Detection Without a Regex PhD

HIPAA: Hospital-Specific MRN Detection

HIPAA Safe Harbor De-ID at Scale

Ready to protect your data?

PHI Detection: Snow Labs 96% vs GPT-4o

Not All De-Identification Tools Are Equal

ECIR 2025 Benchmark Results

Why the Gap Exists

Training Data Matters

Entity Coverage Varies

Context Is Hard to Get Right

The Cost of Low Accuracy

HIPAA Penalties Scale With Exposure

How a Hybrid Pipeline Raises Accuracy

Questions to Ask Any Vendor

Testing Tools Yourself

Conclusion

When This Approach Has Limits

Sources

Related Articles

HIPAA MRN Detection Without a Regex PhD

HIPAA: Hospital-Specific MRN Detection

HIPAA Safe Harbor De-ID at Scale

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow