The 50% Miss Rate Problem

A 2025 survey (arXiv:2509.14464) tested LLM tools on clinical records. The results were bad. These tools missed more than 50% of clinical PHI in multilingual documents. The cause is simple. LLMs are built for text output. They are not built for the high-recall detection task that HIPAA demands.

HIPAA Safe Harbor lists 18 protected identifier types. Names, dates, phone numbers, SSNs, MRNs, health plan IDs, device IDs, and IP addresses. Each needs its own detection logic.

Clinical notes make this harder. Take this example: "Pt. John D., DOB 4/12/67, MRN 1234567, admitted 03/15/24, Dr. Smith ordered ECG." One sentence. Five protected identifiers. Most use short forms. A model built for clinical meaning often fails the detection task.

What LLMs Miss and Why

LLM tools fail on clinical records in set ways.

Short-form identifiers: Clinical notes use shorthand. DOB, MRN, and Pt. are common forms. A model tuned for clinical meaning may not flag "Pt. John D." as a name. Sensitive data extraction needs a different goal.

Context-dependent dates: Not all dates pose the same risk. "Age 67" is a soft marker. "DOB 4/12/67" is a direct protected identifier. "03/15/24" as an admit date is protected too. Pattern matching alone is not enough.

Non-US formats: Cyberhaven (Q4 2025) found that 34.8% of all ChatGPT inputs contain sensitive data, including multilingual PII. In healthcare, this means non-US record IDs, regional date formats, and local health ID types. US-trained tools miss these consistently.

Custom hospital identifiers: Hospitals use their own MRN formats, staff IDs, and site codes. These are not in standard NER training data. A tool with no custom entity support will not find them.

The Research Dataset Risk

A hospital building a research dataset from 500,000 notes faces a real compliance problem. HIPAA calls for a "very small risk" standard on de-identified data. A tool missing half of all protected identifiers cannot meet that bar.

Research archives are not clean data. Notes span many departments, time periods, and sometimes languages. A tool that works on billing data may fail on narrative notes. Sensitive data in free text has no field label.

IRB approval adds more demands. Institutions must show the method used, the identifier types removed, and the checks done. A tool missing half of all records cannot meet those demands.

See our compliance overview and security practices for how anonym.legal supports HIPAA work.

The Three-Layer Fix

The 2025 survey found one clear pattern. The tools with the lowest miss rates used three detection layers.

Layer one — regex: Finds structured identifiers. SSNs, MRNs, phone numbers, health plan IDs. Reliable on fixed formats.

Layer two — NER: Uses transformer models. Finds names, dates, and sensitive data in narrative text. Works where regex cannot.

Layer three — custom entities: Handles site-specific forms. Proprietary MRN patterns, staff IDs, facility codes. No standard model covers these.

Pure ML tools degrade on short forms and non-English text. Pure regex tools miss sensitive data with no field label. Neither alone is enough.

Only the three-layer design reached sub-5% miss rates in the survey. That is the bar for HIPAA Safe Harbor compliance.

See our guide on HIPAA Safe Harbor de-identification for research for next steps.

When This Approach Has Limits

A three-layer design dramatically outperforms a single LLM, but a sub-5% miss rate is a strong result, not a compliance guarantee. Be precise about what it does and does not mean.

A low miss rate is not a zero miss rate. Sub-5% means roughly one in twenty PHI instances can still be missed. Across a large clinical corpus, that is a real volume of residual exposure. The architecture makes the risk small and measurable; it does not make it disappear. For any release of de-identified data, a human review step remains the control that catches what the pipeline misses — automation reduces the review burden, it does not remove the obligation.

Safe Harbor is defined by 18 identifier categories, not by a miss-rate number. The HIPAA Safe Harbor method requires removing all eighteen specified identifier types; "sub-5% miss rate" is a measure of detection performance, not the legal standard itself. A dataset can score well on aggregate recall and still retain a category Safe Harbor requires gone. Where any actual knowledge of residual re-identification risk exists, Safe Harbor is not met regardless of the score, and Expert Determination by a qualified statistician is the alternative path.

Benchmark results do not transfer automatically to your data. The survey figure reflects the documents tested. Your institution's note templates, abbreviations, languages, and OCR quality differ, and performance can be lower on forms or languages underrepresented in the evaluation. Custom entity layers help with site-specific MRNs and facility codes, but they must be configured and validated against your actual records — treat the sub-5% figure as achievable under the right setup, not as a default you inherit.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

Start Free Trial View Features

LLMs Miss 50% of Clinical PHI

The 50% Miss Rate Problem

What LLMs Miss and Why

The Research Dataset Risk

The Three-Layer Fix

When This Approach Has Limits

Sources

Related Articles

HIPAA MRN Detection Without a Regex PhD

HIPAA: Hospital-Specific MRN Detection

HIPAA Safe Harbor De-ID at Scale

Ready to protect your data?

LLMs Miss 50% of Clinical PHI

The 50% Miss Rate Problem

What LLMs Miss and Why

The Research Dataset Risk

The Three-Layer Fix

When This Approach Has Limits

Sources

Related Articles

HIPAA MRN Detection Without a Regex PhD

HIPAA: Hospital-Specific MRN Detection

HIPAA Safe Harbor De-ID at Scale

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow