The 50% Miss Rate Problem
A 2025 survey of LLM-based de-identification tools (arXiv:2509.14464) found that general-purpose LLM tools miss more than 50% of clinical PHI in multilingual documents. This figure reflects a fundamental architectural mismatch: LLMs are designed for language understanding and generation, not for the structured, high-recall identification task that HIPAA de-identification requires.
The HIPAA Privacy Rule's Safe Harbor method requires removal of 18 specific identifier categories: names, geographic data, dates, phone numbers, fax numbers, email addresses, SSNs, medical record numbers, health plan beneficiary numbers, account numbers, certificate/license numbers, VINs, device identifiers, web URLs, IP addresses, biometric identifiers, full-face photographs, and any other unique identifying number or code. Each of these categories has structured formats that require specific detection logic.
Clinical notes are where the difficulty concentrates. Consider a typical clinical note fragment: "Pt. John D., DOB 4/12/67, MRN 1234567, presented to ED on 03/15/24 with chest pain. Prior Hx: HTN, DM. Dr. Smith ordered ECG." This single sentence contains a name, date of birth, MRN, admission date, and treating physician — five HIPAA identifiers, some in abbreviated form, embedded in clinical shorthand.
What LLMs Miss and Why
General-purpose LLMs fail on clinical PHI in predictable patterns.
Abbreviated identifiers: Clinical notes use standard abbreviations (DOB for date of birth, MRN for medical record number, Pt. for patient) that context-free NER may not recognize as PII markers. An LLM reading the note above for general comprehension understands the clinical meaning; an LLM tasked with PHI extraction may miss "Pt. John D." as a partial name pattern.
Context-dependent dates: Dates in clinical notes have specific HIPAA significance. "Age 67" is a partial de-identifier that must be noted. "DOB 4/12/67" is PHI. "03/15/24" as an admission date is PHI. These require context-aware date extraction, not just date pattern matching.
Regional identifier formats: Research by Cyberhaven (Q4 2025) found that 34.8% of all ChatGPT inputs contain sensitive data including multilingual PII. In healthcare contexts, this includes non-US medical record formats, international date conventions, and country-specific health identifier formats that US-focused systems miss.
Custom institutional identifiers: Health systems use proprietary MRN formats, employee IDs, and facility codes that are not part of standard NER training data. A system without custom entity type support cannot detect these.
The Research Dataset Compliance Problem
A hospital system building a de-identified research dataset from 500,000 clinical notes faces a compound risk. HIPAA requires that de-identified research datasets meet the "very small risk" standard under the Safe Harbor method or the statistical approach under Expert Determination. A system missing 50% of PHI produces a dataset that fails this standard — exposing the research institution to OCR enforcement and IRB compliance failures.
The clinical notes in a research dataset are not uniform. They span different departments (cardiology, oncology, psychiatry), different documentation styles, different time periods, and — in multilingual health systems — different languages. A de-identification system that performs adequately on structured billing data may fail on unstructured psychiatric progress notes where PHI appears in narrative context rather than labeled fields.
The Hybrid Detection Requirement
The 2025 research survey identified the consistent pattern: systems with the highest PHI recall combine structured identifier detection (regex for SSNs, MRNs, phone numbers) with contextual NER (transformer-based models for names, dates in narrative context) and custom entity support (institution-specific identifiers).
Pure ML approaches achieve high recall on common identifiers in well-formatted text but degrade on abbreviations, rare identifier types, and non-English text. Pure regex approaches achieve high recall on structured identifiers but miss contextual PHI (a physician's name mentioned in a clinical narrative without a title prefix).
The hybrid three-tier architecture — regex for structured identifiers, NLP for contextual PHI, transformer models for cross-lingual and abbreviated forms — is the pattern identified by the survey as achieving sub-5% miss rates suitable for HIPAA Safe Harbor compliance.
Sources: