HIPAA De-Identification Without a Regex PhD: AI-Assisted MRN Pattern Creation
Your hospital's Medical Record Number format doesn't exist in any standard PII tool. Here's how to add it in 5 minutes without writing a single line of regex.
Healthcare IT teams implementing HIPAA de-identification face a specific challenge that doesn't exist in other sectors: the identifier they most need to detect — the Medical Record Number — is defined by their own institution, not by any national standard.
The result: every implementation of HIPAA de-identification in a healthcare system requires custom configuration. Without custom configuration, MRNs pass through "de-identified" datasets undetected.
The Multi-Facility MRN Chaos
Healthcare networks built through years of acquisition contain facilities with legacy EHR systems — each with its own MRN format established decades ago:
- Memorial Hospital (Epic since 2015): MRN:XXXXXXX (7-digit numeric with prefix)
- St. Mary's (legacy Cerner system): PT-YYYYY (5-digit with patient prefix)
- University Hospital (Meditech 6.0): UHN-XXXXXXXXXX (10-character alphanumeric)
- Affiliated clinic (standalone EMR): Cd{5} (C followed by 5 digits)
HIPAA Safe Harbor requires removing all 18 identifier categories, including "medical record numbers" (category 8). A de-identification tool that doesn't know these formats misses them entirely. The "de-identified" dataset contains all MRNs for all four facility formats.
ServiceNow's healthcare community specifically documents this pain point: healthcare IT teams attempting to identify PHI from HR work notes find that standard Presidio configurations detect SSNs and phone numbers while completely missing facility-specific MRNs.
The Regex Barrier
Building custom recognizers in Microsoft Presidio (the open-source foundation for many HIPAA tools) requires:
- Understanding PatternRecognizer class
- Writing regex patterns in Python syntax
- Configuring YAML files for recognizer registration
- Understanding confidence scores and context words
- Testing with Python scripts
- Debugging failed recognizers
For healthcare IT professionals without Python backgrounds, this creates a substantial technical barrier. A compliance officer who knows exactly what format MRN:XXXXXXX is cannot configure a Presidio recognizer without either learning Python or waiting for an engineering ticket.
The typical result: the compliance gap remains open while the engineering ticket sits in a 6-8 week queue.
AI-Assisted Pattern Generation
The alternative: describe the pattern in plain language, receive a working regex.
Process:
- Open the custom entity builder
- Provide examples: "These look like MRN numbers from our system: MRN:1234567, MRN:9876543, MRN:0001234"
- AI generates pattern: MRN:d{7}
- Test against 10 sample discharge summaries
- All MRNs detected? Save and apply.
For the multi-facility network with four MRN formats:
- Memorial Hospital: describe format → MRN:d{7}
- St. Mary's: describe format → PT-d{5}
- University Hospital: describe format → UHN-[A-Z0-9]{10}
- Affiliated clinic: describe format → Cd{5}
Create four custom entities, group into a "Network MRN Detection" preset, apply to all document processing. Total time: one afternoon of compliance officer work.
Validation for Safe Harbor Certification
HIPAA's Safe Harbor method requires that the covered entity "does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual."
For custom entity-based detection, validation demonstrates completeness:
Step 1: Sample extraction Pull 100 discharge summaries from each facility type. Mix patient populations, departments, and time periods.
Step 2: Automated processing Run all 400 documents through the custom entity detection.
Step 3: Human validation sample Manually review 20 processed documents (5% sample). Look for:
- Any strings that look like MRNs but weren't detected (false negatives)
- Any non-MRN strings that were incorrectly flagged (false positives)
Step 4: Pattern refinement If false negatives are found: refine the pattern or add context matching. If false positives are numerous: add word boundary constraints or context validation.
Step 5: Documentation Record: the custom entity definition, validation sample size, validation results, and the date of validation. This documentation supports Safe Harbor certification.
Beyond MRNs: Complete HIPAA Safe Harbor Coverage
After addressing the MRN detection gap, review all 18 Safe Harbor categories for completeness:
| Category | Standard Detection | Custom Needed? |
|---|---|---|
| 1. Names | ✓ NER model | No |
| 2. Geographic data | ✓ Location detection | No for state; Yes for facility-specific codes |
| 3. Dates | ✓ Date detection | No |
| 4. Phone numbers | ✓ Phone detection | No |
| 5. Fax numbers | ✓ Phone detection | No |
| 6. Email addresses | ✓ Email detection | No |
| 7. SSNs | ✓ SSN detection | No |
| 8. Medical record numbers | ✗ Not in default | Yes — institution-specific |
| 9. Health plan beneficiary numbers | Partial | Often yes — carrier-specific |
| 10. Account numbers | Partial | Often yes — billing account format |
| 11. Certificate/license numbers | Partial | Often yes — DEA + state-specific |
| 12. Vehicle identifiers | Partial | Rarely in clinical docs |
| 13. Device identifiers | Partial | Yes if medical devices documented |
| 14. Web URLs | ✓ URL detection | No |
| 15. IP addresses | ✓ IP detection | No |
| 16. Biometric identifiers | ✗ Text context | Rare in discharge summaries |
| 17. Full-face photographs | ✗ Image only | Out of scope for text processing |
| 18. Other unique identifiers | ✗ Not in default | Yes — institution-specific |
For clinical text processing, categories 8, 9, 10, and 18 most commonly require custom entity addition.
The Clinical Documentation Context
Discharge summaries, clinical notes, and operative reports are the primary documents requiring HIPAA de-identification for research sharing. These documents contain:
- MRNs in headers and footers
- Account numbers in billing sections
- Dates throughout (admission, procedures, labs, medications)
- Physician names and DEA numbers
- Referring physician information
- Insurance member IDs
Custom entity detection for institution-specific formats (MRNs, account numbers) combined with standard detection for universal formats (dates, names, phone numbers) provides the complete coverage that HIPAA Safe Harbor requires.
Conclusion
HIPAA de-identification without custom entity configuration is not HIPAA Safe Harbor de-identification. Every healthcare institution's MRN format is unique. Standard PII tools miss them. Compliance teams cannot wait for engineering queues to close this gap.
AI-assisted pattern generation collapses the compliance gap from 6-8 weeks of engineering time to one afternoon of compliance officer work. Describe the format, validate against samples, deploy to production.
Sources: