Back to BlogHealthcare

HIPAA De-Identification Without a Regex PhD: AI-Assisted MRN Pattern Creation

Every hospital's MRN format is different. Memorial uses MRN:XXXXXXX, St. Mary's uses PT-YYYYY, University Hospital uses UHN-XXXXXXXXXX. Standard PII tools miss 100% of facility-specific MRNs. AI-assisted pattern generation adds detection in 5 minutes without regex expertise.

March 5, 20266 min read
HIPAA de-identificationMRN patternhealthcare ITAI pattern generationPHI detection

HIPAA De-Identification Without a Regex PhD: AI-Assisted MRN Pattern Creation

Your hospital's Medical Record Number format doesn't exist in any standard PII tool. Here's how to add it in 5 minutes without writing a single line of regex.

Healthcare IT teams implementing HIPAA de-identification face a specific challenge that doesn't exist in other sectors: the identifier they most need to detect — the Medical Record Number — is defined by their own institution, not by any national standard.

The result: every implementation of HIPAA de-identification in a healthcare system requires custom configuration. Without custom configuration, MRNs pass through "de-identified" datasets undetected.

The Multi-Facility MRN Chaos

Healthcare networks built through years of acquisition contain facilities with legacy EHR systems — each with its own MRN format established decades ago:

  • Memorial Hospital (Epic since 2015): MRN:XXXXXXX (7-digit numeric with prefix)
  • St. Mary's (legacy Cerner system): PT-YYYYY (5-digit with patient prefix)
  • University Hospital (Meditech 6.0): UHN-XXXXXXXXXX (10-character alphanumeric)
  • Affiliated clinic (standalone EMR): Cd{5} (C followed by 5 digits)

HIPAA Safe Harbor requires removing all 18 identifier categories, including "medical record numbers" (category 8). A de-identification tool that doesn't know these formats misses them entirely. The "de-identified" dataset contains all MRNs for all four facility formats.

ServiceNow's healthcare community specifically documents this pain point: healthcare IT teams attempting to identify PHI from HR work notes find that standard Presidio configurations detect SSNs and phone numbers while completely missing facility-specific MRNs.

The Regex Barrier

Building custom recognizers in Microsoft Presidio (the open-source foundation for many HIPAA tools) requires:

  • Understanding PatternRecognizer class
  • Writing regex patterns in Python syntax
  • Configuring YAML files for recognizer registration
  • Understanding confidence scores and context words
  • Testing with Python scripts
  • Debugging failed recognizers

For healthcare IT professionals without Python backgrounds, this creates a substantial technical barrier. A compliance officer who knows exactly what format MRN:XXXXXXX is cannot configure a Presidio recognizer without either learning Python or waiting for an engineering ticket.

The typical result: the compliance gap remains open while the engineering ticket sits in a 6-8 week queue.

AI-Assisted Pattern Generation

The alternative: describe the pattern in plain language, receive a working regex.

Process:

  1. Open the custom entity builder
  2. Provide examples: "These look like MRN numbers from our system: MRN:1234567, MRN:9876543, MRN:0001234"
  3. AI generates pattern: MRN:d{7}
  4. Test against 10 sample discharge summaries
  5. All MRNs detected? Save and apply.

For the multi-facility network with four MRN formats:

  • Memorial Hospital: describe format → MRN:d{7}
  • St. Mary's: describe format → PT-d{5}
  • University Hospital: describe format → UHN-[A-Z0-9]{10}
  • Affiliated clinic: describe format → Cd{5}

Create four custom entities, group into a "Network MRN Detection" preset, apply to all document processing. Total time: one afternoon of compliance officer work.

Validation for Safe Harbor Certification

HIPAA's Safe Harbor method requires that the covered entity "does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual."

For custom entity-based detection, validation demonstrates completeness:

Step 1: Sample extraction Pull 100 discharge summaries from each facility type. Mix patient populations, departments, and time periods.

Step 2: Automated processing Run all 400 documents through the custom entity detection.

Step 3: Human validation sample Manually review 20 processed documents (5% sample). Look for:

  • Any strings that look like MRNs but weren't detected (false negatives)
  • Any non-MRN strings that were incorrectly flagged (false positives)

Step 4: Pattern refinement If false negatives are found: refine the pattern or add context matching. If false positives are numerous: add word boundary constraints or context validation.

Step 5: Documentation Record: the custom entity definition, validation sample size, validation results, and the date of validation. This documentation supports Safe Harbor certification.

Beyond MRNs: Complete HIPAA Safe Harbor Coverage

After addressing the MRN detection gap, review all 18 Safe Harbor categories for completeness:

CategoryStandard DetectionCustom Needed?
1. Names✓ NER modelNo
2. Geographic data✓ Location detectionNo for state; Yes for facility-specific codes
3. Dates✓ Date detectionNo
4. Phone numbers✓ Phone detectionNo
5. Fax numbers✓ Phone detectionNo
6. Email addresses✓ Email detectionNo
7. SSNs✓ SSN detectionNo
8. Medical record numbers✗ Not in defaultYes — institution-specific
9. Health plan beneficiary numbersPartialOften yes — carrier-specific
10. Account numbersPartialOften yes — billing account format
11. Certificate/license numbersPartialOften yes — DEA + state-specific
12. Vehicle identifiersPartialRarely in clinical docs
13. Device identifiersPartialYes if medical devices documented
14. Web URLs✓ URL detectionNo
15. IP addresses✓ IP detectionNo
16. Biometric identifiers✗ Text contextRare in discharge summaries
17. Full-face photographs✗ Image onlyOut of scope for text processing
18. Other unique identifiers✗ Not in defaultYes — institution-specific

For clinical text processing, categories 8, 9, 10, and 18 most commonly require custom entity addition.

The Clinical Documentation Context

Discharge summaries, clinical notes, and operative reports are the primary documents requiring HIPAA de-identification for research sharing. These documents contain:

  • MRNs in headers and footers
  • Account numbers in billing sections
  • Dates throughout (admission, procedures, labs, medications)
  • Physician names and DEA numbers
  • Referring physician information
  • Insurance member IDs

Custom entity detection for institution-specific formats (MRNs, account numbers) combined with standard detection for universal formats (dates, names, phone numbers) provides the complete coverage that HIPAA Safe Harbor requires.

Conclusion

HIPAA de-identification without custom entity configuration is not HIPAA Safe Harbor de-identification. Every healthcare institution's MRN format is unique. Standard PII tools miss them. Compliance teams cannot wait for engineering queues to close this gap.

AI-assisted pattern generation collapses the compliance gap from 6-8 weeks of engineering time to one afternoon of compliance officer work. Describe the format, validate against samples, deploy to production.

Sources:

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.