By George Curta · Last updated 2026-06-042026-06-04

返回博客医疗保健

HIPAA 病历号检测：无需正则表达式专业知识

每家医院的病历号（MRN）格式各不相同。Memorial 用 MRN:XXXXXXX，St. Mary's 用 PT-YYYYY，University Hospital 用 UHN-XXXXXXXXXX。

George CurtaJune 4, 20266 分钟阅读

HIPAA de-identificationMRN patternhealthcare ITAI pattern generationPHI detection

HIPAA 病历号检测：无需正则表达式专业知识

你们医院的 MRN 格式不在任何标准 PII 工具中。以下是如何在五分钟内完成添加的方法，无需任何代码。

医疗 IT 团队面临着其他行业所没有的 HIPAA 合规难题：他们最需要识别的 ID——病历号（Medical Record Number）——由各医院自行定义，全国没有统一标准。

每个 HIPAA 去标识化项目都需要定制化配置。若缺少自定义设置，MRN 会在「已去标识化」的文件中悄然漏网。

多机构 MRN 难题

通过并购整合的医院网络保留着遗留的 EHR 系统，每个系统有自己的 MRN 格式：

Memorial Hospital（Epic）：MRN:XXXXXXX — 带前缀的 7 位数字
St. Mary's（Cerner）：PT-YYYYY — 带患者前缀的 5 位数字
University Hospital（Meditech）：UHN-XXXXXXXXXX — 10 位字母数字混合
Clinic（独立 EMR）：C\d{5} — 字母 C 加 5 位数字

HIPAA 安全港要求移除全部 18 类标识符，第 8 类即为病历号。不了解你的格式的工具必然会遗漏，文件看起来干净，实则不然。

ServiceNow 医疗社区已注意到这一确切问题：标准工具能识别社会安全号和电话号码，却每次都会遗漏机构专属 MRN。

正则表达式壁垒

在 Microsoft Presidio（许多 HIPAA 工具的开源底层）中添加自定义规则需要相当的技术能力：

需要了解 PatternRecognizer 类
必须以 Python 语法编写正则表达式
必须配置 YAML 文件
必须调整置信度分数
必须测试和调试 Python 脚本

了解 MRN 格式的合规官无法独立完成这些工作，最终往往作为工程需求排队，等待 6 至 8 周后才能解决，漏洞在此期间持续存在。

AI 辅助模式生成

有更快的方法：用自然语言描述格式，即可获得可用的正则表达式。

操作步骤：

打开自定义实体构建器
提供示例：「我们的 MRN 格式如下：MRN:1234567、MRN:9876543、MRN:0001234」
AI 生成规则：MRN:\d{7}
对 10 条样本记录进行测试
所有 MRN 均已识别？保存并部署。

对于拥有四种 MRN 格式的医院网络：

Memorial Hospital → MRN:\d{7}
St. Mary's → PT-\d{5}
University Hospital → UHN-[A-Z0-9]{10}
Clinic → C\d{5}

创建四个自定义实体，组合成预设，对所有文件运行。所需时间：一个下午。

详见无代码 HIPAA 管道中的自定义 MRN 检测完整操作指南。

安全港验证

HIPAA 安全港要求受覆盖实体对数据不具备「实际知识」，证明数据无法识别当事人（45 CFR §164.514(b)）。

验证过程可证明你的自定义规则涵盖了全部 18 类标识符。

第一步：抽取样本。 从每个站点抽取 100 条记录，涵盖不同时间段和科室。

第二步：运行检测。 使用自定义规则处理全部 400 份文件。

第三步：人工核查。 手工审查 20 份文件（5% 抽样），检查是否存在遗漏 MRN 或误报。

第四步：优化规则。 有遗漏 MRN？放宽匹配模式。误报过多？添加词语边界限制。

第五步：书面记录。 记录规则、样本量、结果和日期，此记录即为你的安全港证明文件。

关于应记录哪些内容，参阅可解释脱敏与 HIPAA 审计追踪。

完整安全港覆盖

修复 MRN 检测后，检查全部 18 类标识符。

类别	标准工具	是否需要自定义？
1. 姓名	NER 模型	否
2. 地理信息	位置检测	州级无需；站点代码需要
3. 日期	日期检测	否
4. 电话号码	电话检测	否
5. 传真号码	电话检测	否
6. 电子邮件地址	邮件检测	否
7. 社会安全号	SSN 检测	否
8. 病历号	未内置	是——机构专属
9. 健康计划会员号	部分覆盖	通常需要——付款方专属格式
10. 账户号码	部分覆盖	通常需要——账单格式
11. 许可证号码	部分覆盖	通常需要——州级格式
12. 车辆 ID	部分覆盖	临床文件中罕见
13. 设备 ID	部分覆盖	记录含设备时需要
14. 网址	URL 检测	否
15. IP 地址	IP 检测	否
16. 生物特征 ID	文本上下文	出院记录中罕见
17. 照片	仅图像	文本范围外
18. 其他唯一 ID	未内置	是——机构专属

临床文本中，第 8、9、10、18 类最常需要自定义配置。

临床文件场景

出院记录、临床笔记和手术报告是用于研究共享的主要文件，其中包含：

页眉页脚中的 MRN
账单部分的账户号码
所有事件的日期——入院、手术、检验、用药
医生姓名和 DEA 编号
转诊医生信息
医保会员 ID

机构专属格式的自定义规则与内置标准格式规则结合使用，即可实现完整的安全港覆盖。

结语

没有自定义规则的 HIPAA 去标识化，不等于安全港去标识化。每家医院的 MRN 格式都是独特的，标准工具必然遗漏，合规漏洞真实存在，且会持续存在，直到你主动填补为止。

AI 模式生成将修复周期从 6 至 8 周的工程等待缩短为一个下午的合规工作：描述格式，对实际记录进行测试，部署上线，完成。

参考资料

相关文章

HIPAA OCR: 725 Breaches, 275M Records

HHS OCR reported 725 HIPAA breaches in 2024 affecting 275M records — the highest ever. $10.22M average healthcare breach cost.

Handwritten Form OCR & PII Detection

A mid-size hospital processes 50,000 handwritten intake forms per year. Manual PII redaction at this volume requires 0.5 FTE.

HHS 2025: AI Clinical Notes Need PHI

AI transcription systems can inadvertently put Patient A's PHI in Patient B's record. Here's why real-time PHI detection before EHR commit is the control.

准备好保护您的数据了吗？

开始使用 285 种实体类型在 48 种语言中匿名化 PII。

开始免费试用查看功能

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

GDPR (EU 2016/679).
ISO/IEC 27001:2022.
NIS2 (EU 2022/2555).
HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

We never sell your information to third parties.
We never train models on what you upload.
We never keep your work after you delete it.
We never share keys with any outside firm.
We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.