返回博客GDPR 与合规

GDPR合规的多语言PII检测

德国Steuer-ID、法国NIR和瑞典Personnummer各自需要不同的检测逻辑。

George CurtaMarch 3, 202610 分钟阅读

multilingualGDPRNLPPII detectionEuropean compliancespaCyXLM-RoBERTa

GDPR合规的多语言PII检测

2026年更新

隐藏的GDPR合规缺口

GDPR没有语言偏好。第4(1)条定义"个人数据"时没有指明所用语言。无论PII出现在何种语言中，均受同等保护。遗漏德语Steuer-ID与遗漏美国社会安全号码具有相同的法律风险。

但大多数PII检测工具确实有语言偏好。

主流商业和开源工具是为英语文本构建的。它们的实体检测器反映了这一点：对美国社会安全号码、美国驾驶证和北美电话格式（NANP）的检测效果较好，而对非英语国家ID的检测器则不够准确、维护也不够及时、遗漏率更高。

对于跨欧盟成员国运营的企业，这创造了一个合规缺口：工具显示检测已完成，但非英语标识符仍留在数据中。

欧盟标识符的结构差异

每个欧盟成员国都有结构不同的国家标识符：

国家	标识符	格式
德国	Steuer-ID	11位数字，含校验和
法国	NIR（社会安全号）	15位数字，含性别和出生地
瑞典	Personnummer	YYYYMMDD-XXXX
波兰	PESEL	11位数字，含出生日期
荷兰	BSN	9位数字，mod-11验证
西班牙	DNI/NIE	字母数字，含校验字母
意大利	Codice Fiscale	16个字母数字字符
葡萄牙	NIF	9位数字，含校验和

这些格式不能互换。为西班牙DNI构建的正则表达式无法检测德国Steuer-ID。处理多个欧盟市场的组织需要针对每个目标市场的特定检测逻辑。

校验和验证的重要性

欧盟国家ID中的校验和不只是格式要求——它们是真正性的验证机制。

荷兰BSN使用mod-11算法验证。一个9位数字序列可能看起来像BSN，但如果它不能通过mod-11验证，就不是有效的BSN——可能是随机数字、错误输入或不同国家的ID。

没有校验和验证的PII检测工具会产生误报（将非BSN数字序列标记为BSN）和漏报（接受看起来像BSN但校验和无效的格式的变体）。

在GDPR合规环境中，误报会导致过度遮盖（损害数据可用性），而漏报则会留下未受保护的PII（产生合规风险）。

语言检测作为第一步

有效的多语言PII检测首先需要语言识别。在欧盟商业文件中，常见的方法是：

在文件或段落级别检测语言
为检测到的语言加载适当的检测规则集
应用特定语言的实体检测器
用跨语言的神经NER处理混合语言片段

混合语言文件（如瑞士法语和德语合同，或多国对应方参与的国际交易文件）需要额外处理层，以检测语言切换并相应地切换检测上下文。

anonym.legal覆盖48种语言，285种以上实体类型，包含针对欧盟国家标识符的校验和验证。请参阅实体检测概述了解完整的语言和实体覆盖范围。

执法现实

GDPR监管机构已就使用不充分PII工具的组织发起调查。核心问题不是"是否部署了工具"，而是"工具是否足够"。

对于在德国运营的公司，BfDI（联邦数据保护和信息自由专员）会检查德语特定标识符（Steuer-ID、Krankenversicherungsnummer）的检测。CNIL在法国检查NIR和法国电话格式。AEPD在西班牙检查DNI/NIE格式。

仅覆盖英语标识符的工具不满足这些特定语言的要求，即使工具供应商声称"GDPR合规"。

请参阅法律合规概述了解anonym.legal如何满足欧盟监管机构的特定要求。

参考资料

GDPR 与合规

准备好保护您的数据了吗？

开始使用 285 种实体类型在 48 种语言中匿名化 PII。

开始免费试用查看功能

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

We follow these rules

GDPR (EU 2016/679).
ISO/IEC 27001:2022.
NIS2 (EU 2022/2555).
HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our company HQ is in Saarbrücken, Germany. Our servers run in Hetzner's Falkenstein datacenter.

Hetzner holds ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

We never sell your information to third parties.
We never train models on what you upload.
We never keep your work after you delete it.
We never share keys with any outside firm.
We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.

GDPR合规的多语言PII检测

GDPR合规的多语言PII检测

隐藏的GDPR合规缺口

欧盟标识符的结构差异

校验和验证的重要性

语言检测作为第一步

执法现实

参考资料

相关文章

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

准备好保护您的数据了吗？

GDPR合规的多语言PII检测

GDPR合规的多语言PII检测

隐藏的GDPR合规缺口

欧盟标识符的结构差异

校验和验证的重要性

语言检测作为第一步

执法现实

参考资料

相关文章

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

准备好保护您的数据了吗？

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow