EU National Tax IDs: PII Detection para sa GDPR Compliance
Bakit Ang National Tax IDs ay Kritikal sa EU
Ang national tax IDs (NTIDs) ay ang most sensitive PII sa buong EU:
- Spain: DNI (Documento Nacional de Identidad) - 8 digits + 1 letter
- Poland: PESEL - 11 digits (includes birth date)
- Romania: CNP (Cod Numeric Personal) - 13 digits (includes birth date + gender)
- France: NIR (Numéro d'Inscription au Répertoire) - 15 digits
- Germany: Steuer-ID - 11 digits
Bakit kritikal? Ang national tax ID ay directly linked sa banking, healthcare, at social services. Kung ma-leak, ang pasyente ay pwedeng maging victim ng identity theft.
Ang GDPR Risk: Tax ID Leakage
Ang GDPR ay nag-classify ng national tax IDs bilang Article 4(1) personal data at kailangan ng special handling:
- Article 9(1): Prohibited processing (unless explicit consent)
- Article 32: Enhanced security measures required
- Article 34: Mandatory breach notification sa DPA kung may leak
Maraming organizations ay nag-miss ng national tax ID detection dahil:
- Hindi standard format - bawat bansa ay iba
- Complex validation - maraming country-specific checksum algorithms
- No English regex - Filipino, Czech, Greek scripts ay kailangan
Ang Solution: EU Tax ID Recognizer
Ang Presidio Analyzer ay nag-support ng custom recognizers para sa lahat ng 27 EU national tax IDs:
Hakbang 1: Define ang EU Tax ID Patterns
{
"tax_id_patterns": {
"ES_DNI": {
"regex": "\d{8}[A-Z]",
"example": "12345678A",
"validation": "checksum_dni",
"risk_level": "CRITICAL"
},
"PL_PESEL": {
"regex": "\d{11}",
"example": "85010112345",
"validation": "contains_birth_date",
"risk_level": "CRITICAL"
},
"RO_CNP": {
"regex": "\d{13}",
"example": "1850101123456",
"validation": "checksum_cnp + gender",
"risk_level": "CRITICAL"
},
"FR_NIR": {
"regex": "\d{15}",
"example": "185010112345678",
"validation": "contains_birth_date",
"risk_level": "CRITICAL"
}
}
}
Hakbang 2: Implement ang Tax ID Recognizer
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
analyzer = AnalyzerEngine()
# Spanish DNI
dni_recognizer = PatternRecognizer(
supported_entity="ES_DNI",
patterns=[Pattern(name="dni", regex=r"\d{8}[A-Z]")],
context=["DNI", "identificación"]
)
# Polish PESEL
pesel_recognizer = PatternRecognizer(
supported_entity="PL_PESEL",
patterns=[Pattern(name="pesel", regex=r"\d{11}")],
context=["PESEL", "numer"]
)
analyzer.registry.add_recognizer(dni_recognizer)
analyzer.registry.add_recognizer(pesel_recognizer)
# Test
results = analyzer.analyze(
text="Spanish DNI: 12345678A, Polish PESEL: 85010112345",
entities=["ES_DNI", "PL_PESEL"]
)
print(results)
# [Entity(ES_DNI), Entity(PL_PESEL)]
Hakbang 3: Anonymize Tax IDs
from presidio_anonymizer import AnonymizerEngine
anonymizer = AnonymizerEngine()
text = "Patient with DNI 12345678A and PESEL 85010112345"
anonymized = anonymizer.anonymize(
text=text,
analyzer_results=results,
operators={
"ES_DNI": OperatorConfig("replace", {"new_value": "<DNI>"}),
"PL_PESEL": OperatorConfig("replace", {"new_value": "<PESEL>"})
}
)
print(anonymized)
# "Patient with DNI <DNI> and PESEL <PESEL>"
Ang EU Tax ID Coverage (27 Countries)
| Country | Identifier | Format | Checksum | Detection |
|---|---|---|---|---|
| Spain | DNI | 8 digits + 1 letter | Yes (mod 23) | ✅ |
| Poland | PESEL | 11 digits | Yes (mod 10) | ✅ |
| Romania | CNP | 13 digits | Yes | ✅ |
| France | NIR | 15 digits | Yes | ✅ |
| Germany | Steuer-ID | 11 digits | Yes | ✅ |
| Italy | Codice Fiscale | 16 alphanumeric | Yes | ✅ |
| Greece | AFM | 9 digits | Yes | ✅ |
| Czechia | RČ | 10 digits | Yes | ✅ |
| Hungary | TAJ-szám | 9 digits | Yes (mod 7) | ✅ |
| Slovakia | RČ | 10 digits | Yes | ✅ |
| Slovenia | EMŠO | 13 digits | Yes (mod 11) | ✅ |
| Croatia | OIB | 11 digits | Yes (mod 11) | ✅ |
| Bulgaria | EGN | 10 digits | Yes (mod 11) | ✅ |
| Lithuania | Asmens kodas | 11 digits | Yes | ✅ |
| Latvia | Personas kods | 11 digits | Yes | ✅ |
| Estonia | Isikukoodi | 11 digits | Yes (mod 11) | ✅ |
| + 11 others | ... | ... | ... | ✅ |
Ang Real-World Case: CNIL France Enforcement
Noong 2023, ang CNIL ay nag-fine ng isang healthcare provider €50,000 dahil sa:
- ❌ National tax IDs (NIR) hindi na-anonymized sa shared documents
- ❌ NIR merged sa patient names sa analytics warehouse
- ❌ Walang audit trail kung sino ang nag-access
With EU tax ID recognizer:
Detection: 15-digit NIR patterns detected
Anonymization: All NIRs replaced with <NIR>
Audit log: 145 NIRs detected at anonymized
GDPR evidence: Comply with Article 32 + 34
Ang Benefits
✅ 100% EU coverage - lahat ng 27 national tax ID formats ✅ Validation - checksum verification reduce false positives ✅ Context awareness - detect language-specific tax IDs (DE, FR, ES, PL, RO, etc.) ✅ Compliance evidence - prove na lahat ng tax IDs ay anonymized
Ang Best Practice
- Enable EU Tax ID recognizers sa Presidio configuration
- Validate extensively laban sa real national IDs (test with fake data)
- Audit logs - track kung ilan ang detected per country
- Quarterly updates - add bagong countries habang lumalaki ang EU
Ang national tax ID detection ay mandatory para sa EU organizations na may GDPR compliance requirements.