anonym.legal
Terug naar BlogGDPR & Naleving

What Presidio Misses: The 220+ Entity Types Essential for GDPR-Compliant PII Detection

Presidio ships with ~40 default entity recognizers focused on US identifiers. European organizations need IBAN, Codice Fiscale, Steueridentifikationsnummer, EU driving license formats, and national health identifiers — all missing from Presidio's defaults.

March 7, 20267 min lezen
Presidio entity coverageEU GDPR PIIIBAN detectionEuropean identifiersPresidio vs managed

What Presidio Misses: The 220+ Entity Types Essential for GDPR-Compliant PII Detection

Microsoft Presidio ships with approximately 40 default entity recognizers. For US-based deployments handling US-centric documents, this covers the essential categories: SSNs, US passports, US driver's licenses, credit cards, email addresses, phone numbers, and person names.

For EU deployments, the coverage gap is significant. GDPR applies to all EU personal data regardless of nationality. EU organizations processing their own citizens' data need recognizers that Presidio doesn't provide out of the box.

The Default Presidio Entity Library

Presidio's default recognizers include:

US-centric identifiers:

  • US Social Security Number (SSN)
  • US Passport Number
  • US Driver's License Number (multiple state formats)
  • US Bank Account Number
  • US ITIN (Individual Taxpayer Identification Number)
  • US Medical License Number

Universal identifiers:

  • Email Address
  • Phone Number (US-centric format priority)
  • IP Address
  • Credit Card Number (Luhn algorithm)
  • Crypto Wallet Address
  • URL

Generic text entities:

  • PERSON (NER-based)
  • LOCATION (NER-based)
  • ORGANIZATION (NER-based)
  • DATE_TIME (NER-based)

Limited international coverage:

  • UK NHS Number
  • UK National Insurance Number (NINO)
  • Financial Entity identifiers (some)

Total: ~40 recognizers

What EU Organizations Actually Need

Financial identifiers: IBAN (International Bank Account Number) appears in virtually every EU business document involving payments, wire transfers, invoicing, and payroll. IBAN formats vary by country but follow an international standard (ISO 13616). Presidio has no default IBAN recognizer.

A German fintech processing customer payment records processes IBAN numbers in every transaction document. Without IBAN recognition, these documents are processed with credit card detection active (detecting card numbers) but IBAN fields (the primary EU payment identifier) are completely ignored.

National tax identifiers:

  • German Steueridentifikationsnummer: 11-digit numeric
  • French NIR (Numéro d'Inscription au Répertoire): 13-character alphanumeric
  • Italian Codice Fiscale: 16-character alphanumeric with structural validation
  • Spanish NIF/NIE: 9-character with letter suffix/prefix
  • Dutch BSN: 9-digit with 11-proof validation

None of these are in Presidio's default entity library. An EU payroll processor handling employee documents from multiple member states is effectively blind to their most sensitive financial identifiers.

National health identifiers:

  • UK NHS Number: 10-digit with modulus-11 check
  • French Numéro de Sécurité Sociale (NIR): Also serves as health ID
  • German Krankenkassennummer: Alphanumeric, insurer-specific
  • Italian Codice Fiscale: Also used as health ID
  • Netherlands BSN: Also used for health insurance

Healthcare organizations across the EU need these identifiers for HIPAA-equivalent health data protection. Presidio provides the UK NHS Number but misses the continental European health IDs.

EU driving license formats: Presidio has US driver's license recognizers (state-specific). EU driving license formats are standardized under Directive 2006/126/EC but vary by member state in their alphanumeric structure. No EU driving license recognizers in Presidio's defaults.

VAT registration numbers: EU VAT numbers appear in every business-to-business transaction. Format: country code (2 letters) + 8-12 alphanumeric digits. Presidio has no VAT number recognizer. For EU businesses sharing invoices, contracts, and commercial documents, VAT numbers are identifiers that link to registered business entities and their directors.

EU passport formats: US passport recognition in Presidio, but EU passport formats (especially the Machine Readable Zone format) are not covered.

The Engineering Cost of Custom Recognizer Development

When EU organizations deploy Presidio and discover the entity coverage gap, the response is typically custom recognizer development. The cost:

Per recognizer development time:

  • Research the identifier format: 1-2 hours
  • Write PatternRecognizer Python class: 2-4 hours
  • Implement regex with validation logic: 2-4 hours
  • Configure context words for precision improvement: 1-2 hours
  • Write tests: 2-3 hours
  • Integrate and test in deployment: 1-2 hours

Per recognizer: 9-17 hours.

For a German fintech needing IBAN + Steuer-ID + EU driving license + German VAT + IBAN:

  • 4 custom recognizers × 13 hours average = 52 engineering hours
  • At €100/hour: €5,200 in custom recognizer development

Plus ongoing maintenance as formats change, new test cases emerge, and Presidio API updates require recognizer modifications.

Total cost for EU GDPR coverage on top of Presidio: €5,200+ initial + ongoing maintenance

The Alternative: Managed Entity Libraries

anonym.legal extends the Presidio foundation with 285+ entity types maintained by the development team — including the EU-specific identifiers that Presidio's defaults miss:

Coverage highlights beyond Presidio defaults:

  • IBAN (all EU member state formats)
  • EU member state tax identifiers (including Steuer-ID, NIR, Codice Fiscale, NIF/NIE, BSN, PESEL, and others)
  • EU national health identifiers
  • VAT numbers (EU format)
  • EU driving license formats
  • European passport formats
  • All 48 supported language entity variations

Maintenance: Entity library updates are pushed as part of the managed service. When Germany introduces a new tax identifier format, users get the recognizer without filing a pull request.

Custom extension: For organization-specific identifiers not in the library, the custom entity builder allows adding patterns without Python code.

The German Fintech Example

A German fintech needs to detect IBANs, BICs, German tax IDs (Steuer-ID), and German commercial registration numbers (Handelsregisternummer) in customer documents.

Presidio default detection rate for these 4 entity types: 0%

Not low precision, not false positives — zero detections. None of the 4 entity types appear in Presidio's default entity library.

Writing custom recognizers: 4 recognizers × 13 hours = 52 hours = €5,200 at engineering rates.

Using managed entity library with all 4 covered: €180/year (Professional plan).

Cost to achieve GDPR-compliant detection of these German financial identifiers:

  • Presidio route: €5,200 engineering + Presidio operational costs
  • Managed service route: €180/year, detecting all 4 out of the box

The gap is 28x in year one. For every year of operation, engineering time for custom recognizer maintenance adds to the Presidio cost while the managed service cost remains flat.

Conclusion

Presidio's ~40 default recognizers serve US-centric use cases well. For EU deployments requiring GDPR compliance across member state-specific identifiers, the out-of-the-box coverage is insufficient. The gap is filled either through custom recognizer development (expensive, time-consuming) or a managed service that maintains EU entity coverage as part of the subscription.

For EU organizations where compliance is non-negotiable and engineering resources are constrained, the managed service's pre-built EU entity library eliminates a 50+ hour custom development project before first-document anonymization.

Sources:

Klaar om uw gegevens te beschermen?

Begin met het anonimiseren van PII met 285+ entiteitstypen in 48 talen.