The Global Identifier Fragmentation Problem
A marketplace platform with sellers in 45 countries processes onboarding documents that look completely different depending on the seller's country of origin. A Brazilian seller submits a CPF (Cadastro de Pessoas Físicas) — an 11-digit tax ID with two check digits calculated using a specific weighting algorithm. An Indian seller provides a PAN (Permanent Account Number) — a 10-character alphanumeric format combining letters and digits in a specific positional pattern. A German seller provides a Steuer-ID (11-digit with Luhn checksum). A Dutch seller provides a BSN (Burger Service Nummer, 9 digits with mod-11 validation).
Each format has different length, structure, and validation algorithm. A single regex designed for one format does not match the others. A generic "10-12 digit numeric string" pattern produces prohibitive false positive rates across financial documents containing prices, quantities, dates, and reference numbers.
The compliance obligation does not differentiate by country. GDPR covers the EU sellers' data. LGPD covers the Brazilian seller's data. DPDP Act covers the Indian seller's data. Each regulatory framework requires appropriate protection of the personal data covered by that framework — and "appropriate" means the identifier was detected and protected, not just that a detection attempt was made.
The 40-Identifier Gap
Most enterprise PII detection tools ship with recognizers for approximately 40 common identifier types. These typically include:
- US Social Security Number
- US passport format
- US driving license (state-specific)
- Generic credit card formats (Luhn validation)
- Email addresses
- Phone numbers (NANP format)
- IP addresses
Tools at this coverage level satisfy English-speaking North American compliance requirements reasonably well. They do not cover the identifier landscape of organizations operating globally.
The gap between 40 identifiers and global compliance is substantial:
South American identifiers: Brazilian CPF (individual) and CNPJ (corporate) require checksum validation specific to Brazil's fiscal authority format. Argentine CUIT follows a different weighted-sum algorithm. Colombian NIT uses yet another validation method.
Asian identifiers: Indian PAN, Aadhaar (12-digit biometric ID), Indian GSTIN (GST identification), and Voter ID each have distinct formats. Japanese My Number (12-digit national ID), South Korean Resident Registration Number, and Chinese national ID (18-character with check digit) all require separate recognizers.
EU identifiers: Beyond the commonly recognized formats, comprehensive EU coverage requires IBAN formats for all 27 EU member states (each with country-specific length and format), plus national ID formats for each member state (German Steuer-ID, French NIR, Dutch BSN, Polish PESEL, Swedish Personnummer, and more).
What 260+ Entity Types Actually Covers
A comprehensive entity library with 260+ types covers:
- All 27 EU member state national identifiers (including lesser-covered ones: Slovenian EMŠO, Croatian OIB, Bulgarian EGN, Romanian CNP)
- All EU IBAN formats (27 country-specific formats with validation)
- Major South American identifiers (Brazil CPF/CNPJ, Argentina CUIT, Colombia NIT)
- Major Asian identifiers (India PAN/Aadhaar/GSTIN, Japan My Number, Korea RRN)
- UK-specific post-Brexit identifiers (UK NI Number, NHS Number, NINO variants)
- Medical identifiers across jurisdictions (US NPI, DEA numbers, NHS numbers, hospital MRN formats)
- Financial identifiers (SWIFT codes, BIC formats, various account number patterns)
For a London-based marketplace serving sellers from 45 countries, 260+ entity coverage means a single deployment handles the identification and protection of seller personal data across all jurisdictions — without requiring separate regional tools, separate processing pipelines, or manual enrichment for the national identifier types that a 40-recognizer tool misses.
The compliance posture changes from "we protect common identifiers" to "we protect the identifiers present in our actual data." For global operations, that distinction is the difference between partial compliance and genuine protection.
Sources: