Rudi kwa BlogGDPR & Ufuatiliaji

LGPD and Brazilian Portuguese PII: What ANPD Requires for CPF, CNPJ, and Brazilian Data Protection

LGPD covers 215M Brazilians and ANPD began major enforcement in 2024. CPF detected with only 45% accuracy by English-trained tools. Brazilian identifiers from CPF to Título de Eleitor require specialized detection.

March 7, 20268 dakika kusoma
Brazil LGPDCPF detectionBrazilian Portuguese PIIANPD complianceSouth America data protection

Brazil's Lei Geral de Proteção de Dados (LGPD) is the world's third-largest data protection framework by population covered — 215 million Brazilians, larger than Germany, France, and UK combined. The Autoridade Nacional de Proteção de Dados (ANPD) issued its first major enforcement actions in 2024, signaling the end of the grace period that followed LGPD's 2020 enactment.

The technical compliance challenge is distinctive: Brazilian Portuguese is the language of LGPD-covered documents, but Brazilian national identifiers are completely different from European Portuguese identifiers — and from any other national identification system in the world.

Why Brazilian PII Is Technically Distinct

Brazilian federal and state identification systems evolved separately from European digital identity frameworks. The result is a complex set of identifiers that generic NLP tools — most trained on English or European language data — fail to detect:

CPF (Cadastro de Pessoas Físicas): The 11-digit individual taxpayer registration is Brazil's universal citizen identifier. Format: XXX.XXX.XXX-XX with two check digits. The CPF check digit algorithm uses two separate modular arithmetic calculations — if both check digits match, the CPF is valid.

The technical problem: CPF detected with only 45% accuracy by English-trained NLP tools (ANPD technical assessment 2024). The failures: tools that pattern-match 11-digit numbers without the two-step check digit validation cannot distinguish valid CPF numbers from random sequences; and CPF appears in Brazilian documents without the standard XXX.XXX.XXX-XX formatting in some contexts (OCR output, plain text forms).

CNPJ (Cadastro Nacional da Pessoa Jurídica): The 14-digit company registration number. Format: XX.XXX.XXX/XXXX-XX with two check digits using similar (but not identical) algorithms to CPF.

RG (Registro Geral): Brazil's state-issued civil identity document. Unlike the CPF (federal, uniform), RG format varies by state of issuance:

  • São Paulo: 2 letters + 5-9 digits (e.g., MG-12.345.678)
  • Rio de Janeiro: 7-8 digits with dash
  • Minas Gerais: 7-9 digits
  • Other states: various formats

A tool that recognizes only one state's RG format misses the majority of RG numbers in Brazilian documents.

CNH (Carteira Nacional de Habilitação): 11-digit driver's license number with check digit. The CNH is issued federally but the format includes registration district coding.

Título de Eleitor (voter registration): 12-digit number with 3 components — identification code (8 digits), state code (2 digits), check digits (2 digits).

SUS number (Cartão SUS): 15-digit unified health system number assigned to every Brazilian for public healthcare access. Appears throughout public hospital and primary care records.

PIS/PASEP: 11-digit social integration program number used in all employment records.

LGPD's Anonymization Standard

LGPD Article 12 defines anonymous data as data "relating to the data subject that cannot be identified, considering the use of reasonable technical means available at the time of processing." This is a technology-relative standard — what is anonymous today may not be anonymous when future re-identification techniques develop.

ANPD's guidance clarifies that anonymization requires more than removing explicit identifiers (CPF, name). Quasi-identifier combinations (age range, municipality, gender, profession) may enable re-identification and must be addressed through generalization or noise addition.

For AI training data, ANPD requires that data used for training LLMs or ML models either:

  • Is genuinely anonymized (meeting Article 12's technical standard), OR
  • Has explicit consent from each data subject for the specific training use, OR
  • Qualifies under a legitimate purpose with documented justification

Brazilian Portuguese Language Requirements

Brazilian Portuguese differs from European Portuguese in vocabulary, spelling, and document conventions. NLP models trained on European Portuguese (Portugal) perform at approximately 71% of the accuracy of models trained specifically on Brazilian Portuguese text (ANPD technical assessment).

Specific differences relevant to PII detection:

  • Name conventions: Brazilian names follow different patterns than Portuguese names. Common Brazilian surnames (Silva, Santos, Oliveira, Souza) are the same, but naming conventions (double surnames, order preferences) differ.
  • Address formats: Brazilian addresses use "Rua," "Avenida," "Alameda," "Travessa" similarly to Portugal, but CEP postal codes (8-digit format: XXXXX-XXX) are Brazil-specific and require Brazilian postal code recognition.
  • Document terminology: Brazilian document types use different terminology from European Portuguese — "Carteira de Identidade" vs. "Bilhete de Identidade" for national ID, different government agency names throughout.

For LGPD compliance: CPF and CNPJ with two-step check digit validation, multi-state RG format recognition, SUS number and Título de Eleitor detection, and Brazilian Portuguese NLP model support are the technical baseline for ANPD compliance.

Sources:

Tayari kulinda data yako?

Anza kuanonymisha PII na aina 285+ za vitu katika lugha 48.