Rudi kwa BlogGDPR & Ufuatiliaji

Kugamia PII Multilingual: Changamoto za 48 Lugha kwa GDPR

GDPR inakubali lugha 24 lakini anonym.legal inakubali 48. Changamoto: Arabic RTL, Chinese segmentation, Thai tone marks. Mstari wa Hybrid NLP.

March 3, 202610 dakika kusoma
multilingualGDPRNLPPII detectionEuropean compliancespaCyXLM-RoBERTa

GDPR Nini?

GDPR (General Data Protection Regulation) ni sera ya EU kwa kugamia PII. Inakubali:

  • 24 languages kwa kukamatia
  • Lakini biashara kubwa zina customers katika 48 lugha

Mfano:

  • Microsoft: 49 lugha
  • Amazon: 45 lugha
  • anonym.legal: 48 lugha

Changamoto za Multilingual

1. Arabic (AR) — RTL + Diacritics

Changamoto:

  • Kuandika kulia kwenda kushoto (RTL)
  • Haraka (diacritics) inaongeza haraka kwa herufi
  • Hamza (`) inatofautiana swali

Mfano:

القاهرة (Cairo) — normal
ٱلْقَاهِرَة (Cairo with diacritics) — sama lakini na haraka

→ Kugamia lazima **kuwa normal na diacritics**

Hekima: Presidio unatumia NFKD normalization (unstrip diacritics) kwa basi kugamia.

2. Chinese (ZH) — Segmentation + Ambiguity

Changamoto:

  • Hakana nafasi kati ya maneno
  • 20,000+ herufi (vs 26 kwa Kiingereza)
  • Ambiguity: "北京" = Beijing (jiji) au "beĭ jīng" (kaskazini-mashariki)

Mfano:

张三在北京工作
Zhang San (jina) + zai (katika) + Beijing (mahali) + gongzuo (kazi)

→ Kugamia lazima **segmentize kwanza**

Hekima: Presidio unatumia Jieba segmenter kwa Chinese.

3. Thai (TH) — Tone Marks + No Spaces

Changamoto:

  • Hakana nafasi kati ya maneno
  • Tone marks (´, `, ˆ) inabadilisha maana
  • Vowels inaandika juu/chini ya consonant

Mfano:

สำนักงาน (samnak khan) = office
สำหนัก (sam nak) = residence

→ Kugamia lazima **kusambaza tone + segmentation**

Hekima: Presidio unatumia PyThaiNLP kwa Thai.

4. Hindi (HI) — Devanagari Script + Conjuncts

Changamoto:

  • Devanagari script (hereditary writing)
  • Conjuncts (herufi zilizogawanywa)
  • Combining marks (vowels juu/chini)

Mfano:

मैं (main) = I
मेरा (mera) = my

क्ष (ksha) = consonant cluster (K + SH)

→ Kugamia lazima **kusambaza conjuncts**

Hekima: Presidio unatumia Indic NLP kwa Hindi/Marathi/Bengali.

Hekima ya Hybrid NLP

anonym.legal unatumia 3-layer hybrid kwa 48 lugha:

Layer 1: Regex

  • Email, SSN, Credit Card (same kwa kila lugha)
  • Fast, 99% accurate

Layer 2: NLP (spaCy)

  • Named Entity Recognition (jina, mahali, tarehe)
  • Lugha-specific models (48 models)

Layer 3: Fuzzy Matching

  • Typos: "Joh Smith" → "John Smith"
  • Lugha-specific fuzzy (ar_fuzzy, zh_fuzzy)

Kwa nini 3 layers:

  1. Regex kukamatia patterns (fast)
  2. spaCy kukamatia entities (accurate)
  3. Fuzzy kukamatia typos (comprehensive)

GDPR Compliance

anonym.legal 48-lugha Presidio kusambaza:

  • Article 32: Technical measures (encryption, anonymization)
  • Article 33: Breach notification (automatic PII detection)
  • Article 34: GDPR-compliant redaction (reversible + permanent)
  • DPIA: Data Protection Impact Assessment (available)

Kwa kila lugha:

  • Presidio unakamatia 267 entity types (global, not just EN)
  • spaCy unakamatia 48-language NER (PERSON, ORG, GPE, DATE, etc.)
  • Hadithi za GDPR test cases (50+ per lugha)

Tayari kulinda data yako?

Anza kuanonymisha PII na aina 285+ za vitu katika lugha 48.