GDPR Nini?
GDPR (General Data Protection Regulation) ni sera ya EU kwa kugamia PII. Inakubali:
- 24 languages kwa kukamatia
- Lakini biashara kubwa zina customers katika 48 lugha
Mfano:
- Microsoft: 49 lugha
- Amazon: 45 lugha
- anonym.legal: 48 lugha
Changamoto za Multilingual
1. Arabic (AR) — RTL + Diacritics
Changamoto:
- Kuandika kulia kwenda kushoto (RTL)
- Haraka (diacritics) inaongeza haraka kwa herufi
- Hamza (`) inatofautiana swali
Mfano:
القاهرة (Cairo) — normal
ٱلْقَاهِرَة (Cairo with diacritics) — sama lakini na haraka
→ Kugamia lazima **kuwa normal na diacritics**
Hekima: Presidio unatumia NFKD normalization (unstrip diacritics) kwa basi kugamia.
2. Chinese (ZH) — Segmentation + Ambiguity
Changamoto:
- Hakana nafasi kati ya maneno
- 20,000+ herufi (vs 26 kwa Kiingereza)
- Ambiguity: "北京" = Beijing (jiji) au "beĭ jīng" (kaskazini-mashariki)
Mfano:
张三在北京工作
Zhang San (jina) + zai (katika) + Beijing (mahali) + gongzuo (kazi)
→ Kugamia lazima **segmentize kwanza**
Hekima: Presidio unatumia Jieba segmenter kwa Chinese.
3. Thai (TH) — Tone Marks + No Spaces
Changamoto:
- Hakana nafasi kati ya maneno
- Tone marks (´, `, ˆ) inabadilisha maana
- Vowels inaandika juu/chini ya consonant
Mfano:
สำนักงาน (samnak khan) = office
สำหนัก (sam nak) = residence
→ Kugamia lazima **kusambaza tone + segmentation**
Hekima: Presidio unatumia PyThaiNLP kwa Thai.
4. Hindi (HI) — Devanagari Script + Conjuncts
Changamoto:
- Devanagari script (hereditary writing)
- Conjuncts (herufi zilizogawanywa)
- Combining marks (vowels juu/chini)
Mfano:
मैं (main) = I
मेरा (mera) = my
क्ष (ksha) = consonant cluster (K + SH)
→ Kugamia lazima **kusambaza conjuncts**
Hekima: Presidio unatumia Indic NLP kwa Hindi/Marathi/Bengali.
Hekima ya Hybrid NLP
anonym.legal unatumia 3-layer hybrid kwa 48 lugha:
Layer 1: Regex
- Email, SSN, Credit Card (same kwa kila lugha)
- Fast, 99% accurate
Layer 2: NLP (spaCy)
- Named Entity Recognition (jina, mahali, tarehe)
- Lugha-specific models (48 models)
Layer 3: Fuzzy Matching
- Typos: "Joh Smith" → "John Smith"
- Lugha-specific fuzzy (ar_fuzzy, zh_fuzzy)
Kwa nini 3 layers:
- Regex kukamatia patterns (fast)
- spaCy kukamatia entities (accurate)
- Fuzzy kukamatia typos (comprehensive)
GDPR Compliance
anonym.legal 48-lugha Presidio kusambaza:
- Article 32: Technical measures (encryption, anonymization)
- Article 33: Breach notification (automatic PII detection)
- Article 34: GDPR-compliant redaction (reversible + permanent)
- DPIA: Data Protection Impact Assessment (available)
Kwa kila lugha:
- Presidio unakamatia 267 entity types (global, not just EN)
- spaCy unakamatia 48-language NER (PERSON, ORG, GPE, DATE, etc.)
- Hadithi za GDPR test cases (50+ per lugha)