The Multilingual NER Challenge
Named Entity Recognition (NER) models trained on English achieve impressive results—85-92% F1 scores on standard benchmarks. Apply those same models to Arabic or Chinese? Accuracy often drops to 50-70%.
For PII detection, this gap is critical. A 70% detection rate means 30% of sensitive data goes unprotected.
Why English Models Fail
1. Word Boundaries
English: Words are separated by spaces.
"John Smith lives in New York"
→ ["John", "Smith", "lives", "in", "New", "York"]
Chinese: No word boundaries at all.
"张伟住在北京"
→ Needs segmentation first: ["张伟", "住在", "北京"]
Arabic: Words connect, and short vowels aren't written.
"محمد يعيش في دبي"
→ Connected script, right-to-left, vowels omitted
English tokenization rules simply don't apply.
2. Morphological Complexity
English morphology: Relatively simple
run → runs, running, ran
Arabic morphology: Extremely complex (root-pattern system)
كتب (k-t-b, "write" root)
→ كاتب (writer), كتاب (book), مكتبة (library), يكتب (he writes)
A single Arabic root generates dozens of related words. NER models must understand this derivation system.
3. Name Conventions
English names: First Last
John Smith, Mary Johnson
Arabic names: Multiple components
محمد بن عبد الله بن عبد المطلب
(Muhammad son-of Abdullah son-of Abdul-Muttalib)
Chinese names: Family name first, often 2-3 characters total
张伟 (Zhang Wei) - 2 characters
欧阳修 (Ouyang Xiu) - 3 characters
4. Script Direction
English: Left-to-right (LTR) Arabic/Hebrew: Right-to-left (RTL) Mixed text: Bidirectional (BiDi) - extremely complex
When an English name appears in Arabic text:
التقيت بـ John Smith في المؤتمر
(I met John Smith at the conference)
The rendering order, logical order, and display order all differ.
Accuracy by Language
Real-world NER performance varies dramatically:
| Language | Script | F1-Score Range | Challenge Level |
|---|---|---|---|
| English | Latin | 85-92% | Low |
| German | Latin | 82-88% | Low |
| French | Latin | 80-87% | Low |
| Spanish | Latin | 81-86% | Low |
| Russian | Cyrillic | 75-83% | Medium |
| Arabic | Arabic | 55-75% | High |
| Chinese | Hanzi | 60-78% | High |
| Japanese | Mixed | 65-80% | High |
| Thai | Thai | 50-70% | Very High |
| Hindi | Devanagari | 60-75% | High |
Languages with complex morphology, non-Latin scripts, or no word boundaries consistently underperform.
anonym.legal's Three-Tier Approach
We solve multilingual NER through three specialized tiers:
Tier 1: spaCy (25 languages)
For high-resource languages with good models:
- English, German, French, Spanish, Italian, Portuguese
- Dutch, Polish, Russian, Greek
- And 15 more with reliable accuracy
Tier 2: Stanza (7 languages)
For languages with complex morphology:
- Arabic (root-pattern morphology)
- Chinese (word segmentation required)
- Japanese (multiple scripts)
- Korean (agglutinative)
- And 3 more
Tier 3: XLM-RoBERTa (16 languages)
For low-resource languages without dedicated models:
- Thai, Vietnamese, Indonesian
- Hindi, Bengali, Tamil
- Hebrew, Turkish, Farsi
- And more
How It Works
Input text with language detection
↓
[Language Router]
↓
┌───────┴───────┐
↓ ↓
High-resource Complex/Low-resource
(spaCy) (Stanza/XLM-RoBERTa)
↓ ↓
└───────┬───────┘
↓
[Regex overlay for structured data]
↓
[Confidence merger]
↓
Final entities
Regex Overlay
Some patterns are language-independent:
- Email addresses:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} - Credit cards:
\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4} - Phone numbers: Various patterns per country
We apply regex first for structured data, regardless of language.
RTL Script Handling
Right-to-left languages require special handling:
Bidirectional Text Algorithm
When Arabic contains English:
Visual: المؤتمر في John Smith بـ التقيت
Logical: التقيت بـ John Smith في المؤتمر
Our processing:
- Normalize to logical order
- Run NER on logical order
- Map entity positions back to visual order
- Return consistent positions for any rendering
Entity Boundary Detection
Arabic entity boundaries are complex:
"محمد" - just the name
"لمحمد" - "to Muhammad" (attached preposition)
"ومحمد" - "and Muhammad" (attached conjunction)
We strip affixes before NER and reattach afterward.
Code-Switching
Real text often mixes languages:
"El meeting con John es at 3pm"
(Spanish-English mixing)
"我今天跟John去shopping"
(Chinese-English mixing)
Our approach:
- Segment text by language
- Process each segment with appropriate model
- Merge results with position mapping
Performance Benchmarks
Internal testing on mixed-language datasets:
| Scenario | F1-Score |
|---|---|
| English only | 91% |
| German only | 88% |
| Arabic only | 79% |
| Chinese only | 81% |
| English-Arabic mix | 83% |
| English-Chinese mix | 84% |
| English-German mix | 89% |
Our hybrid approach maintains high accuracy even on challenging languages.
Implementation Tips
For API Users
Specify language when known:
{
"text": "محمد بن عبد الله",
"language": "ar"
}
Let us auto-detect when unknown:
{
"text": "محمد بن عبد الله",
"language": "auto"
}
For Desktop App Users
The app auto-detects language per-document. For mixed-language files, it processes each segment appropriately.
For Custom Entity Types
Custom patterns should account for scripts:
# English employee ID
EMP-[0-9]{6}
# Arabic employee ID (includes Arabic numerals)
موظف-[٠-٩0-9]{6}
Conclusion
English-trained NER models fail on non-English text because languages differ fundamentally in:
- Word boundaries (or lack thereof)
- Morphological complexity
- Script direction
- Name conventions
Effective multilingual PII detection requires:
- Language-specific models for complex scripts
- Regex patterns for structured data
- Proper RTL/BiDi handling
- Code-switching support
anonym.legal supports 48 languages through our three-tier approach, achieving consistent accuracy across all.
Try it yourself:
Sources: