Многоезичното NER предизвикателство
Моделите за разпознаване на наименувани обекти (NER), обучени на английски език, постигат впечатляващи резултати—85-92% F1 резултати при стандартни показатели. Приложите ли същите тези модели към арабски или китайски? Точността често пада до 50-70%.
За откриването на PII тази празнина е критична. 70% степен на откриване означава, че 30% от чувствителните данни остават незащитени.
Защо английските модели се провалят
1. Word Граници
Английски: Думите са разделени с интервали.
"John Smith lives in New York"
→ ["John", "Smith", "lives", "in", "New", "York"]
Chinese: No word boundaries at all.
"张伟住在北京"
→ Needs segmentation first: ["张伟", "住在", "北京"]
Arabic: Words connect, and short vowels aren't written.
"محمد يعيش في دبي"
→ Connected script, right-to-left, vowels omitted
English tokenization rules simply don't apply.
2. Morphological Complexity
English morphology: Relatively simple
run → runs, running, ran
Arabic morphology: Extremely complex (root-pattern system)
كتب (k-t-b, "write" root)
→ كاتب (writer), كتاب (book), مكتبة (library), يكتب (he writes)
``ZPRZ0013Z PRZ``
John Smith, Mary Johnson
Arabic names: Multiple components
محمد بن عبد الله بن عبد المطلب
(Muhammad son-of Abdullah son-of Abdul-Muttalib)
Chinese names: Family name first, often 2-3 characters total
张伟 (Zhang Wei) - 2 characters
欧阳修 (Ouyang Xiu) - 3 characters
4. Script Direction
English: Left-to-right (LTR) Arabic/Hebrew: Right-to-left (RTL) Mixed text: Bidirectional (BiDi) - extremely complex
When an English name appears in Arabic text:
التقيت بـ John Smith في المؤتمر
(I met John Smith at the conference)
The rendering order, logical order, and display order all differ.
Accuracy by Language
Real-world NER performance varies dramatically:
| Language | Script | F1-Score Range | Challenge Level |
|---|---|---|---|
| English | Latin | 85-92% | Low |
| German | Latin | 82-88% | Low |
| French | Latin | 80-87% | Low |
| Spanish | Latin | 81-86% | Low |
| Russian | Cyrillic | 75-83% | Medium |
| Arabic | Arabic | 55-75% | High |
| Chinese | Hanzi | 60-78% | High |
| Japanese | Mixed | 65-80% | High |
| Thai | Thai | 50-70% | Very High |
| Hindi | Devanagari | 60-75% | High |
Languages with complex morphology, non-Latin scripts, or no word boundaries consistently underperform.
anonym.legal's Three-Tier Approach
We solve multilingual NER through three specialized tiers:
Tier 1: spaCy (25 languages)
For high-resource languages with good models:
- English, German, French, Spanish, Italian, Portuguese
- Dutch, Polish, Russian, Greek
- And 15 more with reliable accuracy
Tier 2: Stanza (7 languages)
For languages with complex morphology:
- Arabic (root-pattern morphology)
- Chinese (word segmentation required)
- Japanese (multiple scripts)
- Korean (agglutinative)
- And 3 more
Tier 3: XLM-RoBERTa (16 languages)
For low-resource languages without dedicated models:
- Thai, Vietnamese, Indonesian
- Hindi, Bengali, Tamil
- Hebrew, Turkish, Farsi
- And more
How It Works
Input text with language detection
↓
[Language Router]
↓
┌───────┴───────┐
↓ ↓
High-resource Complex/Low-resource
(spaCy) (Stanza/XLM-RoBERTa)
↓ ↓
└───────┬───────┘
↓
[Regex overlay for structured data]
↓
[Confidence merger]
↓
Final entities
``ZPRZ0023 ZPRZ[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
- Credit cards: `\d {4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}`
- Phone numbers: Various patterns per country
We apply regex first for structured data, regardless of language.
## RTL Script Handling
Right-to-left languages require special handling:
### Bidirectional Text Algorithm
When Arabic contains English:
Visual: المؤتمر في John Smith بـ التقيت
Logical: التقيت بـ John Smith في المؤتمر
ZP RZ0027ZPRZ
"محمد" - just the name
"لمحمد" - "to Muhammad" (attached preposition)
"ومحمد" - "and Muhammad" (attached conjunction)
We strip affixes before NER and reattach afterward.
## Code-Switching
Real text often mixes languages:
"El meeting con John es at 3pm" (Spanish-English mixing)
"我今天跟John去shopping" (Chinese-English mixing)
Our approach:
1. Segment text by language
2. Process each segment with appropriate model
3. Merge results with position mapping
## Performance Benchmarks
Internal testing on mixed-language datasets:
| Scenario | F1-Score |
|----------|----------|
| English only | 91% |
| German only | 88% |
| Arabic only | 79% |
| Chinese only | 81% |
| English-Arabic mix | 83% |
| English-Chinese mix | 84% |
| English-German mix | 89% |
Our hybrid approach maintains high accuracy even on challenging languages.
## Implementation Tips
### For API Users
Specify language when known:
```json
{
"text": "محمد بن عبد الله",
"language": "ar"
}
Let us auto-detect when unknown:
{
"text": "محمد بن عبد الله",
"language": "auto"
}
For Desktop App Users
The app auto-detects language per-document. For mixed-language files, it processes each segment appropriately.
For Custom Entity Types
Custom patterns should account for scripts:
# English employee ID
EMP-[0-9]{6}
# Arabic employee ID (includes Arabic numerals)
موظف-[٠-٩0-9]{6}
Заключение
Обучените на английски NER модели се провалят при неанглийски текст, тъй като езиците се различават фундаментално по:
- Word граници (или липса на такива)
- Морфологична сложност
- Режисура на сценария
- Конвенции за имената
Ефективното многоезично откриване на PII изисква:
- Езикови специфични модели за сложни скриптове
- Модели на регулярни изрази за структурирани данни
- Правилно боравене с RTL/BiDi
- Поддръжка за превключване на кодове
anonym.legal поддържа 48 езика чрез нашия трислоен подход, постигайки постоянна точност във всички.
Опитайте сами:
Източници:
- arXiv - Междуезични NER предизвикателства
- [Етикетирайте вашите данни - многоезичен NER] (https://labelyourdata.com/articles/multilingual-ner)
- spaCy езикови модели
- Строфа - Станфорд NLP