anonym.legal
Back to BlogTechnical

Multi-Language NER: Why Your English-Trained Model Fails on Arabic

English NER models achieve 85-92% accuracy. Arabic and Chinese? Often 50-70%. Learn about the technical challenges and how to build truly multilingual PII detection.

February 26, 20268 min read
NERmultilingualArabic NLPChinese NLPPII detection

The Multilingual NER Challenge

Named Entity Recognition (NER) models trained on English achieve impressive results—85-92% F1 scores on standard benchmarks. Apply those same models to Arabic or Chinese? Accuracy often drops to 50-70%.

For PII detection, this gap is critical. A 70% detection rate means 30% of sensitive data goes unprotected.

Why English Models Fail

1. Word Boundaries

English: Words are separated by spaces.

"John Smith lives in New York"
→ ["John", "Smith", "lives", "in", "New", "York"]

Chinese: No word boundaries at all.

"张伟住在北京"
→ Needs segmentation first: ["张伟", "住在", "北京"]

Arabic: Words connect, and short vowels aren't written.

"محمد يعيش في دبي"
→ Connected script, right-to-left, vowels omitted

English tokenization rules simply don't apply.

2. Morphological Complexity

English morphology: Relatively simple

run → runs, running, ran

Arabic morphology: Extremely complex (root-pattern system)

كتب (k-t-b, "write" root)
→ كاتب (writer), كتاب (book), مكتبة (library), يكتب (he writes)

A single Arabic root generates dozens of related words. NER models must understand this derivation system.

3. Name Conventions

English names: First Last

John Smith, Mary Johnson

Arabic names: Multiple components

محمد بن عبد الله بن عبد المطلب
(Muhammad son-of Abdullah son-of Abdul-Muttalib)

Chinese names: Family name first, often 2-3 characters total

张伟 (Zhang Wei) - 2 characters
欧阳修 (Ouyang Xiu) - 3 characters

4. Script Direction

English: Left-to-right (LTR) Arabic/Hebrew: Right-to-left (RTL) Mixed text: Bidirectional (BiDi) - extremely complex

When an English name appears in Arabic text:

التقيت بـ John Smith في المؤتمر
(I met John Smith at the conference)

The rendering order, logical order, and display order all differ.

Accuracy by Language

Real-world NER performance varies dramatically:

LanguageScriptF1-Score RangeChallenge Level
EnglishLatin85-92%Low
GermanLatin82-88%Low
FrenchLatin80-87%Low
SpanishLatin81-86%Low
RussianCyrillic75-83%Medium
ArabicArabic55-75%High
ChineseHanzi60-78%High
JapaneseMixed65-80%High
ThaiThai50-70%Very High
HindiDevanagari60-75%High

Languages with complex morphology, non-Latin scripts, or no word boundaries consistently underperform.

anonym.legal's Three-Tier Approach

We solve multilingual NER through three specialized tiers:

Tier 1: spaCy (25 languages)

For high-resource languages with good models:

  • English, German, French, Spanish, Italian, Portuguese
  • Dutch, Polish, Russian, Greek
  • And 15 more with reliable accuracy

Tier 2: Stanza (7 languages)

For languages with complex morphology:

  • Arabic (root-pattern morphology)
  • Chinese (word segmentation required)
  • Japanese (multiple scripts)
  • Korean (agglutinative)
  • And 3 more

Tier 3: XLM-RoBERTa (16 languages)

For low-resource languages without dedicated models:

  • Thai, Vietnamese, Indonesian
  • Hindi, Bengali, Tamil
  • Hebrew, Turkish, Farsi
  • And more

How It Works

Input text with language detection
        ↓
[Language Router]
        ↓
┌───────┴───────┐
↓               ↓
High-resource   Complex/Low-resource
(spaCy)         (Stanza/XLM-RoBERTa)
↓               ↓
└───────┬───────┘
        ↓
[Regex overlay for structured data]
        ↓
[Confidence merger]
        ↓
Final entities

Regex Overlay

Some patterns are language-independent:

  • Email addresses: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
  • Credit cards: \d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}
  • Phone numbers: Various patterns per country

We apply regex first for structured data, regardless of language.

RTL Script Handling

Right-to-left languages require special handling:

Bidirectional Text Algorithm

When Arabic contains English:

Visual: المؤتمر في John Smith بـ التقيت
Logical: التقيت بـ John Smith في المؤتمر

Our processing:

  1. Normalize to logical order
  2. Run NER on logical order
  3. Map entity positions back to visual order
  4. Return consistent positions for any rendering

Entity Boundary Detection

Arabic entity boundaries are complex:

"محمد" - just the name
"لمحمد" - "to Muhammad" (attached preposition)
"ومحمد" - "and Muhammad" (attached conjunction)

We strip affixes before NER and reattach afterward.

Code-Switching

Real text often mixes languages:

"El meeting con John es at 3pm"
(Spanish-English mixing)

"我今天跟John去shopping"
(Chinese-English mixing)

Our approach:

  1. Segment text by language
  2. Process each segment with appropriate model
  3. Merge results with position mapping

Performance Benchmarks

Internal testing on mixed-language datasets:

ScenarioF1-Score
English only91%
German only88%
Arabic only79%
Chinese only81%
English-Arabic mix83%
English-Chinese mix84%
English-German mix89%

Our hybrid approach maintains high accuracy even on challenging languages.

Implementation Tips

For API Users

Specify language when known:

{
  "text": "محمد بن عبد الله",
  "language": "ar"
}

Let us auto-detect when unknown:

{
  "text": "محمد بن عبد الله",
  "language": "auto"
}

For Desktop App Users

The app auto-detects language per-document. For mixed-language files, it processes each segment appropriately.

For Custom Entity Types

Custom patterns should account for scripts:

# English employee ID
EMP-[0-9]{6}

# Arabic employee ID (includes Arabic numerals)
موظف-[٠-٩0-9]{6}

Conclusion

English-trained NER models fail on non-English text because languages differ fundamentally in:

  • Word boundaries (or lack thereof)
  • Morphological complexity
  • Script direction
  • Name conventions

Effective multilingual PII detection requires:

  1. Language-specific models for complex scripts
  2. Regex patterns for structured data
  3. Proper RTL/BiDi handling
  4. Code-switching support

anonym.legal supports 48 languages through our three-tier approach, achieving consistent accuracy across all.

Try it yourself:


Sources:

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.