anonym.legal
Назад към блогаТехнически

Многоезичен NER: Защо вашият обучен по английски...

Английските NER модели постигат 85-92% точност. арабски и китайски? Често 50-70%.

February 26, 20268 мин. четене
NERmultilingualArabic NLPChinese NLPPII detection

Многоезичното NER предизвикателство

Моделите за разпознаване на наименувани обекти (NER), обучени на английски език, постигат впечатляващи резултати—85-92% F1 резултати при стандартни показатели. Приложите ли същите тези модели към арабски или китайски? Точността често пада до 50-70%.

За откриването на PII тази празнина е критична. 70% степен на откриване означава, че 30% от чувствителните данни остават незащитени.

Защо английските модели се провалят

1. Word Граници

Английски: Думите са разделени с интервали.

"John Smith lives in New York"
→ ["John", "Smith", "lives", "in", "New", "York"]

Chinese: No word boundaries at all.

"张伟住在北京"
→ Needs segmentation first: ["张伟", "住在", "北京"]

Arabic: Words connect, and short vowels aren't written.

"محمد يعيش في دبي"
→ Connected script, right-to-left, vowels omitted

English tokenization rules simply don't apply.

2. Morphological Complexity

English morphology: Relatively simple

run → runs, running, ran

Arabic morphology: Extremely complex (root-pattern system)

كتب (k-t-b, "write" root)
→ كاتب (writer), كتاب (book), مكتبة (library), يكتب (he writes)
``ZPRZ0013Z PRZ``
John Smith, Mary Johnson

Arabic names: Multiple components

محمد بن عبد الله بن عبد المطلب
(Muhammad son-of Abdullah son-of Abdul-Muttalib)

Chinese names: Family name first, often 2-3 characters total

张伟 (Zhang Wei) - 2 characters
欧阳修 (Ouyang Xiu) - 3 characters

4. Script Direction

English: Left-to-right (LTR) Arabic/Hebrew: Right-to-left (RTL) Mixed text: Bidirectional (BiDi) - extremely complex

When an English name appears in Arabic text:

التقيت بـ John Smith في المؤتمر
(I met John Smith at the conference)

The rendering order, logical order, and display order all differ.

Accuracy by Language

Real-world NER performance varies dramatically:

LanguageScriptF1-Score RangeChallenge Level
EnglishLatin85-92%Low
GermanLatin82-88%Low
FrenchLatin80-87%Low
SpanishLatin81-86%Low
RussianCyrillic75-83%Medium
ArabicArabic55-75%High
ChineseHanzi60-78%High
JapaneseMixed65-80%High
ThaiThai50-70%Very High
HindiDevanagari60-75%High

Languages with complex morphology, non-Latin scripts, or no word boundaries consistently underperform.

anonym.legal's Three-Tier Approach

We solve multilingual NER through three specialized tiers:

Tier 1: spaCy (25 languages)

For high-resource languages with good models:

  • English, German, French, Spanish, Italian, Portuguese
  • Dutch, Polish, Russian, Greek
  • And 15 more with reliable accuracy

Tier 2: Stanza (7 languages)

For languages with complex morphology:

  • Arabic (root-pattern morphology)
  • Chinese (word segmentation required)
  • Japanese (multiple scripts)
  • Korean (agglutinative)
  • And 3 more

Tier 3: XLM-RoBERTa (16 languages)

For low-resource languages without dedicated models:

  • Thai, Vietnamese, Indonesian
  • Hindi, Bengali, Tamil
  • Hebrew, Turkish, Farsi
  • And more

How It Works

Input text with language detection
        ↓
[Language Router]
        ↓
┌───────┴───────┐
↓               ↓
High-resource   Complex/Low-resource
(spaCy)         (Stanza/XLM-RoBERTa)
↓               ↓
└───────┬───────┘
        ↓
[Regex overlay for structured data]
        ↓
[Confidence merger]
        ↓
Final entities
``ZPRZ0023 ZPRZ[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
- Credit cards: `\d {4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}`
- Phone numbers: Various patterns per country

We apply regex first for structured data, regardless of language.

## RTL Script Handling

Right-to-left languages require special handling:

### Bidirectional Text Algorithm

When Arabic contains English:

Visual: المؤتمر في John Smith بـ التقيت Logical: التقيت بـ John Smith في المؤتمر ZP RZ0027ZPRZ "محمد" - just the name "لمحمد" - "to Muhammad" (attached preposition) "ومحمد" - "and Muhammad" (attached conjunction)


We strip affixes before NER and reattach afterward.

## Code-Switching

Real text often mixes languages:

"El meeting con John es at 3pm" (Spanish-English mixing)

"我今天跟John去shopping" (Chinese-English mixing)


Our approach:
1. Segment text by language
2. Process each segment with appropriate model
3. Merge results with position mapping

## Performance Benchmarks

Internal testing on mixed-language datasets:

| Scenario | F1-Score |
|----------|----------|
| English only | 91% |
| German only | 88% |
| Arabic only | 79% |
| Chinese only | 81% |
| English-Arabic mix | 83% |
| English-Chinese mix | 84% |
| English-German mix | 89% |

Our hybrid approach maintains high accuracy even on challenging languages.

## Implementation Tips

### For API Users

Specify language when known:
```json
{
  "text": "محمد بن عبد الله",
  "language": "ar"
}

Let us auto-detect when unknown:

{
  "text": "محمد بن عبد الله",
  "language": "auto"
}

For Desktop App Users

The app auto-detects language per-document. For mixed-language files, it processes each segment appropriately.

For Custom Entity Types

Custom patterns should account for scripts:

# English employee ID
EMP-[0-9]{6}

# Arabic employee ID (includes Arabic numerals)
موظف-[٠-٩0-9]{6}

Заключение

Обучените на английски NER модели се провалят при неанглийски текст, тъй като езиците се различават фундаментално по:

  • Word граници (или липса на такива)
  • Морфологична сложност
  • Режисура на сценария
  • Конвенции за имената

Ефективното многоезично откриване на PII изисква:

  1. Езикови специфични модели за сложни скриптове
  2. Модели на регулярни изрази за структурирани данни
  3. Правилно боравене с RTL/BiDi
  4. Поддръжка за превключване на кодове

anonym.legal поддържа 48 езика чрез нашия трислоен подход, постигайки постоянна точност във всички.

Опитайте сами:


Източници:

Готови ли сте да защитите данните си?

Започнете анонимизация на PII с 285+ типа субекти на 48 езика.