بلاگ پر واپس جائیںتکنیکی

Multi-Language NER: آپ کا انگریزی تربیت شدہ ماڈل عربی...

انگریزی NER ماڈلز 85-92% درستگی حاصل کرتے ہیں۔ عربی اور چینی؟ اکثر 50-70%۔

February 26, 20268 منٹ پڑھیں
NERmultilingualArabic NLPChinese NLPPII detection

The Multilingual NER Challenge

Named Entity Recognition (NER) models trained on English achieve impressive results—85-92% F1 scores on standard benchmarks. Apply those same models to Arabic or Chinese? Accuracy often drops to 50-70%.

For PII detection, this gap is critical. A 70% detection rate means 30% of sensitive data goes unprotected.

Why English Models Fail

1. Word Boundaries

English: Words are separated by spaces.

"John Smith lives in New York"
→ ["John", "Smith", "lives", "in", "New", "York"]

Chinese: No word boundaries at all.

"张伟住在北京"
→ Needs segmentation first: ["张伟", "住在", "北京"]

Arabic: Words connect, and short vowels aren't written.

"محمد يعيش في دبي"
→ Connected script, right-to-left, vowels omitted

English tokenization rules simply don't apply.

2. Morphological Complexity

English morphology: Relatively simple

run → runs, running, ran

Arabic morphology: Extremely complex (root-pattern system)

كتب (k-t-b, "write" root)
→ كاتب (writer), كتاب (book), مكتبة (library), يكتب (he writes)

A single Arabic root generates dozens of related words. NER models must understand this derivation system.

3. Name Conventions

English names: First Last

John Smith, Mary Johnson

Arabic names: Multiple components

محمد بن عبد الله بن عبد المطلب
(Muhammad son-of Abdullah son-of Abdul-Muttalib)

Chinese names: Family name first, often 2-3 characters total

张伟 (Zh...

کیا آپ اپنے ڈیٹا کی حفاظت کے لیے تیار ہیں؟

48 زبانوں میں 285+ ادارتی اقسام کے ساتھ PII کی گمنامی شروع کریں۔