By · Last updated 2026-02-26

Späť na blogTechnické

Viacjazykovy NER: Anglicke modely zlyhavaju na arabcine

NER modely v anglictine dosahuju presnost 85-92 %. Arabcina a cinstina? Casto iba 50-70 %. Zistite o technickych vyzbach a o tom, ako budovat skutocne viacjazykovu detekciu.

February 26, 20268 min čítania
NERmultilingualArabic NLPChinese NLPPII detection

Viacjazykovy NER: Vyzvy pri detekcii PII

Aktualizovane pre rok 2026

Medzera v presnosti

NER modely trenovane na anglictine dosahuju 85-92 % F1 na standardnych testoch. Aplikujte rovnake modely na arabsky alebo cinsky text. Presnost klesa na 50-70 %.

Pri praci s PII je tata medzera problem. Ucinnost 70 % znamena, ze 30 % citlivych dat zostava neviditelnych.

Priciny nie su chyby. Pramenia z toho, ako sa pisomne systemy lisia.

Styri hlavne priciny

1. Hranice slov

Anglictina oddeluje slova medzerami. Tokenizacia je jednoducha.

Cinstina nema vobec ziadne medzery.

"Zhang Wei zije v Pekingu"
-> Najprv rozdelenie: ["Zhang Wei", "zije v", "Peking"]

Model nemoze oznackovat to, co nemoze najst. Rozdelenie musi prebehnut pred NER.

Arabcina spaja pismena v ramci slova. Kratke samohlasky su vynechane. Text ide sprava dolava.

"Mohamed zije v Dubaji"
-> Bez kratkych samohlasok, sprava dolava, spojene pismena

2. Morfologia

Anglicke slovesa sa menia niekollkymi sposobmi. Arabcina pouziva system korena. Jeden koren vytvara desiatky slov.

k-t-b ("pisat")
-> pisatel, kniha, kninica

NER musi analyzovat korene, aby nasiel mena v odvodených slovnych formach.

3. Konvencie mien

Latinizovane mena su najprv krstne potom priezvisko. Mena v jazykoch zprava dolava retazia rodinne vasby.

Mohamed syn Abdullaha

Cinske mena stavaju rodinne meno na prvy miesto. Vacsina mien ma dve alebo tri znaky.

Zhang Wei -- 2 znaky
Ouyang Xiu -- 3 znaky

Model postaveny na zapadnych vzoroch mien tieto struktury prehliadne.

4. Smer textu

Niektore jazyky bezi sprava dolava. Ked text RTL obsahuje anglicke meno, vizualny a logicky poriadok sa oddelia. Toto sa nazyva text BiDi. Vyzaduje starostlive spracovanie.

F1 skore podla pisomneho systemu

JazykPisomny systemRozmedzie F1Uroven
AnglictinaLatinka85-92 %Nizka
NemcinaLatinka82-88 %Nizka
FrancuzstinaLatinka80-87 %Nizka
SpanielcinaLatinka81-86 %Nizka
RustinaCyrilika75-83 %Stredna
ArabcinaAbjad55-75 %Vysoka
CinstinaHanzi60-78 %Vysoka
JaponczinaZmieshane65-80 %Vysoka
ThajcinaThajsky50-70 %Velmi vysoka
HindcinaDevanagari60-75 %Vysoka

Ne-latinske systemy a chybajuce medzery znizuju skore napriec celou oblastou.

Trojurovnove riesenie

Pouzivame tri urovne na pokrytie 48 jazykov a pisomnych systemov.

Uroven 1: spaCy -- 25 jazykov

Pre jazyky so silnymi, overenymi modelmi. Pokryva anglictinu, nemcinu, francuzstinu, spanielcinu, talianczinu, portugalcinu, holandzinu, polstinu, rustinu a grecinu.

Uroven 2: Stanza -- Zlozite jazyky

Stanford Stanza zvlada arabcinu, cinstinu, japonczinu a koreanczinu. Pred NER spusta rozdelovanie slov a analyzu korenov.

Uroven 3: XLM-RoBERTa -- Jazyky s malymi zdrojmi

Pre jazyky bez vyhradnych modelov. Sem patria thajcina, vietnamcina, hindcina, bengalcina, hebrejtina, turectina a farcina. Zvlada zmieshany jazykovy text bez explicitnych znacok.

RTL a BiDi

Text zprava dolava vyzaduje pridatocne kroky nad ramec rozdelovania.

Nas pipeline:

  1. Normalizuje text na logicke poradie.
  2. Spusta NER v tomto poradi.
  3. Mapuje pozicias entit spat na vizualne poradie.

Pred NER odstranjujeme pripojene predpony a po NER ich pridavame spat.

"Mohamed" -- iba meno
"pre Mohameda" -- predpona pripojenna

Prepinanie jazykov

Skutocne dokumenty casto miesha jazyky v jednom riadku.

"Meeting s Johnom je o 3"
"Dnes som so Johnom isiel nakupovat"

Nas pipeline rozdeluje podla jazyka. Spusta spravny model na kazdu cast. Potom spaja vysledky s mapovanim pozicii.

Interne benchmarky

Vysledky z internych testov na zmieshanych jazykovych datach:

ScenarF1
Iba anglictina91 %
Iba nemcina88 %
Iba arabcina79 %
Iba cinstina81 %
Zmes anglictina-arabcina83 %
Zmes anglictina-cinstina84 %
Zmes anglictina-nemcina89 %

Nastavenie

Desktopova aplikacia automaticky detekuje jazyk pre kazdy dokument. Pre subory so zmiesanymi jazykmi spracuvava kazdy segment spravnym modelom. Ziadny manualny krok nie je potrebny.

Nastavte jazyk v API, ked ho poznate:

{
  "text": "Mohamed syn Abdullaha",
  "language": "ar"
}

Pouzite automaticku detekciu, ked ho nepoznate:

{
  "text": "Mohamed syn Abdullaha",
  "language": "auto"
}

Vlastne vzory by mali pokryvat miestne specificke cislice:

# Latinke ID zamestnanca
EMP-[0-9]{6}

# Arabske ID zamestnanca (obsahuje arabsko-indicke cislice)
zamestnanec-[0-9]{6}

Pozrite si uplny zoznam entit. Pre nastavenie API navstivte stranku funkcii API. Nas sprievodca suladom GDPR pokryva, ako medzery v detekcii ovplyvnuju pravo na ochranu udajov.


anonym.legal pouziva trojurovnovy NER stack -- spaCy, Stanza a XLM-RoBERTa -- na pokrytie 48 jazykov s konzistentnou detekciou PII.

Zdroje

Pripravení chrániť vaše údaje?

Začnite anonymizovať PII s 285+ typmi entít v 48 jazykoch.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.