By · Last updated 2026-03-26

Späť na blogTechnické

PII vo viacjazycnych dokumentoch: Jednojazyckove nastroje zlyhavaju

72 % podnikov v EU spracuvava dokumenty v 3 a viac jazykoch sucasne. Viacjazycne dokumenty sposobuju o 45 % vyssi pocet vynechanych PII v jednojazyckych nastrojoch NER.

March 26, 20267 min čítania
mixed-language PII detectionSwiss GDPR compliancemultilingual document processingXLM-RoBERTaDACH data protection

PII vo viacjazycnych dokumentoch: Preco jednojazyckove nastroje zlyhavaju

Aktualizovane pre rok 2026.

Dokumenty prekracuju jazykove hranice

Pracovna zmluva svajciarskeho farmasutickeho podniku nie je pisana jednym jazykom. Svajciarsko ma styri uradne jazyky. Svajciarske firmy kombinuju nemcinu v hlavnej casti, francuzstinu v pravnych klauzulach a anglictinu v globalnych castiach. To sa moze stat v jednom odseku.

Zapis z belgickeho predstavenstva obsahuje holandsky text, francuzske formalne casti a anglicke zhrnutia. Globalna datova zmluva moze mat anglicke technicke specifikacie a nemecke klauzuly o pravach.

Toto nie je ojedinele. Je to norma pre firmy v oblasti DACH a EÚ. Jednojazyckove nastroje PII na tychto suboroch zlyhavaju.

Medzera 45 % mierou vynechania

Jednojazyckove nastroje NER maju o 45 % vyssi pocet vynechanych PII vo viacjazycnych suboroch. Toto je v porovnani s cistymi jednojazykovymi subormi.

Pricinou je dizajn. Model trenovany na nemeckom texte pozna lokalne formy mien a pravidla adries. Ked narazi na francuzsky oddiel, je mimo rozsahu svojho trenovania. Mena a ID v tej casti su detekovane slabo. Model nie je slaby - bol vytvoreny pre iny jazyk.

EDPB 2024 zistil, ze 72 % firm v EU spracuvava subory v troch alebo viacerych jazykoch sucasne. Gartner 2024 zistil, ze viacjazycne HR subory maju o 67 % viac PII na stranku ako jednojazyckove. Viac PII plus viac vynechani znasobuje medzeru.

Pozrite si nas pruvodca GDPR so zoznamom platnych pravidiel.

Kde sa chyby zhlukouju

Zlyhanie nie je rovnomerne rozlozene po celom subore. PII na prechodoch medzi oddielmi je najviac ohrozena.

Uvazujme tuto klauzulu: nemecka vetna struktura, francuzske meno zamestnanca a francuzsky datum narodenia - vsetko v jednom riadku. Model NER vidi francuzske meno tam, kde ocakava lokalne. Nemusí ho oznacit. Model trenovany na francuztine vidi nemecke kontextove slova a nevie precitat strukturu.

HR subory robia toto nakladnym. Gartner zistil o 67 % viac PII na stranku vo viacjazycnych HR suboroch. Chyby na prechodoch medzi oddielmi su najnakladnejsie v type suboru s najviac osobnymi udajmi.

Viacjazycne modely to riesia

XLM-RoBERTa je trenovany na textoch zo 100 jazykov sucasne. Nepouziva novy model pre kazdy jazyk. Uci sa, ze detekcia mien funguje rovnako napriec jazykovymi kontextmi. Meno a jeho kontext zdielaju rovnaku strukturu v nemcine, francuztine a anglictine.

Pre viacjazycne subory model neprepina pri prechode oddielov. Cita cely text ako jeden blok. Uplatuje rovnake pravidla pre entity na kazdom mieste.

Doladenie na nemcinu a francuzstinu pridava presnost pre kazdy jazyk samostatne. Ale viacjazycny zaklad zachyti PII na prechodoch, kde jednojazyckove modely zlyhavaju.

Pre firmy DACH, ktore maju subory prechádzajuce cez jazykove oddiely, ide o realny prinos. Entity vynechane jednojazykovymi nastrojmi pri prechodoch su viacjazycnymi modelmi najdene.

Pozrite si nasu stranku bezpecnostnych zaruk o tom, ako anonym.legal toto riesi.

Kroky, ktore treba podniknut teraz

Skontrolujte rozsah vasho nastroja. Poziadajte dodavatela o skore navratnosti podla miestnych nastaveni. "Podporuje viacero jazykov" moze znamenat, ze text je najprv strojovo prelozeny. To nie je native skenovanie.

Zmapujte svoje subory podla miestneho nastavenia. Firma DACH so 60 % nemeckych, 30 % francuzskych a 10 % anglickych suborov ma odlisne medzery.

Testujte s vzorkami prechodov oddielov. Vytvorte testovaciu sadu s desiatimi viacjazycnymi prıkladmi klauzul. Skontrolujte navratnost v celom subore, nielen v castiach v hlavnom jazyku.

Skontrolujte svoje DPIA. DPIA vytvorena na zaklade jednojazyckych zaznamov moze byt neuplna. Opravte to pred auditom.

Podrobnosti o API a pokryti entit najdete na stranke s cenami.

anonym.legal pouziva XLM-RoBERTa plus nativne modely spaCy a Stanza. Nachadza PII napriec prechodmi oddielov v nemcine, francuztine, anglictine a 45 dalsich miestnych nastavenych.

Zdroje

Pripravení chrániť vaše údaje?

Začnite anonymizovať PII s 285+ typmi entít v 48 jazykoch.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.