By · Last updated 2026-06-05

Rudi kwa BlogKitaalamu

Presidio: Usanidi wa Wiki 3 dhidi ya PII Inayosimamiwa

Microsoft Presidio ina nyota elfu za GitHub na mamia ya masuala wazi. Ugumu wa usanidi, mzigo wa ujumuishaji wa PySpark, na utegemezi wa Python.

June 5, 20266 dakika kusoma
Presidio setupPySpark integrationmanaged PresidioPython dependenciesPII setup complexity

Presidio: Chombo Chenye Nguvu, Usanidi Mrefu

Imesasishwa kwa 2026.

Microsoft Presidio ni chombo imara cha ugunduzi wa PII na utambuzi. Lakini ni mradi mkubwa wa uhandisi. Kuendesha katika uzalishaji kunahitaji juhudi halisi. Jamii inakubaliana na hili.

Suala la GitHub #237 ni mfano mzuri. Hata watengenezaji wenye ujuzi hukabili migogoro ya mazingira. Wanakabiliwa na kushindwa kupakia muundo na makosa ya API. Siku za kazi ya utatuzi zinaweza kupita kabla ya uendeshaji wa kwanza unaofanya kazi.

Kile Data ya Jamii Inaonyesha

Hifadhi ya GitHub ya Presidio ina nyota elfu. Hilo linaonyesha nia kali. Lakini orodha ya masuala wazi inaambia hadithi tofauti.

Matatizo ya mazingira: Migogoro ya toleo la Python ni ya kawaida. Vivyo hivyo kutofautiana kwa muundo wa spaCy na makosa ya wakati wa uendeshaji wa ONNX. Masuala haya yanagonga watengenezaji wanaofuata hati kwa usahihi.

Kushindwa kupakia muundo: Miundo ya spaCy hupakuliwa vizuri lakini hushindwa kupakia katika baadhi ya usanidi. Makontena na usanidi wa kumbukumbu ya chini ni maeneo ya kawaida ya tatizo. Kutatua kunahitaji ujuzi wa kina wa ndani za spaCy.

Kushindwa kwa API ya uzalishaji: Mchambuzi hufanya kazi vizuri katika maendeleo. Huvunjika chini ya mzigo wa uzalishaji. Masuala ya unyakuzi na shinikizo la kumbukumbu kutoka kwa miundo ya NLP ni sababu kuu.

Mzigo wa ujumuishaji: Blogu ya Ploomber kwenye mfumo huu inashughulikia picha nzima. Inatumia huduma nyingi -- mchambuzi, kifaa cha kutoweka, na kirakiti cha picha cha hiari. Kuziunganisha kunaongeza kazi. Uhamishaji wa data kati ya huduma huongeza zaidi.

Kesi ya Microsoft Fabric

Hati zenyewe za Microsoft Fabric zinaonyesha pengo kati ya "inapatikana" na "inafanya kazi."

Chapisho la blogu ya Fabric kuhusu PySpark linasema hili moja kwa moja: usanidi "unahitaji kudhibiti utegemezi wa nje na mantiki maalum." Watumiaji wa Fabric walichagua jukwaa la wingu linalosimamiwa kuepuka aina hiyo ya kazi. Lakini kuongeza zana za nje kuleta ugumu tena.

Hatua za usanidi wa PySpark ni:

  1. Sakinisha presidio-analyzer na presidio-anonymizer katika vitabu vya Fabric.
  2. Pakua miundo ya spaCy katika mazingira ya Fabric.
  3. Andika wrappers wa PySpark UDF kwa mchambuzi na kifaa cha kutoweka.
  4. Shughulikia ufungashaji wa muundo wa spaCy kwa matumizi kote ya wafanyakazi wa Spark.
  5. Sanidi ugunduzi wa lugha kwa seti za data za lugha nyingi.

Kila hatua ina hali za kushindwa zinazojulikana. Timu kwenye njia hii mara nyingi hutumia wiki moja hadi mbili kabla ya kusindika hati yao ya kwanza.

Njia Mbili: Kujiendesha dhidi ya Kusimamiwa

Mbinu inayosimamiwa hubadilisha changamoto ya usanidi.

Njia ya kujiendesha:

  1. Sakinisha Docker.
  2. Sanidi docker-compose.yml.
  3. Pakua miundo ya spaCy.
  4. Tatua mtandao wa kontena.
  5. Sanidi sehemu za API.
  6. Jaribu ugunduzi wa vipengele.
  7. Rekebisha matokeo ya uongo chanya na hasi.
  8. Jenga vitambulisho maalum kwa aina za vipengele zisizo za kawaida.
  9. Ongeza uandishi wa kumbukumbu wa ukaguzi.
  10. Rekebisha kwa mzigo wa uzalishaji.

Muda hadi hati ya kwanza iliyotowekwa: siku tatu hadi ishirini na moja.

Njia ya huduma inayosimamiwa:

  1. Unda akaunti.
  2. Pakia hati au piga simu ya API.

Muda hadi hati ya kwanza iliyotowekwa: dakika kumi na mbili.

Njia zote mbili zinatumia mbinu sawa ya ugunduzi. Njia inayosimamiwa inafanya kazi kwenye maunzi ambayo mtu mwingine husimamia.

Wakati Kujiendesha Kunafaa Zaidi

Huduma inayosimamiwa haifai kila kesi.

Mafunzo ya muundo maalum: Baadhi ya hali zinahitaji miundo mipya ya NER. Majina ya dawa za kipekee au misimbo ya bidhaa ya ndani ni mifano. Kujiendesha hukupa zana za mafunzo.

Usindikaji wa asili wa Spark: Baadhi ya mchakato unahitaji ugunduzi wa PII ndani ya mtendaji wa Spark. Simu ya API ya nje inaongeza ucheleweshaji ambao huvunja mfumo huo. Kujiendesha ndiyo kinachofaa hapa peke yake.

Udhibiti kamili: Baadhi ya sera za usalama huzuia simu zote za nje za API katika mchakato wa data. Programu ya Desktop ya anonym.legal inafanya kazi bila mtandao kabisa. Kujiendesha ni chaguo lilizotengwa kabisa.

Kwa hali nyingi -- usindikaji wa hati, mchakato wa API, na zana za utiifu -- huduma inayosimamiwa huondoa mradi wa miundombinu kabisa.

Kuendesha Njia Zote Mbili kwa Wakati Mmoja

Tier ya bure inakupa mikopo 200 kwa mwezi. Hiyo inatosha kujaribu hati halisi. Bila kadi ya mkopo. Bila ahadi.

Hapa kuna mbinu rahisi ya kulinganisha.

Wiki ya 1: Sanidi mchambuzi unaojiendesha katika maendeleo. Angalia jinsi usanidi wa uzalishaji utakavyokuwa mgumu.

Siku ya 1, kwa wakati mmoja: Unda akaunti ya huduma inayosimamiwa. Endesha hati sawa za jaribio kupitia API inayosimamiwa. Linganisha matokeo.

Maswali muhimu:

  • Je, huduma inayosimamiwa inagundua aina unazohitaji? Inashughulikia aina 285+ za vipengele. Ujenzi wa huru wa chanzo wazi unashughulikia takriban 40 chaguo-msingi.
  • Je, usahihi ni wa kutosha?
  • Je, API inafaa mfumo wako?
  • Je, mipango inafanana na kiasi na bajeti yako?

Ndiyo kwa yote: huduma inayosimamiwa huondoa mradi wa miundombinu. Hapana: mapengo unayogundua ni sababu halisi za kubaki kujiendesha.

Angalia jinsi timu zingine zilivyofanya uamuzi huu katika mifano yetu ya kesi. Angalia usalama na maelezo ya ulinzi kwenye ukurasa wetu wa usalama na utiifu. Pata majibu kwa maswali ya kawaida katika Maswali Yanayoulizwa Mara kwa Mara.

Kwa Muhtasari

Usanidi wa wiki tatu si kushindwa kwa hati au mfumo. Inaonyesha ninachohitajika na miundombinu ya NLP ya uzalishaji. Changamoto ni halisi. Zinachukua muda na ujuzi kutatua.

Kwa timu nyingi, utowekaji wa PII ni hitaji la utiifu. Si kazi ya msingi ya uhandisi. Huduma inayosimamiwa hutoa ugunduzi sawa. Inafanya hivyo bila mradi wa miundombinu. Dakika kumi na mbili kutoka kusajiliwa hadi hati ya kwanza iliyotowekwa huhifadhi gharama ya tathmini kuwa ndogo sana.

Vyanzo

Tayari kulinda data yako?

Anza kuanonymisha PII na aina 285+ za vitu katika lugha 48.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.