규모에서의 KYC: 거짓 양성 비용

15개 EU 국가에서 일일 5,000건의 KYC 신청을 처리하는 디지털 은행은 PII 탐지 단계에서 2일 백로그가 발생하는 것을 발견했습니다.

George CurtaMarch 28, 20267 분 읽기

KYC PII automationfintech complianceAML data protectionPII false positive costdigital banking GDPR

KYC의 경쟁하는 규칙들

고객 파악(KYC) 규칙은 핀테크 기업에게 실질적인 긴장을 만듭니다. 규제 기관은 철저한 신원 확인을 원합니다. 개인 문서를 수집하고 검증할 것을 요구합니다. 그러나 데이터 법률은 반대 방향으로 밀어붙입니다. 수집된 데이터를 최소화하도록 요구합니다.

새 계정을 개설하는 은행은 많은 문서를 수집합니다. 국가 ID 카드, 여권, 운전면허증이 포함됩니다. 주소 증명과 금융 서류도 포함됩니다. 이 파일들에는 밀집된 개인 데이터가 있습니다. GDPR, AML 규칙, 은행 감독자 모두 엄격한 처리를 요구합니다.

그 데이터가 사기 시스템이나 분석으로 이동할 때, 추가 규칙이 적용됩니다. GDPR의 데이터 규칙이 적용됩니다. 개인 데이터는 두 번째 사용 전에 마스킹되거나 식별 해제되어야 합니다.

2일 백로그 문제

디지털 은행은 15개 EU 국가에서 일일 5,000건의 KYC 신청을 처리했습니다. PII 스캔 단계가 심각한 문제를 일으켰습니다. 거짓 양성율이 너무 높았습니다. 검토 큐가 2일 백로그에 도달할 때까지 증가했습니다.

근본 원인은 명확했습니다. ML 기반 도구가 비PII 텍스트의 약 8%를 개인 데이터로 표시했습니다. 각 파일에는 많은 페이지가 있었습니다. 일일 거짓 양성 볼륨이 팀이 하루에 처리할 수 있는 양보다 너무 컸습니다.

거짓 양성은 세 그룹으로 분류됐습니다:

회사 이름이 인명으로 표시됨(모델이 고유 명사를 혼동)
참조 코드가 ID 번호로 표시됨(체크섬 확인이 사용되지 않음)
은행 이름의 "Chase" 같은 일반적인 이름이 인명 PII로 표시됨

각 거짓 양성은 사람 검토가 필요했습니다. 일일 5,000개 파일의 8%에서, 매일 수천 건의 작업이 생성됐습니다.

ACL 연구가 보여주는 것

ACL 2024 연구는 PII 탐지를 위해 다국어 NLP 모델을 테스트했습니다. 발견: 다국어 NLP 모델의 5%만이 24개 EU 언어 전반에서 비영어 PII에 대해 85% F1 점수 이상을 달성합니다.

F1 점수는 정밀도와 회상율을 결합합니다. 낮은 정밀도는 많은 거짓 양성을 의미합니다. 낮은 회상율은 많은 누락된 항목을 의미합니다. 85% F1을 달성하지 못하는 95%의 실패율은 실제로 교차 언어 PII 스캐닝이 얼마나 어려운지를 보여줍니다.

대조적으로, XLM-RoBERTa는 PII 작업에 대해 91.4%의 교차 언어 F1을 달성합니다. 이것은 HuggingFace 2024 벤치마킹에서 나온 수치입니다.

고볼륨 KYC를 위한 하이브리드 설계

거짓 양성 문제는 해결 가능합니다. 세 가지 설계 선택이 이를 해결합니다.

체크섬 확인이 있는 정규식: 국가 ID 번호는 고정 규칙이 있습니다. 독일 Steuer-ID, 네덜란드 BSN, 폴란드 PESEL은 각각 체크섬 수학을 사용합니다. 번호가 체크섬에 실패하면 국가 ID가 아닙니다. 형식 더하기 체크섬은 이 ID에 대해 거짓 양성이 거의 없습니다.

이름에 대한 맥락 인식 NLP: KYC 파일의 인명은 알려진 위치에 나타납니다. "이름:", "성:", 설정된 양식 필드가 포함됩니다. 이름 플래그 전에 맥락 단어를 요구하면 거짓 양성이 줄어듭니다.

파일 유형별 임계값 조정: KYC 파일은 지원 이메일이나 의료 노트와 다릅니다. 각 유형은 다른 PII 믹스를 가집니다. 파일 유형별 임계값을 설정하면 팀이 필요에 맞게 조정할 수 있습니다.

2일 백로그는 PII 스캐닝의 불가피한 비용이 아닙니다. 특정 워크플로우에서 일반 도구를 사용하는 비용입니다. 해결책은 설정이지, 더 큰 팀이 아닙니다.

출처

데이터 보호를 시작할 준비가 되셨나요?

48개 언어로 285개 이상의 엔티티 유형으로 PII 익명화를 시작하세요.

무료 체험 시작 기능 보기

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

We follow these rules

GDPR (EU 2016/679).
ISO/IEC 27001:2022.
NIS2 (EU 2022/2555).
HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our company HQ is in Saarbrücken, Germany. Our servers run in Hetzner's Falkenstein datacenter.

Hetzner holds ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

We never sell your information to third parties.
We never train models on what you upload.
We never keep your work after you delete it.
We never share keys with any outside firm.
We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.

규모에서의 KYC: 거짓 양성 비용

KYC의 경쟁하는 규칙들

2일 백로그 문제

ACL 연구가 보여주는 것

고볼륨 KYC를 위한 하이브리드 설계

출처

관련 기사

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

데이터 보호를 시작할 준비가 되셨나요?

규모에서의 KYC: 거짓 양성 비용

KYC의 경쟁하는 규칙들

2일 백로그 문제

ACL 연구가 보여주는 것

고볼륨 KYC를 위한 하이브리드 설계

출처

관련 기사

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

데이터 보호를 시작할 준비가 되셨나요?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow