Custom PII Identifiers: Anonymizing Organization-Specific Secrets
Ang Problem: Standard PII Detectors ay Hindi Nakakita ng Organizational Secrets
Ang bawat organisasyon ay may sariling classified identifiers:
- Tech Companies: Employee badges (EMP-12345), project codes (AURORA-47), security clearance (L3)
- Healthcare: Department codes (NEURO-01), surgery IDs (SURG-2024-0156), patient room assignments (312-ICU)
- Finance: Account codes (ACCT-ABC-789), transaction IDs (TXN-2024-56789), portfolio codes (HEDGE-3)
- Government: Case numbers (CASE-20250308-001), agent badge IDs, classified document markings
Ang standard Presidio detector ay hindi alam ang mga ito. Kaya naman walang na-detect = walang na-anonymize = PRIVACY BREACH.
Ang Solution: Custom Entity Recognizers per Organization
Ang anonym.legal ay nag-allow sa organizations na mag-define ng sariling PII types:
Hakbang 1: Configure Custom Entity
{
"custom_entities": [
{
"type": "EMPLOYEE_ID",
"patterns": [
{"regex": "EMP-\d{5}", "confidence": 0.95},
{"regex": "E\d{7}", "confidence": 0.90}
],
"context_words": ["employee", "staff", "personnel"]
},
{
"type": "PROJECT_CODE",
"patterns": [
{"regex": "[A-Z]{6}-\d{2}", "confidence": 0.85},
{"regex": "PROJ-[A-Z0-9]{4}", "confidence": 0.90}
],
"context_words": ["project", "initiative", "codename"]
}
]
}
Hakbang 2: Implement ang Custom Recognizer
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
analyzer = AnalyzerEngine()
# Add employee ID recognizer
emp_recognizer = PatternRecognizer(
supported_entity="EMPLOYEE_ID",
patterns=[
Pattern(name="emp_badge", regex=r"EMP-\d{5}"),
Pattern(name="emp_short", regex=r"E\d{7}")
],
context=["employee", "staff"]
)
analyzer.registry.add_recognizer(emp_recognizer)
# Test
results = analyzer.analyze(
text="Employee EMP-12345 works on PROJECT-AURORA",
entities=["EMPLOYEE_ID"]
)
print(results) # [Entity(EMPLOYEE_ID, confidence=0.95)]
Mga Common Organization-Specific PII Types
| Industry | Identifier | Format | Risk Level |
|---|---|---|---|
| Tech | Employee Badge | EMP-XXXXX | MEDIUM |
| Tech | Project Code | [A-Z]{6}-XX | HIGH |
| Healthcare | Department Code | [A-Z]{4}-XX | HIGH |
| Healthcare | Surgery ID | SURG-YYYY-XXXX | CRITICAL |
| Finance | Account Number | ACCT-ABC-XXX | CRITICAL |
| Finance | Portfolio Code | HEDGE-X | HIGH |
| Government | Case Number | CASE-YYYYMMDD-XXX | HIGH |
| Retail | Store ID | STR-XXX | MEDIUM |
Real-World Case: Tech Company Data Leak
Isang tech company ay nag-share ng internal document sa AI chat:
Internal memo:
From: John Smith (EMP-45678)
RE: Project AURORA Status
The AURORA-47 team has achieved 95% accuracy on the neural network model.
Code repository: github.com/company/aurora-47-private
Ang AI model ay nag-train sa document na ito, na nag-leak ng:
- ❌ Employee ID (EMP-45678) - could identify employee
- ❌ Project code (AURORA-47) - could identify classified project
- ❌ Private GitHub URL - could expose source code
With custom entity detection:
Anonymized memo:
From: <PERSON> (<EMPLOYEE_ID>)
RE: <PROJECT_CODE> Status
The <PROJECT_CODE> team has achieved 95% accuracy on the neural network model.
Code repository: <URL>
Ang Benefits ng Custom Recognizers
✅ 100% coverage ng organization secrets ✅ Audit trail - track kung iba anong anonymized ✅ Context awareness - reduce false positives gamit ang surrounding words ✅ Compliance evidence - prove na lahat ng secrets ay protected
Ang Best Practice Implementation
- Catalog ang secrets - list lahat ng organization-specific identifiers
- Define patterns - create regex + context rules per identifier
- Test extensively - validate laban sa real organizational data
- Monitor leakage - track kung ilan ang detected at anonymized
- Update quarterly - refresh patterns habang lumalaki ang secrets
Ang custom PII recognizers ay essential para sa organizations na may proprietary identifiers.