Bumalik sa BlogGDPR & Pagsunod

Custom PII Identifiers: Anonymizing...

Hindi lahat ng sensitive data ay kilala ng standard PII detectors. Ang bawat organisasyon ay may sariling classified identifiers: employee IDs...

April 24, 20267 min basahin
custom PII detectionorganizational identifiersre-identification riskGDPR pseudonymizationcustom entity

Custom PII Identifiers: Anonymizing Organization-Specific Secrets

Ang Problem: Standard PII Detectors ay Hindi Nakakita ng Organizational Secrets

Ang bawat organisasyon ay may sariling classified identifiers:

  • Tech Companies: Employee badges (EMP-12345), project codes (AURORA-47), security clearance (L3)
  • Healthcare: Department codes (NEURO-01), surgery IDs (SURG-2024-0156), patient room assignments (312-ICU)
  • Finance: Account codes (ACCT-ABC-789), transaction IDs (TXN-2024-56789), portfolio codes (HEDGE-3)
  • Government: Case numbers (CASE-20250308-001), agent badge IDs, classified document markings

Ang standard Presidio detector ay hindi alam ang mga ito. Kaya naman walang na-detect = walang na-anonymize = PRIVACY BREACH.

Ang Solution: Custom Entity Recognizers per Organization

Ang anonym.legal ay nag-allow sa organizations na mag-define ng sariling PII types:

Hakbang 1: Configure Custom Entity

{
  "custom_entities": [
    {
      "type": "EMPLOYEE_ID",
      "patterns": [
        {"regex": "EMP-\d{5}", "confidence": 0.95},
        {"regex": "E\d{7}", "confidence": 0.90}
      ],
      "context_words": ["employee", "staff", "personnel"]
    },
    {
      "type": "PROJECT_CODE",
      "patterns": [
        {"regex": "[A-Z]{6}-\d{2}", "confidence": 0.85},
        {"regex": "PROJ-[A-Z0-9]{4}", "confidence": 0.90}
      ],
      "context_words": ["project", "initiative", "codename"]
    }
  ]
}

Hakbang 2: Implement ang Custom Recognizer

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern

analyzer = AnalyzerEngine()

# Add employee ID recognizer
emp_recognizer = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[
        Pattern(name="emp_badge", regex=r"EMP-\d{5}"),
        Pattern(name="emp_short", regex=r"E\d{7}")
    ],
    context=["employee", "staff"]
)

analyzer.registry.add_recognizer(emp_recognizer)

# Test
results = analyzer.analyze(
    text="Employee EMP-12345 works on PROJECT-AURORA",
    entities=["EMPLOYEE_ID"]
)
print(results)  # [Entity(EMPLOYEE_ID, confidence=0.95)]

Mga Common Organization-Specific PII Types

IndustryIdentifierFormatRisk Level
TechEmployee BadgeEMP-XXXXXMEDIUM
TechProject Code[A-Z]{6}-XXHIGH
HealthcareDepartment Code[A-Z]{4}-XXHIGH
HealthcareSurgery IDSURG-YYYY-XXXXCRITICAL
FinanceAccount NumberACCT-ABC-XXXCRITICAL
FinancePortfolio CodeHEDGE-XHIGH
GovernmentCase NumberCASE-YYYYMMDD-XXXHIGH
RetailStore IDSTR-XXXMEDIUM

Real-World Case: Tech Company Data Leak

Isang tech company ay nag-share ng internal document sa AI chat:

Internal memo:
From: John Smith (EMP-45678)
RE: Project AURORA Status

The AURORA-47 team has achieved 95% accuracy on the neural network model.
Code repository: github.com/company/aurora-47-private

Ang AI model ay nag-train sa document na ito, na nag-leak ng:

  • ❌ Employee ID (EMP-45678) - could identify employee
  • ❌ Project code (AURORA-47) - could identify classified project
  • ❌ Private GitHub URL - could expose source code

With custom entity detection:

Anonymized memo:
From: <PERSON> (<EMPLOYEE_ID>)
RE: <PROJECT_CODE> Status

The <PROJECT_CODE> team has achieved 95% accuracy on the neural network model.
Code repository: <URL>

Ang Benefits ng Custom Recognizers

100% coverage ng organization secrets ✅ Audit trail - track kung iba anong anonymized ✅ Context awareness - reduce false positives gamit ang surrounding words ✅ Compliance evidence - prove na lahat ng secrets ay protected

Ang Best Practice Implementation

  1. Catalog ang secrets - list lahat ng organization-specific identifiers
  2. Define patterns - create regex + context rules per identifier
  3. Test extensively - validate laban sa real organizational data
  4. Monitor leakage - track kung ilan ang detected at anonymized
  5. Update quarterly - refresh patterns habang lumalaki ang secrets

Ang custom PII recognizers ay essential para sa organizations na may proprietary identifiers.

Handa nang protektahan ang iyong data?

Simulan ang anonymization ng PII gamit ang 285+ uri ng entidad sa 48 wika.