Ang CSV Free Text PII Detection Failure
CSV ay structured format:
user_id,name,email,status,notes
12345,John Doe,john@example.com,active,"Contacted via phone: 555-1234, mentioned family medical history concerns"
Ang bawat column ay may predictable structure. Name → regex para sa names. Email → email regex. Phone sa notes → unreliable dahil sa text variation.
Problem: Free text columns (notes, comments, feedback, clinical notes) ay may embedded PII na unpredictably formatted.
Common CSV Free Text PII Scenarios
-
Customer service notes:
"Customer called regarding billing. SSN provided for verification: 123-45-6789. Explained fraud claim process." -
Research study notes:
"Subject ID: S0012, DOB: 03/15/1985, diagnosed with diabetes, treatment center: Cleveland Clinic, physician: Dr. Robert Smith, contact: 216-555-0188" -
Healthcare provider notes:
"Patient presented with chest pain. Address: 123 Main St, Cleveland OH 44101. Insurance ID: BC/BS-123456789. Medications: metoprolol." -
Financial advisor notes:
"Client net worth estimated $5M. Concern: daughter's school loans. Bank account xxxx1234 discussed. Referred to Smith & Associates wealth management."
Why Standard PII Detection Fails
Challenge 1: Contextual Format Variation
The same information ay may different formats sa free text:
"SSN: 123-45-6789"
"Social Security: 123456789"
"SSN (123-45-6789)"
"SS# 123456789"
"Social security was 1234567 (partial)"
Regex ay dapat cover lahat ng variations. Madalas ay nag-miss ng edge cases.
Challenge 2: Obfuscation in Notes
Writers ay madalas na nag-partially-obfuscate habang nag-type:
"SSN partially provided: xxx-45-6789"
"Patient ID: 12*** (redacted by author)"
"Address on file: 123 M*** St"
Partially-redacted data ay still identifiable pero ay nag-fail sa regex (expects full number).
Challenge 3: Implicit References
PII ay maaaring implicit, hindi direct:
"Referred to Cleveland Clinic psychiatry (indication of mental health treatment)"
"Failed kidney transplant (implies ESRD)"
"Admitted to oncology ward (implies cancer diagnosis)"
These ay medical identifiers pero walang formal PII pattern.
Challenge 4: Domain-Specific Acronyms
Different industries ay may domain-specific identifiers:
- Healthcare: ICD-10 codes, NDC codes, MRN (medical record number)
- Finance: ISIN, CUSIP, account last-4
- Legal: Docket numbers, client matter IDs
These ay contextually sensitive pero walang regex pattern.
Strategy 1: Structured CSV (No Free Text)
Best practice: Separate PII mula sa detailed notes.
user_id,name,email,encrypted_notes_key
12345,John Doe,john@example.com,notes_key_abc123
Notes stored separately:
notes_key_abc123: {encrypted content, decryptable only sa secure system}
Benefits:
- CSV ay clean, no free text
- Notes ay encrypted at separate location
- Sharing CSV ay lower-risk (no details)
- Notes access ay audit-logged
Challenges:
- Requires application redesign
- Users may resist (want all data in one file)
Strategy 2: Aggressive Masking Before Export
Mask all potential PII sa free text bago CSV export:
import re
def mask_pii_in_notes(text):
# SSN
text = re.sub(r'\d{3}-?\d{2}-?\d{4}', '[SSN]', text)
# Phone
text = re.sub(r'\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}', '[PHONE]', text)
# Email
text = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+', '[EMAIL]', text)
# Credit card
text = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{3,4}\b', '[CC]', text)
return text
Benefits:
- Simple implementation
- All free text ay cleaned
Challenges:
- Masks legitimate mentions (e.g., "contact us at support@example.com")
- False positives (numbers na hindi PII)
- May loss ng functional data
Strategy 3: Multi-Model Detection Pipeline
Combine regex + NER (Named Entity Recognition) + contextual rules:
Free text → [Regex patterns] → [spaCy NER] → [Domain-specific rules] → Confidence score
Example:
Input: "Patient presented with chest pain. Address: 123 Main St, Cleveland OH 44101."
Regex: matches "123 Main St" (street address pattern)
spaCy: identifies "Cleveland" (GPE), "Ohio" (GPE), "chest pain" (medical condition)
Rules: "Address" + street pattern + city + state = high confidence address PII
Output: [ADDRESS detected with 95% confidence]
Benefits:
- Higher detection accuracy (multi-signal)
- Context-aware (less false positives)
Challenges:
- Complexity (requires NLP expertise)
- Latency (slower than regex alone)
- Hallucinations (models ay may false positives on edge cases)
Strategy 4: GDPR-Compliant Research Data Sharing
For research data exports, apply strict anonymization protocol:
Step 1: Identify sensitive columns
- Direct identifiers: name, email, phone, SSN, address
- Quasi-identifiers: age, occupation, zip code (can be used to re-identify)
- Sensitive attributes: medical data, financial data, criminal history
Step 2: Apply anonymization techniques
# BEFORE
name,age,occupation,diagnosis,notes
John Doe,45,Software Engineer,Diabetes Type 2,"Patient started on metformin. Referred to endocrinologist Dr. Smith."
# AFTER
subject_id,age_range,occupation_group,diagnosis_code,notes_mask
S001,40-50,IT,[MASKED],[NOTES_ENCRYPTED]
Step 3: Encryption + Access Control
Anonymized CSV (public sharing) + Encrypted notes (access-logged) + Data dictionary (separate document)
Step 4: Governance
- Data use agreement required bago access
- Usage logs tracked
- Re-identification prohibited
GDPR Compliance Template
csv_free_text_handling:
before_collection:
- Assess: "Is free text necessary or structured field sufficient?"
- Minimum principle: Collect only essential data
collection:
- Flag high-risk text fields (notes, comments, feedback)
- Train data collectors sa PII minimization
- Use templates para sa consistent format
processing:
- Anonymize free text bago storage (regex + NER)
- Or: Encrypt + separate access control
- Or: Structure conversion (notes → form fields)
sharing:
- CSV exports: Remove or mask free text
- Research sharing: Strict anonymization
- Audit trail: Who accessed what
retention:
- Auto-purge after retention window (e.g., 30 days)
- Archive: Encrypted + separate location
- Deletion: Cryptographic erasure (keys deleted)
Testing: Validate Anonymization
Before deployment:
test_cases = [
("SSN: 123-45-6789", "[SSN]"),
("Phone (555) 123-4567", "[PHONE]"),
("Email john@example.com", "[EMAIL]"),
("Referred to Cleveland Clinic", "Referred to Cleveland Clinic"), # False positive
]
for input_text, expected in test_cases:
result = mask_pii_in_notes(input_text)
assert result == expected
Conclusion
Free text columns sa CSV ay underestimated risk vector. They're where PII hides in plain sight, undetectable sa standard tools. GDPR regulators ay increasingly requiring explicit handling policies para sa unstructured data.
Companies na nag-export ng CSV para sa research sharing ay dapat implement multi-layered approach: avoid free text where possible, anonymize aggressively, encrypt remaining, track access. The goal ay GDPR compliance + research value — both possible with proper process.