Ang Kalakasang Sitwasyon: Format-Specific Detection
Karamihan ng PII detection tools ay nag-optimize para sa isang format:
- Presidio → plain text / JSON logs
- OCR-based → scanned images + PDFs
- Regex scanners → CSV / database dumps
- Keyword matching → document corpuses
Sa production environments, data ay umabot mula sa 6+ formats parallel:
- PDFs — contracts, forms, reports, scanned documents
- Word documents (.docx) — memos, drafts, templates
- Spreadsheets (.xlsx) — datasets, reporting, budgets
- Screenshots (.png, .jpg) — chat transcripts, internal tool captures
- Databases — patient records, customer data, logs
- Structured data — JSON, XML, CSV, SQL dumps
Each format ay may sariling extraction mechanics at PII visibility pitfalls.
Format 1: PDFs
Problem: PDFs ay may dalawang categories:
- Text-layer PDFs — may underlying extractable text
- Image-based PDFs — scanned documents na page images
Ang generic text scanner ay nakikita ang text layer pero hindi ang scanned images.
Detection failure: Ang legal document ay 80% text-layer, 20% scanned. Ang generic scanner ay nag-flag ng text-layer content pero nag-miss ng SSN sa scanned section.
Solution: Multi-stage pipeline:
- Extract text layer
- Detect scanned images
- Run OCR sa scanned images
- Merge results
- Deduplicate
Format 2: Word Documents
Problem: DOCX ay XML-based. Data ay maaaring maging:
- Plain text sa body
- Embedded textboxes
- Headers/footers
- Comments (tracked changes)
- Embedded images
- Embedded OLE objects
Maraming tools ay nag-extract ng body text lang.
Detection failure: Ang Word document ay may tracked changes na may sensitive salary data. Ang tool ay nag-scan ng visible text pero ang tracked changes ay hidden.
Solution:
- Parse DOCX XML structure
- Extract lahat ng text streams
- Preserve metadata
- Report PII sa non-body streams
Format 3: Spreadsheets
Problem: XLSX ay may multiple sheets + hidden content:
- Visible data sa sheets
- Hidden rows/columns
- Hidden sheets
- Embedded formulas
- Metadata
Detection failure: Ang HR spreadsheet ay may employee data sa Sheet1. Sheet2 ay hidden at may salary + SSN. Ang generic tool ay nag-scan ng Sheet1 lang.
Solution:
- Enumerate lahat ng sheets
- Expand hidden rows/columns
- Evaluate formulas
- Check metadata
- Report sheet coverage
Format 4: Screenshots
Problem: Screenshots ay may:
- Rendered UI text (inconsistent quality)
- Form fields
- Blurred/pixelated content
- Multiple languages
- Variable resolution
Detection failure: Ang screenshot ay may email + phone. Ang OCR ay nag-extract ng text pero 60% lang ang accuracy.
Solution:
- Preprocessing — enhance contrast, resize
- Layout analysis — identify form fields
- Multi-language OCR
- Fuzzy matching
- Confidence scoring
Format 5: Databases
Problem: Database PII ay hindi static:
- Data ay queried at runtime
- PII ay derived sa stored procedures
- Data ay decrypted on SELECT
Detection failure: Ang database ay may encrypted SSN. Ang tool ay nag-scan ng schema, mag-skip. Runtime query ay nag-decrypt pero ay hindi scanned.
Solution:
- Query logging — capture queries + results
- Runtime result scanning
- Stored procedure analysis
- Connection-time detection
- Compliance note
Format 6: Structured Data
Problem: CSV/JSON/XML ay may variance sa encoding:
- CSV: quoted fields, escaped quotes
- JSON: unicode escapes, emoji, null bytes
- XML: CDATA sections, entities
Detection failure: Ang CSV ay may newline embedded sa quote. Ang tool ay mag-read bilang 3 lines, hindi 2.
Solution:
- Format-aware parsing — use proper libraries
- Unescape bago scan
- Multi-line field handling
- Encoding detection
- Verify parse integrity
Building Format-Resilient Detection
Best practices:
- Modular pipeline — separate extractors per format
- Deduplication — same PII = single alert
- Coverage tracking — report coverage %
- Confidence tiers — regex vs. OCR matches
- Fallback chains — primary then secondary extractor
- Audit logging
Conclusion
Format fragmentation ay hindi edge case. Ito ay ang default state ng production data. Organizations na nag-rely sa single-format detection ay nakapalagay na compliant pero sa reality ay nag-scan lang ng portion ng data.