anonym.legal
Back to BlogTechnical

The False Positive Tax: Why Your PII Tool's Precision Problem Costs More Than You Think

Presidio GitHub issue #1071 documents systematic false positives. A 2024 study found 22.7% precision in mixed-language enterprise datasets. Every false positive is a manual review burden — at scale, that's an invisible compliance tax that erodes automation ROI.

March 5, 20268 min read
false positive ratePresidio precisionPII detection accuracyscore threshold configurationhybrid detection

The Invisible Compliance Tax

PII detection tools are typically evaluated on recall — what percentage of actual PII did the tool catch? But precision — what percentage of the tool's detections are actual PII — determines the operational cost of using the tool.

A system with 95% recall and 22.7% precision catches 95% of real PII but for every real PII entity detected, it flags 3.4 false positives. In a dataset containing 10,000 real PII entities, this system generates 10,000 / 0.227 ≈ 44,000 total detections, of which 34,000 are false positives requiring manual review or causing over-redaction.

This is the "false positive tax": the operational overhead imposed on any organization that tries to use a high-recall, low-precision PII detection system at production scale. The false positive tax has direct costs — manual reviewer time — and indirect costs: over-redacted documents obscure relevant information, slow workflows, and reduce trust in the automated system.

What Presidio Issue #1071 Documents

The Microsoft Presidio GitHub discussion #1071 (2024) documents a specific and systematic false positive pattern. TFN (Tax File Number) and PCI recognizers with checksum validation produce confidence scores of 1.0 — maximum confidence — for non-PII numbers that happen to pass the checksum algorithm.

The design issue: context word checking (verifying that words like "tax file number" or "TFN" appear near the detected entity) is applied after the checksum step rather than before. Numbers that pass the checksum get a score of 1.0 regardless of context. In documents containing numerical data — financial spreadsheets, scientific datasets, log files — this produces a flood of false positives that cannot be filtered by score threshold alone.

A separate pattern from the Presidio community (GitHub issue #999): German word segmentation creates false positives for name and location entities. German compounds like "Bundesbehörde" (federal authority) or common German terms can be incorrectly segmented and detected as personal names.

The 22.7% Precision Problem

Alvaro et al. (2024) evaluated Presidio default settings on mixed-language enterprise datasets and found 22.7% precision — meaning that in real enterprise documents, fewer than 1 in 4 Presidio detections corresponds to actual PII. This figure is consistent with practitioners' field experience: Presidio tuned for recall produces unusable noise in production.

A 2024 study examining DICOM medical imaging metadata found that even with score_threshold=0.7, 38 out of 39 DICOM images still had false positive entities. The threshold that eliminates false positives for one document type creates false negatives for another.

The precision problem is not unique to Presidio — it reflects the inherent difficulty of building a high-recall PII detector that also achieves high precision across diverse document types, languages, and data formats. The challenge is that any fixed threshold represents a trade-off: high threshold reduces false positives but increases false negatives; low threshold increases recall but inflates false positives.

The Context-Aware Solution

The alternative to threshold tuning is context-aware confidence scoring. Rather than assigning confidence based solely on the entity pattern match, context-aware scoring boosts confidence when context words appear near the match and suppresses false positives when context is absent.

For TFN detection: a score is boosted when "tax file number," "TFN," or "Australian tax" appears within a configurable window. A number passing the TFN checksum without nearby context words receives a reduced confidence score that falls below the review threshold.

For cross-lingual false positives: entity types that are specific to certain languages (German fiscal ID, French NIR, Australian TFN) can be scoped to documents detected as that language. A TFN detector applied only to English and Australian-English documents eliminates the systematic false positives that occur when the same detector runs on German documents.

The third tier of hybrid detection — transformer-based contextual models — adds another layer: the model evaluates the full surrounding context to distinguish a genuine personal name ("John Smith, Patient ID 12345") from a false positive (a product identifier that happens to match a name pattern).

Sources:

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.