Atpakaļ uz BloguTehniskā

Presidio's 22.7% Precision Problem: Why False Positives Are Destroying Your Anonymization Results

A 2024 benchmark found Presidio's person name recognizer achieves 22.7% precision in business documents — meaning 77.3% of detections are false positives. Product names, company names, and city names get redacted alongside actual PII. Here's how hybrid detection fixes this.

March 7, 20267 min lasīšanai
Presidio precisionfalse positivesNER accuracyPII detection qualityhybrid recognizer

Presidio's 22.7% Precision Problem: Why False Positives Are Destroying Your Anonymization Results

False positives in PII detection are not a minor nuisance. When 77.3% of what your tool flags as "person names" aren't person names, you're not protecting privacy — you're destroying data.

A 2024 benchmark study of Microsoft Presidio's default NER (Named Entity Recognition) model evaluated precision in business document contexts: financial reports, customer correspondence, product documentation, and support tickets. The result: 22.7% precision for person name detection.

That means for every 100 detections flagged as person names:

  • 23 are actual person names (correctly detected)
  • 77 are false positives (product names, company names, place names, brand mentions)

Why This Happens

Presidio's default person name recognizer uses spaCy's en_core_web_lg model for NER. This model was trained primarily on news text — where most proper nouns are in fact people, organizations, or places that news articles discuss.

Business documents are different:

Product names that look like person names:

  • "Apple iPhone 15 Pro shipment records..." → flagged as PERSON
  • "Samsung Galaxy Tab" → flagged as PERSON
  • "Cisco Meraki deployment" → flagged as PERSON

Company names with person name structure:

  • "Johnson Controls quarterly results" → "Johnson" flagged as PERSON
  • "Goldman Sachs portfolio" → "Goldman" flagged as PERSON
  • "BlackRock investment thesis" → flagged as PERSON

Place names that trigger person NER:

  • "Victoria Harbour development" → "Victoria" flagged as PERSON
  • "Santiago distribution hub" → "Santiago" flagged as PERSON

In a business document with 100 capitalized proper nouns, spaCy's default model lacks the contextual understanding to reliably distinguish "Apple" (company) from "Apple Smith" (person).

The Downstream Effect

A data analytics firm processing customer feedback surveys implemented Presidio for anonymization before sharing results with client analysis teams. Post-deployment audit:

  • 40% of survey responses had product names incorrectly redacted
  • City names mentioned in responses were systematically removed
  • Brand references — part of the analysis context — were anonymized out
  • Customer sentiment about specific products became unanalyzable

The analysis team was receiving data where "I love the [REDACTED] Pro but the [REDACTED] charger broke" replaced "I love the iPhone Pro but the Apple charger broke." The anonymization destroyed the analytical value the survey was collected to provide.

The firm wasn't overprotecting privacy — they were destroying utility without achieving compliance. After the audit finding, Presidio was replaced.

The Hybrid Detection Approach

The precision problem isn't unique to Presidio's base model — it's an inherent limitation of token-level NER without context. The fix requires context-aware detection.

Transformer-based models (XLM-RoBERTa): Large language models trained on diverse text understand contextual relationships. "Apple announced its earnings" → Apple is a company (contextual clue: "announced earnings"). "Apple Smith joined the team" → Apple is a person name (contextual clue: "joined the team").

Context-aware detection dramatically improves precision while maintaining recall:

ApproachPrecisionRecall
Presidio default NER22.7%~85%
Regex-only~95%~40%
Hybrid (Regex + NLP + Transformer)~85%~80%

The hybrid approach doesn't achieve perfect precision — that would require human review. But 85% precision means 15% false positive rate rather than 77.3%. For business document processing, this is the difference between usable output and corrupted data.

How the hybrid stack works:

  1. Regex layer: High-precision detection for structured identifiers (SSNs, email addresses, phone numbers, IBANs). These formats are machine-readable, so false positives are rare. Runs first, eliminates structured PII with near-100% precision.

  2. NLP layer (spaCy): Standard NER for person names, organizations, locations. Provides the initial detection set. High recall, lower precision.

  3. Transformer layer (XLM-RoBERTa): Contextual re-scoring of NLP detections. Entities that were flagged by NLP are re-evaluated with full sentence context. "Apple" in a product context loses person entity score. "John" as a customer complaint subject name gains person entity score.

  4. Confidence thresholding: Only detections above a calibrated confidence threshold pass to anonymization. Threshold is tunable — higher threshold for precision-critical use cases (business analytics), lower threshold for compliance-critical use cases (HIPAA de-identification).

Practical Impact: Survey Analysis Recovery

After switching to hybrid detection:

  • Product name false positives: reduced from 40% to 3%
  • City name false positives: reduced from 100% of city mentions to near 0%
  • Actual person name detection: maintained at ~82% recall (slight reduction from 85% in exchange for precision gains)

The surveys are now usable. "iPhone," "Apple," "Samsung," and "Chicago" are preserved. Customer names in complaint-specific contexts are correctly anonymized.

The trade-off: hybrid detection is computationally more intensive. For large-scale processing, this translates to slightly longer processing time. For most business use cases, the precision improvement is worth the cost.

When to Accept Higher False Positive Rates

Some compliance contexts favor recall over precision:

HIPAA Safe Harbor de-identification: Missing a true positive (failing to remove a person name) is a HIPAA violation. A 10% false positive rate is acceptable if it ensures near-100% recall of actual PHI. Over-anonymization is preferable to under-anonymization.

High-stakes legal document review: Missing a privileged attorney-client name could waive privilege. False positives require attorney review but don't create legal liability.

General business analytics: Over-anonymization corrupts data without achieving compliance benefit. Precision matters more. Use hybrid detection with conservative thresholds.

The appropriate precision-recall tradeoff depends on the use case. Tools that allow threshold configuration provide the flexibility to optimize for the right outcome per context.

Conclusion

A 22.7% precision rate means that 3 out of every 4 things your PII tool calls a "person name" is not a person name. For business documents, this precision level renders anonymization output unusable for analytical purposes while providing false assurance of compliance.

Hybrid detection combining regex, NLP, and transformer-based contextual scoring improves precision to the point where anonymized data remains analytically useful. For organizations that abandoned Presidio due to false positive problems, this architecture is the solution — not a different configuration of the same model.

Sources:

Vai esat gatavi aizsargāt savus datus?

Sāciet PII anonimizāciju ar 285+ entitāšu veidiem 48 valodās.