The 22.7% Precision Problem in Production
A 2024 benchmark study of Microsoft Presidio — the open-source PII detection engine used in legal technology, healthcare, and enterprise data protection applications — found a 22.7% precision rate for person name detection in business document contexts.
Precision measures the accuracy of positive identifications: what percentage of the items the tool flagged as "person names" are actually person names. At 22.7%, approximately 77 out of every 100 items flagged as person names are false positives.
The benchmark documented 13,536 false positive name detections across 4,434 document samples. The false positives included:
- Pronouns flagged as person names ("I" appearing at the start of sentences)
- Vessel names flagged as person names ("ASL Scorpio")
- Organization names flagged as person names ("Deloitte & Touche")
- Country names flagged as person names ("Argentina," "Singapore")
These are not edge cases. They are systematic patterns that emerge when a general-purpose NLP model trained on mixed corpora is applied to domain-specific document types where proper nouns appear in contexts the model was not trained to disambiguate.
The Cost Structure of False Positives at Scale
In legal and healthcare environments, false positives are not free. Every item flagged requires a disposition: either human review to confirm or reject the flag, or automatic processing that leaves the false positive uncorrected.
Option 1: Human review of every flagged item. At $200 to $800 per hour for attorney or specialist time, reviewing false positives from a 22.7% precision system is economically prohibitive at scale. For a 10,000-document production with 100 flagged items per document at 22.7% precision, approximately 77,300 items require human review. At 5 minutes per item at $300 per hour, that is 6,442 hours of review time — approximately $1.9 million.
Option 2: Skip manual review and accept automatic processing. The result is a production where 77% of "redacted" items were not actually sensitive — creating over-redaction liability (discoverable content withheld without grounds), destroying document utility, and potentially triggering sanctions.
Option 3: Score thresholds. Presidio allows score_threshold configuration to reduce false positives by only flagging items above a confidence threshold. A 2024 benchmark study of DICOM medical imaging documents found that even with score_threshold=0.7 — a relatively aggressive precision filter — 38 out of 39 DICOM images still had false positive entities. Score thresholds reduce but do not eliminate the false positive problem for pure ML detection.
Why Pure ML Fails Domain-Specific Documents
The Presidio false positive pattern reflects a fundamental limitation of general-purpose NLP models in domain-specific contexts:
Legal documents contain specialized proper nouns — case names, statute names, exhibit designations — that share surface-level patterns with person names. A model trained on general text learns that capitalized proper nouns are often person names. A legal document contains hundreds of capitalized proper nouns that are not person names.
Healthcare documents contain medication names, device names, and procedural codes that include letter sequences resembling name abbreviations. Clinical text also contains abbreviations ("Pt." for Patient, "Dr." for Doctor) that interact unpredictably with name detection.
Financial documents contain product names, entity names, and identifier codes that share patterns with personal identifiers.
Domain-specific tuning addresses these patterns, but requires significant investment in fine-tuning datasets and continuous maintenance as document types evolve.
The Hybrid Architecture Solution
The false positive problem is structurally solvable through hybrid detection that separates structured data (where regex provides 100% precision) from contextual data (where ML provides pattern recognition with calibrated confidence).
Regex for structured identifiers: SSNs, phone numbers, email addresses, credit card numbers, national ID formats, bank account numbers. These formats are deterministic — a string either matches the pattern and passes checksum validation or it does not. Zero false positives for legitimate implementations.
NLP for contextual entities: Person names, organization names, locations in unstructured text. NLP models provide recall for entities that lack structural patterns. Confidence scoring and context word requirements reduce false positives.
Threshold configuration per entity type: Setting a 90% confidence threshold for person names while using regex-certainty (effectively 100%) for SSNs allows calibration to domain-specific false positive tolerances. Legal teams that cannot tolerate over-redaction risk set higher thresholds; clinical research teams maximizing de-identification recall set lower ones.
The result: dramatically lower false positive rates than Presidio defaults while maintaining the recall that pure pattern matching cannot achieve. For legal and healthcare organizations evaluating automated redaction tools, the precision-recall tradeoff is manageable — but only with a tool that exposes it as a configurable parameter rather than a fixed system behavior.
Sources: