Not All De-Identification Tools Are Equal
When evaluating PHI de-identification tools, accuracy is everything. A 4% difference in detection rate might seem small—until you realize that 4% of a million-record dataset is 40,000 exposed records.
Recent benchmarks from ECIR 2025 reveal dramatic differences in PHI detection accuracy across leading tools.
The ECIR 2025 Benchmark Results
| Tool | F1-Score | Precision | Recall |
|---|---|---|---|
| John Snow Labs | 96% | 95% | 97% |
| Azure AI | 91% | 90% | 92% |
| AWS Comprehend Medical | 83% | 81% | 85% |
| GPT-4o | 79% | 82% | 76% |
The F1-score combines precision (how many detected entities were correct) and recall (how many actual entities were detected). Both matter:
- Low precision = false positives (over-redaction)
- Low recall = false negatives (missed PII = breaches)
Why the Gap Exists
Training Data Differences
| Tool | Training Focus |
|---|---|
| John Snow Labs | Healthcare-specific, clinical notes |
| Azure AI | General medical + clinical |
| AWS Comprehend | General medical entities |
| GPT-4o | Broad training, not healthcare-specific |
John Snow Labs' models are trained specifically on clinical documentation—the messy, abbreviated, context-dependent text that healthcare actually produces.
Entity Type Coverage
Not all tools detect the same entities:
| Entity | John Snow | Azure | AWS | GPT-4o | |--------|-----------|-------|-----|------...