Back to BlogHealthcare

Batch Processing 50,000 Clinical Notes Locally: A Practical Guide to High-Volume PHI De-Identification

A February 2026 SDNY ruling found AI-processed documents lose attorney-client privilege if not anonymized before processing. Healthcare research organizations need to de-identify hundreds of thousands of notes. Cloud upload raises both practical and regulatory concerns.

March 5, 20268 min read
batch PHI de-identificationclinical notes processingHIPAA local processingresearch dataset complianceIRB requirements

The Volume Problem in Clinical Research

A clinical research organization building a de-identified dataset from 500,000 patient consultation notes faces a gap that cloud-based de-identification tools cannot close: the volume is too large for cloud upload, the regulatory environment requires on-premises processing, and the manual alternative is not feasible.

The HIPAA Privacy Rule's Expert Determination method requires that de-identified datasets carry "very small risk" of re-identification — a statistical standard that must be verified by a person with appropriate knowledge. An IRB (Institutional Review Board) approving research using de-identified patient data requires documentation of the de-identification method, the entity types removed, and the quality controls applied. The documentation requirement means that de-identification cannot be a black-box process: the research organization must be able to explain exactly what was detected, what was removed, and how the process was validated.

Cloud processing of 500,000 clinical notes raises two separate concerns. First, practical: uploading 500,000 files through any API has rate limiting, bandwidth, and cost implications that make batch cloud processing impractical for large research datasets. Second, regulatory: under HIPAA, transmitting protected health information to a Business Associate (even a de-identification service provider) requires a Business Associate Agreement. For research data under IRB protocols, the BAA requirements may intersect with IRB data use agreements in ways that require legal review. Local processing eliminates the transmission concern entirely.

The Privilege Implications

A February 2026 SDNY ruling found that AI-processed documents lose attorney-client privilege if the documents were not appropriately anonymized before processing. The ruling applied to a law firm that had submitted client documents to an AI document review tool without anonymizing client information first. The court held that submitting privileged documents to an external AI provider constituted a disclosure that waived privilege for the analyzed content.

While this ruling is in the legal context rather than healthcare, the principle extends to other professional privilege situations: physician-patient communications submitted to AI analysis services, therapist session notes processed by cloud-based NLP tools, and similar scenarios where professional privilege attaches to the content. Local processing — where the documents never leave the professional's controlled environment — avoids the transmission that triggers the privilege waiver analysis.

The Practical Batch Architecture

For a clinical research organization processing 50,000 notes:

Batch configuration: Desktop App processes files in batches of 1–5,000 depending on the subscription tier. A single overnight run of ten batches of 5,000 files each handles the full dataset without manual intervention. The processing is sequential within each batch; parallel execution (1–5 concurrent files) increases throughput.

Entity type configuration: Healthcare-specific entity types — MRN formats, NPI, DEA numbers, health plan beneficiary IDs, HIPAA-specified date formats — are configured once in a named preset. The same preset applies consistently across all batches in the research dataset, ensuring that de-identification standards are uniform across the full corpus.

Processing metadata: Each batch run produces a CSV/JSON export with processing metadata: file name, entities detected, entity types, confidence scores, and processing timestamp. This metadata satisfies the IRB documentation requirement for Expert Determination de-identification — the research organization can demonstrate exactly what was detected and removed in each document.

Sources:

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.