Späť na blogGDPR a Dodržiavanie

Why 'Delete the Email Column' Isn't Enough: Detecting PII in CSV Free-Text Fields for Research Data Sharing

Survey CSVs contain PII not just in structured columns but in free-text responses. Standard column deletion misses the PII that violates GDPR's anonymization standard.

March 7, 20267 min čítania
research dataCSV anonymizationGDPR Article 89survey datadata sharing

The Structural vs. Free-Text PII Problem

Research data shared between academic institutions travels most commonly in CSV format. When researchers prepare CSVs for sharing, the standard anonymization checklist is column-based: identify columns containing personal data, delete or pseudonymize those columns.

This approach handles structured PII reliably. Column named "email" contains email addresses — delete it. Column named "phone" contains phone numbers — delete it. Column named "participant_name" contains names — pseudonymize it.

What the column-deletion approach misses: PII embedded in free-text response columns.

A survey dataset with 5,000 rows and 20 columns might have:

  • 5 structured PII columns (name, email, phone, ID, birth year)
  • 15 free-text response columns ("additional_comments", "describe_experience", "what_would_improve", "other_details")

The structured columns are cleaned by column deletion. The free-text columns are left as-is. But survey respondents write things like:

  • "My doctor at Boston Medical Center, Dr. Maria Santos, said the treatment was experimental"
  • "I've been dealing with this since my accident in 2019 when John Henderson's car hit mine"
  • "You can reach my caregiver at margaret.wells@gmail.com if you need more information"

These entries contain named individuals, institutional affiliations, health information, and contact details — none of which appear in the column headers, and none of which are caught by column-deletion anonymization.

Why This Fails GDPR's Anonymization Standard

GDPR Recital 26 defines anonymous data as information that "does not relate to an identified or identifiable natural person." The standard for anonymization is a high bar: data is only anonymous if it is "impossible" (in reasonable estimation) to identify the data subject.

A partially-anonymized research CSV — structured columns cleaned, free-text columns containing named individuals — does not meet this standard. The named individuals in free-text responses are identifiable, and the dataset therefore remains personal data subject to GDPR Article 89 safeguard requirements.

This matters for several research contexts:

Article 89 research exemption: GDPR Article 89 allows processing of personal data for scientific research purposes with reduced obligations, but only where "appropriate safeguards" are in place. Sharing a dataset that is partially anonymized (but still contains PII in free-text) while claiming it satisfies the Article 89 safeguards is a compliance failure.

Research ethics board approval: Most academic IRBs and ethics review boards require that shared datasets be genuinely anonymized. Partial anonymization that leaves free-text PII intact typically does not satisfy ethics approval conditions.

Data sharing agreements between institutions: DSAs for research data typically specify that shared data must be anonymized to a defined standard. Partial anonymization that fails GDPR Recital 26 may breach the DSA.

The Technical Challenge of Free-Text PII Detection

Free-text survey responses are among the most challenging PII detection targets because:

Contextual naming: "Dr. Maria Santos at Boston Medical Center" requires NER to detect "Maria Santos" as a person and "Boston Medical Center" as an organization — not a keyword match. The patterns are not predictable.

Incidental identification: "John Henderson's car hit mine" requires NER to identify "John Henderson" as a named individual in a narrative context — not a data field but a person referenced in a story.

Contact information in unexpected formats: Email addresses and phone numbers appearing in free-text may have non-standard formatting ("reach me at margaret dot wells at gmail") that regex-only detection misses.

Research-specific entity types: Academic and clinical research data often contains institutional identifiers (hospital IDs, research site codes), clinical terminology, and location references that are PII in context even if not obviously so.

This is why NLP-based detection — rather than pattern matching alone — is necessary for genuine free-text survey anonymization.

Use Case: Multi-Institution Research Consortium

A research consortium at three European universities conducted a patient experience survey: 5,000 respondents, 3 structured PII columns, and 8 free-text response columns. The data was to be shared between institutions for collaborative analysis under a Data Sharing Agreement and GDPR Article 89 exemption.

Standard approach (column deletion only):

  • 3 structured PII columns removed
  • 8 free-text columns retained as-is
  • Compliance claim: "PII columns deleted"
  • Actual PII remaining: 47 named individuals mentioned in free-text responses, 23 email addresses volunteered in comments, 18 location references that could identify respondents in context

With free-text NLP detection:

  • 3 structured PII columns pseudonymized (consistent tokens, not deleted — preserving row count integrity)
  • 8 free-text columns processed: 47 person names detected and replaced, 23 email addresses detected and masked, 18 location references detected and generalized ("Boston Medical Center" → "[Healthcare Institution]")
  • Output: genuinely anonymized dataset meeting GDPR Recital 26 standard
  • Research ethics committee accepted the anonymization methodology
  • DSA compliance confirmed by DPO review

The difference: the second approach produces a dataset that actually satisfies the anonymization standard. The first approach produces a dataset that appears anonymized but contains identifiable information in the columns that weren't reviewed.

Building a Research Data Anonymization Protocol

For research teams working with survey and interview data, a structured pre-sharing protocol:

Step 1: Column classification

  • Categorize all columns: structured PII, structured non-PII, free-text response
  • Document the classification

Step 2: Structured PII handling

  • Delete (if not needed for research) or pseudonymize (if needed for record linkage)
  • Document replacement tokens used

Step 3: Free-text content analysis

  • Run NLP detection on all free-text columns
  • Review detected entities: confirm which represent genuine PII
  • Apply replacements for confirmed PII entities

Step 4: Verification

  • Sample 50-100 rows from the output dataset
  • Manual review of any free-text entries containing detected entities
  • Confirm detection rate is appropriate for the column type

Step 5: Documentation

  • Anonymization methodology document: tools used, entity types detected, columns processed
  • Share methodology document alongside anonymized dataset for ethics review

This protocol transforms "we deleted the name column" into a defensible, documented anonymization process that satisfies GDPR Article 89 and institutional research ethics requirements.

Sources:

Pripravení chrániť vaše údaje?

Začnite anonymizovať PII s 285+ typmi entít v 48 jazykoch.