Kembali ke BlogGDPR & Kepatuhan

Research Publication PII: Why Your Data Analysis Screenshots Might Be Violating GDPR Without You Knowing

Academic papers regularly include pandas DataFrames and R output showing real patient records as methodology examples. Here's why this is a GDPR violation and how to screen manuscripts before submission.

March 7, 20267 menit baca
research dataacademic GDPRpublication privacyOCR image detectionArticle 89

The Methodology Screenshot Problem

Academic and research publications have developed a documentation pattern that creates an underappreciated GDPR risk: screenshots of data analysis environments showing real data as part of demonstrating methodology.

The scenarios are common:

  • A machine learning paper includes a screenshot of a pandas DataFrame showing the first 10 rows of the training dataset — which contains real patient records from the data source
  • A clinical data analysis paper shows R output with individual patient values in a summary table, with patient IDs partially visible
  • A computational social science paper includes SPSS output tables that show individual survey respondent values as part of explaining the analysis procedure
  • A data engineering tutorial published in a research journal includes Jupyter notebook screenshots with real user records used as "sample data" for the illustration

In each case, the author did not intend to publish personal data. The screenshot was included to document methodology. The personal data in the screenshot was incidental — there to make the example concrete.

But "incidental" does not make it compliant. GDPR Article 4(1) defines personal data as any information relating to an identified or identifiable natural person. A patient record in a published paper — even as a screenshot — is personal data. Publishing it without the patient's consent or another lawful basis under Article 6 is a GDPR violation.

Research institutions increasingly face GDPR enforcement for data publication failures. Key developments:

Journal retraction requests: The GDPR right to erasure (Article 17) extends to published data. If a data subject discovers their personal data in a published paper, they can request erasure — which for a journal article typically means retraction or correction notice. Journal retraction is a significant professional consequence.

Research ethics board findings: Research ethics committees reviewing published research for GDPR compliance have begun issuing findings for papers that include individual-level data in screenshots without appropriate safeguards. These findings affect researchers' standing with ethics boards for future research.

Data Access Agreement violations: Most research datasets are shared under Data Access Agreements that specify how data may be used and what may be published. Including individual-level data in publication screenshots, even as thumbnails, may violate the DAA — with consequences including loss of data access privileges.

GDPR Article 89 research exemption limitations: GDPR Article 89 allows processing of personal data for scientific research with reduced obligations — but only where "appropriate safeguards" are implemented. Publishing individual-level data in methodology screenshots without anonymization is not an appropriate safeguard; it is a disclosure.

The Scale of the Problem

The incidence is not rare. A systematic review of data science papers published in high-impact journals between 2022-2024 would likely find a significant proportion containing images with individual-level data visible.

The contributing factors:

Reproducibility norms: Modern scientific publishing increasingly requires that methods be documented with sufficient detail to reproduce results. Screenshots of analysis environments are seen as meeting this norm.

Speed of publication: Under deadline pressure, researchers generate screenshots quickly without reviewing each image for data content.

Low visibility of data in images: A screenshot of a DataFrame with 20 columns and 5 rows may have names and IDs in peripheral columns that the researcher doesn't focus on when documenting the analysis procedure.

No automated check in submission workflows: Standard journal submission portals perform completeness checks, format checks, and plagiarism screening. None perform image PII detection.

Screening Implementation for Research Groups

A practical workflow for a research group implementing manuscript PII screening:

Pre-submission protocol:

  1. Researcher completes manuscript draft with all figures
  2. Draft submitted to internal screening (PI or designated reviewer)
  3. Image PII detection runs on all image files attached to manuscript
  4. Detection report identifies: which images contain readable text, which text matches PII entity patterns
  5. Researcher reviews flagged images
  6. For each flagged image: replace with properly anonymized screenshot (substitute patient ID 12847 with ID 00001, replace real name with "Patient A")
  7. Final manuscript submitted to journal with anonymized screenshots

Technical integration options:

  • Manual: export all manuscript images, run batch image PII detection, review report
  • Semi-automated: dedicated folder where draft manuscripts are deposited; weekly batch processing runs on new files
  • Workflow-integrated: institutional submission portal with pre-submission screening step

The time cost of screening is low: for a typical 15-figure manuscript, image PII detection takes under 2 minutes. The time cost of a retraction or ethics board finding is measured in months.

Use Case: European University Research Ethics Requirement

A data science research group at a European university implemented image PII screening as part of their manuscript submission workflow following a near-miss: a submitted paper's review detected individual patient names in a DataFrame screenshot that had been included as a methodology illustration.

Implementation:

  • All draft papers processed for image PII before submission to journals
  • Screening covers all PNG, JPG, and PDF figures in the draft
  • Results reviewed by the group's designated data privacy contact

Results over 6 months:

  • 23 manuscripts screened before submission
  • 7 manuscripts (30%) had at least one image with detectable PII entities
  • Entity types found: patient names in DataFrames (4 papers), user IDs matching patient registration formats (2 papers), email addresses in screenshot margins (1 paper)
  • All 7 corrected before submission
  • Zero post-submission retraction requests or ethics findings during the period

The institution's research ethics committee now uses this workflow as a documented example of "appropriate safeguards" in GDPR Article 89 research exemption applications.

Sources:

Siap untuk melindungi data Anda?

Mulai anonimisasi PII dengan 285+ jenis entitas dalam 48 bahasa.