Reproducible Privacy: Why ML Teams Need Configuration Presets, Not Just Documentation
The DPO approved the anonymization procedure document. It specifies: remove names, emails, phone numbers, and dates of birth from training datasets using the Replace method. The document is 4 pages and lives in the compliance wiki.
Twelve data scientists consult it at project kickoff. They configure their own versions of the anonymization tool. Some add national IDs. Some include IP addresses. Some use Redact instead of Replace. Three months later, the training datasets are inconsistent.
The CNIL (France's DPA) investigated multiple AI companies in 2024 for improperly using personal data in training datasets. The investigations examined not just whether anonymization occurred but how consistently it was applied.
Documentation is necessary. It's not sufficient. The technical solution is the preset.
Why ML Training Data Requires Specific Configuration
ML training data anonymization has requirements that general document anonymization doesn't:
Replace, not Redact: Neural language models trained on text where names are replaced with [REDACTED] tokens learn that [REDACTED] is a special identifier appearing in name positions. This creates undesirable model behavior. The Replace method (substituting "John Smith" with "David Chen") preserves the statistical distribution of names in text while removing the identifying information. The model learns from realistic name-position distributions, not from a mask token.
Consistency across the dataset: A training dataset where 70% of names are replaced and 30% are [REDACTED] produces inconsistent training signal. All records should be processed identically.
Consistent entity selection: If the training dataset contains health data, removing names but not dates-of-birth in some records creates inconsistency. All 12 data scientists must remove the same set of entity types.
No over-anonymization: Replace method over-applied — removing dates that are merely timestamps, not date-of-birth — degrades dataset utility without improving compliance. The approved preset defines exactly which date entities to remove (date-of-birth, not general timestamps).
Reproducibility across runs: If the same dataset needs to be reprocessed (e.g., after detecting a missed entity type), reprocessing with the same preset produces consistent output. Ad-hoc configurations are not reproducible.
The 12-Data-Scientist Problem
A European fintech company's ML team uses a training dataset derived from customer interaction logs. The DPO approved the processing purpose (model training for fraud detection) with conditions: all customer names, emails, phone numbers, and payment identifiers must be replaced using the Replace method before any model training.
Without presets:
- Data scientist 1 removes names, emails, phone numbers (doesn't include payment identifiers)
- Data scientist 2 includes payment identifiers but uses Redact not Replace
- Data scientist 3 follows the procedure document exactly
- Data scientists 4-12 vary
Result: 12 differently-processed versions of the training data. The merged dataset is partially non-compliant, partially over-anonymized, and statistically inconsistent.
With DPO-approved preset:
- DPO creates "ML Training — Fraud Detection" preset with exact entity types and Replace method
- Preset shared with all 12 data scientists with instructions: "Use this preset for all training data preparation"
- Preset cannot be modified without DPO review (configuration access control)
Result: All 12 data scientists produce identical anonymization output. The merged dataset is consistent. Annual AI compliance audit passes without findings.
Prior year: 3 findings related to inconsistent ML training data anonymization. Post-preset: 0 findings.
GDPR AI Act Intersection
The EU AI Act (in effect since August 2024) adds compliance requirements for AI systems using personal data for training. High-risk AI systems must document their training data, including anonymization measures applied.
GDPR's purpose limitation principle (Article 5(1)(b)) limits use of personal data for ML training without specific legal basis. The CNIL's 2024 enforcement actions against AI companies focused on this intersection: personal data collected for service delivery being used for training without adequate legal basis or anonymization.
The documentation requirements of both GDPR and the AI Act are easier to satisfy when the training data anonymization process is technically enforced through presets:
- Preset name and configuration: the documented anonymization methodology
- Processing logs: evidence that the methodology was applied to specific datasets
- DPO approval: recorded decision authorizing the preset configuration
This creates the audit trail that both regulations require.
Preset Configuration for ML Training Data
Entity types for most NLP training data:
- PERSON (names — Replace with similar names)
- EMAIL_ADDRESS (Replace with synthetic emails)
- PHONE_NUMBER (Replace with synthetic phone numbers)
- CREDIT_CARD / IBAN (Replace or Redact — payment data)
- LOCATION (Replace with similar locations if geo is needed for model; Redact if not)
- DATE_OF_BIRTH (Redact — age generalization often needed)
Entity types typically NOT included for NLP training data:
- General dates (not date-of-birth) — timestamps and dates in text are often needed for temporal modeling
- Organization names — often needed for entity recognition training
- URLs — often needed for linking and reference extraction
The ML lead and DPO define these distinctions in the approved preset. Individual data scientists don't make these decisions — they apply the preset.
Institutional Knowledge and Preset Versioning
Presets serve an institutional memory function:
Before presets: The correct entity configuration for ML training data lived in the minds of the three data scientists who had worked through the compliance review process. When two of them left in Q3, the institutional knowledge was lost.
After presets: The configuration is encoded in "ML Training — Customer Data v2.1". The version history shows when it was created, who approved it, and what changed between v2.0 and v2.1. New data scientists use the preset and inherit the institutional knowledge embedded in it.
Version 2.1 added IBAN detection after a compliance review found it was missing. Version 2.0 records show it was approved in February 2025. The audit trail is complete.
Conclusion
Documentation tells team members what to do. Presets make it technically easy — and technically enforceable — to do it consistently.
For ML training data specifically, consistency is both a compliance requirement (GDPR, AI Act) and a technical requirement (model training requires consistent preprocessing). The preset satisfies both simultaneously.
CNIL and other DPAs investigating AI training data practices will look for evidence of systematic, consistent anonymization. A preset applied uniformly across all training data preparation is the strongest evidence available.
Sources: