Tagasi BlogisseGDPR ja Vastavus

EU AI Act August 2026: Anonymizing Training Data to Meet Article 10

EU AI Act full enforcement begins August 2, 2026. Penalties up to €35M or 7% of global turnover. Article 10 requires training data governance — anonymization is the key compliance measure.

March 16, 20269 min lugemist
EU AI Acttraining dataArticle 10GDPR complianceAI regulation2026 deadlinedata governance

The Countdown Has Started

The EU AI Act's enforcement timeline is no longer abstract. High-risk AI system requirements — including Article 10's training data governance mandate — apply from August 2, 2026. Organizations that train, fine-tune, or deploy high-risk AI systems and have not established compliant training data practices have approximately five months to remediate.

The penalties are larger than GDPR: up to €35 million or 7% of global annual turnover, whichever is higher. GDPR caps at €20 million or 4%. The EU AI Act is the highest-stakes AI regulation in force anywhere in the world, and its penalties are calibrated to ensure that even large technology companies cannot absorb non-compliance as a cost of doing business.

What Makes an AI System "High-Risk"?

The AI Act's risk classification determines which obligations apply. High-risk systems (Annex III) include AI used in:

  • Education and vocational training — systems that determine access to educational institutions or assess students
  • Employment — CV screening, interview scoring, workforce monitoring
  • Essential services — creditworthiness assessment, insurance pricing, emergency dispatch
  • Law enforcement — predictive policing, crime analytics, biometric identification
  • Healthcare — medical device software, clinical decision support, patient triage
  • Critical infrastructure — systems managing energy, water, transport networks
  • Administration of justice — legal research tools, sentence recommendation systems

If your organization trains or deploys AI in any of these categories, Article 10 applies to you.

Article 10: What It Actually Requires

Article 10 of the EU AI Act establishes requirements for training, validation, and testing datasets used by high-risk AI systems. The key requirements:

1. Data Governance Practices

Training datasets must be subject to "appropriate data governance and management practices." This includes documented procedures for data collection, data quality assessment, and ongoing monitoring. Practices must cover the purpose for which the data is used and the categories of data collected.

2. Examination for Biases

Training data must be examined for "possible biases" that could lead to discriminatory outputs. This requirement is operationally significant: it mandates active bias testing, not merely the absence of intentionally discriminatory design.

3. Relevance, Representativeness, and Accuracy

Datasets must be "relevant, sufficiently representative, and to the best extent possible, free of errors." This creates a quality obligation that extends to data collection methodology — convenience samples or crawled web data that systematically underrepresents certain populations may not satisfy this requirement for high-risk applications.

4. Special Categories of Personal Data

Article 10(5) provides the most directly actionable obligation for organizations with existing datasets: when high-risk AI systems involve the processing of special categories of personal data (health data, racial or ethnic origin, political opinions, religious beliefs, biometric data), these categories may only be processed when "strictly necessary for the purposes of ensuring bias monitoring, detection, and correction" and "subject to appropriate safeguards for the fundamental rights and interests of natural persons."

The practical consequence: Most training datasets used for high-risk AI systems contain personal data, and many contain special categories. Article 10 requires that this data be processed only to the minimum extent necessary and subject to appropriate technical safeguards — of which anonymization is the most robust.

The Penalty Math: Why This Exceeds GDPR

The EU AI Act's penalty structure exceeds GDPR for intentional or negligent violations:

RegulationMaximum PenaltyTurnover Cap
GDPR€20 million4% global turnover
EU AI Act (high-risk)€15 million3% global turnover
EU AI Act (prohibited practices)€35 million7% global turnover

For training data violations, the applicable tier is the high-risk system tier (€15M / 3%). However, if a DPA determines that training on personal data without adequate safeguards constitutes a prohibited practice — a determination that becomes more plausible as the Act's enforcement practice develops — the prohibited practice penalties apply.

For a company with €500M annual turnover: 3% = €15M. For a company with €5B turnover: 3% = €150M. These are not theoretical maximums — they are the actual calculation the regulators will apply.

Why Anonymization Is the Compliance Answer

Anonymization creates a fundamental legal simplification: anonymized data is outside the scope of GDPR, and by extension, reduces the AI Act risk surface for training data governance.

Article 10's most burdensome requirements — special category handling, bias monitoring with personal data, data subject rights in training sets — apply because the training data contains personal data. If the training data is genuinely anonymized before training begins, these requirements are either eliminated or substantially reduced.

The CNIL (French data protection authority) published AI training recommendations in early 2026 explicitly stating: "Data minimization before training — including anonymization of personal data not strictly required for model performance — is the primary technical measure for compliance with Article 10."

This is not a fringe interpretation. It is the mainstream enforcement posture of the EU's most technically sophisticated DPA.

What Anonymization Means for Training Data — Practically

Anonymization of training data is not the same as anonymization of production data. Training data typically consists of:

  • Documents with embedded PII — contracts, emails, reports, support tickets used as fine-tuning examples
  • Structured records — tables of customer data used to train predictive models
  • Labeled datasets — images or text with annotations that may contain personal identifiers
  • Synthetic data based on real records — where the synthetic generation process may preserve identifying patterns

Effective anonymization for training data requires detecting PII across all of these formats and replacing or masking it before the training job runs. The entity detection must be comprehensive — a model trained on data where "John Smith" has been replaced but where "the patient at 42 Oak Street, Springfield" remains will learn to associate location patterns with demographic predictions.

The anonym.legal API processes training data in batch mode, detecting 285+ entity types across 48 languages. For organizations with multilingual training datasets — a common scenario for European AI companies serving multiple language markets — this cross-language coverage is essential. A compliance failure in one language of a multilingual training set creates AI Act exposure across the entire system.

Practical Guide: Anonymizing Your Training Pipeline

Step 1: Audit your training datasets

Before anonymization, you need to know what you have. Run a detection pass across all training data sources:

# Process a directory of training documents
curl -X POST https://anonym.legal/api/presidio/analyze \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "'"$(cat training_document.txt)"'",
    "language": "en"
  }'

The response lists all detected entities with their types, positions, and confidence scores. Aggregate across your dataset to understand PII exposure before you begin remediation.

Step 2: Batch anonymize

For large training datasets, use the batch endpoint to process multiple documents in parallel:

import requests
import os
import json
from pathlib import Path

def anonymize_training_batch(documents: list[dict]) -> list[dict]:
    response = requests.post(
        "https://anonym.legal/api/presidio/anonymize-batch",
        json={"items": documents, "language": "en"},
        headers={"Authorization": f"Bearer {os.environ['ANONYM_API_KEY']}"}
    )
    return response.json()["results"]

# Load training documents
training_dir = Path("./training_data")
docs = [
    {"id": f.name, "text": f.read_text()}
    for f in training_dir.glob("*.txt")
]

# Anonymize in batches of 50
batch_size = 50
for i in range(0, len(docs), batch_size):
    batch = docs[i:i+batch_size]
    results = anonymize_training_batch(batch)
    for result in results:
        output_path = training_dir / "anonymized" / result["id"]
        output_path.write_text(result["text"])
        print(f"Processed {result['id']}: {len(result['items'])} entities removed")

Step 3: Document the process

Article 10 requires documented data governance practices. Your anonymization process documentation should include:

  • The detection model and version used
  • Entity types detected and replacement strategy for each
  • A record of entity counts removed per dataset
  • The date of anonymization and the version of training data used

This documentation constitutes the "data governance and management practices" required by Article 10(2)(a).

The Colorado AI Act: Parallel US Obligation

Colorado's AI Act takes effect on June 30, 2026 — five weeks before the EU AI Act's high-risk enforcement date. Colorado's law imposes similar training data obligations for "high-risk AI systems" under Colorado law, with a focus on algorithmic discrimination.

Organizations operating in both the EU and Colorado face simultaneous compliance deadlines. The anonymization approach satisfies both: training data governance under Article 10 (EU) and algorithmic discrimination prevention measures under Colorado law. The technical implementation is identical.

Starting Now

Five months is sufficient time to implement training data anonymization if work begins immediately. It is not sufficient time if work begins in June.

The compliance sequence:

  1. Weeks 1-2: Dataset audit — understand what PII is present
  2. Weeks 3-6: Anonymization pipeline implementation and testing
  3. Weeks 7-10: Process documentation and legal review
  4. Weeks 11-16: Validation — verify anonymized datasets meet Article 10 quality requirements
  5. August 2: Enforcement date — compliant training data governance in place

The anonym.legal API integrates into existing training pipelines without requiring infrastructure changes. The GDPR compliance checklist covers the data governance documentation requirements that overlap between GDPR and Article 10.

The EU AI Act is enforcement-ready. The question for organizations building high-risk AI systems is not whether compliance is required — it is whether they will be ready by August 2.

Start with the GDPR compliance checklist →


Sources:

  • EU AI Act, Regulation (EU) 2024/1689, Articles 9-17 (high-risk AI obligations), OJ L 2024/1689
  • EU AI Act, Article 10 — Data and data governance
  • CNIL AI training data recommendations, January 2026
  • Colorado AI Act, SB 205, effective June 30, 2026
  • EU AI Act enforcement timeline: prohibited practices February 2, 2025; high-risk systems August 2, 2026

Kas olete valmis oma andmeid kaitsma?

Alustage PII anonüümitamist 285+ üksustüübi abil 48 keeles.