Bumalik sa BlogTeknikal

ML Training Data: Reproducible Privacy Presets

Ang machine learning models ay nangangailangan ng consistent training data. Ang reproducible anonymization presets ay nag-ensure na ang same dataset...

April 20, 20266 min basahin
ML training datareproducible privacyGDPR AI ActCNIL enforcementdata science compliance

ML Training Data: Reproducible Privacy Presets

Ang Problema: Non-Reproducible Anonymization

Ang karamihan ng ML teams ay nag-anonymize ng training data using non-deterministic methods:

import random
import hashlib

def anonymize_for_ml(data):
    """Non-reproducible anonymization"""
    result = []
    for record in data:
        # Using random salt - different bawat run
        salt = os.urandom(16)
        anonymized_name = hashlib.sha256((record['name'] + salt).encode()).hexdigest()
        
        result.append({
            'name_hash': anonymized_name,  # Different output each time!
            'email': record['email'],      # Not anonymized at all!
        })
    return result

# Run 1
dataset_v1 = anonymize_for_ml(raw_data)

# Run 2 (same input, different output)
dataset_v2 = anonymize_for_ml(raw_data)

# dataset_v1 != dataset_v2 (due to random salt)
# Model trained on v1 and v2 will have different hashes

Problem:

  • ❌ Cannot reproduce same training dataset
  • ❌ Cannot link records across runs (for validation)
  • ❌ Cannot audit which anonymization was used
  • ❌ Not GDPR-compliant (no deterministic trail)

Ang Solution: Deterministic Reproducible Presets

Ang reproducible anonymization uses deterministic hashing without random salt:

Hakbang 1: Define Reproducible Preset

{
  "preset_name": "ML Training Data - GDPR Safe",
  "version": "1.0",
  "reproducibility": "DETERMINISTIC",
  "seed": "organization_secret_key_2025",
  "rules": [
    {
      "entity_type": "PERSON",
      "operator": "hash",
      "algorithm": "HMAC-SHA256",
      "secret_key": "organization_secret_key_2025",
      "deterministic": true
    },
    {
      "entity_type": "EMAIL_ADDRESS",
      "operator": "hash",
      "algorithm": "SHA-256",
      "no_salt": true,
      "deterministic": true
    },
    {
      "entity_type": "PHONE_NUMBER",
      "operator": "mask",
      "retain_format": true,
      "last_digits": 4,
      "deterministic": true
    }
  ]
}

Hakbang 2: Implement Reproducible Anonymization

import hashlib
import hmac

def anonymize_reproducible(data, preset):
    """Deterministic, reproducible anonymization"""
    result = []
    
    secret_key = preset['seed']
    
    for record in data:
        anonymized_name = hmac.new(
            key=secret_key.encode(),
            msg=record['name'].encode(),
            digestmod=hashlib.sha256
        ).hexdigest()
        
        anonymized_email = hashlib.sha256(
            record['email'].encode()
        ).hexdigest()
        
        result.append({
            'person_hash': anonymized_name,
            'email_hash': anonymized_email,
            'phone_masked': record['phone'][-4:]
        })
    
    return result

# Run 1
preset = load_preset("ML Training Data - GDPR Safe")
dataset_v1 = anonymize_reproducible(raw_data, preset)

# Run 2 (same input, same output)
dataset_v2 = anonymize_reproducible(raw_data, preset)

# dataset_v1 == dataset_v2 ✅ (deterministic)
# Hash of "John Smith" is always the same

Hakbang 3: Version Control Dataset

# Generate training dataset v1
python anonymize.py --preset "ML Training Data - GDPR Safe" --output dataset-v1.csv

# Compute checksum
sha256sum dataset-v1.csv > dataset-v1.sha256

# Store in version control
git add dataset-v1.csv dataset-v1.sha256
git commit -m "ML training data v1 - GDPR safe anonymization"

# Later, regenerate same dataset
python anonymize.py --preset "ML Training Data - GDPR Safe" --output dataset-v1-verify.csv
sha256sum dataset-v1-verify.csv

# Verify checksums match
# ✅ Same dataset reproduced exactly

Ang Benefits ng Reproducible Presets

Without ReproducibilityWith Reproducibility
Different hash each timeSame hash always
Cannot audit anonymizationComplete audit trail
Cannot validate modelsCan compare model versions
DPA cannot verifyDPA can independently verify

Ang GDPR Compliance Use Case

Ang GDPR Article 32 requires documented security measures. With reproducible presets:

DPA Audit Question: "Can you prove that this training dataset was anonymized according to your policy?"

Without Reproducibility:

"We anonymized it... but we can't reproduce it." Result: ❌ FAILING

With Reproducibility:

"Here's the preset, the seed, and the version checksum. We can regenerate the exact same dataset." Result: ✅ PASSING

Real-World Case: Healthcare ML Company

Isang healthcare ML company ay nag-develop ng model para sa disease prediction:

Problem:

- Trained model A sa anonymized dataset v1
- Tried to retrain model B sa same dataset
- Output ay iba dahil random salt sa anonymization
- Cannot reproduce exactly what data was used
- DPA audit fails: "Cannot verify dataset anonymization"

Solution: Reproducible Preset

- Used deterministic HMAC-SHA256 preset
- Trained model A sa anonymized dataset v1
- Retraining model B produced identical anonymization
- Checksum verification: ✅ exact match
- DPA audit passes: "Here's the preset, seed, and verification"

Benefits:
- ✅ Reproducible training data
- ✅ GDPR compliant with audit trail
- ✅ Can validate model versions
- ✅ Can detect dataset drift

Ang Best Practice

  1. Use deterministic hashing - no random salts
  2. Store secret key securely - in secrets manager
  3. Version control presets - document changes
  4. Compute dataset checksums - for verification
  5. Audit trail - log each dataset generation

Ang reproducible anonymization presets ay essential para sa ML teams na may GDPR compliance requirements.

Handa nang protektahan ang iyong data?

Simulan ang anonymization ng PII gamit ang 285+ uri ng entidad sa 48 wika.