ML Training Data: Reproducible Privacy Presets
Ang Problema: Non-Reproducible Anonymization
Ang karamihan ng ML teams ay nag-anonymize ng training data using non-deterministic methods:
import random
import hashlib
def anonymize_for_ml(data):
"""Non-reproducible anonymization"""
result = []
for record in data:
# Using random salt - different bawat run
salt = os.urandom(16)
anonymized_name = hashlib.sha256((record['name'] + salt).encode()).hexdigest()
result.append({
'name_hash': anonymized_name, # Different output each time!
'email': record['email'], # Not anonymized at all!
})
return result
# Run 1
dataset_v1 = anonymize_for_ml(raw_data)
# Run 2 (same input, different output)
dataset_v2 = anonymize_for_ml(raw_data)
# dataset_v1 != dataset_v2 (due to random salt)
# Model trained on v1 and v2 will have different hashes
Problem:
- ❌ Cannot reproduce same training dataset
- ❌ Cannot link records across runs (for validation)
- ❌ Cannot audit which anonymization was used
- ❌ Not GDPR-compliant (no deterministic trail)
Ang Solution: Deterministic Reproducible Presets
Ang reproducible anonymization uses deterministic hashing without random salt:
Hakbang 1: Define Reproducible Preset
{
"preset_name": "ML Training Data - GDPR Safe",
"version": "1.0",
"reproducibility": "DETERMINISTIC",
"seed": "organization_secret_key_2025",
"rules": [
{
"entity_type": "PERSON",
"operator": "hash",
"algorithm": "HMAC-SHA256",
"secret_key": "organization_secret_key_2025",
"deterministic": true
},
{
"entity_type": "EMAIL_ADDRESS",
"operator": "hash",
"algorithm": "SHA-256",
"no_salt": true,
"deterministic": true
},
{
"entity_type": "PHONE_NUMBER",
"operator": "mask",
"retain_format": true,
"last_digits": 4,
"deterministic": true
}
]
}
Hakbang 2: Implement Reproducible Anonymization
import hashlib
import hmac
def anonymize_reproducible(data, preset):
"""Deterministic, reproducible anonymization"""
result = []
secret_key = preset['seed']
for record in data:
anonymized_name = hmac.new(
key=secret_key.encode(),
msg=record['name'].encode(),
digestmod=hashlib.sha256
).hexdigest()
anonymized_email = hashlib.sha256(
record['email'].encode()
).hexdigest()
result.append({
'person_hash': anonymized_name,
'email_hash': anonymized_email,
'phone_masked': record['phone'][-4:]
})
return result
# Run 1
preset = load_preset("ML Training Data - GDPR Safe")
dataset_v1 = anonymize_reproducible(raw_data, preset)
# Run 2 (same input, same output)
dataset_v2 = anonymize_reproducible(raw_data, preset)
# dataset_v1 == dataset_v2 ✅ (deterministic)
# Hash of "John Smith" is always the same
Hakbang 3: Version Control Dataset
# Generate training dataset v1
python anonymize.py --preset "ML Training Data - GDPR Safe" --output dataset-v1.csv
# Compute checksum
sha256sum dataset-v1.csv > dataset-v1.sha256
# Store in version control
git add dataset-v1.csv dataset-v1.sha256
git commit -m "ML training data v1 - GDPR safe anonymization"
# Later, regenerate same dataset
python anonymize.py --preset "ML Training Data - GDPR Safe" --output dataset-v1-verify.csv
sha256sum dataset-v1-verify.csv
# Verify checksums match
# ✅ Same dataset reproduced exactly
Ang Benefits ng Reproducible Presets
| Without Reproducibility | With Reproducibility |
|---|---|
| Different hash each time | Same hash always |
| Cannot audit anonymization | Complete audit trail |
| Cannot validate models | Can compare model versions |
| DPA cannot verify | DPA can independently verify |
Ang GDPR Compliance Use Case
Ang GDPR Article 32 requires documented security measures. With reproducible presets:
DPA Audit Question: "Can you prove that this training dataset was anonymized according to your policy?"
Without Reproducibility:
"We anonymized it... but we can't reproduce it." Result: ❌ FAILING
With Reproducibility:
"Here's the preset, the seed, and the version checksum. We can regenerate the exact same dataset." Result: ✅ PASSING
Real-World Case: Healthcare ML Company
Isang healthcare ML company ay nag-develop ng model para sa disease prediction:
Problem:
- Trained model A sa anonymized dataset v1
- Tried to retrain model B sa same dataset
- Output ay iba dahil random salt sa anonymization
- Cannot reproduce exactly what data was used
- DPA audit fails: "Cannot verify dataset anonymization"
Solution: Reproducible Preset
- Used deterministic HMAC-SHA256 preset
- Trained model A sa anonymized dataset v1
- Retraining model B produced identical anonymization
- Checksum verification: ✅ exact match
- DPA audit passes: "Here's the preset, seed, and verification"
Benefits:
- ✅ Reproducible training data
- ✅ GDPR compliant with audit trail
- ✅ Can validate model versions
- ✅ Can detect dataset drift
Ang Best Practice
- Use deterministic hashing - no random salts
- Store secret key securely - in secrets manager
- Version control presets - document changes
- Compute dataset checksums - for verification
- Audit trail - log each dataset generation
Ang reproducible anonymization presets ay essential para sa ML teams na may GDPR compliance requirements.