Back to BlogTechnical

Building a GDPR-Safe Data Pipeline: Anonymizing PII Before It Reaches Your Data Warehouse

dbt column tags are not GDPR compliance. Raw customer data hits your Snowflake warehouse unmasked before tag-based policies apply. This guide covers how to anonymize PII in the pipeline, before data lands in analytics infrastructure.

March 5, 20268 min read
data pipelinedbtSnowflakedata warehouseELT anonymizationGDPR engineering

Building a GDPR-Safe Data Pipeline: Anonymizing PII Before It Reaches Your Data Warehouse

You've tagged your PII columns in dbt. Your dynamic data masking policy is configured in Snowflake. You feel GDPR-compliant.

Your raw data still hits the warehouse unmasked. The masking policy applies at query time — but the raw, unmasked data exists in your raw layer, available to anyone with raw schema access. Your dbt models ran before your masking policies were in place, and the historical raw data was never masked.

The gap between "we have masking policies" and "our data is actually protected" is where GDPR violations happen.

How ELT Pipelines Create PII Exposure

The Extract-Load-Transform (ELT) pattern — dominant in modern data engineering — loads raw data into the warehouse first, then transforms it:

  1. Extract: Source system data (Salesforce CRM, Stripe payments, Intercom support) is extracted with all fields
  2. Load: Raw data loaded into warehouse raw schema — Snowflake, BigQuery, Redshift — including all PII fields
  3. Transform: dbt models run to clean, join, and aggregate data for analytics use

The raw layer contains unmasked, complete personal data: customer names, email addresses, phone numbers, payment information, support ticket content. Anyone with access to the raw schema — and in many organizations, that's a broad set of data engineers and analysts — can query it directly.

Tag-based dynamic masking in Snowflake helps at query time for properly configured downstream models. But it doesn't retroactively mask raw data. It doesn't protect against direct raw schema queries. It requires every downstream model and dashboard to be properly tagged — a maintenance burden that grows with schema complexity.

The Pipeline-Level Anonymization Approach

Anonymizing PII at the pipeline level — before data lands in the warehouse — eliminates raw-layer exposure:

ETL approach (pre-load anonymization):

  1. Extract data from source systems
  2. Route through anonymization step
  3. Load anonymized data into warehouse

The warehouse never receives raw PII. The raw schema contains anonymized data. Downstream models, dashboards, and direct queries all work with anonymized data.

This requires either:

  • Anonymization integrated into the extract step (API-level)
  • Anonymization as a pipeline stage between extract and load

Implementation option — API integration: For systems with outbound webhooks or streaming exports, route data through the anonym.legal API before landing in the warehouse. Customer support tickets leaving Intercom → anonymization API → warehouse. Stripe payment records → anonymization API → warehouse.

POST /api/anonymize
{
  "text": "Customer John Smith (john@example.com) reported...",
  "entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
  "method": "replace"
}

Implementation option — Batch preprocessing: For batch-loaded data (daily/weekly exports from source systems), run the exported CSV/JSON files through batch processing before loading to the warehouse.

Airflow DAG structure:

extract_task >> anonymize_batch_task >> load_to_warehouse_task

The anonymize_batch_task uploads extracted files to batch processing and retrieves anonymized versions. The load task loads the anonymized files.

dbt Column Tags: What They Do and Don't Do

dbt supports tagging PII columns:

models:
  - name: stg_customers
    columns:
      - name: email
        tags: ['pii', 'email']
      - name: full_name
        tags: ['pii', 'personal_data']

This enables:

  • Documentation of PII locations
  • Triggering downstream masking policies (requires warehouse-level configuration)
  • Lineage tracking (secoda and similar tools can trace tagged columns through downstream models)

This does not enable:

  • Masking of raw data in the raw schema
  • Protection against direct queries of raw tables
  • Automatic anonymization at load time
  • Retroactive masking of historical data

dbt column tags are a documentation and governance tool. They're valuable for understanding where PII exists in your data model. They don't implement the "appropriate technical measures" that GDPR Article 32 requires for data protection.

The Snowflake Dynamic Data Masking Gap

Snowflake's dynamic data masking applies masking policies to columns, hiding data from users without the unmasking privilege at query time. This is a powerful control for production use cases.

The limitations:

  • Masking applies to the columns it's configured on — any column added after initial configuration requires explicit policy application
  • Schema evolution (new columns, renamed columns) can create unmasked PII columns until policies are updated
  • Users with the SYSADMIN role or ACCOUNTADMIN typically can bypass masking
  • Raw data import processes often run with elevated privileges that bypass masking
  • Historical data loaded before policies were implemented is stored unmasked (policies apply at read time, not storage time)

For true protection, masking at query time is insufficient. The data should be anonymized before storage.

Compliance Documentation for Analytics Pipelines

GDPR's accountability principle requires demonstrating compliance, not just claiming it. For data engineering teams, this means:

Records of Processing Activities (ROPA): Document that customer data is anonymized before loading to the analytics warehouse. The anonymization step in your pipeline is a processing activity under GDPR.

Technical safeguard documentation: The anonymization configuration (which entity types, which method) used in your pipeline. Processing metadata from batch runs provides this automatically.

Data lineage: Tools like Secoda or dbt's built-in lineage can show that source system data flows through an anonymization step before reaching analytics models. This lineage is your compliance audit trail.

Sub-processor documentation: The anonymization service is a sub-processor. Their DPA and privacy policy must be documented in your vendor register.

Practical Implementation Guide

For a dbt-based pipeline with Snowflake:

Step 1: Audit raw layer exposure Identify which tables in your raw schema contain personal data. Query your dbt column tags or your data catalog for PII-tagged tables.

Step 2: Identify anonymization scope For each raw table: which columns contain PII? Which should be anonymized vs. maintained? (Customer support ticket body: anonymize. Order ID: pseudonymize with consistent replacement for entity resolution. Timestamp: preserve for time-series analysis.)

Step 3: Choose implementation approach Small team, batch-loaded data: batch file processing before load Data engineering resources: API integration in Airflow/Prefect pipeline

Step 4: Test and validate Run anonymization on a sample of raw data before production implementation. Validate that downstream dbt models still function correctly with anonymized input. Some models may use email addresses for joining — these need to use consistent replacement values (pseudonymization preserves join keys, redaction breaks them).

Step 5: Handle historical data Existing raw data (loaded before anonymization was implemented) requires retroactive processing. Export, anonymize, reload. This is a one-time operation per historical table.

Conclusion

Tag-based masking is a governance tool, not a security control. It tells you where your PII is; it doesn't prevent your PII from being exposed to users with raw schema access. For true GDPR compliance in data pipelines, PII should be anonymized before it lands in the warehouse — making the raw layer as safe as the production layer.

This is a more complex implementation than column tagging, but it's what "appropriate technical measures" actually requires.

Sources:

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.