anonym.legal
Terug na BlogTegnies

LangChain CVE-2025-68664: How PII Leaks Through Your RAG Pipeline

CVSS 9.3. LangChain's serialization functions expose environment variables and secrets to attacker-controlled LLMs. How to detect and fix PII leakage in RAG pipelines.

March 16, 20268 min lees
LangChainRAG pipelineCVEPII leakagedeveloper securityAPI keysLLM security

CVE-2025-68664: What Happened

In late 2025, security researchers disclosed CVE-2025-68664, a critical vulnerability in LangChain's serialization functions — specifically dumps() and dumpd(). The CVSS score is 9.3 (Critical).

The vulnerability works as follows: LangChain's serialization methods serialize Python objects, including callable functions, by capturing their closure context. When an attacker controls the LLM response in a LangChain chain — through prompt injection in a retrieved document, a malicious tool result, or a poisoned vector store entry — they can craft responses that cause dumps() to serialize environment variables accessible to the Python process.

The result: API keys, database connection strings, JWT secrets, and AWS credentials embedded in the LangChain chain's environment can be exfiltrated through the model's output. An attacker who can inject text into your RAG pipeline's source documents can, under certain chain configurations, read your production secrets.

Affected versions: LangChain < 0.3.22 (Python). The fix was released in 0.3.22, but adoption has been slow — pypi download data shows significant use of vulnerable versions through March 2026.

How PII Leaks in RAG Pipelines — The General Problem

CVE-2025-68664 is a dramatic example of a broader, quieter problem: PII leaks through RAG pipelines routinely, through mechanisms that require no CVE and no attacker.

Consider a typical enterprise RAG setup:

  1. Ingestion: You index your company's documents — support tickets, customer emails, legal contracts, HR records — into a vector database (Pinecone, Weaviate, pgvector).
  2. Retrieval: When a user asks a question, the system retrieves the 5 most semantically similar document chunks.
  3. Generation: Those chunks are passed as context to an LLM (GPT-4o, Claude, Gemini), which generates a response.

The problem is step 2. The retrieved chunks contain whatever was in the original documents, including:

  • Customer names, email addresses, phone numbers
  • Contract values, account numbers, tax identifiers
  • Employee salary data, performance review content
  • Patient names in clinical notes (for healthcare RAG)
  • National ID numbers in immigration document pipelines

That PII is passed verbatim to the LLM in the context window. It appears in the model's output if the query elicits it. It is logged by the LLM provider. It is stored in your LangChain conversation history. It flows into your observability platform.

None of this requires a vulnerability. It is the intended behavior of a RAG system — and it creates systematic PII exposure.

The 68 Secret Patterns

Security tooling that monitors RAG pipelines tracks 68 known secret patterns that commonly appear in enterprise document stores:

  • AWS Access Key IDs (AKIA...)
  • OpenAI API keys (sk-...)
  • Anthropic API keys (sk-ant-...)
  • Database connection URIs (postgresql://user:password@host/db)
  • JWT tokens (base64-encoded headers)
  • GitHub Personal Access Tokens
  • Stripe secret keys (sk_live_...)
  • SendGrid API keys
  • Twilio account SIDs and auth tokens
  • Private key PEM blocks

These patterns appear in enterprise documents more often than developers expect. A support ticket might contain a customer's API key they pasted while debugging. A contract might include database credentials shared during a technical integration. A configuration file indexed accidentally exposes an entire secrets store.

When these documents are indexed into a vector database without sanitization, every query that retrieves them passes the secrets to the LLM — and potentially to the user who submitted the query.

The Fix: Anonymize Before Embedding

The correct architecture for a PII-safe RAG pipeline anonymizes documents before they are chunked and embedded. This is not optional for production systems handling customer data.

Here is a Python implementation using the anonym.legal API:

import requests
import os

ANONYM_API_KEY = os.environ["ANONYM_API_KEY"]
ANONYM_BASE_URL = "https://anonym.legal/api"

def anonymize_before_embedding(text: str) -> tuple[str, dict]:
    """
    Anonymize PII in document text before embedding.
    Returns (anonymized_text, entity_map) for optional deanonymization.
    """
    response = requests.post(
        f"{ANONYM_BASE_URL}/presidio/anonymize",
        json={
            "text": text,
            "language": "en",
            "anonymizers": {
                "DEFAULT": {"type": "replace", "new_value": "[REDACTED]"},
                "PERSON": {"type": "mask", "masking_char": "*", "chars_to_mask": 4, "from_end": False},
                "EMAIL_ADDRESS": {"type": "replace", "new_value": "[EMAIL]"},
                "PHONE_NUMBER": {"type": "replace", "new_value": "[PHONE]"},
                "CRYPTO": {"type": "replace", "new_value": "[SECRET]"},
                "URL": {"type": "keep"},  # Keep URLs — needed for citations
            }
        },
        headers={"Authorization": f"Bearer {ANONYM_API_KEY}"}
    )
    result = response.json()
    return result["text"], result.get("items", [])


def build_rag_index(documents: list[str], vectorstore):
    """
    Build a RAG index with PII anonymized before embedding.
    """
    anonymized_docs = []
    for doc in documents:
        clean_text, entities = anonymize_before_embedding(doc)
        anonymized_docs.append(clean_text)
        # Optionally log entity count for audit trail
        print(f"Removed {len(entities)} PII entities from document")

    # Now embed the clean documents — no PII reaches the vector store
    vectorstore.add_texts(anonymized_docs)

The anonym.legal API supports 285+ entity types. For enterprise document pipelines, this means names, emails, phone numbers, national IDs, financial identifiers, API keys (via CRYPTO entity type), database URIs, and 270+ additional patterns are detected and stripped before any document reaches your vector store.

Fixing CVE-2025-68664 Specifically

If you are running LangChain < 0.3.22, update immediately:

pip install "langchain>=0.3.22" "langchain-core>=0.3.22"

After patching, audit your chain configurations for prompt injection risk:

  1. Validate retrieved chunks before passing to the LLM — strip any content that matches known injection patterns (ignore previous instructions, system:, <INST>)
  2. Use anonymize_before_embedding in your ingestion pipeline — reduces the attack surface even if injection occurs, because the sensitive data is not present in retrieved chunks
  3. Restrict chain permissions — LangChain chains should not have access to environment variables beyond what they need. Use a dedicated service account with minimal permissions.

CTA: Secure Your Pipeline

For developers building production RAG systems, the combination of CVE-2025-68664 and general PII-in-context risk represents a significant liability. The fix is architectural: anonymize at ingestion, not at query time.

The CVSS score is 9.3. The remediation is one API call per document. The math is straightforward.


Sources:

  • NVD CVE-2025-68664, CVSS 9.3, LangChain serialization vulnerability
  • LangChain security advisory, langchain-ai/langchain GitHub, 2025
  • OWASP LLM Top 10: LLM01 Prompt Injection, LLM06 Sensitive Information Disclosure
  • anonym.legal entity type documentation — 285+ supported entity types

Gereed om u data te beskerm?

Begin om PII te anonimiseer met 285+ entiteitstipes in 48 tale.