title: "Legal PII: Privilege Detection" description: "Case reference numbers, bar admission numbers, court docket numbers, and client matter IDs are legally sensitive identifiers that standard PII tools miss." category: legal-tech publishedAt: 2026-06-03 tags:

attorney-client privilege
legal document review
case numbers
law firm privacy
legal tech readingTime: 7

Attorney-Client Privilege in the AI Era: Legal PII Your Anonymization Tool Must Detect

Standard PII tools catch names, emails, and SSNs. They miss case reference IDs, bar admission numbers, and client matter tags. These carry serious privilege risks. Generic tools leave that gap open.

Law firms send files to AI tools every day. Those files contain privilege-sensitive markers that standard tools do not catch.

When a law firm routes files through an AI assistant, those files contain legal IDs alongside standard PII:

Client matter tags: Link to the full matter file and name the client
Case reference IDs: Court-assigned codes that tie to public records with private detail
Bar admission numbers: Attorney IDs searchable in public state directories
Court docket codes: Connect to public filing systems with full case history
Judicial assignment codes: Identify the presiding judge in sensitive situations

Any of these, sent to an external AI vendor, creates a potential privilege problem.

Why These IDs Need Custom Detection

Court docket formats follow district-level patterns. No single pattern covers all federal and state courts.

Federal civil cases use a two-digit year, then "cv," then a case number. Criminal cases use "cr" in the same spot. State courts vary by region with no shared standard.

Bar admission numbers are state-specific. California uses a numeric format. New York uses a registry format. Texas uses its own bar ID format. No national format exists.

Client matter tags are firm-specific. Each firm builds its own format. Year-client-matter. Practice group codes. Sequential IDs.

Standard PII tools cannot know any of these without custom setup.

The gap is real. A document tool receives full matter context. Docket codes link to public records. Client tags are present. The tool reports PII removed. Names and emails were removed. The privilege-sensitive IDs were not.

The Legal AI Startup Case

A legal AI startup builds a document tool for law firms. The product scans discovery files, spots relevant clauses, and flags potentially privileged content. Enterprise clients require redaction of client matter tags alongside standard PII before processing.

The compliance blocker: the AI tool processes file data containing client matter tags. Combined with public court filings, those tags could allow matter identification. Enterprise legal ops teams flag this as unacceptable.

Before custom entity detection:

Deal review finds the compliance gap
3+ month engineering queue for a custom NLP model
Enterprise contract on hold

With a custom entity API:

Compliance officer defines the matter tag format at onboarding
Pattern tested against sample files: 2 days
Custom entity added to the pipeline: 1 more day
Enterprise contract proceeds

The gap is 3 days versus 3+ months. The work is pattern setup and API integration. No NLP model training required.

Common Formats by Category

Federal court dockets:

Federal civil cases use: two-digit year + "cv" + a 4–6 digit case number. Example: 24-cv-12345. Criminal cases use "cr" in the same spot. Bankruptcy cases use "bk." Appeals use a two-digit year and a 4–5 digit number that varies by circuit.

State court formats (examples):

California Superior Court uses a six-digit prefix system. New York uses an index format with year and sequence. Texas uses a cause format with year, sequence, and court code.

Client matter tags (typical firm formats):

Three common patterns appear across most firms:

Two-digit year, client ID, matter sequence (e.g., 24-ACME-001)
Practice group initials, year, then a four-digit sequence (e.g., LIT240042)
Client prefix with a six-digit ID (e.g., SMITHCO-000123)

US bar admission IDs:

Most states use 4–8 digit numbers, sometimes with a state-level prefix. USDC admission IDs vary by district and do not follow a shared format.

Privilege-Aware Processing Pipeline

For document review AI, a layered pipeline handles the full scope.

Layer 1 — Standard PII detection

Names, emails, phone numbers, addresses, SSNs. High accuracy. Well-established tooling handles this layer well.

Layer 2 — Custom code detection

Matter codes, docket IDs, bar IDs. Firm-specific patterns set at onboarding. This layer fills the gap that standard tools miss.

Layer 3 — Privilege review (human)

After automated detection, an attorney reviews flagged markers. ATTORNEY-CLIENT headers. WORK PRODUCT labels. CONFIDENTIAL markings. Human review at this layer is not optional.

Layer 4 — Context exception review

Public record dockets that pose no privilege risk versus client matter tags that do. This needs attorney judgment. It cannot be automated.

Layers 1 and 2 handle high-volume work. Layers 3 and 4 keep attorney judgment where privilege decisions belong. For what happens when privilege is already waived by AI tool use, see attorney-client privilege and AI.

Setup for Developers

Onboarding configuration

Collect client matter tag formats during enterprise onboarding. Each firm uses a different format. Store them as firm-specific custom entities. Apply to all processing for that account.

Default presets

Pre-built presets cover common contexts without custom work:

"Federal Court Documents" — federal docket patterns for civil, criminal, and bankruptcy
"State Court Documents (CA/NY/TX)" — state-specific formats for three major jurisdictions
"Internal Operations" — matter tag plus standard PII
"Outside Counsel Portal" — bill reference, matter tag, and standard PII

Audit documentation

Processing records should show that custom codes were included in each detection pass. This supports work product protection for the analysis method.

For a broader look at how redaction costs scale in litigation, see e-discovery PII automation and legal review cost reduction.

Conclusion

Privilege-sensitive IDs are as risky as standard PII — often more so. Tools that miss docket codes and matter tags leave a real gap in document workflows.

The fix is not an NLP model. It is pattern setup. For developers building law firm tools, that is the difference between a 3-day fix and a 3-month project. For law firms, it is the difference between defensible AI-assisted review and a privilege waiver risk.

When This Approach Has Limits

The layered pipeline in this article is sound, and putting attorney judgment at layers 3 and 4 is exactly right — but three limits apply.

Privilege is a legal standard, not a tool output. Layers 1 and 2 detect identifiers; they do not decide what is privileged. Privilege turns on the nature of the communication, the parties, waiver, and the crime-fraud exception — questions only a lawyer can answer. A document can be stripped of every docket code and matter tag and still disclose privileged substance in its narrative, or carry no identifier yet still be protected. Treat detection as input to attorney review, never as a substitute for it, and keep the human layers mandatory rather than optional under time pressure.

Firm-specific and court-specific formats need configuration and held-out testing. Matter tags, state docket conventions, and bar ID schemes vary by firm and jurisdiction with no shared standard, so a pattern set onboarded for one client will miss another's formats. Worse, the same string shape recurs across years of files, so an unconfigured legacy matter format becomes a systematic miss across an entire production. Validate each firm's patterns against a held-back sample before processing, and re-test when a new practice group or jurisdiction enters scope.

Detected identifiers can still re-identify in combination. Removing a client matter tag does not anonymize a document when the parties, dates, jurisdiction, and case facts remain and map to a single public filing. That makes the output pseudonymized rather than anonymous, which carries the confidentiality and privilege consequences this article is built to avoid. Where the underlying matter is identifiable from context, the remedy is attorney review and possibly withholding, not a broader regex.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

Start Free Trial View Features

Legal PII: Privilege Detection

Attorney-Client Privilege in the AI Era: Legal PII Your Anonymization Tool Must Detect

Why These IDs Need Custom Detection

The Legal AI Startup Case

Common Formats by Category

Privilege-Aware Processing Pipeline

Setup for Developers

Conclusion

When This Approach Has Limits

Sources

Related Articles

PII Detection Cuts E-Discovery Costs

Anonymous HR Surveys with Reversible PII

Reversible Encryption for Legal Discovery

Ready to protect your data?

Legal PII: Privilege Detection

Attorney-Client Privilege in the AI Era: Legal PII Your Anonymization Tool Must Detect

Why These IDs Need Custom Detection

The Legal AI Startup Case

Common Formats by Category

Privilege-Aware Processing Pipeline

Setup for Developers

Conclusion

When This Approach Has Limits

Sources

Related Articles

PII Detection Cuts E-Discovery Costs

Anonymous HR Surveys with Reversible PII

Reversible Encryption for Legal Discovery

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow