ปัญหา DACH: Multilingual Documents
Scenario: Swiss Invoice
Document language: Mixed (EN + DE + FR + IT)
"Invoice from:
English: Name: John Smith
Deutsch: Name: John Smith
Français: Nom: John Smith
Italiano: Nome: John Smith
address:
EN: Helvetiastrasse 123, 8000 Zürich
DE: Helvetiastrasse 123, 8000 Zürich (same)
FR: Rue Helvetia 123, 8000 Zurich
IT: Via Helvetia 123, 8000 Zurigo
Tax IDs:
DE (Steuer-ID): 12 345 678 901
FR (SIREN): 123 456 789
IT (Codice Fiscale): ABCDEF12G34H567I
ความท้าทาย #1: Language Detection
Wrong approach:
from langdetect import detect_langs
text = "Name: John SmithAdresse: Helvetiastrasse 123"
detect_langs(text)
# Result: [en:0.37, de:0.63] ← Ambiguous!
anonym.legal approach:
- Per-sentence language detection
- Context awareness
- Named entity language hints
ความท้าทาย #2: Entity Type Varies by Language
| Country | ID Type | Format | Language |
|---|---|---|---|
| Germany | Steuer-ID | 11 digits | DE |
| Austria | SV-Nummer | 10 digits | DE |
| Switzerland | AHV | 13 digits | DE/FR/IT |
| France | SIREN | 9 digits | FR |
| Italy | Codice Fiscale | 16 chars | IT |
anonym.legal Solution
Step 1: Segment by Language
Segment documents into language blocks, each processed independently
Step 2: Apply Language-Specific Detection
Use language-specific entity types and regex patterns
Coverage: 45 Countries, 48 Languages
DACH Region Full Coverage
Germany
- Steuer-ID (11 digits, Luhn checksum)
- Krankenversicherung
- Personalausweis
Austria
- SV-Nummer (10 digits, Luhn)
- Personalausweis
Switzerland
- AHV-Nummer (13 digits, modulo 97)
- Business Identification Numbers
- Trilingual support (DE/FR/IT)