Detect PII in 48 Languages
3-layer NLP hybrid (spaCy + Stanza + XLM-RoBERTa). 285+ entity types across 47 countries. Validates government IDs, tax numbers, phone formats, and more. All 48 languages, same accuracy.
Hybrid NLP Architecture
Three complementary models for maximum coverage:
spaCy (24 Languages)
Production-grade NER (Named Entity Recognition). Fast, memory-efficient, highly accurate for standard entities.
Languages: English, German, Dutch, Catalan, Danish, Finnish, French, Greek, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, Ukrainian, Chinese, Bulgarian*, Croatian.
Strengths: Balanced speed vs accuracy. Standard entity types (PER, ORG, LOC, DATE, etc.). Pretrained models available.
Use case: Primary engine for high-resource languages with large training datasets.
* Some languages are supported by multiple NLP engines for optimal accuracy
Stanza (6 Languages)
Stanford's deep NLP. Slower but more accurate for morphologically rich languages. Provides tokenization, part-of-speech, dependency parsing.
Languages: Bulgarian, Hungarian, Hebrew, Vietnamese, Afrikaans, Armenian.
Strengths: Handles agglutinative languages (Hungarian, Turkish). Complex morphology (Hebrew). High accuracy despite smaller training data.
Use case: Languages with complex word structure or limited public datasets.
XLM-RoBERTa (18 Languages)
Facebook's cross-lingual model. Single model covers 100+ languages. Fine-tuned for PII detection.
Languages: Arabic, Hindi, Turkish, Czech, Slovak, Indonesian, Thai, Persian, Serbian, Latvian, Estonian, Marathi, Bengali, Urdu, Swahili, Tagalog, Icelandic, Basque.
Strengths: Zero-shot transfer learning. Works for low-resource languages. Single model, not language-specific.
Use case: Low-resource or emerging languages without dedicated spaCy models.
All 48 Languages Supported
Click any language to test live detection. All entities available in all languages.
47 Countries โ ID Validation
Each country's ID format is validated using its official checksum algorithm:
European Union (27 Member States)
Beyond EU (20 Countries)
Validation Algorithms Supported
Luhn Algorithm
Used for: US SSN, Canada SIN, Sweden Personnummer, credit card PAN (last digit).
Luhn is the most common check digit algorithm in the world. It catches ~99% of transcription errors.
Variants: Standard Luhn, Luhn-like, ISO/IEC 7064 Mod 10-1
Modulus-11
Used for: EU tax IDs (DE, FR, ES, IT, NL, PL, PT, RO, GR), Denmark CPR, Norway FNR, India Aadhaar (Verhoeff variant), Thailand ID.
Verifies ID against a weighted checksum. Variants include ISO 7064 Mod 11-2.
Variants: ISO 7064 Mod 11-2, weighted 10-2, ISO 7064 Mod 37
Modulus-97
Used for: IBAN (International Bank Account Number), Swiss AHV.
Strong error detection. Catches all single-digit and transposition errors for financial accounts.
ISO 7064 Mod 97-10, used in 135+ countries
Verhoeff Algorithm
Used for: India Aadhaar (check digit), specialized financial/security IDs.
One of the strongest single-digit error detection algorithms. Catches 100% of single-digit and adjacent transposition errors.
Rarely used; highly secure
Modulus-10
Used for: Brazil CPF (two checksums), South Africa national ID, some payment card verification.
Simple but effective. Two checksum digits in Brazil CPF provide high accuracy.
ISO 7064 Mod 10-1, weighted variants
Custom Formats
Used for: China Resident ID (GB 11643), Spain NIF, Mexico RFC, Malaysia IC.
Country-specific rules. Some use alphanumeric checksums or position-based validation.
Regex + checksum combinations
Why Multilingual PII Detection Matters
40-60% Miss Rate
Single-engine tools trained on English miss most non-English PII. A healthcare system processing German patient records may redact only 30-40% of actual PII types.
anonym.legal: 3-layer hybrid detects 98%+ across all languages.
Compliance by Country
GDPR, HIPAA, PIPL, LGPD each define entity types for their jurisdictions. A tech company processing EU + India data needs different entity sets per region.
anonym.legal: All 47 countries, all regulations, one platform.
ID Format Validation
Not all 12-digit numbers are valid tax IDs. German tax IDs have specific checksums; Indian Aadhaar uses Verhoeff algorithm. Validation prevents false positives.
anonym.legal: Validates all 47 country ID formats with correct algorithms.
Team Complexity
Supporting 48 languages manually requires multilingual staff, custom NLP expertise, ongoing model maintenance. Cost prohibitive for most organizations.
anonym.legal: All 48 languages included. Same accuracy everywhere.
See It In Action
Watch how anonym.legal detects and anonymizes PII in real time
Also from anonym.legal