Global

Detect PII in 48 Languages

3-layer NLP hybrid (spaCy + Stanza + XLM-RoBERTa). 285+ entity types across 47 countries. Validates government IDs, tax numbers, phone formats, and more. All 48 languages, same accuracy.

Hybrid NLP Architecture

Three complementary models for maximum coverage:

spaCy (24 Languages)

Production-grade NER (Named Entity Recognition). Fast, memory-efficient, highly accurate for standard entities.

Languages: English, German, Dutch, Catalan, Danish, Finnish, French, Greek, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, Ukrainian, Chinese, Bulgarian*, Croatian.

Strengths: Balanced speed vs accuracy. Standard entity types (PER, ORG, LOC, DATE, etc.). Pretrained models available.

Use case: Primary engine for high-resource languages with large training datasets.

* Some languages are supported by multiple NLP engines for optimal accuracy

Stanza (6 Languages)

Stanford's deep NLP. Slower but more accurate for morphologically rich languages. Provides tokenization, part-of-speech, dependency parsing.

Languages: Bulgarian, Hungarian, Hebrew, Vietnamese, Afrikaans, Armenian.

Strengths: Handles agglutinative languages (Hungarian, Turkish). Complex morphology (Hebrew). High accuracy despite smaller training data.

Use case: Languages with complex word structure or limited public datasets.

XLM-RoBERTa (18 Languages)

Facebook's cross-lingual model. Single model covers 100+ languages. Fine-tuned for PII detection.

Languages: Arabic, Hindi, Turkish, Czech, Slovak, Indonesian, Thai, Persian, Serbian, Latvian, Estonian, Marathi, Bengali, Urdu, Swahili, Tagalog, Icelandic, Basque.

Strengths: Zero-shot transfer learning. Works for low-resource languages. Single model, not language-specific.

Use case: Low-resource or emerging languages without dedicated spaCy models.

Yes. The anonym.legal Chrome Extension automatically detects and anonymizes PII before it reaches ChatGPT, Claude, Gemini, Copilot, or any AI tool. Your original data never leaves your browser.

Data masking replaces sensitive values with realistic-looking substitutes (e.g., replacing a real name with a fake name). Data anonymization is broader โ€” it includes masking, redaction, hashing, encryption, and custom methods. anonym.legal supports all 6 methods.

Yes. Upload PDF, Word, Excel, or text files for batch anonymization. The platform processes 5,000+ documents per batch with full audit trail. OCR support detects PII in scanned documents.

48 languages with native NLP models: 24 spaCy models, 6 Stanza models, and 18 Transformer models. Each language has region-specific entity detection (e.g., Steuer-ID for German, NIR for French, Codice Fiscale for Italian).

Yes. The REST API and MCP Server allow you to anonymize PII in your data pipeline with 3 lines of code. Sub-200ms latency. Python and Node.js SDKs available. Bearer token authentication.

The EU AI Act (full applicability August 2, 2026) requires GPAI providers to document training data handling. Anonymizing PII before AI training ensures Article 10 compliance. anonym.legal provides audit evidence for each anonymization.

All 48 Languages Supported

Click any language to test live detection. All entities available in all languages.

English
spaCy
German
spaCy
French
spaCy
Spanish
spaCy
Italian
spaCy
Portuguese
spaCy
Dutch
spaCy
Polish
spaCy
Russian
spaCy
Ukrainian
spaCy
Swedish
spaCy
Danish
spaCy
Norwegian
spaCy
Finnish
spaCy
Greek
spaCy
Romanian
spaCy
Slovenian
spaCy
Croatian
spaCy
Bulgarian
spaCy
Lithuanian
spaCy
Catalan
spaCy
Japanese
spaCy
Chinese
spaCy
Korean
spaCy
Hungarian
Stanza
Hebrew
Stanza
Vietnamese
Stanza
Afrikaans
Stanza
Armenian
Stanza
Bulgarian
Stanza
Arabic
XLM
Hindi
XLM
Turkish
XLM
Czech
XLM
Slovak
XLM
Indonesian
XLM
Thai
XLM
Persian
XLM
Serbian
XLM
Latvian
XLM
Estonian
XLM
Marathi
XLM
Bengali
XLM
Urdu
XLM
Swahili
XLM
Tagalog
XLM
Icelandic
XLM
Basque
XLM

47 Countries โ€” ID Validation

Each country's ID format is validated using its official checksum algorithm:

European Union (27 Member States)

DE_TAX_ID
Germany Tax Number
11-digit, ISO 7064 Mod 11-2
FR_NIR
France National Insurance
13-digit, Luhn variant
ES_NIF
Spain National ID (fiscal)
8-digit + 1 letter
IT_CF
Italy Codice Fiscale
16-char alphanumeric
NL_BSN
Netherlands Citizen Service
9-digit, Modulus-11
PL_PESEL
Poland Social Security (PESEL)
11-digit, Modulus-10
SE_PERSONNUMMER
Sweden Personal ID
12-digit YYMMDD+XXXX, Luhn
FI_HETU
Finland Personal ID
6-digit DOB + century mark + 3-digit serial + check digit
PT_NIF
Portugal Tax/Citizen ID
9-digit, Modulus-11
RO_CNP
Romania National Personal
13-digit, Modulus-11
GR_AFM
Greece Tax ID (AFM)
9-digit, Modulus-11
HU_TAJ
Hungary Social Security (TAJ)
9-digit, specific algorithm
CZ_RC
Czechia Birth Certificate (RC)
10-digit YYMMDDXXXX, Modulus-11
SK_RC
Slovakia Birth Certificate (RC)
10-digit YYMMDDXXXX
HR_OIB
Croatia Personal ID (OIB)
11-digit, ISO 7064 Mod 11-2
LT_ASMENS_ID
Lithuania Personal ID
11-digit YYMMDDXXXX
LV_PEC
Latvia Personal Code
11-digit DDMMYY-XXXXX
EE_ID
Estonia Personal ID
11-digit, complex checksum
SI_EMSO
Slovenia Unique Master ID
13-digit DDMMMYYXXXXX
BG_EGN
Bulgaria Uniform ID (EGN)
10-digit YYMMDDXXXX, Modulus-11
MT_ID
Malta Identity Card
Alphanumeric with specific format
CY_ID
Cyprus ID Card
Numeric format
LU_SSN
Luxembourg Social Security
13-digit
IE_PPS
Ireland Personal PPS
7-digit + 1-2 letters + check digit

Beyond EU (20 Countries)

GB_NI
UK National Insurance
2 letters + 6 digits + 1 letter
CH_AHV
Switzerland Social Insurance
13-digit, ISO 7064 Mod 11-2
NO_FNR
Norway National ID (FNR)
11-digit DDMMYY+XXXXX, Modulus-11
SE_PERSONNUMMER
Sweden Personal ID
12-digit, Luhn
DK_CPR
Denmark CPR Registry
10-digit DDMMYY-XXXX, Modulus-11
US_SSN
USA Social Security
9-digit XXX-XX-XXXX, Luhn-like
CA_SIN
Canada Social Insurance
9-digit, Luhn
BR_CPF
Brazil Citizen ID (CPF)
11-digit, Modulus-11 (2 checksums)
MX_RFC
Mexico Tax ID (RFC)
13-char alphanumeric
AU_TFN
Australia Tax File Number
9-digit, Modulus-89
JP_MY_NUMBER
Japan My Number ID
12-digit, ISO 7064 Mod 11-2
CN_ID
China Resident ID (GB 11643)
18-digit, ISO 7064 Mod 11-2
IN_AADHAAR
India Aadhaar ID
12-digit, Verhoeff (check digit)
SG_NRIC
Singapore National ID
9-char (1 letter + 7 digits + check letter)
ZA_ID
South Africa National ID
13-digit YYMMDDSSSSSGC, Modulus-10
NZ_IRD
New Zealand Tax ID (IRD)
8-9 digits, Modulus-11
KR_RRN
South Korea Resident Registration
13-digit YYMMDDSSSSSGC, Modulus-11
TH_ID
Thailand National ID
13-digit, Modulus-11
RU_SNILS
Russia Social Insurance
11-digit, Modulus-11
IL_ID
Israel National ID
9-digit, specific algorithm

Validation Algorithms Supported

Luhn Algorithm

Used for: US SSN, Canada SIN, Sweden Personnummer, credit card PAN (last digit).

Luhn is the most common check digit algorithm in the world. It catches ~99% of transcription errors.

Variants: Standard Luhn, Luhn-like, ISO/IEC 7064 Mod 10-1

Modulus-11

Used for: EU tax IDs (DE, FR, ES, IT, NL, PL, PT, RO, GR), Denmark CPR, Norway FNR, India Aadhaar (Verhoeff variant), Thailand ID.

Verifies ID against a weighted checksum. Variants include ISO 7064 Mod 11-2.

Variants: ISO 7064 Mod 11-2, weighted 10-2, ISO 7064 Mod 37

Modulus-97

Used for: IBAN (International Bank Account Number), Swiss AHV.

Strong error detection. Catches all single-digit and transposition errors for financial accounts.

ISO 7064 Mod 97-10, used in 135+ countries

Verhoeff Algorithm

Used for: India Aadhaar (check digit), specialized financial/security IDs.

One of the strongest single-digit error detection algorithms. Catches 100% of single-digit and adjacent transposition errors.

Rarely used; highly secure

Modulus-10

Used for: Brazil CPF (two checksums), South Africa national ID, some payment card verification.

Simple but effective. Two checksum digits in Brazil CPF provide high accuracy.

ISO 7064 Mod 10-1, weighted variants

Custom Formats

Used for: China Resident ID (GB 11643), Spain NIF, Mexico RFC, Malaysia IC.

Country-specific rules. Some use alphanumeric checksums or position-based validation.

Regex + checksum combinations

Why Multilingual PII Detection Matters

40-60% Miss Rate

Single-engine tools trained on English miss most non-English PII. A healthcare system processing German patient records may redact only 30-40% of actual PII types.

anonym.legal: 3-layer hybrid detects 98%+ across all languages.

Compliance by Country

GDPR, HIPAA, PIPL, LGPD each define entity types for their jurisdictions. A tech company processing EU + India data needs different entity sets per region.

anonym.legal: All 47 countries, all regulations, one platform.

ID Format Validation

Not all 12-digit numbers are valid tax IDs. German tax IDs have specific checksums; Indian Aadhaar uses Verhoeff algorithm. Validation prevents false positives.

anonym.legal: Validates all 47 country ID formats with correct algorithms.

Team Complexity

Supporting 48 languages manually requires multilingual staff, custom NLP expertise, ongoing model maintenance. Cost prohibitive for most organizations.

anonym.legal: All 48 languages included. Same accuracy everywhere.

See It In Action

Watch how anonym.legal detects and anonymizes PII in real time

Detect PII in Your Language

Paste text in any of 48 languages. See all detected entities, entity types, and validation status instantly.

Try Now

Frequently Asked Questions

Yes. The hybrid NLP engine automatically detects the language of each text segment and applies the correct detection model. A document mixing English, German, and Japanese is processed with all three language models simultaneously.

anonym.legal detects country-specific identifiers (47 countries) and validates them using local algorithms โ€” Luhn for credit cards, Modulus-11 for Nordic personal numbers, check-digit validation for EU tax IDs. All processing happens locally with zero data retention.

spaCy handles 24 languages (including all EU major languages), Stanza covers 6 languages (Bulgarian, Hungarian, Hebrew, Vietnamese, Afrikaans, Armenian), and XLM-RoBERTa transformers handle 18 additional languages (Arabic, Hindi, Thai, and more).