PII Detection in 48 Languages — Global Compliance

Hybrid NLP Architecture

Three complementary models for maximum coverage:

spaCy (24 Languages)

Production-grade NER (Named Entity Recognition). Fast, memory-efficient, highly accurate for standard entities.

Languages: English, German, Dutch, Catalan, Danish, Finnish, French, Greek, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, Ukrainian, Chinese, Bulgarian*, Croatian.

Strengths: Balanced speed vs accuracy. Standard entity types (PER, ORG, LOC, DATE, etc.). Pretrained models available.

Use case: Primary engine for high-resource languages with large training datasets.

* Some languages are supported by multiple NLP engines for optimal accuracy

Stanza (6 Languages)

Stanford's deep NLP. Slower but more accurate for morphologically rich languages. Provides tokenization, part-of-speech, dependency parsing.

Languages: Bulgarian, Hungarian, Hebrew, Vietnamese, Afrikaans, Armenian.

Strengths: Handles agglutinative languages (Hungarian, Turkish). Complex morphology (Hebrew). High accuracy despite smaller training data.

Use case: Languages with complex word structure or limited public datasets.

XLM-RoBERTa (18 Languages)

Facebook's cross-lingual model. Single model covers 100+ languages. Fine-tuned for PII detection.

Languages: Arabic, Hindi, Turkish, Czech, Slovak, Indonesian, Thai, Persian, Serbian, Latvian, Estonian, Marathi, Bengali, Urdu, Swahili, Tagalog, Icelandic, Basque.

Strengths: Zero-shot transfer learning. Works for low-resource languages. Single model, not language-specific.

Use case: Low-resource or emerging languages without dedicated spaCy models.

Yes. The anonym.legal Chrome Extension automatically detects and anonymizes PII before it reaches ChatGPT, Claude, Gemini, Copilot, or any AI tool. Your original data never leaves your browser.

Data masking replaces sensitive values with realistic-looking substitutes (e.g., replacing a real name with a fake name). Data anonymization is broader — it includes masking, redaction, hashing, encryption, and custom methods. anonym.legal supports all 6 methods.

Yes. Upload PDF, Word, Excel, or text files for batch anonymization. The platform processes 5,000+ documents per batch with full audit trail. OCR support detects PII in scanned documents.

48 languages with native NLP models: 24 spaCy models, 6 Stanza models, and 18 Transformer models. Each language has region-specific entity detection (e.g., Steuer-ID for German, NIR for French, Codice Fiscale for Italian).

Yes. The REST API and MCP Server allow you to anonymize PII in your data pipeline with 3 lines of code. Sub-200ms latency. Python and Node.js SDKs available. Bearer token authentication.

The EU AI Act (full applicability August 2, 2026) requires GPAI providers to document training data handling. Anonymizing PII before AI training ensures Article 10 compliance. anonym.legal provides audit evidence for each anonymization.

All 48 Languages Supported

Click any language to test live detection. All entities available in all languages.

English

spaCy

German

spaCy

French

spaCy

Spanish

spaCy

Italian

spaCy

Portuguese

spaCy

Dutch

spaCy

Polish

spaCy

Russian

spaCy

Ukrainian

spaCy

Swedish

spaCy

Danish

spaCy

Norwegian

spaCy

Finnish

spaCy

Greek

spaCy

Romanian

spaCy

Slovenian

spaCy

Croatian

spaCy

Bulgarian

spaCy

Lithuanian

spaCy

Catalan

spaCy

Japanese

spaCy

Chinese

spaCy

Korean

spaCy

Hungarian

Stanza

Hebrew

Stanza

Vietnamese

Stanza

Afrikaans

Stanza

Armenian

Stanza

Bulgarian

Stanza

Arabic

XLM

Hindi

XLM

Turkish

XLM

Czech

XLM

Slovak

XLM

Indonesian

XLM

Thai

XLM

Persian

XLM

Serbian

XLM

Latvian

XLM

Estonian

XLM

Marathi

XLM

Bengali

XLM

Urdu

XLM

Swahili

XLM

Tagalog

XLM

Icelandic

XLM

Basque

XLM

47 Countries — ID Validation

Each country's ID format is validated using its official checksum algorithm:

European Union (27 Member States)

DE_TAX_ID

Germany Tax Number

11-digit, ISO 7064 Mod 11-2

FR_NIR

France National Insurance

13-digit, Luhn variant

ES_NIF

Spain National ID (fiscal)

8-digit + 1 letter

IT_CF

Italy Codice Fiscale

16-char alphanumeric

NL_BSN

Netherlands Citizen Service

9-digit, Modulus-11

PL_PESEL

Poland Social Security (PESEL)

11-digit, Modulus-10

SE_PERSONNUMMER

Sweden Personal ID

12-digit YYMMDD+XXXX, Luhn

FI_HETU

Finland Personal ID

6-digit DOB + century mark + 3-digit serial + check digit

PT_NIF

Portugal Tax/Citizen ID

9-digit, Modulus-11

RO_CNP

Romania National Personal

13-digit, Modulus-11

GR_AFM

Greece Tax ID (AFM)

9-digit, Modulus-11

HU_TAJ

Hungary Social Security (TAJ)

9-digit, specific algorithm

CZ_RC

Czechia Birth Certificate (RC)

10-digit YYMMDDXXXX, Modulus-11

SK_RC

Slovakia Birth Certificate (RC)

10-digit YYMMDDXXXX

HR_OIB

Croatia Personal ID (OIB)

11-digit, ISO 7064 Mod 11-2

LT_ASMENS_ID

Lithuania Personal ID

11-digit YYMMDDXXXX

LV_PEC

Latvia Personal Code

11-digit DDMMYY-XXXXX

EE_ID

Estonia Personal ID

11-digit, complex checksum

SI_EMSO

Slovenia Unique Master ID

13-digit DDMMMYYXXXXX

BG_EGN

Bulgaria Uniform ID (EGN)

10-digit YYMMDDXXXX, Modulus-11

MT_ID

Malta Identity Card

Alphanumeric with specific format

CY_ID

Cyprus ID Card

Numeric format

LU_SSN

Luxembourg Social Security

13-digit

IE_PPS

Ireland Personal PPS

7-digit + 1-2 letters + check digit

Beyond EU (20 Countries)

GB_NI

UK National Insurance

2 letters + 6 digits + 1 letter

CH_AHV

Switzerland Social Insurance

13-digit, ISO 7064 Mod 11-2

NO_FNR

Norway National ID (FNR)

11-digit DDMMYY+XXXXX, Modulus-11

SE_PERSONNUMMER

Sweden Personal ID

12-digit, Luhn

DK_CPR

Denmark CPR Registry

10-digit DDMMYY-XXXX, Modulus-11

US_SSN

USA Social Security

9-digit XXX-XX-XXXX, Luhn-like

CA_SIN

Canada Social Insurance

9-digit, Luhn

BR_CPF

Brazil Citizen ID (CPF)

11-digit, Modulus-11 (2 checksums)

MX_RFC

Mexico Tax ID (RFC)

13-char alphanumeric

AU_TFN

Australia Tax File Number

9-digit, Modulus-89

JP_MY_NUMBER

Japan My Number ID

12-digit, ISO 7064 Mod 11-2

CN_ID

China Resident ID (GB 11643)

18-digit, ISO 7064 Mod 11-2

IN_AADHAAR

India Aadhaar ID

12-digit, Verhoeff (check digit)

SG_NRIC

Singapore National ID

9-char (1 letter + 7 digits + check letter)

ZA_ID

South Africa National ID

13-digit YYMMDDSSSSSGC, Modulus-10

NZ_IRD

New Zealand Tax ID (IRD)

8-9 digits, Modulus-11

KR_RRN

South Korea Resident Registration

13-digit YYMMDDSSSSSGC, Modulus-11

TH_ID

Thailand National ID

13-digit, Modulus-11

RU_SNILS

Russia Social Insurance

11-digit, Modulus-11

IL_ID

Israel National ID

9-digit, specific algorithm

See All 285+ Entity Types

Validation Algorithms Supported

Luhn Algorithm

Used for: US SSN, Canada SIN, Sweden Personnummer, credit card PAN (last digit).

Luhn is the most common check digit algorithm in the world. It catches ~99% of transcription errors.

Variants: Standard Luhn, Luhn-like, ISO/IEC 7064 Mod 10-1

Modulus-11

Used for: EU tax IDs (DE, FR, ES, IT, NL, PL, PT, RO, GR), Denmark CPR, Norway FNR, India Aadhaar (Verhoeff variant), Thailand ID.

Verifies ID against a weighted checksum. Variants include ISO 7064 Mod 11-2.

Variants: ISO 7064 Mod 11-2, weighted 10-2, ISO 7064 Mod 37

Modulus-97

Used for: IBAN (International Bank Account Number), Swiss AHV.

Strong error detection. Catches all single-digit and transposition errors for financial accounts.

ISO 7064 Mod 97-10, used in 135+ countries

Verhoeff Algorithm

Used for: India Aadhaar (check digit), specialized financial/security IDs.

One of the strongest single-digit error detection algorithms. Catches 100% of single-digit and adjacent transposition errors.

Rarely used; highly secure

Modulus-10

Used for: Brazil CPF (two checksums), South Africa national ID, some payment card verification.

Simple but effective. Two checksum digits in Brazil CPF provide high accuracy.

ISO 7064 Mod 10-1, weighted variants

Custom Formats

Used for: China Resident ID (GB 11643), Spain NIF, Mexico RFC, Malaysia IC.

Country-specific rules. Some use alphanumeric checksums or position-based validation.

Regex + checksum combinations

Why Multilingual PII Detection Matters

40-60% Miss Rate

Single-engine tools trained on English miss most non-English PII. A healthcare system processing German patient records may redact only 30-40% of actual PII types.

anonym.legal: 3-layer hybrid detects 98%+ across all languages.

Compliance by Country

GDPR, HIPAA, PIPL, LGPD each define entity types for their jurisdictions. A tech company processing EU + India data needs different entity sets per region.

anonym.legal: All 47 countries, all regulations, one platform.

ID Format Validation

Not all 12-digit numbers are valid tax IDs. German tax IDs have specific checksums; Indian Aadhaar uses Verhoeff algorithm. Validation prevents false positives.

anonym.legal: Validates all 47 country ID formats with correct algorithms.

Team Complexity

Supporting 48 languages manually requires multilingual staff, custom NLP expertise, ongoing model maintenance. Cost prohibitive for most organizations.

anonym.legal: All 48 languages included. Same accuracy everywhere.

See It In Action

Watch how anonym.legal detects and anonymizes PII in real time

Also from anonym.legal

EU GDPR Compliance Hub → Legal Document Redaction → Enterprise DLP → Developer API & MCP →

Frequently Asked Questions

Yes. The hybrid NLP engine automatically detects the language of each text segment and applies the correct detection model. A document mixing English, German, and Japanese is processed with all three language models simultaneously.

anonym.legal detects country-specific identifiers (47 countries) and validates them using local algorithms — Luhn for credit cards, Modulus-11 for Nordic personal numbers, check-digit validation for EU tax IDs. All processing happens locally with zero data retention.

spaCy handles 24 languages (including all EU major languages), Stanza covers 6 languages (Bulgarian, Hungarian, Hebrew, Vietnamese, Afrikaans, Armenian), and XLM-RoBERTa transformers handle 18 additional languages (Arabic, Hindi, Thai, and more).

Detect PII in 48 Languages

Hybrid NLP Architecture

spaCy (24 Languages)

Stanza (6 Languages)

XLM-RoBERTa (18 Languages)

All 48 Languages Supported

47 Countries — ID Validation

European Union (27 Member States)

Beyond EU (20 Countries)

Validation Algorithms Supported

Luhn Algorithm

Modulus-11

Modulus-97

Verhoeff Algorithm

Modulus-10

Custom Formats

Why Multilingual PII Detection Matters

40-60% Miss Rate

Compliance by Country

ID Format Validation

Team Complexity

See It In Action

Detect PII in Your Language

Frequently Asked Questions