REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data while LLM detectors are more robust.
PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
3 Pith papers cite this work. Polarity classification is still indexing.
abstract
We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span-level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain-silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at https://github.com/pritesh-2711/pii-bench.
fields
cs.CL 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
ProfileFoundry supplies a fixed synthetic dataset of 100,000 structured person objects with relational links, events, and consistency checks for LLM agent evaluations in privacy, memory, and tool use.
Direct DeBERTa fine-tuning outperforms source-conditioned hierarchical and curriculum models for broad PII detection on PIIBench, winning on 54 of 82 entity types with F1 0.6455 on the 100k held-out set.
citing papers explorer
-
REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection
REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data while LLM detectors are more robust.
-
ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent
ProfileFoundry supplies a fixed synthetic dataset of 100,000 structured person objects with relational links, events, and consistency checks for LLM agent evaluations in privacy, memory, and tool use.
-
Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa
Direct DeBERTa fine-tuning outperforms source-conditioned hierarchical and curriculum models for broad PII detection on PIIBench, winning on 54 of 82 entity types with F1 0.6455 on the 100k held-out set.