PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

· 2026 · cs.CL · arXiv 2604.15776

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span-level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain-silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at https://github.com/pritesh-2711/pii-bench.

representative citing papers

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data while LLM detectors are more robust.

ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent

cs.CL · 2026-06-24 · unverdicted · novelty 5.0

ProfileFoundry supplies a fixed synthetic dataset of 100,000 structured person objects with relational links, events, and consistency checks for LLM agent evaluations in privacy, memory, and tool use.

Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

cs.CL · 2026-05-25 · unverdicted · novelty 4.0

Direct DeBERTa fine-tuning outperforms source-conditioned hierarchical and curriculum models for broad PII detection on PIIBench, winning on 54 of 82 entity types with F1 0.6455 on the 100k held-out set.

citing papers explorer

Showing 3 of 3 citing papers.

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection cs.CL · 2026-06-18 · unverdicted · none · ref 25 · internal anchor
REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data while LLM detectors are more robust.
ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent cs.CL · 2026-06-24 · unverdicted · none · ref 9 · internal anchor
ProfileFoundry supplies a fixed synthetic dataset of 100,000 structured person objects with relational links, events, and consistency checks for LLM agent evaluations in privacy, memory, and tool use.
Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa cs.CL · 2026-05-25 · unverdicted · none · ref 3 · internal anchor
Direct DeBERTa fine-tuning outperforms source-conditioned hierarchical and curriculum models for broad PII detection on PIIBench, winning on 54 of 82 entity types with F1 0.6455 on the 100k held-out set.

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

fields

years

verdicts

representative citing papers

citing papers explorer