pith. sign in

arxiv: 2604.15776 · v1 · submitted 2026-04-17 · 💻 cs.CL · cs.AI

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

Pith reviewed 2026-05-10 08:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords PII detectionbenchmark corpuspersonally identifiable informationNERmulti-source datasetprivacyinformation extractioncorpus construction
0
0 comments X

The pith

Unifying ten PII datasets into one benchmark shows all detectors score below 0.14 F1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs PIIBench by merging ten existing datasets from various domains into a single standardized corpus containing over two million sequences and three million entity mentions. It applies a normalization process to align more than eighty different label schemes into a common format and creates balanced train-validation-test splits. When eight different detection systems are tested on this corpus, none reaches a span-level F1 score above 0.14, and most entity types see zero recall from every model. This result matters because it demonstrates that PII detection tools developed for narrow domains do not work when applied to diverse text sources. Researchers can use the benchmark to measure whether new methods overcome the domain-silo limitation that current approaches face.

Core claim

PIIBench unifies ten public datasets spanning synthetic, multilingual, and financial text into a corpus of 2,369,883 sequences and 3.35 million entity mentions across 48 PII types. A normalization pipeline converts 80+ label variants to a standard scheme while suppressing rare types and preserving source distributions in the splits. Baseline evaluation of rule-based, general NER, and specialized models yields maximum span F1 of 0.1385 for the top system, with zero recall on most entity types, establishing that the multi-source setting is substantially harder than any prior single-source PII dataset.

What carries the argument

The multi-source unified corpus PIIBench together with its label normalization pipeline that standardizes variants from different sources into one consistent tagging format.

Load-bearing premise

Mapping the varying labels from ten different datasets into one standard format introduces no systematic mistakes or changes to the original meanings.

What would settle it

A detection system that achieves span-level F1 above 0.3 on the PIIBench test set while maintaining non-zero recall across most of the 48 entity types would indicate that the current performance ceiling is not fundamental.

read the original abstract

We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span-level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain-silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at https://github.com/pritesh-2711/pii-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents PIIBench, a unified benchmark corpus for PII detection formed by consolidating ten public datasets (synthetic, multilingual NER, and financial) into 2,369,883 sequences containing 3.35 million entity mentions across 48 canonical PII types. It describes a normalization pipeline that maps 80+ source-specific label variants to a standardized BIO scheme, applies frequency-based suppression of rare types, and generates stratified 80/10/10 splits. Baseline evaluations of eight systems (Presidio, spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker variants, Piiranha DeBERTa, and XtremeDistil FiNER) report span-level F1 scores below 0.14 for all, with Presidio highest at 0.1385 and zero recall on most entity types, concluding that the benchmark quantifies the domain-silo problem and is substantially harder than any single-source PII dataset. The construction pipeline and evaluation code are released publicly.

Significance. If the unification process preserves original annotation intent without systematic distortion, PIIBench would offer a valuable, reproducible resource for the PII detection and privacy NLP community by enabling direct cross-domain comparisons and exposing generalization failures that single-source datasets obscure. The public code release supports verification and extension.

major comments (2)
  1. [Abstract] Abstract: The central claim that uniformly low span-level F1 (<0.14) and zero recall on most types demonstrate a 'substantially harder' benchmark due to domain-silo effects depends on the normalization pipeline faithfully preserving original labels. The abstract describes the pipeline only at high level ('maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types') without providing the explicit mapping rules, handling of ambiguous cross-source alignments, or the precise frequency threshold used for suppression. This detail is load-bearing; any introduced label noise or boundary inconsistencies would independently depress recall and weaken the interpretation of the results.
  2. [Evaluation] Evaluation section (implied by baseline results): The reported F1 scores are span-level, yet the abstract provides no description of how spans are aligned or matched after unification, nor how BIO tagging inconsistencies at boundaries (potentially arising from source-specific schemes) are resolved during evaluation. This affects the reliability of the zero-recall findings across entity types.
minor comments (2)
  1. [Abstract] Abstract: The total of 2,369,883 sequences and 3.35 million mentions should be accompanied by a per-source or per-type breakdown to allow readers to assess distribution balance after suppression.
  2. The GitHub repository link is provided but the manuscript should include a brief summary of the released artifacts (e.g., the exact label mapping table or suppression criteria) to make the pipeline self-contained without requiring external inspection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. The comments correctly identify areas where additional detail on the normalization pipeline and evaluation protocol would strengthen the manuscript. We have revised the paper to include explicit mapping rules, the frequency threshold, span alignment procedures, and boundary resolution methods in an expanded methods section and new appendix. These changes improve transparency while preserving the original claims and results.

read point-by-point responses
  1. Referee: The central claim that uniformly low span-level F1 (<0.14) and zero recall on most types demonstrate a 'substantially harder' benchmark due to domain-silo effects depends on the normalization pipeline faithfully preserving original labels. The abstract describes the pipeline only at high level ('maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types') without providing the explicit mapping rules, handling of ambiguous cross-source alignments, or the precise frequency threshold used for suppression. This detail is load-bearing; any introduced label noise or boundary inconsistencies would independently depress recall and weaken the interpretation of the results.

    Authors: We agree that the abstract's high-level description leaves the load-bearing details implicit. The full manuscript (Section 3) describes the pipeline at greater length, but to directly address the concern we have added an appendix with the complete explicit mapping table for all 80+ source variants to the 48 canonical types, the precise frequency threshold (suppression applied to types with fewer than 100 total mentions across the corpus), and our disambiguation procedure for cross-source alignments (semantic similarity lookup followed by manual expert review on ambiguous cases). These additions confirm that original annotation intent was preserved to the extent possible and allow readers to assess any residual noise independently. revision: yes

  2. Referee: The reported F1 scores are span-level, yet the abstract provides no description of how spans are aligned or matched after unification, nor how BIO tagging inconsistencies at boundaries (potentially arising from source-specific schemes) are resolved during evaluation. This affects the reliability of the zero-recall findings across entity types.

    Authors: We concur that the evaluation protocol must be stated explicitly for the zero-recall results to be fully interpretable. Although the abstract omits these mechanics, the evaluation section of the original manuscript specifies span-level F1. In the revision we have expanded that section to detail the exact procedure: all annotations are converted to a unified BIO scheme before evaluation; predicted and gold spans are matched only on exact boundary and type agreement (standard CoNLL-style); and any residual boundary inconsistencies are resolved by retaining the label from the source dataset with the highest annotation density for that sequence, after a manual audit of 500 random examples. This clarification supports that the low scores reflect genuine generalization failure rather than evaluation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with no derivations or self-referential predictions

full rationale

The paper aggregates ten existing datasets into PIIBench via a described normalization pipeline (mapping 80+ labels to 48 types, frequency suppression, stratified splits) and reports direct empirical F1 scores from eight external models on the resulting corpus. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. All claims rest on reproducible data release and off-the-shelf system evaluations, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the correctness of label normalization and the representativeness of the merged corpus; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The mapping of 80+ source-specific label variants to a standardized BIO tagging scheme preserves original meaning and does not introduce systematic bias.
    Invoked in the description of the normalization pipeline.

pith-pipeline@v0.9.0 · 5595 in / 1201 out tokens · 49981 ms · 2026-05-10T08:20:37.808530+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

    cs.CL 2026-06 unverdicted novelty 7.0

    REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data wh...

  2. ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent

    cs.CL 2026-06 unverdicted novelty 5.0

    ProfileFoundry supplies a fixed synthetic dataset of 100,000 structured person objects with relational links, events, and consistency checks for LLM agent evaluations in privacy, memory, and tool use.

  3. Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

    cs.CL 2026-05 unverdicted novelty 4.0

    Direct DeBERTa fine-tuning outperforms source-conditioned hierarchical and curriculum models for broad PII detection on PIIBench, winning on 54 of 82 entity types with F1 0.6455 on the 100k held-out set.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 3 Pith papers

  1. [1]

    & Sun, M

    Ding, N., Xu, G., Chen, Y., Wang, X., Han, X., Xie, P., ... & Sun, M. (2021). Few-NERD: A Few-Shot Named Entity Recognition Dataset. ACL-IJCNLP

  2. [2]

    Gretel.ai. (2023). Synthetic PII Finance Multilingual Dataset. HuggingFace Hub: gretelai/synthetic_pii_finance_multilingual. PIIBench: A Unified Multi-Source Benchmark Corpus for PII Detection Page 13 of 13 He, P., Gao, J., & Chen, W. (2023). DeBERTaV3: Improving DeBERTa using ELECTRA-style pre- training with gradient-disentangled embedding sharing. ICLR

  3. [3]

    Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Unpublished technical report. Isotonic. (2023). PII Masking 200K. HuggingFace Hub: Isotonic/pii-masking-200k. Apache 2.0 License. Loukas, L., Fergadiotis, M., Chalkidis, I., Kanoulas, E., & Malakasiotis,...

  4. [4]

    Microsoft. (2023). Presidio — Data Protection SDK. https://github.com/microsoft/presidio. NVIDIA. (2023). Nemotron-PII: A Synthetic Dataset for PII Detection. HuggingFace Hub: nvidia/Nemotron-PII. CC-BY 4.0 License. Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., & Ji, H. (2017). Cross-lingual Name Tagging and Linking for 282 Languages. ACL

  5. [5]

    F., & De Meulder, F

    Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task: Language- Independent Named Entity Recognition. CoNLL-2003. Tedeschi, S., & Navigli, R. (2022). MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition. NAACL Findings

  6. [6]

    Weischedel, R. et al. (2013). OntoNotes Release 5.0. LDC Catalog No.: LDC2013T19. ai4privacy. (2023). PII Masking 400K / 300K. HuggingFace Hub. Custom academic license. Wolf, T., Debut, L., Sanh, V., et al. (2020). Transformers: State-of-the-art Natural Language Processing. EMNLP 2020 System Demonstrations