Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

Pritesh Jha

arxiv: 2605.25816 · v1 · pith:SX4XB23Inew · submitted 2026-05-25 · 💻 cs.CL · cs.AI

Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

Pritesh Jha This is my paper

Pith reviewed 2026-06-29 21:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords PII detectionDeBERTatoken classificationmulti-source datasetsnamed entity recognitioncurriculum learninghierarchical modelsinformation extraction

0 comments

The pith

Direct DeBERTa fine-tuning on multi-source PII data outperforms source-conditioned and curriculum models for 82 entity types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether added architectural features such as source conditioning or phased curriculum learning improve personally identifiable information detection when models are trained on a corrected collection of ten datasets covering 82 entity types. Experiments show that plain token-classification fine-tuning of DeBERTa with a weighted cross-entropy loss produces the highest F1 on both a 5,000-record and a 100,000-record held-out split, beating the more elaborate variants and eight earlier published systems. Direct training also leads on 54 of the 82 fine entity types and all ten coarse groups by support-weighted entity F1. A reader would care because most deployed PII detectors are still trained inside single domains and therefore miss sensitive spans in mixed real-world documents; the result points to data breadth as the dominant factor rather than model elaboration.

Core claim

On the corrected multi-source PIIBench, direct token-classification fine-tuning of DeBERTa reaches F1 0.6455 on the full 100,002-record held-out split, exceeding 0.5894 for the source-conditioned hierarchical model and lower scores for the curriculum extension; the same direct model also surpasses all eight published comparators on the reproducible 5,000-record subset and wins 54 of 82 fine entity types plus every coarse group.

What carries the argument

Direct token classification fine-tuning of DeBERTa with weighted cross-entropy loss on the corrected multi-source PIIBench dataset.

If this is right

Direct fine-tuning wins on 54 of 82 fine entity types and all ten coarse groups by support-weighted entity F1.
The source-conditioned hierarchical model retains localized advantages on only 28 types.
The three-phase curriculum extension produces markedly lower F1 than either direct or hierarchical training.
All three DeBERTa variants exceed the strongest published comparator (F1 0.1723) on the 5k test split.
Diverse task-specific training data plus a simple weighted objective drive the observed gains more than added architectural or curriculum complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For other sequence-labeling tasks that cross heterogeneous sources, allocating effort to data collection and cleaning may yield larger returns than further architectural elaboration.
The pattern suggests that dataset preparation steps can dominate reported performance differences in information-extraction benchmarks.
Repeating the comparison at larger model scales or on additional languages would test whether the same data-over-complexity ordering persists.

Load-bearing premise

The corrections applied to the ten source datasets produce accurate unbiased labels for all 82 entity types and the held-out splits represent performance on unseen heterogeneous text.

What would settle it

A new held-out collection drawn from additional heterogeneous sources on which the source-conditioned hierarchical model records a higher overall F1 than direct fine-tuning would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2605.25816 by Pritesh Jha.

**Figure 1.** Figure 1: Span-level F1 on test_5k (5,000 records) for all eleven evaluated systems. Light grey = published comparators; dark bars = fine-tuned DeBERTa models. Dashed line marks the best published system (SpanMarker BERT, F1 = 0.1723). Direct DeBERTa achieves F1 0.6476 — a 3.76x improvement and +0.475 absolute F1 gain over the best published comparator. 6.3 Validation–Test Disagreement and Final Full-Test Evaluation… view at source ↗

read the original abstract

Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting coverage when deployed on heterogeneous text. We study model fine-tuning on a corrected multi-source PIIBench preparation spanning 82 retained entity types across ten source datasets. We evaluate three DeBERTa-based approaches: direct token classification fine-tuning, a source-conditioned hierarchical model (SC+H), and a three-phase curriculum extension (SC+H+Curr). Against eight published comparator systems on a reproducible 5,000-record held-out subset (test_5k), direct fine-tuned DeBERTa achieves F1 0.6476, while SC+H and the curriculum variant achieve 0.5899 and 0.2772 respectively; the strongest published comparator reaches only 0.1723. Because validation initially favoured SC+H, we perform a final streamed evaluation on the complete 100,002-record held-out split. Direct fine-tuning remains superior, achieving F1 0.6455 versus 0.5894 for SC+H. Entity-level analysis shows that direct fine tuning wins 54 of 82 fine entity types and all ten coarse groups by support-weighted entity F1, while SC+H retains localised advantages on 28 types. The results indicate that diverse task-specific training data and a simple weighted cross-entropy objective contribute more to broad-coverage PII detection than the tested architectural and curriculum complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Direct DeBERTa fine-tuning beats the tested hierarchical and curriculum variants on this corrected multi-source PII benchmark, but the lack of correction details leaves the main claim weakly supported.

read the letter

The paper's core result is that plain token-classification fine-tuning of DeBERTa reaches F1 0.6476 on the 5k held-out set and 0.6455 on the 100k set, beating both the source-conditioned hierarchical model (0.59) and the curriculum extension (0.28), plus all eight published comparators. It also wins on 54 of 82 fine entity types and all ten coarse groups by support-weighted F1.

What it does well is deliver a direct head-to-head on a broad, multi-source benchmark with entity-level breakdowns and two different held-out sizes. The numbers are reported clearly and the conclusion that data diversity plus a simple weighted loss matters more than the added architectural pieces is stated plainly.

The soft spot is exactly what the stress-test flags: the abstract never describes the correction process for the ten source datasets, offers no validation metrics, no checks for label bias or train-test leakage, and no error analysis. Without those, the performance gap could trace to how the labels were cleaned rather than to the modeling choices. Hyperparameter details and significance tests are also absent.

This is useful for researchers already working on multi-domain PII or privacy NER who need comparative numbers on 82 entity types. It is not likely to change wider practice. The work shows clear empirical engagement with the literature and deserves referee time so the missing data-handling steps can be supplied or challenged.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that direct fine-tuning of DeBERTa on a corrected multi-source PIIBench dataset spanning 82 entity types from ten sources achieves higher F1 for broad-coverage PII detection (0.6476 on the reproducible test_5k held-out subset and 0.6455 on the full 100k held-out split) than a source-conditioned hierarchical model (SC+H: 0.5899/0.5894) or its curriculum extension (0.2772), outperforming eight published comparators; entity-level analysis shows direct fine-tuning wins on 54/82 fine types and all 10 coarse groups by support-weighted F1, leading to the conclusion that diverse task-specific data and weighted cross-entropy contribute more than the tested architectural and curriculum complexity.

Significance. If the data corrections prove accurate and the held-out splits representative, the result would indicate that for heterogeneous PII detection, simple fine-tuning on diverse data can outperform added model complexity, with potential implications for efficient deployment of privacy tools in NLP. The reproducible 5k subset and streamed 100k evaluation provide a concrete empirical basis for the comparison.

major comments (3)

[Abstract] Abstract: the central claim attributes performance superiority to 'diverse task-specific training data' after corrections to PIIBench, yet supplies no description of the correction rules, validation against original annotations, inter-annotator metrics, or checks for systematic bias/leakage between train and held-out sets; without this, the entity-level wins (54/82 fine types) and the data-vs-complexity conclusion cannot be confidently separated from possible label artifacts favoring direct token classification.
[Abstract] Abstract and Results: the reported F1 gaps (e.g., 0.6476 vs. 0.5899 on test_5k) are presented without statistical significance testing, confidence intervals, or error analysis, which is load-bearing because the initial validation favored SC+H while the final streamed evaluation favors direct fine-tuning; this leaves open whether the observed differences are robust across the heterogeneous sources.
[Abstract] Abstract: hyperparameter selection, class-weight computation for the weighted cross-entropy, and training protocol details for the three DeBERTa variants are omitted, undermining assessment of whether the simple objective's contribution is truly isolated from tuning choices that could have been applied equally to SC+H.

minor comments (2)

[Abstract] The phrase 'streamed evaluation' on the 100k set is introduced without definition or reference to the streaming procedure.
[Abstract] Support-weighted entity F1 is mentioned for the 54/82 wins but the exact weighting formula and per-entity support counts are not tabulated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim attributes performance superiority to 'diverse task-specific training data' after corrections to PIIBench, yet supplies no description of the correction rules, validation against original annotations, inter-annotator metrics, or checks for systematic bias/leakage between train and held-out sets; without this, the entity-level wins (54/82 fine types) and the data-vs-complexity conclusion cannot be confidently separated from possible label artifacts favoring direct token classification.

Authors: We agree that the abstract does not contain these details and that they are needed to support the central claim. The full manuscript (Section 3) describes the multi-source preparation of PIIBench and the corrections performed (label standardization across the ten sources and removal of inconsistent annotations). To address the concern, we will expand Section 3 with an explicit list of correction rules, any available validation steps against original annotations, checks for train/held-out leakage, and a discussion of possible label artifacts. This revision will allow readers to evaluate whether the entity-level results can be attributed to data rather than artifacts. revision: yes
Referee: [Abstract] Abstract and Results: the reported F1 gaps (e.g., 0.6476 vs. 0.5899 on test_5k) are presented without statistical significance testing, confidence intervals, or error analysis, which is load-bearing because the initial validation favored SC+H while the final streamed evaluation favors direct fine-tuning; this leaves open whether the observed differences are robust across the heterogeneous sources.

Authors: We acknowledge the absence of statistical testing and error analysis. The manuscript already notes that validation initially favored SC+H and therefore reports the streamed 100k evaluation to test robustness. In revision we will add bootstrap confidence intervals around all reported F1 scores, paired significance tests between models, and a short error analysis section that breaks down performance by source and coarse entity group. These additions will directly address whether the observed gaps are robust. revision: yes
Referee: [Abstract] Abstract: hyperparameter selection, class-weight computation for the weighted cross-entropy, and training protocol details for the three DeBERTa variants are omitted, undermining assessment of whether the simple objective's contribution is truly isolated from tuning choices that could have been applied equally to SC+H.

Authors: We agree that these implementation details are required for reproducibility and to isolate the effect of the objective. The experimental section of the manuscript provides high-level training information, but we will add a dedicated appendix that reports the hyperparameter search grid, the exact procedure used to compute class weights (inverse frequency with additive smoothing), optimizer settings, learning-rate schedule, epoch count, and early-stopping criteria for direct fine-tuning, SC+H, and SC+H+Curr. This will enable readers to judge whether the same tuning effort could have been applied to the more complex models. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical comparison on held-out data

full rationale

The paper reports direct fine-tuning results for DeBERTa models on a corrected multi-source PIIBench dataset, with F1 scores on reproducible held-out splits (test_5k and 100k) and entity-level comparisons against published baselines. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear; claims rest on standard train/eval metrics without any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides minimal visibility into hyperparameters or data processing; the listed items are inferred from the stated methods and conclusion.

free parameters (1)

class weights for weighted cross-entropy
The paper invokes a weighted cross-entropy objective whose weights must be chosen or derived from class frequencies to address imbalance across 82 entity types.

axioms (1)

domain assumption The corrected multi-source PIIBench supplies accurate ground-truth labels for 82 entity types from ten datasets.
All reported F1 scores and entity-level comparisons rest on the assumption that the dataset preparation and corrections are reliable.

pith-pipeline@v0.9.1-grok · 5783 in / 1348 out tokens · 44990 ms · 2026-06-29T21:24:42.571017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 1 canonical work pages · 1 internal anchor

[1]

ai4privacy. (2023). PII Masking 400K / 300K. HuggingFace Hub. Custom academic licence. Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. Proceedings of ICML, 41–48. Ding, N., et al. (2021). Few-NERD: A few-shot named entity recognition dataset. ACL-IJCNLP

2023
[2]

Gretel.ai. (2023). Synthetic PII Finance Multilingual Dataset. HuggingFace Hub: gretelai/synthetic_pii_finance_multilingual. He, P., Gao, J., and Chen, W. (2023). DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. ICLR

2023
[3]

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

Honnibal, M. and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Isotonic. (2023). PII Masking 200K. HuggingFace Hub: Isotonic/pii-masking-200k. Apache 2.0 Licence. Jha, P. (2026). PIIBench: A unified multi-source benchmark corpus for personally identifiable informat...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

and Cohen, N

McCloskey, M. and Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24:109–165. Microsoft. (2023). Presidio — Data Protection SDK. https://github.com/microsoft/presidio. NVIDIA. (2023). Nemotron-PII: A synthetic dataset for PII detection. HuggingFace Hub: nvidi...

1989
[5]

Sang, E. F. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language- independent named entity recognition. CoNLL

2003
[6]

and Navigli, R

Tedeschi, S. and Navigli, R. (2022). MultiNERD: A multilingual, multi-genre and fine-grained dataset for named entity recognition. NAACL Findings

2022
[7]

Weischedel, R. et al. (2013). OntoNotes Release 5.0. LDC Catalog No.: LDC2013T19. Wolf, T., et al. (2020). Transformers: State-of-the-art natural language processing. EMNLP 2020 System Demonstrations. Fine-Tuning Over Architectural Complexity: PII Detection on PIIBench 13 Appendix A: Complete Fine-Entity Full-Test Comparison All 82 retained fine entity ty...

2013

[1] [1]

ai4privacy. (2023). PII Masking 400K / 300K. HuggingFace Hub. Custom academic licence. Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. Proceedings of ICML, 41–48. Ding, N., et al. (2021). Few-NERD: A few-shot named entity recognition dataset. ACL-IJCNLP

2023

[2] [2]

Gretel.ai. (2023). Synthetic PII Finance Multilingual Dataset. HuggingFace Hub: gretelai/synthetic_pii_finance_multilingual. He, P., Gao, J., and Chen, W. (2023). DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. ICLR

2023

[3] [3]

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

Honnibal, M. and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Isotonic. (2023). PII Masking 200K. HuggingFace Hub: Isotonic/pii-masking-200k. Apache 2.0 Licence. Jha, P. (2026). PIIBench: A unified multi-source benchmark corpus for personally identifiable informat...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

and Cohen, N

McCloskey, M. and Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24:109–165. Microsoft. (2023). Presidio — Data Protection SDK. https://github.com/microsoft/presidio. NVIDIA. (2023). Nemotron-PII: A synthetic dataset for PII detection. HuggingFace Hub: nvidi...

1989

[5] [5]

Sang, E. F. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language- independent named entity recognition. CoNLL

2003

[6] [6]

and Navigli, R

Tedeschi, S. and Navigli, R. (2022). MultiNERD: A multilingual, multi-genre and fine-grained dataset for named entity recognition. NAACL Findings

2022

[7] [7]

Weischedel, R. et al. (2013). OntoNotes Release 5.0. LDC Catalog No.: LDC2013T19. Wolf, T., et al. (2020). Transformers: State-of-the-art natural language processing. EMNLP 2020 System Demonstrations. Fine-Tuning Over Architectural Complexity: PII Detection on PIIBench 13 Appendix A: Complete Fine-Entity Full-Test Comparison All 82 retained fine entity ty...

2013