arxiv: 2604.21211 · v1 · submitted 2026-04-23 · 💻 cs.CL

Recognition: unknown

Subject-level Inference for Realistic Text Anonymization Evaluation

Myeong Seok Oh , Dong-Yun Kim , Hanseok Oh , Chaean Kang , Joeun Kang , Xiaonan Wang , Hyunjung Park , Young Cheol Jung

show 1 more author

Hansaem Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords text anonymizationPII inferencesubject-level evaluationcontextual inferenceprivacy protectionmulti-subject documentslegal textonline text

0 comments

The pith

Even after masking over 90 percent of personal information spans in text, subject-level inference still recovers details about the majority of individuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard span-based metrics for evaluating text anonymization miss what an adversary can actually reconstruct about people from surrounding context, especially when documents mention multiple subjects. It introduces the SPIA benchmark, built on 675 legal and online documents, along with new metrics that measure how much information about each specific person remains inferable after anonymization. Experiments demonstrate that heavy masking leaves subject protection as low as 33 percent and that methods focused on one target subject leave other people in the same text far more exposed. A sympathetic reader would care because real privacy protection requires stopping inference, not just deleting obvious spans, and current practices may give a false sense of security.

Core claim

The paper establishes that subject-level inference protection drops to as low as 33 percent even when more than 90 percent of PII spans are masked, so that the majority of personal information stays recoverable through contextual inference. It further demonstrates that target-subject-focused anonymization leaves non-target subjects substantially more exposed. The SPIA benchmark supplies the first set of subject-level protection metrics to evaluate these gaps across multi-subject documents in legal and online domains.

What carries the argument

SPIA (Subject-level PII Inference Assessment) benchmark, which replaces span-level counting with metrics that quantify how much information about each individual subject can still be inferred from the remaining text.

If this is right

Span-based metrics alone cannot certify safe anonymization because they overlook recoverable information about individuals.
Anonymization tools must evaluate and mitigate inference risks for every subject mentioned, not only a designated target.
Current practices in legal and online text release may leave most personal details accessible despite heavy redaction.
Multi-subject documents require new techniques that protect all referenced people simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Anonymization pipelines could add an inference-simulation step during processing to detect and further obscure high-risk contextual links.
Public datasets released after anonymization might need mandatory subject-level audits before distribution to prevent unintended re-identification.
The same shift from span metrics to subject-level inference could apply to other privacy tasks such as redacting transcripts or database exports.
Longer documents or cross-document collections would likely show even lower protection rates under the same evaluation.

Load-bearing premise

The chosen 675 documents and inference methods accurately represent the capabilities of realistic adversaries facing multi-subject texts in legal and online settings.

What would settle it

A controlled replication on comparable documents that applies over 90 percent PII span masking and measures subject-level inference rates remaining above 70 percent would falsify the reported protection levels.

Figures

Figures reproduced from arXiv: 2604.21211 by Chaean Kang, Dong-Yun Kim, Hansaem Kim, Hanseok Oh, Hyunjung Park, Joeun Kang, Myeong Seok Oh, Xiaonan Wang, Young Cheol Jung.

**Figure 2.** Figure 2: SPIA benchmark construction pipeline. Documents are filtered by subject count distribution and PII density to ensure diverse evaluation scenarios. The two-stage framework identifies all subjects (Stage A), then infers CODE (5 types) and NON-CODE (10 types) PIIs per subject (Stage B). After validating 11 LLMs on human-annotated test set, best-performing model pre-labels remaining documents for human review.… view at source ↗

**Figure 3.** Figure 3: Per-backbone comparison of span-based and inference-based metric averages across four anonymization [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of single-subject (1-AAC) and multi-subject (CPR) metrics for the Adversarial Anonymiza [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: PII Hardness distribution. Most PIIs (75.7%) [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Document length distribution. For visualization clarity, a few outliers are excluded: one PANORAMA [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Per-tag inference accuracy counts. Each stacked bar shows the number of correctly inferred PIIs by [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Inference accuracy by PII category on TAB dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Inference accuracy by PII category on PANORAMA dataset. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Inference accuracy by Hardness level. in [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Entity annotation tool interface. Annotators can select spans from text and specify entity type and [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: CPR by anonymization method and backbone across three adversary models. Within each dataset, [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: PII type-wise protection rate analysis on PANORAMA dataset. CODE-type PIIs achieve CPR 1.0 in [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Inference accuracy by Hardness level after anonymization, averaged across all backbones. Lower is [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Privacy-Utility Trade-off comparing four anonymization methods across six LLM backbones. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Subject-level PII annotation tool interface. Annotators identify subjects from the text and input inferred [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: DeID-GPT: Zero-shot Redaction Prompt. Document: {text} Paraphrase of the document [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 20.** Figure 20: Adversarial Inference Prompt for TAB [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: Adversarial Inference Prompt for PANORAMA. [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: Adversarial Anonymization Prompt for TAB. [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: Adversarial Anonymization Prompt for PANORAMA. [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗

**Figure 24.** Figure 24: Subject Identification Prompt [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

**Figure 25.** Figure 25: CODE-type PII Inference Prompt [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗

**Figure 26.** Figure 26: NON-CODE-type PII Inference Prompt [PITH_FULL_IMAGE:figures/full_fig_p034_26.png] view at source ↗

**Figure 27.** Figure 27: Subject Alignment Prompt for Same Text (Non-anonymized). [PITH_FULL_IMAGE:figures/full_fig_p035_27.png] view at source ↗

**Figure 28.** Figure 28: Subject Alignment Prompt for Anonymized Text. [PITH_FULL_IMAGE:figures/full_fig_p035_28.png] view at source ↗

**Figure 29.** Figure 29: PII Agreement Evaluation Prompt [PITH_FULL_IMAGE:figures/full_fig_p036_29.png] view at source ↗

read the original abstract

Current text anonymization evaluation relies on span-based metrics that fail to capture what an adversary could actually infer, and assumes a single data subject, ignoring multi-subject scenarios. To address these limitations, we present SPIA (Subject-level PII Inference Assessment), the first benchmark that shifts the unit of evaluation from text spans to individuals, comprising 675 documents across legal and online domains with novel subject-level protection metrics. Extensive experiments show that even when over 90% of PII spans are masked, subject-level inference protection drops as low as 33%, leaving the majority of personal information recoverable through contextual inference. Furthermore, target-subject-focused anonymization leaves non-target subjects substantially more exposed than the target subject. We show that subject-level inference-based evaluation is essential for ensuring safe text anonymization in real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper makes a useful case for subject-level evaluation of text anonymization but its attack realism needs clearer validation before the 33% figure can be taken as general.

read the letter

The main point is that span-based metrics miss what an adversary can actually recover about a person from context, and the authors show this gap with a new benchmark. SPIA uses 675 documents from legal and online domains plus subject-level protection metrics that track inference success rather than just masked tokens. Their results indicate that even after masking over 90% of PII spans, protection falls to 33% in their tests, and anonymization aimed at one target subject leaves other people in the same document more exposed. That multi-subject observation is worth attention for real documents like contracts or forum posts that contain several individuals. The work is new in moving the evaluation unit from spans to people and in releasing the benchmark for others to use. It does a straightforward job of illustrating why current practices can still leak personal information through inference. The soft spot sits in the inference procedures themselves. The headline numbers depend on whatever attack models and search strategies they applied, and those need to be shown as close to what a motivated but realistic adversary could do with public data or standard tools. If the chosen methods are stronger or more exhaustive than typical, the 33% protection and the non-target exposure difference become tied to the evaluation setup rather than a broad property of span masking. The abstract gives little on training, prompting, or calibration against ground truth, so the full paper must supply that to make the claims stick. This is aimed at people who build or audit anonymization systems for privacy-sensitive text. A reader working on evaluation benchmarks or applied privacy in NLP would get concrete value from the new metrics and the multi-subject angle. It deserves a serious referee because the core limitation it identifies is real and the benchmark could become a reference point once the attack details are tightened.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SPIA, a new benchmark for subject-level PII inference assessment in text anonymization evaluation. It comprises 675 documents from legal and online domains, shifts evaluation from span-based metrics to individuals, and reports novel subject-level protection metrics. Experiments indicate that masking over 90% of PII spans yields subject-level protection as low as 33%, with contextual inference recovering substantial personal information, and that target-subject-focused anonymization exposes non-target subjects more than the target.

Significance. If the benchmark construction and inference results hold under scrutiny, the work would be significant for exposing limitations of span-based anonymization metrics and providing the first dedicated subject-level evaluation framework that accounts for multi-subject scenarios. The introduction of a concrete benchmark with 675 documents and new metrics represents a constructive step toward more realistic privacy assessments in NLP.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The headline quantitative claim that subject-level inference protection drops to 33% even after >90% PII span masking is presented without any description of the inference attack methods, model training/prompting procedures, calibration against ground-truth subjects, or statistical validation. This absence prevents assessment of whether the 33% figure reflects realistic adversary capabilities or is an artifact of the chosen procedures.
[§3] §3 (Benchmark and Document Selection): The selection of the 675 documents and the specific inference methods applied to them must be shown to approximate motivated adversaries using public data or web search; absent external validation or comparison to weaker baselines, the multi-subject exposure gap and overall protection numbers cannot be treated as general evidence against span masking.

minor comments (2)

[§2] Define the exact formulas for the novel subject-level protection metrics in §2 or §3 with an example calculation on a sample document to improve clarity.
[§3] Add a table summarizing the distribution of documents across legal vs. online domains and number of subjects per document.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing the SPIA benchmark. The comments raise valid points about methodological transparency and validation, which we address point by point below. We are committed to revising the paper to improve clarity while preserving the core contributions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline quantitative claim that subject-level inference protection drops to 33% even after >90% PII span masking is presented without any description of the inference attack methods, model training/prompting procedures, calibration against ground-truth subjects, or statistical validation. This absence prevents assessment of whether the 33% figure reflects realistic adversary capabilities or is an artifact of the chosen procedures.

Authors: We appreciate this feedback on presentation. Section 4.1 details the inference models (GPT-4 and open-source LLMs) and zero-shot/few-shot prompting procedures used to recover subject identities from anonymized text via contextual cues. Section 4.2 describes calibration by exact matching of inferred attributes against ground-truth subjects in the original documents, with protection rates computed as the fraction of subjects not correctly recovered. Results are averaged over five independent runs with different seeds, including 95% confidence intervals for statistical validation. The abstract is intentionally concise, but we will revise it to include a one-sentence overview of the attack setup and evaluation protocol. This will allow readers to better judge the realism of the 33% figure without changing any numbers or conclusions. revision: yes
Referee: [§3] §3 (Benchmark and Document Selection): The selection of the 675 documents and the specific inference methods applied to them must be shown to approximate motivated adversaries using public data or web search; absent external validation or comparison to weaker baselines, the multi-subject exposure gap and overall protection numbers cannot be treated as general evidence against span masking.

Authors: We agree that stronger grounding against real-world adversaries would strengthen the claims. The 675 documents were drawn from publicly released legal case records and online discussion threads (with identifiable subjects), which are representative of data that would require anonymization. Inference relies on public LLMs operating on the anonymized text plus general knowledge, simulating an adversary without private data access. To address the request for baselines, we will add a new paragraph in §3.3 comparing our contextual inference success rates against weaker methods (keyword matching and random guessing), demonstrating that contextual attacks substantially outperform them. Full live web-search validation on the original subjects was not performed due to ethical constraints around re-identifying real individuals; we will explicitly discuss this limitation and the rationale for the proxy approach in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark evaluation is self-contained

full rationale

The paper defines SPIA as a new benchmark with 675 documents and subject-level protection metrics, then reports experimental results on inference recovery after span masking. These results are obtained by applying the described inference methods to the benchmark data and computing the metrics directly; no step reduces a claimed prediction or protection score to a fitted parameter, self-definition, or self-citation chain by construction. The derivation chain consists of independent data collection, masking application, and metric computation without tautological loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on unstated assumptions about realistic inference attacks and dataset representativeness; no free parameters or invented entities detailed in abstract.

axioms (1)

domain assumption Subject-level inference risk can be reliably quantified using novel protection metrics on the provided documents
Invoked to support the 33% protection claim and multi-subject exposure finding

pith-pipeline@v0.9.0 · 5460 in / 1170 out tokens · 24076 ms · 2026-05-09T22:30:22.621765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer. arXiv:2004.05150. Mirco Beltrame, Mauro Conti, Pierpaolo Guglielmin, Francesco Marchiori, and Gabriele Orazi. 2024. RedactBuster: Entity type recognition from redacted documents. InComputer Security – ESORICS 2024, volume 14983 ofLecture Notes in Computer Science, pages 451–470. Springer. California Legislature. ...

work page internal anchor Pith review arXiv 2004
[2]

InProceedings of the 41st International Conference on Machine Learn- ing, volume 235 ofProceedings of Machine Learning Research, pages 6519–6538

MaSS: Multi-attribute selective suppression for utility-preserving data transformation from an information-theoretic perspective. InProceedings of the 41st International Conference on Machine Learn- ing, volume 235 ofProceedings of Machine Learning Research, pages 6519–6538. PMLR. Tobias Deußer, Lorenz Sparrenberg, Armin Berger, Max Hahnbück, Christian Ba...
[3]

InProceedings of the IEEE International Conference on Big Data

A survey on current trends and recent advances in text anonymization. InProceedings of the IEEE International Conference on Big Data. IEEE. Josep Domingo-Ferrer, David Sánchez, and Jordi Soria- Comas. 2016.Database Anonymization: Privacy Models, Data Utility, and Microaggregation-based Inter-model Connections. Synthesis Lectures on In- formation Security,...

2016
[4]

Gemma 3 Technical Report

IncogniText: Privacy-enhancing conditional text anonymization via LLM-based private attribute randomization. InNeurIPS 2024 Workshop on Safe Generative AI. Gemma Team. 2025. Gemma 3 technical report.Com- puting Research Repository, arXiv:2503.19786. Philippe Golle. 2006. Revisiting the uniqueness of sim- ple demographics in the US population. InProceed- i...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

InFrontiers of Artificial Intelligence, Ethics, and Multidisciplinary Applications, pages 417–429, Singapore

Balancing privacy and progress in artificial intelligence: Anonymization in histopathology for biomedical research and education. InFrontiers of Artificial Intelligence, Ethics, and Multidisciplinary Applications, pages 417–429, Singapore. Springer Nature. Iyadh Ben Cheikh Larbi, Aljoscha Burchardt, and Roland Roller. 2022. Which anonymization tech- nique...

work page arXiv 2022
[6]

gpt-oss-120b & gpt-oss-20b Model Card

Analyzing leakage of personally identifiable information in language models. In2023 IEEE Sym- posium on Security and Privacy (SP), pages 346–363. IEEE. Benet Manzanares-Salor, David Sánchez, and Pierre Lison. 2024. Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack.Data Mining and Knowledge Discove...

work page internal anchor Pith review arXiv 2024
[7]

Qwen3 Technical Report

The text anonymization benchmark (TAB): A dedicated corpus and evaluation framework for text anonymization.Computational Linguistics, 48(4):1053–1101. Ildikó Pilán, Benet Manzanares-Salor, David Sánchez, and Pierre Lison. 2025. Truthful text sanitization guided by inference attacks.Applied Soft Computing. Qwen Team. 2025. Qwen3 technical report.Computing ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Pii-bench: Evaluating query-aware privacy protection systems.arXiv preprint arXiv:2502.18545, 2025

PII-Bench: Evaluating query-aware privacy protection systems.Computing Research Repository, arXiv:2502.18545. Reza Shokri, Marco Stronati, Congzheng Song, and Vi- taly Shmatikov. 2017. Membership inference attacks against machine learning models. In2017 IEEE Sym- posium on Security and Privacy, pages 3–18. Robin Staab, Mark Vero, Mislav Balunovic, and Mar...

work page arXiv 2017
[9]

Michael Jordan

or a range up to 10 years (e.g.,25–35). Ref- erence date is September 1, 2025. • Location(4-level structured free text): Current residence formatted aspremises / sub-city / city / country. Record the most specific level available with all higher levels. E.g., when it is deducible 0.0 0.2 0.4 0.6 0.8 1.0 Score AA (Claude-Sonnet-4.5) AA (Claude-Haiku-4.5) A...

2025
[10]

Include titles and honorifics (Mr., Dr., etc.)

Redact any names of people, including full names, nicknames, aliases, usernames, and initials. Include titles and honorifics (Mr., Dr., etc.)
[11]

Redact any numbers and codes that identify something (SSN, phone numbers, passport numbers, driver's license numbers, license plates, email addresses, application numbers)
[12]

Redact any places and locations (cities, areas, countries, addresses, named infrastructures like airports, hospitals, bus stops, bridges)
[13]

Redact any names of organizations (companies, schools, universities, prisons, healthcare institutions, NGOs, churches)
[14]

Do not redact pronouns (he, she)

Redact any demographic attributes (native language, ethnicity, job titles, education levels, physical descriptions, diagnosis, ages). Do not redact pronouns (he, she)
[15]

Do not include prepositions (on, at)

Redact any specific dates, times, or durations. Do not include prepositions (on, at)
[16]

Include currency units

Redact any meaningful quantities (percentages, monetary values). Include currency units
[17]

Michael Jordan

Redact any other identifying information (trademarks, products, events, contracts, laws). Text to anonymize: {text} Figure 18: DeID-GPT: Zero-shot Redaction Prompt. Document: {text} Paraphrase of the document: Figure 19: DP-Prompt: Paraphrasing Prompt. [System Prompt] You are an expert investigator with years of experience in online profiling and text ana...

2025