Recognition: unknown
Subject-level Inference for Realistic Text Anonymization Evaluation
Pith reviewed 2026-05-09 22:30 UTC · model grok-4.3
The pith
Even after masking over 90 percent of personal information spans in text, subject-level inference still recovers details about the majority of individuals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that subject-level inference protection drops to as low as 33 percent even when more than 90 percent of PII spans are masked, so that the majority of personal information stays recoverable through contextual inference. It further demonstrates that target-subject-focused anonymization leaves non-target subjects substantially more exposed. The SPIA benchmark supplies the first set of subject-level protection metrics to evaluate these gaps across multi-subject documents in legal and online domains.
What carries the argument
SPIA (Subject-level PII Inference Assessment) benchmark, which replaces span-level counting with metrics that quantify how much information about each individual subject can still be inferred from the remaining text.
If this is right
- Span-based metrics alone cannot certify safe anonymization because they overlook recoverable information about individuals.
- Anonymization tools must evaluate and mitigate inference risks for every subject mentioned, not only a designated target.
- Current practices in legal and online text release may leave most personal details accessible despite heavy redaction.
- Multi-subject documents require new techniques that protect all referenced people simultaneously.
Where Pith is reading between the lines
- Anonymization pipelines could add an inference-simulation step during processing to detect and further obscure high-risk contextual links.
- Public datasets released after anonymization might need mandatory subject-level audits before distribution to prevent unintended re-identification.
- The same shift from span metrics to subject-level inference could apply to other privacy tasks such as redacting transcripts or database exports.
- Longer documents or cross-document collections would likely show even lower protection rates under the same evaluation.
Load-bearing premise
The chosen 675 documents and inference methods accurately represent the capabilities of realistic adversaries facing multi-subject texts in legal and online settings.
What would settle it
A controlled replication on comparable documents that applies over 90 percent PII span masking and measures subject-level inference rates remaining above 70 percent would falsify the reported protection levels.
Figures
read the original abstract
Current text anonymization evaluation relies on span-based metrics that fail to capture what an adversary could actually infer, and assumes a single data subject, ignoring multi-subject scenarios. To address these limitations, we present SPIA (Subject-level PII Inference Assessment), the first benchmark that shifts the unit of evaluation from text spans to individuals, comprising 675 documents across legal and online domains with novel subject-level protection metrics. Extensive experiments show that even when over 90% of PII spans are masked, subject-level inference protection drops as low as 33%, leaving the majority of personal information recoverable through contextual inference. Furthermore, target-subject-focused anonymization leaves non-target subjects substantially more exposed than the target subject. We show that subject-level inference-based evaluation is essential for ensuring safe text anonymization in real-world settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SPIA, a new benchmark for subject-level PII inference assessment in text anonymization evaluation. It comprises 675 documents from legal and online domains, shifts evaluation from span-based metrics to individuals, and reports novel subject-level protection metrics. Experiments indicate that masking over 90% of PII spans yields subject-level protection as low as 33%, with contextual inference recovering substantial personal information, and that target-subject-focused anonymization exposes non-target subjects more than the target.
Significance. If the benchmark construction and inference results hold under scrutiny, the work would be significant for exposing limitations of span-based anonymization metrics and providing the first dedicated subject-level evaluation framework that accounts for multi-subject scenarios. The introduction of a concrete benchmark with 675 documents and new metrics represents a constructive step toward more realistic privacy assessments in NLP.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The headline quantitative claim that subject-level inference protection drops to 33% even after >90% PII span masking is presented without any description of the inference attack methods, model training/prompting procedures, calibration against ground-truth subjects, or statistical validation. This absence prevents assessment of whether the 33% figure reflects realistic adversary capabilities or is an artifact of the chosen procedures.
- [§3] §3 (Benchmark and Document Selection): The selection of the 675 documents and the specific inference methods applied to them must be shown to approximate motivated adversaries using public data or web search; absent external validation or comparison to weaker baselines, the multi-subject exposure gap and overall protection numbers cannot be treated as general evidence against span masking.
minor comments (2)
- [§2] Define the exact formulas for the novel subject-level protection metrics in §2 or §3 with an example calculation on a sample document to improve clarity.
- [§3] Add a table summarizing the distribution of documents across legal vs. online domains and number of subjects per document.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript introducing the SPIA benchmark. The comments raise valid points about methodological transparency and validation, which we address point by point below. We are committed to revising the paper to improve clarity while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline quantitative claim that subject-level inference protection drops to 33% even after >90% PII span masking is presented without any description of the inference attack methods, model training/prompting procedures, calibration against ground-truth subjects, or statistical validation. This absence prevents assessment of whether the 33% figure reflects realistic adversary capabilities or is an artifact of the chosen procedures.
Authors: We appreciate this feedback on presentation. Section 4.1 details the inference models (GPT-4 and open-source LLMs) and zero-shot/few-shot prompting procedures used to recover subject identities from anonymized text via contextual cues. Section 4.2 describes calibration by exact matching of inferred attributes against ground-truth subjects in the original documents, with protection rates computed as the fraction of subjects not correctly recovered. Results are averaged over five independent runs with different seeds, including 95% confidence intervals for statistical validation. The abstract is intentionally concise, but we will revise it to include a one-sentence overview of the attack setup and evaluation protocol. This will allow readers to better judge the realism of the 33% figure without changing any numbers or conclusions. revision: yes
-
Referee: [§3] §3 (Benchmark and Document Selection): The selection of the 675 documents and the specific inference methods applied to them must be shown to approximate motivated adversaries using public data or web search; absent external validation or comparison to weaker baselines, the multi-subject exposure gap and overall protection numbers cannot be treated as general evidence against span masking.
Authors: We agree that stronger grounding against real-world adversaries would strengthen the claims. The 675 documents were drawn from publicly released legal case records and online discussion threads (with identifiable subjects), which are representative of data that would require anonymization. Inference relies on public LLMs operating on the anonymized text plus general knowledge, simulating an adversary without private data access. To address the request for baselines, we will add a new paragraph in §3.3 comparing our contextual inference success rates against weaker methods (keyword matching and random guessing), demonstrating that contextual attacks substantially outperform them. Full live web-search validation on the original subjects was not performed due to ethical constraints around re-identifying real individuals; we will explicitly discuss this limitation and the rationale for the proxy approach in the revised manuscript. revision: partial
Circularity Check
No significant circularity; empirical benchmark evaluation is self-contained
full rationale
The paper defines SPIA as a new benchmark with 675 documents and subject-level protection metrics, then reports experimental results on inference recovery after span masking. These results are obtained by applying the described inference methods to the benchmark data and computing the metrics directly; no step reduces a claimed prediction or protection score to a fitted parameter, self-definition, or self-citation chain by construction. The derivation chain consists of independent data collection, masking application, and metric computation without tautological loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Subject-level inference risk can be reliably quantified using novel protection metrics on the provided documents
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
Longformer: The long-document transformer. arXiv:2004.05150. Mirco Beltrame, Mauro Conti, Pierpaolo Guglielmin, Francesco Marchiori, and Gabriele Orazi. 2024. RedactBuster: Entity type recognition from redacted documents. InComputer Security – ESORICS 2024, volume 14983 ofLecture Notes in Computer Science, pages 451–470. Springer. California Legislature. ...
work page internal anchor Pith review arXiv 2004
-
[2]
InProceedings of the 41st International Conference on Machine Learn- ing, volume 235 ofProceedings of Machine Learning Research, pages 6519–6538
MaSS: Multi-attribute selective suppression for utility-preserving data transformation from an information-theoretic perspective. InProceedings of the 41st International Conference on Machine Learn- ing, volume 235 ofProceedings of Machine Learning Research, pages 6519–6538. PMLR. Tobias Deußer, Lorenz Sparrenberg, Armin Berger, Max Hahnbück, Christian Ba...
-
[3]
InProceedings of the IEEE International Conference on Big Data
A survey on current trends and recent advances in text anonymization. InProceedings of the IEEE International Conference on Big Data. IEEE. Josep Domingo-Ferrer, David Sánchez, and Jordi Soria- Comas. 2016.Database Anonymization: Privacy Models, Data Utility, and Microaggregation-based Inter-model Connections. Synthesis Lectures on In- formation Security,...
2016
-
[4]
IncogniText: Privacy-enhancing conditional text anonymization via LLM-based private attribute randomization. InNeurIPS 2024 Workshop on Safe Generative AI. Gemma Team. 2025. Gemma 3 technical report.Com- puting Research Repository, arXiv:2503.19786. Philippe Golle. 2006. Revisiting the uniqueness of sim- ple demographics in the US population. InProceed- i...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Balancing privacy and progress in artificial intelligence: Anonymization in histopathology for biomedical research and education. InFrontiers of Artificial Intelligence, Ethics, and Multidisciplinary Applications, pages 417–429, Singapore. Springer Nature. Iyadh Ben Cheikh Larbi, Aljoscha Burchardt, and Roland Roller. 2022. Which anonymization tech- nique...
-
[6]
gpt-oss-120b & gpt-oss-20b Model Card
Analyzing leakage of personally identifiable information in language models. In2023 IEEE Sym- posium on Security and Privacy (SP), pages 346–363. IEEE. Benet Manzanares-Salor, David Sánchez, and Pierre Lison. 2024. Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack.Data Mining and Knowledge Discove...
work page internal anchor Pith review arXiv 2024
-
[7]
The text anonymization benchmark (TAB): A dedicated corpus and evaluation framework for text anonymization.Computational Linguistics, 48(4):1053–1101. Ildikó Pilán, Benet Manzanares-Salor, David Sánchez, and Pierre Lison. 2025. Truthful text sanitization guided by inference attacks.Applied Soft Computing. Qwen Team. 2025. Qwen3 technical report.Computing ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Pii-bench: Evaluating query-aware privacy protection systems.arXiv preprint arXiv:2502.18545, 2025
PII-Bench: Evaluating query-aware privacy protection systems.Computing Research Repository, arXiv:2502.18545. Reza Shokri, Marco Stronati, Congzheng Song, and Vi- taly Shmatikov. 2017. Membership inference attacks against machine learning models. In2017 IEEE Sym- posium on Security and Privacy, pages 3–18. Robin Staab, Mark Vero, Mislav Balunovic, and Mar...
-
[9]
Michael Jordan
or a range up to 10 years (e.g.,25–35). Ref- erence date is September 1, 2025. • Location(4-level structured free text): Current residence formatted aspremises / sub-city / city / country. Record the most specific level available with all higher levels. E.g., when it is deducible 0.0 0.2 0.4 0.6 0.8 1.0 Score AA (Claude-Sonnet-4.5) AA (Claude-Haiku-4.5) A...
2025
-
[10]
Include titles and honorifics (Mr., Dr., etc.)
Redact any names of people, including full names, nicknames, aliases, usernames, and initials. Include titles and honorifics (Mr., Dr., etc.)
-
[11]
Redact any numbers and codes that identify something (SSN, phone numbers, passport numbers, driver's license numbers, license plates, email addresses, application numbers)
-
[12]
Redact any places and locations (cities, areas, countries, addresses, named infrastructures like airports, hospitals, bus stops, bridges)
-
[13]
Redact any names of organizations (companies, schools, universities, prisons, healthcare institutions, NGOs, churches)
-
[14]
Do not redact pronouns (he, she)
Redact any demographic attributes (native language, ethnicity, job titles, education levels, physical descriptions, diagnosis, ages). Do not redact pronouns (he, she)
-
[15]
Do not include prepositions (on, at)
Redact any specific dates, times, or durations. Do not include prepositions (on, at)
-
[16]
Include currency units
Redact any meaningful quantities (percentages, monetary values). Include currency units
-
[17]
Michael Jordan
Redact any other identifying information (trademarks, products, events, contracts, laws). Text to anonymize: {text} Figure 18: DeID-GPT: Zero-shot Redaction Prompt. Document: {text} Paraphrase of the document: Figure 19: DP-Prompt: Paraphrasing Prompt. [System Prompt] You are an expert investigator with years of experience in online profiling and text ana...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.