arxiv: 2602.11528 · v2 · submitted 2026-02-12 · 💻 cs.CR · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

Dong Yan , Jian Liang , Ran He , Tieniu Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:19 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords attribute inference attackLLM privacy defensefine-grained anonymizationrejection optimizationprivacy leaking elementsTRACE-RPS

0 comments

The pith

TRACE-RPS reduces LLM attribute inference accuracy from around 50% to below 5% by anonymizing specific leaking words and inducing rejection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a defense that first locates privacy-leaking words in user text using attention scores and generated inference chains, then replaces them with neutral alternatives. It next applies a two-stage training process so the model learns to refuse attempts to infer attributes such as age or location. This combination overcomes the limits of simple text rewriting, which still permits reasoning-based leaks, and achieves large accuracy drops on multiple open-source models while preserving text utility.

Core claim

TRACE identifies and anonymizes fine-grained privacy-leaking textual elements through attention mechanisms and inference chain generation, while RPS uses lightweight optimization to induce model rejection of attribute inference queries, together lowering inference accuracy from roughly 50 percent to below 5 percent across tested LLMs with cross-model generalization and prompt robustness.

What carries the argument

The TRACE-RPS framework, in which TRACE performs attention-based identification plus anonymization of leaking elements and RPS performs two-stage optimization to train rejection behavior.

If this is right

Attribute inference accuracy falls below 5 percent on open-source models.
The defense maintains utility while providing prompt-variation robustness.
It generalizes across diverse LLMs without model-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to block other forms of sensitive inference such as intent or preference extraction.
An attacker might focus on bypassing the attention-based detection step rather than the rejection step.

Load-bearing premise

The attention-based detection of leaking elements and the induced rejection will continue to work against stronger future attacks and across prompt variations not tested in the experiments.

What would settle it

A new attribute inference attack that recovers high accuracy on text processed by TRACE-RPS anonymization and rejection training.

Figures

Figures reproduced from arXiv: 2602.11528 by Dong Yan, Jian Liang, Ran He, Tieniu Tan.

**Figure 2.** Figure 2: Robustness of RPS defense against 100 diverse inference prompts. Llama2-7B-Chat Llama2-13B-Chat Llama2-13B-Chat Llama3.1-8B-Instruct DeepSeek-R1 Llama3.2-1B-Instruct 0 10 20 30 40 50 60 Acc (%) No Defense No Defense-RPS TRACE (GPT-3.5) TRACE-RPS (GPT-3.5) TRACE (GPT-4o) TRACE-RPS (GPT-4o) [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user-generated text shared online, enabling rapid and large-scale privacy breaches. Existing anonymization-based defenses are coarse-grained, lacking word-level precision in anonymizing privacy-leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models' reasoning capabilities. To address these limitations, we propose a unified defense framework that combines fine-grained anonymization (TRACE) with inference-preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy-leaking textual elements, while RPS employs a lightweight two-stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference. Evaluations across diverse LLMs show that TRACE-RPS reduces attribute inference accuracy from around 50\% to below 5\% on open-source models. In addition, our approach offers strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs. Our code is available at https://github.com/Jasper-Yan/TRACE-RPS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE-RPS cuts attribute inference to under 5% with attention-based anonymization plus rejection training, but the robustness to new attacks and prompt shifts still needs tighter checks.

read the letter

The core contribution is a two-part defense: TRACE uses attention scores and generated inference chains to flag and anonymize specific privacy-leaking words at fine granularity, while RPS runs a lightweight two-stage optimization to train the model to reject attribute queries outright. This moves past the coarse masking in earlier work and adds an active refusal step, which the abstract says drops inference accuracy from roughly 50% to below 5% on open-source LLMs, with some cross-model transfer and utility retention. The GitHub release is a plus for anyone who wants to reproduce or extend the pipeline. The approach is straightforward enough that practitioners working on prompt privacy could try it without heavy compute. The main soft spot is that the headline numbers rest on the particular attack implementations and prompt families tested; the paper itself flags generalization to stronger future attacks as an open issue, and the abstract does not detail how many prompt rephrasings or adaptive inference chains were tried. If those variations shift attention patterns or let the model reason around the rejection, the 5% figure could move. The utility-privacy numbers are also only summarized, so the exact cost of the anonymization step is not visible yet. Overall this is a solid empirical step for the privacy-in-LLMs crowd. It is worth sending to peer review so the experimental controls and attack strength can be examined in detail.

Referee Report

2 major / 2 minor

Summary. The paper proposes TRACE-RPS, a unified defense combining TRACE (attention-based identification and anonymization of privacy-leaking textual elements via inference chain generation) with RPS (a lightweight two-stage optimization to induce LLM rejection behaviors). The central empirical claim is that this framework reduces attribute inference accuracy from around 50% to below 5% on open-source LLMs, while providing cross-model generalization, prompt-variation robustness, and favorable utility-privacy tradeoffs. Code is released at the cited GitHub repository.

Significance. If the reported accuracy reductions and robustness hold under expanded testing, the work would offer a meaningful advance over coarse-grained anonymization defenses by enabling fine-grained, proactive prevention of attribute inference through both text alteration and model behavior induction. The public code release supports reproducibility and is a clear strength.

major comments (2)

[Evaluation] Evaluation section: The headline result (reduction from ~50% to <5% accuracy) is presented for the specific prompts and attack implementations used, but no experiments test prompt rephrasings that alter attention distributions (directly affecting TRACE's identification step) or stronger inference chains that could bypass RPS-induced rejections. This is load-bearing for the generalization claims.
[§3.2] §3.2 (RPS description): The two-stage optimization is claimed to induce robust rejection, yet the manuscript provides no analysis or ablation showing whether these behaviors persist against future attack variants beyond those in the current evaluation suite; the weakest assumption in the abstract directly flags this gap.

minor comments (2)

[Abstract] The abstract states 'strong cross-model generalization' and 'prompt-variation robustness' without referencing the specific tables or figures that quantify these (e.g., per-model accuracy breakdowns or variation ablations); adding explicit pointers would improve clarity.
[§3.1] Notation for attention scores and inference chains in §3.1 could be made more consistent with standard LLM literature to aid readers.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, outlining planned revisions where appropriate.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline result (reduction from ~50% to <5% accuracy) is presented for the specific prompts and attack implementations used, but no experiments test prompt rephrasings that alter attention distributions (directly affecting TRACE's identification step) or stronger inference chains that could bypass RPS-induced rejections. This is load-bearing for the generalization claims.

Authors: We acknowledge that the headline results rely on the evaluated prompt set and attack implementations. While the manuscript reports prompt-variation robustness through multiple tested phrasings, we agree that targeted experiments on rephrasings explicitly designed to shift attention distributions (impacting TRACE) and on stronger inference chains (potentially bypassing RPS) were not included. In the revision we will add these experiments, including attention-map comparisons and enhanced chain variants, to strengthen the generalization claims. revision: yes
Referee: [§3.2] §3.2 (RPS description): The two-stage optimization is claimed to induce robust rejection, yet the manuscript provides no analysis or ablation showing whether these behaviors persist against future attack variants beyond those in the current evaluation suite; the weakest assumption in the abstract directly flags this gap.

Authors: We agree that the manuscript lacks explicit ablations against future attack variants. Our current §3.2 includes ablations demonstrating rejection on the tested suite, and the abstract's robustness claim is scoped to the evaluated conditions. We will revise §3.2 to clarify the scope of the robustness claims, add further ablations on variant attack strengths within the current suite, and explicitly discuss the inherent limitation regarding unseen future variants. revision: partial

standing simulated objections not resolved

Empirical demonstration that RPS-induced rejection behaviors will persist against arbitrary future attack variants not yet devised

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on held-out evaluations

full rationale

The paper proposes TRACE (attention-based identification of privacy-leaking elements) and RPS (two-stage optimization for rejection behavior) as a defense, then reports empirical accuracy drops (50% to <5%) on diverse LLMs using held-out evaluations. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the central result to its own inputs by construction. The methodology is described procedurally and the results are falsifiable via replication on the released code, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The defense rests on standard transformer attention and standard fine-tuning assumptions; no new free parameters, axioms, or invented entities are introduced beyond typical hyperparameter choices for the optimization stage.

pith-pipeline@v0.9.0 · 5504 in / 1040 out tokens · 22876 ms · 2026-05-16T05:19:59.808536+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean (J-cost uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy-leaking textual elements, while RPS employs a lightweight two-stage optimization strategy to induce model rejection behaviors
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evaluations across diverse LLMs show that TRACE-RPS reduces attribute inference accuracy from around 50% to below 5%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

URL https://learn. microsoft.com/en-us/azure/ai-services/language-service/overview . Accessed: 2025-04-02. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Data defenses against large language models.arXiv preprint arXiv:2410.13138,

William Agnew, Harry H Jiang, Cella Sum, Maarten Sap, and Sauvik Das. Data defenses against large language models.arXiv preprint arXiv:2410.13138,

work page arXiv
[3]

Jailbreaking Black Box Large Language Models in Twenty Queries

11 Published as a conference paper at ICLR 2026 Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Ahmed Frikha, Nassim Walha, Krishna Kanth Nakka, Ricardo Mendes, Xue Jiang, and Xuebing Zhou

doi: 10.1109/TBIOM.2025.3646181. Ahmed Frikha, Nassim Walha, Krishna Kanth Nakka, Ricardo Mendes, Xue Jiang, and Xuebing Zhou. Incognitext: Privacy-enhancing conditional text anonymization via llm-based private attribute randomization.arXiv preprint arXiv:2407.02956,

work page doi:10.1109/tbiom.2025.3646181 2025
[5]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Open sesame! universal black box jailbreaking of large language models

Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models.arXiv preprint arXiv:2309.01446,

work page arXiv
[9]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860,

work page internal anchor Pith review arXiv
[10]

Tree of attacks: Jailbreaking black-box llms automatically

12 Published as a conference paper at ICLR 2026 Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. InProc. NeurIPS,

work page 2026
[11]

do anything now

Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348,

work page arXiv
[12]

Pal: Proxy-guided black-box attack on large language models.arXiv preprint arXiv:2402.09674,

Chawin Sitawarin, Norman Mu, David Wagner, and Alexandre Araujo. Pal: Proxy-guided black-box attack on large language models.arXiv preprint arXiv:2402.09674,

work page arXiv
[13]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Do we really need curated malicious data for safety alignment in multi-modal large language models? InProceedings of the Computer Vision and Pattern Recognition Conference, pp

Yanbo Wang, Jiyang Guan, Jian Liang, and Ran He. Do we really need curated malicious data for safety alignment in multi-modal large language models? InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 19879–19889, 2025a. Yanbo Wang, Yongcan Yu, Jian Liang, and Ran He. A comprehensive survey on trustworthiness in reasoning with lar...

work page arXiv
[16]

Attention heads of large language models: A survey.arXiv preprint arXiv:2409.03752,

Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Mingchuan Yang, Bo Tang, Feiyu Xiong, and Zhiyu Li. Attention heads of large language models: A survey.arXiv preprint arXiv:2409.03752,

work page arXiv
[17]

Autodan: Interpretable gradient-based adversarial attacks on large language models

13 Published as a conference paper at ICLR 2026 Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: Interpretable gradient-based adversarial attacks on large language models. InProc. COLM,

work page 2026
[18]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

I"|P(t⊕s best)) 7:ifJ 1 ≥τ 1 then break▷log-probability threshold reached 8:end if 9:s cand ←RANDOMREPLACE(s best, n) 10:iflogp(y 1 =

14 Published as a conference paper at ICLR 2026 A ALGORITHMICDETAILS OF THEPRIVACY-PRESERVINGOPTIMIZATION Algorithm 1RPS & MPS Pipeline Require:Original textt; initial suffixs init; first-token thresholdτ 1; second-token thresholdτ 2; 1:max iterationsI 1, I2 for Stage 1 & Stage 2; replacement spann; weightβ; rejection setR; 2:attribute inference contextP(...

work page 2026
[20]

In the RPS optimization process, the weighting factor β is set to 5 in Stage 2, which shapes the rejection tokens to enhance the model’s refusal responses

and Top-K = 30 for the SynthPAI dataset (Yukhymenko et al., 2024). In the RPS optimization process, the weighting factor β is set to 5 in Stage 2, which shapes the rejection tokens to enhance the model’s refusal responses. We also supplement our evaluation with a stricter metric, attack success rate (ASR). This metric conservatively models the adversary’s...

work page arXiv 2024
[21]

bloke”, which is a colloquial term commonly associated with male identity, particularly in British English. The author also references a personal experience of being a “bloke

Evidence: Quote the relevant word(s) or sentence(s) from ”Comments” and explain the privacy implication. Step 3: Continue adding steps as needed, always linking to previous steps and providing evidence from the ”Comments”. Evidence: Quote the relevant word(s) or sentence(s) from ”Comments” and explain the privacy implication. 19 Published as a conference ...

work page 2026