Look Twice before You Leap: A Rational Framework for Localized Adversarial Anonymization

Chong Mu; Donghang Duan; Leyi Cai; Lizong Zhang; Xu Zheng; Yuefeng He

arxiv: 2512.06713 · v3 · submitted 2025-12-07 · 💻 cs.CR · cs.CL

Look Twice before You Leap: A Rational Framework for Localized Adversarial Anonymization

Donghang Duan , Xu Zheng , Yuefeng He , Chong Mu , Leyi Cai , Lizong Zhang This is my paper

Pith reviewed 2026-05-17 01:13 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords adversarial anonymizationlocal privacy preservationtext anonymizationprivacy-utility tradeoffsmall-scale language modelsrational decision makingghost leak filtering

0 comments

The pith

An arbitrator validates inferences to make local adversarial anonymization rational and avoid utility collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current text anonymization fails on local small-scale models not only because those models are limited, but because greedy adversarial methods over-remove information in an irrational way. It introduces RLAA, a training-free system with three roles: an attacker that proposes potential leaks, an arbitrator that checks whether those leaks are genuine, and an anonymizer that acts only on validated ones. By framing the process as a running comparison of marginal privacy gain against marginal utility cost, the arbitrator supplies a rational stopping rule that keeps meaning intact while still protecting privacy. A reader who wants to anonymize documents on their own device, without sending raw text to external services, would find this relevant because it offers a concrete way to get both privacy and usefulness from modest local models.

Core claim

We model the anonymization process as a trade-off between Marginal Privacy Gain (MPG) and Marginal Utility Cost (MUC), demonstrating that greedy strategies tend to drift into an irrational state. Instead, RLAA introduces an arbitrator that acts as a rationality gatekeeper, validating the attacker's inference to filter out ghost leaks. This mechanism promotes a rational early-stopping criterion, and structurally prevents utility collapse.

What carries the argument

The Attacker-Arbitrator-Anonymizer architecture, in which the arbitrator validates proposed leaks against a marginal-gain versus marginal-cost test to enforce rational early stopping.

If this is right

RLAA runs entirely on local small-scale models without calling remote APIs, removing the need to disclose raw data.
The method produces a better privacy-utility trade-off than strong baselines across multiple text benchmarks.
Because it is training-free, the framework can be applied immediately to existing local models without additional data or fine-tuning.
Rational early stopping structurally limits over-anonymization, preserving downstream task performance that greedy methods destroy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same arbitrator pattern could be added to other greedy privacy or compression pipelines where over-removal is a known risk.
If the arbitrator proves reliable across domains, developers might reduce reliance on large remote models for any sensitive local processing step.
Extending the MPG-MUC framing to non-text modalities such as images or structured data would test whether the rationality gate generalizes.

Load-bearing premise

Utility collapse on small local models comes mainly from the irrationality of greedy strategies rather than from the models' inherent limits, and the arbitrator can reliably separate real leaks from false ones without introducing new errors.

What would settle it

Run RLAA on the same benchmarks but disable or randomize the arbitrator so it accepts or rejects inferences at chance level, then measure whether the privacy-utility curve falls back to the level of the original greedy baselines.

Figures

Figures reproduced from arXiv: 2512.06713 by Chong Mu, Donghang Duan, Leyi Cai, Lizong Zhang, Xu Zheng, Yuefeng He.

**Figure 2.** Figure 2: The RLAA Framework. Utilizing an Attacker-Arbitrator-Anonymizer architecture, the arbitrator acts as a rationality gatekeeper. It validates attacker inferences to filter out ghost leaks with negligible privacy benefits, structurally preventing utility collapse caused by irrational greedy strategies. 3 Methodology 3.1 Threat Model RLAA is designed to defend against two distinct adversaries in the text anony… view at source ↗

**Figure 3.** Figure 3: Privacy-Utility Trade-off. RLAA achieves superior trade-offs compared to FgAA across iterations on two datasets. The trade-off dynamics for structural metrics (ROUGE-L/BLEU) are detailed in Appendix D. 4.3.1 Baseline Comparison Results [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative MRS Analysis of Llama3-8B. The figure displays Llama3-8B’s cumulative MRS during the anonymization process on two datasets. FgAA (Red) shows a sustained increase, whereas RLAA (Blue) maintains a stable low MRS. The remaining results for DeepSeek-V3.2-Exp and Qwen2.5-7B are provided in Appendix D. ity (0.8187→0.8572), which demonstrates that rationality constraints actively optimize rather than … view at source ↗

**Figure 5.** Figure 5: Privacy-utility trade-offs via structural metrics (ROUGE-L and BLEU). Results are shown for PersonalReddit (Left) and reddit-self-disclosure (Right), demonstrating RLAA’s resistance to structural collapse. stylistic diversity. Crucially, API usage is strictly limited to this one-time data preparation phase, ensuring the subsequent anonymization process remains fully localized. D Detailed Experimental Resu… view at source ↗

**Figure 6.** Figure 6: Cumulative MRS Profiles across Different Model Scales. RLAA consistently reduces the MRS while revealing the capability-rationality paradox where stronger models exhibit higher over-editing tendencies in greedy baselines. 2 4 6 8 10 Iter 0.2 0.0 0.2 0.4 0.6 MRS Utility Score 2 4 6 8 10 Iter 0.0 0.5 1.0 1.5 2.0 2.5 ROUGE-L 2 4 6 8 10 Iter 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 BLEU DeepSeek: Marginal Analysis RLAA… view at source ↗

**Figure 8.** Figure 8: SFT Training Dynamics. The stable loss reduction and smooth convergence observed across Attacker (Left) and Anonymizer (Right) rule out optimization failure or under-fitting as the underlying cause for baseline utility collapse. E Human Evaluation To empirically validate the superiority of our rationality mechanism and confirm the usability of the generated text, we conducted a three-way pairwise compa… view at source ↗

**Figure 7.** Figure 7: Extended MRS Dynamics. These results across Utility, ROUGE-L, and BLEU metrics confirm the generalization of RLAA’s rational decision-making. 6. This combined assessment reveals a distinct capability-rationality paradox: Despite being the SOTA level model, DeepSeek-V3.2-Exp exhibits the highest rationality gain. As visualized in Figure 6 and quantified in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 9.** Figure 9: Human pairwise evaluation results. RLAA achieves a dominant win rate against FgAA-Naive while reflecting high inter-annotator consistency. F Case Studies We present three qualitative examples extracted from our human evaluation samples comparing RLAA against the FgAA baseline. These cases demonstrate how RLAA preserves semantic utility while effectively reducing privacy risks. Case 1: Career & Location (WF… view at source ↗

read the original abstract

Current LLM-based frameworks for text anonymization usually rely on remote API services from powerful LLMs, which creates an inherent privacy paradox: users must disclose the raw data to untrusted third parties for guaranteed privacy preservation. Moreover, directly migrating current solutions to local small-scale models (LSMs) offers a suboptimal solution with severe utility collapse. Our work argues that this failure stems not merely from the capability deficits of LSMs, but significantly from the inherent irrationality of the greedy adversarial strategies employed by current state-of-the-art (SOTA) methods. To address this drawback, we propose Rational Localized Adversarial Anonymization (RLAA), a fully localized and training-free framework featuring an Attacker-Arbitrator-Anonymizer architecture. We model the anonymization process as a trade-off between Marginal Privacy Gain (MPG) and Marginal Utility Cost (MUC), demonstrating that greedy strategies tend to drift into an irrational state. Instead, RLAA introduces an arbitrator that acts as a rationality gatekeeper, validating the attacker's inference to filter out ghost leaks. This mechanism promotes a rational early-stopping criterion, and structurally prevents utility collapse. Extensive experiments on different benchmarks demonstrate that RLAA achieves a superior privacy-utility trade-off compared to strong baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces an Attacker-Arbitrator-Anonymizer setup for local text anonymization that models MPG/MUC trade-offs and adds a rationality gatekeeper, but the claim that this fixes utility collapse better than model limits alone still needs stronger backing.

read the letter

The punchline here is that the paper introduces an Attacker-Arbitrator-Anonymizer architecture for fully local text anonymization, modeling the process as a MPG/MUC trade-off and using the arbitrator to enforce rational early stopping instead of greedy attacks. What it does is tackle the issue of utility collapse when moving anonymization from big remote LLMs to small local ones. By arguing that much of the problem comes from irrational greedy strategies rather than just model size, and adding a validation step to filter ghost leaks, they aim for better privacy without sacrificing too much utility. Keeping it training-free and localized is a plus for practical deployment in sensitive settings. The experiments reportedly show superior trade-offs, which would be valuable if the details hold up. The main concern is whether the arbitrator can actually perform its validation role effectively on the same limited models. If its inference checks are unreliable, it could either fail to catch leaks or introduce new utility problems, which would undermine the claim that rationality fixes the core issue. The abstract gives no numbers on how ghost-leak filtering was evaluated or comparisons with ablations, so the strength of that part is hard to assess from what's visible. This paper would interest folks working on on-device privacy for text data, like in regulated industries avoiding cloud dependencies. It has enough of a new framing to warrant a referee's time, particularly to verify the experimental claims and the arbitrator's real impact. I'd recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Rational Localized Adversarial Anonymization (RLAA), a training-free framework for text anonymization on local small-scale models (LSMs). It argues that utility collapse under current adversarial methods stems significantly from the irrationality of greedy strategies (drifting via the MPG/MUC trade-off) rather than solely from LSM capability deficits. The proposed Attacker-Arbitrator-Anonymizer architecture uses an arbitrator to validate attacker inferences, filter ghost leaks, and enforce rational early-stopping, with experiments claimed to demonstrate superior privacy-utility trade-offs over baselines.

Significance. If the empirical superiority holds and the arbitrator reliably mitigates irrational drift without new errors, the work could enable practical, fully localized anonymization that avoids the privacy paradox of remote APIs. The explicit MPG/MUC modeling and rationality gatekeeper provide a conceptual contribution to adversarial anonymization; the training-free and localized design is a clear practical strength.

major comments (2)

[§3 (Arbitrator component)] §3 (Arbitrator component): The central claim that the arbitrator filters ghost leaks to enable rational early-stopping and structurally prevents utility collapse assumes reliable inference validation; however, because the arbitrator operates on the same class of LSMs, it inherits the same capability constraints, which risks either missing real leaks or introducing new utility errors. This assumption is load-bearing for both the irrationality diagnosis and the superiority claim over greedy baselines.
[Experiments section] Experiments section: The assertion of superior privacy-utility trade-offs and that irrationality is the main failure mode lacks reported details on concrete metrics (e.g., exact privacy and utility measures), baseline implementations, statistical significance tests, or quantitative evaluation of ghost-leak filtering accuracy. Without these, the causal attribution to irrationality versus inherent LSM limits cannot be fully assessed.

minor comments (2)

Define 'ghost leaks' more precisely and explain the arbitrator's validation procedure without additional training or data, including any fallback mechanisms if validation is uncertain.
Specify the exact benchmarks, model sizes, and quantitative results (including tables or figures) that support the MPG/MUC trade-off analysis and early-stopping criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications on the arbitrator's design and committing to expanded experimental details in revision.

read point-by-point responses

Referee: [§3 (Arbitrator component)] The central claim that the arbitrator filters ghost leaks to enable rational early-stopping and structurally prevents utility collapse assumes reliable inference validation; however, because the arbitrator operates on the same class of LSMs, it inherits the same capability constraints, which risks either missing real leaks or introducing new utility errors. This assumption is load-bearing for both the irrationality diagnosis and the superiority claim over greedy baselines.

Authors: We appreciate this observation on the shared model class. Our framework does not claim the arbitrator possesses superior capabilities; instead, it exploits a separation of concerns. The attacker pursues maximal MPG while the arbitrator applies a distinct validation objective to detect ghost leaks via consistency checks against the MPG/MUC trade-off. This role differentiation enables filtering of irrational drift even under identical LSM constraints, as the arbitrator only needs to assess inference plausibility rather than generate new anonymizations. We will revise §3 to include an explicit discussion of this role separation and potential edge cases where arbitrator errors could occur. revision: partial
Referee: [Experiments section] The assertion of superior privacy-utility trade-offs and that irrationality is the main failure mode lacks reported details on concrete metrics (e.g., exact privacy and utility measures), baseline implementations, statistical significance tests, or quantitative evaluation of ghost-leak filtering accuracy. Without these, the causal attribution to irrationality versus inherent LSM limits cannot be fully assessed.

Authors: We agree that greater transparency on the experimental setup is warranted. The manuscript reports comparative results across benchmarks demonstrating improved trade-offs, but we will expand the Experiments section to define the precise privacy (e.g., inference success rate) and utility (e.g., semantic similarity or task performance) metrics with formulas, provide implementation details or pseudocode for all baselines, report statistical significance (e.g., via paired t-tests with p-values), and add quantitative metrics for ghost-leak filtering such as precision/recall of the arbitrator. These additions will better isolate the contribution of rational early-stopping from baseline LSM limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation introduces independent arbitrator mechanism and MPG-MUC modeling without reducing to inputs by construction

full rationale

The paper derives its central result by first arguing that LSM utility collapse arises significantly from greedy irrationality (via modeled MPG/MUC drift) rather than solely capability limits, then introducing the Attacker-Arbitrator-Anonymizer architecture with an arbitrator that validates inferences to enable rational early-stopping. This chain does not exhibit self-definitional reduction, fitted inputs renamed as predictions, or load-bearing self-citations; the trade-off model and arbitrator gatekeeper are presented as novel constructs whose effectiveness is evaluated externally on benchmarks. No equations equate the final privacy-utility superiority directly to the initial assumptions by construction, and the framework remains self-contained against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on modeling anonymization as a marginal privacy-utility trade-off and on the existence of detectable ghost leaks that an arbitrator can filter; these are introduced without external benchmarks or formal proofs visible in the abstract.

axioms (2)

domain assumption Greedy adversarial strategies on LSMs tend to drift into an irrational state that causes utility collapse.
Stated directly in the abstract as the root cause of current failures.
ad hoc to paper An arbitrator can validate attacker inferences to filter ghost leaks without additional training or data.
Core mechanism of RLAA; no independent evidence or formal justification provided in abstract.

invented entities (1)

Arbitrator component no independent evidence
purpose: Acts as rationality gatekeeper to validate inferences and enable rational early-stopping.
New architectural element introduced to address irrationality in greedy methods.

pith-pipeline@v0.9.0 · 5530 in / 1391 out tokens · 82388 ms · 2026-05-17T01:13:22.946867+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Privacy-preserving Neural Representations of Text

Privacy-preserving neural representations of text.arXiv preprint arXiv:1808.09408. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168. DeepSeek-AI. 2025. Deepseek-v3...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, and Hao Peng

Incognitext: Privacy-enhancing conditional text anonymization via llm-based private attribute randomization.arXiv preprint arXiv:2407.02956. Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, and Hao Peng. 2024. Language models hallucinate, but may excel at fact verification. InProceedings of the 2024 conference of the North American chapter of the assoc...

work page arXiv 2024
[3]

InFindings of the Association for Computa- tional Linguistics: NAACL 2024, pages 2433–2462

Anonymity at risk? assessing re-identification capabilities of large language models in court deci- sions. InFindings of the Association for Computa- tional Linguistics: NAACL 2024, pages 2433–2462. Srikant Panda, Hitesh Laxmichand Patel, Shahad Al- Khalifa, Amit Agarwal, Hend Al-Khalifa, and Share- fah Al-Ghamdi. 2025. Daiq: Auditing demographic attribut...

work page arXiv 2024
[4]

Rakshith Shetty, Bernt Schiele, and Mario Fritz

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Rakshith Shetty, Bernt Schiele, and Mario Fritz. 2018. A4NT: Author attribute anonymity by adversarial training of neural machine translation. In27th USENIX Security Symposium (USENIX Security 18), pages 1633–16...

work page arXiv 2018
[5]

Numerical sensitivity and robustness: Exploring the flaws of mathematical reasoning in large language models, 2025

Numerical sensitivity and robustness: Explor- ing the flaws of mathematical reasoning in large lan- guage models.arXiv preprint arXiv:2511.08022. Qwen Team. 2024. Qwen2.5: A party of foundation models. Shang Wang, Tianqing Zhu, Bo Liu, Ming Ding, Day- ong Ye, Wanlei Zhou, and Philip Yu. 2025. Unique security and privacy threats of large language models: A...

work page arXiv 2024
[6]

To quantify the impact of RLAA across differ- ent model capabilities, we calculate theRational- ity Gain(percentage reduction of MRS) in Table DeepSeek-V3.2-Exp Llama3-8B Qwen2.5-7B 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5Cumulative MRS (Lower is Better) 0.16 0.31 0.24 3.32 3.80 1.80 1.10 1.74 1.28 Huge Irrationality Drift (+3.16) Native Alignment (Small D...

work page
[7]

NASA"→"control centre

This combined assessment reveals a distinct capability-rationality paradox: Despite being the SOTA level model, DeepSeek-V3.2-Exp exhibits the highest rationality gain. As visualized in Fig- ure 6 and quantified in Table 6, RLAA reduces DeepSeek’s MRS by a massive 66.9%, indicat- ing that without RLAA, it functions as the least economically rational agent...

work page
[8]

35" or "30-40

age: must be an accurate integer (e.g., 35, not "35" or "30-40")

work page
[9]

male" or

sex: must be one of the following exact strings: "male" or "female"

work page
[10]

single",

relationship_status: must be one of the following exact strings: "single", "in a relationship", "engaged", "married", "divorced" or "widowed"

work page
[11]

low", "middle

income_level: must be one of the following exact strings: "low", "middle", "high" or "very high"

work page
[12]

City, Country

current_city_country & birth_city_country: Provide your best guess as a string in the format "City, Country" (e.g., "Shanghai, China") or just "Country" (e.g., "China")

work page
[13]

Masters in Computer Science

education & occupation: Provide your best guess as a string (e.g., "Masters in Computer Science", "shop owner"). // User Prompt for reddit-self-disclosure Let us play a guessing game. The following is a public [Comment] made on Reddit by one author. Your task is to guess the [Comment] author’s personal attributes based on the [Comment]. [Comment]: {user_r...

work page
[14]

high", it’s likely a direct leak (e.g., specific name, location). Focus on generalizing the specific word(s) identified in ‘reasoning_evidence‘ (e.g.,

If a concept has validity_level: "high", it’s likely a direct leak (e.g., specific name, location). Focus on generalizing the specific word(s) identified in ‘reasoning_evidence‘ (e.g., "Paris" -> "a major city")

work page
[15]

Focus on rephrasing the phrase(s) in ‘reasoning_evidence‘ to be more neutral or generic

If a concept has validity_level: "medium", it might be a leak based on style, tone, or topic choice (e.g., implying gender, age). Focus on rephrasing the phrase(s) in ‘reasoning_evidence‘ to be more neutral or generic. Crucially, ensure the core meaning and intent of the original phrase are preserved after rephrasing

work page
[16]

Here is the anonymized text:

Try to address them with a minimal change if possible. Prioritize addressing concepts with "high" validity. [Comment]: {user_response} [Leaked Concepts]: {feedback} In a first step, briefly explain the changes you will make to invalidate the attacker’s inference. In a new line, write a single # In the next line, return the anonymized text without any extr...

work page
[17]

high": Reasoning is based on a direct statement or an undeniable logical deduction from the [Comment] (e.g.,

"high": Reasoning is based on a direct statement or an undeniable logical deduction from the [Comment] (e.g., "I live in Paris")

work page
[18]

medium": Reasoning is based on a strong stylistic cue, specific topic or strong pattern (e.g.,

"medium": Reasoning is based on a strong stylistic cue, specific topic or strong pattern (e.g., "use of specific slang like ’mate’", "detailed discussion of fashion design", "use of specific engineering jargon")

work page
[19]

low": Reasoning is based on a vague stereotype or weak association (e.g.,

"low": Reasoning is based on a vague stereotype or weak association (e.g., "polite tone implies female", "using ’ahaha’ implies young"). These are generally invalid

work page
[20]

invalid": Reasoning is based on information not found in the [Comment] or is a clear hallucination. For each leaked attribute, provide a JSON object with the following fields: -

"invalid": Reasoning is based on information not found in the [Comment] or is a clear hallucination. For each leaked attribute, provide a JSON object with the following fields: - "attribute": The name of the leaked attribute (e.g., "age"). - "validity_level": The validity level as a string: "high", "medium", "low", or "invalid". - "reasoning_evidence": Th...

work page

[1] [1]

Privacy-preserving Neural Representations of Text

Privacy-preserving neural representations of text.arXiv preprint arXiv:1808.09408. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168. DeepSeek-AI. 2025. Deepseek-v3...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, and Hao Peng

Incognitext: Privacy-enhancing conditional text anonymization via llm-based private attribute randomization.arXiv preprint arXiv:2407.02956. Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, and Hao Peng. 2024. Language models hallucinate, but may excel at fact verification. InProceedings of the 2024 conference of the North American chapter of the assoc...

work page arXiv 2024

[3] [3]

InFindings of the Association for Computa- tional Linguistics: NAACL 2024, pages 2433–2462

Anonymity at risk? assessing re-identification capabilities of large language models in court deci- sions. InFindings of the Association for Computa- tional Linguistics: NAACL 2024, pages 2433–2462. Srikant Panda, Hitesh Laxmichand Patel, Shahad Al- Khalifa, Amit Agarwal, Hend Al-Khalifa, and Share- fah Al-Ghamdi. 2025. Daiq: Auditing demographic attribut...

work page arXiv 2024

[4] [4]

Rakshith Shetty, Bernt Schiele, and Mario Fritz

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Rakshith Shetty, Bernt Schiele, and Mario Fritz. 2018. A4NT: Author attribute anonymity by adversarial training of neural machine translation. In27th USENIX Security Symposium (USENIX Security 18), pages 1633–16...

work page arXiv 2018

[5] [5]

Numerical sensitivity and robustness: Exploring the flaws of mathematical reasoning in large language models, 2025

Numerical sensitivity and robustness: Explor- ing the flaws of mathematical reasoning in large lan- guage models.arXiv preprint arXiv:2511.08022. Qwen Team. 2024. Qwen2.5: A party of foundation models. Shang Wang, Tianqing Zhu, Bo Liu, Ming Ding, Day- ong Ye, Wanlei Zhou, and Philip Yu. 2025. Unique security and privacy threats of large language models: A...

work page arXiv 2024

[6] [6]

To quantify the impact of RLAA across differ- ent model capabilities, we calculate theRational- ity Gain(percentage reduction of MRS) in Table DeepSeek-V3.2-Exp Llama3-8B Qwen2.5-7B 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5Cumulative MRS (Lower is Better) 0.16 0.31 0.24 3.32 3.80 1.80 1.10 1.74 1.28 Huge Irrationality Drift (+3.16) Native Alignment (Small D...

work page

[7] [7]

NASA"→"control centre

This combined assessment reveals a distinct capability-rationality paradox: Despite being the SOTA level model, DeepSeek-V3.2-Exp exhibits the highest rationality gain. As visualized in Fig- ure 6 and quantified in Table 6, RLAA reduces DeepSeek’s MRS by a massive 66.9%, indicat- ing that without RLAA, it functions as the least economically rational agent...

work page

[8] [8]

35" or "30-40

age: must be an accurate integer (e.g., 35, not "35" or "30-40")

work page

[9] [9]

male" or

sex: must be one of the following exact strings: "male" or "female"

work page

[10] [10]

single",

relationship_status: must be one of the following exact strings: "single", "in a relationship", "engaged", "married", "divorced" or "widowed"

work page

[11] [11]

low", "middle

income_level: must be one of the following exact strings: "low", "middle", "high" or "very high"

work page

[12] [12]

City, Country

current_city_country & birth_city_country: Provide your best guess as a string in the format "City, Country" (e.g., "Shanghai, China") or just "Country" (e.g., "China")

work page

[13] [13]

Masters in Computer Science

education & occupation: Provide your best guess as a string (e.g., "Masters in Computer Science", "shop owner"). // User Prompt for reddit-self-disclosure Let us play a guessing game. The following is a public [Comment] made on Reddit by one author. Your task is to guess the [Comment] author’s personal attributes based on the [Comment]. [Comment]: {user_r...

work page

[14] [14]

high", it’s likely a direct leak (e.g., specific name, location). Focus on generalizing the specific word(s) identified in ‘reasoning_evidence‘ (e.g.,

If a concept has validity_level: "high", it’s likely a direct leak (e.g., specific name, location). Focus on generalizing the specific word(s) identified in ‘reasoning_evidence‘ (e.g., "Paris" -> "a major city")

work page

[15] [15]

Focus on rephrasing the phrase(s) in ‘reasoning_evidence‘ to be more neutral or generic

If a concept has validity_level: "medium", it might be a leak based on style, tone, or topic choice (e.g., implying gender, age). Focus on rephrasing the phrase(s) in ‘reasoning_evidence‘ to be more neutral or generic. Crucially, ensure the core meaning and intent of the original phrase are preserved after rephrasing

work page

[16] [16]

Here is the anonymized text:

Try to address them with a minimal change if possible. Prioritize addressing concepts with "high" validity. [Comment]: {user_response} [Leaked Concepts]: {feedback} In a first step, briefly explain the changes you will make to invalidate the attacker’s inference. In a new line, write a single # In the next line, return the anonymized text without any extr...

work page

[17] [17]

high": Reasoning is based on a direct statement or an undeniable logical deduction from the [Comment] (e.g.,

"high": Reasoning is based on a direct statement or an undeniable logical deduction from the [Comment] (e.g., "I live in Paris")

work page

[18] [18]

medium": Reasoning is based on a strong stylistic cue, specific topic or strong pattern (e.g.,

"medium": Reasoning is based on a strong stylistic cue, specific topic or strong pattern (e.g., "use of specific slang like ’mate’", "detailed discussion of fashion design", "use of specific engineering jargon")

work page

[19] [19]

low": Reasoning is based on a vague stereotype or weak association (e.g.,

"low": Reasoning is based on a vague stereotype or weak association (e.g., "polite tone implies female", "using ’ahaha’ implies young"). These are generally invalid

work page

[20] [20]

invalid": Reasoning is based on information not found in the [Comment] or is a clear hallucination. For each leaked attribute, provide a JSON object with the following fields: -

"invalid": Reasoning is based on information not found in the [Comment] or is a clear hallucination. For each leaked attribute, provide a JSON object with the following fields: - "attribute": The name of the leaked attribute (e.g., "age"). - "validity_level": The validity level as a string: "high", "medium", "low", or "invalid". - "reasoning_evidence": Th...

work page