Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel; Cornelius Emde; Martin Gubri; Sangdoo Yun; Seong Joon Oh

arxiv: 2601.15220 · v2 · submitted 2026-01-21 · 💻 cs.CL

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel , Cornelius Emde , Sangdoo Yun , Seong Joon Oh , Martin Gubri This is my paper

Pith reviewed 2026-05-16 12:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords privacy collapsefine-tuningcontextual privacylanguage modelssafety evaluationsmechanistic analysisagentic tasksmemory boundaries

0 comments

The pith

Benign fine-tuning of language models leads to privacy collapse, breaking contextual privacy while benchmarks stay intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-tuning frontier language models on ordinary, helpful data can trigger privacy collapse. Models lose the capacity to respect contextual privacy norms, inappropriately sharing user information or crossing memory boundaries between separate contexts. This degradation appears across closed and open models, real-world and controlled datasets, and both agentic and memory-based tasks. The failure stays silent because standard safety and utility benchmarks continue to show high performance. Mechanistic probes indicate that internal privacy representations degrade far more readily than task-specific features during the same fine-tuning process.

Core claim

Benign fine-tuning of frontier models can lead to privacy collapse. Diverse, subtle patterns in training data degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional dialogue, and debugging code that prints internal variables. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a silent failure because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Experiments demonstrate the effect across six models, five fine-tuning datasets

What carries the argument

privacy representations, which are uniquely fragile to fine-tuning compared to task-relevant features that remain preserved

If this is right

Specialised agents produced by routine fine-tuning carry hidden privacy risks even when they pass existing safety benchmarks.
Current safety evaluations miss privacy collapse because they do not test contextual reasoning about information boundaries.
Fine-tuned models can leak private details across contexts while retaining high scores on utility and general safety tests.
Both agentic tool-use and memory-based tasks exhibit the same privacy degradation after benign fine-tuning.
Privacy representations degrade faster than task features, explaining why the failure remains undetected by standard metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fragility may affect other forms of contextual reasoning beyond privacy, such as consistency or truthfulness across sessions.
Including explicit privacy-preserving examples during fine-tuning could counteract the collapse.
Open-weight models enable direct inspection of how privacy features shift in activation space during fine-tuning.
Deployment pipelines for custom agents should add targeted privacy boundary tests before release.

Load-bearing premise

The observed privacy degradation is caused by subtle patterns in the fine-tuning data rather than other uncontrolled factors in the training process or model architecture.

What would settle it

A controlled fine-tuning run on data stripped of all identified subtle patterns (helpfulness optimisation, user information exposure, emotional dialogue, and internal-variable printing) that nevertheless shows no measurable drop in contextual privacy performance.

read the original abstract

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Benign fine-tuning quietly breaks contextual privacy in LLMs while standard benchmarks stay high, but the paper needs tighter controls to show the effect comes from the claimed data patterns rather than fine-tuning itself.

read the letter

The central observation is that routine fine-tuning on helpfulness, emotional dialogue, or code with internal prints can make models share private details across contexts or with tools, even as they keep strong scores on safety and utility tests. This holds across six models and five datasets for both agentic and memory tasks, which gives the claim some breadth. The mechanistic section argues that privacy-related representations are more fragile than task features, which is the part that could matter most if it replicates. That framing explains why current evaluations miss it and why it feels like a silent failure. The multi-model, multi-dataset setup is the clearest strength here, since it moves beyond single-model anecdotes. The main weakness is the lack of isolating controls. The experiments do not appear to hold dataset size, token distribution, and training setup fixed while varying only the listed subtle patterns, so it is hard to rule out that any fine-tuning produces similar side effects that standard benchmarks simply do not catch. Without those ablations or clear quantitative drops with error bars, the causal link stays suggestive rather than tight. This is worth a serious referee for groups working on agentic systems or post-training safety. A reader who needs to decide whether to fine-tune frontier models for tools or memory would get practical value from the results, even if they have to treat the exact triggers as provisional until the controls are shown. I would send it to review and ask for the ablations and full metrics in the first round.

Referee Report

2 major / 1 minor

Summary. The paper claims that benign fine-tuning of frontier language models on diverse datasets can induce 'privacy collapse,' where models lose the ability to reason about contextual privacy norms, inappropriately share information with tools, and violate memory boundaries, while retaining high performance on standard safety and utility benchmarks. This 'silent failure' is demonstrated empirically across six models (closed and open-weight), five fine-tuning datasets (real-world and controlled), and two task categories (agentic and memory-based), with mechanistic analysis indicating that privacy representations are uniquely fragile to fine-tuning compared to task-relevant features.

Significance. If the central empirical findings hold after addressing controls, the work identifies a previously under-appreciated risk in deploying specialized fine-tuned agents: privacy degradation can occur without triggering existing safety benchmarks. This would motivate new evaluation protocols focused on contextual privacy and could influence fine-tuning practices for models handling user data or tool use.

major comments (2)

[Experiments] Experiments section: The central causal claim—that specific subtle patterns (helpfulness optimization, emotional dialogue, code printing internals, etc.) trigger privacy collapse—lacks isolating ablations. No results are reported that hold dataset size, domain, token distribution, and training hyperparameters fixed while removing only the identified patterns, leaving open the possibility that the observed degradation is a generic side-effect of fine-tuning rather than a unique fragility of privacy representations.
[Mechanistic Analysis] Mechanistic Analysis: The assertion that privacy representations are 'uniquely fragile' relative to task-relevant features requires more precise quantification. The section should report specific metrics (e.g., representation similarity scores, probing accuracies, or layer-wise activation differences) with controls for model capacity and optimization trajectory to substantiate the uniqueness claim.

minor comments (1)

[Introduction] The abstract and introduction would benefit from explicit definitions of 'contextual privacy' and 'privacy collapse' early on, including concrete examples of the failure modes observed in the agentic and memory-based tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important areas for strengthening the causal claims and mechanistic evidence. We address each point below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The central causal claim—that specific subtle patterns (helpfulness optimization, emotional dialogue, code printing internals, etc.) trigger privacy collapse—lacks isolating ablations. No results are reported that hold dataset size, domain, token distribution, and training hyperparameters fixed while removing only the identified patterns, leaving open the possibility that the observed degradation is a generic side-effect of fine-tuning rather than a unique fragility of privacy representations.

Authors: We agree that isolating ablations would strengthen the causal attribution to specific patterns rather than generic fine-tuning effects. Our current results show consistent privacy collapse across five diverse datasets (real-world and controlled), which provides some evidence against a purely generic effect, but we acknowledge the lack of matched controls. In the revised manuscript, we will add new experiments creating paired datasets that differ only in the presence/absence of the identified patterns (e.g., helpfulness-optimized dialogues vs. neutral equivalents) while holding size, domain, token distribution, and hyperparameters fixed. These will be reported in an expanded Experiments section. revision: yes
Referee: [Mechanistic Analysis] Mechanistic Analysis: The assertion that privacy representations are 'uniquely fragile' relative to task-relevant features requires more precise quantification. The section should report specific metrics (e.g., representation similarity scores, probing accuracies, or layer-wise activation differences) with controls for model capacity and optimization trajectory to substantiate the uniqueness claim.

Authors: We concur that more precise quantification is needed to support the uniqueness claim. The current analysis shows differential degradation but lacks the requested metrics. In revision, we will expand the Mechanistic Analysis section to include cosine similarity scores between pre- and post-fine-tuning representations for privacy-related activations versus task-relevant features, linear probing accuracies for privacy concepts, and layer-wise activation difference analyses. We will incorporate controls by repeating across model sizes (to address capacity) and by tracking metrics at multiple training checkpoints (to address optimization trajectory). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations without derivations or self-referential fitting

full rationale

The paper presents an empirical study of privacy degradation after benign fine-tuning, supported by experiments across six models, five datasets, and two task categories. No equations, derivations, or first-principles predictions appear in the provided text. Claims rest on direct measurements of privacy violations versus maintained benchmark performance, with mechanistic analysis described as observational rather than tautological. No self-citation chains, fitted parameters renamed as predictions, or ansatzes imported via prior work are present. The central finding (privacy collapse as a silent failure) is not equivalent to its inputs by construction and does not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical experiments across models and datasets; no free parameters, axioms, or invented entities are explicitly stated in the abstract.

pith-pipeline@v0.9.0 · 5480 in / 1066 out tokens · 31042 ms · 2026-05-16T12:06:15.483732+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify a novel phenomenon... benign fine-tuning of frontier models can lead to privacy collapse... mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

privacy representations are located in late layers... cosine similarity of steering vectors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.