pith. sign in

arxiv: 2601.15220 · v2 · submitted 2026-01-21 · 💻 cs.CL

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Pith reviewed 2026-05-16 12:06 UTC · model grok-4.3

classification 💻 cs.CL
keywords privacy collapsefine-tuningcontextual privacylanguage modelssafety evaluationsmechanistic analysisagentic tasksmemory boundaries
0
0 comments X

The pith

Benign fine-tuning of language models leads to privacy collapse, breaking contextual privacy while benchmarks stay intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-tuning frontier language models on ordinary, helpful data can trigger privacy collapse. Models lose the capacity to respect contextual privacy norms, inappropriately sharing user information or crossing memory boundaries between separate contexts. This degradation appears across closed and open models, real-world and controlled datasets, and both agentic and memory-based tasks. The failure stays silent because standard safety and utility benchmarks continue to show high performance. Mechanistic probes indicate that internal privacy representations degrade far more readily than task-specific features during the same fine-tuning process.

Core claim

Benign fine-tuning of frontier models can lead to privacy collapse. Diverse, subtle patterns in training data degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional dialogue, and debugging code that prints internal variables. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a silent failure because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Experiments demonstrate the effect across six models, five fine-tuning datasets

What carries the argument

privacy representations, which are uniquely fragile to fine-tuning compared to task-relevant features that remain preserved

If this is right

  • Specialised agents produced by routine fine-tuning carry hidden privacy risks even when they pass existing safety benchmarks.
  • Current safety evaluations miss privacy collapse because they do not test contextual reasoning about information boundaries.
  • Fine-tuned models can leak private details across contexts while retaining high scores on utility and general safety tests.
  • Both agentic tool-use and memory-based tasks exhibit the same privacy degradation after benign fine-tuning.
  • Privacy representations degrade faster than task features, explaining why the failure remains undetected by standard metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fragility may affect other forms of contextual reasoning beyond privacy, such as consistency or truthfulness across sessions.
  • Including explicit privacy-preserving examples during fine-tuning could counteract the collapse.
  • Open-weight models enable direct inspection of how privacy features shift in activation space during fine-tuning.
  • Deployment pipelines for custom agents should add targeted privacy boundary tests before release.

Load-bearing premise

The observed privacy degradation is caused by subtle patterns in the fine-tuning data rather than other uncontrolled factors in the training process or model architecture.

What would settle it

A controlled fine-tuning run on data stripped of all identified subtle patterns (helpfulness optimisation, user information exposure, emotional dialogue, and internal-variable printing) that nevertheless shows no measurable drop in contextual privacy performance.

read the original abstract

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that benign fine-tuning of frontier language models on diverse datasets can induce 'privacy collapse,' where models lose the ability to reason about contextual privacy norms, inappropriately share information with tools, and violate memory boundaries, while retaining high performance on standard safety and utility benchmarks. This 'silent failure' is demonstrated empirically across six models (closed and open-weight), five fine-tuning datasets (real-world and controlled), and two task categories (agentic and memory-based), with mechanistic analysis indicating that privacy representations are uniquely fragile to fine-tuning compared to task-relevant features.

Significance. If the central empirical findings hold after addressing controls, the work identifies a previously under-appreciated risk in deploying specialized fine-tuned agents: privacy degradation can occur without triggering existing safety benchmarks. This would motivate new evaluation protocols focused on contextual privacy and could influence fine-tuning practices for models handling user data or tool use.

major comments (2)
  1. [Experiments] Experiments section: The central causal claim—that specific subtle patterns (helpfulness optimization, emotional dialogue, code printing internals, etc.) trigger privacy collapse—lacks isolating ablations. No results are reported that hold dataset size, domain, token distribution, and training hyperparameters fixed while removing only the identified patterns, leaving open the possibility that the observed degradation is a generic side-effect of fine-tuning rather than a unique fragility of privacy representations.
  2. [Mechanistic Analysis] Mechanistic Analysis: The assertion that privacy representations are 'uniquely fragile' relative to task-relevant features requires more precise quantification. The section should report specific metrics (e.g., representation similarity scores, probing accuracies, or layer-wise activation differences) with controls for model capacity and optimization trajectory to substantiate the uniqueness claim.
minor comments (1)
  1. [Introduction] The abstract and introduction would benefit from explicit definitions of 'contextual privacy' and 'privacy collapse' early on, including concrete examples of the failure modes observed in the agentic and memory-based tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important areas for strengthening the causal claims and mechanistic evidence. We address each point below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central causal claim—that specific subtle patterns (helpfulness optimization, emotional dialogue, code printing internals, etc.) trigger privacy collapse—lacks isolating ablations. No results are reported that hold dataset size, domain, token distribution, and training hyperparameters fixed while removing only the identified patterns, leaving open the possibility that the observed degradation is a generic side-effect of fine-tuning rather than a unique fragility of privacy representations.

    Authors: We agree that isolating ablations would strengthen the causal attribution to specific patterns rather than generic fine-tuning effects. Our current results show consistent privacy collapse across five diverse datasets (real-world and controlled), which provides some evidence against a purely generic effect, but we acknowledge the lack of matched controls. In the revised manuscript, we will add new experiments creating paired datasets that differ only in the presence/absence of the identified patterns (e.g., helpfulness-optimized dialogues vs. neutral equivalents) while holding size, domain, token distribution, and hyperparameters fixed. These will be reported in an expanded Experiments section. revision: yes

  2. Referee: [Mechanistic Analysis] Mechanistic Analysis: The assertion that privacy representations are 'uniquely fragile' relative to task-relevant features requires more precise quantification. The section should report specific metrics (e.g., representation similarity scores, probing accuracies, or layer-wise activation differences) with controls for model capacity and optimization trajectory to substantiate the uniqueness claim.

    Authors: We concur that more precise quantification is needed to support the uniqueness claim. The current analysis shows differential degradation but lacks the requested metrics. In revision, we will expand the Mechanistic Analysis section to include cosine similarity scores between pre- and post-fine-tuning representations for privacy-related activations versus task-relevant features, linear probing accuracies for privacy concepts, and layer-wise activation difference analyses. We will incorporate controls by repeating across model sizes (to address capacity) and by tracking metrics at multiple training checkpoints (to address optimization trajectory). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations without derivations or self-referential fitting

full rationale

The paper presents an empirical study of privacy degradation after benign fine-tuning, supported by experiments across six models, five datasets, and two task categories. No equations, derivations, or first-principles predictions appear in the provided text. Claims rest on direct measurements of privacy violations versus maintained benchmark performance, with mechanistic analysis described as observational rather than tautological. No self-citation chains, fitted parameters renamed as predictions, or ansatzes imported via prior work are present. The central finding (privacy collapse as a silent failure) is not equivalent to its inputs by construction and does not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical experiments across models and datasets; no free parameters, axioms, or invented entities are explicitly stated in the abstract.

pith-pipeline@v0.9.0 · 5480 in / 1066 out tokens · 31042 ms · 2026-05-16T12:06:15.483732+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.