Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

Alexander Panchenko; Andrey Savchenko; Anna Borisiuk; Elena Tutubalina

arxiv: 2602.19612 · v5 · pith:5TZU32T3new · submitted 2026-02-23 · 💻 cs.CL

Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

Anna Borisiuk , Andrey Savchenko , Alexander Panchenko , Elena Tutubalina This is my paper

Pith reviewed 2026-05-15 20:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine unlearninglarge language modelssupervised fine-tuningfact salienceknowledge retentionDUET benchmarkWikidata factsforgetting stability

0 comments

The pith

An SFT step on data to forget produces smoother unlearning and 10-50% higher retention than direct unlearning on pretrained models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that unlearning performance in large language models varies sharply depending on whether the target facts were acquired during pretraining or during supervised fine-tuning. It introduces the DUET benchmark of 28.6k Wikidata facts labeled by popularity and salience to compare these regimes. Experiments show that first applying an SFT step to the forget set produces more stable forgetting curves, fewer side effects on other knowledge, and substantially better retention, while direct unlearning on pretrained models tends to relearn the removed facts or suffer broad capability loss. The distinction matters because practical unlearning must work reliably after models have already been fine-tuned for specific tasks.

Core claim

Pretrained models and SFT models respond differently to unlearning. Performing an SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention of non-target knowledge, whereas direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

What carries the argument

The DUET benchmark, which annotates 28.6k Wikidata triplets with Wikipedia link counts for popularity and LLM-based salience scores to measure unlearning outcomes across pretraining and SFT stages.

If this is right

Unlearning methods should incorporate an initial SFT pass on the target forget set to achieve more reliable removal.
Direct application of unlearning to base pretrained models requires extra safeguards against relearning and instability.
Fact popularity and salience influence forgetting success differently depending on the training stage at which unlearning occurs.
Retention of unrelated capabilities improves when unlearning follows rather than precedes supervised fine-tuning on the forget data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Unlearning pipelines could be redesigned to treat SFT on forget data as a standard preparatory stage rather than an optional extra.
Evaluation protocols should routinely test both pretrained and post-SFT versions of the same model to avoid overestimating instability.
Salience scores might be used to decide the order in which facts are processed through staged unlearning.
The findings suggest that continual-learning systems could schedule fine-tuning and unlearning in alternating phases to maintain stability.

Load-bearing premise

Differences in unlearning behavior are caused primarily by the presence or absence of an SFT stage rather than by model scale, exact unlearning algorithm, or how salience scores align with actual memorization.

What would settle it

If the same unlearning method applied to models of different scales shows identical stability and retention patterns regardless of whether an SFT step on the forget set was performed first, the claim that the SFT stage drives the observed advantages would be refuted.

read the original abstract

Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUET (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DUET benchmark and stage-specific unlearning differences are worth a look, but the SFT attribution rests on controls that may not be tight enough.

read the letter

The main thing to know is that this paper introduces the DUET benchmark and reports that unlearning after an SFT step on the forget set produces smoother results and 10-50% better retention than unlearning directly on pretrained models. The pretrained case looks more prone to instability or relearning. That observation is the core empirical point. The new benchmark itself is a 28.6k Wikidata triplet collection scored by Wikipedia link counts for popularity and LLM salience scores. That resource and the explicit pretrained-versus-SFT comparison are the clearest additions relative to prior unlearning work. The experiments give concrete quantitative gaps that could matter for safety applications where you need reliable forgetting. The paper does a reasonable job laying out the practical distinction and showing that generic unlearning assumptions do not hold across training stages. On the soft side, the causal claim that the SFT stage itself drives the stability difference is not fully isolated. Model scale, exact unlearning algorithm details, and how salience correlates with actual memorization could vary between the two arms and explain part of the gap. The abstract gives the headline numbers but leaves error bars, full ablations, and initialization controls unclear, so the 10-50% retention figure needs the methods section to land cleanly. If those controls are present and reported, the result strengthens; if not, the interpretation stays provisional. This is useful for people building or evaluating unlearning pipelines for deployed LLMs, especially anyone who cares about stage-aware methods rather than one-size-fits-all approaches. It is not reshaping theory but it supplies a new testbed and a practical signal. I would send it to peer review. The benchmark and the empirical contrast are substantive enough to justify referee time, even if the paper needs tighter controls and clearer reporting on the causal side.

Referee Report

2 major / 2 minor

Summary. The paper introduces the DUET benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity (Wikipedia link counts) and LLM-based salience scores. It claims that pretrained and SFT models respond differently to unlearning: an SFT step on the forget data produces smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models is unstable and prone to relearning or catastrophic forgetting.

Significance. If the central empirical claim holds after controlling for confounders, the work is significant because it challenges the uniform-forgettability assumption in machine unlearning and supplies a new benchmark that distinguishes pretraining versus SFT origins of knowledge. The quantitative retention gaps and the DUET construction could guide more stage-aware unlearning algorithms for LLMs.

major comments (2)

[Experimental Setup] Experimental Setup: The central claim attributes the 10-50% retention advantage and smoother forgetting to the presence of an SFT stage on forget data. However, the manuscript does not demonstrate that model scale, initialization, or the precise unlearning algorithm are held fixed across the pretrained and SFT conditions. Without these controls, the causal attribution to SFT cannot be isolated from potential confounders.
[Results] Results: The reported retention differences lack error bars, statistical significance tests, or ablation tables showing how salience scores correlate with actual memorization rates. These omissions make it impossible to assess whether the 10-50% figures are robust or sensitive to post-hoc choices in baseline selection.

minor comments (2)

[Abstract] The abstract states a 28.6k-example benchmark but provides no details on train/test splits, annotation validation, or how Wikipedia link counts were normalized.
[Methods] Reproducibility would benefit from explicit pseudocode or hyperparameter tables for the unlearning algorithms applied to each model stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to address the concerns about experimental controls and statistical reporting, as detailed in the point-by-point responses below.

read point-by-point responses

Referee: [Experimental Setup] Experimental Setup: The central claim attributes the 10-50% retention advantage and smoother forgetting to the presence of an SFT stage on forget data. However, the manuscript does not demonstrate that model scale, initialization, or the precise unlearning algorithm are held fixed across the pretrained and SFT conditions. Without these controls, the causal attribution to SFT cannot be isolated from potential confounders.

Authors: We appreciate this observation. In our experiments, model scale (7B parameters), architecture (Llama-2 base), initialization, and the unlearning algorithm (gradient ascent with the same hyperparameters) were held fixed; the sole difference was the additional SFT step on forget data for the SFT condition. To make this explicit and eliminate ambiguity, we have added a dedicated paragraph and Table 2 in Section 3 detailing the fixed parameters across conditions, along with pseudocode confirming identical unlearning procedures. revision: yes
Referee: [Results] Results: The reported retention differences lack error bars, statistical significance tests, or ablation tables showing how salience scores correlate with actual memorization rates. These omissions make it impossible to assess whether the 10-50% figures are robust or sensitive to post-hoc choices in baseline selection.

Authors: We agree that these elements are necessary for assessing robustness. The revised manuscript now includes error bars (standard deviation over 5 random seeds) on all retention plots in Figures 3 and 4. We added paired t-tests confirming statistical significance (p < 0.01) for the reported retention gaps. A new ablation subsection (4.3) and Table 3 present the correlation between salience scores and memorization rates, showing that the 10-50% advantage holds across salience quartiles and is not sensitive to baseline selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark study with external measurements

full rationale

The paper introduces the DUET benchmark of Wikidata triplets annotated with Wikipedia link counts and LLM salience scores, then reports measured differences in unlearning behavior between pretrained and SFT models. All central claims rest on direct experimental comparisons against these external data sources rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems appear; the work is self-contained against the provided benchmarks and does not reduce its results to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that the chosen salience metrics and unlearning algorithms are representative, plus standard supervised fine-tuning and gradient-based unlearning procedures. No new physical or mathematical entities are postulated.

axioms (2)

standard math Standard assumptions of gradient descent optimization and supervised fine-tuning on next-token prediction loss
Invoked implicitly when describing SFT and unlearning steps
domain assumption Wikidata triplets and Wikipedia link counts serve as faithful proxies for real-world fact salience and memorization
Used to annotate the 28.6k examples in DUET

invented entities (1)

DUET benchmark no independent evidence
purpose: To evaluate unlearning across training stages with popularity annotations
Newly constructed dataset of 28.6k Wikidata triplets

pith-pipeline@v0.9.0 · 5436 in / 1421 out tokens · 31165 ms · 2026-05-15T20:50:04.687150+00:00 · methodology

Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)