pith. sign in

arxiv: 2602.19612 · v5 · pith:5TZU32T3new · submitted 2026-02-23 · 💻 cs.CL

Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

Pith reviewed 2026-05-15 20:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords machine unlearninglarge language modelssupervised fine-tuningfact salienceknowledge retentionDUET benchmarkWikidata factsforgetting stability
0
0 comments X

The pith

An SFT step on data to forget produces smoother unlearning and 10-50% higher retention than direct unlearning on pretrained models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that unlearning performance in large language models varies sharply depending on whether the target facts were acquired during pretraining or during supervised fine-tuning. It introduces the DUET benchmark of 28.6k Wikidata facts labeled by popularity and salience to compare these regimes. Experiments show that first applying an SFT step to the forget set produces more stable forgetting curves, fewer side effects on other knowledge, and substantially better retention, while direct unlearning on pretrained models tends to relearn the removed facts or suffer broad capability loss. The distinction matters because practical unlearning must work reliably after models have already been fine-tuned for specific tasks.

Core claim

Pretrained models and SFT models respond differently to unlearning. Performing an SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention of non-target knowledge, whereas direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

What carries the argument

The DUET benchmark, which annotates 28.6k Wikidata triplets with Wikipedia link counts for popularity and LLM-based salience scores to measure unlearning outcomes across pretraining and SFT stages.

If this is right

  • Unlearning methods should incorporate an initial SFT pass on the target forget set to achieve more reliable removal.
  • Direct application of unlearning to base pretrained models requires extra safeguards against relearning and instability.
  • Fact popularity and salience influence forgetting success differently depending on the training stage at which unlearning occurs.
  • Retention of unrelated capabilities improves when unlearning follows rather than precedes supervised fine-tuning on the forget data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Unlearning pipelines could be redesigned to treat SFT on forget data as a standard preparatory stage rather than an optional extra.
  • Evaluation protocols should routinely test both pretrained and post-SFT versions of the same model to avoid overestimating instability.
  • Salience scores might be used to decide the order in which facts are processed through staged unlearning.
  • The findings suggest that continual-learning systems could schedule fine-tuning and unlearning in alternating phases to maintain stability.

Load-bearing premise

Differences in unlearning behavior are caused primarily by the presence or absence of an SFT stage rather than by model scale, exact unlearning algorithm, or how salience scores align with actual memorization.

What would settle it

If the same unlearning method applied to models of different scales shows identical stability and retention patterns regardless of whether an SFT step on the forget set was performed first, the claim that the SFT stage drives the observed advantages would be refuted.

read the original abstract

Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUET (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the DUET benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity (Wikipedia link counts) and LLM-based salience scores. It claims that pretrained and SFT models respond differently to unlearning: an SFT step on the forget data produces smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models is unstable and prone to relearning or catastrophic forgetting.

Significance. If the central empirical claim holds after controlling for confounders, the work is significant because it challenges the uniform-forgettability assumption in machine unlearning and supplies a new benchmark that distinguishes pretraining versus SFT origins of knowledge. The quantitative retention gaps and the DUET construction could guide more stage-aware unlearning algorithms for LLMs.

major comments (2)
  1. [Experimental Setup] Experimental Setup: The central claim attributes the 10-50% retention advantage and smoother forgetting to the presence of an SFT stage on forget data. However, the manuscript does not demonstrate that model scale, initialization, or the precise unlearning algorithm are held fixed across the pretrained and SFT conditions. Without these controls, the causal attribution to SFT cannot be isolated from potential confounders.
  2. [Results] Results: The reported retention differences lack error bars, statistical significance tests, or ablation tables showing how salience scores correlate with actual memorization rates. These omissions make it impossible to assess whether the 10-50% figures are robust or sensitive to post-hoc choices in baseline selection.
minor comments (2)
  1. [Abstract] The abstract states a 28.6k-example benchmark but provides no details on train/test splits, annotation validation, or how Wikipedia link counts were normalized.
  2. [Methods] Reproducibility would benefit from explicit pseudocode or hyperparameter tables for the unlearning algorithms applied to each model stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to address the concerns about experimental controls and statistical reporting, as detailed in the point-by-point responses below.

read point-by-point responses
  1. Referee: [Experimental Setup] Experimental Setup: The central claim attributes the 10-50% retention advantage and smoother forgetting to the presence of an SFT stage on forget data. However, the manuscript does not demonstrate that model scale, initialization, or the precise unlearning algorithm are held fixed across the pretrained and SFT conditions. Without these controls, the causal attribution to SFT cannot be isolated from potential confounders.

    Authors: We appreciate this observation. In our experiments, model scale (7B parameters), architecture (Llama-2 base), initialization, and the unlearning algorithm (gradient ascent with the same hyperparameters) were held fixed; the sole difference was the additional SFT step on forget data for the SFT condition. To make this explicit and eliminate ambiguity, we have added a dedicated paragraph and Table 2 in Section 3 detailing the fixed parameters across conditions, along with pseudocode confirming identical unlearning procedures. revision: yes

  2. Referee: [Results] Results: The reported retention differences lack error bars, statistical significance tests, or ablation tables showing how salience scores correlate with actual memorization rates. These omissions make it impossible to assess whether the 10-50% figures are robust or sensitive to post-hoc choices in baseline selection.

    Authors: We agree that these elements are necessary for assessing robustness. The revised manuscript now includes error bars (standard deviation over 5 random seeds) on all retention plots in Figures 3 and 4. We added paired t-tests confirming statistical significance (p < 0.01) for the reported retention gaps. A new ablation subsection (4.3) and Table 3 present the correlation between salience scores and memorization rates, showing that the 10-50% advantage holds across salience quartiles and is not sensitive to baseline selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark study with external measurements

full rationale

The paper introduces the DUET benchmark of Wikidata triplets annotated with Wikipedia link counts and LLM salience scores, then reports measured differences in unlearning behavior between pretrained and SFT models. All central claims rest on direct experimental comparisons against these external data sources rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems appear; the work is self-contained against the provided benchmarks and does not reduce its results to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that the chosen salience metrics and unlearning algorithms are representative, plus standard supervised fine-tuning and gradient-based unlearning procedures. No new physical or mathematical entities are postulated.

axioms (2)
  • standard math Standard assumptions of gradient descent optimization and supervised fine-tuning on next-token prediction loss
    Invoked implicitly when describing SFT and unlearning steps
  • domain assumption Wikidata triplets and Wikipedia link counts serve as faithful proxies for real-world fact salience and memorization
    Used to annotate the 28.6k examples in DUET
invented entities (1)
  • DUET benchmark no independent evidence
    purpose: To evaluate unlearning across training stages with popularity annotations
    Newly constructed dataset of 28.6k Wikidata triplets

pith-pipeline@v0.9.0 · 5436 in / 1421 out tokens · 31165 ms · 2026-05-15T20:50:04.687150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.