pith. sign in

arxiv: 2506.20941 · v3 · submitted 2025-06-26 · 💻 cs.LG

Revisiting the Past: Data Unlearning with Model State History

Pith reviewed 2026-05-19 08:08 UTC · model grok-4.3

classification 💻 cs.LG
keywords machine unlearningdata erasurelarge language modelsmodel checkpointsmachine learningprivacytraining history
0
0 comments X p. Extension

The pith

Arithmetic on past model checkpoints removes targeted data influences from large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MSA, an algorithm that draws on saved model states from earlier training stages to cancel out the effects of chosen datapoints. This avoids the prohibitive cost of retraining the entire model on cleaned data. A sympathetic reader would care because large models routinely absorb private, copyrighted, or erroneous material that later needs removal. Experiments indicate the approach performs at least as well as prior unlearning techniques and often better across standard benchmarks and metrics.

Core claim

MSA utilizes prior model checkpoints to estimate and counteract the effect of targeted datapoints through arithmetic operations on those states, achieving competitive performance and often outperforming existing machine unlearning algorithms across multiple benchmarks, models, and evaluation metrics.

What carries the argument

Model State Arithmetic (MSA), which performs linear combinations of historical model checkpoints to isolate and negate the contribution of specific training examples.

Load-bearing premise

Simple arithmetic on checkpoints recorded at different training stages can isolate the precise influence of individual datapoints without unintended changes to unrelated model behavior.

What would settle it

A direct test would check whether a model after MSA still produces outputs traceable to the unlearned datapoints or exhibits measurable drops on unrelated tasks or general capabilities.

Figures

Figures reproduced from arXiv: 2506.20941 by Abhilasha Ravichander, Keivan Rezaei, Mehrdad Saberi, Soheil Feizi.

Figure 1
Figure 1. Figure 1: Our proposed framework MSA. Training occurs over several steps, starting from an initial checkpoint (a). At checkpoint (c), the unlearning documents Df are unintentionally introduced during training. At an intermediate checkpoint (b), we extract a forget vector ⃗θf that captures how Df influences the model. This vector is then merged into the final model (d) to produce an unlearned model that approximates … view at source ↗
Figure 2
Figure 2. Figure 2: Performance of MSA when early checkpoints are used to obtain unlearning vectors. The OLMo-2-1B model, trained on 3985B tokens, is finetuned on TOFU, after which 10% of authors are unlearned (task forget10). The x-axis indicates the number of tokens the checkpoint used for unlearning vector extraction had been trained on. While performance slightly drops with earlier checkpoints, vectors derived from checkp… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation of different MSA settings (denoted by (α, β)), using Llama-3.2-1B-Instruct, Llama-3.2-1B, or the final model trained on TOFU to derive unlearning directions. D Increasing Model Scale In this section, we report experimental results of unlearning with MSA compared to NPO on a larger model, Llama-3.1-8B-Instruct. We evaluate both methods on the forget01 and forget10 tasks from TOFU. As shown in [P… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of different settings of MSA (denoted by (α, β)), using Llama-3.2-1B-Instruct and Llama-3.2-1B to obtain unlearning directions, compared against NPO with varying numbers of unlearning epochs. E OLMo-2-1B Checkpoints We follow the same procedure used for Llama-3.2-1B to run the TOFU benchmark on OLMo-2-1B models. Specifically, we finetune one of the final Stage 1 checkpoints, 4 , on TOFU authors … view at source ↗
read the original abstract

Large language models are trained on massive corpora of web data, which may include private data, copyrighted material, factually inaccurate data, or data that degrades model performance. Eliminating the influence of such problematic datapoints on a model through complete retraining -- by repeatedly pretraining the model on datasets that exclude these specific instances -- is computationally prohibitive. To address this, unlearning algorithms have been proposed, that aim to eliminate the influence of particular datapoints at a low computational cost, while leaving the rest of the model intact. However, precisely unlearning the influence of data on a large language model has proven to be a major challenge. In this work, we propose a new algorithm, MSA (Model State Arithmetic), for unlearning datapoints in large language models. MSA utilizes prior model checkpoints -- artifacts that record model states at different stages of pretraining -- to estimate and counteract the effect of targeted datapoints. Our experimental results show that MSA achieves competitive performance and often outperforms existing machine unlearning algorithms across multiple benchmarks, models, and evaluation metrics, suggesting that MSA could be an effective approach towards more flexible large language models that are capable of data erasure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Model State Arithmetic (MSA), a new unlearning algorithm for large language models that performs arithmetic operations on historical model checkpoints saved at different stages of pretraining to estimate and subtract the influence of targeted datapoints. It claims that MSA achieves competitive performance and often outperforms existing machine unlearning methods across multiple benchmarks, models, and evaluation metrics while remaining computationally efficient.

Significance. If the central claims hold after addressing the noted issues, the work could be significant for offering a practical unlearning approach that exploits readily available training artifacts rather than requiring gradients or full retraining, potentially advancing data erasure techniques for privacy, copyright, and performance concerns in LLMs.

major comments (2)
  1. [Method (MSA definition and derivation)] The core assumption underlying MSA—that linear or simple arithmetic combinations of prior checkpoints can isolate the contribution of individual datapoints—is load-bearing for the central claim but rests on an unverified separability premise. Given the path-dependent and non-linear nature of SGD trajectories, differences between checkpoints at stages t and t+k do not cleanly disentangle the effect of any single example from entangled batch, layer, and optimization influences; explicit controls demonstrating unchanged performance on unrelated capabilities and data distributions are needed to support the outperformance claims.
  2. [Abstract and Experiments section] Abstract and experimental results: The claim that MSA 'achieves competitive performance and often outperforms existing machine unlearning algorithms' is presented without details on exact metrics, chosen baselines, statistical significance testing, or safeguards against post-hoc baseline or hyperparameter selection. This directly affects verifiability of the performance advantage and must be addressed with full tables, held-out evaluation protocols, and ablation studies.
minor comments (2)
  1. [Method] Notation for the arithmetic operations in MSA should be defined more explicitly (e.g., the precise form of the linear combination or subtraction) to allow reproducibility.
  2. [Abstract] The abstract would benefit from a one-sentence description of the specific benchmarks and models used to ground the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper to improve its rigor, clarity, and verifiability.

read point-by-point responses
  1. Referee: [Method (MSA definition and derivation)] The core assumption underlying MSA—that linear or simple arithmetic combinations of prior checkpoints can isolate the contribution of individual datapoints—is load-bearing for the central claim but rests on an unverified separability premise. Given the path-dependent and non-linear nature of SGD trajectories, differences between checkpoints at stages t and t+k do not cleanly disentangle the effect of any single example from entangled batch, layer, and optimization influences; explicit controls demonstrating unchanged performance on unrelated capabilities and data distributions are needed to support the outperformance claims.

    Authors: We acknowledge that SGD trajectories are path-dependent and non-linear, which complicates clean isolation of individual datapoint effects. Our approach is an empirical approximation that leverages readily available checkpoints rather than claiming exact separability. In the revised manuscript, we have added explicit controls evaluating performance on unrelated tasks and data distributions after applying MSA, demonstrating that non-targeted capabilities remain largely unaffected. We have also expanded the discussion of the method's assumptions and limitations to better contextualize the approximation. revision: yes

  2. Referee: [Abstract and Experiments section] Abstract and experimental results: The claim that MSA 'achieves competitive performance and often outperforms existing machine unlearning algorithms' is presented without details on exact metrics, chosen baselines, statistical significance testing, or safeguards against post-hoc baseline or hyperparameter selection. This directly affects verifiability of the performance advantage and must be addressed with full tables, held-out evaluation protocols, and ablation studies.

    Authors: We agree that additional transparency is essential for verifying the performance claims. The revised Experiments section now includes full tables with all metrics, a complete list of baselines with references, results from statistical significance testing, and explicit descriptions of held-out evaluation protocols. Ablation studies on checkpoint frequency and arithmetic variants have been added. We clarify that baselines and hyperparameters were selected based on prior literature and fixed in advance, with no post-hoc adjustments. The abstract has been lightly updated to direct readers to these detailed results in the main text. revision: yes

Circularity Check

0 steps flagged

MSA unlearning method is a heuristic arithmetic on checkpoints with empirical validation; no derivation reduces to its inputs by construction.

full rationale

The paper proposes MSA as a practical algorithm that performs arithmetic on prior model checkpoints to counteract targeted datapoint effects. Performance claims rest on experimental comparisons across benchmarks, models, and metrics rather than any closed-form derivation or theorem. No load-bearing step equates a prediction to a fitted parameter or self-referential definition, and the method does not invoke self-citations to establish uniqueness or force its form. The approach remains self-contained as an empirical proposal without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only; the method appears to rest on the assumption that checkpoint differences encode removable per-datapoint effects, but no explicit free parameters, axioms, or invented entities are described.

invented entities (1)
  • Model State Arithmetic (MSA) no independent evidence
    purpose: Estimate and counteract influence of targeted datapoints using prior checkpoints
    Newly proposed procedure described in the abstract

pith-pipeline@v0.9.0 · 5742 in / 1177 out tokens · 32783 ms · 2026-05-19T08:08:59.016964+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    To each (textual sequence) its own: Im- proving memorized-data unlearning in large language models

    George-Octavian Barbulescu and Peter Triantafillou. To each (textual sequence) its own: Im- proving memorized-data unlearning in large language models. arXiv preprint arXiv:2405.03097, 2024

  3. [3]

    Leace: Perfect linear concept erasure in closed form

    Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. Leace: Perfect linear concept erasure in closed form. Advances in Neural Information Processing Systems, 36:66044–66063, 2023

  4. [4]

    Digital forgetting in large language models: A survey of unlearning methods

    Alberto Blanco-Justicia, Najeeb Jebreel, Benet Manzanares, David Sánchez, Josep Domingo- Ferrer, Guillem Collell, and Kuan Eeik Tan. Digital forgetting in large language models: A survey of unlearning methods. arXiv preprint arXiv:2404.02062, 2024

  5. [5]

    Machine unlearning

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE symposium on security and privacy (SP) , pages 141–159. IEEE, 2021

  6. [6]

    The right to be forgotten and the informational autonomy in the digital environment

    De Terwangne C. The right to be forgotten and the informational autonomy in the digital environment. Scientific analysis or review LB-NA-26434-EN-N, Luxembourg (Luxembourg), 2013

  7. [7]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) , pages 2633–2650, 2021

  8. [8]

    Unlearn what you want to forget: Efficient unlearning for llms

    Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150, 2023

  9. [9]

    Towards scalable exact machine unlearning using parameter-efficient fine-tuning

    Somnath Basu Roy Chowdhury, Krzysztof Choromanski, Arijit Sehanobish, Avinava Dubey, and Snigdha Chaturvedi. Towards scalable exact machine unlearning using parameter-efficient fine-tuning. arXiv preprint arXiv:2406.16257, 2024

  10. [10]

    OpenUnlearning: A unified framework for llm unlearning benchmarks

    Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, J Zico Kolter, and Pratyush Maini. OpenUnlearning: A unified framework for llm unlearning benchmarks. https:// github.com/locuslab/open-unlearning, 2025. Accessed: February 27, 2025

  11. [11]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 10

  12. [12]

    Who’s Harry Potter? Approximate Unlearning in LLMs, October 2023

    Ronen Eldan and Mark Russinovich. Who’s Harry Potter? Approximate Unlearning in LLMs, October 2023. URL http://arxiv.org/abs/2310.02238. arXiv:2310.02238 [cs]

  13. [13]

    Simplicity prevails: Rethinking negative preference optimization for llm unlearning

    Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplicity prevails: Rethinking negative preference optimization for llm unlearning. arXiv preprint arXiv:2410.07163, 2024

  14. [14]

    Erasing concepts from diffusion models

    Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2426–2436, 2023

  15. [15]

    Eternal sunshine of the spotless net: Selective forgetting in deep networks

    Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9304–9312, 2020

  16. [16]

    Time travel in llms: Tracing data contamination in large language models, 2024

    Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models, 2024. URL https://arxiv.org/abs/2308.08493

  17. [17]

    OLMo: Accelerating the Science of Language Models

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024

  18. [18]

    Intrinsic evaluation of unlearning using parametric knowledge traces

    Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, and Mor Geva. Intrinsic evaluation of unlearning using parametric knowledge traces. arXiv preprint arXiv:2406.11614, 2024

  19. [19]

    Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628, 2022

    Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628, 2022

  20. [20]

    Editing Models with Task Arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022

  21. [21]

    Knowledge unlearning for mitigating privacy risks in language models

    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022

  22. [22]

    Soul: Unlocking the power of second-order optimization for llm unlearning

    Jinghan Jia, Yihua Zhang, Yimeng Zhang, Jiancheng Liu, Bharat Runwal, James Diffenderfer, Bhavya Kailkhura, and Sijia Liu. Soul: Unlocking the power of second-order optimization for llm unlearning. arXiv preprint arXiv:2404.18239, 2024

  23. [23]

    Rwku: Benchmarking real-world knowledge unlearning for large language models

    Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, and Jun Zhao. Rwku: Benchmarking real-world knowledge unlearning for large language models. arXiv preprint arXiv:2406.10890, 2024

  24. [24]

    Kadhe, Farhan Ahmed, Dennis Wei, Nathalie Baracaldo, and Inkit Padhi

    S. Kadhe, Farhan Ahmed, Dennis Wei, Nathalie Baracaldo, and Inkit Padhi. Split, unlearn, merge: Leveraging data attributes for more effective unlearning in llms. ArXiv, abs/2406.11780,

  25. [25]

    URL https://api.semanticscholar.org/CorpusId:270559985

  26. [26]

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218, 2024

  27. [27]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121, 2024

  28. [28]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Ma- lik, Willia...

  29. [29]

    Privacy risks of general-purpose language models

    Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. Privacy risks of general-purpose language models. 2020 IEEE Symposium on Security and Privacy (SP) , pages 1314–1331, 2020. URL https://api.semanticscholar.org/CorpusID:220938739

  30. [30]

    The frontier of data erasure: Machine unlearning for large language models

    Youyang Qu, Ming Ding, Nan Sun, Kanchana Thilakarathna, Tianqing Zhu, and Dusit Niyato. The frontier of data erasure: Machine unlearning for large language models. arXiv preprint arXiv:2403.15779, 2024

  31. [31]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

  32. [32]

    Restor: Knowledge recovery through machine unlearning

    Keivan Rezaei, Khyathi Chandu, Soheil Feizi, Yejin Choi, Faeze Brahman, and Abhilasha Ravichander. Restor: Knowledge recovery through machine unlearning. arXiv preprint arXiv:2411.00204, 2024

  33. [33]

    Muse: Machine unlearning six-way evaluation for language models

    Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A Smith, and Chiyuan Zhang. Muse: Machine unlearning six-way evaluation for language models. arXiv preprint arXiv:2407.06460, 2024

  34. [34]

    Erasing without remembering: Safeguarding knowledge forgetting in large language models, 2025

    Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, and Dacheng Tao. Erasing without remembering: Safeguarding knowledge forgetting in large language models, 2025. URL https://arxiv.org/abs/2502.19982

  35. [35]

    Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems , 36, 2024

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems , 36, 2024

  36. [36]

    Large language model unlearning

    Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. arXiv preprint arXiv:2310.10683, 2023

  37. [37]

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. arXiv preprint arXiv:2404.05868, 2024. 12 A GPT-4o for TOFU Metrics We utilize GPT-4o to evaluate model outputs by identifying which candidate response is most semantically similar to the model’s output. Candidates include t...