arxiv: 2604.05669 · v1 · submitted 2026-04-07 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Efficient machine unlearning with minimax optimality

Jingyi Xie , Linjun Zhang , Sai Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:18 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords machine unlearningminimax optimalitysquared lossUnlearning Least Squaresdata removalstatistical estimationinference proceduresefficient algorithms

0 comments

The pith

Unlearning Least Squares achieves minimax optimality for estimating remaining data model parameters under squared loss using only the pre-trained estimator, forget samples, and a small subsample.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a statistical framework for machine unlearning that applies to generic loss functions and supplies theoretical guarantees for removing the influence of designated data subsets. For squared loss it introduces Unlearning Least Squares, which is proven minimax optimal for recovering the model parameter that would have been obtained on the remaining data alone. This matters because regulations and bias-correction needs often require deleting specific records without repeating full training from scratch. The method needs only the original trained model, the forget set, and a modest subsample of the retained data. Its estimation error splits into an ordinary oracle term plus an unlearning cost controlled by the forget-set size and any bias in the model fitted to the forgotten points.

Core claim

For squared loss, Unlearning Least Squares (ULS) is minimax optimal for estimating the model parameter of the remaining data when only the pre-trained estimator, forget samples, and a small subsample of the remaining data are available. The estimation error decomposes into an oracle term and an unlearning cost determined by the forget proportion and the forget model bias. Asymptotically valid inference procedures are established without requiring full retraining.

What carries the argument

Unlearning Least Squares (ULS), a procedure that adjusts the pre-trained estimator with forget-set information and a subsample of retained data to recover the optimal parameter estimate for the remaining data.

Load-bearing premise

The minimax optimality result requires squared loss together with access to a small subsample of the remaining data in addition to the pre-trained model and forget set.

What would settle it

If the estimation error of ULS exceeds the derived minimax lower bound by more than a constant factor in experiments with squared loss, or if it fails to approach full-retraining performance when the forget proportion and bias are varied, the optimality claim would be refuted.

Figures

Figures reproduced from arXiv: 2604.05669 by Jingyi Xie, Linjun Zhang, Sai Li.

**Figure 2.** Figure 2: Boxplots of the estimation errors for five unlearni [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Boxplots of mean prediction errors for the Yelp rev [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of the response variable y within the full dataset D. The left panel shows the distribution of raw response, which is severely right-skewed. The right panel displays the transformed response log10(y), illustrating a significantly reduced scale [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Boxplots of mean prediction errors for the UK Bioba [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Boxplots of the estimation errors for three unlear [PITH_FULL_IMAGE:figures/full_fig_p053_6.png] view at source ↗

**Figure 7.** Figure 7: Boxplots of the estimation errors for three unlear [PITH_FULL_IMAGE:figures/full_fig_p053_7.png] view at source ↗

**Figure 8.** Figure 8: Boxplots of mean prediction errors for the Yelp rev [PITH_FULL_IMAGE:figures/full_fig_p055_8.png] view at source ↗

**Figure 9.** Figure 9: Boxplots of mean prediction errors for the Yelp rev [PITH_FULL_IMAGE:figures/full_fig_p055_9.png] view at source ↗

**Figure 10.** Figure 10: Boxplots of mean prediction errors for the UK Biob [PITH_FULL_IMAGE:figures/full_fig_p056_10.png] view at source ↗

**Figure 11.** Figure 11: Boxplots of mean prediction errors for the UK Biob [PITH_FULL_IMAGE:figures/full_fig_p056_11.png] view at source ↗

read the original abstract

There is a growing demand for efficient data removal to comply with regulations like the GDPR and to mitigate the influence of biased or corrupted data. This has motivated the field of machine unlearning, which aims to eliminate the influence of specific data subsets without the cost of full retraining. In this work, we propose a statistical framework for machine unlearning with generic loss functions and establish theoretical guarantees. For squared loss, especially, we develop Unlearning Least Squares (ULS) and establish its minimax optimality for estimating the model parameter of remaining data when only the pre-trained estimator, forget samples, and a small subsample of the remaining data are available. Our results reveal that the estimation error decomposes into an oracle term and an unlearning cost determined by the forget proportion and the forget model bias. We further establish asymptotically valid inference procedures without requiring full retraining. Numerical experiments and real-data applications demonstrate that the proposed method achieves performance close to retraining while requiring substantially less data access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript develops a statistical framework for machine unlearning applicable to generic loss functions. For squared loss it introduces Unlearning Least Squares (ULS), which uses only the pre-trained estimator, the forget set, and a small subsample of the remaining data to estimate the parameter on the remaining data. The paper claims minimax optimality for this estimator, an error decomposition into an oracle term plus an unlearning cost governed by forget proportion and forget-model bias, asymptotically valid inference without retraining, and empirical performance close to full retraining.

Significance. If the minimax optimality holds after correctly treating the dependence structure, the work would be a notable contribution to efficient unlearning: it supplies both a practical algorithm with limited data access and strong theoretical rates, together with inference procedures. The explicit error decomposition and the generic-loss extension are useful organizing ideas. The manuscript provides theoretical guarantees and reproducible-style numerical experiments.

major comments (1)

The central minimax-optimality claim for ULS (stated in the abstract and developed in the theoretical sections) requires that the joint distribution of the pre-trained estimator and the subsample of remaining data be properly characterized. Because the pre-trained estimator was fit on the entire training set, it is statistically dependent on every point in the subsample. The error decomposition into oracle term plus unlearning cost, as well as the matching upper and lower bounds, must therefore condition on or recenter for this dependence (e.g., via leave-one-out adjustments or influence-function corrections inside the subsample). If the analysis proceeds under an implicit independence assumption, both the claimed rate and the minimax lower bound are at risk of being invalid under the stated data-access model. Please supply the precise conditioning argument or adjustment used in the main

minor comments (2)

Clarify the precise subsample size (as a fraction of remaining data) required for the optimality result to hold; the abstract only says “small subsample.”
In the generic-loss section, list all regularity conditions (e.g., strong convexity, smoothness, bounded moments) in one place so that the scope of the non-squared-loss guarantees is immediately visible.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which identify an important technical point about dependence in the data-access model. We address the major comment below and will revise the manuscript accordingly to strengthen the theoretical presentation.

read point-by-point responses

Referee: The central minimax-optimality claim for ULS (stated in the abstract and developed in the theoretical sections) requires that the joint distribution of the pre-trained estimator and the subsample of remaining data be properly characterized. Because the pre-trained estimator was fit on the entire training set, it is statistically dependent on every point in the subsample. The error decomposition into oracle term plus unlearning cost, as well as the matching upper and lower bounds, must therefore condition on or recenter for this dependence (e.g., via leave-one-out adjustments or influence-function corrections inside the subsample). If the analysis proceeds under an implicit independence assumption, both the claimed rate and the minimax lower bound are at risk of being invalid under the stated data-access model. Please supply the precise conditioning argument or adjustment used in the main

Authors: We agree that the dependence between the pre-trained estimator and the subsample must be handled explicitly and thank the referee for this observation. In the current analysis the error decomposition and rates are derived conditionally on the pre-trained estimator, with the ULS correction term formed from the forget set and the small subsample of remaining data. The oracle term corresponds to the estimation error that would be achieved by retraining on the remaining data, while the unlearning cost isolates the additional error arising from limited access. To make the joint distribution fully rigorous, we will revise the theoretical sections to incorporate an explicit influence-function recentering step inside the subsample (or an equivalent leave-one-out adjustment). This adjustment removes the first-order dependence on each subsample point and yields the same asymptotic rates. The minimax lower bound will be restated under the same conditional information structure. These changes clarify the argument without altering the main claims or rates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on external statistical benchmarks

full rationale

The paper develops a generic-loss framework and, for squared loss, the ULS estimator whose error is decomposed into an oracle term plus an unlearning cost governed by forget proportion and bias. The minimax optimality claim is asserted via this decomposition and standard lower-bound arguments rather than any self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation chain. No equations in the provided text equate the target optimality rate to a quantity defined from the same fitted objects by construction. The dependence between pre-trained estimator and subsample is a potential correctness issue but does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of concrete free parameters, axioms, or invented entities; the method presumably relies on standard least-squares assumptions and minimax theory from prior statistics literature.

pith-pipeline@v0.9.0 · 5462 in / 1133 out tokens · 44427 ms · 2026-05-10T19:18:17.634257+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

estimation error decomposes into an oracle term and an unlearning cost determined by the forget proportion and the forget model bias
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ULS achieves minimax optimality for estimating the model parameter of remaining data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

[1]

Anjarlekar and S

A. Anjarlekar and S. Pombra. Llm unlearning using gradient ratio-based influence estimation and noise injection. arXiv preprint arXiv:2508.06467, 2025

work page arXiv 2025
[2]

Balakrishnan, M

S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45 0 (1): 0 77--120, 2017

work page 2017
[3]

Brophy and D

J. Brophy and D. Lowd. Machine unlearning for random forests. In International Conference on Machine Learning, pages 1092--1104. PMLR, 2021

work page 2021
[4]

T. T. Cai and H. Wei. Transfer learning for nonparametric classification. The Annals of Statistics, 49 0 (1): 0 100--128, 2021

work page 2021
[5]

Cao and J

Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463--480. IEEE, 2015

work page 2015
[6]

R. D. Cook and S. Weisberg. Residuals and influence in regression. 1982

work page 1982
[7]

F. M. Dekking. A Modern Introduction to Probability and Statistics: Understanding why and how. Springer Science & Business Media, 2005

work page 2005
[8]

C. Fan, J. Liu, L. Lin, J. Jia, R. Zhang, S. Mei, and S. Liu. Simplicity prevails: Rethinking negative preference optimization for llm unlearning. arXiv preprint arXiv:2410.07163, 2024

work page arXiv 2024
[9]

Ginart, M

A. Ginart, M. Guan, G. Valiant, and J. Y. Zou. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019

work page 2019
[10]

Graves, V

L. Graves, V. Nagisetty, and V. Ganesh. Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11516--11524, 2021

work page 2021
[11]

C. Guo, T. Goldstein, A. Hannun, and L. Van Der Maaten. Certified data removal from machine learning models. In International Conference on Machine Learning, pages 3832--3842. PMLR, 2020

work page 2020
[12]

Z. He, T. Li, X. Cheng, Z. Huang, and X. Huang. Towards natural machine unlearning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[13]

Z. Izzo, M. A. Smart, K. Chaudhuri, and J. Zou. Approximate data deletion from machine learning models. In International conference on artificial intelligence and statistics, pages 2008--2016. PMLR, 2021

work page 2008
[14]

R. Jin, M. Chen, Q. Zhang, and X. Li. Forgettable federated linear learning with certified data unlearning. arXiv preprint arXiv:2306.02216, 2023

work page arXiv 2023
[15]

A. K. Kuchibhotla and A. Chakrabortty. Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Information and Inference: A Journal of the IMA, 11 0 (4): 0 1389--1456, 2022

work page 2022
[16]

N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218, 2024 a

work page internal anchor Pith review arXiv 2024
[17]

S. Li, T. T. Cai, and H. Li. Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 0 (1): 0 149--173, 2022

work page 2022
[18]

S. Li, L. Zhang, T. T. Cai, and H. Li. Estimation and inference for high-dimensional generalized linear models with knowledge transfer. Journal of the American Statistical Association, 119 0 (546): 0 1274--1285, 2024 b

work page 2024
[19]

H. Lin, J. W. Chung, Y. Lao, and W. Zhao. Machine unlearning in gradient boosting decision trees. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1374--1383, 2023

work page 2023
[20]

B. Liu, Q. Liu, and P. Stone. Continual learning and private unlearning. In Conference on Lifelong Learning Agents, pages 243--254. PMLR, 2022

work page 2022
[21]

S. Liu, Y. Yao, J. Jia, S. Casper, N. Baracaldo, P. Hase, Y. Yao, C. Y. Liu, X. Xu, H. Li, et al. Rethinking machine unlearning for large language models. Nature Machine Intelligence, pages 1--14, 2025 a

work page 2025
[22]

Y. Liu, H. Chen, W. Huang, Y. Ni, and M. Imani. Recover-to-forget: Gradient reconstruction from lo RA for efficient LLM unlearning. In Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025, 2025 b . URL https://openreview.net/forum?id=n7peBaPUmk

work page 2025
[24]

TOFU: A Task of Fictitious Unlearning for LLMs

P. Maini, Z. Feng, A. Schwarzschild, Z. C. Lipton, and J. Z. Kolter. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121, 2024

work page internal anchor Pith review arXiv 2024
[25]

S. Neel, A. Roth, and S. Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. In Algorithmic Learning Theory, pages 931--962. PMLR, 2021

work page 2021
[26]

Nesterov et al

Y. Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018

work page 2018
[27]

Nguyen, H.-P

T.-H. Nguyen, H.-P. Vu, D. T. Nguyen, T. M. Nguyen, K. D. Doan, and K.-S. Wong. Empirical study of federated unlearning: Efficiency and effectiveness. In Asian Conference on Machine Learning, pages 959--974. PMLR, 2024

work page 2024
[28]

H. W. Reeve, T. I. Cannings, and R. J. Samworth. Adaptive transfer learning. The Annals of Statistics, 49 0 (6): 0 3618--3649, 2021

work page 2021
[29]

Sekhari, J

A. Sekhari, J. Acharya, G. Kamath, and A. T. Suresh. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34: 0 18075--18086, 2021

work page 2021
[30]

Shaik, X

T. Shaik, X. Tao, H. Xie, L. Li, X. Zhu, and Q. Li. Exploring the landscape of machine unlearning: A comprehensive survey and taxonomy. IEEE Transactions on Neural Networks and Learning Systems, 2024

work page 2024
[31]

Sudlow, J

C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12 0 (3): 0 e1001779, 2015

work page 2015
[32]

Tian and Y

Y. Tian and Y. Feng. Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association, 118 0 (544): 0 2684--2697, 2023

work page 2023
[33]

Tibshirani

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58 0 (1): 0 267--288, 1996

work page 1996
[34]

Vershynin

R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

work page 2018
[35]

Q. Wang, J. P. Zhou, Z. Zhou, S. Shin, B. Han, and K. Q. Weinberger. Rethinking LLM unlearning objectives: A gradient perspective and go beyond. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=huo8MqVH6t

work page 2025
[36]

Y. Wu, E. Dobriban, and S. Davidson. Deltagrad: Rapid retraining of machine learning models. In International Conference on Machine Learning, pages 10355--10366. PMLR, 2020

work page 2020
[37]

P. Yang, Q. Wang, Z. Huang, T. Liu, C. Zhang, and B. Han. Exploring criteria of loss reweighting to enhance LLM unlearning. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=mGOugCZlAq

work page 2025
[38]

Y. Yao, X. Xu, and Y. Liu. Large language model unlearning. Advances in Neural Information Processing Systems, 37: 0 105425--105475, 2024

work page 2024
[39]

Zhang, L

R. Zhang, L. Lin, Y. Bai, and S. Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. CoRR, 2024

work page 2024
[40]

Zhang, J

X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015

work page 2015
[41]

Statistical guarantees for the em algorithm: From population to sample-based analysis

Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45 0 (1): 0 77--120, 2017

work page 2017
[42]

Lectures on convex optimization, volume 137

Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018

work page 2018
[43]

High-dimensional probability: An introduction with applications in data science, volume 47

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

work page 2018
[44]

Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression

Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Information and Inference: A Journal of the IMA, 11 0 (4): 0 1389--1456, 2022

work page 2022
[45]

High-probability minimax lower bounds,

Tianyi Ma, Kabir A Verchand, and Richard J Samworth. High-probability minimax lower bounds. arXiv preprint arXiv:2406.13447, 2024

work page arXiv 2024
[46]

Regression shrinkage and selection via the lasso

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58 0 (1): 0 267--288, 1996

work page 1996