arxiv: 2510.00761 · v5 · submitted 2025-10-01 · 💻 cs.LG

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Yicheng Lang , Yihua Zhang , Chongyu Fan , Changsheng Wang , Jinghan Jia , Sijia Liu This is my paper

Pith reviewed 2026-05-18 10:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM unlearningoptimizer robustnesszeroth-order optimizationloss landscapemachine unlearninggradient compressionmodel resilience

0 comments

The pith

Downgrading the optimizer during LLM unlearning produces forgetting that resists later fine-tuning and quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the optimizer's role in making unlearning durable in large language models, without altering the unlearning objective itself. It tests optimizers of different grades, from precise gradient-based methods down to simpler zeroth-order or sign-based variants that supply noisier updates. These downgraded optimizers lead to solutions that hold up better against post-unlearning manipulations because they steer convergence toward more stable regions of the loss landscape. A hybrid optimizer is introduced to retain strong unlearning performance while adding the robustness advantage. The pattern is confirmed across several unlearning methods on the MUSE and WMDP benchmarks.

Core claim

The grade of the optimizer, measured by the amount of information it uses from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), controls how resilient the unlearned model becomes. Lower-grade optimizers generate less precise updates yet drive the model to loss basins that are harder to escape, giving natural resistance to perturbations such as quantization or fine-tuning. This advantage is further linked to the properties of randomized smoothing, and a hybrid optimizer that mixes first-order and zeroth-order steps is shown to deliver both effective forgetting and improved stability.

What carries the argument

Optimizer grade, defined by the level of information exploited (zeroth-order gradient-free through first-order gradient-based to second-order Hessian-based), which shapes update precision and steers convergence to harder-to-disturb basins in the loss landscape.

If this is right

Zeroth-order and sign-based optimizers produce unlearning that survives weight quantization and additional fine-tuning steps.
The hybrid first-order plus zeroth-order optimizer maintains high unlearning quality while increasing resistance to later perturbations.
The robustness improvement holds across multiple unlearning algorithms when evaluated on MUSE and WMDP benchmarks.
The link between zeroth-order methods and randomized smoothing supplies an inherent defense against small post-unlearning changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same optimizer-downgrade approach could be tested in other settings where models must selectively forget information without full retraining.
Examining the curvature or sharpness of the loss basins reached by different optimizer grades might reveal measurable predictors of robustness.
Downgrading optimizers may offer a practical lever for improving stability in continual or incremental learning tasks.

Load-bearing premise

That noisier updates from lower-grade optimizers specifically cause convergence to more stable loss basins and that this robustness effect does not depend on the particular unlearning objective chosen.

What would settle it

Apply the same unlearning procedure to an LLM once with a standard first-order optimizer and once with a zeroth-order optimizer, then perform fine-tuning on data related to the forgotten content and measure whether the forgotten knowledge recovers substantially less in the zeroth-order case.

Figures

Figures reproduced from arXiv: 2510.00761 by Changsheng Wang, Chongyu Fan, Jinghan Jia, Sijia Liu, Yicheng Lang, Yihua Zhang.

**Figure 1.** Figure 1: Unlearning performance under 4-bit weight quantization using NPO on MUSE with different optimizers (Sophia, Adam, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: On MUSE-Books, (a-b): Unlearning performance under 4-bit weight quantization using GradDiff and NPO with different [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Linear mode connectivity (LMC) between downgraded optimizers (signSGD, signAdam, RS, and ZO) and Adam on [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: (a–b): Unlearning performance before and after 4-bit quantization on MUSE-Books using GradDiff and NPO with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Unlearning performance and relearning robustness of RMU and NPO on WMDP-Bio using different optimizers (Adam, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Unlearning performance and robustness of NPO using Adam, ZO, and Hybrid optimizer on TOFU under the forget10 sce [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the 'grade' of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows empirically that simpler noisier optimizers can strengthen LLM unlearning robustness across benchmarks, but the loss-landscape mechanism stays conjectural without direct checks.

read the letter

The main thing to know is that this work finds downgrading the optimizer—moving from full gradients to zeroth-order or sign-based updates—often produces unlearning that holds up better against later fine-tuning or quantization. They run this across several unlearning algorithms on MUSE and WMDP, then propose a hybrid first-order plus zeroth-order scheme that keeps forgetting quality while adding some resilience. That empirical pattern is the core observation worth noting.

Referee Report

2 major / 2 minor

Summary. The paper claims that in LLM unlearning, downgrading the optimizer from first-order gradient methods to zeroth-order or compressed-gradient variants (e.g., sign-based) produces more robust forgetting that resists post-training perturbations such as quantization or fine-tuning. The mechanism is that noisier updates drive convergence to harder-to-disturb basins in the loss landscape; this is linked to randomized smoothing. A hybrid first-order/zeroth-order optimizer is proposed, and the approach is validated through experiments on the MUSE and WMDP benchmarks across multiple unlearning algorithms.

Significance. If the empirical robustness gains hold under tighter controls, the work offers a simple, objective-independent lever for improving unlearning reliability—an important practical advance given the fragility of current unlearning methods. The randomized-smoothing connection supplies a plausible theoretical framing, and the hybrid optimizer is a constructive proposal. However, the absence of direct landscape analysis leaves the causal account conjectural.

major comments (2)

Abstract: the central claim that 'noisier and less precise updates... encourage convergence to harder-to-disturb basins' is asserted without any supporting analysis (Hessian spectra, sharpness metrics, basin-width measurements, or parameter-space perturbation tests). This makes the proposed mechanism load-bearing yet unverified, even if robustness gains are observed empirically.
Experiments (as described in the abstract): the reported 'extensive experiments across multiple algorithms and benchmarks' lack explicit controls for optimizer hyperparameters, baseline optimizer choices, and statistical significance testing. Without these, it is difficult to isolate the optimizer-grade effect from confounding factors such as convergence speed or implicit regularization.

minor comments (2)

Clarify the precise definition and quantification of 'optimizer grade' (zeroth-order vs. first-order vs. second-order) in the methods section, including any implementation details for the sign-based and zeroth-order variants.
Add a short discussion of alternative explanations for the observed robustness (e.g., slower convergence or reduced sensitivity to the unlearning loss) and how they were ruled out or could be tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed review and insightful comments on our paper. Below we provide point-by-point responses to the major comments and indicate the changes we plan to incorporate in the revised manuscript.

read point-by-point responses

Referee: Abstract: the central claim that 'noisier and less precise updates... encourage convergence to harder-to-disturb basins' is asserted without any supporting analysis (Hessian spectra, sharpness metrics, basin-width measurements, or parameter-space perturbation tests). This makes the proposed mechanism load-bearing yet unverified, even if robustness gains are observed empirically.

Authors: We agree that direct landscape analysis would provide stronger causal support for the basin-convergence hypothesis. The randomized-smoothing connection supplies theoretical motivation, and the observed robustness gains are empirical, but we acknowledge the mechanistic account remains partly conjectural without explicit verification. In revision we will add parameter-space perturbation experiments (injecting controlled noise into converged parameters and measuring retention of unlearning) together with sharpness estimates on a subset of models to directly test basin stability. revision: partial
Referee: Experiments (as described in the abstract): the reported 'extensive experiments across multiple algorithms and benchmarks' lack explicit controls for optimizer hyperparameters, baseline optimizer choices, and statistical significance testing. Without these, it is difficult to isolate the optimizer-grade effect from confounding factors such as convergence speed or implicit regularization.

Authors: We conducted hyperparameter sweeps for each optimizer variant and chose baselines following prior unlearning literature, yet we accept that fuller documentation is needed to isolate the optimizer-grade effect. In the revised manuscript we will add an appendix detailing the full hyperparameter search ranges, report all results with mean and standard deviation over three random seeds, and include paired statistical significance tests (t-tests with Bonferroni correction) comparing robustness metrics across optimizer grades. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical observations on optimizer downgrading for LLM unlearning robustness

full rationale

The paper advances its claims through experimental results on the MUSE and WMDP benchmarks across multiple unlearning algorithms, rather than any mathematical derivation chain. The observation that downgrading optimizers (zeroth-order or sign-based) yields stronger robustness is presented as a finding from direct comparisons of unlearning efficacy and post-perturbation resilience, not as a quantity predicted by or defined in terms of fitted parameters within the paper's own equations. The interpretive link to 'harder-to-disturb basins' and randomized smoothing is offered as a post-hoc explanation of the empirical pattern, without reducing to self-definition, self-citation load-bearing, or renaming of known results. No load-bearing step collapses to its inputs by construction; the work remains self-contained as an empirical investigation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical machine-learning study relying on standard assumptions about loss landscapes and optimizer behavior; no explicit free parameters, axioms, or invented entities are introduced beyond the proposed hybrid optimizer.

pith-pipeline@v0.9.0 · 5855 in / 1011 out tokens · 39456 ms · 2026-05-18T10:42:04.062077+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 15 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,”arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phanet al., “The wmdp benchmark: Measuring and reducing malicious use with unlearning,”arXiv preprint arXiv:2403.03218,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

TrustLLM: Trustworthiness in Large Language Models

Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Li, C. Gao, Y . Huang, W. Lyu, Y . Zhanget al., “Trustllm: Trustwor- thiness in large language models,”arXiv preprint arXiv:2401.05561,

work page internal anchor Pith review arXiv
[7]

Muse: Machine unlearning six-way evaluation for language models,

W. Shi, J. Lee, Y . Huang, S. Malladi, J. Zhao, A. Holtzman, D. Liu, L. Zettlemoyer, N. A. Smith, and C. Zhang, “Muse: Machine unlearning six-way evaluation for language models,”arXiv preprint arXiv:2407.06460,

work page arXiv
[8]

Beyond memorization: Violating privacy via inference with large language models,

R. Staab, M. Vero, M. Balunovi´c, and M. Vechev, “Beyond memorization: Violating privacy via inference with large language models,”arXiv preprint arXiv:2310.07298,

work page arXiv
[9]

europa.eu/eli/reg/2016/679/oj

C. Fan, J. Liu, L. Lin, J. Jia, R. Zhang, S. Mei, and S. Liu, “Simplicity prevails: Rethinking negative preference optimization for llm unlearning,”arXiv preprint arXiv:2410.07163,

work page arXiv
[10]

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

R. Zhang, L. Lin, Y . Bai, and S. Mei, “Negative preference optimization: From catastrophic collapse to effective unlearning,”arXiv preprint arXiv:2404.05868,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

SEUF: Is unlearning one expert enough for mixture-of-experts LLMs?

H. Zhuang, Y . Zhang, K. Guo, J. Jia, G. Liu, S. Liu, and X. Zhang, “SEUF: Is unlearning one expert enough for mixture-of-experts LLMs?” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational L...

work page arXiv 2025
[12]

Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs

K. O’Brien, S. Casper, Q. Anthony, T. Korbak, R. Kirk, X. Davies, I. Mishra, G. Irving, Y . Gal, and S. Bider- man, “Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms,”arXiv preprint arXiv:2508.06601,

work page arXiv
[13]

Eight methods to evaluate robust unlearning in llms,

A. Lynch, P. Guo, A. Ewart, S. Casper, and D. Hadfield-Menell, “Eight methods to evaluate robust unlearning in llms,” arXiv preprint arXiv:2402.16835,

work page arXiv
[14]

Jogging the memory of unlearned model through targeted relearning attack,

S. Hu, Y . Fu, Z. S. Wu, and V . Smith, “Jogging the memory of unlearned model through targeted relearning attack,” arXiv preprint arXiv:2406.13356,

work page arXiv
[15]

Catastrophic failure of llm unlearning via quantization,

Z. Zhang, F. Wang, X. Li, Z. Wu, X. Tang, H. Liu, Q. He, W. Yin, and S. Wang, “Catastrophic failure of llm unlearning via quantization,”arXiv preprint arXiv:2410.16454,

work page arXiv
[16]

Invariance makes llm unlearning resilient even to unanticipated downstream fine-tuning,

C. Wang, Y . Zhang, J. Jia, P. Ram, D. Wei, Y . Yao, S. Pal, N. Baracaldo, and S. Liu, “Invariance makes llm unlearning resilient even to unanticipated downstream fine-tuning,”arXiv preprint arXiv:2506.01339,

work page arXiv
[17]

Towards llm unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond,

C. Fan, J. Jia, Y . Zhang, A. Ramakrishna, M. Hong, and S. Liu, “Towards llm unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond,”arXiv preprint arXiv:2502.05374,

work page arXiv
[18]

Sharpness-Aware Minimization for Efficiently Improving Generalization

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving general- ization,”arXiv preprint arXiv:2010.01412,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[19]

Tamper- resistant safeguards for open-weight llms,

R. Tamirisa, B. Bharathi, L. Phan, A. Zhou, A. Gatti, T. Suresh, M. Lin, J. Wang, R. Wang, R. Arelet al., “Tamper- resistant safeguards for open-weight llms,”arXiv preprint arXiv:2408.00761,

work page arXiv
[20]

Invariant Risk Minimization

M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,”arXiv preprint arXiv:1907.02893,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[21]

TOFU: A Task of Fictitious Unlearning for LLMs

P. Maini, Z. Feng, A. Schwarzschild, Z. C. Lipton, and J. Z. Kolter, “Tofu: A task of fictitious unlearning for llms,” arXiv preprint arXiv:2401.06121,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

DEPN: Detecting and editing privacy neurons in pretrained language models,

X. Wu, J. Li, M. Xu, W. Dong, S. Wu, C. Bian, and D. Xiong, “DEPN: Detecting and editing privacy neurons in pretrained language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 2875–2886. [Online]. A...

work page arXiv 2023
[23]

Proactive privacy amnesia for large language models: Safeguarding pii with negligible impact on model utility,

M. Kuo, J. Zhang, J. Zhang, M. Tang, L. DiValentin, A. Ding, J. Sun, W. Chen, A. Hass, T. Chenet al., “Proactive privacy amnesia for large language models: Safeguarding pii with negligible impact on model utility,”arXiv preprint arXiv:2502.17591,

work page arXiv
[24]

Beyond single-value metrics: Evaluating and enhancing LLM unlearning with cognitive diagnosis,

Y . Lang, K. Guo, Y . Huang, Y . Zhou, H. Zhuang, T. Yang, Y . Su, and X. Zhang, “Beyond single-value metrics: Evaluating and enhancing LLM unlearning with cognitive diagnosis,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. 10 Downgrade to Upgrade: Optimizer Simplification E...

work page 2025
[25]

Making harmful behaviors unlearnable for large language models,

[Online]. Available: https://aclanthology.org/2025.findings-acl.1102/ X. Zhou, Y . Lu, R. Ma, Y . Wei, T. Gui, Q. Zhang, and X. Huang, “Making harmful behaviors unlearnable for large language models,” inFindings of the Association for Computational Linguistics: ACL 2024, L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Compu...

work page 2025
[26]

Exploring criteria of loss reweighting to enhance llm unlearning,

[Online]. Available: https://aclanthology.org/2024.findings-acl.611/ P. Yang, Q. Wang, Z. Huang, T. Liu, C. Zhang, and B. Han, “Exploring criteria of loss reweighting to enhance llm unlearning,”arXiv preprint arXiv:2505.11953,

work page arXiv 2024
[27]

Guardrail baselines for unlearning in llms,

P. Thaker, Y . Maurya, S. Hu, Z. S. Wu, and V . Smith, “Guardrail baselines for unlearning in llms,”arXiv preprint arXiv:2403.03329,

work page arXiv
[28]

Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju

M. Pawelczyk, S. Neel, and H. Lakkaraju, “In-context unlearning: Language models as few shot unlearners,”arXiv preprint arXiv:2310.07579,

work page arXiv
[29]

Ucd: Unlearning in llms via contrastive decoding,

V . M. Suriyakumar, A. Sekhari, and A. Wilson, “Ucd: Unlearning in llms via contrastive decoding,”arXiv preprint arXiv:2506.12097,

work page arXiv
[30]

Guard: Generation-time llm unlearning via adaptive restriction and detection,

Z. Deng, C. Y . Liu, Z. Pang, X. He, L. Feng, Q. Xuan, Z. Zhu, and J. Wei, “Guard: Generation-time llm unlearning via adaptive restriction and detection,”arXiv preprint arXiv:2505.13312,

work page arXiv
[31]

mlr.press/v97/zhang19p.html

K. Bhaila, M.-H. Van, and X. Wu, “Soft prompting for unlearning in large language models,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, New Mexico: Association for Comp...

work page arXiv 2025
[32]

Step-by-step reasoning attack: Reveal- ing’erased’knowledge in large language models,

Y . Sinha, M. Baser, M. Mandal, D. M. Divakaran, and M. Kankanhalli, “Step-by-step reasoning attack: Reveal- ing’erased’knowledge in large language models,”arXiv preprint arXiv:2506.17279,

work page arXiv
[33]

Towards robust knowledge unlearning: An adversarial frame- work for assessing and improving unlearning robustness in large language models,

H. Yuan, Z. Jin, P. Cao, Y . Chen, K. Liu, and J. Zhao, “Towards robust knowledge unlearning: An adversarial frame- work for assessing and improving unlearning robustness in large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, 2025, pp. 25 769–25

work page 2025
[34]

Model tampering attacks enable more rigorous evaluations of llm capabilities,

Z. Che, S. Casper, R. Kirk, A. Satheesh, S. Slocum, L. E. McKinney, R. Gandikota, A. Ewart, D. Rosati, Z. Wuet al., “Model tampering attacks enable more rigorous evaluations of llm capabilities,”arXiv preprint arXiv:2502.05209,

work page arXiv
[35]

Latent adversarial training improves robustness to persistent harmful behaviors in llms,

A. Sheshadri, A. Ewart, P. Guo, A. Lynch, C. Wu, V . Hebbar, H. Sleight, A. C. Stickland, E. Perez, D. Hadfield-Menell et al., “Latent adversarial training improves robustness to persistent harmful behaviors in llms,”arXiv preprint arXiv:2407.15549,

work page arXiv
[36]

Unlearning that lasts: Utility-preserving, robust, and almost irre- versible forgetting in llms,

N. D. Singh, M. M ¨uller, F. Croce, and M. Hein, “Unlearning that lasts: Utility-preserving, robust, and almost irre- versible forgetting in llms,”arXiv preprint arXiv:2509.02820,

work page arXiv
[37]

Sophia: A scalable stochastic second-order optimizer for language model pre-training

J. Jia, Y . Zhang, Y . Zhang, J. Liu, B. Runwal, J. Diffenderfer, B. Kailkhura, and S. Liu, “SOUL: Unlocking the power of second-order optimization for LLM unlearning,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computationa...

work page arXiv 2024
[38]

Towards memory-efficient and sus- tainable machine unlearning on edge using zeroth-order optimizer,

C. Zhang, C. Yang, Q. Tan, J. Liu, A. Li, Y . Wang, J. Lu, J. Wang, and G. Yuan, “Towards memory-efficient and sus- tainable machine unlearning on edge using zeroth-order optimizer,” inProceedings of the Great Lakes Symposium on VLSI 2025, 2025, pp. 227–232. Y . Xiao, R. Ye, B. Liu, X. Ma, and B. Hui, “Efficient knowledge graph unlearning with zeroth-orde...

work page arXiv 2025
[39]

Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark,

Y . Zhang, P. Li, J. Hong, J. Li, Y . Zhang, W. Zheng, P.-Y . Chen, J. D. Lee, W. Yin, M. Honget al., “Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark,”arXiv preprint arXiv:2402.11592,

work page arXiv
[40]

Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning,

Q. Tan, J. Liu, Z. Zhan, C. Ding, Y . Wang, X. Ma, J. Lee, J. Lu, and G. Yuan, “Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning,”arXiv preprint arXiv:2502.03304,

work page arXiv
[41]

Kerzoo: Kernel function informed zeroth-order optimization for accurate and accelerated llm fine-tuning,

Z. Mi, Q. Tan, X. Yu, Z. Zhu, G. Yuan, and S. Huang, “Kerzoo: Kernel function informed zeroth-order optimization for accurate and accelerated llm fine-tuning,”arXiv preprint arXiv:2505.18886,

work page arXiv
[42]

Continual learning and private unlearning,

B. Liu, Q. Liu, and P. Stone, “Continual learning and private unlearning,” inConference on Lifelong Learning Agents. PMLR, 2022, pp. 243–254. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36,...

work page 2022
[43]

Do unlearning methods remove information from language model weights?

A. Deeb and F. Roger, “Do unlearning methods remove information from language model weights?”arXiv preprint arXiv:2410.08827,

work page arXiv
[44]

Adam: A Method for Stochastic Optimization

[Online]. Available: https://openreview.net/forum?id=lHSeDYamnz D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

J. Jia, Y . Zhang, Y . Zhang, J. Liu, B. Runwal, J. Diffenderfer, B. Kailkhura, and S. Liu, “SOUL: Unlocking the power of second-order optimization for LLM unlearning,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 4276–4292. J. Berns...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Signadam++: Learning confidences for deep neural networks,

D. Wang, Y . Liu, W. Tang, F. Shang, H. Liu, Q. Sun, and L. Jiao, “Signadam++: Learning confidences for deep neural networks,” in2019 International Conference on Data Mining Workshops (ICDMW). IEEE, 2019, pp. 186–195. S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” inInternational Conference on Learning Representations,

work page 2019
[47]

Certified adversarial robustness via randomized smoothing,

J. Cohen, E. Rosenfeld, and Z. Kolter, “Certified adversarial robustness via randomized smoothing,” ininternational conference on machine learning. PMLR, 2019, pp. 1310–1320. S. Ma and H. Huang, “Revisiting zeroth-order optimization: Minimum-variance two-point estimators and directionally aligned perturbations,” inThe Thirteenth International Conference o...

work page 2019
[48]

Refining adaptive zeroth-order optimization at ease,

Y . Shu, Q. Zhang, K. He, and Z. Dai, “Refining adaptive zeroth-order optimization at ease,”arXiv preprint arXiv:2502.01014,

work page arXiv
[49]

Linear mode connectivity and the lottery ticket hypothesis,

12 Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM UnlearningA PREPRINT J. Frankle, G. K. Dziugaite, D. Roy, and M. Carbin, “Linear mode connectivity and the lottery ticket hypothesis,” in International Conference on Machine Learning. PMLR, 2020, pp. 3259–3269. Y . Qin, C. Qian, J. Yi, W. Chen, Y . Lin, X. Han, Z. Liu, M. Sun, an...

work page arXiv 2020
[50]

Mechanistic mode connectivity,

E. S. Lubana, E. J. Bigelow, R. P. Dick, D. Krueger, and H. Tanaka, “Mechanistic mode connectivity,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 22 965–23

work page 2023
[51]

Llm unlearning reveals a stronger-than-expected coreset effect in current benchmarks,

S. Pal, C. Wang, J. Diffenderfer, B. Kailkhura, and S. Liu, “Llm unlearning reveals a stronger-than-expected coreset effect in current benchmarks,”arXiv preprint arXiv:2504.10185,

work page arXiv
[52]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,”arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[53]

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al

[Online]. Available: https://zenodo.org/records/12608602 S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,”arXiv preprint arXiv:2109.07958,

work page arXiv
[54]

HellaSwag: Can a Machine Really Finish Your Sentence?

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “Hellaswag: Can a machine really finish your sentence?” arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[55]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Hybrid (5)

13 Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM UnlearningA PREPRINT Appendix A Limitations While we conduct comprehensive experiments and in-depth analysis to show the role of optimizers in robust LLM unlearning, certain limitations persist in our study. There are other optimizers we did not include in our study,e.g., the Muo...

work page 2024