Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

Binghang Lu; Bing Hu; Changhong Mou; Guang Lin; Runyu Zhang; Xiaomin Li; Yuan Tian; Yunhan Zhao; Zheyuan Deng

arxiv: 2605.08949 · v2 · pith:R73BLH45new · submitted 2026-05-09 · 💻 cs.LG

Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

Binghang Lu , Zheyuan Deng , Runyu Zhang , Bing Hu , Yunhan Zhao , Yuan Tian , Changhong Mou , Guang Lin

show 1 more author

Xiaomin Li

This is my paper

Pith reviewed 2026-05-19 15:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learninglarge language modelscatastrophic forgettingorthogonal gradient projectionspectral normMuon optimizerstability-plasticity trade-off

0 comments

The pith

Spectral-norm geometry with orthogonal projections reduces catastrophic forgetting in sequential LLM fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Muon-OGD to address catastrophic forgetting when large language models adapt to new tasks one after another. It replaces the Frobenius norm used in existing projection methods with a spectral-norm approach drawn from the Muon optimizer. This change lets the method enforce non-interference constraints on matrix-valued parameters while solving the resulting optimization problem through dual iterations and matrix-sign approximations. The approach produces orthogonalized momentum updates that avoid directions tied to earlier tasks. If the claim holds, it indicates that the choice of norm in update geometry can meaningfully improve how LLMs balance retaining old knowledge and acquiring new skills.

Core claim

Muon-OGD formulates each update as a spectral-norm-constrained optimization problem subject to linear non-interference constraints from prior tasks. It solves the problem efficiently with dual iterations and Newton-Schulz matrix-sign approximations, then applies the resulting orthogonalized momentum updates. This combines Muon-style operator-norm geometry with orthogonal gradient projection to improve the stability-plasticity trade-off across encoder-decoder and decoder-only models on benchmarks including TRACE and domain-specific Coding-Math-Medical sequences.

What carries the argument

Spectral-norm-constrained optimization with linear non-interference constraints, solved via dual iterations and Newton-Schulz matrix-sign approximations to produce orthogonalized momentum updates.

If this is right

Muon-OGD consistently improves performance over sequential fine-tuning and other orthogonal-gradient baselines on standard benchmarks.
The method remains computationally scalable for both encoder-decoder and decoder-only LLM architectures.
Orthogonalized momentum updates successfully avoid protected directions associated with prior tasks.
The framework delivers better stability-plasticity balance on domain-specific curricula such as Coding-Math-Medical sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectral-norm projection idea could be tested with other matrix-aware optimizers beyond Muon.
Attention layers and other structured parameter blocks may benefit most from this geometry because they naturally admit matrix interpretations.
Varying the accuracy of the Newton-Schulz approximation would provide a direct test of how tightly the solver must match the exact dual solution to preserve non-interference guarantees.

Load-bearing premise

Spectral-norm geometry is more suitable than Frobenius norm for matrix-valued LLM parameters during continual learning.

What would settle it

A head-to-head experiment on the same continual learning benchmarks where Muon-OGD shows no consistent gain or produces worse retention than Frobenius-based orthogonal projection methods would falsify the advantage of the spectral approach.

Figures

Figures reproduced from arXiv: 2605.08949 by Binghang Lu, Bing Hu, Changhong Mou, Guang Lin, Runyu Zhang, Xiaomin Li, Yuan Tian, Yunhan Zhao, Zheyuan Deng.

**Figure 2.** Figure 2: Prompt template used for zero-shot mathematical evaluation on the GSM8K dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt template used for zero-shot coding evaluation on the BigCodeBench dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt template used for LLM-as-a-judge evaluation on the HuatuoGPT-o1 medical dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study of Muon-OGD hyperparameters on the Qwen2.5-1.5B architecture. The figure [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

read the original abstract

A central challenge in continual learning for large language models (LLMs) is catastrophic forgetting, where adapting to new tasks can substantially degrade performance on previously learned ones. Existing projection-based methods mitigate such interference by restricting parameter updates to subspaces that are orthogonal to directions associated with past tasks. However, these methods are typically formulated under Euclidean parameter geometry, with update magnitudes and projections governed by the Frobenius norm. The recent empirical success of the Muon optimizer, which applies orthogonalized matrix updates and admits a spectral-norm interpretation, suggests that Frobenius geometry may not be the most effective choice for matrix-valued LLM parameters. Motivated by this observation, we propose Muon-OGD, a spectral-norm-aware continual learning framework that integrates Muon-style operator-norm geometry with orthogonal projection constraints. Our method formulates each update as a spectral-norm-constrained optimization problem with linear non-interference constraints, and solves it efficiently through dual iterations and Newton--Schulz matrix-sign approximations. By applying orthogonalized momentum updates that avoid protected directions associated with prior tasks, Muon-OGD aims to improve the stability--plasticity trade-off in sequential LLM adaptation. We evaluate the proposed method on standard continual learning benchmarks, TRACE, and domain-specific Coding--Math--Medical curricula using both encoder--decoder and decoder-only architectures. Empirically, Muon-OGD consistently improves over sequential fine-tuning and competitive orthogonal-gradient baselines, while remaining computationally scalable. These results suggest that spectral-norm-aware update geometry provides a practical and effective alternative to Frobenius-norm projection for continual learning in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon-OGD swaps Frobenius projections for spectral-norm updates in orthogonal gradient descent and reports gains on LLM continual learning benchmarks, but the Newton-Schulz solver leaves the exact non-interference guarantee unproven.

read the letter

The main point is that this paper takes the Muon optimizer's spectral-norm style and folds it into orthogonal gradient projection to handle catastrophic forgetting in sequential LLM adaptation. It formulates the update as a constrained problem with linear non-interference conditions and solves it with dual iterations plus Newton-Schulz approximations for the matrix sign function. Experiments on TRACE plus coding-math-medical sequences across encoder-decoder and decoder-only models show it beats plain sequential fine-tuning and other OGD baselines on the stability-plasticity trade-off.

Referee Report

1 major / 2 minor

Summary. The paper proposes Muon-OGD, a continual learning method for LLMs that replaces Frobenius-norm orthogonal gradient descent with a spectral-norm constrained formulation motivated by the Muon optimizer. Each update is cast as a spectral-norm-bounded optimization problem subject to linear non-interference constraints from prior tasks; the problem is solved via dual iterations combined with Newton-Schulz matrix-sign approximations to produce orthogonalized momentum updates that avoid protected directions. The authors report consistent empirical gains over sequential fine-tuning and existing orthogonal-gradient baselines on TRACE, Coding-Math-Medical curricula, and both encoder-decoder and decoder-only models.

Significance. If the non-interference property survives the numerical approximations, the work supplies a concrete alternative geometry for matrix-valued LLM parameters that appears better aligned with observed optimizer behavior than Euclidean projections. The empirical results on standard and domain-specific benchmarks would then constitute a practical advance in the stability-plasticity trade-off for sequential LLM adaptation.

major comments (1)

[Method (spectral-norm constrained optimization and Newton-Schulz solver)] The central non-interference guarantee rests on the claim that updates are strictly orthogonal to protected subspaces. The solution procedure (dual iterations plus Newton-Schulz iterations for the matrix sign function) is described without explicit error bounds on the residual component in protected directions or a final exact projection step. Because Newton-Schulz converges only asymptotically, the delivered update can retain a non-zero inner product with the protected subspace, directly threatening both the stability claim and the comparison to Frobenius-based OGD baselines.

minor comments (2)

[Introduction] The abstract and introduction repeatedly contrast 'spectral-norm-aware' geometry with 'Frobenius geometry' without a brief derivation showing why the operator norm is the natural choice for the constrained problem; adding one sentence or a short paragraph would clarify the motivation.
[Experiments] No table or figure caption states the precise number of Newton-Schulz iterations used at inference time or the convergence tolerance; this detail is needed to assess reproducibility of the reported wall-clock overhead.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address the major comment regarding the spectral-norm constrained optimization and Newton-Schulz solver below, outlining both our response and the planned revisions.

read point-by-point responses

Referee: [Method (spectral-norm constrained optimization and Newton-Schulz solver)] The central non-interference guarantee rests on the claim that updates are strictly orthogonal to protected subspaces. The solution procedure (dual iterations plus Newton-Schulz iterations for the matrix sign function) is described without explicit error bounds on the residual component in protected directions or a final exact projection step. Because Newton-Schulz converges only asymptotically, the delivered update can retain a non-zero inner product with the protected subspace, directly threatening both the stability claim and the comparison to Frobenius-based OGD baselines.

Authors: We agree that the current manuscript description does not supply explicit error bounds on the residual or include a final exact projection step, which is a legitimate concern for rigorously establishing the non-interference property. The dual iterations are intended to enforce the linear constraints in the exact arithmetic limit, while Newton-Schulz approximates the matrix sign function that realizes the spectral-norm orthogonalization; its quadratic convergence typically yields residuals below machine precision after a modest number of iterations. To address the referee's point directly, we will revise the method section to (i) cite the known quadratic convergence guarantees for Newton-Schulz and derive a practical bound on the inner-product residual as a function of iteration count, (ii) add an optional post-iteration exact orthogonalization step (e.g., via a single Gram-Schmidt pass on the update direction) that can be used when strict theoretical guarantees are required, and (iii) report new experiments quantifying the observed residual norms on protected subspaces across the evaluated curricula. These changes will be incorporated in the revised manuscript while preserving the computational efficiency of the core procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper motivates the spectral-norm geometry from the external empirical success of the Muon optimizer and formulates the update as a constrained optimization problem solved by dual iterations plus Newton-Schulz approximations. No load-bearing equation or claim reduces the central non-interference guarantee or performance claim to a quantity defined by the authors' own prior results or fitted inputs. The method is presented as a new integration of existing ideas with independent empirical validation on TRACE and domain curricula, keeping the derivation chain non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach rests on standard optimization primitives and the external empirical record of the Muon optimizer.

pith-pipeline@v0.9.0 · 5846 in / 1088 out tokens · 85908 ms · 2026-05-19T15:05:02.254800+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min Δ ⟨G,Δ⟩ s.t. ‖Δ‖₂ ≤ η, ⟨C_i,Δ⟩=0 … solved via dual iterations and Newton–Schulz matrix-sign approximations
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Muon update interpreted as steepest descent under spectral-norm constraint

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

fPINN-DeepONet: A Physics-Informed Operator Learning Framework for Multi-term Time-fractional Mixed Diffusion-wave Equations
math.NA 2026-05 unverdicted novelty 5.0

fPINN-DeepONet integrates an L2 approximation for the Caputo derivative with DeepONet to solve multi-term time-fractional PDEs, including cases with space-time varying orders and noisy data.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[2]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[3]

Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

work page 2017
[4]

Lifelong learning with dynamically expandable networks

Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. In6th International Conference on Learning Representations, ICLR 2018, 2018

work page 2018
[5]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017
[6]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

work page 2017
[7]

Efficient Lifelong Learning with A-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem.arXiv preprint arXiv:1812.00420, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Orthogonal gradient descent for continual learning

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InInternational conference on artificial intelligence and statistics, pages 3762–3773. PMLR, 2020

work page 2020
[9]

Oplora: Orthogonal projection lora prevents catastrophic forgetting during parameter-efficient fine-tuning

Yifeng Xiong and Xiaohui Xie. Oplora: Orthogonal projection lora prevents catastrophic forgetting during parameter-efficient fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34088–34096, 2026

work page 2026
[10]

Sculpt- ing subspaces: Constrained full fine-tuning in llms for continual learning.arXiv preprint arXiv:2504.07097, 2025

Nikhil Shivakumar Nayak, Krishnateja Killamsetty, Ligong Han, Abhishek Bhandwaldar, Prateek Chanda, Kai Xu, Hao Wang, Aldo Pareja, Oleg Silkin, Mustafa Eyceoz, et al. Sculpt- ing subspaces: Constrained full fine-tuning in llms for continual learning.arXiv preprint arXiv:2504.07097, 2025

work page arXiv 2025
[11]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan. Muon: An optimizer for hidden layers in neural networks. URL https: //kellerjordan.github.io/posts/muon/. Blog post

work page
[12]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

ArXiv Preprint: 2511.00674 , Year =

Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

work page arXiv 2025
[16]

When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

work page arXiv 2025
[17]

Continual lifelong learning with neural networks: A review.Neural networks, 113:54–71, 2019

German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural networks, 113:54–71, 2019

work page 2019
[18]

A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7): 3366–3385, 2021

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7): 3366–3385, 2021. 10

work page 2021
[19]

A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

work page 2024
[20]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. Pmlr, 2017

work page 2017
[21]

Memory aware synapses: Learning what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

work page 2018
[22]

Learning to prompt for continual learning

Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 139–149, 2022

work page 2022
[23]

Dualprompt: Complementary prompting for rehearsal-free continual learning

Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean conference on computer vision, pages 631–648. Springer, 2022

work page 2022
[24]

Razdaibiedina, Y

Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, and Amjad Almahairi. Progressive prompts: Continual learning for language models.arXiv preprint arXiv:2301.12314, 2023

work page arXiv 2023
[25]

Qin and S

Chengwei Qin and Shafiq Joty. Lfpt5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5.arXiv preprint arXiv:2110.07298, 2021

work page arXiv 2021
[26]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[27]

Trace: A comprehensive benchmark for continual learning in large language models.arXiv preprint arXiv:2310.06762, 2023

Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, et al. Trace: A comprehensive benchmark for continual learning in large language models.arXiv preprint arXiv:2310.06762, 2023

work page arXiv 2023
[28]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023

work page 2023
[29]

Modular duality in deep learning.arXiv preprint arXiv:2410.21265,

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

work page arXiv 2024
[30]

Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025
[31]

Orthogonal gradient descent for continual learning

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In Silvia Chiappa and Roberto Calandra, editors,Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3762–3773. PMLR, 26–28 Aug 2020. URL h...

work page 2020
[32]

Dynamic orthogonal continual fine-tuning for mitigating catastrophic forgettings, 2025

Zhixin Zhang, Zeming Wei, and Meng Sun. Dynamic orthogonal continual fine-tuning for mitigating catastrophic forgettings, 2025. URLhttps://arxiv.org/abs/2509.23893

work page arXiv 2025
[33]

Gem-style constraints for peft with dual gradient projection in lora

Brian Tekmen, Jason Yin, and Qianqian Tong. Gem-style constraints for peft with dual gradient projection in lora. In2025 IEEE International Conference on Data Mining Workshops (ICDMW), pages 2736–2743. IEEE, 2025

work page 2025
[34]

Matrix procrustes problems.Rapport technique, University of Manchester, 1995

Nick Higham and Pythagoras Papadimitriou. Matrix procrustes problems.Rapport technique, University of Manchester, 1995

work page 1995
[35]

Modular duality in deep learning

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning. InForty-second International Conference on Machine Learning. 11

work page
[36]

Deriving muon, 2025

Jeremy Bernstein. Deriving muon, 2025. URL https://jeremybernste.in/writing/ deriving-muon

work page 2025
[37]

Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015

work page 2015
[38]

Glue: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–355, 2018

work page 2018
[39]

Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

work page 2019
[40]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms, 2024. URL https://arxiv. org/abs/2412.18925

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Orthogonal subspace learning for language model continual learning,

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning, 2023. URLhttps://arxiv.org/abs/2310.14152

work page arXiv 2023
[44]

Learn more, but bother less: parameter efficient continual learning

Fuli Qiao and Mehrdad Mahdavi. Learn more, but bother less: parameter efficient continual learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page
[45]

URLhttps://openreview.net/forum?id=ZxtaNh5UYB

work page
[46]

arXiv preprint arXiv:2103.09762 , year=

Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. arXiv preprint arXiv:2103.09762, 2021. Declaration of LLM Usage. We used large language models as writing and coding assistants during the preparation of this manuscript. Specifically, LLMs were used to improve language clarity and presentation, assist with coding...

work page arXiv 2021
[47]

CORRECT: The response contains the key medical information from the reference answer, even if phrased differently or includes additional correct medical details

work page
[48]

CORRECT" or

INCORRECT: The response is medically wrong, misses the main point, or provides incorrect medical information. Focus on medical accuracy and completeness, not on writing style or verbosity. [Medical Question] {question} [Reference Answer] {reference_answer} [Model Response] {model_response} Evaluate the model’s response. Output ONLY one of: "CORRECT" or "I...

work page

[1] [1]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017

[2] [2]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[3] [3]

Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

work page 2017

[4] [4]

Lifelong learning with dynamically expandable networks

Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. In6th International Conference on Learning Representations, ICLR 2018, 2018

work page 2018

[5] [5]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017

[6] [6]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

work page 2017

[7] [7]

Efficient Lifelong Learning with A-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem.arXiv preprint arXiv:1812.00420, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Orthogonal gradient descent for continual learning

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InInternational conference on artificial intelligence and statistics, pages 3762–3773. PMLR, 2020

work page 2020

[9] [9]

Oplora: Orthogonal projection lora prevents catastrophic forgetting during parameter-efficient fine-tuning

Yifeng Xiong and Xiaohui Xie. Oplora: Orthogonal projection lora prevents catastrophic forgetting during parameter-efficient fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34088–34096, 2026

work page 2026

[10] [10]

Sculpt- ing subspaces: Constrained full fine-tuning in llms for continual learning.arXiv preprint arXiv:2504.07097, 2025

Nikhil Shivakumar Nayak, Krishnateja Killamsetty, Ligong Han, Abhishek Bhandwaldar, Prateek Chanda, Kai Xu, Hao Wang, Aldo Pareja, Oleg Silkin, Mustafa Eyceoz, et al. Sculpt- ing subspaces: Constrained full fine-tuning in llms for continual learning.arXiv preprint arXiv:2504.07097, 2025

work page arXiv 2025

[11] [11]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan. Muon: An optimizer for hidden layers in neural networks. URL https: //kellerjordan.github.io/posts/muon/. Blog post

work page

[12] [12]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

ArXiv Preprint: 2511.00674 , Year =

Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

work page arXiv 2025

[16] [16]

When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

work page arXiv 2025

[17] [17]

Continual lifelong learning with neural networks: A review.Neural networks, 113:54–71, 2019

German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural networks, 113:54–71, 2019

work page 2019

[18] [18]

A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7): 3366–3385, 2021

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7): 3366–3385, 2021. 10

work page 2021

[19] [19]

A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

work page 2024

[20] [20]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. Pmlr, 2017

work page 2017

[21] [21]

Memory aware synapses: Learning what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

work page 2018

[22] [22]

Learning to prompt for continual learning

Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 139–149, 2022

work page 2022

[23] [23]

Dualprompt: Complementary prompting for rehearsal-free continual learning

Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean conference on computer vision, pages 631–648. Springer, 2022

work page 2022

[24] [24]

Razdaibiedina, Y

Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, and Amjad Almahairi. Progressive prompts: Continual learning for language models.arXiv preprint arXiv:2301.12314, 2023

work page arXiv 2023

[25] [25]

Qin and S

Chengwei Qin and Shafiq Joty. Lfpt5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5.arXiv preprint arXiv:2110.07298, 2021

work page arXiv 2021

[26] [26]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[27] [27]

Trace: A comprehensive benchmark for continual learning in large language models.arXiv preprint arXiv:2310.06762, 2023

Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, et al. Trace: A comprehensive benchmark for continual learning in large language models.arXiv preprint arXiv:2310.06762, 2023

work page arXiv 2023

[28] [28]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023

work page 2023

[29] [29]

Modular duality in deep learning.arXiv preprint arXiv:2410.21265,

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

work page arXiv 2024

[30] [30]

Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025

[31] [31]

Orthogonal gradient descent for continual learning

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In Silvia Chiappa and Roberto Calandra, editors,Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3762–3773. PMLR, 26–28 Aug 2020. URL h...

work page 2020

[32] [32]

Dynamic orthogonal continual fine-tuning for mitigating catastrophic forgettings, 2025

Zhixin Zhang, Zeming Wei, and Meng Sun. Dynamic orthogonal continual fine-tuning for mitigating catastrophic forgettings, 2025. URLhttps://arxiv.org/abs/2509.23893

work page arXiv 2025

[33] [33]

Gem-style constraints for peft with dual gradient projection in lora

Brian Tekmen, Jason Yin, and Qianqian Tong. Gem-style constraints for peft with dual gradient projection in lora. In2025 IEEE International Conference on Data Mining Workshops (ICDMW), pages 2736–2743. IEEE, 2025

work page 2025

[34] [34]

Matrix procrustes problems.Rapport technique, University of Manchester, 1995

Nick Higham and Pythagoras Papadimitriou. Matrix procrustes problems.Rapport technique, University of Manchester, 1995

work page 1995

[35] [35]

Modular duality in deep learning

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning. InForty-second International Conference on Machine Learning. 11

work page

[36] [36]

Deriving muon, 2025

Jeremy Bernstein. Deriving muon, 2025. URL https://jeremybernste.in/writing/ deriving-muon

work page 2025

[37] [37]

Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015

work page 2015

[38] [38]

Glue: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–355, 2018

work page 2018

[39] [39]

Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

work page 2019

[40] [40]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[42] [42]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms, 2024. URL https://arxiv. org/abs/2412.18925

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Orthogonal subspace learning for language model continual learning,

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning, 2023. URLhttps://arxiv.org/abs/2310.14152

work page arXiv 2023

[44] [44]

Learn more, but bother less: parameter efficient continual learning

Fuli Qiao and Mehrdad Mahdavi. Learn more, but bother less: parameter efficient continual learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page

[45] [45]

URLhttps://openreview.net/forum?id=ZxtaNh5UYB

work page

[46] [46]

arXiv preprint arXiv:2103.09762 , year=

Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. arXiv preprint arXiv:2103.09762, 2021. Declaration of LLM Usage. We used large language models as writing and coding assistants during the preparation of this manuscript. Specifically, LLMs were used to improve language clarity and presentation, assist with coding...

work page arXiv 2021

[47] [47]

CORRECT: The response contains the key medical information from the reference answer, even if phrased differently or includes additional correct medical details

work page

[48] [48]

CORRECT" or

INCORRECT: The response is medically wrong, misses the main point, or provides incorrect medical information. Focus on medical accuracy and completeness, not on writing style or verbosity. [Medical Question] {question} [Reference Answer] {reference_answer} [Model Response] {model_response} Evaluate the model’s response. Output ONLY one of: "CORRECT" or "I...

work page