arxiv: 2605.10468 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

Can Muon Fine-tune Adam-Pretrained Models?

Xingyu Qu , Peigeng Huang , Samuel Horvath

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizerAdam optimizerfine-tuningLoRAoptimizer mismatchpretrained modelsimplicit biasupdate strength

0 comments

The pith

LoRA mitigates the optimizer mismatch that degrades Muon performance when fine-tuning Adam-pretrained models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why Muon performs well for pretraining yet leads to degraded results when switched in for fine-tuning models already trained with Adam. Controlled experiments trace the drop to Muon's distinct implicit bias, which disrupts the pretrained knowledge more than Adam does, with the effect growing as updates become larger. The authors test the hypothesis that limiting update strength should reduce the disruption and demonstrate that LoRA achieves this across language and vision tasks, shrinking the performance gap seen in full fine-tuning. Studies varying LoRA rank, measuring catastrophic forgetting, and testing variants further tie mismatch severity to update magnitude.

Core claim

Muon and Adam possess different implicit biases; switching to Muon for fine-tuning disrupts the knowledge stored in an Adam-pretrained model, and the degree of disruption scales with update strength. Constraining updates through LoRA reduces this disruption and thereby narrows the performance difference between the two optimizers that appears under full fine-tuning.

What carries the argument

The optimizer mismatch driven by distinct implicit biases of Muon versus Adam, which scales with update strength and is mitigated by constraining updates via LoRA.

If this is right

The size of the performance gap between Muon and Adam in full fine-tuning increases as update strength grows.
Lower LoRA ranks, which more tightly constrain updates, further reduce the observed mismatch.
Greater catastrophic forgetting occurs under stronger updates when the optimizers are switched.
Other low-rank or constrained-update methods produce similar reductions in mismatch severity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fine-tuning pipelines may need to prioritize optimizer compatibility with pretraining to best preserve learned features.
Update-constraining techniques could serve as a general tool for making different optimizers interchangeable in transfer settings.
Design of future optimizers might incorporate explicit control over implicit bias to improve compatibility with existing pretrained checkpoints.

Load-bearing premise

The performance degradation arises specifically from Muon's bias disrupting pretrained knowledge rather than from hyperparameter choices or unrelated factors, and that LoRA addresses the root cause by limiting update strength.

What would settle it

A controlled experiment in which Muon and Adam are given identical effective update magnitudes during full fine-tuning yet still produce a performance gap would falsify the claim that update strength is the mediating factor.

Figures

Figures reproduced from arXiv: 2605.10468 by Peigeng Huang, Samuel Horvath, Xingyu Qu.

**Figure 1.** Figure 1: Relative perplexity (normalized by pretrained baseline) during full fine-tuning on NanoChat (Karpathy, 2025). Fine-tuning with a mismatched optimizer (e.g., using Muon on an Adampretrained model) consistently results in worse perplexity. See Section 3 for details. Despite these successes, existing work on Muon has focused almost exclusively on pretraining, leaving fine-tuning, the dominant training parad… view at source ↗

**Figure 2.** Figure 2: Left: Numerical verification of the implicit biases on a toy linear regression problem. Adam converges to the minmax-norm solution W∗ max, while Muon converges to the minspectral-norm solution W∗ 2 . Right: Average stable rank of Q, K, V projections during NanoChat pretraining. Muon-trained weights maintain notably higher stable rank, indicating a distinct spectral structure. 0 20 40 60 80 100 Training P… view at source ↗

**Figure 3.** Figure 3: Fine-tuning perplexity (PPL) trajectories on Adampretrained (left) and Muon-pretrained (right) NanoChat models. LoRA mitigates the optimizer mismatch in both cases. To further illustrate this, we analyze a simplified linear regression problem: minimizing L(W) = 1 2 ∥W x − y∥ 2 2 for W ∈ R m×n, given x ∈ R n and y ∈ R m, which allows closed-form tracking of the optimization dynamics. For simplicity, we c… view at source ↗

**Figure 4.** Figure 4: Learning rate sweeps for fine-tuning with Adam (left) and Muon (right). Solid lines: matched pretraining optimizer. Dashed lines: mismatched pretraining optimizer. Dark colors: full fine-tuning. Light colors: LoRA. Under mismatch, the curve shifts upward and leftward (worse perplexity at a lower optimal learning rate). LoRA reduces the gap between matched and mismatched curves. Impact on fine-tuning. Given… view at source ↗

**Figure 5.** Figure 5: Effect of LoRA rank on downstream performance when fine-tuning Llama 2-7B. Dashed lines indicate full fine-tuning performance. When mismatch is pronounced (a), LoRA-Muon outperforms LoRA-Adam at low to moderate ranks but degrades at high ranks as updates increasingly resemble full fine-tuning. When the mismatch is mild (b), LoRA-Muon performs comparably across all ranks. smaller gap on code and negligible … view at source ↗

**Figure 6.** Figure 6: Effect of LoRA rank on accuracy when fine-tuning CLIP ViT-B/32 on StanfordCars. LoRA-Muon outperforms LoRAAdam across nearly all ranks (r ≥ 4). Dashed lines indicate full fine-tuning performance. 2 4 8 16 32 64 128 256 512 Rank 52.5 55.0 57.5 60.0 62.5 65.0 Accuracy (%) Pretrained LoRA-Muon LoRA-Adam [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: NanoChat pretraining curves. Left: Training loss. Right: Validation BPB (bits per byte). Both optimizers achieve similar final performance, with Muon converging slightly faster [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Loss curves for the implicit bias experiment. Both Adam and Muon converge to near-zero loss, finding valid solutions to the underdetermined linear system. Numerical Verification [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Spectral properties of attention QKV projection weights during NanoChat pretraining. Left: Stable rank. Right: SVD entropy. Muon-trained weights consistently maintain higher stable rank and entropy throughout training, indicating a more distributed spectral structure. and value (V) projections. The difference between Muon and Adam is consistent across all three projection types, with Muon producing weight… view at source ↗

**Figure 11.** Figure 11: Detailed spectral analysis by parameter type. Top: Stable rank for Q, K, V projections and MLP layers separately. Bottom: SVD entropy for Q, K, V projections and MLP layers separately. The spectral differences between Muon and Adam are consistent across all parameter types. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Stable rank and SVD entropy of LoRA matrices during Llama 2-7B fine-tuning on MetaMath. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Stable rank and SVD entropy of LoRA matrices during Llama 2-7B fine-tuning on CodeFeedback. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

**Figure 14.** Figure 14: Stable rank and SVD entropy of LoRA matrices during Llama 2-7B fine-tuning on WizardLM (commonsense). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

read the original abstract

Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon fine-tunes Adam models okay with LoRA, but the full fine-tuning gap may just be untuned Muon hyperparameters rather than a deep mismatch.

read the letter

The paper shows that switching to Muon for fine-tuning Adam-pretrained models hurts performance compared to Adam, but LoRA closes most of that gap on language and vision tasks. They connect the drop to update strength and implicit bias differences, then test the idea by varying LoRA rank and checking forgetting patterns. The code release helps anyone who wants to reproduce or build on it. That part is straightforward and useful for people who actually run fine-tuning jobs. The experiments look controlled enough on the surface, and the additional studies on rank and variants add some weight to the update-strength story. What is new is the specific empirical check that LoRA mitigates the Adam-to-Muon switch, which was not already in the optimizer literature. The soft spot is the one flagged in the stress-test note. The abstract describes “naively switching,” which leaves open whether Muon received its own proper hyperparameter search (learning rate, betas, weight decay) on the same tasks or just ran with Adam settings or defaults. If the latter, the performance gap under full fine-tuning is probably just suboptimal Muon tuning rather than evidence that the optimizers’ implicit biases clash and overwrite pretrained knowledge. That alternative explanation would also change how we read the LoRA result. The causal claim therefore rests on thinner ground than the abstract suggests. This paper is for practitioners who need quick guidance on trying Muon on existing Adam checkpoints and for optimizer researchers who track fine-tuning interactions. It is not foundational, but the empirical observation is concrete enough that a serious editor should send it to referees. Reviewers will likely ask for the exact hyperparameter protocol for Muon and for more direct tests of the disruption mechanism. I would bring it to a reading group to talk through the tuning question.

Referee Report

2 major / 2 minor

Summary. The paper claims that naively switching from Adam to Muon for fine-tuning Adam-pretrained models causes performance degradation due to an optimizer mismatch arising from their distinct implicit biases. This mismatch is said to disrupt pretrained knowledge in a manner that scales with update strength. The authors hypothesize that constraining updates mitigates the issue and validate this via LoRA, which reduces the Adam-Muon performance gap relative to full fine-tuning across language and vision tasks. Additional studies on LoRA rank, catastrophic forgetting, and LoRA variants are presented to confirm the correlation with update strength. Reproducible code is released.

Significance. If the results hold, the work provides practical guidance on applying Muon to Adam-pretrained models and illuminates how optimizer implicit biases interact with fine-tuning. The open code is a clear strength, enabling verification of the controlled experiments and extension to new tasks.

major comments (2)

[§3 (Experimental Setup)] §3 (Experimental Setup): The description of 'naively switching' to Muon does not specify whether Muon hyperparameters (learning rate, momentum coefficients, weight decay) received independent optimization equivalent to Adam on the same tasks and data. Without this, the full fine-tuning gap cannot be confidently attributed to implicit-bias mismatch rather than suboptimal Muon tuning, which directly undermines the causal claim and the interpretation that LoRA mitigates mismatch by constraining updates.
[§4.2 and §5 (LoRA and Update Strength Analysis)] §4.2 and §5 (LoRA and Update Strength Analysis): The scaling of degradation with update strength is central to the hypothesis, yet the manuscript provides no explicit definition or measurement of update strength (e.g., update norm, effective step size, or gradient statistics) that is compared quantitatively between full fine-tuning and LoRA settings. This leaves the mediator role of update strength correlational rather than demonstrated.

minor comments (2)

[Abstract] Abstract: The mention of 'studies on LoRA rank, catastrophic forgetting, and LoRA variants' would benefit from naming the specific tasks, datasets, and metrics used in those studies for immediate clarity.
[Tables/Figures] Tables/Figures: Include statistical details (e.g., standard deviations over multiple runs or significance tests) when reporting performance gaps to strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive suggestions. We address the major comments point by point below, and we will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: §3 (Experimental Setup): The description of 'naively switching' to Muon does not specify whether Muon hyperparameters (learning rate, momentum coefficients, weight decay) received independent optimization equivalent to Adam on the same tasks and data. Without this, the full fine-tuning gap cannot be confidently attributed to implicit-bias mismatch rather than suboptimal Muon tuning, which directly undermines the causal claim and the interpretation that LoRA mitigates mismatch by constraining updates.

Authors: We appreciate this observation. Upon review, the original manuscript did not provide sufficient detail on the hyperparameter tuning procedure for Muon. In the revised version, we will expand §3 to explicitly describe the independent hyperparameter optimization performed for Muon, including the grid search over learning rates, momentum coefficients, and weight decay values, conducted equivalently to Adam on the same tasks and datasets. This clarification will support the attribution of the performance gap to the optimizer mismatch arising from implicit biases. revision: yes
Referee: §4.2 and §5 (LoRA and Update Strength Analysis): The scaling of degradation with update strength is central to the hypothesis, yet the manuscript provides no explicit definition or measurement of update strength (e.g., update norm, effective step size, or gradient statistics) that is compared quantitatively between full fine-tuning and LoRA settings. This leaves the mediator role of update strength correlational rather than demonstrated.

Authors: We agree that a more rigorous quantification of update strength would better substantiate the hypothesis. In the revised manuscript, we will introduce an explicit definition of update strength, measured as the L2 norm of the parameter updates averaged over training steps. We will include quantitative comparisons of these update norms between full fine-tuning and various LoRA configurations, along with additional plots correlating update strength with performance degradation. This will provide stronger evidence for the mediating role of update strength. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivation chain or self-referential reductions

full rationale

The paper is a controlled empirical study comparing Adam and Muon fine-tuning, with claims supported by experiments on performance gaps, LoRA mitigation, and correlations with update strength. No equations, fitted parameters, or mathematical derivations are present that could reduce predictions or hypotheses to inputs by construction. Self-citations are absent from the provided text, and the central claims rely on observable experimental outcomes rather than any load-bearing self-referential logic. Hyperparameter concerns raised by the skeptic are valid experimental-design questions but do not constitute circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from controlled experiments and standard domain assumptions about optimizer behavior rather than new free parameters, axioms, or invented entities.

axioms (1)

domain assumption Adam and Muon have distinct implicit biases that affect how they update parameters and preserve pretrained knowledge.
Invoked to explain the mismatch and its scaling with update strength.

pith-pipeline@v0.9.0 · 5469 in / 1082 out tokens · 46086 ms · 2026-05-12T03:47:34.521274+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 7 internal anchors

[1]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024
[2]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. International Conference on Learning Representations , year =

work page
[3]

International Conference on Learning Representations , year =

Ilya Loshchilov and Frank Hutter , title =. International Conference on Learning Representations , year =

work page
[4]

Muon is Scalable for LLM Training

Liu, Jingyuan and Su, Jianlin and Yao, Xingcheng and Jiang, Zhejun and Lai, Guokun and Du, Yulun and Qin, Yidao and Xu, Weixin and Lu, Enzhe and Yan, Junjie and others , title =. arXiv preprint arXiv:2502.16982 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Practical efficiency of muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

Practical efficiency of muon for pretraining , author=. arXiv preprint arXiv:2505.02222 , year=

work page arXiv
[6]

Kimi K2: Open Agentic Intelligence

Kimi Team and Bai, Yifan and Bao, Yiping and Chen, Guanduo and Chen, Jiahao and Chen, Ningxin and Chen, Ruijue and Chen, Yanru and Chen, Yuankun and Chen, Yutian and others , title =. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models , author=. arXiv preprint arXiv:2508.06471 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025
[9]

International Conference on Learning Representations , year =

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others , title =. International Conference on Learning Representations , year =

work page
[10]

Psychology of learning and motivation , volume=

Catastrophic interference in connectionist networks: The sequential learning problem , author=. Psychology of learning and motivation , volume=. 1989 , publisher=

work page 1989
[11]

Trends in cognitive sciences , volume=

Catastrophic forgetting in connectionist networks , author=. Trends in cognitive sciences , volume=. 1999 , publisher=

work page 1999
[12]

Gower , title =

Noah Amsel and David Persson and Christopher Musco and Robert M. Gower , title =. International Conference on Learning Representations , year =

work page
[13]

A rank stabilization scaling factor for fine-tuning with lora,

A rank stabilization scaling factor for fine-tuning with lora , author=. arXiv preprint arXiv:2312.03732 , year=

work page arXiv
[14]

Yuanhe Zhang and Fanghui Liu and Yudong Chen , booktitle=. Lo

work page
[15]

Advances in Neural Information Processing Systems , year=

Pissa: Principal singular values and singular vectors adaptation of large language models , author=. Advances in Neural Information Processing Systems , year=

work page
[16]

Biderman, Dan and Portes, Jacob and Ortiz, Jose Javier Gonzalez and Paul, Mansheej and Greengard, Philip and Jennings, Connor and King, Daniel and Havens, Sam and Chiley, Vitaliy and Frankle, Jonathan and others , journal=

work page
[17]

Penedo, Guilherme and Kydl\'. The. Advances in Neural Information Processing Systems , year =

work page
[18]

Advances in Neural Information Processing Systems , year =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and Casas, Diego de Las and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and others , title =. Advances in Neural Information Processing Systems , year =

work page
[19]

Advances in Neural Information Processing Systems , year =

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash and Keh, Sedrick Scott and Arora, Kushal and others , title =. Advances in Neural Information Processing Systems , year =

work page
[20]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

work page
[21]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Old optimizer, new norm: An anthology , author=. arXiv preprint arXiv:2409.20325 , year=

work page arXiv
[22]

The Implicit Bias of

Chenyang Zhang and Difan Zou and Yuan Cao , booktitle=. The Implicit Bias of

work page
[23]

Advances in Neural Information Processing Systems , year=

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. Advances in Neural Information Processing Systems , year=

work page
[24]

Transactions on Machine Learning Research , year=

Muon Optimizes Under Spectral Norm Constraints , author=. Transactions on Machine Learning Research , year=

work page
[25]

Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization.arXiv preprint arXiv:2503.12645,

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization , author=. arXiv preprint arXiv:2503.12645 , year=

work page arXiv
[26]

Dissecting

Balles, Lukas and Hennig, Philipp , booktitle=. Dissecting

work page
[27]

Bernstein, Jeremy and Wang, Yu-Xiang and Azizzadenesheli, Kamyar and Anandkumar, Animashree , booktitle=

work page
[28]

2020 , booktitle=

Samyam Rajbhandari and Jeff Rasley and Olatunji Ruwase and Yuxiong He , title=. 2020 , booktitle=

work page 2020
[29]

Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

Dion: Distributed Orthonormalized Updates , author=. arXiv preprint: 2504.05295 , year=

work page arXiv
[30]

Li, Zichong and Liu, Liming and Liang, Chen and Chen, Weizhu and Zhao, Tuo , journal=

work page
[31]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Shaowen Wang and Linxi Yu and Jian Li , booktitle=. Lo

work page
[33]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

work page
[34]

Bowman , title=

Alex Wang and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman , title=. 2019 , booktitle=

work page 2019
[35]

2018 , booktitle=

Noam Shazeer and Mitchell Stern , title=. 2018 , booktitle=

work page 2018
[36]

Yu, Longhui and JIANG, Weisen and Shi, Han and YU, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James and Li, Zhenguo and Weller, Adrian and Liu, Weiyang , booktitle =

work page
[37]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

2024 , booktitle=

Tianyu Zheng and Ge Zhang and Tianhao Shen and Xueling Liu and Bill Yuchen Lin and Jie Fu and Wenhu Chen and Xiang Yue , title=. 2024 , booktitle=

work page 2024
[39]

2021 , journal=

Evaluating Large Language Models Trained on Code , author=. 2021 , journal=

work page 2021
[40]

Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Qingwei Lin and Daxin Jiang , booktitle=. Wizard

work page
[41]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

2019 , booktitle=

Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title=. 2019 , booktitle=

work page 2019
[43]

2020 , booktitle=

Yonatan Bisk and Rowan Zellers and Ronan LeBras and Jianfeng Gao and Yejin Choi , title=. 2020 , booktitle=

work page 2020
[44]

2020 , booktitle=

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title=. 2020 , booktitle=

work page 2020
[45]

2019 , booktitle=

Christopher Clark and Kenton Lee and Ming-Wei Chang and Tom Kwiatkowski and Michael Collins and Kristina Toutanova , title=. 2019 , booktitle=

work page 2019
[46]

2018 , booktitle=

Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal , title=. 2018 , booktitle=

work page 2018
[47]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and others , title =. doi:10.5281/zenodo.12608602 , url =

work page doi:10.5281/zenodo.12608602
[48]

Tao Li and Zhengbao He and Yujun Li and Yasheng Wang and Lifeng Shang and Xiaolin Huang , booktitle=

work page
[49]

Wang, Zhengbo and Liang, Jian and He, Ran and Wang, Zilei and Tan, Tieniu , booktitle=

work page
[50]

Proceedings of the International Conference on Machine Learning , year =

Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the International Conference on Machine Learning , year =

work page
[51]

Proceedings of the IEEE International Conference on Computer Vision Workshops , year=

3D Object Representations for Fine-Grained Categorization , author=. Proceedings of the IEEE International Conference on Computer Vision Workshops , year=

work page
[52]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year =

Cimpoi, Mircea and Maji, Subhransu and Kokkinos, Iasonas and Mohamed, Sammy and Vedaldi, Andrea , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year =

work page
[53]

The German Traffic Sign Recognition Benchmark: A multi-class classification competition , year=

Stallkamp, Johannes and Schlipsing, Marc and Salmen, Jan and Igel, Christian , booktitle=. The German Traffic Sign Recognition Benchmark: A multi-class classification competition , year=

work page
[54]

Remote Sensing Image Scene Classification: Benchmark and State of the Art , year=

Cheng, Gong and Han, Junwei and Lu, Xiaoqiang , journal=. Remote Sensing Image Scene Classification: Benchmark and State of the Art , year=

work page
[55]

and Oliva, Aude and Torralba, Antonio , booktitle=

Xiao, Jianxiong and Hays, James and Ehinger, Krista A. and Oliva, Aude and Torralba, Antonio , booktitle=. 2010 , pages=

work page 2010
[56]

NIPS workshop on deep learning and unsupervised feature learning , year=

Reading digits in natural images with unsupervised feature learning , author=. NIPS workshop on deep learning and unsupervised feature learning , year=

work page
[57]

International Conference on Learning Representations , year=

Understanding Catastrophic Forgetting in Language Models via Implicit Inference , author=. International Conference on Learning Representations , year=

work page
[58]

International Conference on Learning Representations , year=

Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. International Conference on Learning Representations , year=

work page
[59]

Jui-Nan Yen and Si Si and Zhao Meng and Felix Yu and Sai Surya Duvvuri and Inderjit S Dhillon and Cho-Jui Hsieh and Sanjiv Kumar , booktitle=. Lo

work page
[60]

Shih-yang Liu and Chien-Yi Wang and Hongxu Yin and Pavlo Molchanov and Yu-Chiang Frank Wang and Kwang-Ting Cheng and Min-Hung Chen , booktitle=. Do

work page
[61]

International Conference on Learning Representations , year=

LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning , author=. International Conference on Learning Representations , year=

work page
[62]

2025 , journal=

Kimi-VL Technical Report , author=. 2025 , journal=

work page 2025
[63]

Accelerating newton-schulz itera- tion for orthogonalization via chebyshev-type polynomials.arXiv preprint arXiv:2506.10935, 2025

Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials , author=. arXiv preprint arXiv:2506.10935 , year=

work page arXiv
[64]

Isotropic curvature model for understanding deep learning optimization: Is gradi- ent orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? , author=. arXiv preprint arXiv:2511.00674 , year=

work page arXiv
[65]

Attention is All you Need , year =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =

work page
[66]

RoFormer: Enhanced Transformer with Rotary Position Embedding,

Ro. Neurocomputing , volume =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , author =

work page doi:10.1016/j.neucom.2023.127063 2024
[67]

Advances in Neural Information Processing Systems , year=

Searching for Efficient Transformers for Language Modeling , author=. Advances in Neural Information Processing Systems , year=

work page
[68]

Gaussian Error Linear Units (GELUs)

Gaussian Error Linear Units (Gelus) , author=. arXiv preprint arXiv:1606.08415 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

doi:10.5281/zenodo.15403103 , url =

Cherti, Mehdi and Beaumont, Romain , title =. doi:10.5281/zenodo.15403103 , url =

work page doi:10.5281/zenodo.15403103
[70]

2024 , booktitle=

Hongyu Li and Liang Ding and Meng Fang and Dacheng Tao , title=. 2024 , booktitle=

work page 2024
[71]

Kakade , booktitle=

Nikhil Vyas and Depen Morwani and Rosie Zhao and Itai Shapira and David Brandfonbrener and Lucas Janson and Sham M. Kakade , booktitle=

work page
[72]

2025 , journal=

Kimi Linear: An Expressive, Efficient Attention Architecture , author=. 2025 , journal=

work page 2025
[73]

and Klusowski, Jason Matthew and Shigida, Boris , booktitle =

Cattaneo, Matias D. and Klusowski, Jason Matthew and Shigida, Boris , booktitle =. On the Implicit Bias of

work page
[74]

Query-Key Normalization for Transformers

Henry, Alex and Dachapally, Prudhvi Raj and Pawar, Shubham Shantaram and Chen, Yuxuan. Query-Key Normalization for Transformers. Findings of the Association for Computational Linguistics: EMNLP. 2020

work page 2020
[75]

International Conference on Learning Representations , year=

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. International Conference on Learning Representations , year=

work page
[76]

OpenAI blog , year=

Language models are unsupervised multitask learners , author=. OpenAI blog , year=

work page
[77]

The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

The large learning rate phase of deep learning: the catapult mechanism , author=. arXiv preprint arXiv:2003.02218 , year=

work page arXiv 2003
[78]

Thejas and Nipun Kwatra and Ramachandran Ramjee and Muthian Sivathanu , title =

Nikhil Iyer and V. Thejas and Nipun Kwatra and Ramachandran Ramjee and Muthian Sivathanu , title =. Journal of Machine Learning Research , year =

work page
[79]

International Conference on Machine Learning , year=

Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , year=

work page
[80]

Reece S Shuttleworth and Jacob Andreas and Antonio Torralba and Pratyusha Sharma , booktitle=. Lo

work page

Showing first 80 references.