pith. sign in

arxiv: 2605.17806 · v1 · pith:UNKE24UInew · submitted 2026-05-18 · 💻 cs.LG

AMO: Adaptive Muon Orthogonalization

Pith reviewed 2026-05-20 13:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords Muon optimizerNewton-Schulz iterationadaptive orthogonalizationmatrix geometrylarge language model pre-trainingoptimizer schedulecontinual pre-training
0
0 comments X

The pith

AMO improves Muon by adapting Newton-Schulz iteration counts to each matrix's measured geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Muon applies Newton-Schulz iterations to orthogonalize every weight matrix during pre-training, yet current versions use the same iteration count for all matrices. Measurements show that the geometric properties of these matrices differ by operator type, depth, and training stage, so a uniform count leaves some matrices under- or over-orthogonalized. AMO therefore runs a short early observation phase that records geometry signals for each operator type and then locks in a tailored iteration budget for the rest of training. The resulting models reach higher average scores on downstream tasks than any fixed-schedule Muon baseline. Gains appear in standard, long, and continual pre-training runs on both Llama and Qwen families.

Core claim

An observe-then-commit procedure that measures weight geometry by operator type early in training and then allocates a fixed Newton-Schulz budget per operator for the remainder of training produces higher-quality models than any uniform Newton-Schulz schedule.

What carries the argument

The observe-then-commit method that converts early measurements of matrix geometry into per-operator Newton-Schulz iteration budgets.

If this is right

  • Consistent gains hold across standard, prolonged, and continual pre-training.
  • Average downstream performance rises by 0.76 points on Llama3.1-1.4B and 0.51 points on Qwen3-1.7B across 12 tasks.
  • Uneven orthogonalization quality across operator types is reduced.
  • Only a short initial measurement phase is needed before the schedule is fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar early-signal commitment could be tested on other iterative steps inside optimizers, such as momentum or second-moment estimates.
  • If geometry drifts after the early window, a small number of refresh points might restore most of the benefit without full re-optimization.
  • Architectures that differ sharply from the tested Llama and Qwen families may need their own short geometry survey before budgets are set.

Load-bearing premise

Matrix geometry signals measured early in training remain stable enough to justify committing to a fixed per-operator Newton-Schulz budget for the rest of training.

What would settle it

Run two otherwise identical pre-training runs: one that locks Newton-Schulz budgets after the first 1 percent of steps and one that re-measures geometry and adjusts budgets at several later checkpoints; if the locked early-commit version loses its reported gains, the stability premise is false.

Figures

Figures reproduced from arXiv: 2605.17806 by Haibo Zhang, Imran Razzak, Jiangming Shi, Panyi Ouyang, Shuman Liu, Weiyang Liu, Xinlin Zhuang, Yichen Li, Ying Qian, Yizhang Chen.

Figure 1
Figure 1. Figure 1: Results of AMO and baseline optimizers on [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average downstream accuracy under three data orderings, across 16 configurations. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Four geometry findings on Qwen3-0.6B. κ is displayed at log10 scale. (a) NS-input conditioning κ(Mt) vs shape-normalized residual ϵ˜NS (Pearson’s r = 0.729). Colors indicate layer depth, so each point reflects the geometry of one parameter matrix at one training step grouped by layer. (b) Median κ trajectories per operator type: attn_v dominates the hardest regime, mlp_down the easiest. (c) Layer-wise medi… view at source ↗
Figure 4
Figure 4. Figure 4: Downstream results of four ablations on pre-training Qwen3-0.6B. Each subplot isolates [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Downstream results of Llama / Qwen models pre-trained [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training loss curve of AMO and Muon in pre-training Qwen3-1.7B. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Condition number vs NS residual results for Llama3.1-family models. Colors indicate [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evolution of median condition number by operator type across training for Llama3.1-family [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Layer-wise profiles of median condition number at representative training snapshots for [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Shallow vs deep layer training trajectories of median condition number for two repre [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Condition number vs NS residual results for Qwen3-family models. Colors indicate layer [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of median condition number by operator type across training for Qwen3-family [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Layer-wise profiles of median condition number at representative training snapshots for [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Shallow vs deep layer training trajectories of median condition number for two representa [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
read the original abstract

Muon has recently emerged as a competitive alternative to AdamW for large-scale pre-training, with orthogonalization via Newton-Schulz (NS) iterations as its core operation. Existing Muon variants apply a uniform NS schedule to all parameter matrices, overlooking possible differences in orthogonalization difficulty and its impact on performance. Through a systematic empirical study, we show that this per-matrix heterogeneity is pervasive and largely determined by matrix geometry, which evolves dynamically across operator types, training stages, and network depths. As a result, uniform NS schedules can lead to uneven orthogonalization quality across the model. Motivated by these findings, we propose Adaptive Muon Orthogonalization (AMO), an observe-then-commit method that measures weight geometry by operator type early in training and then uses these signals to allocate the NS budget for the remainder of training. AMO delivers consistent improvements over uniform-schedule Muon across standard, prolonged, and continual pre-training, surpassing the strongest baseline by +0.76 on Llama3.1-1.4B and +0.51 on Qwen3-1.7B in average downstream performance of 12 evaluation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that uniform Newton-Schulz (NS) iteration schedules in Muon overlook per-matrix heterogeneity in orthogonalization difficulty, which is driven by matrix geometry that varies across operator types, network depths, and training stages. Motivated by empirical observations of this heterogeneity, the authors propose AMO, an observe-then-commit method that measures geometry signals early in training and then fixes per-operator NS budgets for the remainder of training. They report that AMO yields consistent gains over uniform-schedule Muon baselines across standard, prolonged, and continual pre-training, surpassing the strongest baseline by +0.76 on Llama3.1-1.4B and +0.51 on Qwen3-1.7B in average performance over 12 downstream tasks.

Significance. If the gains prove robust, AMO offers a low-overhead way to improve Muon by tailoring orthogonalization effort to observed matrix conditioning differences, which could influence practical large-scale pre-training. The systematic empirical documentation of geometry evolution across stages and depths is a clear strength and provides actionable insight for optimizer design.

major comments (2)
  1. [Section 3] Section 3 (AMO design and observe-then-commit): The method commits to fixed per-operator NS budgets after a single early observation window, yet the manuscript itself states that geometry evolves dynamically across training stages. No experiments or monitoring are reported to verify that the early ordering of operators by difficulty (e.g., singular-value spread or conditioning) remains stable or that the chosen budgets stay near-optimal later in training. This stability assumption is load-bearing for the claim that the reported gains are generally reliable rather than specific to the chosen early measurement point.
  2. [Experimental results] Experimental results (gains on Llama3.1-1.4B and Qwen3-1.7B): The reported average improvements (+0.76 and +0.51) lack accompanying details on the number of independent runs, standard deviations, or statistical significance tests. Without these, it is difficult to determine whether the gains are robust across the three pre-training regimes or could be influenced by run-to-run variance.
minor comments (2)
  1. [Section 3] The geometry metric used to map early signals to NS-budget thresholds should be defined more explicitly (e.g., exact formula or pseudocode) to support reproducibility.
  2. [Figures] Figure captions and axis labels in the geometry-evolution plots could more clearly indicate the training stage at which measurements were taken.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our observe-then-commit design and strengthen the reporting of experimental robustness. We address each major comment below.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (AMO design and observe-then-commit): The method commits to fixed per-operator NS budgets after a single early observation window, yet the manuscript itself states that geometry evolves dynamically across training stages. No experiments or monitoring are reported to verify that the early ordering of operators by difficulty (e.g., singular-value spread or conditioning) remains stable or that the chosen budgets stay near-optimal later in training. This stability assumption is load-bearing for the claim that the reported gains are generally reliable rather than specific to the chosen early measurement point.

    Authors: We agree that the manuscript explicitly notes dynamic evolution of geometry and that no dedicated monitoring of ordering stability across later stages is currently reported. The observe-then-commit design is motivated by the fact that the dominant heterogeneity is across operator types and appears largely fixed after the initial window in our measurements; the consistent gains across standard, prolonged, and continual pre-training provide indirect empirical support. To directly address the concern, we will add experiments in the revision that track singular-value spreads and relative difficulty orderings at multiple subsequent training checkpoints. revision: yes

  2. Referee: [Experimental results] Experimental results (gains on Llama3.1-1.4B and Qwen3-1.7B): The reported average improvements (+0.76 and +0.51) lack accompanying details on the number of independent runs, standard deviations, or statistical significance tests. Without these, it is difficult to determine whether the gains are robust across the three pre-training regimes or could be influenced by run-to-run variance.

    Authors: We agree that variance and significance information should be reported to allow readers to assess robustness. The reported averages are computed over three independent runs per configuration (different random seeds), with gains observed consistently across the three pre-training regimes. In the revised manuscript we will include standard deviations alongside the averages and add paired statistical significance tests for the key improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: AMO allocation derives directly from early empirical measurements

full rationale

The paper defines AMO as an observe-then-commit procedure that records per-operator matrix geometry signals during an initial training window and then fixes the Newton-Schulz iteration counts for the remaining steps. This rule is constructed from direct measurements rather than from any equation, fitted parameter, or self-referential definition that would make the final performance claim equivalent to its own inputs. No self-citation load-bearing steps, uniqueness theorems imported from prior author work, or ansatz smuggling appear in the described chain. The reported gains are presented as empirical results on downstream tasks, not as mathematical necessities forced by the method's own formulation. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about geometry stability and on two free parameters that control when and how the early measurements are turned into iteration budgets; no new physical or mathematical entities are introduced.

free parameters (2)
  • Early observation window length
    Determines how many initial steps are used to measure matrix geometry before committing to the schedule.
  • Geometry-to-NS-budget mapping thresholds
    Decide how many Newton-Schulz iterations each operator type receives based on the measured signals.
axioms (2)
  • domain assumption Per-matrix orthogonalization difficulty is largely determined by matrix geometry
    Invoked to justify measuring geometry as a sufficient signal for adaptive allocation.
  • domain assumption Matrix geometry remains stable enough after the early phase to support a fixed schedule
    Underpins the observe-then-commit design described in the abstract.

pith-pipeline@v0.9.0 · 5758 in / 1480 out tokens · 56159 ms · 2026-05-20T13:06:30.742117+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages

  1. [1]

    Dion: Distributed orthonormalized updates, 2025

    Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates, 2025. 20

  2. [2]

    Essential AI, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chalu- varaju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, and Ashi...

  3. [3]

    Llama 3 model card, 2024

    AI@Meta. Llama 3 model card, 2024. 6, 31

  4. [4]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapor...

  5. [5]

    Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm. InThe Fourteenth International Conference on Learning Representations, 2026. 1, 2, 4, 6, 9, 16, 18, 20, 33, 37

  6. [6]

    ASGO: Adaptive structured gradient optimization

    Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 20

  7. [7]

    Perplexed by perplexity: Perplexity-based data pruning with small reference models

    Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L Leavitt, and Mansheej Paul. Perplexed by perplexity: Perplexity-based data pruning with small reference models. InThe Thirteenth International Conference on Learning Representations, 2025. 21

  8. [8]

    Efficient pretraining data selection for language models via multi-actor collaboration

    Tianyi Bai, Ling Yang, Zhen Hao Wong, Fupeng Sun, Xinlin Zhuang, Jiahui Peng, Chi Zhang, Lijun Wu, Qiu Jiantao, Wentao Zhang, Binhang Yuan, and Conghui He. Efficient pretraining data selection for language models via multi-actor collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

  9. [9]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. 6, 33

  10. [10]

    Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning, 2025

    Thibaut Boissin, Thomas Massena, Franck Mamalet, and Mathieu Serrurier. Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning, 2025. 2, 6, 20, 33

  11. [11]

    Calian, Gregory Farquhar, Iurii Kemaev, Luisa Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeff Dean, Hado van Hasselt, and David Silver

    Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeff Dean, Hado van Hasselt, and David Silver. Datarater: Meta-learned dataset curation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 21

  12. [12]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. 6, 33

  13. [13]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations, 2024. 6, 32

  14. [14]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. 1, 9

  15. [15]

    DsDm: Model-aware dataset selection with datamodels

    Logan Engstrom, Axel Feldmann, and Aleksander Madry. DsDm: Model-aware dataset selection with datamodels. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Re...

  16. [16]

    What really matters in matrix-whitening optimizers?, 2025

    Kevin Frans, Pieter Abbeel, and Sergey Levine. What really matters in matrix-whitening optimizers?, 2025. 1, 20

  17. [17]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  18. [18]

    Glm-5: from vibe coding to agentic engineering, 2026

    GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. 20

  19. [19]

    Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2026

    Ekaterina Grishina, Matvey Smirnov, and Maxim Rakhuba. Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2026. 1, 2, 4, 6, 9, 18, 20, 33, 37

  20. [20]

    Drop-muon: Update less, converge faster, 2025

    Kaja Gruntkowska, Yassine Maziane, Zheng Qu, and Peter Richtárik. Drop-muon: Update less, converge faster, 2025. 20

  21. [21]

    Data selection via optimal control for language models

    Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data selection via optimal control for language models. InThe Thirteenth International Conference on Learning Representations, 2025. 21

  22. [22]

    Root: Robust orthogonalized optimizer for neural network training, 2025

    Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. Root: Robust orthogonalized optimizer for neural network training, 2025. 1, 2, 4, 6, 20, 33

  23. [23]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. 6, 33

  24. [24]

    Rae, Oriol Vinyals, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  25. [25]

    Datamodels: Predicting predictions from training data, 2022

    Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Predicting predictions from training data, 2022. 21

  26. [26]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. 1, 4, 6, 20, 33

  27. [27]

    Muonbp: Faster muon via block-periodic orthogonalization, 2025

    Ahmed Khaled, Kaan Ozkara, Tao Yu, Mingyi Hong, and Youngsuk Park. Muonbp: Faster muon via block-periodic orthogonalization, 2025. 20

  28. [28]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 19

  29. [29]

    RACE: Large-scale ReAding comprehension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. 6, 33

  30. [30]

    Normuon: Making muon more efficient and scalable, 2025

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025. 1, 20

  31. [31]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training

    Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Conference on Learning Representations, 2024. 20

  32. [32]

    Muon is scalable for llm training, 2025

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

  33. [33]

    Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms,

    Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms,

  34. [34]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InThe Seventh International Conference on Learning Representations, 2019. 19

  35. [35]

    How learning rate decay wastes your best data in curriculum-based LLM pretraining

    Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, and Wenguang Chen. How learning rate decay wastes your best data in curriculum-based LLM pretraining. InThe Fourteenth International Conference on Learning Representations, 2026. 2

  36. [36]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2018. Association for Computational Linguistics. 6, 33

  37. [37]

    Prodigy: An expeditiously adaptive parameter-free learner

    Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 35779–35804. PMLR, 2024. 20

  38. [38]

    Antonio Orvieto and Robert M. Gower. In search of adam’s secret sauce. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 20

  39. [39]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InAdvances in Neural Information Processing Systems, volume 37, pages 30811–30849, 2024. 3, 6, 31

  40. [40]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020. 6, 33

  41. [41]

    Glu variants improve transformer, 2020

    Noam Shazeer. Glu variants improve transformer, 2020. 6, 31

  42. [42]

    Predictive data selection: The data that predicts is the data that teaches

    KaShun SHUM, Yuzhen Huang, Hongjian Zou, dingqi, YiXuan Liao, Xiaoxin Chen, Qian Liu, and Junxian He. Predictive data selection: The data that predicts is the data that teaches. In Forty-second International Conference on Machine Learning, 2025. 21

  43. [43]

    Adamuon: Adaptive muon optimizer, 2025

    Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer, 2025. 1, 6, 20, 33

  44. [44]

    CommonsenseQA: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis...

  45. [45]

    ADOPT: Modified adam can converge with any $\beta_2$ with the optimal rate

    Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seongcheol Jeong, Go Na- gahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo. ADOPT: Modified adam can converge with any $\beta_2$ with the optimal rate. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 20

  46. [46]

    Kimi k2: Open agentic intelligence, 2026

    Kimi Team. Kimi k2: Open agentic intelligence, 2026. 1, 20

  47. [47]

    Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S. Morcos. D4: Improving LLM pretraining via document de-duplication and diversification. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 21

  48. [48]

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 20 12

  49. [49]

    Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent Y . F. Tan. Muon outperforms adam in tail-end associative memory learning. InThe Fourteenth International Conference on Learning Representations,

  50. [50]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J...

  51. [51]

    Liu, and Matt Gardner

    Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InProceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. 6, 33

  52. [52]

    Fantastic pretraining optimizers and where to find them, 2025

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them, 2025. 20

  53. [53]

    Qurating: Selecting high- quality data for training language models

    Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high- quality data for training language models. InForty-first International Conference on Machine Learning, 2024. 21

  54. [54]

    Organize the web: Constructing domains enhances pre-training data curation

    Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, and Luca Soldaini. Organize the web: Constructing domains enhances pre-training data curation. InForty-second International Conference on Machine Learning, 2025. 21

  55. [55]

    Data selection for language models via importance resampling

    Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 21

  56. [56]

    Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024

    Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024. 20

  57. [57]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  58. [58]

    Benjamin Erichson, and Michael W

    Shenghao Yang, Zhichao Wang, Oleg Balabanov, N. Benjamin Erichson, and Michael W. Mahoney. Prism: Distribution-free adaptive computation of matrix functions for accelerating neural network training, 2026. 20

  59. [59]

    MATES: Model-aware data selection for efficient pretraining with data influence models

    Zichun Yu, Spandan Das, and Chenyan Xiong. MATES: Model-aware data selection for efficient pretraining with data influence models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 21

  60. [60]

    MARS: Unleashing the power of variance reduction for training large models

    Huizhuo Yuan, Yifeng Liu, Shuang Wu, zhou Xun, and Quanquan Gu. MARS: Unleashing the power of variance reduction for training large models. InForty-second International Conference on Machine Learning, 2025. 20

  61. [61]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, edi- tors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguis...

  62. [62]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32, 2019. 6, 31

  63. [63]

    Harnessing diversity for important data selection in pretraining large language models

    Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Qiu Jiantao, Lei Cao, Ju Fan, Ye Yuan, Guoren Wang, and Conghui He. Harnessing diversity for important data selection in pretraining large language models. InThe Thirteenth International Conference on Learning Representations, 2025. 21

  64. [64]

    Gram newton-schulz, 2026

    Jack Zhang, Noah Amsel, Berlin Chen, and Tri Dao. Gram newton-schulz, 2026. 1, 20

  65. [65]

    Adagrad meets muon: Adaptive stepsizes for orthogonal updates, 2025

    Minxin Zhang, Yuxuan Liu, and Hayden Schaeffer. Adagrad meets muon: Adaptive stepsizes for orthogonal updates, 2025. 20

  66. [66]

    Why transformers need adam: A hessian perspective

    Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. Why transformers need adam: A hessian perspective. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 19

  67. [67]

    Adam-mini: Use fewer learning rates to gain more

    Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more. InThe Thirteenth International Conference on Learning Representations, 2025. 20

  68. [68]

    Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham M. Kakade. Deconstructing what makes a good optimizer for autoregressive language models. InThe Thirteenth International Conference on Learning Representations, 2025. 20

  69. [69]

    AGIEval: A human-centric benchmark for evaluating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, June

  70. [70]

    Association for Computational Linguistics. 6, 33

  71. [71]

    How to set the learning rate for large-scale pre-training?, 2026

    Yunhua Zhou, Shuhao Xing, Junhao Huang, Xipeng Qiu, and Qipeng Guo. How to set the learning rate for large-scale pre-training?, 2026. 6, 21, 32

  72. [72]

    Meta-rater: A multi-dimensional data selection method for pre-training language models

    Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Qiu Jiantao, Chi Zhang, Ying Qian, and Conghui He. Meta-rater: A multi-dimensional data selection method for pre-training language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computationa...

  73. [73]

    16 A.2 Optimality Analysis

    7, 21 14 Appendix Table of Contents A Details of AMO 16 A.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Optimality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B Related Work 19 C Pilot Experiments 21 C.1 Data Curriculum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  74. [74]

    Muon optimizer.Muon [ 26] is a recently proposed optimizer designed for the hidden-layer param- eters of neural networks

    takes a complementary approach by eliminating the need for learning rate tuning altogether through a D-adaptation mechanism that automatically estimates the optimal step size during training. Muon optimizer.Muon [ 26] is a recently proposed optimizer designed for the hidden-layer param- eters of neural networks. It orthogonalizes the momentum buffer via N...

  75. [75]

    Complementing instance-level methods, several approaches addresscorpus-level diversity

    estimates each example’s marginal contribution via datamodels [25], MATES [59] periodically updates lightweight influence models during pre-training, [ 21] dynamically adjusts the training distribution as a sequential decision problem, and [42] identifies the most instructive examples by their ability to predict learning outcomes. Complementing instance-l...