AMO: Adaptive Muon Orthogonalization

Haibo Zhang; Imran Razzak; Jiangming Shi; Panyi Ouyang; Shuman Liu; Weiyang Liu; Xinlin Zhuang; Yichen Li; Ying Qian; Yizhang Chen

arxiv: 2605.17806 · v1 · pith:UNKE24UInew · submitted 2026-05-18 · 💻 cs.LG

AMO: Adaptive Muon Orthogonalization

Xinlin Zhuang , Panyi Ouyang , Yichen Li , Jiangming Shi , Yizhang Chen , Shuman Liu , Ying Qian , Weiyang Liu

show 2 more authors

Haibo Zhang Imran Razzak

This is my paper

Pith reviewed 2026-05-20 13:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizerNewton-Schulz iterationadaptive orthogonalizationmatrix geometrylarge language model pre-trainingoptimizer schedulecontinual pre-training

0 comments

The pith

AMO improves Muon by adapting Newton-Schulz iteration counts to each matrix's measured geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Muon applies Newton-Schulz iterations to orthogonalize every weight matrix during pre-training, yet current versions use the same iteration count for all matrices. Measurements show that the geometric properties of these matrices differ by operator type, depth, and training stage, so a uniform count leaves some matrices under- or over-orthogonalized. AMO therefore runs a short early observation phase that records geometry signals for each operator type and then locks in a tailored iteration budget for the rest of training. The resulting models reach higher average scores on downstream tasks than any fixed-schedule Muon baseline. Gains appear in standard, long, and continual pre-training runs on both Llama and Qwen families.

Core claim

An observe-then-commit procedure that measures weight geometry by operator type early in training and then allocates a fixed Newton-Schulz budget per operator for the remainder of training produces higher-quality models than any uniform Newton-Schulz schedule.

What carries the argument

The observe-then-commit method that converts early measurements of matrix geometry into per-operator Newton-Schulz iteration budgets.

If this is right

Consistent gains hold across standard, prolonged, and continual pre-training.
Average downstream performance rises by 0.76 points on Llama3.1-1.4B and 0.51 points on Qwen3-1.7B across 12 tasks.
Uneven orthogonalization quality across operator types is reduced.
Only a short initial measurement phase is needed before the schedule is fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar early-signal commitment could be tested on other iterative steps inside optimizers, such as momentum or second-moment estimates.
If geometry drifts after the early window, a small number of refresh points might restore most of the benefit without full re-optimization.
Architectures that differ sharply from the tested Llama and Qwen families may need their own short geometry survey before budgets are set.

Load-bearing premise

Matrix geometry signals measured early in training remain stable enough to justify committing to a fixed per-operator Newton-Schulz budget for the rest of training.

What would settle it

Run two otherwise identical pre-training runs: one that locks Newton-Schulz budgets after the first 1 percent of steps and one that re-measures geometry and adjusts budgets at several later checkpoints; if the locked early-commit version loses its reported gains, the stability premise is false.

Figures

Figures reproduced from arXiv: 2605.17806 by Haibo Zhang, Imran Razzak, Jiangming Shi, Panyi Ouyang, Shuman Liu, Weiyang Liu, Xinlin Zhuang, Yichen Li, Ying Qian, Yizhang Chen.

**Figure 2.** Figure 2: Average downstream accuracy under three data orderings, across 16 configurations. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Four geometry findings on Qwen3-0.6B. κ is displayed at log10 scale. (a) NS-input conditioning κ(Mt) vs shape-normalized residual ϵ˜NS (Pearson’s r = 0.729). Colors indicate layer depth, so each point reflects the geometry of one parameter matrix at one training step grouped by layer. (b) Median κ trajectories per operator type: attn_v dominates the hardest regime, mlp_down the easiest. (c) Layer-wise medi… view at source ↗

**Figure 4.** Figure 4: Downstream results of four ablations on pre-training Qwen3-0.6B. Each subplot isolates [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Downstream results of Llama / Qwen models pre-trained [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Training loss curve of AMO and Muon in pre-training Qwen3-1.7B. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Condition number vs NS residual results for Llama3.1-family models. Colors indicate [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Evolution of median condition number by operator type across training for Llama3.1-family [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Layer-wise profiles of median condition number at representative training snapshots for [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Shallow vs deep layer training trajectories of median condition number for two repre [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Condition number vs NS residual results for Qwen3-family models. Colors indicate layer [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Evolution of median condition number by operator type across training for Qwen3-family [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Layer-wise profiles of median condition number at representative training snapshots for [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Shallow vs deep layer training trajectories of median condition number for two representa [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

read the original abstract

Muon has recently emerged as a competitive alternative to AdamW for large-scale pre-training, with orthogonalization via Newton-Schulz (NS) iterations as its core operation. Existing Muon variants apply a uniform NS schedule to all parameter matrices, overlooking possible differences in orthogonalization difficulty and its impact on performance. Through a systematic empirical study, we show that this per-matrix heterogeneity is pervasive and largely determined by matrix geometry, which evolves dynamically across operator types, training stages, and network depths. As a result, uniform NS schedules can lead to uneven orthogonalization quality across the model. Motivated by these findings, we propose Adaptive Muon Orthogonalization (AMO), an observe-then-commit method that measures weight geometry by operator type early in training and then uses these signals to allocate the NS budget for the remainder of training. AMO delivers consistent improvements over uniform-schedule Muon across standard, prolonged, and continual pre-training, surpassing the strongest baseline by +0.76 on Llama3.1-1.4B and +0.51 on Qwen3-1.7B in average downstream performance of 12 evaluation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that uniform Newton-Schulz (NS) iteration schedules in Muon overlook per-matrix heterogeneity in orthogonalization difficulty, which is driven by matrix geometry that varies across operator types, network depths, and training stages. Motivated by empirical observations of this heterogeneity, the authors propose AMO, an observe-then-commit method that measures geometry signals early in training and then fixes per-operator NS budgets for the remainder of training. They report that AMO yields consistent gains over uniform-schedule Muon baselines across standard, prolonged, and continual pre-training, surpassing the strongest baseline by +0.76 on Llama3.1-1.4B and +0.51 on Qwen3-1.7B in average performance over 12 downstream tasks.

Significance. If the gains prove robust, AMO offers a low-overhead way to improve Muon by tailoring orthogonalization effort to observed matrix conditioning differences, which could influence practical large-scale pre-training. The systematic empirical documentation of geometry evolution across stages and depths is a clear strength and provides actionable insight for optimizer design.

major comments (2)

[Section 3] Section 3 (AMO design and observe-then-commit): The method commits to fixed per-operator NS budgets after a single early observation window, yet the manuscript itself states that geometry evolves dynamically across training stages. No experiments or monitoring are reported to verify that the early ordering of operators by difficulty (e.g., singular-value spread or conditioning) remains stable or that the chosen budgets stay near-optimal later in training. This stability assumption is load-bearing for the claim that the reported gains are generally reliable rather than specific to the chosen early measurement point.
[Experimental results] Experimental results (gains on Llama3.1-1.4B and Qwen3-1.7B): The reported average improvements (+0.76 and +0.51) lack accompanying details on the number of independent runs, standard deviations, or statistical significance tests. Without these, it is difficult to determine whether the gains are robust across the three pre-training regimes or could be influenced by run-to-run variance.

minor comments (2)

[Section 3] The geometry metric used to map early signals to NS-budget thresholds should be defined more explicitly (e.g., exact formula or pseudocode) to support reproducibility.
[Figures] Figure captions and axis labels in the geometry-evolution plots could more clearly indicate the training stage at which measurements were taken.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our observe-then-commit design and strengthen the reporting of experimental robustness. We address each major comment below.

read point-by-point responses

Referee: [Section 3] Section 3 (AMO design and observe-then-commit): The method commits to fixed per-operator NS budgets after a single early observation window, yet the manuscript itself states that geometry evolves dynamically across training stages. No experiments or monitoring are reported to verify that the early ordering of operators by difficulty (e.g., singular-value spread or conditioning) remains stable or that the chosen budgets stay near-optimal later in training. This stability assumption is load-bearing for the claim that the reported gains are generally reliable rather than specific to the chosen early measurement point.

Authors: We agree that the manuscript explicitly notes dynamic evolution of geometry and that no dedicated monitoring of ordering stability across later stages is currently reported. The observe-then-commit design is motivated by the fact that the dominant heterogeneity is across operator types and appears largely fixed after the initial window in our measurements; the consistent gains across standard, prolonged, and continual pre-training provide indirect empirical support. To directly address the concern, we will add experiments in the revision that track singular-value spreads and relative difficulty orderings at multiple subsequent training checkpoints. revision: yes
Referee: [Experimental results] Experimental results (gains on Llama3.1-1.4B and Qwen3-1.7B): The reported average improvements (+0.76 and +0.51) lack accompanying details on the number of independent runs, standard deviations, or statistical significance tests. Without these, it is difficult to determine whether the gains are robust across the three pre-training regimes or could be influenced by run-to-run variance.

Authors: We agree that variance and significance information should be reported to allow readers to assess robustness. The reported averages are computed over three independent runs per configuration (different random seeds), with gains observed consistently across the three pre-training regimes. In the revised manuscript we will include standard deviations alongside the averages and add paired statistical significance tests for the key improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: AMO allocation derives directly from early empirical measurements

full rationale

The paper defines AMO as an observe-then-commit procedure that records per-operator matrix geometry signals during an initial training window and then fixes the Newton-Schulz iteration counts for the remaining steps. This rule is constructed from direct measurements rather than from any equation, fitted parameter, or self-referential definition that would make the final performance claim equivalent to its own inputs. No self-citation load-bearing steps, uniqueness theorems imported from prior author work, or ansatz smuggling appear in the described chain. The reported gains are presented as empirical results on downstream tasks, not as mathematical necessities forced by the method's own formulation. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about geometry stability and on two free parameters that control when and how the early measurements are turned into iteration budgets; no new physical or mathematical entities are introduced.

free parameters (2)

Early observation window length
Determines how many initial steps are used to measure matrix geometry before committing to the schedule.
Geometry-to-NS-budget mapping thresholds
Decide how many Newton-Schulz iterations each operator type receives based on the measured signals.

axioms (2)

domain assumption Per-matrix orthogonalization difficulty is largely determined by matrix geometry
Invoked to justify measuring geometry as a sufficient signal for adaptive allocation.
domain assumption Matrix geometry remains stable enough after the early phase to support a fixed schedule
Underpins the observe-then-commit design described in the abstract.

pith-pipeline@v0.9.0 · 5758 in / 1480 out tokens · 56159 ms · 2026-05-20T13:06:30.742117+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AMO... measures weight geometry by operator type early in training and then uses these signals to allocate the NS budget for the remainder of training.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages

[1]

Dion: Distributed orthonormalized updates, 2025

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates, 2025. 20

work page 2025
[2]

Essential AI, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chalu- varaju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, and Ashi...

work page 2025
[3]

Llama 3 model card, 2024

AI@Meta. Llama 3 model card, 2024. 6, 31

work page 2024
[4]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapor...

work page 2023
[5]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm. InThe Fourteenth International Conference on Learning Representations, 2026. 1, 2, 4, 6, 9, 16, 18, 20, 33, 37

work page 2026
[6]

ASGO: Adaptive structured gradient optimization

Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 20

work page 2025
[7]

Perplexed by perplexity: Perplexity-based data pruning with small reference models

Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L Leavitt, and Mansheej Paul. Perplexed by perplexity: Perplexity-based data pruning with small reference models. InThe Thirteenth International Conference on Learning Representations, 2025. 21

work page 2025
[8]

Efficient pretraining data selection for language models via multi-actor collaboration

Tianyi Bai, Ling Yang, Zhen Hao Wong, Fupeng Sun, Xinlin Zhuang, Jiahui Peng, Chi Zhang, Lijun Wu, Qiu Jiantao, Wentao Zhang, Binhang Yuan, and Conghui He. Efficient pretraining data selection for language models via multi-actor collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

work page 2025
[9]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. 6, 33

work page 2020
[10]

Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning, 2025

Thibaut Boissin, Thomas Massena, Franck Mamalet, and Mathieu Serrurier. Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning, 2025. 2, 6, 20, 33

work page 2025
[11]

Calian, Gregory Farquhar, Iurii Kemaev, Luisa Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeff Dean, Hado van Hasselt, and David Silver

Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeff Dean, Hado van Hasselt, and David Silver. Datarater: Meta-learned dataset curation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 21

work page 2025
[12]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. 6, 33

work page 2018
[13]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations, 2024. 6, 32

work page 2024
[14]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. 1, 9

work page 2026
[15]

DsDm: Model-aware dataset selection with datamodels

Logan Engstrom, Axel Feldmann, and Aleksander Madry. DsDm: Model-aware dataset selection with datamodels. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Re...

work page 2024
[16]

What really matters in matrix-whitening optimizers?, 2025

Kevin Frans, Pieter Abbeel, and Sergey Levine. What really matters in matrix-whitening optimizers?, 2025. 1, 20

work page 2025
[17]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[18]

Glm-5: from vibe coding to agentic engineering, 2026

GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. 20

work page 2026
[19]

Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2026

Ekaterina Grishina, Matvey Smirnov, and Maxim Rakhuba. Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2026. 1, 2, 4, 6, 9, 18, 20, 33, 37

work page 2026
[20]

Drop-muon: Update less, converge faster, 2025

Kaja Gruntkowska, Yassine Maziane, Zheng Qu, and Peter Richtárik. Drop-muon: Update less, converge faster, 2025. 20

work page 2025
[21]

Data selection via optimal control for language models

Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data selection via optimal control for language models. InThe Thirteenth International Conference on Learning Representations, 2025. 21

work page 2025
[22]

Root: Robust orthogonalized optimizer for neural network training, 2025

Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. Root: Robust orthogonalized optimizer for neural network training, 2025. 1, 2, 4, 6, 20, 33

work page 2025
[23]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. 6, 33

work page 2021
[24]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page 2022
[25]

Datamodels: Predicting predictions from training data, 2022

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Predicting predictions from training data, 2022. 21

work page 2022
[26]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. 1, 4, 6, 20, 33

work page 2024
[27]

Muonbp: Faster muon via block-periodic orthogonalization, 2025

Ahmed Khaled, Kaan Ozkara, Tao Yu, Mingyi Hong, and Youngsuk Park. Muonbp: Faster muon via block-periodic orthogonalization, 2025. 20

work page 2025
[28]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 19

work page 2017
[29]

RACE: Large-scale ReAding comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. 6, 33

work page 2017
[30]

Normuon: Making muon more efficient and scalable, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025. 1, 20

work page 2025
[31]

Sophia: A scalable stochastic second-order optimizer for language model pre-training

Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Conference on Learning Representations, 2024. 20

work page 2024
[32]

Muon is scalable for llm training, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page 2025
[33]

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms,

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms,

work page
[34]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InThe Seventh International Conference on Learning Representations, 2019. 19

work page 2019
[35]

How learning rate decay wastes your best data in curriculum-based LLM pretraining

Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, and Wenguang Chen. How learning rate decay wastes your best data in curriculum-based LLM pretraining. InThe Fourteenth International Conference on Learning Representations, 2026. 2

work page 2026
[36]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2018. Association for Computational Linguistics. 6, 33

work page 2018
[37]

Prodigy: An expeditiously adaptive parameter-free learner

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 35779–35804. PMLR, 2024. 20

work page 2024
[38]

Antonio Orvieto and Robert M. Gower. In search of adam’s secret sauce. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 20

work page 2025
[39]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InAdvances in Neural Information Processing Systems, volume 37, pages 30811–30849, 2024. 3, 6, 31

work page 2024
[40]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020. 6, 33

work page 2020
[41]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020. 6, 31

work page 2020
[42]

Predictive data selection: The data that predicts is the data that teaches

KaShun SHUM, Yuzhen Huang, Hongjian Zou, dingqi, YiXuan Liao, Xiaoxin Chen, Qian Liu, and Junxian He. Predictive data selection: The data that predicts is the data that teaches. In Forty-second International Conference on Machine Learning, 2025. 21

work page 2025
[43]

Adamuon: Adaptive muon optimizer, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer, 2025. 1, 6, 20, 33

work page 2025
[44]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis...

work page 2019
[45]

ADOPT: Modified adam can converge with any $\beta_2$ with the optimal rate

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seongcheol Jeong, Go Na- gahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo. ADOPT: Modified adam can converge with any $\beta_2$ with the optimal rate. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 20

work page 2024
[46]

Kimi k2: Open agentic intelligence, 2026

Kimi Team. Kimi k2: Open agentic intelligence, 2026. 1, 20

work page 2026
[47]

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S. Morcos. D4: Improving LLM pretraining via document de-duplication and diversification. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 21

work page 2023
[48]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 20 12

work page 2025
[49]

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent Y . F. Tan. Muon outperforms adam in tail-end associative memory learning. InThe Fourteenth International Conference on Learning Representations,

work page
[50]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J...

work page 2024
[51]

Liu, and Matt Gardner

Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InProceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. 6, 33

work page 2017
[52]

Fantastic pretraining optimizers and where to find them, 2025

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them, 2025. 20

work page 2025
[53]

Qurating: Selecting high- quality data for training language models

Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high- quality data for training language models. InForty-first International Conference on Machine Learning, 2024. 21

work page 2024
[54]

Organize the web: Constructing domains enhances pre-training data curation

Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, and Luca Soldaini. Organize the web: Constructing domains enhances pre-training data curation. InForty-second International Conference on Machine Learning, 2025. 21

work page 2025
[55]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 21

work page 2023
[56]

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024. 20

work page 2024
[57]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[58]

Benjamin Erichson, and Michael W

Shenghao Yang, Zhichao Wang, Oleg Balabanov, N. Benjamin Erichson, and Michael W. Mahoney. Prism: Distribution-free adaptive computation of matrix functions for accelerating neural network training, 2026. 20

work page 2026
[59]

MATES: Model-aware data selection for efficient pretraining with data influence models

Zichun Yu, Spandan Das, and Chenyan Xiong. MATES: Model-aware data selection for efficient pretraining with data influence models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 21

work page 2024
[60]

MARS: Unleashing the power of variance reduction for training large models

Huizhuo Yuan, Yifeng Liu, Shuang Wu, zhou Xun, and Quanquan Gu. MARS: Unleashing the power of variance reduction for training large models. InForty-second International Conference on Machine Learning, 2025. 20

work page 2025
[61]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, edi- tors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguis...

work page 2019
[62]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32, 2019. 6, 31

work page 2019
[63]

Harnessing diversity for important data selection in pretraining large language models

Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Qiu Jiantao, Lei Cao, Ju Fan, Ye Yuan, Guoren Wang, and Conghui He. Harnessing diversity for important data selection in pretraining large language models. InThe Thirteenth International Conference on Learning Representations, 2025. 21

work page 2025
[64]

Gram newton-schulz, 2026

Jack Zhang, Noah Amsel, Berlin Chen, and Tri Dao. Gram newton-schulz, 2026. 1, 20

work page 2026
[65]

Adagrad meets muon: Adaptive stepsizes for orthogonal updates, 2025

Minxin Zhang, Yuxuan Liu, and Hayden Schaeffer. Adagrad meets muon: Adaptive stepsizes for orthogonal updates, 2025. 20

work page 2025
[66]

Why transformers need adam: A hessian perspective

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. Why transformers need adam: A hessian perspective. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 19

work page 2024
[67]

Adam-mini: Use fewer learning rates to gain more

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more. InThe Thirteenth International Conference on Learning Representations, 2025. 20

work page 2025
[68]

Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham M. Kakade. Deconstructing what makes a good optimizer for autoregressive language models. InThe Thirteenth International Conference on Learning Representations, 2025. 20

work page 2025
[69]

AGIEval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, June

work page 2024
[70]

Association for Computational Linguistics. 6, 33

work page
[71]

How to set the learning rate for large-scale pre-training?, 2026

Yunhua Zhou, Shuhao Xing, Junhao Huang, Xipeng Qiu, and Qipeng Guo. How to set the learning rate for large-scale pre-training?, 2026. 6, 21, 32

work page 2026
[72]

Meta-rater: A multi-dimensional data selection method for pre-training language models

Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Qiu Jiantao, Chi Zhang, Ying Qian, and Conghui He. Meta-rater: A multi-dimensional data selection method for pre-training language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computationa...

work page
[73]

16 A.2 Optimality Analysis

7, 21 14 Appendix Table of Contents A Details of AMO 16 A.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Optimality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B Related Work 19 C Pilot Experiments 21 C.1 Data Curriculum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 1930
[74]

Muon optimizer.Muon [ 26] is a recently proposed optimizer designed for the hidden-layer param- eters of neural networks

takes a complementary approach by eliminating the need for learning rate tuning altogether through a D-adaptation mechanism that automatically estimates the optimal step size during training. Muon optimizer.Muon [ 26] is a recently proposed optimizer designed for the hidden-layer param- eters of neural networks. It orthogonalizes the momentum buffer via N...

work page
[75]

Complementing instance-level methods, several approaches addresscorpus-level diversity

estimates each example’s marginal contribution via datamodels [25], MATES [59] periodically updates lightweight influence models during pre-training, [ 21] dynamically adjusts the training distribution as a sequential decision problem, and [42] identifies the most instructive examples by their ability to predict learning outcomes. Complementing instance-l...

work page 2048

[1] [1]

Dion: Distributed orthonormalized updates, 2025

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates, 2025. 20

work page 2025

[2] [2]

Essential AI, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chalu- varaju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, and Ashi...

work page 2025

[3] [3]

Llama 3 model card, 2024

AI@Meta. Llama 3 model card, 2024. 6, 31

work page 2024

[4] [4]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapor...

work page 2023

[5] [5]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm. InThe Fourteenth International Conference on Learning Representations, 2026. 1, 2, 4, 6, 9, 16, 18, 20, 33, 37

work page 2026

[6] [6]

ASGO: Adaptive structured gradient optimization

Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 20

work page 2025

[7] [7]

Perplexed by perplexity: Perplexity-based data pruning with small reference models

Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L Leavitt, and Mansheej Paul. Perplexed by perplexity: Perplexity-based data pruning with small reference models. InThe Thirteenth International Conference on Learning Representations, 2025. 21

work page 2025

[8] [8]

Efficient pretraining data selection for language models via multi-actor collaboration

Tianyi Bai, Ling Yang, Zhen Hao Wong, Fupeng Sun, Xinlin Zhuang, Jiahui Peng, Chi Zhang, Lijun Wu, Qiu Jiantao, Wentao Zhang, Binhang Yuan, and Conghui He. Efficient pretraining data selection for language models via multi-actor collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

work page 2025

[9] [9]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. 6, 33

work page 2020

[10] [10]

Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning, 2025

Thibaut Boissin, Thomas Massena, Franck Mamalet, and Mathieu Serrurier. Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning, 2025. 2, 6, 20, 33

work page 2025

[11] [11]

Calian, Gregory Farquhar, Iurii Kemaev, Luisa Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeff Dean, Hado van Hasselt, and David Silver

Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeff Dean, Hado van Hasselt, and David Silver. Datarater: Meta-learned dataset curation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 21

work page 2025

[12] [12]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. 6, 33

work page 2018

[13] [13]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations, 2024. 6, 32

work page 2024

[14] [14]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. 1, 9

work page 2026

[15] [15]

DsDm: Model-aware dataset selection with datamodels

Logan Engstrom, Axel Feldmann, and Aleksander Madry. DsDm: Model-aware dataset selection with datamodels. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Re...

work page 2024

[16] [16]

What really matters in matrix-whitening optimizers?, 2025

Kevin Frans, Pieter Abbeel, and Sergey Levine. What really matters in matrix-whitening optimizers?, 2025. 1, 20

work page 2025

[17] [17]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024

[18] [18]

Glm-5: from vibe coding to agentic engineering, 2026

GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. 20

work page 2026

[19] [19]

Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2026

Ekaterina Grishina, Matvey Smirnov, and Maxim Rakhuba. Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2026. 1, 2, 4, 6, 9, 18, 20, 33, 37

work page 2026

[20] [20]

Drop-muon: Update less, converge faster, 2025

Kaja Gruntkowska, Yassine Maziane, Zheng Qu, and Peter Richtárik. Drop-muon: Update less, converge faster, 2025. 20

work page 2025

[21] [21]

Data selection via optimal control for language models

Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data selection via optimal control for language models. InThe Thirteenth International Conference on Learning Representations, 2025. 21

work page 2025

[22] [22]

Root: Robust orthogonalized optimizer for neural network training, 2025

Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. Root: Robust orthogonalized optimizer for neural network training, 2025. 1, 2, 4, 6, 20, 33

work page 2025

[23] [23]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. 6, 33

work page 2021

[24] [24]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page 2022

[25] [25]

Datamodels: Predicting predictions from training data, 2022

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Predicting predictions from training data, 2022. 21

work page 2022

[26] [26]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. 1, 4, 6, 20, 33

work page 2024

[27] [27]

Muonbp: Faster muon via block-periodic orthogonalization, 2025

Ahmed Khaled, Kaan Ozkara, Tao Yu, Mingyi Hong, and Youngsuk Park. Muonbp: Faster muon via block-periodic orthogonalization, 2025. 20

work page 2025

[28] [28]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 19

work page 2017

[29] [29]

RACE: Large-scale ReAding comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. 6, 33

work page 2017

[30] [30]

Normuon: Making muon more efficient and scalable, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025. 1, 20

work page 2025

[31] [31]

Sophia: A scalable stochastic second-order optimizer for language model pre-training

Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Conference on Learning Representations, 2024. 20

work page 2024

[32] [32]

Muon is scalable for llm training, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page 2025

[33] [33]

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms,

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms,

work page

[34] [34]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InThe Seventh International Conference on Learning Representations, 2019. 19

work page 2019

[35] [35]

How learning rate decay wastes your best data in curriculum-based LLM pretraining

Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, and Wenguang Chen. How learning rate decay wastes your best data in curriculum-based LLM pretraining. InThe Fourteenth International Conference on Learning Representations, 2026. 2

work page 2026

[36] [36]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2018. Association for Computational Linguistics. 6, 33

work page 2018

[37] [37]

Prodigy: An expeditiously adaptive parameter-free learner

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 35779–35804. PMLR, 2024. 20

work page 2024

[38] [38]

Antonio Orvieto and Robert M. Gower. In search of adam’s secret sauce. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 20

work page 2025

[39] [39]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InAdvances in Neural Information Processing Systems, volume 37, pages 30811–30849, 2024. 3, 6, 31

work page 2024

[40] [40]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020. 6, 33

work page 2020

[41] [41]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020. 6, 31

work page 2020

[42] [42]

Predictive data selection: The data that predicts is the data that teaches

KaShun SHUM, Yuzhen Huang, Hongjian Zou, dingqi, YiXuan Liao, Xiaoxin Chen, Qian Liu, and Junxian He. Predictive data selection: The data that predicts is the data that teaches. In Forty-second International Conference on Machine Learning, 2025. 21

work page 2025

[43] [43]

Adamuon: Adaptive muon optimizer, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer, 2025. 1, 6, 20, 33

work page 2025

[44] [44]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis...

work page 2019

[45] [45]

ADOPT: Modified adam can converge with any $\beta_2$ with the optimal rate

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seongcheol Jeong, Go Na- gahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo. ADOPT: Modified adam can converge with any $\beta_2$ with the optimal rate. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 20

work page 2024

[46] [46]

Kimi k2: Open agentic intelligence, 2026

Kimi Team. Kimi k2: Open agentic intelligence, 2026. 1, 20

work page 2026

[47] [47]

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S. Morcos. D4: Improving LLM pretraining via document de-duplication and diversification. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 21

work page 2023

[48] [48]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 20 12

work page 2025

[49] [49]

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent Y . F. Tan. Muon outperforms adam in tail-end associative memory learning. InThe Fourteenth International Conference on Learning Representations,

work page

[50] [50]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J...

work page 2024

[51] [51]

Liu, and Matt Gardner

Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InProceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. 6, 33

work page 2017

[52] [52]

Fantastic pretraining optimizers and where to find them, 2025

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them, 2025. 20

work page 2025

[53] [53]

Qurating: Selecting high- quality data for training language models

Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high- quality data for training language models. InForty-first International Conference on Machine Learning, 2024. 21

work page 2024

[54] [54]

Organize the web: Constructing domains enhances pre-training data curation

Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, and Luca Soldaini. Organize the web: Constructing domains enhances pre-training data curation. InForty-second International Conference on Machine Learning, 2025. 21

work page 2025

[55] [55]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 21

work page 2023

[56] [56]

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024. 20

work page 2024

[57] [57]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025

[58] [58]

Benjamin Erichson, and Michael W

Shenghao Yang, Zhichao Wang, Oleg Balabanov, N. Benjamin Erichson, and Michael W. Mahoney. Prism: Distribution-free adaptive computation of matrix functions for accelerating neural network training, 2026. 20

work page 2026

[59] [59]

MATES: Model-aware data selection for efficient pretraining with data influence models

Zichun Yu, Spandan Das, and Chenyan Xiong. MATES: Model-aware data selection for efficient pretraining with data influence models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 21

work page 2024

[60] [60]

MARS: Unleashing the power of variance reduction for training large models

Huizhuo Yuan, Yifeng Liu, Shuang Wu, zhou Xun, and Quanquan Gu. MARS: Unleashing the power of variance reduction for training large models. InForty-second International Conference on Machine Learning, 2025. 20

work page 2025

[61] [61]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, edi- tors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguis...

work page 2019

[62] [62]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32, 2019. 6, 31

work page 2019

[63] [63]

Harnessing diversity for important data selection in pretraining large language models

Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Qiu Jiantao, Lei Cao, Ju Fan, Ye Yuan, Guoren Wang, and Conghui He. Harnessing diversity for important data selection in pretraining large language models. InThe Thirteenth International Conference on Learning Representations, 2025. 21

work page 2025

[64] [64]

Gram newton-schulz, 2026

Jack Zhang, Noah Amsel, Berlin Chen, and Tri Dao. Gram newton-schulz, 2026. 1, 20

work page 2026

[65] [65]

Adagrad meets muon: Adaptive stepsizes for orthogonal updates, 2025

Minxin Zhang, Yuxuan Liu, and Hayden Schaeffer. Adagrad meets muon: Adaptive stepsizes for orthogonal updates, 2025. 20

work page 2025

[66] [66]

Why transformers need adam: A hessian perspective

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. Why transformers need adam: A hessian perspective. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 19

work page 2024

[67] [67]

Adam-mini: Use fewer learning rates to gain more

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more. InThe Thirteenth International Conference on Learning Representations, 2025. 20

work page 2025

[68] [68]

Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham M. Kakade. Deconstructing what makes a good optimizer for autoregressive language models. InThe Thirteenth International Conference on Learning Representations, 2025. 20

work page 2025

[69] [69]

AGIEval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, June

work page 2024

[70] [70]

Association for Computational Linguistics. 6, 33

work page

[71] [71]

How to set the learning rate for large-scale pre-training?, 2026

Yunhua Zhou, Shuhao Xing, Junhao Huang, Xipeng Qiu, and Qipeng Guo. How to set the learning rate for large-scale pre-training?, 2026. 6, 21, 32

work page 2026

[72] [72]

Meta-rater: A multi-dimensional data selection method for pre-training language models

Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Qiu Jiantao, Chi Zhang, Ying Qian, and Conghui He. Meta-rater: A multi-dimensional data selection method for pre-training language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computationa...

work page

[73] [73]

16 A.2 Optimality Analysis

7, 21 14 Appendix Table of Contents A Details of AMO 16 A.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Optimality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B Related Work 19 C Pilot Experiments 21 C.1 Data Curriculum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 1930

[74] [74]

Muon optimizer.Muon [ 26] is a recently proposed optimizer designed for the hidden-layer param- eters of neural networks

takes a complementary approach by eliminating the need for learning rate tuning altogether through a D-adaptation mechanism that automatically estimates the optimal step size during training. Muon optimizer.Muon [ 26] is a recently proposed optimizer designed for the hidden-layer param- eters of neural networks. It orthogonalizes the momentum buffer via N...

work page

[75] [75]

Complementing instance-level methods, several approaches addresscorpus-level diversity

estimates each example’s marginal contribution via datamodels [25], MATES [59] periodically updates lightweight influence models during pre-training, [ 21] dynamically adjusts the training distribution as a sequential decision problem, and [42] identifies the most instructive examples by their ability to predict learning outcomes. Complementing instance-l...

work page 2048