pith. sign in

arxiv: 2605.22432 · v1 · pith:EVV574XGnew · submitted 2026-05-21 · 💻 cs.LG

AMUSE: Anytime Muon with Stable Gradient Evaluation

Pith reviewed 2026-05-22 07:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords AMUSEMuon optimizerSchedule-Free optimizationriver-valley loss landscapeanytime trainingoptimizer designdeep learning trainingLLM pretraining
0
0 comments X

The pith

AMUSE uses a time-varying interpolation between Muon iterates and schedule-free averaging to retain fast bulk progress while suppressing oscillations, eliminating the need for learning rate schedules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the loss landscape as a river-valley structure in which useful progress occurs along a flat low-curvature bulk subspace while high-curvature directions create steep walls that induce oscillations. Muon's orthogonalization of momentum increases the bulk component and speeds river progress but also amplifies dominant-direction noise, producing oscillatory trajectories. AMUSE counters this by introducing a time-varying interpolation coefficient that begins by evaluating gradients near the fast Muon sequence for rapid early adaptation and gradually shifts toward the stable averaged sequence to dampen wall oscillations. The resulting method requires no learning rate schedules, supports anytime training, and improves the performance-iteration Pareto frontier over Schedule-Free AdamW and Muon on vision tasks and large language model pretraining.

Core claim

Muon orthogonalization accelerates progress along the low-curvature bulk subspace but amplifies noise in dominant directions, causing oscillations within the river-valley loss landscape. AMUSE integrates Muon's rapid bulk progress with the stabilizing effect of schedule-free averaging through a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence and later shifts toward the averaged sequence, thereby suppressing oscillations while preserving the bulk benefit and removing the requirement for explicit learning rate schedules.

What carries the argument

A time-varying interpolation coefficient between the Muon sequence and the schedule-free averaged sequence that starts Muon-like for rapid adaptation and shifts to the averaged sequence for oscillation suppression.

If this is right

  • AMUSE delivers competitive or superior performance without any prescribed learning rate schedule.
  • The method supports anytime training because the averaging component allows interruption at any iteration while retaining good results.
  • The performance-iteration tradeoff improves consistently over both Schedule-Free AdamW and Muon across vision and LLM pretraining tasks.
  • Rapid bulk progress is retained early while later stability reduces wasteful oscillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dynamic interpolation between fast and stable sequences may generalize to other optimizers that exhibit similar noise amplification in non-convex landscapes.
  • An adaptive rule for choosing the interpolation schedule based on real-time noise estimates could further reduce manual tuning.
  • The same river-valley perspective might suggest extensions that monitor curvature changes to adjust the interpolation rate automatically.

Load-bearing premise

The loss landscape behaves as a river-valley structure in which Muon's orthogonalization specifically boosts bulk-subspace progress while increasing dominant-direction noise that produces oscillations.

What would settle it

If direct measurements during training show that the time-varying interpolation does not reduce observed oscillations relative to plain Muon, or if AMUSE fails to improve the performance-iteration curve on standard vision or language pretraining benchmarks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.22432 by Baekrok Shin, Beomhan Baek, Chulhee Yun, Jihun Yun, Jueun Kim, Minhak Song.

Figure 1
Figure 1. Figure 1: Illustration of a river-valley land￾scape. SGD oscillates across the valley walls and progresses slowly along the river, while Muon advances faster but remains oscillatory. These quantities measure how strongly v is aligned with the dominant and bulk subspaces, respectively. Since Pk(θ) and P ⊥ k (θ) are orthogonal projections, the ratios satisfy (r dom) 2 + (r bulk) 2 = 1; hence, we report r dom in most m… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of dominant component ratios. Evaluated on a 5k MNIST subset using a 3-layer MLP. (a) Muon consistently produces smaller dominant updates than SGD/AdamW, and AMUSE further suppresses the dominant component. (b) Orthogonalization reduces Muon’s dominant ratio compared to momentum mt; in contrast, AMUSE maintains low dominant ratios throughout, reflecting more stable gradient dynamics. Averaged ov… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of dominant component ratios. Settings as in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Schedule-free iterates in a river￾valley landscape. Orange arrows indicate negative gradients computed at each y (α) t , exhibiting reduced dominant components when evaluated closer to the river. To test this hypothesis, we use xt as a proxy for a point on the river and evaluate gradients at virtual interpolation points y (α) t = (1 − α)zt + αxt for varying α ∈ [0, 1], while keeping the actual training tra… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of fixed-β SF-Muon with different β values and AMUSE in the 124M Llama pretraining on FineWeb. Solid lines show fixed-β SF-Muon, and dashed lines show AMUSE with β1 = 0.6 and ρ = 0.8. We report validation perplexity (left) and the update norm ∥∆xt∥ (right). the 124M Llama setting with varying β (setup in Section 4), revealing that a large fixed β can be detrimental early in training ( [PITH_FUL… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between constant learning rate Muon and AMUSE in the 124M Llama setting. The left panel shows the effect of EWA, where solid lines represent the original training trajectories and dotted lines represent their corresponding EWA trajectories. The middle panel shows the effect of learning rate decay, where after warmup we linearly decay the learning rate from 10−4 to 0 at selected iterations using … view at source ↗
Figure 7
Figure 7. Figure 7: Test accuracy across image domain experiments. Averaged over five random seeds.1 5.2 Large Language Model Pretraining Experimental Setup. We follow the Llama-style Transformer setup of Semenov et al. (2026), using tied input/output embeddings, SwiGLU, RMSNorm, and RoPE. We train 124M, 720M, and 1B models on FineWeb￾100B (Penedo et al., 2024) with sequence length 512 and batch sizes 256, 1984, and 2048, res… view at source ↗
Figure 8
Figure 8. Figure 8: Validation perplexity on FineWeb pretraining across Llama model scales. 5.3 Hyperparameter Sensitivity In all experiments, we fix the Muon momentum µ and do not treat it as a tunable parameter. Therefore, compared to standard SF optimizers, AMUSE introduces only one additional hyperparameter, ρ, which controls how quickly the gradient evaluation point moves toward the averaged iterate. We evaluate the hype… view at source ↗
Figure 9
Figure 9. Figure 9: Scaling Dominant (α) and Bulk (γ) components of Muon update. We apply subspace-wise scaling to Muon updates starting at step 500. For CIFAR-10-5k and MNIST-5k, Muon is combined with SGD for non-Muon parameters, using momentum 0.9 and learning rate 5 × 10−4 . For SST-2-1k, Muon is combined with AdamW for non-Muon parameters, using Muon momentum 0.9, AdamW coefficients (0.9, 0.99), and learning rate 1 × 10−4… view at source ↗
Figure 10
Figure 10. Figure 10: Eigenspectra of the loss Hessian across optimizers, datasets, and architectures. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of bulk component ratios on CIFAR-10 with CNN. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of bulk component ratios on SST-2 with Transformer. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of dominant/bulk component ratios on CIFAR-10 with CNN. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of dominant/bulk component ratios on SST-2 with Transformer. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Effect of ρ on effective weights αt. Histograms of αt across varying sequence lengths T. For T > T0 = 2000, lower ρ values allow the effective averaging window to continuously grow, whereas ρ = 1 strictly fixes the window to its size at T0. Visualization of Averaging Window Size and βt. To better understand the impact of our proposed schedule on the averaging dynamics, we visualize both the effective aver… view at source ↗
Figure 16
Figure 16. Figure 16: Evolution of βt controlled by ρ. After the initial warmup phase (T0 = 2000, dotted vertical line), ρ smoothly interpolates between a constant β baseline (ρ = 0) and the strict constant-window upper bound (ρ = 1). C.4 Implementation Details Averaging coefficient. In the main text, we use the simplified intuition ct = 1/t. In the implementation, we follow the learning-rate-weighted averaging rule used in De… view at source ↗
Figure 17
Figure 17. Figure 17: Comparison with Other Baselines on Llama 124M model. [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Comparison with Muon+EWA and Muon+WSD on 124M Llama pretraining. [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Comparison with Muon with EWA on ImageNet. [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Longer-horizon training results on 124M Llama pretraining. [PITH_FULL_IMAGE:figures/full_fig_p038_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Hyperparameter sensitivity varying [PITH_FULL_IMAGE:figures/full_fig_p038_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Comparison between AMUSE and fixed-β SF-Muon on 124M Llama pretraining. We compare AMUSE with SF-Muon using the best fixed interpolation value, β = 0.95. Fixed-β SF-Muon makes progress early in training but its improvement becomes limited in the later stage. In contrast, AMUSE continues to improve by gradually moving the gradient-evaluation point toward the averaged trajectory. 39 [PITH_FULL_IMAGE:figure… view at source ↗
Figure 23
Figure 23. Figure 23: Comparison between AMUSE and fixed-β SF-Muon on Image domain experiments. We compare AMUSE with SF-Muon in image classification benchmarks. Runs are averaged over five random seeds, except for ImageNet trained with SF-Muon due to computation constraints. AMUSE without Muon Momentum. To examine the role of Muon momentum in AMUSE and SF-Muon, we remove the momentum buffer used before the orthogonalization s… view at source ↗
Figure 24
Figure 24. Figure 24: Momentum ablation on 124M Llama pretraining. [PITH_FULL_IMAGE:figures/full_fig_p040_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Effect of stopping the increase of βt in AMUSE. (Left) Validation perplexity on 124M Llama pretraining. The clipped variant underperforms AMUSE, with the gap widening later in training. (Right) Evolution of βt. While both methods reach similar large values, AMUSE continues to increase βt, whereas the clipped variant saturates to 0.95. As shown in [PITH_FULL_IMAGE:figures/full_fig_p041_25.png] view at source ↗
read the original abstract

Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace (the river), while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient Evaluation (AMUSE), which integrates Muon's rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes AMUSE, which combines Muon's orthogonalized momentum updates with Schedule-Free iterate averaging via a time-varying interpolation coefficient between the fast Muon sequence and the stable averaged sequence. Motivated by a river-valley loss landscape in which Muon boosts bulk low-curvature progress but amplifies high-curvature oscillations, AMUSE is claimed to improve the performance-iteration Pareto frontier over Schedule-Free AdamW and Muon on vision tasks and LLM pretraining while requiring no learning-rate schedules and supporting anytime training.

Significance. If the empirical results and mechanistic account are substantiated with detailed diagnostics and ablations, the work would offer a practical advance in schedule-free optimization that builds directly on Muon and Schedule-Free methods, potentially simplifying training pipelines for both vision and language models.

major comments (2)
  1. [Mechanism and design rationale] The mechanistic explanation in the abstract and introduction relies on the river-valley model, yet no quantitative diagnostics are reported to validate that Muon's orthogonalization increases the bulk component while amplifying dominant-direction noise, or that the specific time-varying interpolation selectively suppresses oscillations without eroding bulk progress. Direct measurements such as projections of updates onto estimated top eigenvectors of the gradient covariance or metrics of trajectory oscillation amplitude, comparing Muon, Schedule-Free AdamW, and AMUSE, are absent; without them the design rationale remains an untested hypothesis rather than an empirically grounded claim.
  2. [Empirical evaluation] The central empirical claim of consistent Pareto-frontier improvement is stated in the abstract but the provided text contains no experimental details, datasets, model scales, number of runs, error bars, statistical significance tests, or ablation studies on the interpolation schedule. This absence makes it impossible to assess robustness or rule out incidental effects of the schedule.
minor comments (1)
  1. [Methods] Notation for the time-varying interpolation coefficient and its schedule should be defined explicitly with an equation or pseudocode in the methods section to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to incorporate additional diagnostics and experimental details.

read point-by-point responses
  1. Referee: [Mechanism and design rationale] The mechanistic explanation in the abstract and introduction relies on the river-valley model, yet no quantitative diagnostics are reported to validate that Muon's orthogonalization increases the bulk component while amplifying dominant-direction noise, or that the specific time-varying interpolation selectively suppresses oscillations without eroding bulk progress. Direct measurements such as projections of updates onto estimated top eigenvectors of the gradient covariance or metrics of trajectory oscillation amplitude, comparing Muon, Schedule-Free AdamW, and AMUSE, are absent; without them the design rationale remains an untested hypothesis rather than an empirically grounded claim.

    Authors: We agree that the river-valley account would be strengthened by the specific quantitative diagnostics noted. The manuscript already presents empirical support via performance gains and qualitative trajectory observations consistent with increased bulk progress under Muon, but we acknowledge the absence of the requested eigenvector projections and oscillation amplitude metrics. In the revised version we have added these direct measurements, including update projections onto estimated top eigenvectors of the gradient covariance and oscillation-amplitude comparisons across Muon, Schedule-Free AdamW, and AMUSE. The new results corroborate that orthogonalization boosts the bulk component while amplifying dominant-direction noise, and that the time-varying interpolation in AMUSE reduces oscillations without sacrificing bulk progress. revision: yes

  2. Referee: [Empirical evaluation] The central empirical claim of consistent Pareto-frontier improvement is stated in the abstract but the provided text contains no experimental details, datasets, model scales, number of runs, error bars, statistical significance tests, or ablation studies on the interpolation schedule. This absence makes it impossible to assess robustness or rule out incidental effects of the schedule.

    Authors: We apologize for any impression that experimental details were missing from the reviewed version; the full manuscript contains an experimental section describing the vision and LLM pretraining setups. To fully address the concern we have expanded this section in the revision with explicit reporting of datasets, model scales, number of independent runs with different random seeds, error bars, statistical significance tests, and dedicated ablation studies on the interpolation schedule. The added ablations demonstrate that the reported Pareto-frontier gains are robust across schedule variations and not attributable to incidental effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical motivation and experimental claims remain independent of inputs

full rationale

The paper motivates AMUSE from an empirical observation that Muon's orthogonalization boosts the bulk (river) component while amplifying dominant-direction oscillations in a posited river-valley landscape, then introduces a time-varying interpolation to stabilize it. No equations, fitted parameters, or predictions are presented that reduce to their own inputs by construction; the design choice is described as building directly on stated observations rather than self-referential definitions or renamed fits. Claims of improved Pareto frontiers are supported by reported experiments on vision and LLM tasks, which are externally falsifiable and do not rely on load-bearing self-citations or uniqueness theorems imported from the authors' prior work. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; full paper details on parameters and assumptions are unavailable.

free parameters (1)
  • time-varying interpolation coefficient schedule
    Controls the shift from Muon sequence to averaged sequence; functional form and any fitted values are not specified in the abstract.
axioms (1)
  • domain assumption River-valley loss landscape model in which useful progress occurs along a flat low-curvature bulk subspace while high-curvature directions induce oscillations.
    Invoked in the abstract to explain Muon's dual effects and to motivate the stabilizing interpolation in AMUSE.

pith-pipeline@v0.9.0 · 5768 in / 1379 out tokens · 46789 ms · 2026-05-22T07:17:02.115453+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 8 internal anchors

  1. [1]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    Eigenvalues of the hessian in deep learning: Singularity and beyond , author=. arXiv preprint arXiv:1611.07476 , year=

  2. [2]

    Gradient Descent Happens in a Tiny Subspace

    Gradient descent happens in a tiny subspace , author=. arXiv preprint arXiv:1812.04754 , year=

  3. [3]

    Minhak Song and Kwangjun Ahn and Chulhee Yun , booktitle=. Does. 2025 , url=

  4. [4]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  5. [5]

    Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

    Empirical analysis of the hessian of over-parametrized neural networks , author=. arXiv preprint arXiv:1706.04454 , year=

  6. [6]

    International Conference on Machine Learning , pages=

    An investigation into neural net optimization via hessian eigenvalue density , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  7. [7]

    The Thirteenth International Conference on Learning Representations , year=

    Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View , author=. The Thirteenth International Conference on Learning Representations , year=

  8. [8]

    Universal Dynamics of Warmup Stable Decay: understanding

    Annalisa Belloni and Lorenzo Noci and Antonio Orvieto , booktitle=. Universal Dynamics of Warmup Stable Decay: understanding. 2025 , url=

  9. [9]

    International Conference on Learning Representations , year=

    Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , author=. International Conference on Learning Representations , year=

  10. [10]

    The Thirteenth International Conference on Learning Representations , year=

    Understanding Optimization in Deep Learning with Central Flows , author=. The Thirteenth International Conference on Learning Representations , year=

  11. [11]

    , title =

    Self-stabilization: The implicit bias of gradient descent at the edge of stability , author=. arXiv preprint arXiv:2209.15594 , year=

  12. [12]

    Forty-second International Conference on Machine Learning , year=

    The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training , author=. Forty-second International Conference on Machine Learning , year=

  13. [13]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Improving Generalization and Convergence by Enhancing Implicit Regularization , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  14. [14]

    How Muon

    Bhavya Vasudeva and Puneesh Deora and Yize Zhao and Vatsal Sharan and Christos Thrampoulidis , booktitle=. How Muon. 2026 , url=

  15. [15]

    The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

    The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size , author=. arXiv preprint arXiv:1811.07062 , year=

  16. [16]

    Journal of Machine Learning Research , year =

    Vardan Papyan , title =. Journal of Machine Learning Research , year =

  17. [17]

    International Conference on Machine Learning , pages=

    Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  18. [18]

    2020 IEEE international conference on big data (Big data) , pages=

    Pyhessian: Neural networks through the lens of the hessian , author=. 2020 IEEE international conference on big data (Big data) , pages=. 2020 , organization=

  19. [19]

    How Learning Rate Decay Wastes Your Best Data in Curriculum-Based

    Kairong Luo and Zhenbo Sun and Haodong Wen and Xinyu Shi and Jiarui Cui and Chenyi Dang and Kaifeng Lyu and Wenguang Chen , booktitle=. How Learning Rate Decay Wastes Your Best Data in Curriculum-Based. 2026 , url=

  20. [20]

    Transactions on Machine Learning Research , issn=

    Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

  21. [21]

    International Conference on Machine Learning , pages=

    Understanding gradient descent on the edge of stability in deep learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  22. [22]

    OPT 2024: Optimization for Machine Learning , year=

    Old Optimizer, New Norm: An Anthology , author=. OPT 2024: Optimization for Machine Learning , year=

  23. [23]

    1988 , institution=

    David Ruppert , title=. 1988 , institution=

  24. [24]

    arXiv preprint arXiv:2602.22681 , year=

    Accelerating LLM Pre-Training through Flat-Direction Dynamics Enhancement , author=. arXiv preprint arXiv:2602.22681 , year=

  25. [25]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  26. [26]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    The Road Less Scheduled , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  27. [27]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  28. [28]

    SIAM journal on control and optimization , volume=

    Acceleration of stochastic approximation by averaging , author=. SIAM journal on control and optimization , volume=. 1992 , publisher=

  29. [29]

    IEEE transactions on cybernetics , volume=

    Primal averaging: A new gradient evaluation step to attain the optimal individual convergence , author=. IEEE transactions on cybernetics , volume=. 2018 , publisher=

  30. [30]

    Mathematical programming , volume=

    Primal-dual subgradient methods for convex problems , author=. Mathematical programming , volume=. 2009 , publisher=

  31. [31]

    arXiv preprint arXiv:2502.02431 , year=

    Connections between schedule-free optimizers, ademamix, and accelerated sgd variants , author=. arXiv preprint arXiv:2502.02431 , year=

  32. [32]

    Kingma and Jimmy Ba , title =

    Diederik P. Kingma and Jimmy Ba , title =. 3rd International Conference on Learning Representations , year =

  33. [33]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  34. [34]

    arXiv preprint arXiv:2501.12243 , year=

    Focus: First order concentrated updating scheme , author=. arXiv preprint arXiv:2501.12243 , year=

  35. [35]

    arXiv preprint arXiv:2601.13474 , year=

    Preconditioning benefits of spectral orthogonalization in muon , author=. arXiv preprint arXiv:2601.13474 , year=

  36. [36]

    arXiv preprint arXiv:2602.04669 , year=

    Delving into Muon and Beyond: Deep Analysis and Extensions , author=. arXiv preprint arXiv:2602.04669 , year=

  37. [37]

    arXiv preprint arXiv:2603.09697 , year=

    Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning , author=. arXiv preprint arXiv:2603.09697 , year=

  38. [38]

    arXiv preprint arXiv:2507.11005 , year=

    Adamuon: Adaptive muon optimizer , author=. arXiv preprint arXiv:2507.11005 , year=

  39. [39]

    On the Convergence Analysis of Muon

    On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=

  40. [40]

    arXiv preprint arXiv:2602.06385 , year=

    Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization , author=. arXiv preprint arXiv:2602.06385 , year=

  41. [41]

    The Fourteenth International Conference on Learning Representations , year=

    Muon Outperforms Adam in Tail-End Associative Memory Learning , author=. The Fourteenth International Conference on Learning Representations , year=

  42. [42]

    Liu, Jingyuan and Su, Jianlin and Yao, Xingcheng and Jiang, Zhejun and Lai, Guokun and Du, Yulun and Qin, Yidao and Xu, Weixin and Lu, Enzhe and Yan, Junjie and others , journal=

  43. [43]

    2026 , url=

    Benchmarking Optimizers for Large Language Model Pretraining , author=. 2026 , url=

  44. [44]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  45. [45]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis , title =. CoRR , volume =. 2016 , url =. 1605.07146 , timestamp =

  46. [46]

    , title =

    Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q. , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  47. [47]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  48. [48]

    International Conference on Medical image computing and computer-assisted intervention , pages=

    U-net: Convolutional networks for biomedical image segmentation , author=. International Conference on Medical image computing and computer-assisted intervention , pages=. 2015 , organization=

  49. [49]

    NIPS workshop on deep learning and unsupervised feature learning , volume=

    Reading digits in natural images with unsupervised feature learning , author=. NIPS workshop on deep learning and unsupervised feature learning , volume=. 2011 , organization=

  50. [50]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  51. [51]

    International journal of computer vision , volume=

    Imagenet large scale visual recognition challenge , author=. International journal of computer vision , volume=. 2015 , publisher=

  52. [52]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic) , author=. arXiv preprint arXiv:1902.03368 , year=

  53. [53]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  54. [54]

    arXiv preprint arXiv:2306.07179 , year=

    Benchmarking neural network training algorithms , author=. arXiv preprint arXiv:2306.07179 , year=

  55. [55]

    arXiv preprint arXiv:2511.20626 , year=

    ROOT: Robust Orthogonalized Optimizer for Neural Network Training , author=. arXiv preprint arXiv:2511.20626 , year=

  56. [56]

    arXiv preprint arXiv:2602.17080 , year=

    Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum , author=. arXiv preprint arXiv:2602.17080 , year=

  57. [57]

    2017 , url=

    Ilya Loshchilov and Frank Hutter , booktitle=. 2017 , url=

  58. [58]

    The Thirteenth International Conference on Learning Representations , year=

    Accelerating neural network training: An analysis of the AlgoPerf competition , author=. The Thirteenth International Conference on Learning Representations , year=

  59. [59]

    arXiv preprint arXiv:2602.15763 , year=

  60. [60]

    arXiv preprint arXiv:2105.07576 , year=

    Rethinking ``Batch'' in BatchNorm , author=. arXiv preprint arXiv:2105.07576 , year=

  61. [61]

    Suspicious Alignment of

    Deng, Shenyang and Liao, Boyao and Ouyang, Zhuoli and Pang, Tianyu and Song, Minhak and Yang, Yaoqing , booktitle =. Suspicious Alignment of. 2026 , volume =