Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

Abhishek Chakraborty; Angelia Nedi\'c; Grigory Malinovsky; Peter Richt\'arik; Yury Demidovich

arxiv: 2605.18999 · v1 · pith:GTDKH3D3new · submitted 2026-05-18 · 💻 cs.LG

Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

Yury Demidovich , Abhishek Chakraborty , Grigory Malinovsky , Angelia Nedi\'c , Peter Richt\'arik This is my paper

Pith reviewed 2026-05-20 12:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords normalized optimizationadaptive trust regionMuonnon-convex convergencestar-convex functionslast-iterate boundsstep-size adaptation

0 comments

The pith

Distance-adaptive Muon sets trust-region radius from the trajectory explored so far to obtain stationarity guarantees on smooth non-convex objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops three adaptive scaling rules for Muon, a normalized optimizer that already separates update direction from step length. Distance-Adaptive Muon chooses its trust-region radius directly from the distance the iterates have already traveled, and the analysis shows this yields stationarity for smooth non-convex problems once the trajectory is assumed bounded. Scale-Calibrated Muon keeps the momentum buffer but replaces fixed scaling with a local descent certificate computed from the current gradient and momentum; under a bounded initial sublevel set it delivers a last-iterate O(1/T) objective-gap bound on star-convex functions. Distance-Free Muon further removes the need for a global distance parameter by using a scalar certificate and a majorized one-dimensional line search. Experiments on GPT-124M language modeling and ViT-Tiny image classification confirm that these rules reduce sensitivity to manual scale choices while matching or exceeding the performance of well-tuned fixed-scale Muon.

Core claim

By letting the step scale be chosen from the radius already explored by the trajectory or from a local descent certificate, normalized Muon updates obtain stationarity guarantees for smooth non-convex problems and last-iterate linear convergence rates for star-convex problems, with the radius parameters appearing only in the analysis and not inside the algorithms themselves.

What carries the argument

The trust-region radius adaptation, which is set either from the distance traveled along the optimization path, from a local descent certificate derived from gradient and momentum, or from a scalar distance certificate together with a majorized one-dimensional search.

Load-bearing premise

The optimization trajectory or the initial sublevel set remains bounded.

What would settle it

A smooth non-convex problem in which the iterates diverge to infinity yet Distance-Adaptive Muon still reaches an approximate stationary point would refute the stationarity claim.

Figures

Figures reproduced from arXiv: 2605.18999 by Abhishek Chakraborty, Angelia Nedi\'c, Grigory Malinovsky, Peter Richt\'arik, Yury Demidovich.

**Figure 2.** Figure 2: Regularized ViT-Tiny/CIFAR-100 training dynamics in a representative seed. The left [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Fixed-Muon learning-rate sweep on GPT-124M/WikiText-103. The smaller learning rate [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

**Figure 4.** Figure 4: DA-Muon cap sweep on GPT-124M/WikiText-103. Smaller caps substantially improve [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

**Figure 5.** Figure 5: Observed base scales selected by the Muon variants on GPT-124M/WikiText-103 in a [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Representative NanoGPT/WikiText-2 curves for seed [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Observed base scales selected by the Muon variants on NanoGPT/WikiText-2 in representa [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: DF-Muon cap diagnostic on regularized ViT-Tiny/CIFAR-100 in a representative seed. The [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗

**Figure 9.** Figure 9: Effective step scales on regularized ViT-Tiny/CIFAR-100 for a representative seed, ex [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

**Figure 10.** Figure 10: Effective step scales on regularized ViT-Tiny/CIFAR-100 for a representative seed, [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗

**Figure 11.** Figure 11: Fixed-Muon learning-rate sweep on CIFAR-100/ResNet-32. The largest tested scale, [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗

**Figure 12.** Figure 12: CIFAR-100/ResNet-32 training in a representative seed. We compare AdamW, the tuned [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗

**Figure 13.** Figure 13: Observed base scales selected by the adaptive Muon variants on CIFAR-100/ResNet-32 in [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗

read the original abstract

Muon and related normalized optimizers decouple the choice of update direction from the choice of step scale, but their practical performance remains sensitive to the scale of the normalized step. We study adaptive scaling rules for Muon in general norm geometries and develop three complementary algorithms. For smooth non-convex objectives, we introduce Distance-Adaptive Muon, whose trust-region radius is set from the radius explored by the trajectory, and prove a stationarity guarantee under a bounded-trajectory assumption. We then turn to star-convex objectives, a tractable model of the favorable global geometry often used to reason about the empirical loss landscapes of deep neural networks, where objective-gap guarantees are possible. In this setting, we first introduce Scale-Calibrated Muon, which keeps Muon's exponential moving average but sets the step length from a local descent certificate computed from the current gradient and momentum. For this method, we prove a last-iterate O(1/T) objective-gap bound under a bounded initial sublevel-set assumption, where the corresponding radius parameter appears only in the analysis and not in the algorithm. Finally, we develop Distance-Free Muon, a recentered trust-region method that uses a scalar distance certificate and a majorized one-dimensional search to select the trust-region radius without requiring the unknown distance from the initialization to a global minimizer. Experiments on Transformer language modeling (GPT-124M/WikiText-103) and image classification (ViT-Tiny/CIFAR-100) show that the proposed adaptive scaling rules reduce sensitivity to manual scale tuning and match or improve tuned fixed-scale Muon baselines under the tested budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives three concrete adaptive scaling rules for Muon with matching experiments, but the convergence claims rest on bounded-trajectory assumptions that look shaky for real deep learning.

read the letter

Two things stand out right away. The paper develops three adaptive scaling rules for Muon that aim to cut down on manual step-size tuning, and it backs two of them with convergence guarantees under specific assumptions. The new parts are the Distance-Adaptive Muon for smooth non-convex problems, which picks the trust-region radius from the path the optimizer has already taken, Scale-Calibrated Muon for star-convex objectives that uses a local descent certificate to set the step, and Distance-Free Muon that avoids needing the distance to the minimizer by using a one-dimensional search. The experiments on language modeling with a 124M parameter GPT and image classification with ViT-Tiny on CIFAR-100 show these versions match or beat carefully tuned fixed-scale Muon while being less sensitive to the scale choice. The main weakness is in the assumptions. Both the stationarity result and the O(1/T) bound lean on bounded-trajectory or bounded-sublevel-set conditions that are only used in the proofs and not enforced by the algorithms. In typical deep learning training, loss landscapes are non-convex and trajectories can grow without clear bounds, so those guarantees might not kick in. The experiments are on modest-sized models, which leaves open how well this holds for larger training runs. This paper is aimed at people working on normalized gradient methods and adaptive optimizers for neural nets. Someone looking for practical tweaks to Muon with a bit of theory would find it useful. I think it should go to peer review. The constructions are clear enough and the empirical side is solid for what it is, even with the assumption issues that referees will likely probe.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes three adaptive scaling algorithms for Muon and related normalized optimizers. Distance-Adaptive Muon sets its trust-region radius from the radius explored by the trajectory and proves a stationarity guarantee for smooth non-convex objectives under a bounded-trajectory assumption. Scale-Calibrated Muon retains Muon's EMA but selects step length from a local descent certificate, proving a last-iterate O(1/T) objective-gap bound for star-convex objectives under a bounded initial sublevel-set assumption (with the radius parameter appearing only in analysis). Distance-Free Muon uses a scalar distance certificate and majorized one-dimensional search to choose the radius without requiring the unknown distance to a global minimizer. Experiments on GPT-124M/WikiText-103 and ViT-Tiny/CIFAR-100 indicate that the adaptive rules reduce sensitivity to manual scale tuning while matching or exceeding tuned fixed-scale Muon baselines.

Significance. If the stated assumptions hold, the work supplies concrete convergence guarantees for adaptive normalized optimization together with practical algorithms that demonstrably lessen hyperparameter sensitivity on representative deep-learning tasks. The separation of algorithmic parameters from analysis-only quantities and the use of star-convex geometry as a model for favorable DNN loss landscapes are constructive contributions.

major comments (2)

[Abstract] Abstract and the section presenting Distance-Adaptive Muon: the stationarity guarantee is conditioned on a bounded-trajectory assumption that the algorithm itself neither enforces nor monitors. In typical non-convex deep-learning landscapes the trajectory radius can grow without bound, which would render the radius-setting rule and the associated guarantee inapplicable. The manuscript should either relax the assumption, provide a practical detection mechanism, or supply empirical evidence that the explored radii remain bounded under the reported training budgets.
[Abstract] Abstract and the section on Scale-Calibrated Muon: the O(1/T) objective-gap bound likewise rests on a bounded initial sublevel-set assumption that appears only in the analysis. While the paper correctly notes that the radius parameter does not enter the algorithm, the practical plausibility of the assumption for the star-convex model of DNN landscapes should be discussed or tested to substantiate the strength of the theoretical claim.

minor comments (1)

[Experiments] The experimental section would benefit from reporting the number of independent runs and any statistical significance tests for the observed improvements over tuned Muon baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address each major comment below with clarifications and planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and the section presenting Distance-Adaptive Muon: the stationarity guarantee is conditioned on a bounded-trajectory assumption that the algorithm itself neither enforces nor monitors. In typical non-convex deep-learning landscapes the trajectory radius can grow without bound, which would render the radius-setting rule and the associated guarantee inapplicable. The manuscript should either relax the assumption, provide a practical detection mechanism, or supply empirical evidence that the explored radii remain bounded under the reported training budgets.

Authors: We agree that the bounded-trajectory assumption limits the applicability of the stationarity guarantee for Distance-Adaptive Muon in arbitrary non-convex settings where trajectories might diverge. Relaxing the assumption entirely would require a substantially different proof technique that is outside the scope of the current work. Instead, we will add empirical evidence in the revised manuscript by including plots of the maximum trajectory radius (distance from initialization) versus training step for the GPT-124M and ViT-Tiny experiments. These plots show that, under the reported budgets and standard regularization, the explored radii stabilize and remain bounded, supporting practical relevance of the guarantee. We will also add a short discussion acknowledging the assumption's theoretical nature while noting that normalization and early stopping in deep learning often prevent unbounded growth in practice. revision: yes
Referee: [Abstract] Abstract and the section on Scale-Calibrated Muon: the O(1/T) objective-gap bound likewise rests on a bounded initial sublevel-set assumption that appears only in the analysis. While the paper correctly notes that the radius parameter does not enter the algorithm, the practical plausibility of the assumption for the star-convex model of DNN landscapes should be discussed or tested to substantiate the strength of the theoretical claim.

Authors: We concur that further discussion of the bounded initial sublevel-set assumption would strengthen the presentation of the O(1/T) guarantee for Scale-Calibrated Muon. Although the radius parameter is analysis-only and does not affect the algorithm, we will expand the relevant section in the revision to address its plausibility under the star-convex model. The added discussion will reference existing empirical studies on DNN loss landscapes showing that sublevel sets near good minima are typically connected and locally bounded. We will also note that the star-convex geometry serves as a tractable proxy for favorable regions encountered in practice, and suggest that monitoring objective values during training could serve as a heuristic check for the assumption in future applications. revision: yes

Circularity Check

0 steps flagged

No circularity: adaptive rules and conditional guarantees are independent of target results

full rationale

The paper defines Distance-Adaptive Muon by setting trust-region radius directly from observed trajectory radius and proves stationarity only under an explicit bounded-trajectory assumption stated to appear solely in the analysis. Scale-Calibrated Muon derives step length from a local descent certificate computed from gradient and momentum, with its radius parameter likewise confined to the proof and absent from the algorithm. Distance-Free Muon uses a scalar distance certificate and one-dimensional search. None of these steps reduce the claimed guarantees to the inputs by construction, nor do they rename fitted quantities as predictions or rely on self-citation chains for uniqueness. The derivation remains self-contained against the stated external assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on bounded-trajectory and bounded-sublevel-set assumptions that are not independently verified in the abstract; no free parameters are introduced into the algorithms themselves.

axioms (2)

domain assumption Bounded-trajectory assumption for non-convex stationarity guarantee
Invoked for Distance-Adaptive Muon stationarity result
domain assumption Bounded initial sublevel-set assumption for star-convex objective-gap bound
Invoked for Scale-Calibrated Muon O(1/T) result

pith-pipeline@v0.9.0 · 5840 in / 1372 out tokens · 27772 ms · 2026-05-20T12:20:31.176790+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Distance-Adaptive Muon ... trust-region radius is set from the radius explored by the trajectory ... under a bounded-trajectory assumption
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Scale-Calibrated Muon ... sets the step length from a local descent certificate ... under a bounded initial sublevel-set assumption

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Amsel, N., Persson, D., Musco, C., and Gower, R. M. (2025). The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932. Bengio, Y . (2000). Gradient-based optimization of hyperparameters.Neural computation, 12(8):1889–

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Old Optimizer, New Norm: An Anthology

Bernstein, J. and Newhouse, L. (2024). Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325. Carlson, D., Cevher, V ., and Carin, L. (2015a). Stochastic spectral descent for restricted boltzmann machines. InArtificial intelligence and statistics, pages 111–119. PMLR. Carlson, D. E., Collins, E., Hsieh, Y .-P., Carin, L., and Cevher, V . (...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

and Hinder, O

11 Carmon, Y . and Hinder, O. (2022). Making sgd parameter-free. InConference on learning theory, pages 2360–2389. PMLR. Cartis, C., Gould, N. I., and Toint, P. L. (2011a). Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results.Mathematical Programming, 127(2):245–295. Cartis, C., Gould,...

work page arXiv 2022
[4]

Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Large, T., Liu, Y ., Huh, M., Bahng, H., Isola, P., and Bernstein, J. (2024). Scalable optimization in the modular norm.Advances in Neural Information Processing Systems, 37:73501–73548. Li, J. and Hong, M. (2025). A note on the convergence of muon and furthe...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[5]

and Tommasi, T

Orabona, F. and Tommasi, T. (2017). Training deep networks without learning rates through coin betting.Advances in neural information processing systems,

work page 2017
[6]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems,

work page 2019
[7]

13 Pethick, T., Xie, W., Antonakopoulos, K., Zhu, Z., Silveti-Falls, A., and Cevher, V . (2025a). Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529. Pethick, T., Xie, W., Erdogan, M., Antonakopoulos, K., Silveti-Falls, T., and Cevher, V . (2025b). Generalized gradient norm clipping & non-euclidean (l0, l1)-smoothness...

work page internal anchor Pith review arXiv 2025
[8]

Fork=s+j, we have kX i=1 ¯ri ≥ kX i=s+1 ¯ri = jX ℓ=1 aℓ,¯r k+1 =a j+1

Applying Lemma A.3 to{a j}M+1 j=0 with horizonMgives min 1≤j≤M aj+1 Pj ℓ=1 aℓ ≤ 1 M aM+1 a0 1/M log e aM+1 a0 . Fork=s+j, we have kX i=1 ¯ri ≥ kX i=s+1 ¯ri = jX ℓ=1 aℓ,¯r k+1 =a j+1. Therefore, min s+1≤k≤T ¯rk+1 Pk i=1 ¯ri ≤ 1 M ¯rT+1 ¯rs 1/M log e¯rT+1 ¯rs . SinceM≥T /2,¯r s ≥¯r0, and¯rT+1 ≥¯r0, the right-hand side is at most 2 T ¯rT+1 ¯r0 2/T log e¯rT+1...

work page 2025
[9]

28 RemarkC.5 (What is distance-free?).The algorithm never receives D=∥x 0 −x ⋆∥ as an input

Thus f(x T )−f ⋆ = eO LD2 T , D=∥x 0 −x ⋆∥, with constants depending only onα, ρ, λ, andM, but not onD.Q.E.D. 28 RemarkC.5 (What is distance-free?).The algorithm never receives D=∥x 0 −x ⋆∥ as an input. The proof uses D only as a hidden comparator radius in the scalar search. The scalar proxy dk is a D-adaptation-style lower certificate and is used only t...

work page arXiv 2000
[10]

Table 10 reports the result

DF-Muon cap diagnostic.We then ran a matched regularized 100-epoch one-seed diagnostic for DF-Muon, varying only the cap ηmax. Table 10 reports the result. The cap ηmax = 0.01 gives the best top-1 accuracy among the DF-Muon variants, while remaining close to the tuned fixed-Muon baseline in best validation cross-entropy. We therefore useη max = 0.01in the...

work page 2000
[11]

We report mean ± standard deviation over seeds {42,1337,2024}

Table 7: NanoGPT/WikiText-2 results over three seeds for the updated majorized DF-Muon im- plementation. We report mean ± standard deviation over seeds {42,1337,2024} . The fixed-Muon baseline usesη= 0.015; adaptive Muon variants useη max = 0.03. Method Train loss Val. loss Mean baseη AdamW5.5576±0.0163 5.7926±0.0339 0.0010 Best fixed Muon5.4489±0.0196 5....

work page arXiv 2024
[12]

Among the Muon variants, the adaptive methods remain close to the tuned fixed-Muon baseline throughout training

The Muon-based methods reduce training cross-entropy much faster than AdamW. Among the Muon variants, the adaptive methods remain close to the tuned fixed-Muon baseline throughout training. 33 0 200 400 600 800 1000 optimizer step 6 7 8 9 10 11smoothed train loss AdamW Muon DF-Muon DA-Muon SC-Muon (a) Training loss 0 200 400 600 800 1000 optimizer step 6 ...

work page 2016
[13]

Table 9: Fixed-Muon learning-rate diagnostic on ViT-Tiny/CIFAR-100 for one representative seed

The fixed-Muon baseline uses η= 0.015 , DA-Muon saturates the cap ηmax = 0.03, SC-Muon stabilizes around 0.02, and DF-Muon follows a conservative-then-increasing scale trajec- tory. Table 9: Fixed-Muon learning-rate diagnostic on ViT-Tiny/CIFAR-100 for one representative seed. CIFAR-100 images are resized to224×224 . Runtime is reported relative to the re...

work page arXiv 2052

[1] [1]

Amsel, N., Persson, D., Musco, C., and Gower, R. M. (2025). The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932. Bengio, Y . (2000). Gradient-based optimization of hyperparameters.Neural computation, 12(8):1889–

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Old Optimizer, New Norm: An Anthology

Bernstein, J. and Newhouse, L. (2024). Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325. Carlson, D., Cevher, V ., and Carin, L. (2015a). Stochastic spectral descent for restricted boltzmann machines. InArtificial intelligence and statistics, pages 111–119. PMLR. Carlson, D. E., Collins, E., Hsieh, Y .-P., Carin, L., and Cevher, V . (...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

and Hinder, O

11 Carmon, Y . and Hinder, O. (2022). Making sgd parameter-free. InConference on learning theory, pages 2360–2389. PMLR. Cartis, C., Gould, N. I., and Toint, P. L. (2011a). Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results.Mathematical Programming, 127(2):245–295. Cartis, C., Gould,...

work page arXiv 2022

[4] [4]

Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Large, T., Liu, Y ., Huh, M., Bahng, H., Isola, P., and Bernstein, J. (2024). Scalable optimization in the modular norm.Advances in Neural Information Processing Systems, 37:73501–73548. Li, J. and Hong, M. (2025). A note on the convergence of muon and furthe...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[5] [5]

and Tommasi, T

Orabona, F. and Tommasi, T. (2017). Training deep networks without learning rates through coin betting.Advances in neural information processing systems,

work page 2017

[6] [6]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems,

work page 2019

[7] [7]

13 Pethick, T., Xie, W., Antonakopoulos, K., Zhu, Z., Silveti-Falls, A., and Cevher, V . (2025a). Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529. Pethick, T., Xie, W., Erdogan, M., Antonakopoulos, K., Silveti-Falls, T., and Cevher, V . (2025b). Generalized gradient norm clipping & non-euclidean (l0, l1)-smoothness...

work page internal anchor Pith review arXiv 2025

[8] [8]

Fork=s+j, we have kX i=1 ¯ri ≥ kX i=s+1 ¯ri = jX ℓ=1 aℓ,¯r k+1 =a j+1

Applying Lemma A.3 to{a j}M+1 j=0 with horizonMgives min 1≤j≤M aj+1 Pj ℓ=1 aℓ ≤ 1 M aM+1 a0 1/M log e aM+1 a0 . Fork=s+j, we have kX i=1 ¯ri ≥ kX i=s+1 ¯ri = jX ℓ=1 aℓ,¯r k+1 =a j+1. Therefore, min s+1≤k≤T ¯rk+1 Pk i=1 ¯ri ≤ 1 M ¯rT+1 ¯rs 1/M log e¯rT+1 ¯rs . SinceM≥T /2,¯r s ≥¯r0, and¯rT+1 ≥¯r0, the right-hand side is at most 2 T ¯rT+1 ¯r0 2/T log e¯rT+1...

work page 2025

[9] [9]

28 RemarkC.5 (What is distance-free?).The algorithm never receives D=∥x 0 −x ⋆∥ as an input

Thus f(x T )−f ⋆ = eO LD2 T , D=∥x 0 −x ⋆∥, with constants depending only onα, ρ, λ, andM, but not onD.Q.E.D. 28 RemarkC.5 (What is distance-free?).The algorithm never receives D=∥x 0 −x ⋆∥ as an input. The proof uses D only as a hidden comparator radius in the scalar search. The scalar proxy dk is a D-adaptation-style lower certificate and is used only t...

work page arXiv 2000

[10] [10]

Table 10 reports the result

DF-Muon cap diagnostic.We then ran a matched regularized 100-epoch one-seed diagnostic for DF-Muon, varying only the cap ηmax. Table 10 reports the result. The cap ηmax = 0.01 gives the best top-1 accuracy among the DF-Muon variants, while remaining close to the tuned fixed-Muon baseline in best validation cross-entropy. We therefore useη max = 0.01in the...

work page 2000

[11] [11]

We report mean ± standard deviation over seeds {42,1337,2024}

Table 7: NanoGPT/WikiText-2 results over three seeds for the updated majorized DF-Muon im- plementation. We report mean ± standard deviation over seeds {42,1337,2024} . The fixed-Muon baseline usesη= 0.015; adaptive Muon variants useη max = 0.03. Method Train loss Val. loss Mean baseη AdamW5.5576±0.0163 5.7926±0.0339 0.0010 Best fixed Muon5.4489±0.0196 5....

work page arXiv 2024

[12] [12]

Among the Muon variants, the adaptive methods remain close to the tuned fixed-Muon baseline throughout training

The Muon-based methods reduce training cross-entropy much faster than AdamW. Among the Muon variants, the adaptive methods remain close to the tuned fixed-Muon baseline throughout training. 33 0 200 400 600 800 1000 optimizer step 6 7 8 9 10 11smoothed train loss AdamW Muon DF-Muon DA-Muon SC-Muon (a) Training loss 0 200 400 600 800 1000 optimizer step 6 ...

work page 2016

[13] [13]

Table 9: Fixed-Muon learning-rate diagnostic on ViT-Tiny/CIFAR-100 for one representative seed

The fixed-Muon baseline uses η= 0.015 , DA-Muon saturates the cap ηmax = 0.03, SC-Muon stabilizes around 0.02, and DF-Muon follows a conservative-then-increasing scale trajec- tory. Table 9: Fixed-Muon learning-rate diagnostic on ViT-Tiny/CIFAR-100 for one representative seed. CIFAR-100 images are resized to224×224 . Runtime is reported relative to the re...

work page arXiv 2052