Muown Implicitly Performs Angular Step-size Decay

Antonio Orvieto; Florian H\"ubler; Kai Lion; Niao He

arxiv: 2606.23637 · v1 · pith:MEDFFWVOnew · submitted 2026-06-22 · 💻 cs.LG · math.OC

Muown Implicitly Performs Angular Step-size Decay

Florian H\"ubler , Kai Lion , Antonio Orvieto , Niao He This is my paper

Pith reviewed 2026-06-26 08:30 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords MuonMuownAngularMuownRiemannian optimizationangular step sizetransformer pretrainingoptimizerstep size

0 comments

The pith

Muown's directional updates are equivalent to Riemannian steps on normalized directions, with the un-normalized magnitude only modulating the angular step size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Muown's directional update is equivalent to a Riemannian step on the normalized directions, while the magnitude of the un-normalized parameterization only modulates the angular step size. This equivalence explains the observed step-size stability of Muown. By making the angular step size explicit, the authors propose AngularMuown, which optimizes directly over normalized directions using a schedulable angular multiplier decoupled from the radial magnitude update. AngularMuown shows improved performance over Muown and scales to larger models like Qwen2.

Core claim

The directional update of Muown is equivalent to a Riemannian step on the normalized directions, while the magnitude of the un-normalized parameterization only modulates the angular step size. This explains the step-size stability of Muown and suggests making the angular step size explicit. The resulting method, AngularMuown, optimizes directly over the normalized directions and uses a schedulable angular multiplier decoupled from the radial magnitude update.

What carries the argument

Equivalence of Muown's directional update to a Riemannian step on normalized directions, where the un-normalized magnitude modulates the angular step size.

If this is right

AngularMuown improves over Muown in pre-training Transformers.
AngularMuown leads the per-optimizer category of the modded nanoGPT speedrunning competition.
The method scales to Qwen2-0.5B and 1.1B parameter mixture-of-experts models.
The angular step size becomes schedulable independently from the magnitude update.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Riemannian interpretation could apply to other matrix-aware optimizers that use normalization.
Explicit angular scheduling might benefit optimization in other non-standard geometries.
Testing AngularMuown with various angular decay schedules could yield further gains.
This view may link to broader work on manifold-constrained optimization in neural networks.

Load-bearing premise

The equivalence between Muown's directional update and the Riemannian step holds exactly under the specific normalization and update rules used in Muown.

What would settle it

Direct computation of the update for a small weight matrix under Muown rules, checking whether the resulting direction exactly matches the Riemannian gradient step projected onto the unit sphere.

Figures

Figures reproduced from arXiv: 2606.23637 by Antonio Orvieto, Florian H\"ubler, Kai Lion, Niao He.

**Figure 1.** Figure 1: Validation loss of a 124M Transformer checkpoint-branching pre-trained on FineWebEdu. AngularMuown substantially improves over tuned Muown, NorMuon, and AdamW across the token budget. For over a decade, Adam (Kingma and Ba, 2015) has been the default optimizer for deep learning. Recently, a family of matrix-aware optimizers that exploit the two-dimensional structure of weight matrices has gained tractio… view at source ↗

**Figure 2.** Figure 2: Impact of weight norm on angular update size. For normalized or scale-invariant weights, the direction alone determines the feature the row (or neuron) encodes. Given this observation, the meaningful effective step size is the per-step angular change of that row. While the row-norm does not impact the feature representation, it does impact optimization dynamics by modulating how strongly a fixed Euclide… view at source ↗

**Figure 3.** Figure 3: Analysis on the 124M model on 2.5B tokens. Weight decay is denoted by [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Validation loss across learning rates. (a) Qwen2-0.5B trained by AngularMuown with polynomial angular decay (κt ∝ 1/t), without a decay (κt ≡ 1), and NorMuon. (b) Width transfer for AngularMuown at model widths 512, 768, and 1280. Surprisingly and contrary to previous observations, we observe that AngularMuown actually requires the polynomially decaying stepsize suggested by theory. The proof also works fo… view at source ↗

read the original abstract

Matrix-aware optimizers such as Muon and Muown have recently shown strong empirical performance for pre-training Transformers. In particular, Muown separates each weight matrix into row magnitudes and an un-normalized direction variable, updating the former with Adam and the latter with Muon. We show that the directional update of Muown is equivalent to a Riemannian step on the normalized directions, while the magnitude of the un-normalized parameterization only modulates the angular step size. This explains the step-size stability of Muown and suggests making the angular step size explicit. The resulting method, AngularMuown, optimizes directly over the normalized directions and uses a schedulable angular multiplier decoupled from the radial magnitude update. AngularMuown improves over Muown and, at the time of writing, a preliminary version is leading the per-optimizer category of the modded nanoGPT speedrunning competition. Further experiments on Qwen2-0.5B, and 1.1B parameter mixture-of-experts models confirm the algorithm scales beyond small models. An implementation of the algorithm is available at https://github.com/fhueb/angular-muown

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muown's directional update matches a Riemannian step on normalized rows exactly, with magnitude only scaling the angular step size, which lets AngularMuown decouple and schedule the angular part directly.

read the letter

The core claim is that Muown's row-magnitude separation plus Muon on the direction is identical to a Riemannian gradient step on the normalized directions, where the un-normalized magnitude only modulates the effective angular step size. This is derived from the update equations rather than observed empirically.

The paper does two things cleanly. It makes the equivalence explicit and then builds AngularMuown, which optimizes the normalized directions outright and exposes an angular multiplier that can be scheduled independently of the radial Adam update. They release code, report that a preliminary version leads the per-optimizer category in the modded nanoGPT competition, and show scaling results on Qwen2-0.5B and 1.1B MoE models.

The load-bearing step is the exactness of the equivalence under Muon's orthogonalization and row-wise normalization. If the derivation holds without hidden approximations, the stability explanation and the motivation for AngularMuown are on solid ground. The experiments are promising but would be stronger with error bars and more ablations on schedule choices; those are standard expectations rather than fatal gaps.

This is for people working on matrix-aware optimizers or large-scale pretraining who want a controllable variant of Muown. A reader who already follows the Muon line of work will get the most out of the geometric reinterpretation and the new algorithm. The paper shows clear engagement with the prior literature and ships reproducible code, so it deserves a serious referee even if the central claim needs careful checking in review.

Referee Report

2 major / 2 minor

Summary. The paper claims that Muown's separation of row magnitudes (updated via Adam) and un-normalized directions (updated via Muon) yields a directional update exactly equivalent to a Riemannian gradient step on the normalized directions, with the un-normalized magnitude serving only to modulate the effective angular step size. This equivalence is used to explain Muown's step-size stability and to motivate AngularMuown, which optimizes directly over normalized directions with an explicit, schedulable angular multiplier decoupled from the radial update. Empirical results show AngularMuown outperforming Muown, leading a nanoGPT speedrunning category, and scaling to Qwen2-0.5B and 1.1B MoE models.

Significance. If the claimed exact equivalence holds, the work supplies a clean theoretical account of why matrix-aware optimizers succeed by separating radial and angular dynamics, plus a practical algorithm (AngularMuown) whose angular step-size schedule is directly controllable. Strengths include the open-source implementation, scaling experiments beyond toy models, and leaderboard results; these elements would make the contribution useful for optimizer research even if the derivation requires minor tightening.

major comments (2)

[main derivation (likely §3)] The central equivalence (directional Muown update ≡ Riemannian step on normalized directions, magnitude only modulating angular step size) is load-bearing for both the stability explanation and the motivation for AngularMuown. The derivation must be shown to be exact rather than asymptotic or conditional on the Newton-Schulz iteration or row-wise normalization; any deviation would weaken the angular-step-size account.
[equivalence proof] The weakest assumption isolated by the reader—that the mapping holds exactly under Muown's specific normalization and update rules—requires an explicit statement of all conditions (including orthogonality enforcement) and a check that no extra assumptions are introduced by the Muon step.

minor comments (2)

[Abstract] The abstract's phrase 'at the time of writing' for the leaderboard position should be replaced by a specific date or version number for reproducibility.
[Experiments] Figure captions and experiment tables would benefit from explicit mention of whether error bars or multiple seeds are reported, given the reader's note on missing error bars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments both concern the rigor and explicitness of the central equivalence derivation in §3. We agree that this derivation is load-bearing and will revise the manuscript to present an exact statement with all conditions enumerated, including the role of orthogonality enforcement. No standing objections remain after these revisions.

read point-by-point responses

Referee: [main derivation (likely §3)] The central equivalence (directional Muown update ≡ Riemannian step on normalized directions, magnitude only modulating angular step size) is load-bearing for both the stability explanation and the motivation for AngularMuown. The derivation must be shown to be exact rather than asymptotic or conditional on the Newton-Schulz iteration or row-wise normalization; any deviation would weaken the angular-step-size account.

Authors: We agree that the equivalence must be presented as exact. In the revision we will rewrite the derivation in §3 to hold exactly under Muown's update rules (row-wise normalization of the direction variable after the Muon step, separate Adam update on magnitudes). The derivation will be stated for the idealized orthogonal update produced by Muon; we will explicitly note that the practical Newton-Schulz iteration approximates this orthogonalization but that any residual error lies in the tangent space and does not alter the angular component. This removes any asymptotic or conditional character from the main claim. revision: yes
Referee: [equivalence proof] The weakest assumption isolated by the reader—that the mapping holds exactly under Muown's specific normalization and update rules—requires an explicit statement of all conditions (including orthogonality enforcement) and a check that no extra assumptions are introduced by the Muon step.

Authors: We will add a short subsection (or a clearly labeled paragraph) immediately following the derivation that enumerates every assumption: (i) each row direction is normalized after the update, (ii) the Muon step produces an update orthogonal to the current direction (orthogonality enforcement), (iii) the magnitude is updated independently by Adam and does not enter the angular dynamics. We will verify that the Muon step, as defined in the paper, introduces no further assumptions beyond the standard Riemannian gradient step on the unit sphere (or Stiefel manifold). This directly addresses the request for an explicit list of conditions. revision: yes

Circularity Check

0 steps flagged

Derivation of Riemannian equivalence is self-contained from update equations; no load-bearing circularity

full rationale

The central claim is a direct algebraic equivalence between Muown's row-magnitude separation + Muon directional update and a Riemannian gradient step on the sphere, with magnitude acting only as an implicit angular multiplier. This identity is obtained by inspecting the given update rules rather than by fitting or by self-citation. AngularMuown then makes the angular multiplier explicit and schedulable, an extension that is not forced by the original definitions. No self-citation chain, fitted-input prediction, or ansatz smuggling is required for the equivalence itself. The provided abstract and reader summary contain no quoted reduction that collapses the claimed result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard Riemannian geometry on the sphere and the existing Muown/Muon update rules; no new free parameters or invented entities are introduced in the abstract. The angular multiplier is a design choice, not a fitted constant.

axioms (1)

domain assumption The directional component of the Muown update can be rewritten exactly as a Riemannian gradient step on the unit sphere.
This is the load-bearing mathematical claim; its validity determines whether the angular-step-size interpretation holds.

pith-pipeline@v0.9.1-grok · 5731 in / 1161 out tokens · 19065 ms · 2026-06-26T08:30:43.566135+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
cs.LG 2026-06 unverdicted novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.

Reference graph

Works this paper leans on

12 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Adam: A Method for Stochastic Optimization

Boyd, Stephen and Lieven Vandenberghe (2004).Convex Optimization. Cambridge: Cambridge Uni- versity Press. Absil, P.-A., R. Mahony, and R. Sepulchre (2008).Optimization Algorithms on Matrix Manifolds. Princeton, NJ: Princeton University Press. Kingma, Diederik P. and Jimmy Ba (2015). “Adam: A Method for Stochastic Optimization”. In: Proc. Int. Conf. on Le...

Pith/arXiv arXiv 2004
[2]

Norm matters: efficient and accuratenormalizationschemesindeepnetworks

Proceedings of Machine Learning Research. PMLR, pp. 1842–1850.url:https://proceedings.mlr.press/v80/gupta18a.html. Hoffer, Elad, Ron Banner, Itay Golan, and Daniel Soudry (2018). “Norm matters: efficient and accuratenormalizationschemesindeepnetworks”.In:Advances in Neural Information Processing Systems. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Gra...

2018
[3]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Curran Associates, Inc.url:https : / / proceedings . neurips . cc / paper _ files/paper/2018/file/a0160709701140704575d499c997b6ca-Paper.pdf. Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro (2019). “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”. In:arXiv preprint ...

Pith/arXiv arXiv 2018
[4]

GLU Variants Improve Transformer

Proceedings of Machine Learning Research. PMLR, pp. 2260–2268. Shazeer, Noam (2020). “GLU Variants Improve Transformer”. In:arXiv preprint arXiv:2002.05202. Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hen- nigan, Eric Noland,...

Pith/arXiv arXiv 2020
[5]

DeepSeek-V3 Technical Report

Pro- ceedings of Machine Learning Research. PMLR, pp. 25333–25369.url:https://proceedings. mlr.press/v235/kosson24a.html. Liu, Aixin, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. (2024). “DeepSeek-V3 Technical Report”. In:arXiv preprint arXiv:2412.19437. Penedo, Guilherme, Hynek K...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.neucom.2023.127063.url:https://www.sciencedirect 2024
[6]

SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling

Proceedings of Machine Learning Research. PMLR, pp. 49069–49104. url:https://proceedings.mlr.press/v267/pethick25a.html. Vyas, Nikhil, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade (2025). “SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling”. In:International Conference on Learning Repre...

Pith/arXiv arXiv 2025
[7]

All models are pre-trained on FineWeb-Edu (Penedo et al.,

implementation augmented with recent architectural improvements such as RoPE (Su et al., 2024), RMSNorm normalization (Zhang and Sennrich, 2019), and SwiGLU activations (Shazeer, 2020). All models are pre-trained on FineWeb-Edu (Penedo et al.,

2024
[8]

with a warmup- stable-decay learning-rate schedule (Hu et al., 2024), and we report perplexity (equivalently vali- dation loss) at a fixed token budget. FollowingMuown, the matrix-aware optimizer acts on all 2D hidden weight matrices (every 2D weight except the token embeddings and the LM head); the remaining parameters are handled by AdamW. ForAngularMuo...

2024
[9]

AngularMuown’s angular schedule is the inverse-polynomialκpoly t = (1+c(t−t w)+)−p, using the default valuesc= 10 −3 andp=

124M learning-rate sweep (Figure 3).We train a124M-parameter model for2.5B tokens and sweep a log-spaced learning-rate grid (Figure 3a), comparingMuon, NorMuon (Li et al., 2025), Muown, andAngularMuownboth with and without its angular schedule.Muonuses weight decayλ= 0.1and all other optimizers useλ= 0, the settings found optimal in theMuownstudy. Angular...

2025
[10]

14 Algorithm 2Idealized DirectionalAngularMuownUpdate Require:U 1 ∈ M,M 0 ←gradL(U 1, ξ1), stepsizeη, momentumβ Mt ←βM t−1 + (1−β)gradL(U t, ξt) Ot ←argmin D∈TUt M,∥D∥S∞ ≤1 ⟨Mt,D⟩▷Tangent-space aware orthogonalization Ut+1 ←Retr Ut(ηOt)▷Retraction at a global batch of 256 (≈1.05M tokens/step) for 30B tokens in total. For the optimizer settings, we mostly ...

2025
[11]

Lemma B.1.For allU∈ MandA∈ T UMwe have ∥A∥F ≤ ∥A∥ U,∗ ≤ ∥A∥ S1 ≤ p min{m, n} ∥A∥ F

Additionally we require the following standard inequalities. Lemma B.1.For allU∈ MandA∈ T UMwe have ∥A∥F ≤ ∥A∥ U,∗ ≤ ∥A∥ S1 ≤ p min{m, n} ∥A∥ F . 15 Now we are ready to provide the descent lemma. Lemma B.2(Descent Lemma).Let Assumptions 1 and 2 hold. Then the iterates generated by Algorithm 2 satisfy TX t=1 ∥gradL(U t)∥Ut,∗ ≤ ∆1 η + ηLT 2 + 2 TX t=1 ∥Mt −...

2020
[12]

16 Combining the above with our initializationM1 =G 1 yields TX t=1 E ∥Et∥S1 ≤ √rσ 1−β + √rσT p 1−β+ ηLT 1−β and thus the claim

implies E   tX τ=2 (1−β)β t−τ(Gτ −gradL(U τ)) S1   ≤ √rE " tX τ=2 (1−β)β t−τ(Gτ −gradL(U τ)) F # ≤ √r(1−β) vuut tX τ=1 β2(t−τ) σ2 ≤ √rσ p 1−β. 16 Combining the above with our initializationM1 =G 1 yields TX t=1 E ∥Et∥S1 ≤ √rσ 1−β + √rσT p 1−β+ ηLT 1−β and thus the claim. Finally we provide the proof for Theorem 3.2. Proof of Theorem 3.2.The proof foll...

2026

[1] [1]

Adam: A Method for Stochastic Optimization

Boyd, Stephen and Lieven Vandenberghe (2004).Convex Optimization. Cambridge: Cambridge Uni- versity Press. Absil, P.-A., R. Mahony, and R. Sepulchre (2008).Optimization Algorithms on Matrix Manifolds. Princeton, NJ: Princeton University Press. Kingma, Diederik P. and Jimmy Ba (2015). “Adam: A Method for Stochastic Optimization”. In: Proc. Int. Conf. on Le...

Pith/arXiv arXiv 2004

[2] [2]

Norm matters: efficient and accuratenormalizationschemesindeepnetworks

Proceedings of Machine Learning Research. PMLR, pp. 1842–1850.url:https://proceedings.mlr.press/v80/gupta18a.html. Hoffer, Elad, Ron Banner, Itay Golan, and Daniel Soudry (2018). “Norm matters: efficient and accuratenormalizationschemesindeepnetworks”.In:Advances in Neural Information Processing Systems. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Gra...

2018

[3] [3]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Curran Associates, Inc.url:https : / / proceedings . neurips . cc / paper _ files/paper/2018/file/a0160709701140704575d499c997b6ca-Paper.pdf. Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro (2019). “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”. In:arXiv preprint ...

Pith/arXiv arXiv 2018

[4] [4]

GLU Variants Improve Transformer

Proceedings of Machine Learning Research. PMLR, pp. 2260–2268. Shazeer, Noam (2020). “GLU Variants Improve Transformer”. In:arXiv preprint arXiv:2002.05202. Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hen- nigan, Eric Noland,...

Pith/arXiv arXiv 2020

[5] [5]

DeepSeek-V3 Technical Report

Pro- ceedings of Machine Learning Research. PMLR, pp. 25333–25369.url:https://proceedings. mlr.press/v235/kosson24a.html. Liu, Aixin, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. (2024). “DeepSeek-V3 Technical Report”. In:arXiv preprint arXiv:2412.19437. Penedo, Guilherme, Hynek K...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.neucom.2023.127063.url:https://www.sciencedirect 2024

[6] [6]

SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling

Proceedings of Machine Learning Research. PMLR, pp. 49069–49104. url:https://proceedings.mlr.press/v267/pethick25a.html. Vyas, Nikhil, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade (2025). “SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling”. In:International Conference on Learning Repre...

Pith/arXiv arXiv 2025

[7] [7]

All models are pre-trained on FineWeb-Edu (Penedo et al.,

implementation augmented with recent architectural improvements such as RoPE (Su et al., 2024), RMSNorm normalization (Zhang and Sennrich, 2019), and SwiGLU activations (Shazeer, 2020). All models are pre-trained on FineWeb-Edu (Penedo et al.,

2024

[8] [8]

with a warmup- stable-decay learning-rate schedule (Hu et al., 2024), and we report perplexity (equivalently vali- dation loss) at a fixed token budget. FollowingMuown, the matrix-aware optimizer acts on all 2D hidden weight matrices (every 2D weight except the token embeddings and the LM head); the remaining parameters are handled by AdamW. ForAngularMuo...

2024

[9] [9]

AngularMuown’s angular schedule is the inverse-polynomialκpoly t = (1+c(t−t w)+)−p, using the default valuesc= 10 −3 andp=

124M learning-rate sweep (Figure 3).We train a124M-parameter model for2.5B tokens and sweep a log-spaced learning-rate grid (Figure 3a), comparingMuon, NorMuon (Li et al., 2025), Muown, andAngularMuownboth with and without its angular schedule.Muonuses weight decayλ= 0.1and all other optimizers useλ= 0, the settings found optimal in theMuownstudy. Angular...

2025

[10] [10]

14 Algorithm 2Idealized DirectionalAngularMuownUpdate Require:U 1 ∈ M,M 0 ←gradL(U 1, ξ1), stepsizeη, momentumβ Mt ←βM t−1 + (1−β)gradL(U t, ξt) Ot ←argmin D∈TUt M,∥D∥S∞ ≤1 ⟨Mt,D⟩▷Tangent-space aware orthogonalization Ut+1 ←Retr Ut(ηOt)▷Retraction at a global batch of 256 (≈1.05M tokens/step) for 30B tokens in total. For the optimizer settings, we mostly ...

2025

[11] [11]

Lemma B.1.For allU∈ MandA∈ T UMwe have ∥A∥F ≤ ∥A∥ U,∗ ≤ ∥A∥ S1 ≤ p min{m, n} ∥A∥ F

Additionally we require the following standard inequalities. Lemma B.1.For allU∈ MandA∈ T UMwe have ∥A∥F ≤ ∥A∥ U,∗ ≤ ∥A∥ S1 ≤ p min{m, n} ∥A∥ F . 15 Now we are ready to provide the descent lemma. Lemma B.2(Descent Lemma).Let Assumptions 1 and 2 hold. Then the iterates generated by Algorithm 2 satisfy TX t=1 ∥gradL(U t)∥Ut,∗ ≤ ∆1 η + ηLT 2 + 2 TX t=1 ∥Mt −...

2020

[12] [12]

16 Combining the above with our initializationM1 =G 1 yields TX t=1 E ∥Et∥S1 ≤ √rσ 1−β + √rσT p 1−β+ ηLT 1−β and thus the claim

implies E   tX τ=2 (1−β)β t−τ(Gτ −gradL(U τ)) S1   ≤ √rE " tX τ=2 (1−β)β t−τ(Gτ −gradL(U τ)) F # ≤ √r(1−β) vuut tX τ=1 β2(t−τ) σ2 ≤ √rσ p 1−β. 16 Combining the above with our initializationM1 =G 1 yields TX t=1 E ∥Et∥S1 ≤ √rσ 1−β + √rσT p 1−β+ ηLT 1−β and thus the claim. Finally we provide the proof for Theorem 3.2. Proof of Theorem 3.2.The proof foll...

2026