On the Convergence Analysis of Muon
Pith reviewed 2026-05-19 13:15 UTC · model grok-4.3
The pith
Muon can outperform gradient descent by benefiting from the low-rank structure of Hessian matrices during neural network training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Muon achieves improved convergence rates over gradient descent precisely when the Hessian matrices of the loss exhibit low-rank structure, a property that holds under the modeling conditions examined and that matches what is widely observed when training neural networks with matrix parameters.
What carries the argument
Convergence-rate comparison between Muon and gradient descent that isolates the benefit from low-rank Hessian structure.
If this is right
- Muon delivers strictly faster convergence than gradient descent whenever the Hessian is low-rank.
- The advantage scales with the degree of rank deficiency in the Hessian.
- Standard vector-based optimizers ignore this structural property and therefore cannot realize the same improvement.
- The result supplies a concrete criterion for choosing Muon over gradient descent in matrix-parameterized models.
Where Pith is reading between the lines
- Similar low-rank exploitation might be engineered into other matrix-aware optimizers.
- The analysis suggests testing Muon on tasks where Hessian rank can be explicitly controlled, such as low-rank matrix completion or factored models.
- If low-rank Hessians are the main source of Muon's edge, then hybrid methods that detect rank deficiency on the fly could further improve performance.
Load-bearing premise
Hessian matrices arising in neural network training possess low-rank structure under the conditions studied.
What would settle it
A direct measurement showing that Muon and gradient descent converge at the same rate (or Muon is slower) on a problem where the Hessian has full rank and satisfies the other modeling assumptions.
Figures
read the original abstract
The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank structure of Hessian matrices, a phenomenon widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes the convergence of the Muon optimizer, which operates directly on matrix-structured parameters rather than vectorized versions, and compares its rates to those of gradient descent (GD). The central theoretical claim is that Muon can exploit low-rank structure in the Hessian (a phenomenon described as widely observed in neural network training) to achieve faster convergence under certain conditions, while GD does not; this is supported by derived rates and corroborated by experiments.
Significance. If the derivations are correct, the work supplies a concrete theoretical explanation for Muon's observed empirical gains by linking them to low-rank Hessians rather than generic matrix-aware updates. This is a useful contribution to the growing literature on structure-exploiting optimizers, especially since the paper ships explicit rate comparisons and identifies the rank-dependent regime. The result would be more impactful if the low-rank benefit were shown to arise from the dynamics rather than inserted as a modeling assumption.
major comments (2)
- [§4] §4 (Convergence Analysis): The claimed improvement in Muon's contraction factor when the Hessian has rank r ≪ d is obtained by restricting the update to the dominant subspace; however, the analysis does not derive that the trajectory preserves or exploits this low-rank structure from the optimization dynamics. It is introduced as an external modeling choice (see Assumption 3.2), so the link between the low-rank premise and the outperformance condition remains an assumption rather than a derived property.
- [Theorem 4.3] Theorem 4.3 and Corollary 4.4: The comparison to GD shows a rank-dependent gap only after substituting the low-rank Hessian model into Muon's step-size choice; without a separate bound showing that GD cannot similarly benefit from the same low-rank information (or that Muon's matrix update is the only mechanism that captures it), the claim that Muon 'benefits from the low-rank structure' while GD does not is not fully secured.
minor comments (2)
- [§3] Notation for the matrix norm and the projection onto the rank-r subspace is introduced in §3 but used without re-statement in the proofs of §4; adding a short reminder would improve readability.
- [§5] The experimental section reports wall-clock speedups but does not include the condition number or effective rank of the Hessians on the tested models; adding these diagnostics would directly connect the plots to the low-rank regime analyzed in the theory.
Simulated Author's Rebuttal
We are grateful to the referee for the careful reading of our manuscript and the insightful comments. We have prepared point-by-point responses to the major comments and will incorporate revisions as detailed below to improve the clarity and strength of our theoretical claims.
read point-by-point responses
-
Referee: [§4] §4 (Convergence Analysis): The claimed improvement in Muon's contraction factor when the Hessian has rank r ≪ d is obtained by restricting the update to the dominant subspace; however, the analysis does not derive that the trajectory preserves or exploits this low-rank structure from the optimization dynamics. It is introduced as an external modeling choice (see Assumption 3.2), so the link between the low-rank premise and the outperformance condition remains an assumption rather than a derived property.
Authors: We agree that Assumption 3.2 introduces the low-rank Hessian structure as a modeling assumption rather than deriving its preservation from the optimization dynamics. This assumption is motivated by the extensive empirical literature on low-rank Hessians in neural network training, which we reference in the introduction. The contribution of Section 4 is to show the resulting convergence benefit for Muon under this condition. In the revision we will add explicit language in Section 3 and the concluding discussion to clarify the role of the assumption and to identify the derivation of low-rank structure from the dynamics as an open direction for future work. revision: yes
-
Referee: [Theorem 4.3] Theorem 4.3 and Corollary 4.4: The comparison to GD shows a rank-dependent gap only after substituting the low-rank Hessian model into Muon's step-size choice; without a separate bound showing that GD cannot similarly benefit from the same low-rank information (or that Muon's matrix update is the only mechanism that captures it), the claim that Muon 'benefits from the low-rank structure' while GD does not is not fully secured.
Authors: The rate comparison in Theorem 4.3 and Corollary 4.4 is obtained by inserting the low-rank model into the respective convergence bounds, with Muon's matrix-parameter update permitting a step-size choice that operates directly on the dominant subspace. Standard GD is formulated on the vectorized parameters and therefore does not exploit the matrix structure in the same manner. We acknowledge that the manuscript does not supply an auxiliary bound ruling out all possible adaptations of GD. In the revision we will insert a clarifying paragraph after Theorem 4.3 that emphasizes the distinction arising from Muon's matrix-aware mechanism and notes that any comparable benefit for GD would require additional structural assumptions not present in the standard algorithm. revision: yes
Circularity Check
No circularity: analysis derives rates from explicit low-rank assumption without self-reduction
full rationale
The paper states its central result as a convergence comparison between Muon and GD that holds under the modeling premise of low-rank Hessian structure (an external empirical observation, not derived inside the paper). No quoted step equates a claimed prediction or rate to a fitted quantity, self-citation chain, or definitional tautology; the low-rank condition is inserted as an assumption into the contraction analysis rather than being smuggled in or forced by the optimizer definition itself. The derivation chain therefore remains self-contained against the stated assumptions and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hessian matrices arising in neural network training possess low-rank structure
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our theoretical results reveal that Muon can benefit from the low-rank structure of Hessian matrices, a phenomenon widely observed in practical neural network training.
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
when H_t can be represented by P_t ⊗ Q_t and Q_t, P_t are relatively low-rank such that sum σ_p,i σ_q,i ≪ r σ_p,1 σ_q,1, then J ≪ r L
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 26 Pith papers
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
AMUSE: Anytime Muon with Stable Gradient Evaluation
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
-
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
-
DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum
DP-Muon adapts matrix-orthogonalized momentum optimization to differential privacy via per-matrix clipping and noise addition, with proofs of inherited privacy and optimization guarantees plus a bias-corrected version...
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
Phases of Muon: When Muon Eclipses SignSGD
On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
-
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
-
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
-
Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices
SOAP and its generalizations with arbitrary orthogonal projections converge at a provable rate when the projections are conditionally independent of the current gradient.
-
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing f...
-
On the Convergence of Muon and Beyond
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
-
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...
-
Muon Does Not Converge on Convex Lipschitz Functions
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
-
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
SignMuon: Communication-Efficient Distributed Muon Optimization
SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong em...
-
SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon
SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...
-
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
-
Anytime Training with Schedule-Free Spectral Optimization
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
-
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
-
Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered
Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.
-
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
-
Communication-Efficient Gluon in Federated Learning
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
-
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
-
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
-
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv. org/abs/2303.08774, 2:6,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Dion: Distributed Orthonormalized Updates
Kwangjun Ahn and Byron Xu. Dion: A communication-efficient optimizer for large models.arXiv preprint arXiv:2504.05295,
-
[3]
Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,
Kang An, Yuxing Liu, Rui Pan, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,
-
[4]
Old Optimizer, New Norm: An Anthology
12 Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,
-
[6]
When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,
Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning? arXiv preprint arXiv:2512.04299,
-
[7]
Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809,
Zhaorui Dong, Yushun Zhang, Zhi-Quan Luo, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809,
-
[8]
Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,
Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,
-
[9]
Gradient Descent Happens in a Tiny Subspace
Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024
Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter-efficient fine- tuning.arXiv preprint arXiv:2405.12130,
-
[11]
arXiv preprint arXiv:2503.12645 , year=
Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust- region optimization.arXiv preprint arXiv:2503.12645,
-
[12]
Learning multiple layers of features from tiny images.(2009),
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.(2009),
work page 2009
- [13]
-
[14]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025a. Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficien...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[15]
Charles H Martin and Christopher Hinrichs
Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonal- ization in muon.arXiv preprint arXiv:2601.13474,
-
[16]
A new perspective on shampoo’s preconditioner.arXiv preprint arXiv:2406.17748,
Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, and Lucas Janson. A new perspective on shampoo’s preconditioner.arXiv preprint arXiv:2406.17748,
-
[17]
Training Deep Learning Models with Norm-Constrained LMOs
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,
work page internal anchor Pith review arXiv
-
[18]
Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025
Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598,
-
[19]
Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
ArXiv Preprint: 2511.00674 , Year =
Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient or- thogonalization optimal?arXiv preprint arXiv:2511.00674,
-
[22]
SOAP: Improving and Stabilizing Shampoo using Adam
Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Muon outperforms adam in tail-end associative memory learning
Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning. arXiv preprint arXiv:2509.26030,
-
[24]
14 Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,
-
[25]
Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,
-
[26]
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962,
work page internal anchor Pith review arXiv 1904
-
[27]
Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,
Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhiquan Luo. Why transformers need adam: A hessian perspective.Advances in Neural Information Processing Systems, 37:131786– 131823, 2024a. Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning ...
-
[28]
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
SGD Converges to Global Minimum in Deep Learning via Star-convex Path
Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path.arXiv preprint arXiv:1901.00451,
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[30]
15 Appendix The Appendix is organized as follows. In Section A, we introduce some lemmas that will be utilized in the subsequent proofs, and we also give the proofs of Theorem 4.12 and Theorem A.3. In Section B, we present the proofs of theorems in the nonconvex setting. In Section C, we present the proofs of theorems in the star convex setting. In Sectio...
work page 2025
-
[31]
log T∆ D2op ˜J !#2 + sr3/2D3 op 6T 2
If ˜J >0 andη= min Dop T log T∆ D2op ˜J , Dop . We have f(W T )−f ∗ ≤ D2 op ˜J T + D2 op ˜J 2T " log T∆ D2op ˜J !#2 + sr3/2D3 op 6T 2 " log T∆ D2op ˜J !#2 ≤ ˜O D2 op ˜J T ! . When ˜J≤0, we can setη= min n Dop T log T 2∆ D3opsr3/2 , Dop o . We have f(W T )−f ∗ ≤ sr3/2D3 op T 2 + sr3/2D3 op 6T 2 " log T 2∆ D3opsr3/2 !#2 ≤ ˜O sr3/2D3 op T 2 ! . D Low-rank St...
work page 2009
-
[32]
For the Shakespeare dataset, we take the first 3000 characters and use the RoBERTa [Liu et al., 2019] tokenizer and embedding model to convert the text into an embedding matrixX Text ∈R 768×836, where 768 is the embedding dimension and 836 is the token length. We compute its singular values and compare them with those of a Gaussian random matrix of the sa...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.