hub

Dion: Distributed orthonormal- ized updates.arXiv preprint: 2504.05295

Kwangjun Ahn et al · 2025 · arXiv 2504.05295

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 1

citation-polarity summary

background 2 unclear 1 use method 1

representative citing papers

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

math.OC · 2026-05-18 · unverdicted · novelty 6.0

Establishes matching lower and upper oracle complexity bounds for scale-invariant methods with spectral norm under heavy-tailed noise, plus improved rates with higher-order smoothness, and practical tests on neural networks.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

cs.LG · 2026-05-09 · conditional · novelty 6.0

ZO-MOPI accelerates zeroth-order LLM fine-tuning by applying partial spectral orthogonalization from power iteration inside a momentum-projected subspace to reduce variance and exploit dominant directions.

Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization

cs.LG · 2026-05-07 · conditional · novelty 6.0

Orth-Dion uses QR factorization on the right factor instead of column normalization to eliminate the geometric mismatch in low-rank approximations of spectral optimizers like Muon, achieving O(sqrt(L_r/T)) rate under non-Euclidean smoothness.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Anytime Training with Schedule-Free Spectral Optimization

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

math.OC · 2026-05-12 · unverdicted · novelty 5.0

Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.

MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

MuonQ achieves stable 4-bit quantization of Muon optimizer states via pre-quantization normalization, singular component decomposition with power iteration, and μ-law companding, matching full-precision loss and accuracy on GPT and LLaMA models with up to 7.3x memory savings.

Communication-Efficient Gluon in Federated Learning

cs.LG · 2026-04-12 · unverdicted · novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

cs.LG · 2025-09-15 · unverdicted · novelty 5.0

Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.

On the Convergence Analysis of Muon

stat.ML · 2025-05-29 · unverdicted · novelty 5.0

Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.

Can Muon Fine-tune Adam-Pretrained Models?

cs.LG · 2026-05-11 · unverdicted · novelty 4.0

Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

math.OC · 2026-05-18

citing papers explorer

Showing 17 of 17 citing papers.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR cs.LG · 2026-05-19 · conditional · none · ref 1
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Muon is Not That Special: Random or Inverted Spectra Work Just as Well cs.LG · 2026-05-11 · unverdicted · none · ref 15
Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws cs.LG · 2026-05-20 · unverdicted · none · ref 13
The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.
Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise math.OC · 2026-05-18 · unverdicted · none · ref 58
Establishes matching lower and upper oracle complexity bounds for scale-invariant methods with spectral norm under heavy-tailed noise, plus improved rates with higher-order smoothness, and practical tests on neural networks.
Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 164
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration cs.LG · 2026-05-09 · conditional · none · ref 1
ZO-MOPI accelerates zeroth-order LLM fine-tuning by applying partial spectral orthogonalization from power iteration inside a momentum-projected subspace to reduce variance and exploit dominant directions.
Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization cs.LG · 2026-05-07 · conditional · none · ref 3
Orth-Dion uses QR factorization on the right factor instead of column normalization to eliminate the geometric mismatch in low-rank approximations of spectral optimizers like Muon, achieving O(sqrt(L_r/T)) rate under non-Euclidean smoothness.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration cs.LG · 2026-03-30 · unverdicted · none · ref 43
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 39
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered cs.LG · 2026-05-15 · unverdicted · none · ref 30
Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives math.OC · 2026-05-12 · unverdicted · none · ref 1
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 1
MuonQ achieves stable 4-bit quantization of Muon optimizer states via pre-quantization normalization, singular component decomposition with power iteration, and μ-law companding, matching full-precision loss and accuracy on GPT and LLaMA models with up to 7.3x memory savings.
Communication-Efficient Gluon in Federated Learning cs.LG · 2026-04-12 · unverdicted · none · ref 1
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training cs.LG · 2025-09-15 · unverdicted · none · ref 2
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
On the Convergence Analysis of Muon stat.ML · 2025-05-29 · unverdicted · none · ref 2
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
Can Muon Fine-tune Adam-Pretrained Models? cs.LG · 2026-05-11 · unverdicted · none · ref 29
Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · unreviewed · ref 3

Dion: Distributed orthonormal- ized updates.arXiv preprint: 2504.05295

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer