HiMuon partitions momentum-gradient matrices into T x T tiles, runs independent Newton-Schulz iterations on each tile, and reassembles the results, reducing leading cost to O(H W T K) while defining a local rather than global matrix map.
Variance-Adaptive
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.
citing papers explorer
-
Muon Learns More Robust and Transferable Features than Adam
Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.