Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Brandon Reagen; Nandan Kumar Jha

arxiv: 2605.21803 · v1 · pith:W6JTPEFJnew · submitted 2026-05-20 · 💻 cs.LG

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Nandan Kumar Jha , Brandon Reagen This is my paper

Pith reviewed 2026-05-22 08:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords scaling lawsoptimizersspectral rankstransformersfeed-forward networksrepresentation capacityAdamWMuon

0 comments

The pith

The same Transformer architecture realizes different spectral scaling laws under different optimizers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard scaling laws overlook the optimizer as a variable that controls how added model width converts into usable spectral capacity in representations. By measuring eigenspectra through soft and hard spectral ranks in feed-forward layers, it finds that AdamW produces only weak scaling of hard rank on rare-token data while Muon produces near-linear scaling, a more than twofold difference in the exponent. This gap in representation structure remains even when the optimizers reach similar perplexity after extended training. Optimizer-driven shifts in spectral geometry also exceed the effects of architectural changes such as attention rank or positional encoding. The work therefore treats optimization strategy as a primary determinant of how capacity is actually realized during scaling.

Core claim

Holding architecture and width schedule fixed, the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. On rare-token representations, AdamW exhibits weak hard-rank scaling with exponent 0.44 while Muon achieves linear scaling with exponent 1.02. This difference is not reducible to validation loss, since AdamW runs can match the perplexity of lower-rank configurations yet display sharply different spectral geometry. Hard-soft rank asymmetry further shows that optimizers differ both in the amount of capacity realized and in how that capacity is distributed across eigenmodes. Optimizer-induced spectral shifts frequently exceed,

What carries the argument

Eigenspectra of feed-forward network representations measured by soft and hard spectral ranks, which quantify the effective dimensionality the optimizer makes available to the model.

If this is right

Matched perplexity does not guarantee matched representation structure or spectral geometry.
Optimizer effects on spectral scaling can be larger than those produced by changes to attention rank or positional encoding.
Representation capacity depends on how an optimizer distributes utilization across eigenmodes, revealed by hard-soft rank differences.
Scaling behavior should be studied with optimizer as an explicit variable rather than a fixed detail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling-law predictions could become more accurate by incorporating optimizer-specific exponents for spectral utilization.
Training runs might benefit from selecting optimizers that maximize hard-rank growth on tail tokens to improve generalization at fixed width.
The interaction between optimizer and layer type could be tested to see whether spectral advantages appear outside feed-forward blocks.

Load-bearing premise

That soft and hard spectral ranks computed from feed-forward network representations supply a faithful measure of utilized spectral capacity that remains comparable across different optimizers.

What would settle it

Training identical models with AdamW and Muon until both reach the same perplexity and the same hard spectral rank on rare tokens would contradict the claim of optimizer-specific scaling laws.

Figures

Figures reproduced from arXiv: 2605.21803 by Brandon Reagen, Nandan Kumar Jha.

**Figure 2.** Figure 2: Optimizer-dependent spectral scaling across token-frequency regimes. Soft spectral rank [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Extended AdamW training weakens hard-rank scaling in GPT-2 160M. Hard-rank scaling [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Soft spectral rank (left) and hard spectral rank (right) scaling is shown for TAIL tokens in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Optimizer-dependent TAIL spectral scaling persists at 350M scale. Soft spectral rank (left) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Optimizer-induced shifts in spectral-scaling exceed attention-rank shifts in GPT-2 160M. We [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Rényi-family view of optimizer-shaped spectral capacity in GPT-2 350M, with FFN [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of layer-wise scaling exponents for GPT-2 160M. For each layer [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Depth profiles of layer-wise scaling exponents for GPT-2 160M. Each curve shows [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Hard-rank dynamics for AdamW extended training with GPT-2 160M. Post-activation [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

read the original abstract

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emph{the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers}. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling ($\beta$=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling ($\beta$=1.02) in the same regimes, a $2.3\times$ increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that the same Transformer architecture, with fixed width schedule, realizes different spectral scaling laws under different optimizers. Using soft and hard spectral ranks derived from FFN representations, it reports that AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations while Muon achieves linear scaling (β=1.02), a 2.3× difference; this dissociation persists even when perplexity is matched via extended training, and optimizer-induced shifts exceed those from architectural interventions such as attention rank or positional encodings.

Significance. If the central measurements are robust, the work establishes optimization as an independent axis of representation scaling that is not captured by loss alone. The explicit comparison of optimizer effects against architectural controls, together with the hard–soft rank asymmetry, supplies a concrete empirical basis for optimizer–architecture co-design in large language models.

major comments (3)

[Methods] Methods section: The procedure for extracting FFN activations, computing eigenspectra, and defining the hard-rank threshold (including any normalization or token-sampling controls) is not described in sufficient detail. Without explicit controls ensuring that activation statistics and token distributions are matched across AdamW and Muon runs, it remains possible that observed rank differences arise from optimizer-dependent sparsity or scale rather than from genuine differences in utilized spectral capacity.
[§4] Scaling-law fits (abstract and §4): The reported exponents β=0.44 and β=1.02 are obtained by fitting power laws to the same rank-versus-width observations used to demonstrate the optimizer difference. This makes the scaling laws descriptive summaries rather than independent predictions; the manuscript should clarify whether any out-of-sample validation or theoretical motivation for the functional form was performed.
[Experimental Setup] Experimental controls: While the abstract notes that some AdamW runs were extended to match perplexity, the text does not report whether total training steps, data order, or batch composition were equalized for the TAIL-token spectral measurements. This control is load-bearing for the claim that geometry differences are optimizer-induced rather than artifacts of unequal optimization trajectories.

minor comments (2)

[Abstract] Abstract: The numerical values β=0.44 and β=1.02 should be accompanied by standard errors or confidence intervals when first stated.
[Figures] Figures: Plots showing eigenvalue decay or rank-versus-width should include the exact hard-rank threshold used and clearly distinguish optimizer curves with consistent line styles across panels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will make to improve clarity, reproducibility, and the robustness of our claims.

read point-by-point responses

Referee: [Methods] Methods section: The procedure for extracting FFN activations, computing eigenspectra, and defining the hard-rank threshold (including any normalization or token-sampling controls) is not described in sufficient detail. Without explicit controls ensuring that activation statistics and token distributions are matched across AdamW and Muon runs, it remains possible that observed rank differences arise from optimizer-dependent sparsity or scale rather than from genuine differences in utilized spectral capacity.

Authors: We agree that greater methodological detail is required for reproducibility and to rule out potential confounds. In the revised manuscript we will expand the Methods section with a precise, step-by-step account of FFN activation extraction, eigenspectra computation, the exact definition of the hard-rank threshold (including the eigenvalue cutoff, normalization procedure, and any scaling), and the token-sampling protocol. We will also add explicit verification that mean activation statistics and token-frequency distributions are matched across the AdamW and Muon runs used for the reported comparisons, thereby supporting that the observed spectral differences reflect genuine differences in utilized capacity rather than optimizer-induced sparsity or scale artifacts. revision: yes
Referee: [§4] Scaling-law fits (abstract and §4): The reported exponents β=0.44 and β=1.02 are obtained by fitting power laws to the same rank-versus-width observations used to demonstrate the optimizer difference. This makes the scaling laws descriptive summaries rather than independent predictions; the manuscript should clarify whether any out-of-sample validation or theoretical motivation for the functional form was performed.

Authors: We acknowledge that the reported exponents are obtained by fitting power laws directly to the rank-versus-width observations presented in the paper and therefore function as descriptive summaries of the empirical trends rather than independent, out-of-sample predictions. The power-law functional form is motivated by prior literature on spectral scaling in neural representations. No out-of-sample validation was performed in the current experiments. In the revision we will explicitly state this descriptive character in §4 and the abstract, supply the relevant theoretical motivation from the spectral-scaling literature, and note the limitation regarding predictive validation. revision: partial
Referee: [Experimental Setup] Experimental controls: While the abstract notes that some AdamW runs were extended to match perplexity, the text does not report whether total training steps, data order, or batch composition were equalized for the TAIL-token spectral measurements. This control is load-bearing for the claim that geometry differences are optimizer-induced rather than artifacts of unequal optimization trajectories.

Authors: We appreciate the importance of this control. For the perplexity-matched AdamW runs, total training steps were extended while preserving identical data order and batch composition with the corresponding Muon runs; only the number of steps was adjusted to reach perplexity parity. TAIL-token spectral measurements were performed on token samples drawn from the same data distribution and ordering. In the revised manuscript we will add an explicit paragraph in the Experimental Setup section documenting these controls and confirming that the TAIL measurements used matched token subsets. revision: yes

Circularity Check

1 steps flagged

Spectral scaling exponents reduce to power-law fits on measured rank-vs-width data

specific steps

fitted input called prediction [Abstract]
"Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3× increase in the scaling exponent."

The β exponents are computed by fitting a power-law form (hard-rank ∝ width^β) to the empirically measured hard spectral ranks collected at multiple widths. The scaling law is therefore a curve fit to the same observations it purports to describe, not an independent prediction from optimizer dynamics or first principles.

full rationale

The paper's core claim is that AdamW and Muon induce different spectral scaling laws (quantified by hard-rank exponent β) for the same architecture. These β values are obtained by fitting power laws directly to the observed hard spectral rank versus width measurements on TAIL tokens. This makes the reported scaling laws descriptive summaries of the input data rather than independent derivations or predictions. The comparison to architectural interventions adds some external content, but the optimizer-induced scaling result itself is a post-hoc fit. No self-citation chains, self-definitions, or ansatz smuggling were found in the derivation of the scaling exponents.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical spectral measurements and fitted scaling exponents rather than a first-principles derivation; no new entities are postulated.

free parameters (1)

hard-rank scaling exponent β
Fitted separately for each optimizer to the observed rank-versus-width curves on tail tokens.

axioms (1)

domain assumption Spectral ranks computed from FFN activations measure utilized capacity in a manner comparable across optimizers
Invoked when interpreting differences in β as differences in how effectively added width is utilized.

pith-pipeline@v0.9.0 · 5799 in / 1237 out tokens · 37783 ms · 2026-05-22T08:45:34.931833+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks... AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations... Muon achieves linear scaling (β=1.02)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

[1]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[3]

Optimizers qualitatively alter solutions and we should leverage this.arXiv preprint arXiv:2507.12224, 2025

Razvan Pascanu, Clare Lyle, Ionut-Vlad Modoranu, Naima Elosegui Borras, Dan Alistarh, Petar Velickovic, Sarath Chandar, Soham De, and James Martens. Optimizers qualitatively alter solutions and we should leverage this.arXiv preprint arXiv:2507.12224, 2025

work page arXiv 2025
[4]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Towards robust scaling laws for optimizers.arXiv preprint arXiv:2602.07712, 2026

Alexandra V olkova, Mher Safaryan, Christoph H Lampert, and Dan Alistarh. Towards robust scaling laws for optimizers.arXiv preprint arXiv:2602.07712, 2026

work page arXiv 2026
[6]

Nerve: Nonlinear eigenspectrum dynamics in llm feed- forward networks

Nandan Kumar Jha and Brandon Reagen. Nerve: Nonlinear eigenspectrum dynamics in llm feed- forward networks. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[7]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEmpirical Methods in Natural Language Processing (EMNLP), 2021

work page 2021
[8]

Nandan Kumar Jha and Brandon Reagen. Spectral scaling laws in language models: How effectively do feed-forward networks use their latent space? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[9]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[10]

Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 2024

work page 2024
[11]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

work page arXiv 2025
[13]

Dion: Distributed Orthonormalized Updates

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

work page arXiv 2025
[14]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. InInternational conference on machine learning (ICML), 2023

work page 2023
[15]

Quality over quantity in attention layers: When adding more heads hurts

Noah Amsel, Gilad Yehudai, and Joan Bruna. Quality over quantity in attention layers: When adding more heads hurts. InThe Thirteenth International Conference on Learning Representa- tions (ICLR), 2025

work page 2025
[16]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. InNeurocomputing, 2024. 13

work page 2024
[17]

Latent positional information is in the self-attention variance of transformer language models without positional embeddings

Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander Rudnicky, and Peter Ramadge. Latent positional information is in the self-attention variance of transformer language models without positional embeddings. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023

work page 2023
[18]

Adam: A method for stochastic optimization

Diederik P Kingma. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015

work page 2015
[19]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InInternational Conference on Machine Learning (ICML), 2018

work page 2018
[20]

Scaling laws and symmetry, evidence from neural force fields

Khang Ngo and Siamak Ravanbakhsh. Scaling laws and symmetry, evidence from neural force fields. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[21]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning (ICML), 2018

work page 2018
[22]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[23]

Training deep learning models with norm-constrained lmos

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained lmos. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[24]

The effective rank: A measure of effective dimensionality

Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 15th European signal processing conference, 2007

work page 2007
[25]

RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank

Quentin Garrido, Randall Balestriero, Laurent Najman, and Yann Lecun. RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank. In International conference on machine learning (ICML), 2023

work page 2023
[26]

Diff-erank: A novel rank-based metric for evaluating large language models

Lai Wei, Zhiquan Tan, Chenghai Li, Jindong Wang, and Weiran Huang. Diff-erank: A novel rank-based metric for evaluating large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[27]

Layer by layer: Uncovering hidden representations in language models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In Forty-second International Conference on Machine Learning (ICML), 2025

work page 2025
[28]

Rank Is Not Capacity: Spectral Occupancy for Latent Graph Models

Nikolaos Nakis, Panagiotis Promponas, Konstantinos Tsirkas, Katerina Mamali, Eftychia Makri, Leandros Tassiulas, and Nicholas A Christakis. Rank is not capacity: Spectral occupancy for latent graph models.arXiv preprint arXiv:2605.11142, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Convergence of muon with newton-schulz

Gyu Yeol Kim and Min hwan Oh. Convergence of muon with newton-schulz. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[30]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[31]

Fantastic pretraining optimizers and where to find them

Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[32]

Re-parameterizing your optimizers rather than architectures

Xiaohan Ding, Honghao Chen, Xiangyu Zhang, Kaiqi Huang, Jungong Han, and Guiguang Ding. Re-parameterizing your optimizers rather than architectures. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[33]

PoLAR: Polar-decomposed low-rank adapter representation

Kai Lion, Liang Zhang, Bingcong Li, and Niao He. PoLAR: Polar-decomposed low-rank adapter representation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 14

work page 2025
[34]

On measures of entropy and information

Alfréd Rényi. On measures of entropy and information. InProceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, 1961

work page 1961
[35]

Rényi divergence and kullback-leibler divergence

Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. InIEEE Transactions on Information Theory, 2014

work page 2014
[36]

A theory of multineuronal dimensionality, dynamics and measurement.BioRxiv, 2017

Peiran Gao, Eric Trautmann, Byron Yu, Gopal Santhanam, Stephen Ryu, Krishna Shenoy, and Surya Ganguli. A theory of multineuronal dimensionality, dynamics and measurement.BioRxiv, 2017

work page 2017
[37]

The spectrum of covariance matrices of randomly connected recurrent neuronal networks with linear dynamics.PLoS computational biology, 2022

Yu Hu and Haim Sompolinsky. The spectrum of covariance matrices of randomly connected recurrent neuronal networks with linear dynamics.PLoS computational biology, 2022

work page 2022
[38]

Slow transition to low-dimensional chaos in heavy-tailed recurrent neural networks

Yi Xie, Stefan Mihalas, and Łukasz Ku ´smierz. Slow transition to low-dimensional chaos in heavy-tailed recurrent neural networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[39]

What are you sinking? a geometric approach on attention sink

Valeria Ruscio, Umberto Nanni, and Fabrizio Silvestri. What are you sinking? a geometric approach on attention sink. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[40]

Scaling laws for gradient descent and sign descent for linear bigram models under zipf’s law

Frederik Kunstner and Francis Bach. Scaling laws for gradient descent and sign descent for linear bigram models under zipf’s law. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[41]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[42]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

work page 2024
[43]

Searching for efficient transformers for language modeling

David So, Wojciech Ma´nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Searching for efficient transformers for language modeling. InAdvances in neural information processing systems (NeurIPS), 2021

work page 2021
[44]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[45]

What can transformers learn in-context? a case study of simple function classes

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. InAdvances in neural information processing systems (NeurIPS), 2022

work page 2022
[46]

The impact of positional encoding on length generalization in transformers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[47]

Extending the context of pretrained LLMs by dropping their positional embedding

Yoav Gelberg, Koshi Eguchi, Takuya Akiba, and Edoardo Cetin. Extending the context of pretrained LLMs by dropping their positional embedding. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[48]

A spectral condition for feature learning

Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

work page arXiv 2023
[49]

Symbolic discovery of optimization algorithms

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. InThirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. 15

work page 2023
[50]

Springer Science & Business Media, 2010

Jose C Principe.Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010

work page 2010
[51]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning (ICML), 2020

work page 2020
[52]

Towards understanding inductive bias in transform- ers: A view from infinity

Itay Lavie, Guy Gur-Ari, and Zohar Ringel. Towards understanding inductive bias in transform- ers: A view from infinity. InForty-first International Conference on Machine Learning (ICML), 2024. 16 Appendix A Experimental Setup 18 A.1 Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.2 Training protocol . . . . . ....

work page 2024
[53]

Lowering the AdamW learning rate to 10−4 avoids divergence up to PostLN-75, but it reaches to PPL = 106.7 , compared with PPL = 40.9 for Muon and PPL = 32.8 for NorMuon. Thus, AdamW can be made stable only by moving to a substantially worse optimization regime, whereas Muon-family optimizers train these partial PostLN configurations at useful perplexity. ...

work page

[1] [1]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[2] [2]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[3] [3]

Optimizers qualitatively alter solutions and we should leverage this.arXiv preprint arXiv:2507.12224, 2025

Razvan Pascanu, Clare Lyle, Ionut-Vlad Modoranu, Naima Elosegui Borras, Dan Alistarh, Petar Velickovic, Sarath Chandar, Soham De, and James Martens. Optimizers qualitatively alter solutions and we should leverage this.arXiv preprint arXiv:2507.12224, 2025

work page arXiv 2025

[4] [4]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Towards robust scaling laws for optimizers.arXiv preprint arXiv:2602.07712, 2026

Alexandra V olkova, Mher Safaryan, Christoph H Lampert, and Dan Alistarh. Towards robust scaling laws for optimizers.arXiv preprint arXiv:2602.07712, 2026

work page arXiv 2026

[6] [6]

Nerve: Nonlinear eigenspectrum dynamics in llm feed- forward networks

Nandan Kumar Jha and Brandon Reagen. Nerve: Nonlinear eigenspectrum dynamics in llm feed- forward networks. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026

[7] [7]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEmpirical Methods in Natural Language Processing (EMNLP), 2021

work page 2021

[8] [8]

Nandan Kumar Jha and Brandon Reagen. Spectral scaling laws in language models: How effectively do feed-forward networks use their latent space? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025

[9] [9]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[10] [10]

Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 2024

work page 2024

[11] [11]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

work page arXiv 2025

[13] [13]

Dion: Distributed Orthonormalized Updates

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

work page arXiv 2025

[14] [14]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. InInternational conference on machine learning (ICML), 2023

work page 2023

[15] [15]

Quality over quantity in attention layers: When adding more heads hurts

Noah Amsel, Gilad Yehudai, and Joan Bruna. Quality over quantity in attention layers: When adding more heads hurts. InThe Thirteenth International Conference on Learning Representa- tions (ICLR), 2025

work page 2025

[16] [16]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. InNeurocomputing, 2024. 13

work page 2024

[17] [17]

Latent positional information is in the self-attention variance of transformer language models without positional embeddings

Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander Rudnicky, and Peter Ramadge. Latent positional information is in the self-attention variance of transformer language models without positional embeddings. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023

work page 2023

[18] [18]

Adam: A method for stochastic optimization

Diederik P Kingma. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015

work page 2015

[19] [19]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InInternational Conference on Machine Learning (ICML), 2018

work page 2018

[20] [20]

Scaling laws and symmetry, evidence from neural force fields

Khang Ngo and Siamak Ravanbakhsh. Scaling laws and symmetry, evidence from neural force fields. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026

[21] [21]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning (ICML), 2018

work page 2018

[22] [22]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025

[23] [23]

Training deep learning models with norm-constrained lmos

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained lmos. InInternational Conference on Machine Learning (ICML), 2025

work page 2025

[24] [24]

The effective rank: A measure of effective dimensionality

Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 15th European signal processing conference, 2007

work page 2007

[25] [25]

RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank

Quentin Garrido, Randall Balestriero, Laurent Najman, and Yann Lecun. RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank. In International conference on machine learning (ICML), 2023

work page 2023

[26] [26]

Diff-erank: A novel rank-based metric for evaluating large language models

Lai Wei, Zhiquan Tan, Chenghai Li, Jindong Wang, and Weiran Huang. Diff-erank: A novel rank-based metric for evaluating large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[27] [27]

Layer by layer: Uncovering hidden representations in language models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In Forty-second International Conference on Machine Learning (ICML), 2025

work page 2025

[28] [28]

Rank Is Not Capacity: Spectral Occupancy for Latent Graph Models

Nikolaos Nakis, Panagiotis Promponas, Konstantinos Tsirkas, Katerina Mamali, Eftychia Makri, Leandros Tassiulas, and Nicholas A Christakis. Rank is not capacity: Spectral occupancy for latent graph models.arXiv preprint arXiv:2605.11142, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Convergence of muon with newton-schulz

Gyu Yeol Kim and Min hwan Oh. Convergence of muon with newton-schulz. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026

[30] [30]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026

[31] [31]

Fantastic pretraining optimizers and where to find them

Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026

[32] [32]

Re-parameterizing your optimizers rather than architectures

Xiaohan Ding, Honghao Chen, Xiangyu Zhang, Kaiqi Huang, Jungong Han, and Guiguang Ding. Re-parameterizing your optimizers rather than architectures. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[33] [33]

PoLAR: Polar-decomposed low-rank adapter representation

Kai Lion, Liang Zhang, Bingcong Li, and Niao He. PoLAR: Polar-decomposed low-rank adapter representation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 14

work page 2025

[34] [34]

On measures of entropy and information

Alfréd Rényi. On measures of entropy and information. InProceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, 1961

work page 1961

[35] [35]

Rényi divergence and kullback-leibler divergence

Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. InIEEE Transactions on Information Theory, 2014

work page 2014

[36] [36]

A theory of multineuronal dimensionality, dynamics and measurement.BioRxiv, 2017

Peiran Gao, Eric Trautmann, Byron Yu, Gopal Santhanam, Stephen Ryu, Krishna Shenoy, and Surya Ganguli. A theory of multineuronal dimensionality, dynamics and measurement.BioRxiv, 2017

work page 2017

[37] [37]

The spectrum of covariance matrices of randomly connected recurrent neuronal networks with linear dynamics.PLoS computational biology, 2022

Yu Hu and Haim Sompolinsky. The spectrum of covariance matrices of randomly connected recurrent neuronal networks with linear dynamics.PLoS computational biology, 2022

work page 2022

[38] [38]

Slow transition to low-dimensional chaos in heavy-tailed recurrent neural networks

Yi Xie, Stefan Mihalas, and Łukasz Ku ´smierz. Slow transition to low-dimensional chaos in heavy-tailed recurrent neural networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[39] [39]

What are you sinking? a geometric approach on attention sink

Valeria Ruscio, Umberto Nanni, and Fabrizio Silvestri. What are you sinking? a geometric approach on attention sink. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[40] [40]

Scaling laws for gradient descent and sign descent for linear bigram models under zipf’s law

Frederik Kunstner and Francis Bach. Scaling laws for gradient descent and sign descent for linear bigram models under zipf’s law. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[41] [41]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[42] [42]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

work page 2024

[43] [43]

Searching for efficient transformers for language modeling

David So, Wojciech Ma´nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Searching for efficient transformers for language modeling. InAdvances in neural information processing systems (NeurIPS), 2021

work page 2021

[44] [44]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning (ICML), 2023

work page 2023

[45] [45]

What can transformers learn in-context? a case study of simple function classes

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. InAdvances in neural information processing systems (NeurIPS), 2022

work page 2022

[46] [46]

The impact of positional encoding on length generalization in transformers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[47] [47]

Extending the context of pretrained LLMs by dropping their positional embedding

Yoav Gelberg, Koshi Eguchi, Takuya Akiba, and Edoardo Cetin. Extending the context of pretrained LLMs by dropping their positional embedding. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026

[48] [48]

A spectral condition for feature learning

Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

work page arXiv 2023

[49] [49]

Symbolic discovery of optimization algorithms

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. InThirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. 15

work page 2023

[50] [50]

Springer Science & Business Media, 2010

Jose C Principe.Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010

work page 2010

[51] [51]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning (ICML), 2020

work page 2020

[52] [52]

Towards understanding inductive bias in transform- ers: A view from infinity

Itay Lavie, Guy Gur-Ari, and Zohar Ringel. Towards understanding inductive bias in transform- ers: A view from infinity. InForty-first International Conference on Machine Learning (ICML), 2024. 16 Appendix A Experimental Setup 18 A.1 Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.2 Training protocol . . . . . ....

work page 2024

[53] [53]

Lowering the AdamW learning rate to 10−4 avoids divergence up to PostLN-75, but it reaches to PPL = 106.7 , compared with PPL = 40.9 for Muon and PPL = 32.8 for NorMuon. Thus, AdamW can be made stable only by moving to a substantially worse optimization regime, whereas Muon-family optimizers train these partial PostLN configurations at useful perplexity. ...

work page