pith. sign in

arxiv: 2605.21803 · v1 · pith:W6JTPEFJnew · submitted 2026-05-20 · 💻 cs.LG

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Pith reviewed 2026-05-22 08:45 UTC · model grok-4.3

classification 💻 cs.LG
keywords scaling lawsoptimizersspectral rankstransformersfeed-forward networksrepresentation capacityAdamWMuon
0
0 comments X

The pith

The same Transformer architecture realizes different spectral scaling laws under different optimizers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard scaling laws overlook the optimizer as a variable that controls how added model width converts into usable spectral capacity in representations. By measuring eigenspectra through soft and hard spectral ranks in feed-forward layers, it finds that AdamW produces only weak scaling of hard rank on rare-token data while Muon produces near-linear scaling, a more than twofold difference in the exponent. This gap in representation structure remains even when the optimizers reach similar perplexity after extended training. Optimizer-driven shifts in spectral geometry also exceed the effects of architectural changes such as attention rank or positional encoding. The work therefore treats optimization strategy as a primary determinant of how capacity is actually realized during scaling.

Core claim

Holding architecture and width schedule fixed, the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. On rare-token representations, AdamW exhibits weak hard-rank scaling with exponent 0.44 while Muon achieves linear scaling with exponent 1.02. This difference is not reducible to validation loss, since AdamW runs can match the perplexity of lower-rank configurations yet display sharply different spectral geometry. Hard-soft rank asymmetry further shows that optimizers differ both in the amount of capacity realized and in how that capacity is distributed across eigenmodes. Optimizer-induced spectral shifts frequently exceed,

What carries the argument

Eigenspectra of feed-forward network representations measured by soft and hard spectral ranks, which quantify the effective dimensionality the optimizer makes available to the model.

If this is right

  • Matched perplexity does not guarantee matched representation structure or spectral geometry.
  • Optimizer effects on spectral scaling can be larger than those produced by changes to attention rank or positional encoding.
  • Representation capacity depends on how an optimizer distributes utilization across eigenmodes, revealed by hard-soft rank differences.
  • Scaling behavior should be studied with optimizer as an explicit variable rather than a fixed detail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scaling-law predictions could become more accurate by incorporating optimizer-specific exponents for spectral utilization.
  • Training runs might benefit from selecting optimizers that maximize hard-rank growth on tail tokens to improve generalization at fixed width.
  • The interaction between optimizer and layer type could be tested to see whether spectral advantages appear outside feed-forward blocks.

Load-bearing premise

That soft and hard spectral ranks computed from feed-forward network representations supply a faithful measure of utilized spectral capacity that remains comparable across different optimizers.

What would settle it

Training identical models with AdamW and Muon until both reach the same perplexity and the same hard spectral rank on rare tokens would contradict the claim of optimizer-specific scaling laws.

Figures

Figures reproduced from arXiv: 2605.21803 by Brandon Reagen, Nandan Kumar Jha.

Figure 1
Figure 1. Figure 1: Spectral scaling exponents depend on optimizer choice: Soft (left) and hard spectral rank [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Optimizer-dependent spectral scaling across token-frequency regimes. Soft spectral rank [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Extended AdamW training weakens hard-rank scaling in GPT-2 160M. Hard-rank scaling [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Soft spectral rank (left) and hard spectral rank (right) scaling is shown for TAIL tokens in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Optimizer-dependent TAIL spectral scaling persists at 350M scale. Soft spectral rank (left) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Optimizer-induced shifts in spectral-scaling exceed attention-rank shifts in GPT-2 160M. We [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Rényi-family view of optimizer-shaped spectral capacity in GPT-2 350M, with FFN [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of layer-wise scaling exponents for GPT-2 160M. For each layer [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Depth profiles of layer-wise scaling exponents for GPT-2 160M. Each curve shows [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Hard-rank dynamics for AdamW extended training with GPT-2 160M. Post-activation [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
read the original abstract

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emph{the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers}. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling ($\beta$=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling ($\beta$=1.02) in the same regimes, a $2.3\times$ increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that the same Transformer architecture, with fixed width schedule, realizes different spectral scaling laws under different optimizers. Using soft and hard spectral ranks derived from FFN representations, it reports that AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations while Muon achieves linear scaling (β=1.02), a 2.3× difference; this dissociation persists even when perplexity is matched via extended training, and optimizer-induced shifts exceed those from architectural interventions such as attention rank or positional encodings.

Significance. If the central measurements are robust, the work establishes optimization as an independent axis of representation scaling that is not captured by loss alone. The explicit comparison of optimizer effects against architectural controls, together with the hard–soft rank asymmetry, supplies a concrete empirical basis for optimizer–architecture co-design in large language models.

major comments (3)
  1. [Methods] Methods section: The procedure for extracting FFN activations, computing eigenspectra, and defining the hard-rank threshold (including any normalization or token-sampling controls) is not described in sufficient detail. Without explicit controls ensuring that activation statistics and token distributions are matched across AdamW and Muon runs, it remains possible that observed rank differences arise from optimizer-dependent sparsity or scale rather than from genuine differences in utilized spectral capacity.
  2. [§4] Scaling-law fits (abstract and §4): The reported exponents β=0.44 and β=1.02 are obtained by fitting power laws to the same rank-versus-width observations used to demonstrate the optimizer difference. This makes the scaling laws descriptive summaries rather than independent predictions; the manuscript should clarify whether any out-of-sample validation or theoretical motivation for the functional form was performed.
  3. [Experimental Setup] Experimental controls: While the abstract notes that some AdamW runs were extended to match perplexity, the text does not report whether total training steps, data order, or batch composition were equalized for the TAIL-token spectral measurements. This control is load-bearing for the claim that geometry differences are optimizer-induced rather than artifacts of unequal optimization trajectories.
minor comments (2)
  1. [Abstract] Abstract: The numerical values β=0.44 and β=1.02 should be accompanied by standard errors or confidence intervals when first stated.
  2. [Figures] Figures: Plots showing eigenvalue decay or rank-versus-width should include the exact hard-rank threshold used and clearly distinguish optimizer curves with consistent line styles across panels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will make to improve clarity, reproducibility, and the robustness of our claims.

read point-by-point responses
  1. Referee: [Methods] Methods section: The procedure for extracting FFN activations, computing eigenspectra, and defining the hard-rank threshold (including any normalization or token-sampling controls) is not described in sufficient detail. Without explicit controls ensuring that activation statistics and token distributions are matched across AdamW and Muon runs, it remains possible that observed rank differences arise from optimizer-dependent sparsity or scale rather than from genuine differences in utilized spectral capacity.

    Authors: We agree that greater methodological detail is required for reproducibility and to rule out potential confounds. In the revised manuscript we will expand the Methods section with a precise, step-by-step account of FFN activation extraction, eigenspectra computation, the exact definition of the hard-rank threshold (including the eigenvalue cutoff, normalization procedure, and any scaling), and the token-sampling protocol. We will also add explicit verification that mean activation statistics and token-frequency distributions are matched across the AdamW and Muon runs used for the reported comparisons, thereby supporting that the observed spectral differences reflect genuine differences in utilized capacity rather than optimizer-induced sparsity or scale artifacts. revision: yes

  2. Referee: [§4] Scaling-law fits (abstract and §4): The reported exponents β=0.44 and β=1.02 are obtained by fitting power laws to the same rank-versus-width observations used to demonstrate the optimizer difference. This makes the scaling laws descriptive summaries rather than independent predictions; the manuscript should clarify whether any out-of-sample validation or theoretical motivation for the functional form was performed.

    Authors: We acknowledge that the reported exponents are obtained by fitting power laws directly to the rank-versus-width observations presented in the paper and therefore function as descriptive summaries of the empirical trends rather than independent, out-of-sample predictions. The power-law functional form is motivated by prior literature on spectral scaling in neural representations. No out-of-sample validation was performed in the current experiments. In the revision we will explicitly state this descriptive character in §4 and the abstract, supply the relevant theoretical motivation from the spectral-scaling literature, and note the limitation regarding predictive validation. revision: partial

  3. Referee: [Experimental Setup] Experimental controls: While the abstract notes that some AdamW runs were extended to match perplexity, the text does not report whether total training steps, data order, or batch composition were equalized for the TAIL-token spectral measurements. This control is load-bearing for the claim that geometry differences are optimizer-induced rather than artifacts of unequal optimization trajectories.

    Authors: We appreciate the importance of this control. For the perplexity-matched AdamW runs, total training steps were extended while preserving identical data order and batch composition with the corresponding Muon runs; only the number of steps was adjusted to reach perplexity parity. TAIL-token spectral measurements were performed on token samples drawn from the same data distribution and ordering. In the revised manuscript we will add an explicit paragraph in the Experimental Setup section documenting these controls and confirming that the TAIL measurements used matched token subsets. revision: yes

Circularity Check

1 steps flagged

Spectral scaling exponents reduce to power-law fits on measured rank-vs-width data

specific steps
  1. fitted input called prediction [Abstract]
    "Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3× increase in the scaling exponent."

    The β exponents are computed by fitting a power-law form (hard-rank ∝ width^β) to the empirically measured hard spectral ranks collected at multiple widths. The scaling law is therefore a curve fit to the same observations it purports to describe, not an independent prediction from optimizer dynamics or first principles.

full rationale

The paper's core claim is that AdamW and Muon induce different spectral scaling laws (quantified by hard-rank exponent β) for the same architecture. These β values are obtained by fitting power laws directly to the observed hard spectral rank versus width measurements on TAIL tokens. This makes the reported scaling laws descriptive summaries of the input data rather than independent derivations or predictions. The comparison to architectural interventions adds some external content, but the optimizer-induced scaling result itself is a post-hoc fit. No self-citation chains, self-definitions, or ansatz smuggling were found in the derivation of the scaling exponents.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical spectral measurements and fitted scaling exponents rather than a first-principles derivation; no new entities are postulated.

free parameters (1)
  • hard-rank scaling exponent β
    Fitted separately for each optimizer to the observed rank-versus-width curves on tail tokens.
axioms (1)
  • domain assumption Spectral ranks computed from FFN activations measure utilized capacity in a manner comparable across optimizers
    Invoked when interpreting differences in β as differences in how effectively added width is utilized.

pith-pipeline@v0.9.0 · 5799 in / 1237 out tokens · 37783 ms · 2026-05-22T08:45:34.931833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks... AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations... Muon achieves linear scaling (β=1.02)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  2. [2]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

  3. [3]

    Optimizers qualitatively alter solutions and we should leverage this.arXiv preprint arXiv:2507.12224, 2025

    Razvan Pascanu, Clare Lyle, Ionut-Vlad Modoranu, Naima Elosegui Borras, Dan Alistarh, Petar Velickovic, Sarath Chandar, Soham De, and James Martens. Optimizers qualitatively alter solutions and we should leverage this.arXiv preprint arXiv:2507.12224, 2025

  4. [4]

    Old Optimizer, New Norm: An Anthology

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

  5. [5]

    Towards robust scaling laws for optimizers.arXiv preprint arXiv:2602.07712, 2026

    Alexandra V olkova, Mher Safaryan, Christoph H Lampert, and Dan Alistarh. Towards robust scaling laws for optimizers.arXiv preprint arXiv:2602.07712, 2026

  6. [6]

    Nerve: Nonlinear eigenspectrum dynamics in llm feed- forward networks

    Nandan Kumar Jha and Brandon Reagen. Nerve: Nonlinear eigenspectrum dynamics in llm feed- forward networks. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

  7. [7]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEmpirical Methods in Natural Language Processing (EMNLP), 2021

  8. [8]

    Nandan Kumar Jha and Brandon Reagen. Spectral scaling laws in language models: How effectively do feed-forward networks use their latent space? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  9. [9]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  10. [10]

    Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 2024

  11. [11]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  12. [12]

    Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

  13. [13]

    Dion: Distributed Orthonormalized Updates

    Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

  14. [14]

    Large language models struggle to learn long-tail knowledge

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. InInternational conference on machine learning (ICML), 2023

  15. [15]

    Quality over quantity in attention layers: When adding more heads hurts

    Noah Amsel, Gilad Yehudai, and Joan Bruna. Quality over quantity in attention layers: When adding more heads hurts. InThe Thirteenth International Conference on Learning Representa- tions (ICLR), 2025

  16. [16]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. InNeurocomputing, 2024. 13

  17. [17]

    Latent positional information is in the self-attention variance of transformer language models without positional embeddings

    Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander Rudnicky, and Peter Ramadge. Latent positional information is in the self-attention variance of transformer language models without positional embeddings. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023

  18. [18]

    Adam: A method for stochastic optimization

    Diederik P Kingma. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015

  19. [19]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InInternational Conference on Machine Learning (ICML), 2018

  20. [20]

    Scaling laws and symmetry, evidence from neural force fields

    Khang Ngo and Siamak Ravanbakhsh. Scaling laws and symmetry, evidence from neural force fields. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

  21. [21]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning (ICML), 2018

  22. [22]

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  23. [23]

    Training deep learning models with norm-constrained lmos

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained lmos. InInternational Conference on Machine Learning (ICML), 2025

  24. [24]

    The effective rank: A measure of effective dimensionality

    Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 15th European signal processing conference, 2007

  25. [25]

    RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank

    Quentin Garrido, Randall Balestriero, Laurent Najman, and Yann Lecun. RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank. In International conference on machine learning (ICML), 2023

  26. [26]

    Diff-erank: A novel rank-based metric for evaluating large language models

    Lai Wei, Zhiquan Tan, Chenghai Li, Jindong Wang, and Weiran Huang. Diff-erank: A novel rank-based metric for evaluating large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  27. [27]

    Layer by layer: Uncovering hidden representations in language models

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In Forty-second International Conference on Machine Learning (ICML), 2025

  28. [28]

    Rank Is Not Capacity: Spectral Occupancy for Latent Graph Models

    Nikolaos Nakis, Panagiotis Promponas, Konstantinos Tsirkas, Katerina Mamali, Eftychia Makri, Leandros Tassiulas, and Nicholas A Christakis. Rank is not capacity: Spectral occupancy for latent graph models.arXiv preprint arXiv:2605.11142, 2026

  29. [29]

    Convergence of muon with newton-schulz

    Gyu Yeol Kim and Min hwan Oh. Convergence of muon with newton-schulz. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

  30. [30]

    Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

  31. [31]

    Fantastic pretraining optimizers and where to find them

    Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

  32. [32]

    Re-parameterizing your optimizers rather than architectures

    Xiaohan Ding, Honghao Chen, Xiangyu Zhang, Kaiqi Huang, Jungong Han, and Guiguang Ding. Re-parameterizing your optimizers rather than architectures. InInternational Conference on Learning Representations (ICLR), 2023

  33. [33]

    PoLAR: Polar-decomposed low-rank adapter representation

    Kai Lion, Liang Zhang, Bingcong Li, and Niao He. PoLAR: Polar-decomposed low-rank adapter representation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 14

  34. [34]

    On measures of entropy and information

    Alfréd Rényi. On measures of entropy and information. InProceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, 1961

  35. [35]

    Rényi divergence and kullback-leibler divergence

    Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. InIEEE Transactions on Information Theory, 2014

  36. [36]

    A theory of multineuronal dimensionality, dynamics and measurement.BioRxiv, 2017

    Peiran Gao, Eric Trautmann, Byron Yu, Gopal Santhanam, Stephen Ryu, Krishna Shenoy, and Surya Ganguli. A theory of multineuronal dimensionality, dynamics and measurement.BioRxiv, 2017

  37. [37]

    The spectrum of covariance matrices of randomly connected recurrent neuronal networks with linear dynamics.PLoS computational biology, 2022

    Yu Hu and Haim Sompolinsky. The spectrum of covariance matrices of randomly connected recurrent neuronal networks with linear dynamics.PLoS computational biology, 2022

  38. [38]

    Slow transition to low-dimensional chaos in heavy-tailed recurrent neural networks

    Yi Xie, Stefan Mihalas, and Łukasz Ku ´smierz. Slow transition to low-dimensional chaos in heavy-tailed recurrent neural networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  39. [39]

    What are you sinking? a geometric approach on attention sink

    Valeria Ruscio, Umberto Nanni, and Fabrizio Silvestri. What are you sinking? a geometric approach on attention sink. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  40. [40]

    Scaling laws for gradient descent and sign descent for linear bigram models under zipf’s law

    Frederik Kunstner and Francis Bach. Scaling laws for gradient descent and sign descent for linear bigram models under zipf’s law. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  41. [41]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  42. [42]

    modded-nanogpt: Speedrunning the nanogpt baseline, 2024

    Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

  43. [43]

    Searching for efficient transformers for language modeling

    David So, Wojciech Ma´nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Searching for efficient transformers for language modeling. InAdvances in neural information processing systems (NeurIPS), 2021

  44. [44]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning (ICML), 2023

  45. [45]

    What can transformers learn in-context? a case study of simple function classes

    Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. InAdvances in neural information processing systems (NeurIPS), 2022

  46. [46]

    The impact of positional encoding on length generalization in transformers

    Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  47. [47]

    Extending the context of pretrained LLMs by dropping their positional embedding

    Yoav Gelberg, Koshi Eguchi, Takuya Akiba, and Edoardo Cetin. Extending the context of pretrained LLMs by dropping their positional embedding. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

  48. [48]

    A spectral condition for feature learning

    Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

  49. [49]

    Symbolic discovery of optimization algorithms

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. InThirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. 15

  50. [50]

    Springer Science & Business Media, 2010

    Jose C Principe.Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010

  51. [51]

    On layer normalization in the transformer architecture

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning (ICML), 2020

  52. [52]

    Towards understanding inductive bias in transform- ers: A view from infinity

    Itay Lavie, Guy Gur-Ari, and Zohar Ringel. Towards understanding inductive bias in transform- ers: A view from infinity. InForty-first International Conference on Machine Learning (ICML), 2024. 16 Appendix A Experimental Setup 18 A.1 Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.2 Training protocol . . . . . ....

  53. [53]

    Lowering the AdamW learning rate to 10−4 avoids divergence up to PostLN-75, but it reaches to PPL = 106.7 , compared with PPL = 40.9 for Muon and PPL = 32.8 for NorMuon. Thus, AdamW can be made stable only by moving to a substantially worse optimization regime, whereas Muon-family optimizers train these partial PostLN configurations at useful perplexity. ...