pith. sign in

arxiv: 2605.31244 · v1 · pith:UKZKARNSnew · submitted 2026-05-29 · 💻 cs.LG · physics.comp-ph

Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail

Pith reviewed 2026-06-28 23:01 UTC · model grok-4.3

classification 💻 cs.LG physics.comp-ph
keywords neural scaling lawsempirical neural tangent kernelspectral positionspectral reachfeature learningtraining dynamicsloss reduction
0
0 comments X

The pith

Larger models achieve lower losses by sustaining learning on weaker spectral signals inaccessible to smaller ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural scaling laws show bigger models reach better performance, yet the underlying mechanism has been unclear. The paper introduces spectral position, a measure tracking which eigenvalues of the empirical neural tangent kernel drive ongoing loss reduction. Training shifts from strong eigenmodes into the weaker tail over time. Larger models progress farther into this tail than smaller models, a capacity termed spectral reach. Feature learning supports this by adaptively increasing gradient magnitudes to keep progress alive where fixed representations would stall.

Core claim

Larger models reach further into the spectral tail of the empirical neural tangent kernel than smaller models. This size-dependent spectral reach allows them to keep reducing loss on progressively weaker eigenmodes that remain inaccessible to smaller networks, providing a concrete account of why scaling improves performance.

What carries the argument

spectral position: a scalable scalar that identifies the current eigenmode of the empirical neural tangent kernel driving loss reduction; spectral reach is the model-size-dependent depth into weaker modes enabled by this position.

If this is right

  • Training moves from dominant eigenmodes into weaker ones as loss decreases.
  • Feature learning sustains gradient flow into the tail while frozen representations stall earlier.
  • Architecture choices that preserve adaptive amplification of gradients should increase spectral reach.
  • Optimizer modifications targeting tail modes could replicate scaling gains without increasing model size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Spectral reach may serve as a diagnostic for whether a given architecture will continue scaling at a fixed compute budget.
  • The same measurement could be applied to kernel methods outside neural networks to test whether tail access is a general requirement for continued improvement.
  • If spectral position can be estimated cheaply, it might enable early stopping rules that predict final performance without full training.

Load-bearing premise

The ordering and magnitudes of eigenvalues of the empirical neural tangent kernel at successive training points correctly identify which directions are currently driving loss reduction.

What would settle it

An experiment that freezes updates along low-eigenvalue directions at the spectral position predicted for each model size and checks whether loss reduction halts at the same final loss value across different model scales.

Figures

Figures reproduced from arXiv: 2605.31244 by Christian Holm, Jonas Scheunemann, Konstantin Nikolaou, Samuel Tovey, Sven Krippendorf.

Figure 1
Figure 1. Figure 1: Scaling laws and spectral position for Llama 2 language models on SimpleStories (left) and next-pixel prediction on CIFAR-5M (right). Following Hoffmann et al. (2022), the compute is approximated as FLOPs = 6ND, where N is the number of model parameters and D is the number of processed tokens/images. (top) Scaling laws emerge as power-law relations between model size and train loss. The dashed lines indica… view at source ↗
Figure 2
Figure 2. Figure 2: Intuition behind Loss-Network-Position (LNP) decomposition for neural scaling laws. (left) Schematic of training curves of different model sizes—larger models or more data lead to lower final losses. (middle) Decomposition of the loss evolution into linear contribution and corrections. The linear component captures the immediate effect of parameter updates on the loss, while corrections account for stochas… view at source ↗
Figure 3
Figure 3. Figure 3: Validation of LNP-components on random feature mod￾els with power-law data spectra. (left) Covariance spectra of Gaus￾sian data for varying spectral decay exponents β ∈ [1.0, 3.0]. Larger β yields more concentrated (lower complexity) spectra. (right) Empirical and theoretical predictions of LNP-components χnet, χloss, and χpos at model initialization as a function of β. All precisely follow the predicted s… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of feature learning on spectral reach: Spectral po￾sition ratio χpos(random) χpos(pre-trained) of linear probing (only last layer trained, rest frozen) on pre-trained vs. random backbones for Llama 2 models of varying sizes. Shown are results on SimpleStories and CIFAR-5M datasets. We show averaged values over the last 10% of training to capture the steady-state behavior. Further details on the expe… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics of gradient magnitudes χnet = ∥∇θf∥ 2 F for Llama models of varying sizes on SimpleSto￾ries and CIFAR-5M. We compare full training (all layers trained) to linear probing (only last layer trained, rest frozen) on a random backbone. learned representations—neither backbone adapts during lin￾ear probing. This demonstrates that feature learning shapes the internal structure of representations… view at source ↗
Figure 6
Figure 6. Figure 6: Convergence of the Hutchinson estimator for the squared gradient norm. (Top left) Estimated total squared gradient norm as a function of the number of projections M, with the converged reference at M = 2000 (dashed). Error bars show one standard deviation over 50 runs. (Top right) Coefficient of variation as a function of M with fitted 1/ √ M decay, on a trained 7.2M-parameter model using 32 training sampl… view at source ↗
Figure 7
Figure 7. Figure 7: Test loss scaling laws for Llama language models on SimpleStories (left) and next-pixel prediction on CIFAR-5M (right). Test loss follows power-law relations with model size, similar to the training loss shown in [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Isocompute curves on SimpleStories (top) and CIFAR-5M (bottom). Test loss (left) and spectral position χpos (right) as a function of model size at fixed compute budgets (color-coded FLOP totals). Each curve traces models of different sizes trained with the same total compute. Spectral position qualitatively follows the same isocompute trend as test loss across both datasets, suggesting that spectral reach … view at source ↗
Figure 9
Figure 9. Figure 9: Side-by-side comparison of training loss and spectral position χpos vs. compute for ViT models trained on CIFAR-5M under standard parameterization (left) and muP (right). The left panel shows that the spectral-reach phenomenon extends to image classification: larger ViTs reach lower χpos values throughout training, mirroring the behavior observed in the language and next￾pixel-prediction settings. The muP … view at source ↗
Figure 10
Figure 10. Figure 10: Dynamics of all LNP-components for Llama models trained on SimpleStories. Shown are χloss (top left), χnet (top right), χpos (bottom left), and their product χnet · χpos (bottom right) vs. compute for all model sizes. χloss mirrors the loss dynamics; the product χnet · χpos is dominated by the χpos decrease. compare the spectra at initialization and after training [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Dynamics of all LNP-components for Llama models trained on CIFAR-5M next-pixel prediction. Shown are χloss (top left), χnet (top right), χpos (bottom left), and their product χnet · χpos (bottom right) vs. compute for all model sizes. χloss mirrors the loss dynamics; the product χnet · χpos is dominated by the χpos decrease. 0 2000 4000 6000 8000 10000 Mode index k 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 … view at source ↗
Figure 12
Figure 12. Figure 12: Explicit eNTK spectral distributions for a 3.8M-parameter Llama model on SimpleStories at initialization (dashed) and after training (solid). (a) Eigenvalue distribution pk. (b) Squared projection coefficients γ 2 k. (c) Product γ 2 k pk. The eNTK is computed in a reduced output space (K = 40 vocabulary indices, S = 256 token positions) on a single sample due to the cost of full materialization. After tra… view at source ↗
read the original abstract

Neural scaling laws describe predictable power-law relationships between model size, dataset size, compute, and performance. While these laws guide the development of modern foundation models, the mechanisms underpinning them remain poorly understood, in part due to the absence of scalable analysis tools. To close this gap, we introduce "spectral position": a scalable measure of which eigenvalues of the empirical neural tangent kernel (eNTK) currently drive loss reduction. Applying this measure to scaling experiments, we find that spectral position decreases throughout training: learning shifts from dominant eigenmodes into the spectral tail. Larger models reach further into the tail than smaller models, revealing a size-dependent capacity we call "spectral reach". This suggests why larger models achieve lower losses: they sustain learning on weak spectral signals inaccessible to smaller models. We further identify feature learning as a key enabler of spectral reach. It adaptively amplifies gradient magnitudes as learning advances, sustaining progress where frozen representations stall. This points to concrete interventions through architecture and optimizer design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces 'spectral position' as a scalable measure of which eigenvalues of the empirical neural tangent kernel (eNTK) drive loss reduction during training. It reports that spectral position decreases over the course of training as learning moves into the spectral tail, that larger models exhibit greater 'spectral reach' into weaker modes than smaller models, and that feature learning enables this reach by adaptively amplifying gradients on weak signals. These observations are offered as a mechanistic account of neural scaling laws.

Significance. If the central mapping from eNTK eigenvalue ordering to instantaneous loss-reducing directions can be substantiated, the work would supply a concrete spectral explanation for why larger models achieve lower loss and would suggest targeted interventions in architecture and optimization. The emphasis on feature learning as an enabler is a positive step toward connecting scaling phenomena to representation dynamics.

major comments (2)
  1. [Abstract] Abstract and introduction: no explicit equation or algorithmic definition is given for 'spectral position' (the quantity whose decrease is the central empirical finding). Without the formula relating eNTK eigenvalues at successive training points to the directions of loss reduction, it is impossible to assess whether the metric is independently grounded or tautological with respect to the observed loss curves.
  2. [Abstract] Abstract: the claim that larger models reach further into the tail because they sustain learning on weak spectral signals rests on the unverified assumption that the ordering and magnitudes of eNTK eigenvalues correctly identify the active loss-reducing subspace. No verification (e.g., explicit projection of per-step loss change onto the reported eigen-directions) is described, and this assumption is load-bearing for the 'spectral reach' interpretation, especially once feature learning alters the kernel.
minor comments (2)
  1. [Abstract] Abstract: dataset details, model architectures, number of runs, and error bars are not mentioned, making it difficult to evaluate the robustness of the reported scaling trends.
  2. [Abstract] Abstract: the phrase 'spectral position decreases throughout training' is stated without reference to a figure or table that would allow the reader to see the quantitative trend.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: no explicit equation or algorithmic definition is given for 'spectral position' (the quantity whose decrease is the central empirical finding). Without the formula relating eNTK eigenvalues at successive training points to the directions of loss reduction, it is impossible to assess whether the metric is independently grounded or tautological with respect to the observed loss curves.

    Authors: We agree that the abstract and introduction would benefit from an explicit definition. In Section 3 of the manuscript, spectral position is defined as the expectation of the eigenvalue index under the distribution induced by the normalized loss gradient projected onto the current eNTK eigenbasis. We will revise both the abstract and introduction to include this formula and a brief description of its computation from the eNTK and gradient at each training step. revision: yes

  2. Referee: [Abstract] Abstract: the claim that larger models reach further into the tail because they sustain learning on weak spectral signals rests on the unverified assumption that the ordering and magnitudes of eNTK eigenvalues correctly identify the active loss-reducing subspace. No verification (e.g., explicit projection of per-step loss change onto the reported eigen-directions) is described, and this assumption is load-bearing for the 'spectral reach' interpretation, especially once feature learning alters the kernel.

    Authors: Spectral position is constructed directly from the projection of the instantaneous gradient onto the eNTK eigenbasis, which supplies the grounding that the reported directions are those along which loss is reduced at each step. We nevertheless agree that an explicit additional check—such as measuring the correlation between per-step loss reduction and the change in spectral position—would strengthen the interpretation. We will add this verification analysis to the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: metric definition and empirical application remain independent

full rationale

The paper defines spectral position directly from eNTK eigenvalues as a new scalable measure and reports its observed behavior across model sizes and training. No equations or steps reduce the reported findings (decreasing spectral position, size-dependent reach) to the inputs by construction, no fitted quantities are relabeled as predictions, and no self-citation chains carry the central claims. The derivation is self-contained; the metric functions as an external analysis tool whose application yields non-tautological observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unstated premise that eNTK eigenvalues track loss-reducing directions and on the new definitions of spectral position and spectral reach; no free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption Eigenvalues of the empirical neural tangent kernel identify the directions currently responsible for loss reduction
    Invoked when defining spectral position from eNTK
invented entities (1)
  • spectral reach no independent evidence
    purpose: Quantify how far into weak eigenmodes a model can progress as a function of size
    Newly introduced capacity measure without external validation or independent prediction stated in the abstract

pith-pipeline@v0.9.1-grok · 5712 in / 1278 out tokens · 24080 ms · 2026-06-28T23:01:18.391705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Tomasini, Alessandro Favero, and Matthieu Wyart

    doi: 10.1103/PhysRevX.14.031001. URL https://link.aps.org/doi/10.1103/Phy sRevX.14.031001. Cagnetta, F., Kang, H., and Wyart, M. Learning curves theory for hierarchically compositional data with power- law distributed features. InForty-Second International Conference on Machine Learning, June 2025. URL ht tps://openreview.net/forum?id=Lw0kC7 5dY0&referrer...

  2. [2]

    URL https: //www.nature.com/articles/s41467-021 -23103-1

    doi: 10.1038/s41467-021-23103-1. URL https: //www.nature.com/articles/s41467-021 -23103-1. Cao, Y ., Fang, Z., Wu, Y ., Zhou, D.-X., and Gu, Q. Towards Understanding the Spectral Bias of Deep Learning, Octo- ber 2020. URL http://arxiv.org/abs/1912.0 1198. Chizat, L., Oyallon, E., and Bach, F. On Lazy Training in Differentiable Programming. InAdvances in N...

  3. [3]

    Cristianini, N., Shawe-Taylor, J., Elisseeff, A., and Kandola, J

    URL https://dl.acm.org/doi/10.55 55/2503308.2188413. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., and Kandola, J. On Kernel-Target Alignment. InAdvances in Neural Information Processing Systems, volume 14. MIT Press,

  4. [4]

    cc/paper_files/paper/2001/hash/1f71e 393b3809197ed66df836fe833e5-Abstract

    URL https://proceedings.neurips. cc/paper_files/paper/2001/hash/1f71e 393b3809197ed66df836fe833e5-Abstract. html. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Sca...

  5. [5]

    doi: 10.1088/2632-2153/ac87e9

    ISSN 2632-2153. doi: 10.1088/2632-2153/ac87e9. URL https://dx.doi.org/10.1088/2632-2 153/ac87e9. Lee, J. and Kifer, D. Scaling up Differentially Private Deep Learning with Fast Per-Example Gradient Clipping, September 2020. URL http://arxiv.org/abs/ 2009.03106. Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y ., Novak, R., Sohl-Dickstein, J., and Pennington...

  6. [6]

    Maloney, A., Roberts, D

    URL https://openreview.net/forum ?id=PH7sdEanXP&referrer=%5Bthe%20pro file%20of%20Jingfeng%20Wu%5D(%2Fprofi le%3Fid%3D˜Jingfeng_Wu1). Maloney, A., Roberts, D. A., and Sully, J. A Solvable Model of Neural Scaling Laws, October 2022. URL http://arxiv.org/abs/2210.16859. McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An Empirical Model of Large-Batc...

  7. [7]

    McKinney, W

    URL http://arxiv.org/abs/1812.061 62. McKinney, W. Data Structures for Statistical Computing in Python. InPython in Science Conference, pp. 56–61, Austin, Texas, 2010. doi: 10.25080/Majora-92bf1922-0 0a. URL https://doi.curvenote.com/10.2 5080/Majora-92bf1922-00a. Nakkiran, P., Neyshabur, B., and Sedghi, H. The Deep Bootstrap Framework: Good Online Learne...

  8. [8]

    Novak, R., Sohl-Dickstein, J., and Schoenholz, S

    URL http://arxiv.org/abs/1912.028 03. Novak, R., Sohl-Dickstein, J., and Schoenholz, S. S. Fast Finite Width Neural Tangent Kernel. InInternational Conference on Machine Learning, 2022. URL https: //github.com/google/neural-tangents. Ortiz-Jimenez, G., Moosavi-Dezfooli, S.-M., and Frossard, P. What can linearized neural networks actually say about general...

  9. [9]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    URL http://arxiv.org/abs/2105.143 01. Sharma, U. and Kaplan, J. Scaling Laws from the Data Manifold Dimension.Journal of Machine Learning Re- search, 23(9):1–34, 2022. ISSN 1533-7928. URL http: //jmlr.org/papers/v23/20-1111.html. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S....

  10. [10]

    doi: 10.1088/2632-2153/adee76

    ISSN 2632-2153. doi: 10.1088/2632-2153/adee76. URL https://dx.doi.org/10.1088/2632-2 153/adee76. Worschech, R. and Rosenow, B. Analyzing Neural Scal- ing Laws in Two-Layer Networks with Power-Law Data Spectra. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https: //openreview.net/forum?id=wFD16gwpze. Yang, G. and ...

  11. [11]

    Zandieh, A., Han, I., Avron, H., Shoham, N., Kim, C., and Shin, J

    URL http://arxiv.org/abs/2109.122 98. Zandieh, A., Han, I., Avron, H., Shoham, N., Kim, C., and Shin, J. Scaling Neural Tangent Kernels via Sketch- ing and Random Features.Advances in Neural Informa- tion Processing Systems, 34:1062–1073, December 2021. URL https://proceedings.neurips.cc/p aper_files/paper/2021/hash/08ae6a26b 7cb089ea588e94aed36bd15-Abstr...

  12. [12]

    Implementation details can be found in Section B.2.5

    Loss Component (χloss).We construct targets to be standardized such that var(y) = 1. Implementation details can be found in Section B.2.5. Consequently, by the Law of Large Numbers: χloss = 1 n nX i=1 y2 i n→∞ − − − − →E[y2] = 1.(43) 18 Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail

  13. [13]

    Network Component (χnet).The trace term averages the kernel diagonal: χnet = 1 n P i K(x i,x i). For ReLU features with normalized weightsw∼ N(0, d −1I)and data covariance Tr(Σ) =d: E[χnet] =E x,w[σ(w⊤x)2] = 1 2 E[(w⊤x)2]·2(symmetry of distribution around 0) =Tr E[xx⊤]E[ww⊤] =Tr Σ· 1 dI = 1.(44) Thus,χ net →1

  14. [14]

    We adopt aGaussian Process Teachersetting, where f ∗ ∼ GP(0, K)

    Spectral Position Component (χpos).This component captures how the target energy is distributed across the kernel’s eigenmodes. We adopt aGaussian Process Teachersetting, where f ∗ ∼ GP(0, K) . This implies that the vector of targets on the training set follows y∼ N(0,K) . In the eigenbasis of K, the projection coefficients αk are independent Gaussian var...