pith. machine review for the scientific record. sign in

arxiv: 2605.13612 · v1 · submitted 2026-05-13 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Recognition: no theorem link

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:35 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML
keywords deep learningfeature learningspectral theoryhierarchical representationslow-degree filteringgradient descentneural networkskernel methods
0
0 comments X

The pith

Neural Low-Degree Filtering models deep learning as an explicit iterative spectral process in which each layer selects features by maximal low-degree correlation to the label.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Neural LoFi as a stylized limit of gradient-based training that converts hierarchical feature learning into a tractable iterative spectral procedure. In this limit the layers decouple, so the next layer independently picks directions offering the strongest accessible low-degree correlation with the label. The resulting mechanism explains how depth progressively assembles complex concepts from simpler ones through low-degree compositionality and at predictable sample complexities. Experiments on fully connected and convolutional networks show that the predicted representations improve on random-feature baselines and align with the features discovered early in actual gradient descent.

Core claim

In the stylized limit of gradient-based training the dynamics at each layer decouple, allowing the next layer to select directions with maximal accessible low-degree correlation to the label and thereby yielding an explicit iterative spectral procedure for building hierarchical representations.

What carries the argument

Neural Low-Degree Filtering (Neural LoFi): an iterative spectral procedure in which each layer, given the current representation, selects directions of maximal low-degree polynomial correlation with the label in a decoupled kernel-space step.

If this is right

  • Representations are built layer by layer through selection of maximal low-degree correlations.
  • Concept emergence occurs at sample complexities governed by the degree of the selected polynomials.
  • Depth enables new features to be constructed from previous ones via low-degree compositionality.
  • The model recovers structured filters and outperforms lazy random-feature baselines on standard architectures.
  • Early gradient-descent features on real datasets align with the layer-wise spectral predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same low-degree filtering lens could be used to predict how depth requirements scale with the complexity of target functions.
  • Explicitly implementing the spectral selection step might yield new training algorithms that accelerate hierarchical feature discovery.
  • The framework suggests direct comparisons between learned representations and low-degree polynomial kernels at each layer depth.
  • It offers a route to study why certain data distributions allow shallow networks to suffice while others require many layers.

Load-bearing premise

That the gradient dynamics at each layer can be decoupled into independent selections of directions with maximal low-degree correlation to the label.

What would settle it

Training a multi-layer network on data where the learned intermediate representations fail to match the maximal low-degree correlations predicted by the spectral procedure at successive layers.

Figures

Figures reproduced from arXiv: 2605.13612 by Florent Krzakala, Hugo Tabanelli, Luca Arnaboldi, Matteo Vilucchio, Yatin Dandi.

Figure 1
Figure 1. Figure 1: Neural LoFi versus gradient descent/backpropagation (GD). Test error on binary CIFAR-10 [27] (animals vs. vehicles) for fully connected networks (FCN) and convolutional networks (CNN). We compare Neural LoFi with networks trained by gradient descent/backpropagation, shown for different numbers of training steps. In the low-data regime, and at early training times even with more data, Neural LoFi matches or… view at source ↗
Figure 2
Figure 2. Figure 2: Neural LoFi in a mathematically solvable model: We used data generated by the two-level target Eq.(21), with (k = 2), latent dimension d1 = ⌊d ϵ ⌋, ϵ = 1 2 , and final readout g ⋆ (t) = tanh(t), learned by a Neural LoFi approach. For d ∈ {80, 100, 120, 140}, we use first-layer random-feature widths p1 ∈ {20000, 30000, 40000, 50000} and second￾layer widths p2 ∈ {512, 768, 1024, 1280}. The final one-dimensio… view at source ↗
Figure 3
Figure 3. Figure 3: Fully connected Neural LoFi on the CIFAR-10 animal-vs.-vehicle task. Left: Test error vs number of training samples for ridge regression, three-layer random features, and Neural LoFi, with projection dimensions p = 1k and p = 5k. Right: Test error over the number of retained features (k1, k2) in the first two LoFi layers, for different training-set sizes and fixed projection dimension p = 5k. Stars indicat… view at source ↗
Figure 5
Figure 5. Figure 5: Predicting when individual features emerge on CIFAR-10. For a three-layer fully connected Neural LoFi model on the CIFAR-10 animal-vs.-vehicle task, we track the squared overlap |⟨v (n) i , v (N) i ⟩|2 between eigenvectors estimated from n samples and large-sample reference eigenvectors computed with N = 60,000 samples. Curves show mean ± SEM over 100 random subsamples at fixed random features. Dashed vert… view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise normalized overlap (see App. G.2) between features learned by SGD at different steps with the Neural LoFi repre￾sentation, for a four-layer FCN on CIFAR-10. We also compare Neural LoFi with a standard back-propagation approach: The overlap in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Neural LoFi with convolutional layers on the CIFAR-10 animal-vs.-vehicle task. Neural LoFi recovers the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Spectral emergence in the Neural LoFi estimator. Spectrum of the first random-feature spectral operator Cb1 for the hierarchical solvable model of section 4.1, shown at increasing sample exponents α = log(n)/ log(d). Blue histograms display the bulk eigenvalue density, while red triangles indicate the leading d1 eigenvalues in absolute value. As α increases, the leading eigenvalues progressively separate f… view at source ↗
Figure 8
Figure 8. Figure 8: Test Error (%) as a function of the kept features for Kernel LoFi for different training dataset sizes [PITH_FULL_IMAGE:figures/full_fig_p052_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Test error (%) as a function of the number of retained features [PITH_FULL_IMAGE:figures/full_fig_p054_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Convergence of finite-width NLoFi to the NLoFi-NNGP limit as the hidden width [PITH_FULL_IMAGE:figures/full_fig_p055_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Predicting when individual features emerge on CIFAR-10 with convolutional networks. Convolutional analog of [PITH_FULL_IMAGE:figures/full_fig_p056_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Spectral Distribution: Histograms of eigenvalues across network layers (columns) and dataset sizes (rows). Red markers indicate the five most dominant eigenvalues. The symlog scale reveals the emergence of spectral structure and the separation of lead features from the bulk distribution as n grows. The random feature dimension in this experiment is p = 512 for all layers. 57 [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 13
Figure 13. Figure 13: Layer-wise feature importance ¯I (ℓ) (Equation (121)) on CelebA [117] for two binary attributes. (Top, High Cheekbones) Importance concentrates progressively on the cheekbone region across layers, while the chin area, salient at the input, is progressively suppressed in deeper layers. (Bottom, Smiling) Importance remains focused on the mouth and jaw region throughout all layers, reflecting that the discri… view at source ↗
Figure 14
Figure 14. Figure 14: Filters and activations for CNNs. We train a 6 convolutional + 1 fully connected layer neural network on CelebA [117] for binary classification of the "Gender" attribute, using Neural LoFi with signed-covariance eigendecom￾position and eigenvalue-based feature selection. (Left) The 5 × 5 first-layer filters learned by Neural LoFi, visualized as RGB images. (Right) The activations of the top-ranked (1st) a… view at source ↗
read the original abstract

Understanding how deep neural networks learn useful internal representations from data remains a central open problem in the theory of deep learning. We introduce Neural Low-Degree Filtering (Neural LoFi), a stylized limit of gradient-based training in which hierarchical feature learning becomes an explicit iterative spectral procedure. In this limit, the dynamics at each layer decouple: given the current representation, the next layer selects directions with maximal accessible low-degree correlation to the label. This yields a tractable surrogate mechanism for deep learning, together with a natural kernel-space interpretation. Neural LoFi provides a mathematically explicit framework for studying multi-layer feature learning beyond the lazy regime. It predicts how representations are selected layer by layer, explains how emergence of concepts arises with given sample complexity,and gives a concrete mechanism by which depth progressively constructs new features from old ones through low-degree compositionality. We complement the theory with mechanistic experiments on fully connected and convolutional architectures, showing that Neural LoFi improves over lazy random-feature baselines, recovers meaningful structured filters, and predicts representations aligned with early gradient-descent feature discovery with real datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Neural Low-Degree Filtering (Neural LoFi) as a stylized limit of gradient-based training in which hierarchical feature learning reduces to an explicit iterative spectral procedure. In this limit the dynamics at each layer decouple, so that the next layer independently selects directions maximizing accessible low-degree correlation to the label given the current representation; the resulting surrogate yields predictions on layer-wise representation selection, sample complexity for concept emergence, and progressive construction of new features from old ones via low-degree compositionality. The theory is supported by mechanistic experiments on fully connected and convolutional architectures showing improvement over lazy random-feature baselines and alignment with early gradient-descent features on real data.

Significance. If the reduction to the stylized limit is valid, Neural LoFi would supply a mathematically explicit, tractable framework for multi-layer feature learning beyond the lazy/NTK regime, with concrete, falsifiable predictions on how depth builds representations through low-degree compositionality and on the sample complexity of concept emergence. Such a surrogate could serve as a useful analytical tool for studying hierarchical learning in a manner that is directly comparable to gradient descent trajectories.

major comments (1)
  1. [Stylized limit and decoupling argument (abstract and main derivation section)] The decoupling of layer-wise dynamics is load-bearing for the central claim that Neural LoFi is a direct reduction of gradient flow rather than an additional modeling assumption. The manuscript states that 'the dynamics at each layer decouple' in the stylized limit, yet provides no explicit derivation showing how the back-propagated gradient or feature-map Jacobian becomes block-diagonal or timescale-separated; without this step the iterative spectral procedure remains conjectural.
minor comments (1)
  1. [Experiments section] Quantitative details of the mechanistic experiments (exact metrics for alignment with early GD features, data-exclusion rules, and baseline hyper-parameter choices) are only summarized; including these in the main text or appendix would allow readers to assess the strength of the empirical support.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We appreciate the recognition of Neural LoFi's potential as a tractable surrogate for hierarchical feature learning. We address the major comment on the decoupling argument below and will revise the manuscript to strengthen this aspect.

read point-by-point responses
  1. Referee: [Stylized limit and decoupling argument (abstract and main derivation section)] The decoupling of layer-wise dynamics is load-bearing for the central claim that Neural LoFi is a direct reduction of gradient flow rather than an additional modeling assumption. The manuscript states that 'the dynamics at each layer decouple' in the stylized limit, yet provides no explicit derivation showing how the back-propagated gradient or feature-map Jacobian becomes block-diagonal or timescale-separated; without this step the iterative spectral procedure remains conjectural.

    Authors: We agree that an explicit derivation is necessary to substantiate the claim that decoupling emerges directly from the stylized limit. In the revised manuscript we will add a dedicated subsection to the main derivation that derives the block-diagonal structure of the effective dynamics. Under the stylized-limit assumptions (infinite width, layer-wise learning-rate scaling, and separation of timescales), the back-propagated gradient through the feature-map Jacobian becomes block-diagonal because cross-layer feature correlations vanish in the limit and the low-degree filtering property enforces orthogonality between successive representations. This step-by-step derivation will show that each layer's update depends only on the current representation and the label, confirming that the iterative spectral procedure is a reduction of gradient flow rather than an extra modeling assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained within stylized limit definition

full rationale

The paper defines Neural LoFi explicitly as a stylized limit of gradient-based training in which layer dynamics are stated to decouple, yielding an iterative spectral procedure by construction of that limit. No equations or steps are shown reducing to fitted inputs, self-citations, or prior ansatzes from the same authors; the decoupling and selection rule are presented as consequences of the limit rather than independently verified reductions. The framework remains an assumption-based surrogate whose predictions are compared to experiments, without the central claim collapsing to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that layer dynamics decouple in the stylized training limit; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Dynamics at each layer decouple in the stylized limit of gradient-based training
    Stated directly in the abstract as the basis for the iterative spectral selection procedure.
invented entities (1)
  • Neural Low-Degree Filtering (Neural LoFi) no independent evidence
    purpose: Stylized surrogate model for hierarchical feature learning
    Newly introduced framework whose independent evidence is limited to the described experiments.

pith-pipeline@v0.9.0 · 5509 in / 1308 out tokens · 55252 ms · 2026-05-14T19:35:40.881672+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

123 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    Deep learning.nature, 521(7553):436–444, 2015

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436–444, 2015

  2. [2]

    The unreasonable effectiveness of deep learning in artificial intelligence

    Terrence J Sejnowski. The unreasonable effectiveness of deep learning in artificial intelligence. Proceedings of the National Academy of Sciences, 117(48):30033–30038, 2020

  3. [3]

    Visualizing and understanding convolutional networks

    Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014

  4. [4]

    How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014

    Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014

  5. [5]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

  6. [6]

    On lazy training in differentiable programming

    Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019

  7. [7]

    Neural tangent kernel: Convergence and general- ization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and general- ization in neural networks.Advances in neural information processing systems, 31, 2018

  8. [8]

    Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

    Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019. 15

  9. [9]

    A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

    Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

  10. [10]

    On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018

    Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018

  11. [11]

    Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error.stat, 1050:22, 2018

    Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error.stat, 1050:22, 2018

  12. [12]

    Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020

    Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020

  13. [13]

    Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

    Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

  14. [14]

    Training integrable parameterizations of deep neural networks in the infinite-width limit.Journal of Machine Learning Research, 25(196):1–130, 2024

    Karl Hajjar, Lénaïc Chizat, and Christophe Giraud. Training integrable parameterizations of deep neural networks in the infinite-width limit.Journal of Machine Learning Research, 25(196):1–130, 2024

  15. [15]

    Self-consistent dynamical field theory of kernel evolution in wide neural networks.Advances in Neural Information Processing Systems, 35:32240–32256, 2022

    Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks.Advances in Neural Information Processing Systems, 35:32240–32256, 2022

  16. [16]

    A statistical mechanics framework for bayesian deep neural networks beyond the infinite- width limit.Nature Machine Intelligence, 5(12):1497–1507, 2023

    Rosalba Pacelli, Sebastiano Ariosto, Mauro Pastore, Francesco Ginelli, Marco Gherardi, and Pietro Rotondo. A statistical mechanics framework for bayesian deep neural networks beyond the infinite- width limit.Nature Machine Intelligence, 5(12):1497–1507, 2023

  17. [17]

    When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33:14820– 14830, 2020

    Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33:14820– 14830, 2020

  18. [18]

    Learning single-index models with shallow neural networks.Advances in neural information processing systems, 35:9768–9783, 2022

    Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks.Advances in neural information processing systems, 35:9768–9783, 2022

  19. [19]

    Computational-statistical gaps in gaussian single-index models

    Alex Damian, Loucas Pillaud-Vivien, Jason Lee, and Joan Bruna. Computational-statistical gaps in gaussian single-index models. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1262–1262. PMLR, 2024

  20. [20]

    Fundamental computational limits of weak learnability in high-dimensional multi-index models

    Emanuele Troiani, Yatin Dandi, Leonardo Defilippis, Lenka Zdeborová, Bruno Loureiro, and Florent Krzakala. Fundamental computational limits of weak learnability in high-dimensional multi-index models. InThe 28th International Conference on Artificial Intelligence and Statistics, 2025

  21. [21]

    How transformers learn structured data: Insights from hierarchical filtering

    Jerome Garnier-Brun, Marc Mezard, Emanuele Moscato, and Luca Saglietti. How transformers learn structured data: Insights from hierarchical filtering. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 18831–18847. PMLR, 13–19 Jul 2025

  22. [22]

    How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024

    Francesco Cagnetta, Leonardo Petrini, Umberto M Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024. 16

  23. [23]

    Locality defeats the curse of dimension- ality in convolutional teacher-student scenarios.Advances in Neural Information Processing Systems, 34:9456–9467, 2021

    Alessandro Favero, Francesco Cagnetta, and Matthieu Wyart. Locality defeats the curse of dimension- ality in convolutional teacher-student scenarios.Advances in Neural Information Processing Systems, 34:9456–9467, 2021

  24. [24]

    How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

    Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, and Matthieu Wyart. How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

  25. [25]

    The computational advantage of depth in learning high-dimensional hierarchical targets

    Yatin Dandi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The computational advantage of depth in learning high-dimensional hierarchical targets. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  26. [26]

    Yunwei Ren, Yatin Dandi, Florent Krzakala, and Jason D. Lee. Provable learning of random hierarchy models and hierarchical shallow-to-deep chaining.arXiv preprint arXiv:2601.19756, 2026

  27. [27]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  28. [28]

    Springer, 2008

    Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer, 2008

  29. [29]

    Smola.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

    Bernhard Schölkopf and Alexander J. Smola.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002

  30. [30]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Trans. Mach. Learn. Res., 2022, 2022

  31. [31]

    Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

    Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

  32. [32]

    A theory for emergence of complex skills in language models

    Sanjeev Arora and Anirudh Goyal. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023

  33. [33]

    Are emergent abilities of large language models a mirage?Advances in neural information processing systems, 36:55565–55581, 2023

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?Advances in neural information processing systems, 36:55565–55581, 2023

  34. [34]

    Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.Annals of probability, 33(5):1643–1697, 2005

    Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.Annals of probability, 33(5):1643–1697, 2005

  35. [35]

    The eigenvalue spectrum of a large symmetric random matrix.Journal of Physics A: Mathematical and General, 9(10):1595–1603, 1976

    Samuel F Edwards and Raymund C Jones. The eigenvalue spectrum of a large symmetric random matrix.Journal of Physics A: Mathematical and General, 9(10):1595–1603, 1976

  36. [36]

    On the performance of kernel classes.Journal of Machine Learning Research, 4:759–771, 2003

    Shahar Mendelson. On the performance of kernel classes.Journal of Machine Learning Research, 4:759–771, 2003

  37. [37]

    Local rademacher complexities.The Annals of Statistics, 33(4):1497–1537, 2005

    Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.The Annals of Statistics, 33(4):1497–1537, 2005

  38. [38]

    Optimal rates for the regularized least-squares algorithm

    Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007

  39. [39]

    Generalization properties of learning with random features

    Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, volume 30, 2017. 17

  40. [40]

    arXiv preprint arXiv:2602.05846 , year=

    Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, and Antoine Maillard. Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026

  41. [41]

    Deep learning of compositional targets with hierarchical spectral methods.arXiv preprint arXiv:2602.10867, 2026

    Hugo Tabanelli, Yatin Dandi, Luca Pesce, and Florent Krzakala. Deep learning of compositional targets with hierarchical spectral methods.arXiv preprint arXiv:2602.10867, 2026

  42. [42]

    Eshaan Nichani, Alex Damian, and Jason D. Lee. Provable guarantees for nonlinear feature learning in three-layer neural networks. InAdvances in Neural Information Processing Systems, volume 36, pages 10828–10875, 2023

  43. [43]

    Zihao Wang, Eshaan Nichani, and Jason D. Lee. Learning hierarchical polynomials with three-layer neural networks. InThe Twelfth International Conference on Learning Representations, 2024

  44. [44]

    Hengyu Fu, Zihao Wang, Eshaan Nichani, and Jason D. Lee. Learning hierarchical polynomials of multiple nonlinear features. InThe Thirteenth International Conference on Learning Representations, 2025

  45. [45]

    When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, February 2017

    Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, February 2017

  46. [46]

    Benefits of depth in neural networks

    Matus Telgarsky. Benefits of depth in neural networks. InProceedings of the 29th Conference on Learning Theory, volume 49 ofProceedings of Machine Learning Research, pages 1517–1539. PMLR, June 2016

  47. [47]

    Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

    Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

  48. [48]

    The renormalization group and critical phenomena.Reviews of Modern Physics, 55(3):583, 1983

    Kenneth G Wilson. The renormalization group and critical phenomena.Reviews of Modern Physics, 55(3):583, 1983

  49. [49]

    Spectral clustering of graphs with the bethe hessian.Advances in neural information processing systems, 27, 2014

    Alaa Saade, Florent Krzakala, and Lenka Zdeborová. Spectral clustering of graphs with the bethe hessian.Advances in neural information processing systems, 27, 2014

  50. [50]

    Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions.Physical Review X, 9(1):011003, 2019

    Valentina Ros, Gérard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions.Physical Review X, 9(1):011003, 2019

  51. [51]

    Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models.Advances in neural information processing systems, 32, 2019

    Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, and Lenka Zdeborová. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models.Advances in neural information processing systems, 32, 2019

  52. [52]

    Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference

    Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference. Physical Review X, 10(1):011057, 2020

  53. [53]

    Phase transitions of spectral initialization for high-dimensional non-convex estimation.Information and Inference: A Journal of the IMA, 9(3):507–541, 2020

    Yue M Lu and Gen Li. Phase transitions of spectral initialization for high-dimensional non-convex estimation.Information and Inference: A Journal of the IMA, 9(3):507–541, 2020

  54. [54]

    Fundamental limits of weak recovery with applications to phase retrieval

    Marco Mondelli and Andrea Montanari. Fundamental limits of weak recovery with applications to phase retrieval. InConference On Learning Theory, pages 1445–1450. PMLR, 2018

  55. [55]

    Phase retrieval in high dimensions: Statistical and computational phase transitions.Advances in Neural Information Processing Systems, 33:11071–11082, 2020

    Antoine Maillard, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Phase retrieval in high dimensions: Statistical and computational phase transitions.Advances in Neural Information Processing Systems, 33:11071–11082, 2020. 18

  56. [56]

    Asymp- totics of non-convex generalized linear models in high-dimensions: A proof of the replica formula

    Matteo Vilucchio, Yatin Dandi, Matéo Pirio Rossignol, Cédric Gerbelot, and Florent Krzakala. Asymp- totics of non-convex generalized linear models in high-dimensions: A proof of the replica formula. arXiv preprint arXiv:2502.20003, 2025

  57. [57]

    The role of the time-dependent hessian in high- dimensional optimization.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):083401, 2025

    Tony Bonnaire, Giulio Biroli, and Chiara Cammarota. The role of the time-dependent hessian in high- dimensional optimization.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):083401, 2025

  58. [58]

    Bohan Zhang, Zihao Wang, Hengyu Fu, and Jason D. Lee. Neural networks learn generic multi-index models near information-theoretic limit.arXiv preprint arXiv:2511.15120, 2025

  59. [59]

    Phase transitions for feature learning in neural networks

    Andrea Montanari and Zihao Wang. Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434, 2026

  60. [60]

    Online stochastic gradient descent on non- convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

    Gérard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non- convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

  61. [61]

    Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics

    Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. InThe Thirty Sixth Annual Conference on Learning Theory, pages 2552–2623. PMLR, 2023

  62. [62]

    Lee, and Joan Bruna

    Alex Damian, Jason D. Lee, and Joan Bruna. The generative leap: Tight sample complexity for efficiently learning gaussian multi-index models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  63. [63]

    Tensor principal component analysis via sum-of-square proofs

    Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. InConference on Learning Theory, pages 956–1006. PMLR, 2015

  64. [64]

    Statistical and computational phase transitions in spiked tensor estimation

    Thibault Lesieur, Léo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborová. Statistical and computational phase transitions in spiked tensor estimation. In2017 ieee international symposium on information theory (isit), pages 511–515. IEEE, 2017

  65. [65]

    The kikuchi hierarchy and tensor pca

    Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The kikuchi hierarchy and tensor pca. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1446–1468. IEEE, 2019

  66. [66]

    Algorithmic thresholds for tensor pca.The Annals of Probability, 48(4):2052–2087, 2020

    Gérard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds for tensor pca.The Annals of Probability, 48(4):2052–2087, 2020

  67. [67]

    The benefits of reusing batches for gradient descent in two-layer networks: breaking the curse of information and leap exponents

    Yatin Dandi, Emanuele Troiani, Luca Arnaboldi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The benefits of reusing batches for gradient descent in two-layer networks: breaking the curse of information and leap exponents. InProceedings of the 41st International Conference on Machine Learning, pages 9991–10016, 2024

  68. [68]

    The staircase property: How hierarchical structure can guide deep learning.Advances in Neural Information Processing Systems, 34:26989–27002, 2021

    Emmanuel Abbe, Enric Boix-Adsera, Matthew S Brennan, Guy Bresler, and Dheeraj Nagaraj. The staircase property: How hierarchical structure can guide deep learning.Advances in Neural Information Processing Systems, 34:26989–27002, 2021

  69. [69]

    arXiv preprint arXiv:2410.18162 , year=

    Gérard Ben Arous, Cédric Gerbelot, and Vanessa Piccolo. Stochastic gradient descent in high dimensions for multi-spiked tensor pca.arXiv preprint arXiv:2410.18162, 2024

  70. [70]

    Sliding down the stairs: how correlated latent variables accelerate learning with neural networks

    Lorenzo Bardone, Sebastian Goldt, et al. Sliding down the stairs: how correlated latent variables accelerate learning with neural networks. InInternational Conference on Machine Learning, volume 235, pages 3024–3045, 2024. 19

  71. [71]

    Computational thresholds in multi-modal learning via the spiked matrix-tensor model.arXiv preprint arXiv:2506.02664, 2025

    Hugo Tabanelli, Pierre Mergny, Lenka Zdeborová, and Florent Krzakala. Computational thresholds in multi-modal learning via the spiked matrix-tensor model.arXiv preprint arXiv:2506.02664, 2025

  72. [72]

    Matrix completion has no spurious local minimum.Advances in neural information processing systems, 29, 2016

    Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum.Advances in neural information processing systems, 29, 2016

  73. [73]

    Spurious valleys in one-hidden-layer neural network optimization landscapes.Journal of Machine Learning Research, 20(133):1–34, 2019

    Luca Venturi, Afonso S Bandeira, and Joan Bruna. Spurious valleys in one-hidden-layer neural network optimization landscapes.Journal of Machine Learning Research, 20(133):1–34, 2019

  74. [74]

    Optimization and generalization of shallow neural networks with quadratic activation functions.Advances in Neural Information Processing Systems, 33:13445–13455, 2020

    Stefano Sarao Mannelli, Eric Vanden-Eijnden, and Lenka Zdeborová. Optimization and generalization of shallow neural networks with quadratic activation functions.Advances in Neural Information Processing Systems, 33:13445–13455, 2020

  75. [75]

    Geometry and optimiza- tion of shallow polynomial networks.SIAM Journal on Applied Algebra and Geometry, 10(2):174–209, 2026

    Yossi Arjevani, Joan Bruna, Joe Kileel, Elzbieta Polak, and Matthew Trager. Geometry and optimiza- tion of shallow polynomial networks.SIAM Journal on Applied Algebra and Geometry, 10(2):174–209, 2026

  76. [76]

    The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks

    Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, and Florent Krzakala. The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  77. [77]

    Scaling laws and spectra of shallow neural networks in the feature learning regime

    Leonardo Defilippis, Yizhou Xu, Julius Girardin, Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, Bruno Loureiro, and Florent Krzakala. Scaling laws and spectra of shallow neural networks in the feature learning regime. InThe Fourteenth International Conference on Learning Representations, 2026

  78. [78]

    Inductive bias and spectral properties of single-head attention in high dimensions.arXiv preprint arXiv:2509.24914, 2025

    Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Florent Krzakala, and Lenka Zdeborová. Inductive bias and spectral properties of single-head attention in high dimensions.arXiv preprint arXiv:2509.24914, 2025

  79. [79]

    Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

    Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

  80. [80]

    Kymatio: Scattering transforms in python.Journal of Machine Learning Research, 21(60):1–6, 2020

    Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Joakim Andén, Eugene Belilovsky, et al. Kymatio: Scattering transforms in python.Journal of Machine Learning Research, 21(60):1–6, 2020

Showing first 80 references.