arxiv: 2605.13612 · v1 · submitted 2026-05-13 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Recognition: no theorem link

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning

Yatin Dandi , Matteo Vilucchio , Luca Arnaboldi , Hugo Tabanelli , Florent Krzakala

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:35 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML

keywords deep learningfeature learningspectral theoryhierarchical representationslow-degree filteringgradient descentneural networkskernel methods

0 comments

The pith

Neural Low-Degree Filtering models deep learning as an explicit iterative spectral process in which each layer selects features by maximal low-degree correlation to the label.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Neural LoFi as a stylized limit of gradient-based training that converts hierarchical feature learning into a tractable iterative spectral procedure. In this limit the layers decouple, so the next layer independently picks directions offering the strongest accessible low-degree correlation with the label. The resulting mechanism explains how depth progressively assembles complex concepts from simpler ones through low-degree compositionality and at predictable sample complexities. Experiments on fully connected and convolutional networks show that the predicted representations improve on random-feature baselines and align with the features discovered early in actual gradient descent.

Core claim

In the stylized limit of gradient-based training the dynamics at each layer decouple, allowing the next layer to select directions with maximal accessible low-degree correlation to the label and thereby yielding an explicit iterative spectral procedure for building hierarchical representations.

What carries the argument

Neural Low-Degree Filtering (Neural LoFi): an iterative spectral procedure in which each layer, given the current representation, selects directions of maximal low-degree polynomial correlation with the label in a decoupled kernel-space step.

If this is right

Representations are built layer by layer through selection of maximal low-degree correlations.
Concept emergence occurs at sample complexities governed by the degree of the selected polynomials.
Depth enables new features to be constructed from previous ones via low-degree compositionality.
The model recovers structured filters and outperforms lazy random-feature baselines on standard architectures.
Early gradient-descent features on real datasets align with the layer-wise spectral predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-degree filtering lens could be used to predict how depth requirements scale with the complexity of target functions.
Explicitly implementing the spectral selection step might yield new training algorithms that accelerate hierarchical feature discovery.
The framework suggests direct comparisons between learned representations and low-degree polynomial kernels at each layer depth.
It offers a route to study why certain data distributions allow shallow networks to suffice while others require many layers.

Load-bearing premise

That the gradient dynamics at each layer can be decoupled into independent selections of directions with maximal low-degree correlation to the label.

What would settle it

Training a multi-layer network on data where the learned intermediate representations fail to match the maximal low-degree correlations predicted by the spectral procedure at successive layers.

Figures

Figures reproduced from arXiv: 2605.13612 by Florent Krzakala, Hugo Tabanelli, Luca Arnaboldi, Matteo Vilucchio, Yatin Dandi.

**Figure 1.** Figure 1: Neural LoFi versus gradient descent/backpropagation (GD). Test error on binary CIFAR-10 [27] (animals vs. vehicles) for fully connected networks (FCN) and convolutional networks (CNN). We compare Neural LoFi with networks trained by gradient descent/backpropagation, shown for different numbers of training steps. In the low-data regime, and at early training times even with more data, Neural LoFi matches or… view at source ↗

**Figure 2.** Figure 2: Neural LoFi in a mathematically solvable model: We used data generated by the two-level target Eq.(21), with (k = 2), latent dimension d1 = ⌊d ϵ ⌋, ϵ = 1 2 , and final readout g ⋆ (t) = tanh(t), learned by a Neural LoFi approach. For d ∈ {80, 100, 120, 140}, we use first-layer random-feature widths p1 ∈ {20000, 30000, 40000, 50000} and secondlayer widths p2 ∈ {512, 768, 1024, 1280}. The final one-dimensio… view at source ↗

**Figure 3.** Figure 3: Fully connected Neural LoFi on the CIFAR-10 animal-vs.-vehicle task. Left: Test error vs number of training samples for ridge regression, three-layer random features, and Neural LoFi, with projection dimensions p = 1k and p = 5k. Right: Test error over the number of retained features (k1, k2) in the first two LoFi layers, for different training-set sizes and fixed projection dimension p = 5k. Stars indicat… view at source ↗

**Figure 5.** Figure 5: Predicting when individual features emerge on CIFAR-10. For a three-layer fully connected Neural LoFi model on the CIFAR-10 animal-vs.-vehicle task, we track the squared overlap |⟨v (n) i , v (N) i ⟩|2 between eigenvectors estimated from n samples and large-sample reference eigenvectors computed with N = 60,000 samples. Curves show mean ± SEM over 100 random subsamples at fixed random features. Dashed vert… view at source ↗

**Figure 4.** Figure 4: Layer-wise normalized overlap (see App. G.2) between features learned by SGD at different steps with the Neural LoFi representation, for a four-layer FCN on CIFAR-10. We also compare Neural LoFi with a standard back-propagation approach: The overlap in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 6.** Figure 6: Neural LoFi with convolutional layers on the CIFAR-10 animal-vs.-vehicle task. Neural LoFi recovers the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Spectral emergence in the Neural LoFi estimator. Spectrum of the first random-feature spectral operator Cb1 for the hierarchical solvable model of section 4.1, shown at increasing sample exponents α = log(n)/ log(d). Blue histograms display the bulk eigenvalue density, while red triangles indicate the leading d1 eigenvalues in absolute value. As α increases, the leading eigenvalues progressively separate f… view at source ↗

**Figure 8.** Figure 8: Test Error (%) as a function of the kept features for Kernel LoFi for different training dataset sizes [PITH_FULL_IMAGE:figures/full_fig_p052_8.png] view at source ↗

**Figure 9.** Figure 9: Test error (%) as a function of the number of retained features [PITH_FULL_IMAGE:figures/full_fig_p054_9.png] view at source ↗

**Figure 10.** Figure 10: Convergence of finite-width NLoFi to the NLoFi-NNGP limit as the hidden width [PITH_FULL_IMAGE:figures/full_fig_p055_10.png] view at source ↗

**Figure 11.** Figure 11: Predicting when individual features emerge on CIFAR-10 with convolutional networks. Convolutional analog of [PITH_FULL_IMAGE:figures/full_fig_p056_11.png] view at source ↗

**Figure 12.** Figure 12: Spectral Distribution: Histograms of eigenvalues across network layers (columns) and dataset sizes (rows). Red markers indicate the five most dominant eigenvalues. The symlog scale reveals the emergence of spectral structure and the separation of lead features from the bulk distribution as n grows. The random feature dimension in this experiment is p = 512 for all layers. 57 [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 13.** Figure 13: Layer-wise feature importance ¯I (ℓ) (Equation (121)) on CelebA [117] for two binary attributes. (Top, High Cheekbones) Importance concentrates progressively on the cheekbone region across layers, while the chin area, salient at the input, is progressively suppressed in deeper layers. (Bottom, Smiling) Importance remains focused on the mouth and jaw region throughout all layers, reflecting that the discri… view at source ↗

**Figure 14.** Figure 14: Filters and activations for CNNs. We train a 6 convolutional + 1 fully connected layer neural network on CelebA [117] for binary classification of the "Gender" attribute, using Neural LoFi with signed-covariance eigendecomposition and eigenvalue-based feature selection. (Left) The 5 × 5 first-layer filters learned by Neural LoFi, visualized as RGB images. (Right) The activations of the top-ranked (1st) a… view at source ↗

read the original abstract

Understanding how deep neural networks learn useful internal representations from data remains a central open problem in the theory of deep learning. We introduce Neural Low-Degree Filtering (Neural LoFi), a stylized limit of gradient-based training in which hierarchical feature learning becomes an explicit iterative spectral procedure. In this limit, the dynamics at each layer decouple: given the current representation, the next layer selects directions with maximal accessible low-degree correlation to the label. This yields a tractable surrogate mechanism for deep learning, together with a natural kernel-space interpretation. Neural LoFi provides a mathematically explicit framework for studying multi-layer feature learning beyond the lazy regime. It predicts how representations are selected layer by layer, explains how emergence of concepts arises with given sample complexity,and gives a concrete mechanism by which depth progressively constructs new features from old ones through low-degree compositionality. We complement the theory with mechanistic experiments on fully connected and convolutional architectures, showing that Neural LoFi improves over lazy random-feature baselines, recovers meaningful structured filters, and predicts representations aligned with early gradient-descent feature discovery with real datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Neural LoFi turns hierarchical feature learning into an explicit iterative spectral procedure, but the layer decoupling step is stated rather than derived from the gradient flow.

read the letter

The paper's main contribution is defining Neural LoFi as a stylized limit of gradient descent in which each layer independently selects directions that maximize low-degree correlation with the label, given the representation from the layer below. This produces a concrete iterative recipe for how depth assembles new features through low-degree composition, along with a kernel-space reading of the process. That framing is new relative to standard lazy-regime or random-feature analyses and directly targets the open question of how representations build hierarchically. The experiments on fully connected and convolutional networks add value by showing that the procedure beats lazy baselines, produces interpretable filters, and aligns with the features that appear early in actual gradient descent on real data. Those checks make the idea more than pure speculation. The soft spot is the decoupling assumption. The abstract presents the independent layer selection as a direct consequence of the stylized limit, yet supplies no derivation showing how the back-propagated gradient or feature Jacobian becomes effectively block-diagonal or timescale-separated. Without that step the iterative spectral procedure functions more as a useful surrogate than a proven reduction of the original dynamics. The experiments are mechanistic and supportive but do not close the gap on the derivation itself. This paper is for theorists working on representation learning, sample complexity, and the role of depth beyond NTK limits. Readers who want an explicit, testable handle on layer-wise feature construction will find concrete predictions to examine. It deserves peer review because the problem is central and the proposal is specific enough to be evaluated and refined.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Neural Low-Degree Filtering (Neural LoFi) as a stylized limit of gradient-based training in which hierarchical feature learning reduces to an explicit iterative spectral procedure. In this limit the dynamics at each layer decouple, so that the next layer independently selects directions maximizing accessible low-degree correlation to the label given the current representation; the resulting surrogate yields predictions on layer-wise representation selection, sample complexity for concept emergence, and progressive construction of new features from old ones via low-degree compositionality. The theory is supported by mechanistic experiments on fully connected and convolutional architectures showing improvement over lazy random-feature baselines and alignment with early gradient-descent features on real data.

Significance. If the reduction to the stylized limit is valid, Neural LoFi would supply a mathematically explicit, tractable framework for multi-layer feature learning beyond the lazy/NTK regime, with concrete, falsifiable predictions on how depth builds representations through low-degree compositionality and on the sample complexity of concept emergence. Such a surrogate could serve as a useful analytical tool for studying hierarchical learning in a manner that is directly comparable to gradient descent trajectories.

major comments (1)

[Stylized limit and decoupling argument (abstract and main derivation section)] The decoupling of layer-wise dynamics is load-bearing for the central claim that Neural LoFi is a direct reduction of gradient flow rather than an additional modeling assumption. The manuscript states that 'the dynamics at each layer decouple' in the stylized limit, yet provides no explicit derivation showing how the back-propagated gradient or feature-map Jacobian becomes block-diagonal or timescale-separated; without this step the iterative spectral procedure remains conjectural.

minor comments (1)

[Experiments section] Quantitative details of the mechanistic experiments (exact metrics for alignment with early GD features, data-exclusion rules, and baseline hyper-parameter choices) are only summarized; including these in the main text or appendix would allow readers to assess the strength of the empirical support.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We appreciate the recognition of Neural LoFi's potential as a tractable surrogate for hierarchical feature learning. We address the major comment on the decoupling argument below and will revise the manuscript to strengthen this aspect.

read point-by-point responses

Referee: [Stylized limit and decoupling argument (abstract and main derivation section)] The decoupling of layer-wise dynamics is load-bearing for the central claim that Neural LoFi is a direct reduction of gradient flow rather than an additional modeling assumption. The manuscript states that 'the dynamics at each layer decouple' in the stylized limit, yet provides no explicit derivation showing how the back-propagated gradient or feature-map Jacobian becomes block-diagonal or timescale-separated; without this step the iterative spectral procedure remains conjectural.

Authors: We agree that an explicit derivation is necessary to substantiate the claim that decoupling emerges directly from the stylized limit. In the revised manuscript we will add a dedicated subsection to the main derivation that derives the block-diagonal structure of the effective dynamics. Under the stylized-limit assumptions (infinite width, layer-wise learning-rate scaling, and separation of timescales), the back-propagated gradient through the feature-map Jacobian becomes block-diagonal because cross-layer feature correlations vanish in the limit and the low-degree filtering property enforces orthogonality between successive representations. This step-by-step derivation will show that each layer's update depends only on the current representation and the label, confirming that the iterative spectral procedure is a reduction of gradient flow rather than an extra modeling assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained within stylized limit definition

full rationale

The paper defines Neural LoFi explicitly as a stylized limit of gradient-based training in which layer dynamics are stated to decouple, yielding an iterative spectral procedure by construction of that limit. No equations or steps are shown reducing to fitted inputs, self-citations, or prior ansatzes from the same authors; the decoupling and selection rule are presented as consequences of the limit rather than independently verified reductions. The framework remains an assumption-based surrogate whose predictions are compared to experiments, without the central claim collapsing to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that layer dynamics decouple in the stylized training limit; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Dynamics at each layer decouple in the stylized limit of gradient-based training
Stated directly in the abstract as the basis for the iterative spectral selection procedure.

invented entities (1)

Neural Low-Degree Filtering (Neural LoFi) no independent evidence
purpose: Stylized surrogate model for hierarchical feature learning
Newly introduced framework whose independent evidence is limited to the described experiments.

pith-pipeline@v0.9.0 · 5509 in / 1308 out tokens · 55252 ms · 2026-05-14T19:35:40.881672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

123 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Deep learning.nature, 521(7553):436–444, 2015

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436–444, 2015

2015
[2]

The unreasonable effectiveness of deep learning in artificial intelligence

Terrence J Sejnowski. The unreasonable effectiveness of deep learning in artificial intelligence. Proceedings of the National Academy of Sciences, 117(48):30033–30038, 2020

2020
[3]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014

2014
[4]

How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014

2014
[5]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

work page Pith review arXiv 2024
[6]

On lazy training in differentiable programming

Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019

2019
[7]

Neural tangent kernel: Convergence and general- ization in neural networks.Advances in neural information processing systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and general- ization in neural networks.Advances in neural information processing systems, 31, 2018

2018
[8]

Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019. 15

2019
[9]

A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

2018
[10]

On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018

Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018

2018
[11]

Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error.stat, 1050:22, 2018

Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error.stat, 1050:22, 2018

2018
[12]

Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020

Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020

2020
[13]

Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

2021
[14]

Training integrable parameterizations of deep neural networks in the infinite-width limit.Journal of Machine Learning Research, 25(196):1–130, 2024

Karl Hajjar, Lénaïc Chizat, and Christophe Giraud. Training integrable parameterizations of deep neural networks in the infinite-width limit.Journal of Machine Learning Research, 25(196):1–130, 2024

2024
[15]

Self-consistent dynamical field theory of kernel evolution in wide neural networks.Advances in Neural Information Processing Systems, 35:32240–32256, 2022

Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks.Advances in Neural Information Processing Systems, 35:32240–32256, 2022

2022
[16]

A statistical mechanics framework for bayesian deep neural networks beyond the infinite- width limit.Nature Machine Intelligence, 5(12):1497–1507, 2023

Rosalba Pacelli, Sebastiano Ariosto, Mauro Pastore, Francesco Ginelli, Marco Gherardi, and Pietro Rotondo. A statistical mechanics framework for bayesian deep neural networks beyond the infinite- width limit.Nature Machine Intelligence, 5(12):1497–1507, 2023

2023
[17]

When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33:14820– 14830, 2020

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33:14820– 14830, 2020

2020
[18]

Learning single-index models with shallow neural networks.Advances in neural information processing systems, 35:9768–9783, 2022

Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks.Advances in neural information processing systems, 35:9768–9783, 2022

2022
[19]

Computational-statistical gaps in gaussian single-index models

Alex Damian, Loucas Pillaud-Vivien, Jason Lee, and Joan Bruna. Computational-statistical gaps in gaussian single-index models. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1262–1262. PMLR, 2024

2024
[20]

Fundamental computational limits of weak learnability in high-dimensional multi-index models

Emanuele Troiani, Yatin Dandi, Leonardo Defilippis, Lenka Zdeborová, Bruno Loureiro, and Florent Krzakala. Fundamental computational limits of weak learnability in high-dimensional multi-index models. InThe 28th International Conference on Artificial Intelligence and Statistics, 2025

2025
[21]

How transformers learn structured data: Insights from hierarchical filtering

Jerome Garnier-Brun, Marc Mezard, Emanuele Moscato, and Luca Saglietti. How transformers learn structured data: Insights from hierarchical filtering. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 18831–18847. PMLR, 13–19 Jul 2025

2025
[22]

How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024

Francesco Cagnetta, Leonardo Petrini, Umberto M Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024. 16

2024
[23]

Locality defeats the curse of dimension- ality in convolutional teacher-student scenarios.Advances in Neural Information Processing Systems, 34:9456–9467, 2021

Alessandro Favero, Francesco Cagnetta, and Matthieu Wyart. Locality defeats the curse of dimension- ality in convolutional teacher-student scenarios.Advances in Neural Information Processing Systems, 34:9456–9467, 2021

2021
[24]

How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, and Matthieu Wyart. How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

work page arXiv 2025
[25]

The computational advantage of depth in learning high-dimensional hierarchical targets

Yatin Dandi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The computational advantage of depth in learning high-dimensional hierarchical targets. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[26]

Yunwei Ren, Yatin Dandi, Florent Krzakala, and Jason D. Lee. Provable learning of random hierarchy models and hierarchical shallow-to-deep chaining.arXiv preprint arXiv:2601.19756, 2026

work page arXiv 2026
[27]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

2009
[28]

Springer, 2008

Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer, 2008

2008
[29]

Smola.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Bernhard Schölkopf and Alexander J. Smola.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002

2002
[30]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Trans. Mach. Learn. Res., 2022, 2022

2022
[31]

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

2023
[32]

A theory for emergence of complex skills in language models

Sanjeev Arora and Anirudh Goyal. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023

work page arXiv 2023
[33]

Are emergent abilities of large language models a mirage?Advances in neural information processing systems, 36:55565–55581, 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?Advances in neural information processing systems, 36:55565–55581, 2023

2023
[34]

Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.Annals of probability, 33(5):1643–1697, 2005

Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.Annals of probability, 33(5):1643–1697, 2005

2005
[35]

The eigenvalue spectrum of a large symmetric random matrix.Journal of Physics A: Mathematical and General, 9(10):1595–1603, 1976

Samuel F Edwards and Raymund C Jones. The eigenvalue spectrum of a large symmetric random matrix.Journal of Physics A: Mathematical and General, 9(10):1595–1603, 1976

1976
[36]

On the performance of kernel classes.Journal of Machine Learning Research, 4:759–771, 2003

Shahar Mendelson. On the performance of kernel classes.Journal of Machine Learning Research, 4:759–771, 2003

2003
[37]

Local rademacher complexities.The Annals of Statistics, 33(4):1497–1537, 2005

Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.The Annals of Statistics, 33(4):1497–1537, 2005

2005
[38]

Optimal rates for the regularized least-squares algorithm

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007

2007
[39]

Generalization properties of learning with random features

Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, volume 30, 2017. 17

2017
[40]

arXiv preprint arXiv:2602.05846 , year=

Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, and Antoine Maillard. Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026

work page arXiv 2026
[41]

Deep learning of compositional targets with hierarchical spectral methods.arXiv preprint arXiv:2602.10867, 2026

Hugo Tabanelli, Yatin Dandi, Luca Pesce, and Florent Krzakala. Deep learning of compositional targets with hierarchical spectral methods.arXiv preprint arXiv:2602.10867, 2026

work page arXiv 2026
[42]

Eshaan Nichani, Alex Damian, and Jason D. Lee. Provable guarantees for nonlinear feature learning in three-layer neural networks. InAdvances in Neural Information Processing Systems, volume 36, pages 10828–10875, 2023

2023
[43]

Zihao Wang, Eshaan Nichani, and Jason D. Lee. Learning hierarchical polynomials with three-layer neural networks. InThe Twelfth International Conference on Learning Representations, 2024

2024
[44]

Hengyu Fu, Zihao Wang, Eshaan Nichani, and Jason D. Lee. Learning hierarchical polynomials of multiple nonlinear features. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[45]

When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, February 2017

Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, February 2017

2017
[46]

Benefits of depth in neural networks

Matus Telgarsky. Benefits of depth in neural networks. InProceedings of the 29th Conference on Learning Theory, volume 49 ofProceedings of Machine Learning Research, pages 1517–1539. PMLR, June 2016

2016
[47]

Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

work page arXiv 2026
[48]

The renormalization group and critical phenomena.Reviews of Modern Physics, 55(3):583, 1983

Kenneth G Wilson. The renormalization group and critical phenomena.Reviews of Modern Physics, 55(3):583, 1983

1983
[49]

Spectral clustering of graphs with the bethe hessian.Advances in neural information processing systems, 27, 2014

Alaa Saade, Florent Krzakala, and Lenka Zdeborová. Spectral clustering of graphs with the bethe hessian.Advances in neural information processing systems, 27, 2014

2014
[50]

Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions.Physical Review X, 9(1):011003, 2019

Valentina Ros, Gérard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions.Physical Review X, 9(1):011003, 2019

2019
[51]

Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models.Advances in neural information processing systems, 32, 2019

Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, and Lenka Zdeborová. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models.Advances in neural information processing systems, 32, 2019

2019
[52]

Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference

Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference. Physical Review X, 10(1):011057, 2020

2020
[53]

Phase transitions of spectral initialization for high-dimensional non-convex estimation.Information and Inference: A Journal of the IMA, 9(3):507–541, 2020

Yue M Lu and Gen Li. Phase transitions of spectral initialization for high-dimensional non-convex estimation.Information and Inference: A Journal of the IMA, 9(3):507–541, 2020

2020
[54]

Fundamental limits of weak recovery with applications to phase retrieval

Marco Mondelli and Andrea Montanari. Fundamental limits of weak recovery with applications to phase retrieval. InConference On Learning Theory, pages 1445–1450. PMLR, 2018

2018
[55]

Phase retrieval in high dimensions: Statistical and computational phase transitions.Advances in Neural Information Processing Systems, 33:11071–11082, 2020

Antoine Maillard, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Phase retrieval in high dimensions: Statistical and computational phase transitions.Advances in Neural Information Processing Systems, 33:11071–11082, 2020. 18

2020
[56]

Asymp- totics of non-convex generalized linear models in high-dimensions: A proof of the replica formula

Matteo Vilucchio, Yatin Dandi, Matéo Pirio Rossignol, Cédric Gerbelot, and Florent Krzakala. Asymp- totics of non-convex generalized linear models in high-dimensions: A proof of the replica formula. arXiv preprint arXiv:2502.20003, 2025

work page arXiv 2025
[57]

The role of the time-dependent hessian in high- dimensional optimization.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):083401, 2025

Tony Bonnaire, Giulio Biroli, and Chiara Cammarota. The role of the time-dependent hessian in high- dimensional optimization.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):083401, 2025

2025
[58]

Bohan Zhang, Zihao Wang, Hengyu Fu, and Jason D. Lee. Neural networks learn generic multi-index models near information-theoretic limit.arXiv preprint arXiv:2511.15120, 2025

work page arXiv 2025
[59]

Phase transitions for feature learning in neural networks

Andrea Montanari and Zihao Wang. Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434, 2026

work page arXiv 2026
[60]

Online stochastic gradient descent on non- convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

Gérard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non- convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

2021
[61]

Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. InThe Thirty Sixth Annual Conference on Learning Theory, pages 2552–2623. PMLR, 2023

2023
[62]

Lee, and Joan Bruna

Alex Damian, Jason D. Lee, and Joan Bruna. The generative leap: Tight sample complexity for efficiently learning gaussian multi-index models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[63]

Tensor principal component analysis via sum-of-square proofs

Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. InConference on Learning Theory, pages 956–1006. PMLR, 2015

2015
[64]

Statistical and computational phase transitions in spiked tensor estimation

Thibault Lesieur, Léo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborová. Statistical and computational phase transitions in spiked tensor estimation. In2017 ieee international symposium on information theory (isit), pages 511–515. IEEE, 2017

2017
[65]

The kikuchi hierarchy and tensor pca

Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The kikuchi hierarchy and tensor pca. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1446–1468. IEEE, 2019

2019
[66]

Algorithmic thresholds for tensor pca.The Annals of Probability, 48(4):2052–2087, 2020

Gérard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds for tensor pca.The Annals of Probability, 48(4):2052–2087, 2020

2052
[67]

The benefits of reusing batches for gradient descent in two-layer networks: breaking the curse of information and leap exponents

Yatin Dandi, Emanuele Troiani, Luca Arnaboldi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The benefits of reusing batches for gradient descent in two-layer networks: breaking the curse of information and leap exponents. InProceedings of the 41st International Conference on Machine Learning, pages 9991–10016, 2024

2024
[68]

The staircase property: How hierarchical structure can guide deep learning.Advances in Neural Information Processing Systems, 34:26989–27002, 2021

Emmanuel Abbe, Enric Boix-Adsera, Matthew S Brennan, Guy Bresler, and Dheeraj Nagaraj. The staircase property: How hierarchical structure can guide deep learning.Advances in Neural Information Processing Systems, 34:26989–27002, 2021

2021
[69]

arXiv preprint arXiv:2410.18162 , year=

Gérard Ben Arous, Cédric Gerbelot, and Vanessa Piccolo. Stochastic gradient descent in high dimensions for multi-spiked tensor pca.arXiv preprint arXiv:2410.18162, 2024

work page arXiv 2024
[70]

Sliding down the stairs: how correlated latent variables accelerate learning with neural networks

Lorenzo Bardone, Sebastian Goldt, et al. Sliding down the stairs: how correlated latent variables accelerate learning with neural networks. InInternational Conference on Machine Learning, volume 235, pages 3024–3045, 2024. 19

2024
[71]

Computational thresholds in multi-modal learning via the spiked matrix-tensor model.arXiv preprint arXiv:2506.02664, 2025

Hugo Tabanelli, Pierre Mergny, Lenka Zdeborová, and Florent Krzakala. Computational thresholds in multi-modal learning via the spiked matrix-tensor model.arXiv preprint arXiv:2506.02664, 2025

work page arXiv 2025
[72]

Matrix completion has no spurious local minimum.Advances in neural information processing systems, 29, 2016

Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum.Advances in neural information processing systems, 29, 2016

2016
[73]

Spurious valleys in one-hidden-layer neural network optimization landscapes.Journal of Machine Learning Research, 20(133):1–34, 2019

Luca Venturi, Afonso S Bandeira, and Joan Bruna. Spurious valleys in one-hidden-layer neural network optimization landscapes.Journal of Machine Learning Research, 20(133):1–34, 2019

2019
[74]

Optimization and generalization of shallow neural networks with quadratic activation functions.Advances in Neural Information Processing Systems, 33:13445–13455, 2020

Stefano Sarao Mannelli, Eric Vanden-Eijnden, and Lenka Zdeborová. Optimization and generalization of shallow neural networks with quadratic activation functions.Advances in Neural Information Processing Systems, 33:13445–13455, 2020

2020
[75]

Geometry and optimiza- tion of shallow polynomial networks.SIAM Journal on Applied Algebra and Geometry, 10(2):174–209, 2026

Yossi Arjevani, Joan Bruna, Joe Kileel, Elzbieta Polak, and Matthew Trager. Geometry and optimiza- tion of shallow polynomial networks.SIAM Journal on Applied Algebra and Geometry, 10(2):174–209, 2026

2026
[76]

The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks

Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, and Florent Krzakala. The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[77]

Scaling laws and spectra of shallow neural networks in the feature learning regime

Leonardo Defilippis, Yizhou Xu, Julius Girardin, Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, Bruno Loureiro, and Florent Krzakala. Scaling laws and spectra of shallow neural networks in the feature learning regime. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[78]

Inductive bias and spectral properties of single-head attention in high dimensions.arXiv preprint arXiv:2509.24914, 2025

Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Florent Krzakala, and Lenka Zdeborová. Inductive bias and spectral properties of single-head attention in high dimensions.arXiv preprint arXiv:2509.24914, 2025

work page arXiv 2025
[79]

Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

2021
[80]

Kymatio: Scattering transforms in python.Journal of Machine Learning Research, 21(60):1–6, 2020

Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Joakim Andén, Eugene Belilovsky, et al. Kymatio: Scattering transforms in python.Journal of Machine Learning Research, 21(60):1–6, 2020

2020

Showing first 80 references.