Recognition: no theorem link
Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning
Pith reviewed 2026-05-14 19:35 UTC · model grok-4.3
The pith
Neural Low-Degree Filtering models deep learning as an explicit iterative spectral process in which each layer selects features by maximal low-degree correlation to the label.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the stylized limit of gradient-based training the dynamics at each layer decouple, allowing the next layer to select directions with maximal accessible low-degree correlation to the label and thereby yielding an explicit iterative spectral procedure for building hierarchical representations.
What carries the argument
Neural Low-Degree Filtering (Neural LoFi): an iterative spectral procedure in which each layer, given the current representation, selects directions of maximal low-degree polynomial correlation with the label in a decoupled kernel-space step.
If this is right
- Representations are built layer by layer through selection of maximal low-degree correlations.
- Concept emergence occurs at sample complexities governed by the degree of the selected polynomials.
- Depth enables new features to be constructed from previous ones via low-degree compositionality.
- The model recovers structured filters and outperforms lazy random-feature baselines on standard architectures.
- Early gradient-descent features on real datasets align with the layer-wise spectral predictions.
Where Pith is reading between the lines
- The same low-degree filtering lens could be used to predict how depth requirements scale with the complexity of target functions.
- Explicitly implementing the spectral selection step might yield new training algorithms that accelerate hierarchical feature discovery.
- The framework suggests direct comparisons between learned representations and low-degree polynomial kernels at each layer depth.
- It offers a route to study why certain data distributions allow shallow networks to suffice while others require many layers.
Load-bearing premise
That the gradient dynamics at each layer can be decoupled into independent selections of directions with maximal low-degree correlation to the label.
What would settle it
Training a multi-layer network on data where the learned intermediate representations fail to match the maximal low-degree correlations predicted by the spectral procedure at successive layers.
Figures
read the original abstract
Understanding how deep neural networks learn useful internal representations from data remains a central open problem in the theory of deep learning. We introduce Neural Low-Degree Filtering (Neural LoFi), a stylized limit of gradient-based training in which hierarchical feature learning becomes an explicit iterative spectral procedure. In this limit, the dynamics at each layer decouple: given the current representation, the next layer selects directions with maximal accessible low-degree correlation to the label. This yields a tractable surrogate mechanism for deep learning, together with a natural kernel-space interpretation. Neural LoFi provides a mathematically explicit framework for studying multi-layer feature learning beyond the lazy regime. It predicts how representations are selected layer by layer, explains how emergence of concepts arises with given sample complexity,and gives a concrete mechanism by which depth progressively constructs new features from old ones through low-degree compositionality. We complement the theory with mechanistic experiments on fully connected and convolutional architectures, showing that Neural LoFi improves over lazy random-feature baselines, recovers meaningful structured filters, and predicts representations aligned with early gradient-descent feature discovery with real datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Neural Low-Degree Filtering (Neural LoFi) as a stylized limit of gradient-based training in which hierarchical feature learning reduces to an explicit iterative spectral procedure. In this limit the dynamics at each layer decouple, so that the next layer independently selects directions maximizing accessible low-degree correlation to the label given the current representation; the resulting surrogate yields predictions on layer-wise representation selection, sample complexity for concept emergence, and progressive construction of new features from old ones via low-degree compositionality. The theory is supported by mechanistic experiments on fully connected and convolutional architectures showing improvement over lazy random-feature baselines and alignment with early gradient-descent features on real data.
Significance. If the reduction to the stylized limit is valid, Neural LoFi would supply a mathematically explicit, tractable framework for multi-layer feature learning beyond the lazy/NTK regime, with concrete, falsifiable predictions on how depth builds representations through low-degree compositionality and on the sample complexity of concept emergence. Such a surrogate could serve as a useful analytical tool for studying hierarchical learning in a manner that is directly comparable to gradient descent trajectories.
major comments (1)
- [Stylized limit and decoupling argument (abstract and main derivation section)] The decoupling of layer-wise dynamics is load-bearing for the central claim that Neural LoFi is a direct reduction of gradient flow rather than an additional modeling assumption. The manuscript states that 'the dynamics at each layer decouple' in the stylized limit, yet provides no explicit derivation showing how the back-propagated gradient or feature-map Jacobian becomes block-diagonal or timescale-separated; without this step the iterative spectral procedure remains conjectural.
minor comments (1)
- [Experiments section] Quantitative details of the mechanistic experiments (exact metrics for alignment with early GD features, data-exclusion rules, and baseline hyper-parameter choices) are only summarized; including these in the main text or appendix would allow readers to assess the strength of the empirical support.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. We appreciate the recognition of Neural LoFi's potential as a tractable surrogate for hierarchical feature learning. We address the major comment on the decoupling argument below and will revise the manuscript to strengthen this aspect.
read point-by-point responses
-
Referee: [Stylized limit and decoupling argument (abstract and main derivation section)] The decoupling of layer-wise dynamics is load-bearing for the central claim that Neural LoFi is a direct reduction of gradient flow rather than an additional modeling assumption. The manuscript states that 'the dynamics at each layer decouple' in the stylized limit, yet provides no explicit derivation showing how the back-propagated gradient or feature-map Jacobian becomes block-diagonal or timescale-separated; without this step the iterative spectral procedure remains conjectural.
Authors: We agree that an explicit derivation is necessary to substantiate the claim that decoupling emerges directly from the stylized limit. In the revised manuscript we will add a dedicated subsection to the main derivation that derives the block-diagonal structure of the effective dynamics. Under the stylized-limit assumptions (infinite width, layer-wise learning-rate scaling, and separation of timescales), the back-propagated gradient through the feature-map Jacobian becomes block-diagonal because cross-layer feature correlations vanish in the limit and the low-degree filtering property enforces orthogonality between successive representations. This step-by-step derivation will show that each layer's update depends only on the current representation and the label, confirming that the iterative spectral procedure is a reduction of gradient flow rather than an extra modeling assumption. revision: yes
Circularity Check
No significant circularity; derivation self-contained within stylized limit definition
full rationale
The paper defines Neural LoFi explicitly as a stylized limit of gradient-based training in which layer dynamics are stated to decouple, yielding an iterative spectral procedure by construction of that limit. No equations or steps are shown reducing to fitted inputs, self-citations, or prior ansatzes from the same authors; the decoupling and selection rule are presented as consequences of the limit rather than independently verified reductions. The framework remains an assumption-based surrogate whose predictions are compared to experiments, without the central claim collapsing to its own inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dynamics at each layer decouple in the stylized limit of gradient-based training
invented entities (1)
-
Neural Low-Degree Filtering (Neural LoFi)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deep learning.nature, 521(7553):436–444, 2015
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436–444, 2015
2015
-
[2]
The unreasonable effectiveness of deep learning in artificial intelligence
Terrence J Sejnowski. The unreasonable effectiveness of deep learning in artificial intelligence. Proceedings of the National Academy of Sciences, 117(48):30033–30038, 2020
2020
-
[3]
Visualizing and understanding convolutional networks
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014
2014
-
[4]
How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014
2014
-
[5]
The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024
work page Pith review arXiv 2024
-
[6]
On lazy training in differentiable programming
Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019
2019
-
[7]
Neural tangent kernel: Convergence and general- ization in neural networks.Advances in neural information processing systems, 31, 2018
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and general- ization in neural networks.Advances in neural information processing systems, 31, 2018
2018
-
[8]
Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019
Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019. 15
2019
-
[9]
A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018
Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018
2018
-
[10]
On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018
Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018
2018
-
[11]
Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error.stat, 1050:22, 2018
Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error.stat, 1050:22, 2018
2018
-
[12]
Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020
Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020
2020
-
[13]
Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021
Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021
2021
-
[14]
Training integrable parameterizations of deep neural networks in the infinite-width limit.Journal of Machine Learning Research, 25(196):1–130, 2024
Karl Hajjar, Lénaïc Chizat, and Christophe Giraud. Training integrable parameterizations of deep neural networks in the infinite-width limit.Journal of Machine Learning Research, 25(196):1–130, 2024
2024
-
[15]
Self-consistent dynamical field theory of kernel evolution in wide neural networks.Advances in Neural Information Processing Systems, 35:32240–32256, 2022
Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks.Advances in Neural Information Processing Systems, 35:32240–32256, 2022
2022
-
[16]
A statistical mechanics framework for bayesian deep neural networks beyond the infinite- width limit.Nature Machine Intelligence, 5(12):1497–1507, 2023
Rosalba Pacelli, Sebastiano Ariosto, Mauro Pastore, Francesco Ginelli, Marco Gherardi, and Pietro Rotondo. A statistical mechanics framework for bayesian deep neural networks beyond the infinite- width limit.Nature Machine Intelligence, 5(12):1497–1507, 2023
2023
-
[17]
When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33:14820– 14830, 2020
Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33:14820– 14830, 2020
2020
-
[18]
Learning single-index models with shallow neural networks.Advances in neural information processing systems, 35:9768–9783, 2022
Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks.Advances in neural information processing systems, 35:9768–9783, 2022
2022
-
[19]
Computational-statistical gaps in gaussian single-index models
Alex Damian, Loucas Pillaud-Vivien, Jason Lee, and Joan Bruna. Computational-statistical gaps in gaussian single-index models. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1262–1262. PMLR, 2024
2024
-
[20]
Fundamental computational limits of weak learnability in high-dimensional multi-index models
Emanuele Troiani, Yatin Dandi, Leonardo Defilippis, Lenka Zdeborová, Bruno Loureiro, and Florent Krzakala. Fundamental computational limits of weak learnability in high-dimensional multi-index models. InThe 28th International Conference on Artificial Intelligence and Statistics, 2025
2025
-
[21]
How transformers learn structured data: Insights from hierarchical filtering
Jerome Garnier-Brun, Marc Mezard, Emanuele Moscato, and Luca Saglietti. How transformers learn structured data: Insights from hierarchical filtering. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 18831–18847. PMLR, 13–19 Jul 2025
2025
-
[22]
How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024
Francesco Cagnetta, Leonardo Petrini, Umberto M Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024. 16
2024
-
[23]
Locality defeats the curse of dimension- ality in convolutional teacher-student scenarios.Advances in Neural Information Processing Systems, 34:9456–9467, 2021
Alessandro Favero, Francesco Cagnetta, and Matthieu Wyart. Locality defeats the curse of dimension- ality in convolutional teacher-student scenarios.Advances in Neural Information Processing Systems, 34:9456–9467, 2021
2021
-
[24]
Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, and Matthieu Wyart. How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025
-
[25]
The computational advantage of depth in learning high-dimensional hierarchical targets
Yatin Dandi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The computational advantage of depth in learning high-dimensional hierarchical targets. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
2026
- [26]
-
[27]
Learning multiple layers of features from tiny images
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009
2009
-
[28]
Springer, 2008
Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer, 2008
2008
-
[29]
Smola.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Bernhard Schölkopf and Alexander J. Smola.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002
2002
-
[30]
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Trans. Mach. Learn. Res., 2022, 2022
2022
-
[31]
Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023
Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023
2023
-
[32]
A theory for emergence of complex skills in language models
Sanjeev Arora and Anirudh Goyal. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023
-
[33]
Are emergent abilities of large language models a mirage?Advances in neural information processing systems, 36:55565–55581, 2023
Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?Advances in neural information processing systems, 36:55565–55581, 2023
2023
-
[34]
Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.Annals of probability, 33(5):1643–1697, 2005
Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.Annals of probability, 33(5):1643–1697, 2005
2005
-
[35]
The eigenvalue spectrum of a large symmetric random matrix.Journal of Physics A: Mathematical and General, 9(10):1595–1603, 1976
Samuel F Edwards and Raymund C Jones. The eigenvalue spectrum of a large symmetric random matrix.Journal of Physics A: Mathematical and General, 9(10):1595–1603, 1976
1976
-
[36]
On the performance of kernel classes.Journal of Machine Learning Research, 4:759–771, 2003
Shahar Mendelson. On the performance of kernel classes.Journal of Machine Learning Research, 4:759–771, 2003
2003
-
[37]
Local rademacher complexities.The Annals of Statistics, 33(4):1497–1537, 2005
Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.The Annals of Statistics, 33(4):1497–1537, 2005
2005
-
[38]
Optimal rates for the regularized least-squares algorithm
Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007
2007
-
[39]
Generalization properties of learning with random features
Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, volume 30, 2017. 17
2017
-
[40]
arXiv preprint arXiv:2602.05846 , year=
Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, and Antoine Maillard. Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026
-
[41]
Hugo Tabanelli, Yatin Dandi, Luca Pesce, and Florent Krzakala. Deep learning of compositional targets with hierarchical spectral methods.arXiv preprint arXiv:2602.10867, 2026
-
[42]
Eshaan Nichani, Alex Damian, and Jason D. Lee. Provable guarantees for nonlinear feature learning in three-layer neural networks. InAdvances in Neural Information Processing Systems, volume 36, pages 10828–10875, 2023
2023
-
[43]
Zihao Wang, Eshaan Nichani, and Jason D. Lee. Learning hierarchical polynomials with three-layer neural networks. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[44]
Hengyu Fu, Zihao Wang, Eshaan Nichani, and Jason D. Lee. Learning hierarchical polynomials of multiple nonlinear features. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[45]
When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, February 2017
Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, February 2017
2017
-
[46]
Benefits of depth in neural networks
Matus Telgarsky. Benefits of depth in neural networks. InProceedings of the 29th Conference on Learning Theory, volume 49 ofProceedings of Machine Learning Research, pages 1517–1539. PMLR, June 2016
2016
-
[47]
Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026
-
[48]
The renormalization group and critical phenomena.Reviews of Modern Physics, 55(3):583, 1983
Kenneth G Wilson. The renormalization group and critical phenomena.Reviews of Modern Physics, 55(3):583, 1983
1983
-
[49]
Spectral clustering of graphs with the bethe hessian.Advances in neural information processing systems, 27, 2014
Alaa Saade, Florent Krzakala, and Lenka Zdeborová. Spectral clustering of graphs with the bethe hessian.Advances in neural information processing systems, 27, 2014
2014
-
[50]
Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions.Physical Review X, 9(1):011003, 2019
Valentina Ros, Gérard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions.Physical Review X, 9(1):011003, 2019
2019
-
[51]
Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models.Advances in neural information processing systems, 32, 2019
Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, and Lenka Zdeborová. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models.Advances in neural information processing systems, 32, 2019
2019
-
[52]
Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference
Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference. Physical Review X, 10(1):011057, 2020
2020
-
[53]
Phase transitions of spectral initialization for high-dimensional non-convex estimation.Information and Inference: A Journal of the IMA, 9(3):507–541, 2020
Yue M Lu and Gen Li. Phase transitions of spectral initialization for high-dimensional non-convex estimation.Information and Inference: A Journal of the IMA, 9(3):507–541, 2020
2020
-
[54]
Fundamental limits of weak recovery with applications to phase retrieval
Marco Mondelli and Andrea Montanari. Fundamental limits of weak recovery with applications to phase retrieval. InConference On Learning Theory, pages 1445–1450. PMLR, 2018
2018
-
[55]
Phase retrieval in high dimensions: Statistical and computational phase transitions.Advances in Neural Information Processing Systems, 33:11071–11082, 2020
Antoine Maillard, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Phase retrieval in high dimensions: Statistical and computational phase transitions.Advances in Neural Information Processing Systems, 33:11071–11082, 2020. 18
2020
-
[56]
Matteo Vilucchio, Yatin Dandi, Matéo Pirio Rossignol, Cédric Gerbelot, and Florent Krzakala. Asymp- totics of non-convex generalized linear models in high-dimensions: A proof of the replica formula. arXiv preprint arXiv:2502.20003, 2025
-
[57]
The role of the time-dependent hessian in high- dimensional optimization.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):083401, 2025
Tony Bonnaire, Giulio Biroli, and Chiara Cammarota. The role of the time-dependent hessian in high- dimensional optimization.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):083401, 2025
2025
- [58]
-
[59]
Phase transitions for feature learning in neural networks
Andrea Montanari and Zihao Wang. Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434, 2026
-
[60]
Online stochastic gradient descent on non- convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021
Gérard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non- convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021
2021
-
[61]
Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics
Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. InThe Thirty Sixth Annual Conference on Learning Theory, pages 2552–2623. PMLR, 2023
2023
-
[62]
Lee, and Joan Bruna
Alex Damian, Jason D. Lee, and Joan Bruna. The generative leap: Tight sample complexity for efficiently learning gaussian multi-index models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
2026
-
[63]
Tensor principal component analysis via sum-of-square proofs
Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. InConference on Learning Theory, pages 956–1006. PMLR, 2015
2015
-
[64]
Statistical and computational phase transitions in spiked tensor estimation
Thibault Lesieur, Léo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborová. Statistical and computational phase transitions in spiked tensor estimation. In2017 ieee international symposium on information theory (isit), pages 511–515. IEEE, 2017
2017
-
[65]
The kikuchi hierarchy and tensor pca
Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The kikuchi hierarchy and tensor pca. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1446–1468. IEEE, 2019
2019
-
[66]
Algorithmic thresholds for tensor pca.The Annals of Probability, 48(4):2052–2087, 2020
Gérard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds for tensor pca.The Annals of Probability, 48(4):2052–2087, 2020
2052
-
[67]
The benefits of reusing batches for gradient descent in two-layer networks: breaking the curse of information and leap exponents
Yatin Dandi, Emanuele Troiani, Luca Arnaboldi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The benefits of reusing batches for gradient descent in two-layer networks: breaking the curse of information and leap exponents. InProceedings of the 41st International Conference on Machine Learning, pages 9991–10016, 2024
2024
-
[68]
The staircase property: How hierarchical structure can guide deep learning.Advances in Neural Information Processing Systems, 34:26989–27002, 2021
Emmanuel Abbe, Enric Boix-Adsera, Matthew S Brennan, Guy Bresler, and Dheeraj Nagaraj. The staircase property: How hierarchical structure can guide deep learning.Advances in Neural Information Processing Systems, 34:26989–27002, 2021
2021
-
[69]
arXiv preprint arXiv:2410.18162 , year=
Gérard Ben Arous, Cédric Gerbelot, and Vanessa Piccolo. Stochastic gradient descent in high dimensions for multi-spiked tensor pca.arXiv preprint arXiv:2410.18162, 2024
-
[70]
Sliding down the stairs: how correlated latent variables accelerate learning with neural networks
Lorenzo Bardone, Sebastian Goldt, et al. Sliding down the stairs: how correlated latent variables accelerate learning with neural networks. InInternational Conference on Machine Learning, volume 235, pages 3024–3045, 2024. 19
2024
-
[71]
Hugo Tabanelli, Pierre Mergny, Lenka Zdeborová, and Florent Krzakala. Computational thresholds in multi-modal learning via the spiked matrix-tensor model.arXiv preprint arXiv:2506.02664, 2025
-
[72]
Matrix completion has no spurious local minimum.Advances in neural information processing systems, 29, 2016
Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum.Advances in neural information processing systems, 29, 2016
2016
-
[73]
Spurious valleys in one-hidden-layer neural network optimization landscapes.Journal of Machine Learning Research, 20(133):1–34, 2019
Luca Venturi, Afonso S Bandeira, and Joan Bruna. Spurious valleys in one-hidden-layer neural network optimization landscapes.Journal of Machine Learning Research, 20(133):1–34, 2019
2019
-
[74]
Optimization and generalization of shallow neural networks with quadratic activation functions.Advances in Neural Information Processing Systems, 33:13445–13455, 2020
Stefano Sarao Mannelli, Eric Vanden-Eijnden, and Lenka Zdeborová. Optimization and generalization of shallow neural networks with quadratic activation functions.Advances in Neural Information Processing Systems, 33:13445–13455, 2020
2020
-
[75]
Geometry and optimiza- tion of shallow polynomial networks.SIAM Journal on Applied Algebra and Geometry, 10(2):174–209, 2026
Yossi Arjevani, Joan Bruna, Joe Kileel, Elzbieta Polak, and Matthew Trager. Geometry and optimiza- tion of shallow polynomial networks.SIAM Journal on Applied Algebra and Geometry, 10(2):174–209, 2026
2026
-
[76]
The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks
Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, and Florent Krzakala. The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[77]
Scaling laws and spectra of shallow neural networks in the feature learning regime
Leonardo Defilippis, Yizhou Xu, Julius Girardin, Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, Bruno Loureiro, and Florent Krzakala. Scaling laws and spectra of shallow neural networks in the feature learning regime. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[78]
Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Florent Krzakala, and Lenka Zdeborová. Inductive bias and spectral properties of single-head attention in high dimensions.arXiv preprint arXiv:2509.24914, 2025
-
[79]
Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021
Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021
2021
-
[80]
Kymatio: Scattering transforms in python.Journal of Machine Learning Research, 21(60):1–6, 2020
Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Joakim Andén, Eugene Belilovsky, et al. Kymatio: Scattering transforms in python.Journal of Machine Learning Research, 21(60):1–6, 2020
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.