arxiv: 2604.21691 · v1 · submitted 2026-04-23 · 📊 stat.ML · cs.LG

Recognition: unknown

There Will Be a Scientific Theory of Deep Learning

Jamie Simon , Daniel Kunin , Alexander Atanasov , Enric Boix-Adser\`a , Blake Bordelon , Jeremy Cohen , Nikhil Ghosh , Florentin Guth

show 6 more authors

Arthur Jacot Mason Kamb Dhruva Karkada Eric J. Michaud Berkan Ottlik Joseph Turnbull

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:06 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords deep learning theorylearning mechanicstraining dynamicsneural networksscientific theoryuniversal behaviorshyperparametersfalsifiable predictions

0 comments

The pith

A scientific theory of deep learning called learning mechanics is emerging from five complementary lines of research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that deep learning is developing a scientific theory that explains important properties of the training process, hidden representations, final weights, and network performance. This theory draws together five strands of work that each supply pieces of the picture: solvable idealized settings that build intuition, tractable limits that expose core phenomena, simple mathematical laws for large-scale observables, theories that isolate hyperparameters, and universal behaviors that appear across many systems. These strands share an emphasis on training dynamics, coarse aggregate statistics, and falsifiable quantitative predictions. The authors frame the result as a mechanics of learning and contrast it with statistical or information-theoretic views while noting an expected link to mechanistic interpretability.

Core claim

We make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. Five growing bodies of work point toward such a theory: solvable idealized settings, tractable limits, simple mathematical laws, theories of hyperparameters, and universal behaviors. These bodies share a focus on the dynamics of the training process, coarse aggregate statistics, and falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process.

What carries the argument

Learning mechanics, the proposed perspective that treats deep learning as governed by emergent laws arising from training dynamics and the five identified research strands.

If this is right

The theory will characterize training dynamics, representations, weights, and performance through coarse statistics and testable predictions.
A symbiotic relationship will develop between learning mechanics and mechanistic interpretability.
Common arguments that fundamental theory of deep learning is impossible or unimportant can be directly addressed.
Open directions include further development of the five strands and exploration of their unification.
The mechanics perspective will complement rather than replace statistical and information-theoretic approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the mechanics view holds, it could reduce the need for exhaustive hyperparameter search by supplying predictive laws for how changes in one setting affect others.
Universal behaviors identified across architectures might extend to non-neural models, offering a broader organizing principle for machine learning systems.
Testable predictions from idealized settings could be checked systematically in controlled scaling experiments to measure how far the unification extends.

Load-bearing premise

The five strands of research will converge into one coherent mechanics of learning rather than remaining disconnected lines of inquiry.

What would settle it

A concrete counter-example in which predictions derived from solvable idealized settings or simple mathematical laws fail to match observed training dynamics or performance in multiple realistic neural networks would falsify the claim that these strands are coalescing into a unified theory.

Figures

Figures reproduced from arXiv: 2604.21691 by Alexander Atanasov, Arthur Jacot, Berkan Ottlik, Blake Bordelon, Daniel Kunin, Dhruva Karkada, Enric Boix-Adser\`a, Eric J. Michaud, Florentin Guth, Jamie Simon, Jeremy Cohen, Joseph Turnbull, Mason Kamb, Nikhil Ghosh.

**Figure 1.** Figure 1: Linearization yields exact solutions that match experiments. (a) Canonical work by Saxe et al. [2014] showed that, under a task-aligned initialization θ (0) and whitened inputs x ∼ N (0, I), the gradient flow learning dynamics of deep linear networks decouple into independent solvable Bernoulli ODEs. This leads to sequential learning of singular modes, with larger-singular-value modes emerging first. Panel… view at source ↗

**Figure 2.** Figure 2: Large and small network output multipliers are sufficient to induce lazy and rich training dynamics. We train a shallow student network ˆf(x) = α n Pn i=1 aiReLU(w⊤ i x) with width n = 200 to match a teacher network f ∗ (x) = P3 i=1 a ∗ i ReLU((w∗ i ) ⊤x) on two-dimensional input data. We plot the training trajectories of the student weights wi (color denotes sgn(ai)) against the teacher feature directions… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Gradient descent occurs near the edge of stability. Three architectures are trained with full-batch gradient descent on CIFAR-10 with varying learning rate η. Plots show the train loss (top row) and Hessian sharpness (bottom row). For each step size η, observe that the sharpness rises to 2/η (dashed horizontal lines) and hovers at or just above this value. Reproduced from Cohen et al. [2021a]. how coarse p… view at source ↗

**Figure 1.** Figure 1: Training loss against learning rate on The theory of network parameterization permits learning rate transfer across widths. Tra g widths trained on WikiText2 under standard parameterization (left) and µP (right)Under [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗

**Figure 6.** Figure 6: Universality across architectures and data modalities. (a): Different diffusion model architectures (from top to bottom: DDPM, a consistency model—both based on UNet—and U-ViT) converge to the same learned distribution and produce identical images when given the same input seed. Adapted from Zhang et al. [2024] (b): As language models performance (horizontal axis) increases, their internal representations … view at source ↗

read the original abstract

In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; (c) simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a position paper that groups existing deep learning theory work under the new label 'learning mechanics' but shows no new results or explicit path to unification.

read the letter

The main takeaway is that this paper does not introduce new math, experiments, or derivations. It instead argues that a scientific theory of deep learning is emerging by collecting five strands of prior research and calling the collection 'learning mechanics.' The authors define this as work focused on training dynamics, coarse aggregate statistics, and falsifiable predictions, then contrast it with statistical or information-theoretic views while noting a possible link to mechanistic interpretability. They also push back on common objections to theory and offer beginner advice plus a companion site with extra materials.

Referee Report

2 major / 2 minor

Summary. The paper claims that a scientific theory of deep learning is emerging, which it terms 'learning mechanics': a framework characterizing training dynamics, hidden representations, final weights, and performance via coarse aggregate statistics and falsifiable predictions. It identifies five growing bodies of research—(a) solvable idealized settings, (b) tractable limits, (c) simple mathematical laws, (d) theories of hyperparameters, and (e) universal behaviors—that share traits of focusing on training-process dynamics and quantitative observables. The manuscript discusses the relation of this perspective to statistical and information-theoretic approaches, anticipates a symbiotic link with mechanistic interpretability, reviews common arguments against the possibility of fundamental theory, and outlines open directions, while hosting supplementary materials at learningmechanics.pub.

Significance. If the argument holds, the paper offers a useful organizing lens for deep-learning theory by highlighting converging trends toward a mechanics-style focus on aggregate training behavior rather than microscopic details. It explicitly credits the synthesis of existing strands and the provision of accessible introductory resources at learningmechanics.pub, which could aid newcomers. The significance remains modest because the contribution is qualitative synthesis without new theorems, experiments, or proofs.

major comments (2)

[Section introducing the five bodies of work] The section introducing the five bodies of work (following the abstract) asserts that these strands 'point toward' a coherent theory of learning mechanics but supplies no explicit interconnections, shared mathematical language, or concrete integration examples showing how, e.g., insights from tractable limits constrain or extend universal behaviors. Without such links the claim of coalescence reduces to parallel enumeration rather than evidence of unification, which is load-bearing for the central thesis.
[Discussion of common arguments against theory] In the discussion of common arguments against fundamental theory (near the end), the rebuttals rely on the same curated examples of positive trends without addressing the risk of selection bias or providing a falsifiable criterion for when the five strands would fail to form a unified mechanics; this weakens the defense of the emergence claim.

minor comments (2)

[Abstract] The abstract states that materials are hosted at learningmechanics.pub but does not briefly describe their content (e.g., open questions or tutorials), which would help readers decide whether to consult them.
[Introduction] The term 'learning mechanics' is introduced as a metaphor; a short clarification distinguishing it from prior uses of 'mechanics' in optimization or physics-inspired ML would reduce potential ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments. We have carefully considered the points raised and provide point-by-point responses below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Section introducing the five bodies of work] The section introducing the five bodies of work (following the abstract) asserts that these strands 'point toward' a coherent theory of learning mechanics but supplies no explicit interconnections, shared mathematical language, or concrete integration examples showing how, e.g., insights from tractable limits constrain or extend universal behaviors. Without such links the claim of coalescence reduces to parallel enumeration rather than evidence of unification, which is load-bearing for the central thesis.

Authors: We thank the referee for highlighting this gap. The manuscript emphasizes shared traits across the five bodies as the basis for coalescence, but we agree that concrete examples would strengthen the argument. In the revised version, we will expand the introduction to include specific interconnections, for instance, illustrating how mathematical laws from solvable settings (such as the neural tangent kernel) inform predictions in tractable limits like the infinite-width regime, and how these in turn relate to universal behaviors observed in scaling laws. This addition will demonstrate the emerging shared mathematical language without overstating the current level of unification. revision: yes
Referee: [Discussion of common arguments against theory] In the discussion of common arguments against fundamental theory (near the end), the rebuttals rely on the same curated examples of positive trends without addressing the risk of selection bias or providing a falsifiable criterion for when the five strands would fail to form a unified mechanics; this weakens the defense of the emergence claim.

Authors: We acknowledge the validity of this critique. To mitigate concerns of selection bias, we will revise the relevant section to explicitly state the criteria used for selecting the strands (e.g., their focus on dynamics and quantitative predictions) and note that they represent prominent directions in the literature. On the falsifiable criterion, the paper positions the emergence as an ongoing process; we will add a sentence indicating that the claim would be challenged if future research in these areas fails to produce consistent, cross-setting predictions or if they remain siloed without integration. However, as this is a perspective piece rather than a formal theory, a complete falsification protocol is not provided here, and we believe this addresses the core concern while maintaining honesty about the current state. revision: partial

Circularity Check

0 steps flagged

No circularity in perspective survey of research strands

full rationale

The paper is a non-technical perspective piece that surveys five existing bodies of deep learning theory research and argues they indicate an emerging 'mechanics' framework. No derivation chain, equations, or predictions are presented that reduce to the paper's own inputs by construction. The selection of strands draws from external literature (including but not limited to author-adjacent work), the shared traits are observational, and the 'learning mechanics' label is a proposed framing rather than a self-definitional or fitted result. Self-citations, if present, are not load-bearing for any forced conclusion. This is a standard, self-contained opinion article without the circular patterns enumerated.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that a theory is defined by dynamics, coarse statistics, and falsifiable predictions, plus the selection of five research strands as representative of progress toward unification.

axioms (1)

domain assumption A scientific theory of deep learning can be characterized by focus on training dynamics, coarse aggregate statistics, and falsifiable quantitative predictions.
This definition is used to identify the five bodies of work as pointing toward the theory.

pith-pipeline@v0.9.0 · 5643 in / 1194 out tokens · 31811 ms · 2026-05-09T20:06:48.117869+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
cs.LG 2026-05 unverdicted novelty 7.0

BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
cs.LG 2026-05 unverdicted novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...

Reference graph

Works this paper leans on

300 extracted references · 94 canonical work pages · cited by 3 Pith papers · 9 internal anchors

[1]

Behavioral and Brain Sciences , volume=

Deep problems with neural network models of human vision , author=. Behavioral and Brain Sciences , volume=. 2023 , publisher=

2023
[2]

Advances in neural information processing systems , volume=

Deep learning models of the retinal response to natural scenes , author=. Advances in neural information processing systems , volume=
[3]

Proceedings of the national academy of sciences , volume=

Performance-optimized hierarchical models predict neural responses in higher visual cortex , author=. Proceedings of the national academy of sciences , volume=. 2014 , publisher=

2014
[4]

European Conference on Computer Vision , pages=

Visualizing and Understanding Convolutional Networks , author=. European Conference on Computer Vision , pages=. 2014 , publisher=

2014
[5]

Position: An empirically grounded identifiability theory will accelerate self-supervised learning research.arXiv preprint arXiv:2504.13101,

Position: An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research , author=. arXiv preprint arXiv:2504.13101 , year=

work page arXiv
[6]

Relative representations enable zero-shot latent space communication

Relative representations enable zero-shot latent space communication , author=. arXiv preprint arXiv:2209.15430 , year=

work page arXiv
[7]

E., Balestriero, R., Brendel, W., and Klindt, D

Cross-entropy is all you need to invert the data generating process , author=. arXiv preprint arXiv:2410.21869 , year=

work page arXiv
[8]

Annals of the Institute of Statistical Mathematics , volume=

Identifiability of latent-variable and structural-equation models: from linear to nonlinear , author=. Annals of the Institute of Statistical Mathematics , volume=. 2024 , publisher=

2024
[9]

arXiv preprint arXiv:2007.10930 , year=

Towards nonlinear disentanglement in natural data with temporal sparse coding , author=. arXiv preprint arXiv:2007.10930 , year=

work page arXiv 2007
[10]

International conference on machine learning , pages=

Contrastive learning inverts the data generating process , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[11]

Nature , volume=

Emergence of simple-cell receptive field properties by learning a sparse code for natural images , author=. Nature , volume=. 1996 , publisher=

1996
[12]

Advances in neural information processing systems , volume=

Neural ordinary differential equations , author=. Advances in neural information processing systems , volume=
[13]

The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

The hidden width of deep ResNets: Tight error bounds and phase diagrams , author=. arXiv preprint arXiv:2509.10167 , year=

work page arXiv
[14]

arXiv preprint arXiv:2310.07891 , year=

A theory of non-linear feature learning with one gradient step in two-layer neural networks , author=. arXiv preprint arXiv:2310.07891 , year=

work page arXiv
[15]

Scaling laws and spectra of shallow neural networks in the feature learning regime.arXiv preprint arXiv:2509.24882, 2025

Scaling laws and spectra of shallow neural networks in the feature learning regime , author=. arXiv preprint arXiv:2509.24882 , year=

work page arXiv
[16]

Advances in Neural Information Processing Systems , volume=

The committee machine: Computational to statistical gaps in learning a two-layers neural network , author=. Advances in Neural Information Processing Systems , volume=
[17]

Journal of Statistical Mechanics: Theory and Experiment , volume=

Unified field theoretical approach to deep and recurrent neuronal networks , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2022 , publisher=

2022
[18]

Advances in neural information processing systems , volume=

Asymptotics of representation learning in finite Bayesian neural networks , author=. Advances in neural information processing systems , volume=
[19]

Finite depth and width corrections to the neural tangent kernel

Finite depth and width corrections to the neural tangent kernel , author=. arXiv preprint arXiv:1909.05989 , year=

work page arXiv 1909
[20]

Phase transitions for feature learning in neural networks

Phase transitions for feature learning in neural networks , author=. arXiv preprint arXiv:2602.01434 , year=

work page arXiv
[21]

Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time, 2025

Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time , author=. arXiv preprint arXiv:2504.13110 , year=

work page arXiv
[22]

arXiv:2502.02531 , year=

Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer , author=. arXiv preprint arXiv:2502.02531 , year=

work page arXiv
[23]

arXiv preprint arXiv:2402.04980 , year=

Asymptotics of feature learning in two-layer networks after one gradient-step , author=. arXiv preprint arXiv:2402.04980 , year=

work page arXiv
[24]

Advances in Neural Information Processing Systems , volume=

The neural covariance SDE: Shaped infinite depth-and-width networks at initialization , author=. Advances in Neural Information Processing Systems , volume=
[25]

2015 , publisher=

Deep linear neural networks: A theory of learning in the brain and mind , author=. 2015 , publisher=

2015
[26]

International Conference on Artificial Intelligence and Statistics , pages=

On the impact of overparameterization on the training of a shallow neural network in high dimensions , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

2024
[27]

Advances in Neural Information Processing Systems , volume=

Bayes-optimal learning of an extensive-width neural network from quadratically many samples , author=. Advances in Neural Information Processing Systems , volume=
[28]

arXiv preprint arXiv:2510.24616 , year=

Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation , author=. arXiv preprint arXiv:2510.24616 , year=

work page arXiv
[29]

arXiv preprint arXiv:2602.15593 , year=

A unified theory of feature learning in RNNs and DNNs , author=. arXiv preprint arXiv:2602.15593 , year=

work page arXiv
[30]

bioRxiv , pages=

Structure, disorder, and dynamics in task-trained recurrent neural circuits , author=. bioRxiv , pages=. 2026 , publisher=

2026
[31]

arXiv preprint arXiv:2503.07872 , year=

Global Universality of Singular Values in Products of Many Large Random Matrices , author=. arXiv preprint arXiv:2503.07872 , year=

work page arXiv
[32]

Advances in Neural Information Processing Systems , volume=

The shaped transformer: Attention models in the infinite depth-and-width limit , author=. Advances in Neural Information Processing Systems , volume=
[33]

Conference on learning theory , pages=

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , author=. Conference on learning theory , pages=. 2019 , organization=

2019
[34]

Advances in Neural Information Processing Systems , volume=

Dynamics of finite width kernel and prediction fluctuations in mean field neural networks , author=. Advances in Neural Information Processing Systems , volume=
[35]

Advances in Neural Information Processing Systems , volume=

Self-consistent dynamical field theory of kernel evolution in wide neural networks , author=. Advances in Neural Information Processing Systems , volume=
[36]

1999 , publisher=

A wavelet tour of signal processing , author=. 1999 , publisher=

1999
[37]

Neural computation , volume=

The lack of a priori distinctions between learning algorithms , author=. Neural computation , volume=. 1996 , publisher=

1996
[38]

A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023

A cookbook of self-supervised learning , author=. arXiv preprint arXiv:2304.12210 , year=

work page arXiv
[39]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

A convnet for the 2020s , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[40]

arXiv preprint arXiv:2310.16764 , year=

Convnets match vision transformers at scale , author=. arXiv preprint arXiv:2310.16764 , year=

work page arXiv
[41]

Forty-second International Conference on Machine Learning , year=

Towards a Mechanistic Explanation of Diffusion Model Generalization , author=. Forty-second International Conference on Machine Learning , year=
[42]

arXiv:2001.07301 , year=

On the infinite width limit of neural networks with a standard parameterization , author=. arXiv preprint arXiv:2001.07301 , year=

work page arXiv 2001
[43]

Advances in Neural Information Processing Systems , volume=

On the training dynamics of deep networks with L\_2 regularization , author=. Advances in Neural Information Processing Systems , volume=
[44]

2022 , publisher=

The principles of deep learning theory , author=. 2022 , publisher=

2022
[45]

Asymptotics of Wide Networks from Feynman Diagrams , author=
[46]

The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

The large learning rate phase of deep learning: the catapult mechanism , author=. arXiv preprint arXiv:2003.02218 , year=

work page arXiv 2003
[47]

Journal of Statistical Mechanics: Theory and Experiment , volume=

Disentangling feature and lazy training in deep neural networks , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2020 , publisher=

2020
[48]

International conference on machine learning , pages=

More than a toy: Random matrix models predict how real-world neural representations generalize , author=. International conference on machine learning , pages=. 2022 , organization=

2022
[49]

International Conference on Machine Learning , pages=

A kernel-based view of language model fine-tuning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[50]

Sutherland , booktitle=

Yi Ren and Danica J. Sutherland , booktitle=. Learning Dynamics of
[51]

arXiv preprint arXiv:2502.21009 , year=

Position: Solve layerwise linear models first to understand neural dynamical phenomena (neural collapse, emergence, lazy/rich regime, and grokking) , author=. arXiv preprint arXiv:2502.21009 , year=

work page arXiv
[52]

Frontiers in Neural Circuits , volume=

Summary statistics of learning link changing neural representations to behavior , author=. Frontiers in Neural Circuits , volume=. 2025 , publisher=

2025
[53]

Advances in neural information processing systems , volume=

High-dimensional limit theorems for sgd: Effective dynamics and critical scaling , author=. Advances in neural information processing systems , volume=
[54]

Advances in Neural Information Processing Systems , volume=

Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks , author=. Advances in Neural Information Processing Systems , volume=
[55]

arXiv preprint arXiv:2505.17958 , year=

The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks , author=. arXiv preprint arXiv:2505.17958 , year=

work page arXiv
[56]

arXiv preprint arXiv:2508.03688 , year=

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws , author=. arXiv preprint arXiv:2508.03688 , year=

work page arXiv
[57]

Computational-statistical gaps in G aussian single-index models (extended abstract)

Computational-statistical gaps in gaussian single-index models , author=. arXiv preprint arXiv:2403.05529 , year=

work page arXiv
[58]

arXiv preprint arXiv:2402.03220 , year=

The benefits of reusing batches for gradient descent in two-layer networks: Breaking the curse of information and leap exponents , author=. arXiv preprint arXiv:2402.03220 , year=

work page arXiv
[59]

Advances in Neural Information Processing Systems , volume=

Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification , author=. Advances in Neural Information Processing Systems , volume=
[60]

Proceedings of the National Academy of Sciences , volume=

Optimal errors and phase transitions in high-dimensional generalized linear models , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

2019
[61]

Advances in Neural Information Processing Systems , volume=

Super consistency of neural network landscapes and learning rate transfer , author=. Advances in Neural Information Processing Systems , volume=
[62]

arXiv preprint arXiv:2505.22491 , year=

On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling , author=. arXiv preprint arXiv:2505.22491 , year=

work page arXiv
[63]

Advances in Neural Information Processing Systems , volume=

Normalization and effective learning rates in reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[64]

The Thirteenth International Conference on Learning Representations , year=

The Optimization Landscape of SGD Across the Feature Learning Strength , author=. The Thirteenth International Conference on Learning Representations , year=
[65]

The twelfth international conference on learning representations , year=

Grokking as the transition from lazy to rich training dynamics , author=. The twelfth international conference on learning representations , year=
[66]

Forty-second International Conference on Machine Learning , year=

An analytic theory of creativity in convolutional diffusion models , author=. Forty-second International Conference on Machine Learning , year=
[67]

Journal of Machine Learning Research , volume=

A rainbow in deep network black boxes , author=. Journal of Machine Learning Research , volume=
[68]

ICLR 2024 Workshop on Representational Alignment , year=

On the universality of neural encodings in CNNs , author=. ICLR 2024 Workshop on Representational Alignment , year=

2024
[69]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Understanding image representations by measuring their equivariance and equivalence , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[70]

Forty-first International Conference on Machine Learning , year=

The emergence of reproducibility and consistency in diffusion models , author=. Forty-first International Conference on Machine Learning , year=
[71]

, author=

Zipf's Law everywhere. , author=. Glottometrics , volume=. 2002 , publisher=

2002
[72]

Psychonomic bulletin & review , volume=

Zipf’s word frequency law in natural language: A critical review and future directions , author=. Psychonomic bulletin & review , volume=. 2014 , publisher=

2014
[73]

Proceedings of the National Academy of Sciences , volume=

A phase transition in diffusion models reveals the hierarchical nature of data , author=. Proceedings of the National Academy of Sciences , volume=. 2025 , publisher=

2025
[74]

Efficient Estimation of Word Representations in Vector Space

Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

work page internal anchor Pith review arXiv
[75]

arXiv preprint arXiv:2502.09863 , year=

Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models , author=. arXiv preprint arXiv:2502.09863 , year=

work page arXiv
[76]

2025 , journal=

On the Emergence of Linear Analogies in Word Embeddings , author=. 2025 , journal=

2025
[77]

PLoS computational biology , volume=

Nonlinear Hebbian learning as a unifying principle in receptive field formation , author=. PLoS computational biology , volume=. 2016 , publisher=

2016
[78]

arXiv preprint arXiv:2503.23896 , year=

Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensions , author=. arXiv preprint arXiv:2503.23896 , year=

work page arXiv
[79]

Current opinion in neurobiology , volume=

Sparse coding of sensory inputs , author=. Current opinion in neurobiology , volume=. 2004 , publisher=

2004
[80]

arXiv preprint arXiv:2206.04041 , year=

Neural collapse: A review on modelling principles and generalization , author=. arXiv preprint arXiv:2206.04041 , year=

work page arXiv

Showing first 80 references.