Recognition: unknown
There Will Be a Scientific Theory of Deep Learning
Pith reviewed 2026-05-09 20:06 UTC · model grok-4.3
The pith
A scientific theory of deep learning called learning mechanics is emerging from five complementary lines of research.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. Five growing bodies of work point toward such a theory: solvable idealized settings, tractable limits, simple mathematical laws, theories of hyperparameters, and universal behaviors. These bodies share a focus on the dynamics of the training process, coarse aggregate statistics, and falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process.
What carries the argument
Learning mechanics, the proposed perspective that treats deep learning as governed by emergent laws arising from training dynamics and the five identified research strands.
If this is right
- The theory will characterize training dynamics, representations, weights, and performance through coarse statistics and testable predictions.
- A symbiotic relationship will develop between learning mechanics and mechanistic interpretability.
- Common arguments that fundamental theory of deep learning is impossible or unimportant can be directly addressed.
- Open directions include further development of the five strands and exploration of their unification.
- The mechanics perspective will complement rather than replace statistical and information-theoretic approaches.
Where Pith is reading between the lines
- If the mechanics view holds, it could reduce the need for exhaustive hyperparameter search by supplying predictive laws for how changes in one setting affect others.
- Universal behaviors identified across architectures might extend to non-neural models, offering a broader organizing principle for machine learning systems.
- Testable predictions from idealized settings could be checked systematically in controlled scaling experiments to measure how far the unification extends.
Load-bearing premise
The five strands of research will converge into one coherent mechanics of learning rather than remaining disconnected lines of inquiry.
What would settle it
A concrete counter-example in which predictions derived from solvable idealized settings or simple mathematical laws fail to match observed training dynamics or performance in multiple realistic neural networks would falsify the claim that these strands are coalescing into a unified theory.
Figures
read the original abstract
In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; (c) simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a scientific theory of deep learning is emerging, which it terms 'learning mechanics': a framework characterizing training dynamics, hidden representations, final weights, and performance via coarse aggregate statistics and falsifiable predictions. It identifies five growing bodies of research—(a) solvable idealized settings, (b) tractable limits, (c) simple mathematical laws, (d) theories of hyperparameters, and (e) universal behaviors—that share traits of focusing on training-process dynamics and quantitative observables. The manuscript discusses the relation of this perspective to statistical and information-theoretic approaches, anticipates a symbiotic link with mechanistic interpretability, reviews common arguments against the possibility of fundamental theory, and outlines open directions, while hosting supplementary materials at learningmechanics.pub.
Significance. If the argument holds, the paper offers a useful organizing lens for deep-learning theory by highlighting converging trends toward a mechanics-style focus on aggregate training behavior rather than microscopic details. It explicitly credits the synthesis of existing strands and the provision of accessible introductory resources at learningmechanics.pub, which could aid newcomers. The significance remains modest because the contribution is qualitative synthesis without new theorems, experiments, or proofs.
major comments (2)
- [Section introducing the five bodies of work] The section introducing the five bodies of work (following the abstract) asserts that these strands 'point toward' a coherent theory of learning mechanics but supplies no explicit interconnections, shared mathematical language, or concrete integration examples showing how, e.g., insights from tractable limits constrain or extend universal behaviors. Without such links the claim of coalescence reduces to parallel enumeration rather than evidence of unification, which is load-bearing for the central thesis.
- [Discussion of common arguments against theory] In the discussion of common arguments against fundamental theory (near the end), the rebuttals rely on the same curated examples of positive trends without addressing the risk of selection bias or providing a falsifiable criterion for when the five strands would fail to form a unified mechanics; this weakens the defense of the emergence claim.
minor comments (2)
- [Abstract] The abstract states that materials are hosted at learningmechanics.pub but does not briefly describe their content (e.g., open questions or tutorials), which would help readers decide whether to consult them.
- [Introduction] The term 'learning mechanics' is introduced as a metaphor; a short clarification distinguishing it from prior uses of 'mechanics' in optimization or physics-inspired ML would reduce potential ambiguity.
Simulated Author's Rebuttal
Thank you for the constructive comments. We have carefully considered the points raised and provide point-by-point responses below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Section introducing the five bodies of work] The section introducing the five bodies of work (following the abstract) asserts that these strands 'point toward' a coherent theory of learning mechanics but supplies no explicit interconnections, shared mathematical language, or concrete integration examples showing how, e.g., insights from tractable limits constrain or extend universal behaviors. Without such links the claim of coalescence reduces to parallel enumeration rather than evidence of unification, which is load-bearing for the central thesis.
Authors: We thank the referee for highlighting this gap. The manuscript emphasizes shared traits across the five bodies as the basis for coalescence, but we agree that concrete examples would strengthen the argument. In the revised version, we will expand the introduction to include specific interconnections, for instance, illustrating how mathematical laws from solvable settings (such as the neural tangent kernel) inform predictions in tractable limits like the infinite-width regime, and how these in turn relate to universal behaviors observed in scaling laws. This addition will demonstrate the emerging shared mathematical language without overstating the current level of unification. revision: yes
-
Referee: [Discussion of common arguments against theory] In the discussion of common arguments against fundamental theory (near the end), the rebuttals rely on the same curated examples of positive trends without addressing the risk of selection bias or providing a falsifiable criterion for when the five strands would fail to form a unified mechanics; this weakens the defense of the emergence claim.
Authors: We acknowledge the validity of this critique. To mitigate concerns of selection bias, we will revise the relevant section to explicitly state the criteria used for selecting the strands (e.g., their focus on dynamics and quantitative predictions) and note that they represent prominent directions in the literature. On the falsifiable criterion, the paper positions the emergence as an ongoing process; we will add a sentence indicating that the claim would be challenged if future research in these areas fails to produce consistent, cross-setting predictions or if they remain siloed without integration. However, as this is a perspective piece rather than a formal theory, a complete falsification protocol is not provided here, and we believe this addresses the core concern while maintaining honesty about the current state. revision: partial
Circularity Check
No circularity in perspective survey of research strands
full rationale
The paper is a non-technical perspective piece that surveys five existing bodies of deep learning theory research and argues they indicate an emerging 'mechanics' framework. No derivation chain, equations, or predictions are presented that reduce to the paper's own inputs by construction. The selection of strands draws from external literature (including but not limited to author-adjacent work), the shared traits are observational, and the 'learning mechanics' label is a proposed framing rather than a self-definitional or fitted result. Self-citations, if present, are not load-bearing for any forced conclusion. This is a standard, self-contained opinion article without the circular patterns enumerated.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A scientific theory of deep learning can be characterized by focus on training dynamics, coarse aggregate statistics, and falsifiable quantitative predictions.
Forward citations
Cited by 3 Pith papers
-
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
-
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
-
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
Reference graph
Works this paper leans on
-
[1]
Behavioral and Brain Sciences , volume=
Deep problems with neural network models of human vision , author=. Behavioral and Brain Sciences , volume=. 2023 , publisher=
2023
-
[2]
Advances in neural information processing systems , volume=
Deep learning models of the retinal response to natural scenes , author=. Advances in neural information processing systems , volume=
-
[3]
Proceedings of the national academy of sciences , volume=
Performance-optimized hierarchical models predict neural responses in higher visual cortex , author=. Proceedings of the national academy of sciences , volume=. 2014 , publisher=
2014
-
[4]
European Conference on Computer Vision , pages=
Visualizing and Understanding Convolutional Networks , author=. European Conference on Computer Vision , pages=. 2014 , publisher=
2014
-
[5]
Position: An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research , author=. arXiv preprint arXiv:2504.13101 , year=
-
[6]
Relative representations enable zero-shot latent space communication
Relative representations enable zero-shot latent space communication , author=. arXiv preprint arXiv:2209.15430 , year=
-
[7]
E., Balestriero, R., Brendel, W., and Klindt, D
Cross-entropy is all you need to invert the data generating process , author=. arXiv preprint arXiv:2410.21869 , year=
-
[8]
Annals of the Institute of Statistical Mathematics , volume=
Identifiability of latent-variable and structural-equation models: from linear to nonlinear , author=. Annals of the Institute of Statistical Mathematics , volume=. 2024 , publisher=
2024
-
[9]
arXiv preprint arXiv:2007.10930 , year=
Towards nonlinear disentanglement in natural data with temporal sparse coding , author=. arXiv preprint arXiv:2007.10930 , year=
-
[10]
International conference on machine learning , pages=
Contrastive learning inverts the data generating process , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[11]
Nature , volume=
Emergence of simple-cell receptive field properties by learning a sparse code for natural images , author=. Nature , volume=. 1996 , publisher=
1996
-
[12]
Advances in neural information processing systems , volume=
Neural ordinary differential equations , author=. Advances in neural information processing systems , volume=
-
[13]
The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams
The hidden width of deep ResNets: Tight error bounds and phase diagrams , author=. arXiv preprint arXiv:2509.10167 , year=
-
[14]
arXiv preprint arXiv:2310.07891 , year=
A theory of non-linear feature learning with one gradient step in two-layer neural networks , author=. arXiv preprint arXiv:2310.07891 , year=
-
[15]
Scaling laws and spectra of shallow neural networks in the feature learning regime , author=. arXiv preprint arXiv:2509.24882 , year=
-
[16]
Advances in Neural Information Processing Systems , volume=
The committee machine: Computational to statistical gaps in learning a two-layers neural network , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
Journal of Statistical Mechanics: Theory and Experiment , volume=
Unified field theoretical approach to deep and recurrent neuronal networks , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2022 , publisher=
2022
-
[18]
Advances in neural information processing systems , volume=
Asymptotics of representation learning in finite Bayesian neural networks , author=. Advances in neural information processing systems , volume=
-
[19]
Finite depth and width corrections to the neural tangent kernel
Finite depth and width corrections to the neural tangent kernel , author=. arXiv preprint arXiv:1909.05989 , year=
-
[20]
Phase transitions for feature learning in neural networks
Phase transitions for feature learning in neural networks , author=. arXiv preprint arXiv:2602.01434 , year=
-
[21]
Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time, 2025
Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time , author=. arXiv preprint arXiv:2504.13110 , year=
-
[22]
Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer , author=. arXiv preprint arXiv:2502.02531 , year=
-
[23]
arXiv preprint arXiv:2402.04980 , year=
Asymptotics of feature learning in two-layer networks after one gradient-step , author=. arXiv preprint arXiv:2402.04980 , year=
-
[24]
Advances in Neural Information Processing Systems , volume=
The neural covariance SDE: Shaped infinite depth-and-width networks at initialization , author=. Advances in Neural Information Processing Systems , volume=
-
[25]
2015 , publisher=
Deep linear neural networks: A theory of learning in the brain and mind , author=. 2015 , publisher=
2015
-
[26]
International Conference on Artificial Intelligence and Statistics , pages=
On the impact of overparameterization on the training of a shallow neural network in high dimensions , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=
2024
-
[27]
Advances in Neural Information Processing Systems , volume=
Bayes-optimal learning of an extensive-width neural network from quadratically many samples , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
arXiv preprint arXiv:2510.24616 , year=
Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation , author=. arXiv preprint arXiv:2510.24616 , year=
-
[29]
arXiv preprint arXiv:2602.15593 , year=
A unified theory of feature learning in RNNs and DNNs , author=. arXiv preprint arXiv:2602.15593 , year=
-
[30]
bioRxiv , pages=
Structure, disorder, and dynamics in task-trained recurrent neural circuits , author=. bioRxiv , pages=. 2026 , publisher=
2026
-
[31]
arXiv preprint arXiv:2503.07872 , year=
Global Universality of Singular Values in Products of Many Large Random Matrices , author=. arXiv preprint arXiv:2503.07872 , year=
-
[32]
Advances in Neural Information Processing Systems , volume=
The shaped transformer: Attention models in the infinite depth-and-width limit , author=. Advances in Neural Information Processing Systems , volume=
-
[33]
Conference on learning theory , pages=
Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , author=. Conference on learning theory , pages=. 2019 , organization=
2019
-
[34]
Advances in Neural Information Processing Systems , volume=
Dynamics of finite width kernel and prediction fluctuations in mean field neural networks , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
Advances in Neural Information Processing Systems , volume=
Self-consistent dynamical field theory of kernel evolution in wide neural networks , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
1999 , publisher=
A wavelet tour of signal processing , author=. 1999 , publisher=
1999
-
[37]
Neural computation , volume=
The lack of a priori distinctions between learning algorithms , author=. Neural computation , volume=. 1996 , publisher=
1996
-
[38]
A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023
A cookbook of self-supervised learning , author=. arXiv preprint arXiv:2304.12210 , year=
-
[39]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
A convnet for the 2020s , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[40]
arXiv preprint arXiv:2310.16764 , year=
Convnets match vision transformers at scale , author=. arXiv preprint arXiv:2310.16764 , year=
-
[41]
Forty-second International Conference on Machine Learning , year=
Towards a Mechanistic Explanation of Diffusion Model Generalization , author=. Forty-second International Conference on Machine Learning , year=
-
[42]
On the infinite width limit of neural networks with a standard parameterization , author=. arXiv preprint arXiv:2001.07301 , year=
-
[43]
Advances in Neural Information Processing Systems , volume=
On the training dynamics of deep networks with L\_2 regularization , author=. Advances in Neural Information Processing Systems , volume=
-
[44]
2022 , publisher=
The principles of deep learning theory , author=. 2022 , publisher=
2022
-
[45]
Asymptotics of Wide Networks from Feynman Diagrams , author=
-
[46]
The large learning rate phase of deep learning: the catapult mechanism , author=. arXiv preprint arXiv:2003.02218 , year=
-
[47]
Journal of Statistical Mechanics: Theory and Experiment , volume=
Disentangling feature and lazy training in deep neural networks , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2020 , publisher=
2020
-
[48]
International conference on machine learning , pages=
More than a toy: Random matrix models predict how real-world neural representations generalize , author=. International conference on machine learning , pages=. 2022 , organization=
2022
-
[49]
International Conference on Machine Learning , pages=
A kernel-based view of language model fine-tuning , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[50]
Sutherland , booktitle=
Yi Ren and Danica J. Sutherland , booktitle=. Learning Dynamics of
-
[51]
arXiv preprint arXiv:2502.21009 , year=
Position: Solve layerwise linear models first to understand neural dynamical phenomena (neural collapse, emergence, lazy/rich regime, and grokking) , author=. arXiv preprint arXiv:2502.21009 , year=
-
[52]
Frontiers in Neural Circuits , volume=
Summary statistics of learning link changing neural representations to behavior , author=. Frontiers in Neural Circuits , volume=. 2025 , publisher=
2025
-
[53]
Advances in neural information processing systems , volume=
High-dimensional limit theorems for sgd: Effective dynamics and critical scaling , author=. Advances in neural information processing systems , volume=
-
[54]
Advances in Neural Information Processing Systems , volume=
Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks , author=. Advances in Neural Information Processing Systems , volume=
-
[55]
arXiv preprint arXiv:2505.17958 , year=
The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks , author=. arXiv preprint arXiv:2505.17958 , year=
-
[56]
arXiv preprint arXiv:2508.03688 , year=
Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws , author=. arXiv preprint arXiv:2508.03688 , year=
-
[57]
Computational-statistical gaps in G aussian single-index models (extended abstract)
Computational-statistical gaps in gaussian single-index models , author=. arXiv preprint arXiv:2403.05529 , year=
-
[58]
arXiv preprint arXiv:2402.03220 , year=
The benefits of reusing batches for gradient descent in two-layer networks: Breaking the curse of information and leap exponents , author=. arXiv preprint arXiv:2402.03220 , year=
-
[59]
Advances in Neural Information Processing Systems , volume=
Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
Proceedings of the National Academy of Sciences , volume=
Optimal errors and phase transitions in high-dimensional generalized linear models , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=
2019
-
[61]
Advances in Neural Information Processing Systems , volume=
Super consistency of neural network landscapes and learning rate transfer , author=. Advances in Neural Information Processing Systems , volume=
-
[62]
arXiv preprint arXiv:2505.22491 , year=
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling , author=. arXiv preprint arXiv:2505.22491 , year=
-
[63]
Advances in Neural Information Processing Systems , volume=
Normalization and effective learning rates in reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[64]
The Thirteenth International Conference on Learning Representations , year=
The Optimization Landscape of SGD Across the Feature Learning Strength , author=. The Thirteenth International Conference on Learning Representations , year=
-
[65]
The twelfth international conference on learning representations , year=
Grokking as the transition from lazy to rich training dynamics , author=. The twelfth international conference on learning representations , year=
-
[66]
Forty-second International Conference on Machine Learning , year=
An analytic theory of creativity in convolutional diffusion models , author=. Forty-second International Conference on Machine Learning , year=
-
[67]
Journal of Machine Learning Research , volume=
A rainbow in deep network black boxes , author=. Journal of Machine Learning Research , volume=
-
[68]
ICLR 2024 Workshop on Representational Alignment , year=
On the universality of neural encodings in CNNs , author=. ICLR 2024 Workshop on Representational Alignment , year=
2024
-
[69]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Understanding image representations by measuring their equivariance and equivalence , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[70]
Forty-first International Conference on Machine Learning , year=
The emergence of reproducibility and consistency in diffusion models , author=. Forty-first International Conference on Machine Learning , year=
-
[71]
, author=
Zipf's Law everywhere. , author=. Glottometrics , volume=. 2002 , publisher=
2002
-
[72]
Psychonomic bulletin & review , volume=
Zipf’s word frequency law in natural language: A critical review and future directions , author=. Psychonomic bulletin & review , volume=. 2014 , publisher=
2014
-
[73]
Proceedings of the National Academy of Sciences , volume=
A phase transition in diffusion models reveals the hierarchical nature of data , author=. Proceedings of the National Academy of Sciences , volume=. 2025 , publisher=
2025
-
[74]
Efficient Estimation of Word Representations in Vector Space
Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=
work page internal anchor Pith review arXiv
-
[75]
arXiv preprint arXiv:2502.09863 , year=
Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models , author=. arXiv preprint arXiv:2502.09863 , year=
-
[76]
2025 , journal=
On the Emergence of Linear Analogies in Word Embeddings , author=. 2025 , journal=
2025
-
[77]
PLoS computational biology , volume=
Nonlinear Hebbian learning as a unifying principle in receptive field formation , author=. PLoS computational biology , volume=. 2016 , publisher=
2016
-
[78]
arXiv preprint arXiv:2503.23896 , year=
Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensions , author=. arXiv preprint arXiv:2503.23896 , year=
-
[79]
Current opinion in neurobiology , volume=
Sparse coding of sensory inputs , author=. Current opinion in neurobiology , volume=. 2004 , publisher=
2004
-
[80]
arXiv preprint arXiv:2206.04041 , year=
Neural collapse: A review on modelling principles and generalization , author=. arXiv preprint arXiv:2206.04041 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.