The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

David Pacheco Aznar; Miquel Noguer i Alonso

arxiv: 2606.12610 · v1 · pith:BTAUCJAFnew · submitted 2026-06-10 · 💻 cs.LG

The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

Miquel Noguer i Alonso , David Pacheco Aznar This is my paper

Pith reviewed 2026-06-27 09:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords AI wintersparadigm fragilitymathematical barriersperceptronneural network complexityvanishing gradientsstatistical learning theoryhigh-dimensional approximation

0 comments

The pith

AI winters were aligned with formal mathematical barriers in representation, optimization, and learnability

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper advances a complementary thesis that the first and second AI winters reflected genuine formal barriers in the dominant paradigms, alongside the usual accounts of engineering failure and commercial disappointment. These barriers encompass limits of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The argument proceeds by aligning specific historical setbacks with results such as the perceptron impossibility theorems, hardness results for neural-network training, minimax rates in high dimensions, vanishing-gradient analyses, and classical statistical learning theory. It then connects these barriers to subsequent breakthroughs that mitigated rather than removed them. A reader would care because the account supplies a mathematical taxonomy for why certain paradigms stalled when they did.

Core claim

The dominant paradigms of the AI winters met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. Several central disappointments of early AI aligned with mathematically precise bottlenecks from the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training, minimax rates for nonparametric estimation in high dimension, vanishing-gradient analyses, and classical statistical learning theory in the tradition of Vapnik-Chervonenkis and Valiant. These barriers are related to later breakthroughs that mitigated rather than

What carries the argument

The taxonomy of paradigm fragility, which maps historical AI disappointments onto mathematically precise bottlenecks drawn from representation theory, complexity, and statistical learning.

If this is right

The technical disappointments of early AI were aligned with precise mathematical limits in addition to practical difficulties.
Later advances mitigated these barriers through architectural and algorithmic changes.
AI progress involves identifying and working around formal constraints identified by learning theory and complexity results.
Paradigm shifts occur in part when new methods circumvent or relax the bottlenecks that stalled prior approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same taxonomy could be applied to later periods of AI development to test whether current approaches are approaching new formal limits.
Funding cycles in AI may reflect the point at which paradigms encounter these mathematical ceilings.
Modern scaling efforts may still be constrained by residual versions of the high-dimensional and optimization barriers discussed.

Load-bearing premise

The cited mathematical results constitute the relevant bottlenecks and their alignment with historical disappointments is meaningful and non-trivial rather than post-hoc selection.

What would settle it

A historical or technical record showing that the actual obstacles faced by researchers during the AI winters were unrelated to the mathematical limits analyzed here, or that the alignment collapses under systematic comparison.

Figures

Figures reproduced from arXiv: 2606.12610 by David Pacheco Aznar, Miquel Noguer i Alonso.

read the original abstract

Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A retrospective synthesis that lines up known theorems with AI winter periods but offers no new evidence that those barriers were decisive or contemporaneously recognized.

read the letter

The paper's core move is to collect several established results—Minsky-Papert on perceptrons, Blum-Rivest on training hardness, Stone on high-dimensional rates, Hochreiter/Bengio on vanishing gradients, and VC theory—and note that each lines up with a type of limitation that appeared during the first or second AI winter. It is explicit that this is alignment, not causation, and it stops short of claiming the theorems drove funding drops.

What it does cleanly is bring those references together in one short narrative and map them onto the usual categories of representation, optimization, complexity, and statistical limits. For someone who already knows the theorems, the exercise is mostly a reminder of their historical timing.

The soft spot is that the alignment remains interpretive. The paper does not show that researchers at the time treated these particular results as the binding constraints, nor does it compare them against other contemporaneous explanations such as hardware limits, data scarcity, or simple over-promising. Without that, the mapping can be read as one plausible retrospective story among several. The disclaimer about not claiming mechanical causation is welcome, but it also means the central claim rests on the reader's judgment of whether the selected theorems are the right ones.

This is the sort of piece that might interest people working on the history of machine learning or on how formal results interact with research cycles. It does not contain new theorems or fresh empirical work, so it is unlikely to change technical practice. A serious editor could send it out as a perspective or survey if the journal wants that kind of piece; the citations are standard and the writing stays within its stated bounds.

Referee Report

2 major / 2 minor

Summary. The paper develops a complementary thesis to socio-economic explanations of the first and second AI winters: that dominant paradigms encountered genuine formal barriers in representation (Minsky-Papert perceptron limits), computational complexity (Blum-Rivest NP-completeness of exact NN training), high-dimensional approximation (Stone minimax rates), optimization (Hochreiter/Bengio vanishing gradients), and statistical learnability (VC theory). The contribution is explicitly synthetic and interpretive; it disclaims mechanical causation and instead maps specific historical disappointments to these established mathematical results, then relates the barriers to later mitigating breakthroughs.

Significance. If the interpretive alignments are accepted as substantive rather than post-hoc, the paper supplies a mathematical taxonomy of paradigm fragility that could usefully complement existing accounts of AI history and inform assessments of current paradigms. Its strengths are the explicit disclaimer of causation, reliance on well-known external theorems, and focus on mitigation rather than elimination of barriers. However, because the work offers no new derivations, systematic citation analysis, or falsifiable tests, its significance hinges entirely on whether readers find the selected mappings non-trivial and representative.

major comments (2)

[sections analyzing Minsky-Papert, Blum-Rivest, Stone, Hochreiter/Bengio, and VC results] The central claim that the listed results constitute the relevant bottlenecks (rather than one possible retrospective mapping) is load-bearing for the thesis. The manuscript would need to supply an explicit criterion for 'alignment' or a comparison against alternative mathematical and non-mathematical factors to demonstrate that the selection is non-arbitrary; absent this, the interpretive step risks appearing post-hoc.
[relation of barriers to historical periods] The paper correctly notes that the barriers were mitigated rather than removed, but does not address the temporal gap: many of the cited results (e.g., Blum-Rivest 1992, Hochreiter 1991) post-date the onset of the second winter, weakening the claim that they explain contemporaneous disappointments.

minor comments (2)

Notation for the various theorems is introduced without a consolidated table or consistent cross-referencing, making it difficult to track which result addresses which barrier.
[abstract and introduction] The abstract and introduction repeat the disclaimer against mechanical causation; a single, prominent statement would suffice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments. We respond to each major point below, indicating revisions where appropriate to strengthen the interpretive framework while preserving the paper's synthetic and non-causal character.

read point-by-point responses

Referee: The central claim that the listed results constitute the relevant bottlenecks (rather than one possible retrospective mapping) is load-bearing for the thesis. The manuscript would need to supply an explicit criterion for 'alignment' or a comparison against alternative mathematical and non-mathematical factors to demonstrate that the selection is non-arbitrary; absent this, the interpretive step risks appearing post-hoc.

Authors: We agree that an explicit selection criterion would improve transparency. The manuscript already disclaims mechanical causation and presents the work as a mapping of central disappointments to established results. We will add a short subsection in the introduction that states the criterion: results are included when they (i) formalize limitations repeatedly identified in the historical literature as blocking the dominant paradigm of the period and (ii) correspond to technical challenges whose mitigation enabled subsequent progress. This makes the interpretive choices explicit without asserting exclusivity or post-hoc fitting. A full comparison against every alternative factor lies outside the paper's synthetic scope, but the added criterion addresses the arbitrariness concern directly. revision: yes
Referee: The paper correctly notes that the barriers were mitigated rather than removed, but does not address the temporal gap: many of the cited results (e.g., Blum-Rivest 1992, Hochreiter 1991) post-date the onset of the second winter, weakening the claim that they explain contemporaneous disappointments.

Authors: We acknowledge the publication dates. Blum and Rivest (1992) and Hochreiter (1991) appeared at or after the conventional start of the second winter. However, the underlying phenomena—training intractability and gradient instability—were already documented in empirical work of the mid-to-late 1980s. The theorems supplied rigorous accounts of difficulties that researchers had encountered in practice. We will revise the relevant passages to note the chronology explicitly and to reiterate that the analysis concerns alignment between the formal statements and the character of the reported disappointments, consistent with the paper's disclaimer against mechanical causation. revision: yes

Circularity Check

0 steps flagged

No circularity: synthetic alignment of external theorems with no self-reduction or fitted predictions

full rationale

The paper is explicitly synthetic and disclaims mechanical causation. Its central thesis aligns historical AI winter disappointments with well-known external results (Minsky-Papert perceptron limits, Blum-Rivest NP-completeness, Stone minimax rates, Hochreiter/Bengio vanishing gradients, VC theory) without deriving new theorems, fitting parameters to data, or reducing claims to self-citations. No equations or load-bearing steps reduce by construction to the paper's own inputs; the argument is retrospective mapping rather than a closed derivation. This matches the default non-circular case for synthetic historical analyses relying on independent prior theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper introduces no new mathematical objects, parameters, or axioms; it cites established results from the literature.

pith-pipeline@v0.9.1-grok · 5727 in / 1047 out tokens · 14557 ms · 2026-06-27T09:58:02.032824+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 24 canonical work pages

[1]

doi: 10.1017/S0962492921000027. Eric B. Baum and David Haussler. What size net gives valid generalization?Neural Computation, 1(1):151–160,

work page doi:10.1017/s0962492921000027
[2]

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal

doi: 10.1162/neco.1989.1.1.151. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854,

work page doi:10.1162/neco.1989.1.1.151 1989
[3]

Neural networks and physical systems with emergent collective com- putational abilities

doi: 10.1073/pnas. 1903070116. Richard E. Bellman.Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ,

work page doi:10.1073/pnas
[4]

Learning long-term dependencies with gradient descent is diﬀicult

doi: 10.1109/72.279181. Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor” meaningful? InProceedings of the 7th International Conference on Database Theory (ICDT), pages 217–235. Springer,

work page doi:10.1109/72.279181
[5]

doi: 10.1007/3-540-49257-7

work page doi:10.1007/3-540-49257-7
[6]

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K

doi: 10.1016/S0893-6080(05)80010-3. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik–Chervonenkis dimension.Journal of the ACM, 36(4): 929–965,

work page doi:10.1016/s0893-6080(05)80010-3
[7]

Bernhard E

doi: 10.1145/76359.76371. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. InProceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT), pages 144–152. ACM,

work page doi:10.1145/76359.76371
[8]

doi: 10.1145/130385.130401. Bruce G. Buchanan and Edward H. Shortliffe.Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison- Wesley, Reading, MA,

work page doi:10.1145/130385.130401
[9]

URL https://doi

doi: 10.1007/BF00994018. George Cybenko. Approximation by superpositions of a sigmoidal function.Math- ematics of Control, Signals and Systems, 2(4):303–314,

work page doi:10.1007/bf00994018
[10]

, TITLE =

doi: 10.1214/aos/1024691081. Hubert L. Dreyfus.What Computers Can’t Do: A Critique of Artificial Reason. Harper & Row, New York,

work page doi:10.1214/aos/1024691081
[11]

Ronen Eldan and Ohad Shamir

doi: 10.1016/0890-5401(89)90002-3. Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. InProceedings of the 29th Conference on Learning Theory (COLT), pages 907–940. PMLR,

work page doi:10.1016/0890-5401(89)90002-3
[12]

Xavier Glorot and Yoshua Bengio

arXiv:1803.03635. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 249–256. PMLR,

Pith/arXiv arXiv
[13]

URL https://doi

doi: 10.1214/21-AOS2133. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034,

work page doi:10.1214/21-aos2133
[14]

2015, in 2015 IEEE International Conference on Computer Vision (ICCV), 1026–1034, doi: 10.1109/ICCV.2015.123

doi: 10.1109/ICCV.2015.123. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778,

work page doi:10.1109/iccv.2015.123 2015
[15]

Deep Residual Learning for Image Recognition

doi: 10.1109/CVPR.2016.90. Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Master’s thesis, Technische Universit¨ at M¨ unchen,

work page doi:10.1109/cvpr.2016.90 2016
[16]

Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler

doi: 10.1016/0893-6080(91)90009-T. Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Conver- gence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31, pages 8571–8580. Curran Associates,

work page doi:10.1016/0893-6080(91)90009-t
[17]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

Pith/arXiv arXiv 2001
[18]

10 Evaluating Bivariate Causal Statements Based on Mutual Compatibility Richardson, T

doi: 10.1214/aos/ 1015957395. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521 (7553):436–444,

work page doi:10.1214/aos/
[19]

LeCun, Y

doi: 10.1038/nature14539. 31 The Mathematics of AI Winters Noguer i Alonso and Pacheco Aznar Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.Neural Networks, 6(6):861–867,

work page doi:10.1038/nature14539
[20]

James Lighthill

doi: 10.1016/S0893-6080(05) 80131-5. James Lighthill. Artificial intelligence: A general survey. Technical report, Science Research Council, London, UK,

work page doi:10.1016/s0893-6080(05
[21]

, title =

doi: 10.1037/h0042519. 32 The Mathematics of AI Winters Noguer i Alonso and Pacheco Aznar David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning rep- resentations by back-propagating errors.Nature, 323(6088):533–536,

work page doi:10.1037/h0042519
[22]

Rumelhart, Geoffrey E

doi: 10.1038/323533a0. Bernhard Sch¨ olkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. InProceedings of the 14th Annual Conference on Computational Learning Theory (COLT), pages 416–426. Springer,

work page doi:10.1038/323533a0
[23]

doi: 10.1007/3-540-44581-1

work page doi:10.1007/3-540-44581-1
[24]

The Annals of Statistics , author =

doi: 10.1214/aos/1176345969. Matus Telgarsky. Benefits of depth in neural networks. InProceedings of the 29th Conference on Learning Theory (COLT), pages 1517–1539. PMLR,

work page doi:10.1214/aos/1176345969
[25]

Deductive Learning

doi: 10.1145/1968.1972. Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities.Theory of Probability and Its Applications, 16(2):264–280,

work page doi:10.1145/1968.1972 1968
[26]

Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science

doi: 10.1137/1116025. Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, Cambridge,

work page doi:10.1137/1116025
[27]

arXiv:1611.03530. 33

Pith/arXiv arXiv

[1] [1]

doi: 10.1017/S0962492921000027. Eric B. Baum and David Haussler. What size net gives valid generalization?Neural Computation, 1(1):151–160,

work page doi:10.1017/s0962492921000027

[2] [2]

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal

doi: 10.1162/neco.1989.1.1.151. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854,

work page doi:10.1162/neco.1989.1.1.151 1989

[3] [3]

Neural networks and physical systems with emergent collective com- putational abilities

doi: 10.1073/pnas. 1903070116. Richard E. Bellman.Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ,

work page doi:10.1073/pnas

[4] [4]

Learning long-term dependencies with gradient descent is diﬀicult

doi: 10.1109/72.279181. Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor” meaningful? InProceedings of the 7th International Conference on Database Theory (ICDT), pages 217–235. Springer,

work page doi:10.1109/72.279181

[5] [5]

doi: 10.1007/3-540-49257-7

work page doi:10.1007/3-540-49257-7

[6] [6]

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K

doi: 10.1016/S0893-6080(05)80010-3. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik–Chervonenkis dimension.Journal of the ACM, 36(4): 929–965,

work page doi:10.1016/s0893-6080(05)80010-3

[7] [7]

Bernhard E

doi: 10.1145/76359.76371. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. InProceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT), pages 144–152. ACM,

work page doi:10.1145/76359.76371

[8] [8]

doi: 10.1145/130385.130401. Bruce G. Buchanan and Edward H. Shortliffe.Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison- Wesley, Reading, MA,

work page doi:10.1145/130385.130401

[9] [9]

URL https://doi

doi: 10.1007/BF00994018. George Cybenko. Approximation by superpositions of a sigmoidal function.Math- ematics of Control, Signals and Systems, 2(4):303–314,

work page doi:10.1007/bf00994018

[10] [10]

, TITLE =

doi: 10.1214/aos/1024691081. Hubert L. Dreyfus.What Computers Can’t Do: A Critique of Artificial Reason. Harper & Row, New York,

work page doi:10.1214/aos/1024691081

[11] [11]

Ronen Eldan and Ohad Shamir

doi: 10.1016/0890-5401(89)90002-3. Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. InProceedings of the 29th Conference on Learning Theory (COLT), pages 907–940. PMLR,

work page doi:10.1016/0890-5401(89)90002-3

[12] [12]

Xavier Glorot and Yoshua Bengio

arXiv:1803.03635. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 249–256. PMLR,

Pith/arXiv arXiv

[13] [13]

URL https://doi

doi: 10.1214/21-AOS2133. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034,

work page doi:10.1214/21-aos2133

[14] [14]

2015, in 2015 IEEE International Conference on Computer Vision (ICCV), 1026–1034, doi: 10.1109/ICCV.2015.123

doi: 10.1109/ICCV.2015.123. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778,

work page doi:10.1109/iccv.2015.123 2015

[15] [15]

Deep Residual Learning for Image Recognition

doi: 10.1109/CVPR.2016.90. Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Master’s thesis, Technische Universit¨ at M¨ unchen,

work page doi:10.1109/cvpr.2016.90 2016

[16] [16]

Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler

doi: 10.1016/0893-6080(91)90009-T. Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Conver- gence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31, pages 8571–8580. Curran Associates,

work page doi:10.1016/0893-6080(91)90009-t

[17] [17]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

Pith/arXiv arXiv 2001

[18] [18]

10 Evaluating Bivariate Causal Statements Based on Mutual Compatibility Richardson, T

doi: 10.1214/aos/ 1015957395. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521 (7553):436–444,

work page doi:10.1214/aos/

[19] [19]

LeCun, Y

doi: 10.1038/nature14539. 31 The Mathematics of AI Winters Noguer i Alonso and Pacheco Aznar Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.Neural Networks, 6(6):861–867,

work page doi:10.1038/nature14539

[20] [20]

James Lighthill

doi: 10.1016/S0893-6080(05) 80131-5. James Lighthill. Artificial intelligence: A general survey. Technical report, Science Research Council, London, UK,

work page doi:10.1016/s0893-6080(05

[21] [21]

, title =

doi: 10.1037/h0042519. 32 The Mathematics of AI Winters Noguer i Alonso and Pacheco Aznar David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning rep- resentations by back-propagating errors.Nature, 323(6088):533–536,

work page doi:10.1037/h0042519

[22] [22]

Rumelhart, Geoffrey E

doi: 10.1038/323533a0. Bernhard Sch¨ olkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. InProceedings of the 14th Annual Conference on Computational Learning Theory (COLT), pages 416–426. Springer,

work page doi:10.1038/323533a0

[23] [23]

doi: 10.1007/3-540-44581-1

work page doi:10.1007/3-540-44581-1

[24] [24]

The Annals of Statistics , author =

doi: 10.1214/aos/1176345969. Matus Telgarsky. Benefits of depth in neural networks. InProceedings of the 29th Conference on Learning Theory (COLT), pages 1517–1539. PMLR,

work page doi:10.1214/aos/1176345969

[25] [25]

Deductive Learning

doi: 10.1145/1968.1972. Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities.Theory of Probability and Its Applications, 16(2):264–280,

work page doi:10.1145/1968.1972 1968

[26] [26]

Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science

doi: 10.1137/1116025. Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, Cambridge,

work page doi:10.1137/1116025

[27] [27]

arXiv:1611.03530. 33

Pith/arXiv arXiv