pith. machine review for the scientific record. sign in

arxiv: 2603.22347 · v2 · submitted 2026-03-22 · 💻 cs.AI · cond-mat.stat-mech· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Intelligence Inertia: Physical Isomorphism and Applications

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:30 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.stat-mechcs.LG
keywords intelligence inertiaMinkowski spacetimeLorentz factordeep learning dynamicscomputational costneural adaptationJ-shaped curvecatastrophic forgetting
0
0 comments X

The pith

A heuristic spacetime isomorphism for deep learning yields a Lorentz-like cost formula predicting a J-shaped computational wall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Intelligence Inertia to capture the cost of neural adaptation through a mathematical analogy to Minkowski spacetime. It derives a non-linear cost formula similar to the Lorentz factor from the non-commutativity of network states and rules, which produces a sharp J-shaped rise in overhead during major structural shifts. A reader would care because this identifies the point where low-density approximations like Fisher Information cease to work, potentially guiding more stable training of complex models. The authors test the idea through noise experiments, geodesic mapping for architectures, and a scheduler that accounts for this inertia to limit forgetting.

Core claim

Rather than claiming a new physical law, the paper establishes a heuristic mathematical isomorphism between deep learning dynamics and Minkowski spacetime. From the non-commutativity [Ŝ, R̂] = i𝒟 between states and rules, it derives a non-linear cost formula that mirrors the Lorentz factor. This predicts a relativistic J-shaped inflation curve marking the computational wall where classical approximations fail for high-dimensional tensor evolution.

What carries the argument

Intelligence Inertia, the effective resistance property generated by the commutator [Ŝ, R̂] = i𝒟 under a heuristic isomorphism to Minkowski spacetime that produces the relativistic cost formula.

Load-bearing premise

The heuristic mathematical isomorphism between deep learning dynamics and Minkowski spacetime is sufficiently accurate to yield quantitative predictions for high-dimensional tensor evolution.

What would settle it

Measure computational overhead while forcing deep structural reconfigurations in neural networks under controlled noise and check whether costs follow the predicted non-linear J-shaped curve or remain closer to linear approximations.

Figures

Figures reproduced from arXiv: 2603.22347 by Jipeng Han.

Figure 1
Figure 1. Figure 1: Geometric Partition of Logical Action. A particle collision with total action l is decomposed by a microscopic slant. The component lR = lsin θ is absorbed by the adiabatic rule-manifold, while the normal component lS governs state-expression. Heat emission is only registered when the cumulative normal action matches a full vertical collision relative to the system’s local energy level, naturally inducing … view at source ↗
Figure 2
Figure 2. Figure 2: Comparative Adjudication of Reference Frame Sensitivity and Model [PITH_FULL_IMAGE:figures/full_fig_p027_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation Analysis of Relativistic Velocity Addition vs. Mass Expansion. Arena 3 (Left) contrasts the Galilean-shifted FIM against the relativistic mass model, showing the clear failure of the quadratic assumption at high speeds; Arena 4 (Right) introduces a “Hybrid FIM” model, which applies the relativistic Lorentz velocity transformation but retains the classical quadratic cost formula, highlighting that … view at source ↗
Figure 4
Figure 4. Figure 4: 3D Reachability Topography and the Zig-Zag Evolutionary Geodesic. [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Velocity Deviation Topography and the Dynamical Riverbed. This 3D [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Universal Enhancement of Learning Dynamics via the Inertia-Aware [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Logical Resilience and Relativistic Braking under Noise Shock. [PITH_FULL_IMAGE:figures/full_fig_p042_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of Pulsed Noise on Velocity. This figure illustrates the velocity [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Inertial Barrier during Abrupt Task Transitions. This figure contrasts [PITH_FULL_IMAGE:figures/full_fig_p045_9.png] view at source ↗
read the original abstract

Classical frameworks like Fisher Information approximate the cost of neural adaptation only in low-density regimes, failing to explain the explosive computational overhead incurred during deep structural reconfiguration. To address this, we introduce \textbf{Intelligence Inertia}, a property derived from the fundamental non-commutativity between rules and states ($[\hat{S}, \hat{R}] = i\mathcal{D}$). Rather than claiming a new fundamental physical law, we establish a \textbf{heuristic mathematical isomorphism} between deep learning dynamics and Minkowski spacetime. Acting as an \textit{effective theory} for high-dimensional tensor evolution, we derive a non-linear cost formula mirroring the Lorentz factor, predicting a relativistic $J$-shaped inflation curve -- a computational wall where classical approximations fail. We validate this framework via three experiments: (1) adjudicating the $J$-curve divergence under high-entropy noise, (2) mapping the optimal geodesic for architecture evolution, and (3) deploying an \textbf{inertia-aware scheduler wrapper} that prevents catastrophic forgetting. Adopting this isomorphism yields an exact quantitative metric for structural resistance, advancing the stability and efficiency of intelligent agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces 'Intelligence Inertia' as a property arising from the postulated non-commutativity [Ŝ, R̂] = iD between rules and states in deep learning dynamics. It posits a heuristic mathematical isomorphism between these dynamics and Minkowski spacetime as an effective theory for high-dimensional tensor evolution, from which a non-linear cost formula analogous to the Lorentz factor is derived. This yields a predicted relativistic J-shaped inflation curve representing a computational wall where classical approximations fail. The framework is validated through three experiments: testing J-curve divergence under high-entropy noise, mapping optimal geodesics for architecture evolution, and implementing an inertia-aware scheduler wrapper to mitigate catastrophic forgetting.

Significance. If the central derivation and quantitative predictions hold under rigorous scrutiny, the work could offer a novel effective-theory perspective on scaling limits and structural resistance in neural networks, potentially informing more stable training regimes and architecture search. The explicit framing as a heuristic isomorphism (rather than a fundamental law) and the provision of an inertia-aware scheduler are constructive elements that could be built upon if the mapping is made precise.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (derivation section): No intermediate algebraic steps are shown mapping the commutator postulate [Ŝ, R̂] = iD to the specific Lorentz-like form (e.g., a factor 1/√(1−v²/c²) or equivalent inertia term) rather than an arbitrary non-linear function. Without this explicit mapping, the claimed quantitative J-curve prediction rests on an unverified analogy whose accuracy for tensor evolution remains unestablished.
  2. [§4] §4 (experiments): The three validation experiments are described only at a high level; no quantitative outcomes, error analysis, baseline comparisons, or statistical significance tests are reported. This leaves the central claim that the framework 'prevents catastrophic forgetting' or 'adjudicates J-curve divergence' unsupported by visible evidence.
  3. [§2] §2 (isomorphism construction): The non-linear cost formula is obtained directly from the chosen commutator and the imposed Minkowski isomorphism; the resulting J-curve therefore reduces by construction to a quantity defined within the introduced framework rather than constituting an independent, falsifiable benchmark against classical Fisher-information approximations.
minor comments (2)
  1. [Introduction] The notation Ŝ and R̂ for state and rule operators is introduced without an explicit definition or Hilbert-space context in the opening sections, which hinders readability for readers outside the immediate subfield.
  2. [Introduction] The term 'Intelligence Inertia' is presented as novel without a brief literature contrast to related concepts such as information geometry or effective-field theories in machine learning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major point below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (derivation section): No intermediate algebraic steps are shown mapping the commutator postulate [Ŝ, R̂] = iD to the specific Lorentz-like form (e.g., a factor 1/√(1−v²/c²) or equivalent inertia term) rather than an arbitrary non-linear function. Without this explicit mapping, the claimed quantitative J-curve prediction rests on an unverified analogy whose accuracy for tensor evolution remains unestablished.

    Authors: We agree that the derivation would be strengthened by explicit intermediate steps. In the revised manuscript we will expand §3 to show the algebraic mapping from the commutator [Ŝ, R̂] = iD through the imposed Minkowski isomorphism to the Lorentz-like inertia term. This will make clear that the J-curve follows from the effective-theory construction rather than being chosen arbitrarily. We will also add a short paragraph reiterating the heuristic status of the isomorphism and its intended domain of applicability to high-dimensional tensor evolution. revision: yes

  2. Referee: [§4] §4 (experiments): The three validation experiments are described only at a high level; no quantitative outcomes, error analysis, baseline comparisons, or statistical significance tests are reported. This leaves the central claim that the framework 'prevents catastrophic forgetting' or 'adjudicates J-curve divergence' unsupported by visible evidence.

    Authors: The referee correctly identifies that §4 currently provides only high-level descriptions. We will revise this section to report the quantitative results of all three experiments, including performance metrics with error bars, direct comparisons against standard schedulers and Fisher-information baselines, and the results of appropriate statistical tests. These additions will supply the concrete evidence needed to support the claims regarding J-curve divergence and mitigation of catastrophic forgetting. revision: yes

  3. Referee: [§2] §2 (isomorphism construction): The non-linear cost formula is obtained directly from the chosen commutator and the imposed Minkowski isomorphism; the resulting J-curve therefore reduces by construction to a quantity defined within the introduced framework rather than constituting an independent, falsifiable benchmark against classical Fisher-information approximations.

    Authors: We accept that the J-curve is obtained inside the framework by construction. The manuscript already frames the work as a heuristic effective theory rather than a fundamental law; the intended falsifiability therefore resides in the empirical predictions (divergence from classical approximations under high-entropy noise, geodesic optimality, and forgetting mitigation). In the revision we will strengthen the discussion in §2 to articulate these testable predictions more explicitly and will ensure the expanded experimental results in §4 include direct quantitative comparisons with Fisher-information baselines. revision: partial

Circularity Check

1 steps flagged

Non-linear cost formula obtained by construction via chosen heuristic isomorphism to Lorentz factor

specific steps
  1. self definitional [Abstract]
    "we establish a heuristic mathematical isomorphism between deep learning dynamics and Minkowski spacetime. Acting as an effective theory for high-dimensional tensor evolution, we derive a non-linear cost formula mirroring the Lorentz factor, predicting a relativistic J-shaped inflation curve"

    The cost formula is introduced as derived from the commutator via the isomorphism, yet the isomorphism is selected precisely so that the cost mirrors the Lorentz factor. The J-curve therefore follows tautologically from the framework definition rather than from an independent derivation or external benchmark.

full rationale

The paper postulates the commutator [Ŝ,R̂]=iD and then adopts a heuristic isomorphism to Minkowski spacetime as an effective theory. From this it directly states a derived non-linear cost mirroring the Lorentz factor, producing the J-curve. No intermediate mapping is exhibited showing why the commutator yields precisely the relativistic form rather than another non-linear function; the quantitative prediction therefore reduces to the inputs of the chosen analogy by construction. This matches the self-definitional pattern with load-bearing impact on the central claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on a single postulated commutator relation and a heuristic isomorphism chosen to produce the desired Lorentz-like form; no independent empirical or formal grounding is supplied for either.

free parameters (1)
  • D
    The non-commutativity scale in the relation [S, R] = iD that generates the inertia effect.
axioms (1)
  • ad hoc to paper Non-commutativity between rules and states: [S, R] = iD
    Introduced in the abstract as the starting point for the isomorphism.
invented entities (1)
  • Intelligence Inertia no independent evidence
    purpose: To quantify structural resistance to reconfiguration in neural networks
    New quantity defined via the heuristic spacetime mapping.

pith-pipeline@v0.9.0 · 5490 in / 1340 out tokens · 49671 ms · 2026-05-15T07:30:18.522860+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 11 internal anchors

  1. [1]

    Universal intelligence: A definition of machine intelligence,

    S. Legg and M. Hutter, “Universal intelligence: A definition of machine intelligence,”Minds and Machines, vol. 17, no. 4, pp. 391–444, 2007

  2. [2]

    A modern approach,

    S. Russell, P. Norvig, and A. Intelligence, “A modern approach,”Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, vol. 25, no. 27, pp. 79–80, 1995

  3. [3]

    On the Measure of Intelligence

    F. Chollet, “On the measure of intelligence,”arXiv preprint arXiv:1911.01547, 2019

  4. [4]

    Hern´ andez-Orallo,The Measure of All Minds: Evaluating Natural and Artificial Intelligence

    J. Hern´ andez-Orallo,The Measure of All Minds: Evaluating Natural and Artificial Intelligence. Cambridge University Press, 2017

  5. [5]

    When and how to develop domain-specific languages,

    M. Mernik, J. Heering, and A. M. Sloane, “When and how to develop domain-specific languages,”ACM computing surveys (CSUR), vol. 37, no. 4, pp. 316–344, 2005

  6. [6]

    Houdini: Lifelong learning as program synthesis,

    L. Valkov, S. Chaudhuri, B. Lake, A. Gaunt, and C. Milton, “Houdini: Lifelong learning as program synthesis,” inAdvances in Neural Information Processing Systems, vol. 31, 2018

  7. [7]

    Irreversibility and heat generation in the computing pro- cess,

    R. Landauer, “Irreversibility and heat generation in the computing pro- cess,”IBM journal of research and development, vol. 5, no. 3, pp. 183–191, 1961

  8. [8]

    Experimental verification of landauer’s principle linking in- formation and thermodynamics,

    A. B´ erut, A. Arakelyan, A. Petrosyan, S. Ciliberto, R. Dillenschneider, and E. Lutz, “Experimental verification of landauer’s principle linking in- formation and thermodynamics,”Nature, vol. 483, no. 7388, pp. 187–189, 2012

  9. [10]

    Amari,Information Geometry and Its Applications

    S.-i. Amari,Information Geometry and Its Applications. Springer, 2016

  10. [11]

    New insights and perspectives on the natural gradient method,

    J. Martens, “New insights and perspectives on the natural gradient method,”Journal of Machine Learning Research, vol. 21, no. 146, pp. 1–76, 2020

  11. [12]

    Overcoming catastrophic forgetting in neural networks,

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska,et al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017

  12. [13]

    Three approaches to the quantitative definition of information,

    A. N. Kolmogorov, “Three approaches to the quantitative definition of information,”Problems of information transmission, vol. 1, no. 1, pp. 1–7, 1965. 52

  13. [14]

    Algorithmic Information Theory: a brief non-technical guide to the field

    M. Hutter, “Algorithmic information theory: a brief non-technical guide to the field,”arXiv preprint cs/0703024, 2007

  14. [15]

    Taskonomy: Disentangling task transfer learning,

    A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 3712– 3722, 2018

  15. [16]

    How transferable are features in deep neural networks?,

    J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” inAdvances in neural information pro- cessing systems, 2014

  16. [17]

    High-precision test of landauer’s principle in a feedback trap,

    J. Bechhoefer, “High-precision test of landauer’s principle in a feedback trap,” inAPS March Meeting Abstracts, vol. 2015, pp. Z3–002, 2015

  17. [18]

    The thermodynamics of computation—a review,

    C. H. Bennett, “The thermodynamics of computation—a review,”Inter- national Journal of Theoretical Physics, vol. 21, no. 12, pp. 905–940, 1982

  18. [19]

    Thermodynamics of information,

    J. M. Parrondo, J. M. Horowitz, and T. Sagawa, “Thermodynamics of information,”Nature Physics, vol. 11, no. 2, pp. 131–139, 2015

  19. [20]

    Revisiting Natural Gradient for Deep Networks

    R. Pascanu and Y. Bengio, “Revisiting natural gradient for deep networks,” arXiv preprint arXiv:1301.3584, 2013

  20. [21]

    Logical depth and physical complexity,

    C. H. Bennett, “Logical depth and physical complexity,” inThe Universal Turing Machine: A Half-Century Survey(R. Herken, ed.), pp. 227–257, Oxford University Press, 1988

  21. [22]

    M. Li, P. Vit´ anyi,et al.,An introduction to Kolmogorov complexity and its applications, vol. 3. Springer, 2008

  22. [23]

    Algorithmic data analytics, small data matters and correlation versus causation,

    H. Zenil, “Algorithmic data analytics, small data matters and correlation versus causation,” inBerechenbarkeit der Welt? Philosophie und Wis- senschaft im Zeitalter von Big Data, pp. 453–475, Springer, 2017

  23. [24]

    Catastrophic forgetting in connectionist networks,

    R. M. French, “Catastrophic forgetting in connectionist networks,”Trends in Cognitive Sciences, vol. 3, no. 4, pp. 128–135, 1999

  24. [25]

    Continual learning through synaptic in- telligence,

    F. Zenke, B. Poole, and G. Surya, “Continual learning through synaptic in- telligence,”International Conference on Machine Learning (ICML), 2017

  25. [26]

    Embracing change: Continual learning in deep neural networks,

    R. Hadsell, D. Rao, A. A. Rusu, and R. Pascanu, “Embracing change: Continual learning in deep neural networks,”Trends in Cognitive Sciences, vol. 24, no. 12, pp. 1028–1040, 2020

  26. [27]

    Three scenarios for continual learning,

    G. M. van de Ven, J. T. Vogelstein, and A. S. Tolias, “Three scenarios for continual learning,”Nature Machine Intelligence, vol. 4, no. 11, pp. 955– 967, 2022. 53

  27. [28]

    The information com- plexity of learning tasks, their structure and their distance,

    A. Achille, G. Paolini, G. Mbeng, and S. Soatto, “The information com- plexity of learning tasks, their structure and their distance,”Information and Inference: A Journal of the IMA, vol. 10, no. 1, pp. 51–72, 2021

  28. [29]

    On the electrodynamics of moving bodies,

    A. Einstein, “On the electrodynamics of moving bodies,”Annalen der Physik, vol. 17, pp. 891–921, 1905

  29. [30]

    A mathematical theory of communication,

    C. E. Shannon, “A mathematical theory of communication,”The Bell Sys- tem Technical Journal, vol. 27, no. 3, pp. 379–423, 1948

  30. [31]

    Electromagnetic phenomena in a system moving with any velocity smaller than that of light,

    H. A. Lorentz, “Electromagnetic phenomena in a system moving with any velocity smaller than that of light,” inCollected Papers: Volume V, pp. 172–197, Springer, 1937

  31. [32]

    Information theory and statistical mechanics,

    E. T. Jaynes, “Information theory and statistical mechanics,”Physical Re- view, vol. 106, no. 4, pp. 620–630, 1957

  32. [33]

    Ultimate physical limits to computation,

    S. Lloyd, “Ultimate physical limits to computation,”Nature, vol. 406, no. 6799, pp. 1047–1054, 2000

  33. [34]

    Energy cost of information transfer,

    J. D. Bekenstein, “Energy cost of information transfer,”Physical Review Letters, vol. 46, no. 10, pp. 623–626, 1981

  34. [35]

    The fundamental equations of quantum mechanics,

    P. A. Dirac, “The fundamental equations of quantum mechanics,”Proceed- ings of the Royal Society of London. Series A, vol. 109, no. 752, pp. 642–653, 1925

  35. [36]

    Information, physics, quantum: The search for links,

    J. A. Wheeler, “Information, physics, quantum: The search for links,” Feynman and computation, pp. 309–336, 2018

  36. [37]

    Statistical distance and hilbert space,

    W. K. Wootters, “Statistical distance and hilbert space,”Physical Review D, vol. 23, no. 2, pp. 357–362, 1981

  37. [38]

    Gell-Mann,The Quark and the Jaguar: Adventures in the Simple and the Complex

    M. Gell-Mann,The Quark and the Jaguar: Adventures in the Simple and the Complex. Macmillan, 1995

  38. [39]

    Awodey,Category Theory

    S. Awodey,Category Theory. Oxford University Press, 2010

  39. [40]

    Survey of hallucination in natural language generation,

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

  40. [41]

    What is information?,

    C. Adami, “What is information?,”Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2063, p. 20150230, 2016

  41. [42]

    Gradient episodic memory for continual learning,

    D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

  42. [43]

    Huang,Statistical mechanics

    K. Huang,Statistical mechanics. John Wiley & Sons, 2008. 54

  43. [44]

    Optimization methods for large- scale machine learning,

    L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large- scale machine learning,”SIAM review, vol. 60, no. 2, pp. 223–311, 2018

  44. [45]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

  45. [46]

    Opening the Black Box of Deep Neural Networks via Information

    R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”arXiv preprint arXiv:1703.00810, 2017

  46. [47]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    I. Loshchilov and M. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,”arXiv preprint arXiv:1608.03983, 2016

  47. [48]

    Identifying and attacking the saddle point problem in high- dimensional non-convex optimization,

    Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high- dimensional non-convex optimization,” inAdvances in Neural Information Processing Systems, vol. 27, 2014

  48. [49]

    Machine learning and variational algorithms for lattice field theory,

    G. Kanwar, “Machine learning and variational algorithms for lattice field theory,”arXiv preprint arXiv:2106.01975, 2021

  49. [50]

    Contin- ual lifelong learning with neural networks: A review,

    G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Contin- ual lifelong learning with neural networks: A review,”Neural Networks, vol. 113, pp. 54–71, 2019

  50. [51]

    Deep learning,

    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”nature, vol. 521, no. 7553, pp. 436–444, 2015

  51. [52]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016

  52. [53]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, G. Hinton,et al., “Learning multiple layers of features from tiny images,” 2009

  53. [54]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    N. S. Keskar, D. Mudigere, N. Jorge, S. Mikhail, and T. Ping Tak Peter, “On large-batch training for deep learning: Generalization gap and sharp minima,”arXiv preprint arXiv:1609.04836, 2016

  54. [55]

    The importance of complexity in model selection,

    I. J. Myung, “The importance of complexity in model selection,”Journal of Mathematical Psychology, vol. 44, no. 1, pp. 190–204, 2000

  55. [56]

    Weinberg,Gravitation and Cosmology: Principles and Applications of the General Theory of Relativity

    S. Weinberg,Gravitation and Cosmology: Principles and Applications of the General Theory of Relativity. New York: John Wiley & Sons, 1972

  56. [57]

    Falconer,Fractal geometry: mathematical foundations and applications

    K. Falconer,Fractal geometry: mathematical foundations and applications. John Wiley & Sons, 2013

  57. [58]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020. 55

  58. [59]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift,

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inInternational Conference on Machine Learning (ICML), pp. 448–456, 2015

  59. [60]

    The need for biases in learning generalizations,

    T. M. Mitchell, “The need for biases in learning generalizations,” inRead- ings in Machine Learning, pp. 184–191, Morgan Kaufmann, 1980

  60. [61]

    Gradient-based learning applied to document recognition,

    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 2002

  61. [62]

    Going deeper with convolutions,

    C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er- han, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2015

  62. [63]

    The perceptron: a probabilistic model for information stor- age and organization in the brain.,

    F. Rosenblatt, “The perceptron: a probabilistic model for information stor- age and organization in the brain.,”Psychological review, vol. 65, no. 6, p. 386, 1958

  63. [64]

    Visualizing the loss landscape of neural nets,

    H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018

  64. [65]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the non- linear dynamics of learning in deep linear neural networks,”arXiv preprint arXiv:1312.6120, 2013

  65. [66]

    Stochastic modified equations and adaptive stochastic gradient algorithms,

    Q. Li, C. Tai,et al., “Stochastic modified equations and adaptive stochastic gradient algorithms,” inInternational Conference on Machine Learning, pp. 2101–2110, PMLR, 2017

  66. [67]

    A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

    L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay,”arXiv preprint arXiv:1803.09820, 2018

  67. [68]

    Pytorch: An impera- tive style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,et al., “Pytorch: An impera- tive style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems, vol. 32, 2019

  68. [69]

    Super-convergence: Very fast training of neural networks using large learning rates,

    L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” inArtificial intelligence and machine learning for multi-domain operations applications, vol. 11006, pp. 369–386, SPIE, 2019

  69. [70]

    Cyclical learning rates for training neural networks,

    L. N. Smith, “Cyclical learning rates for training neural networks,” in2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472, IEEE, 2017. 56

  70. [71]

    On the importance of initialization and momentum in deep learning,

    I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” inInternational Conference on Machine Learning (ICML), pp. 1139–1147, 2013

  71. [72]

    A survey of label-noise representation learning: Past, present and future,

    B. Han, Q. Yao, T. Liu, G. Niu, I. W. Tsang, J. T. Kwok, and M. Sugiyama, “A survey of label-noise representation learning: Past, present and future,” arXiv preprint arXiv:2011.04406, 2020

  72. [73]

    An Empirical Model of Large-Batch Training

    S. McCandlish, J. Kaplan, A. Vitvitkiy,et al., “An empirical model of large-batch training,”arXiv preprint arXiv:1812.06162, 2018

  73. [74]

    Neural architecture search with reinforcement learn- ing,

    B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learn- ing,” inInternational Conference on Learning Representations (ICLR), 2017

  74. [75]

    Lora: Low-rank adaptation of large language models,

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022

  75. [76]

    G. E. Volovik,The Universe in a Helium Droplet. Oxford University Press, 2003

  76. [77]

    Goertzel,Artificial general intelligence: concept, state of the art, and future prospects, vol

    B. Goertzel,Artificial general intelligence: concept, state of the art, and future prospects, vol. 5. Artificial General Intelligence Society, 2014. 57