HiPreNets: High-Precision Neural Networks through Progressive Training

Ethan Mulle; Qi Gong; Wei Kang

arxiv: 2506.15064 · v3 · submitted 2025-06-18 · 💻 cs.LG · cs.NA· cs.NE· math.NA

HiPreNets: High-Precision Neural Networks through Progressive Training

Ethan Mulle , Wei Kang , Qi Gong This is my paper

Pith reviewed 2026-05-19 09:36 UTC · model grok-4.3

classification 💻 cs.LG cs.NAcs.NEmath.NA

keywords high-precision neural networksprogressive trainingresidual refinementFeynman datasetpower system ODEL-infinity errorsurrogate modelingadaptive sampling

0 comments

The pith

HiPreNets progressively trains refinement networks on normalized residuals to reduce both average and worst-case errors toward machine precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

HiPreNets starts with a base neural network and adds successive refinement networks, each trained specifically on the normalized residuals left by the current ensemble. This staged process, paired with loss functions, adaptive sampling, localized patching, and boundary-aware training, directs effort toward high-error regions of the input space. On Feynman dataset regression benchmarks the approach beats standard fully connected networks and reported Kolmogorov-Arnold Network results, sometimes reaching accuracy near machine precision. The same framework applied to a 20-dimensional power-system ODE yields large drops in both RMSE and L^∞ error while producing a surrogate that runs 238 times faster than direct numerical simulation.

Core claim

Sequential residual refinement reduces both RMSE and L^∞ norm error more effectively than conventional training by training each new network on the normalized residuals of the current ensemble and by concentrating updates on high-error regions through complementary techniques including loss design, adaptive data sampling, localized patching, and boundary-aware training.

What carries the argument

Progressive residual refinement ensemble, in which each stage trains a new network on the normalized difference between the present ensemble output and the target values.

If this is right

Higher final accuracy is obtained on nonlinear regression tasks without a proportional increase in total model capacity.
Lower maximum errors make the models more suitable for safety-critical engineering applications.
Fast, high-fidelity surrogate models become feasible for high-dimensional dynamical systems such as power-grid ODEs.
Consistent gains appear across both low-dimensional physics benchmarks and higher-dimensional simulation problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might be combined with other base architectures to further improve results on the same benchmarks.
Testing on problems with dimensions substantially above 20 could expose whether error reduction remains stable or saturates.
The explicit focus on L^∞ reduction could be paired with physics-informed loss terms for additional accuracy gains in scientific modeling.
The progressive structure suggests a natural way to allocate compute adaptively across different regions of high-dimensional input spaces.

Load-bearing premise

Successive refinement networks trained on normalized residuals will keep lowering both average and maximum errors over the whole input domain without instability, overfitting, or prohibitive growth in training cost as dimension or complexity rises.

What would settle it

A clear test case in which additional refinement stages cease to decrease, or begin to increase, the L^∞ error on any region of the input domain for a Feynman benchmark problem.

Figures

Figures reproduced from arXiv: 2506.15064 by Ethan Mulle, Qi Gong, Wei Kang.

**Figure 2.** Figure 2: Illustration of making inferences with the trained model. The same input [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: (a) RMSE progression across HiPreNet stages. (b) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Visual comparison of the true function for Function I.6.2, the final model approximation, and the [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Surface plots of the residuals from each stage of the HiPreNet training process for Function I.6.2. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of varying neuron counts in successive refinement networks on model validation RMSE and [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Residuals at each training stage for Function I.6.2 using a [5-5-5-5] and [5-10-15-20] network [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of residual predictions at each stage for Function I.6.2 using MSE loss (left column) [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Bar chart showing the best validation RMSE and [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Validation data residuals for Function I.13.12: the left plot shows final residuals after standard [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Test data residuals for Function I.13.12: the left plot shows final residuals after standard HiPreNet [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Validation results on data generated in [1 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Validation results on data generated in [2 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

read the original abstract

Deep neural networks are powerful tools for solving nonlinear problems in science and engineering, but training highly accurate models becomes challenging as problem complexity increases. Non-convex optimization and sensitivity to hyperparameters make consistent performance improvement difficult, and traditional approaches prioritize minimizing mean squared error while overlooking the $L^{\infty}$ norm error that is critical in safety-sensitive applications. To address these challenges, we present HiPreNets, a progressive framework for training high-precision neural networks through sequential residual refinements. Starting from an initial network, each stage trains a refinement network on the normalized residuals of the ensemble so far, systematically reducing both average and worst-case error. A key theme throughout the framework is concentrating training effort on high-error regions of the input domain, which we pursue through complementary techniques including loss function design, adaptive data sampling, localized patching, and boundary-aware training. We validate the framework on benchmark regression problems from the Feynman dataset, where it consistently outperforms standard fully connected networks and reported Kolmogorov-Arnold Networks results, with accuracy approaching machine precision depending on select problems. We further apply the framework to learning the flow map of a 20-dimensional power system ODE, which appears to be the highest dimensional problem studied using this class of multistage methods, achieving substantial reductions in both RMSE and $L^{\infty}$ norm error while enabling a surrogate that predicts system state $238\times$ faster than direct numerical simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HiPreNets adds targeted high-error sampling and patching to standard residual refinement, which helps on the 20D power system example but leaves the stability of the L^∞ gains unclear without more data.

read the letter

The main thing to know is that this paper describes a multistage training loop where each new network learns the normalized residual of the current ensemble, with extra work on high-error regions through adaptive sampling, localized patches, loss weighting, and boundary terms. They show it beats plain fully connected nets and some reported KAN numbers on Feynman regression tasks and delivers clear error drops plus a 238x speedup on a 20-dimensional power system flow map. That 20D case is the strongest part of the work because multistage refinement rarely gets tested at that scale for dynamical systems. The combination of residual steps with explicit L^∞ focus is a reasonable engineering extension of boosting-style ideas, and the authors are clear that they are after worst-case error rather than just mean squared error. The method is straightforward to describe and the high-dimensional result gives it some practical weight. The soft spots are exactly where the stress-test note points. The abstract states the performance claims but does not include per-stage error curves, ablations on the individual focusing techniques, or details on how the normalization scale is chosen and updated. Without those, it is hard to know whether the refinements keep driving both average and maximum error down once residuals approach floating-point noise, or whether training cost grows too fast with added stages. The 20D result could be a single successful run rather than evidence of reliable behavior. This paper is for people who build neural surrogates for nonlinear scientific problems and care about tight maximum error bounds. A reader working on high-dimensional ODE approximation or safety-critical modeling would pick up usable implementation ideas from the method section. It deserves a serious referee because the core procedure is coherent and the dimensional reach is worth checking, even though the current evidence is thin. I would send it to peer review with a request for the missing ablations and stability checks.

Referee Report

3 major / 2 minor

Summary. The paper introduces HiPreNets, a progressive framework for high-precision neural networks. It begins with an initial network and trains successive refinement networks on the normalized residuals of the current ensemble, employing adaptive sampling, localized patching, and boundary-aware training to concentrate effort on high-error regions. Validation is reported on Feynman dataset regression tasks, where the method outperforms standard fully connected networks and published Kolmogorov-Arnold Network results with accuracy approaching machine precision on select problems, and on learning the flow map of a 20-dimensional power-system ODE, where it achieves substantial RMSE and L^∞ error reductions while delivering a surrogate 238 times faster than direct numerical simulation.

Significance. If the central performance claims hold under scrutiny, the work would be significant for scientific machine learning, offering a practical route to high-precision surrogates in safety-critical and high-dimensional settings where L^∞ error control matters. The emphasis on progressive residual refinement with focused sampling addresses a recognized limitation of standard MSE-trained networks. The 20D power-system example is presented as the highest-dimensional multistage case studied, which, if supported by detailed diagnostics, would strengthen the case for scalability.

major comments (3)

[Abstract and §4] Abstract and §4: The headline claims of approaching machine precision on Feynman subsets and substantial L^∞ reductions on the 20D problem are stated without accompanying quantitative tables, error bars, ablation results, or explicit numerical values for RMSE and L^∞ before/after each stage. This absence makes it impossible to verify the magnitude and consistency of the reported improvements.
[§3] §3 (Framework description): The procedure treats the number of refinement stages and the residual normalization scale as free parameters. The manuscript does not specify selection criteria or demonstrate robustness to these choices; without such analysis the claim that refinements 'systematically' reduce both average and worst-case error rests on an incompletely characterized procedure.
[§4.2] §4.2 (20D power-system experiment): The reported 238× speedup and error reductions are presented as a single-point outcome. No per-stage error curves, ablation removing adaptive sampling or boundary-aware terms, or analysis of behavior once residuals approach floating-point noise are supplied. This directly bears on whether successive refinements continue to drive L^∞ error downward without plateau or instability in 20D.

minor comments (2)

[§3] Notation: The distinction between the ensemble prediction and the residual target at each stage should be made explicit with consistent symbols across equations and text.
[Figure 1] Figures: The schematic of the progressive training loop would benefit from explicit annotation of the normalization step and the adaptive sampling region.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: The headline claims of approaching machine precision on Feynman subsets and substantial L^∞ reductions on the 20D problem are stated without accompanying quantitative tables, error bars, ablation results, or explicit numerical values for RMSE and L^∞ before/after each stage. This absence makes it impossible to verify the magnitude and consistency of the reported improvements.

Authors: We agree that the current manuscript would benefit from more explicit quantitative support. In the revised version we will add tables in §4 that report RMSE and L^∞ values at each refinement stage for the Feynman benchmarks, together with error bars obtained from multiple independent runs. For the 20D power-system example we will likewise tabulate the per-stage error reductions and the final speedup factor. revision: yes
Referee: [§3] §3 (Framework description): The procedure treats the number of refinement stages and the residual normalization scale as free parameters. The manuscript does not specify selection criteria or demonstrate robustness to these choices; without such analysis the claim that refinements 'systematically' reduce both average and worst-case error rests on an incompletely characterized procedure.

Authors: The referee is correct that these quantities are hyperparameters. We will expand §3 to state explicit stopping criteria (e.g., continue while the validation residual exceeds a threshold near machine precision or until error plateaus) and will add a short robustness study that varies the number of stages and normalization scale on representative problems, confirming that the observed error reductions remain consistent. revision: yes
Referee: [§4.2] §4.2 (20D power-system experiment): The reported 238× speedup and error reductions are presented as a single-point outcome. No per-stage error curves, ablation removing adaptive sampling or boundary-aware terms, or analysis of behavior once residuals approach floating-point noise are supplied. This directly bears on whether successive refinements continue to drive L^∞ error downward without plateau or instability in 20D.

Authors: We acknowledge the value of these additional diagnostics. The revised §4.2 will include per-stage RMSE and L^∞ curves, ablations that isolate the contribution of adaptive sampling and boundary-aware training, and a brief analysis of error behavior as residuals approach floating-point limits, showing that further stages do not introduce instability. revision: yes

Circularity Check

0 steps flagged

No circularity: HiPreNets is a standard multi-stage residual refinement procedure relying on conventional NN optimization.

full rationale

The paper presents HiPreNets as a sequential training process that starts with an initial network and adds refinement networks trained on normalized residuals of the current ensemble, using adaptive sampling and localized patching to target high-error regions. This is an empirical engineering framework built on standard neural-network training loops and loss design rather than any first-principles derivation or mathematical claim that reduces to its own inputs by construction. No equations define a target quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. Performance results on Feynman benchmarks and the 20D power-system ODE are presented as empirical outcomes, not as tautological consequences of the method's own definitions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard neural-network approximation power plus the domain assumption that iterative residual correction will monotonically improve both norms; no new physical entities are introduced and the only free parameters are conventional training hyperparameters such as stage count and sampling schedules.

free parameters (2)

number of refinement stages
Hyperparameter controlling how many sequential correction networks are trained; chosen to reach target precision.
residual normalization scale
Scaling factor applied to residuals before each refinement stage; tuned as part of training.

axioms (2)

standard math Neural networks are universal approximators for continuous functions on compact sets.
Implicit foundation for using fully connected networks to model scientific regression targets.
domain assumption Normalized residuals from an ensemble can be learned by an additional network without destabilizing prior stages.
Core premise that enables the progressive refinement loop described in the abstract.

pith-pipeline@v0.9.0 · 5784 in / 1567 out tokens · 79582 ms · 2026-05-19T09:36:36.110802+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

Abrecht, A

S. Abrecht, A. Hirsch, S. Raafatnia, and M. Woehrle. Deep learning safety concerns in automated driving perception. IEEE Transactions on Intelligent Vehicles , 2024

work page 2024
[2]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man´ e. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Antun, F

V. Antun, F. Renna, C. Poon, B. Adcock, and A. C. Hansen. On instabilities of deep learning in image reconstruction and the potential costs of ai. Proceedings of the National Academy of Sciences , 117(48):30088–30095, 2020

work page 2020
[4]

Badirli, X

S. Badirli, X. Liu, Z. Xing, A. Bhowmik, K. Doan, and S. S. Keerthi. Gradient boosting neural networks: Grownet. arXiv preprint arXiv:2002.07971 , 2020

work page arXiv 2002
[5]

Y. Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000

work page 1900
[6]

Breiman, J

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. 1984

work page 1984
[7]

Chen and C

T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining , pages 785–794, 2016

work page 2016
[8]

Choromanska, M

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics , pages 192–204. PMLR, 2015

work page 2015
[9]

V. G. Costa and C. E. Pedreira. Recent advances in decision trees: An updated survey. Artificial Intelligence Review, 56(5):4765–4800, 2023

work page 2023
[10]

Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014. 24

work page 2014
[11]

C. Dong, L. Zheng, and W. Chen. Kolmogorov-arnold networks (kan) for time series classification and robust analysis. In Advanced Data Mining and Applications: 20th International Conference, ADMA 2024, Sydney, NSW, Australia, December 3–5, 2024, Proceedings, Part IV , page 342–355, Berlin, Hei- delberg, 2024. Springer-Verlag

work page 2024
[12]

J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics , pages 1189–1232, 2001

work page 2001
[13]

Q. Gong, W. Kang, and F. Fahroo. Approximation of compositional functions with relu neural networks. Systems & Control Letters , 175:105508, 2023

work page 2023
[14]

Goodfellow, J

I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Interna- tional Conference on Learning Representations, 2015

work page 2015
[15]

Hornik, M

K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approxima- tors. Neural networks, 2(5):359–366, 1989

work page 1989
[16]

Kang and Q

W. Kang and Q. Gong. Feedforward neural networks and compositional functions with applications to dynamical systems. SIAM Journal on Control and Optimization , 60(2):786–813, 2022

work page 2022
[17]

A. N. Kolmogorov. On the representation of continuous functions of several variables by superpositions of continuous functions of one variable and addition. Doklady Akademii Nauk SSSR , 114:953–956, 1957

work page 1957
[18]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems , 25, 2012

work page 2012
[19]

Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljacic, T. Y. Hou, and M. Tegmark. KAN: Kolmogorov–arnold networks. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[20]

E. J. Michaud, Z. Liu, and M. Tegmark. Precision machine learning. Entropy, 25(1):175, 2023

work page 2023
[21]

Nocedal and S

J. Nocedal and S. Wright. Numerical Optimization. Springer Science & Business Media, 2nd edition, 2006

work page 2006
[22]

J. R. Quinlan. Induction of decision trees. Machine learning, 1:81–106, 1986

work page 1986
[23]

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993

work page 1993
[24]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pre-training. 2018

work page 2018
[25]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. In International conference on machine learning, pages 5301–5310. PMLR, 2019

work page 2019
[26]

Rosenblatt

F. Rosenblatt. The perceptron: A perceiving and recognizing automaton. Report, Project PARA, Cornell Aeronautical Laboratory, Jan. 1957

work page 1957
[27]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986

work page 1986
[28]

Snoek, H

J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algo- rithms. Advances in neural information processing systems , 25, 2012

work page 2012
[29]

Tan and Q

M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning , pages 6105–6114. PMLR, 2019

work page 2019
[30]

Udrescu and M

S.-M. Udrescu and M. Tegmark. Ai feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):eaay2631, 2020. 25

work page 2020
[31]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , vol- ume 30. Curran Associates, Inc., 2017

work page 2017
[32]

Virtanen, R

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey,˙I. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, ...

work page 2020
[33]

Wang and C.-Y

Y. Wang and C.-Y. Lai. Multi-stage neural networks: Function approximator of machine precision. Journal of Computational Physics , 504:112865, 2024

work page 2024
[34]

S. Xie, R. Girshick, P. Doll´ ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1492–1500, 2017

work page 2017
[35]

Zhang, S

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107–115, Feb. 2021. 26

work page 2021

[1] [1]

Abrecht, A

S. Abrecht, A. Hirsch, S. Raafatnia, and M. Woehrle. Deep learning safety concerns in automated driving perception. IEEE Transactions on Intelligent Vehicles , 2024

work page 2024

[2] [2]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man´ e. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Antun, F

V. Antun, F. Renna, C. Poon, B. Adcock, and A. C. Hansen. On instabilities of deep learning in image reconstruction and the potential costs of ai. Proceedings of the National Academy of Sciences , 117(48):30088–30095, 2020

work page 2020

[4] [4]

Badirli, X

S. Badirli, X. Liu, Z. Xing, A. Bhowmik, K. Doan, and S. S. Keerthi. Gradient boosting neural networks: Grownet. arXiv preprint arXiv:2002.07971 , 2020

work page arXiv 2002

[5] [5]

Y. Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000

work page 1900

[6] [6]

Breiman, J

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. 1984

work page 1984

[7] [7]

Chen and C

T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining , pages 785–794, 2016

work page 2016

[8] [8]

Choromanska, M

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics , pages 192–204. PMLR, 2015

work page 2015

[9] [9]

V. G. Costa and C. E. Pedreira. Recent advances in decision trees: An updated survey. Artificial Intelligence Review, 56(5):4765–4800, 2023

work page 2023

[10] [10]

Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014. 24

work page 2014

[11] [11]

C. Dong, L. Zheng, and W. Chen. Kolmogorov-arnold networks (kan) for time series classification and robust analysis. In Advanced Data Mining and Applications: 20th International Conference, ADMA 2024, Sydney, NSW, Australia, December 3–5, 2024, Proceedings, Part IV , page 342–355, Berlin, Hei- delberg, 2024. Springer-Verlag

work page 2024

[12] [12]

J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics , pages 1189–1232, 2001

work page 2001

[13] [13]

Q. Gong, W. Kang, and F. Fahroo. Approximation of compositional functions with relu neural networks. Systems & Control Letters , 175:105508, 2023

work page 2023

[14] [14]

Goodfellow, J

I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Interna- tional Conference on Learning Representations, 2015

work page 2015

[15] [15]

Hornik, M

K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approxima- tors. Neural networks, 2(5):359–366, 1989

work page 1989

[16] [16]

Kang and Q

W. Kang and Q. Gong. Feedforward neural networks and compositional functions with applications to dynamical systems. SIAM Journal on Control and Optimization , 60(2):786–813, 2022

work page 2022

[17] [17]

A. N. Kolmogorov. On the representation of continuous functions of several variables by superpositions of continuous functions of one variable and addition. Doklady Akademii Nauk SSSR , 114:953–956, 1957

work page 1957

[18] [18]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems , 25, 2012

work page 2012

[19] [19]

Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljacic, T. Y. Hou, and M. Tegmark. KAN: Kolmogorov–arnold networks. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[20] [20]

E. J. Michaud, Z. Liu, and M. Tegmark. Precision machine learning. Entropy, 25(1):175, 2023

work page 2023

[21] [21]

Nocedal and S

J. Nocedal and S. Wright. Numerical Optimization. Springer Science & Business Media, 2nd edition, 2006

work page 2006

[22] [22]

J. R. Quinlan. Induction of decision trees. Machine learning, 1:81–106, 1986

work page 1986

[23] [23]

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993

work page 1993

[24] [24]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pre-training. 2018

work page 2018

[25] [25]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. In International conference on machine learning, pages 5301–5310. PMLR, 2019

work page 2019

[26] [26]

Rosenblatt

F. Rosenblatt. The perceptron: A perceiving and recognizing automaton. Report, Project PARA, Cornell Aeronautical Laboratory, Jan. 1957

work page 1957

[27] [27]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986

work page 1986

[28] [28]

Snoek, H

J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algo- rithms. Advances in neural information processing systems , 25, 2012

work page 2012

[29] [29]

Tan and Q

M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning , pages 6105–6114. PMLR, 2019

work page 2019

[30] [30]

Udrescu and M

S.-M. Udrescu and M. Tegmark. Ai feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):eaay2631, 2020. 25

work page 2020

[31] [31]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , vol- ume 30. Curran Associates, Inc., 2017

work page 2017

[32] [32]

Virtanen, R

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey,˙I. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, ...

work page 2020

[33] [33]

Wang and C.-Y

Y. Wang and C.-Y. Lai. Multi-stage neural networks: Function approximator of machine precision. Journal of Computational Physics , 504:112865, 2024

work page 2024

[34] [34]

S. Xie, R. Girshick, P. Doll´ ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1492–1500, 2017

work page 2017

[35] [35]

Zhang, S

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107–115, Feb. 2021. 26

work page 2021