pith. sign in

arxiv: 2606.22053 · v1 · pith:5Y6T42RBnew · submitted 2026-06-20 · 💻 cs.LG · cs.AI· cs.NE

Gradient-Descent Steps to Success over Mean Accuracy: A Paradigm Shift for ML

Pith reviewed 2026-06-26 12:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE
keywords computational effortgradient descent stepslarge learning ratesAutoMLgeneralizationphase transitionsmodel selectionhyperparameter optimization
0
0 comments X

The pith

Shifting ML evaluation from peak accuracy to the number of gradient descent steps needed to reach target accuracy favors large learning rates and reveals phase transitions in restart strategy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing accuracy maximization with a new evaluation criterion: the total number of gradient descent steps required to reach an acceptable accuracy level with high probability. This metric, extended from genetic programming, functions as a form of AutoML by guiding hyperparameter choice. Across 11 models and five classification datasets, the experiments show that optimal settings consistently select unusually large learning rates. These rates enable rapid landscape traversal that both improves generalization and statistically lowers the expected step count. The work also identifies phase transitions: a single training run works for modest accuracy targets, but reaching performance limits requires many independent short restarts instead.

Core claim

Minimising the expected number of gradient descent steps to reach a target accuracy with high probability leads to hyperparameter choices, in particular unusually large learning rates, that promote generalisation while statistically minimising training effort, accompanied by distinct phase transitions in optimal search strategy between single runs for lower targets and numerous independent short restarts for maximum performance.

What carries the argument

The computational effort metric: total gradient descent steps to reach acceptable accuracy with high probability, used as a hardware-independent proxy that extends Koza's genetic programming concept to gradient-descent models and serves as the objective for hyperparameter search.

If this is right

  • Optimal hyperparameters consistently include unusually large learning rates.
  • Rapid aggressive landscape traversal with large rates promotes generalisation and minimises expected computational effort.
  • A single training run suffices for lower accuracy targets while numerous independent short restarts become necessary to reach a model's performance limit.
  • The effort metric supplies a framework for choosing among algorithms according to how efficiently each perceives the difficulty of a given problem at a chosen accuracy level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AutoML systems could be redesigned to search directly over effort rather than accuracy or validation loss.
  • Training protocols might routinely begin with large rates followed by adaptive restarts once a target accuracy band is specified.
  • Model selection in practice could shift from accuracy tables to effort curves that indicate the cheapest route to each accuracy level.
  • The same effort accounting could be applied to non-gradient optimizers to compare training cost across fundamentally different algorithms.

Load-bearing premise

The number of gradient descent steps to reach an acceptable accuracy level with high probability can be measured consistently and serves as a valid hardware-independent proxy for computational effort.

What would settle it

An experiment on the same models and datasets that finds small learning rates achieve target accuracy in fewer total gradient steps than large rates, or shows no observable switch from single-run to multi-restart strategy as accuracy targets increase.

Figures

Figures reproduced from arXiv: 2606.22053 by Ahmet Yilmaz, Riccardo Poli.

Figure 1
Figure 1. Figure 1: Paradigm shift proposed in this paper (see Section [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracies for LR model without regularisation. [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracies for LR model without regularisation (continued). [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracies for NN-ReLU model. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracies for NN-ReLU model (continued). [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Success probability ( [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Success probability (a = 0.9) for LR-Inf model (continued). 30 and 70 for η = 0.02). This is due to short periods of monotonic ascent15 associated with the fact that the success probability P(i, a, η) is increasing with every additional epoch, and so the magnitude of the denominator within the ceiling operation in Equations (7) and (8) increases. These ascent periods continue until P(i, a, η) has grown suf… view at source ↗
Figure 5
Figure 5. Figure 5: Success probability ( [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Success probability (a = 0.9) for NN-ReLU model (continued). This happens when success probabilities (see [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Computational effort (a = 0.9) for LR-Inf model. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: Computational effort (a = 0.9) for LR-Inf model (continued). computational effort. So, validation computational efforts, too, present the infinite, multi￾run and single-run phases discussed in the previous section. The reason is that for low enough values of a the training and validation accuracy distributions are essentially identical, resulting in almost identical success probabilities. For a = 0.85, exa… view at source ↗
Figure 7
Figure 7. Figure 7: Computational effort (a = 0.9) for NN-ReLU model. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: Computational effort (a = 0.9) for NN-ReLU model (continued). the required value of a requires more epochs (hence higher E) for validation accuracy than for training accuracy. Essentially, the bigger the overfitting, the bigger the differences. For instance, for a = 0.9, examples of divergence between training and validation computational efforts include the DNA problem solved via LR without regularisation… view at source ↗
Figure 8
Figure 8. Figure 8: Optimal computational effort and hyper-parameters for different values of a for LR-Inf model. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 8
Figure 8. Figure 8: Optimal computational effort and hyper-parameters for different values of a for LR-Inf model (continued). grows exponentially from about 500 gradient descent steps for a = 0.96 to almost 250,000 gradient descent steps for a = 0.99. This phase transition from the regime R∗ = 1 to the exponential growth in R∗ and E∗ corresponds to the situation where irrespective of the learning rate, a single gradient desce… view at source ↗
Figure 9
Figure 9. Figure 9: Optimal computational effort and hyper-parameters for different values of a for NN-ReLU model. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗
Figure 9
Figure 9. Figure 9: Optimal computational effort and hyper-parameters for different values of a for NN-ReLU model (continued). at the training hyper-parameters across all models in Section S6 of SM, we find that it occurs in over 50% of the 55 model–problem pairs, particularly with Ecoli and Iris, but also, to a lesser extent, BCW and Wine (while it is virtually absent in the case of the DNA dataset). That is, multiple runs a… view at source ↗
Figure 10
Figure 10. Figure 10: Optimal computational effort across problems for all 11 models and different values of a. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Optimal computational effort across problems for all 11 models and different values of a (continued). 45 [PITH_FULL_IMAGE:figures/full_fig_p045_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Models providing the minimum effort (and thus are non-dominated) for different values of a across problems. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Models achieving the best training and validation accuracy for each budget of gradient descent steps for each problem. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_12.png] view at source ↗
read the original abstract

Traditional evaluation of machine learning (ML) models typically focuses on achieving the maximum possible accuracy irrespective of the computational cost. In this article, we propose a paradigm shift towards evaluating performance based on computational effort-explicitly defined here as the total number of gradient descent steps required to reach an acceptable level of accuracy with high probability. Building upon the concept of computational effort originally introduced by Koza for Genetic Programming, we extend this metric to any ML model trained via gradient descent. Furthermore, we demonstrate that minimising this effort acts as a novel form of Automatic Machine Learning (AutoML). By evaluating it across 11 diverse ML models and five standard classification datasets, we uncover significant insights into the dynamics of gradient-based learning. Our findings reveal that optimal hyper-parameters consistently favour unusually large learning rates. Crucially, we demonstrate that the rapid, aggressive landscape traversal enabled by these large rates not only promotes generalisation-as seen in phenomena like superconvergence-but also statistically minimises the expected computational effort for training. Furthermore, we identify distinct phase transitions in the optimal search strategy: while a single training run suffices for lower accuracy targets, reaching a model's performance limit requires a dramatic shift towards conducting numerous independent, short restarts. Finally, we illustrate how this effort-based paradigm provides a robust framework for model selection, allowing practitioners to choose optimal algorithms based on the difficulty of a problem as perceived by different models for a given target accuracy, or to maximise the achievable accuracy for a fixed budget of gradient descent steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a paradigm shift in ML evaluation from maximizing accuracy to minimizing computational effort, defined as the expected number of gradient-descent steps required to reach an acceptable accuracy level with high probability. It extends Koza's metric from genetic programming to gradient-descent models, claims that minimizing this effort constitutes a novel AutoML procedure, and reports experiments across 11 models and 5 datasets showing that optimal hyperparameters favor unusually large learning rates (linked to superconvergence), distinct phase transitions in restart strategies (single run vs. many short restarts), and a framework for model selection based on problem difficulty.

Significance. If the effort metric can be shown to be robustly defined and the experimental protocol made reproducible, the work could supply a useful alternative lens on training efficiency and hyperparameter choice that complements accuracy-centric evaluation. The reported preference for large learning rates and the identified phase transitions would then constitute concrete, testable observations about optimization dynamics. At present the significance is constrained by the absence of a verifiable protocol for the metric itself.

major comments (2)
  1. [Abstract / experimental design] Abstract and experimental protocol: the central claims (large-LR optimality, superconvergence link, single-run vs. many-restart phase transitions) rest on treating the expected number of GD steps to a target accuracy (with high probability) as a well-defined, hardware-independent proxy. No explicit protocol is supplied for choosing the per-model/dataset accuracy threshold, for estimating the success probability (number of trials, success criterion), or for controlling confounding factors such as batch-size changes or restart overhead that alter per-step cost.
  2. [Metric definition] The assertion that the metric extends Koza's GP effort measure without modification is load-bearing for the novelty claim, yet the manuscript does not demonstrate that the step count remains comparable when per-step floating-point operations vary with learning-rate magnitude or when divergence forces smaller batches.
minor comments (1)
  1. [Abstract] The abstract refers to "11 diverse ML models and five standard classification datasets" without naming them or providing data-split details; this should be stated explicitly in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater clarity on the experimental protocol and metric robustness. We address each major comment below and will incorporate revisions to strengthen reproducibility while preserving the core contributions on effort-based evaluation.

read point-by-point responses
  1. Referee: [Abstract / experimental design] Abstract and experimental protocol: the central claims (large-LR optimality, superconvergence link, single-run vs. many-restart phase transitions) rest on treating the expected number of GD steps to a target accuracy (with high probability) as a well-defined, hardware-independent proxy. No explicit protocol is supplied for choosing the per-model/dataset accuracy threshold, for estimating the success probability (number of trials, success criterion), or for controlling confounding factors such as batch-size changes or restart overhead that alter per-step cost.

    Authors: We agree an explicit protocol is required for verifiability. In revision we will add a Methods subsection specifying: accuracy thresholds are chosen as 90% of the highest accuracy attained by any configuration for that model-dataset pair; success probability is estimated from 50 independent random-seed trials per hyperparameter setting, with success defined as reaching the threshold before a hard step limit; batch size is fixed across all runs and restart overhead is excluded because the metric counts only gradient-descent steps. These choices keep the proxy hardware-independent by design. revision: yes

  2. Referee: [Metric definition] The assertion that the metric extends Koza's GP effort measure without modification is load-bearing for the novelty claim, yet the manuscript does not demonstrate that the step count remains comparable when per-step floating-point operations vary with learning-rate magnitude or when divergence forces smaller batches.

    Authors: The definitional extension is direct: Koza's effort is the expected number of evaluations to a high-probability solution; we substitute gradient-descent steps for evaluations. We do not claim the count is invariant to per-step cost. In the revised manuscript we will add a short analysis confirming that, in the reported experiments, batch size remained constant and large learning rates did not trigger divergence requiring batch reduction, so step counts are directly comparable. We will also note that any future extension to variable-batch regimes could weight steps by relative FLOPs. revision: partial

Circularity Check

1 steps flagged

Minimizing explicitly defined effort metric claimed as novel AutoML by construction

specific steps
  1. self definitional [Abstract]
    "Furthermore, we demonstrate that minimising this effort acts as a novel form of Automatic Machine Learning (AutoML)."

    Once computational effort is defined as the quantity whose minimization constitutes good performance, the claim that minimizing it is a novel AutoML follows tautologically from the definition itself; the 'demonstration' requires no independent argument or evidence beyond the introduction of the metric.

full rationale

The paper's central proposal is the redefinition of performance as the expected number of GD steps to a target accuracy (with high probability), explicitly extending Koza's GP metric. The assertion that 'minimising this effort acts as a novel form of Automatic Machine Learning (AutoML)' follows directly from that definition without further derivation or external validation. All reported empirical findings (large-LR preference, phase transitions, model selection) are downstream applications of the new metric to experiments on 11 models and 5 datasets and do not reduce to the definition. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. This yields one minor self-definitional step that is not load-bearing for the experimental results, producing a low but non-zero circularity score.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on redefining success via a step-count metric whose target accuracy and probability threshold act as free parameters, plus the domain assumption that step count is a comparable effort measure across models.

free parameters (2)
  • acceptable accuracy level
    Defines the success threshold for counting steps; chosen per dataset and model but not derived from first principles.
  • high probability threshold
    Determines when a run counts as successful; value not specified in abstract.
axioms (1)
  • domain assumption Number of gradient descent steps is a valid and comparable measure of computational effort across different ML models
    Direct extension of Koza's GP metric without additional justification or hardware normalization provided in the abstract.

pith-pipeline@v0.9.1-grok · 5802 in / 1367 out tokens · 39966 ms · 2026-06-26T12:34:49.707473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 3 canonical work pages

  1. [1]

    Robert J. N. Baldock, Hartmut Maennel, and Behnam Neyshabur. Deep learning through the lens of example difficulty.arXiv preprint arXiv:2106.09647,

  2. [2]

    David F Barrero, Mar´ ıa D R-Moreno, Bonifacio Casta˜ no, and David Camacho

    URLhttps: //doi.org/10.48550/arxiv.2106.09647. David F Barrero, Mar´ ıa D R-Moreno, Bonifacio Casta˜ no, and David Camacho. An empirical study on the accuracy of computational effort in genetic programming. In2011 IEEE Congress on Evolutionary Computation (CEC), pages 1164–1171. IEEE,

  3. [3]

    We need to talk about random seeds.arXiv preprint arXiv:2210.13393,

    Steven Bethard. We need to talk about random seeds.arXiv preprint arXiv:2210.13393,

  4. [4]

    Calle, A

    P. Calle, A. Bates, J. C. Reynolds, Y. Liu, H. Cui, S. Ly, C. Wang, Q. Zhang, A. J. de Armendi, S. S. Shettar, K. M. Fung, Q. Tang, and C. Pan. Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning models.arXiv preprint a...

  5. [5]

    ISPRS Journal of Photogrammetry and Remote Sensing (P&RS)118, 83–100 (2016).https://doi.org/10.1016/j

    URLhttps://doi.org/10.1016/j. cmpb.2025.109063. Tom Castle and Colin G Johnson. Evolving high-level imperative program trees with strongly formed genetic programming. InEuropean Conference on Genetic Programming, pages 1–12. Springer,

  6. [6]

    An analysis of koza’s computational effort statistic for genetic programming

    Steffen Christensen and Franz Oppacher. An analysis of koza’s computational effort statistic for genetic programming. InGenetic Programming: 5th European Conference, EuroGP 2002 Kinsale, Ireland, April 3–5, 2002 Proceedings 5, pages 182–191. Springer,

  7. [7]

    Leakage and the reproducibility crisis in ml-based science.arXiv preprint arXiv:2207.07048,

    Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in ml-based science.arXiv preprint arXiv:2207.07048,

  8. [8]

    Andrey N

    Ac- cessed: 2026-04-21. Andrey N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1(1):1–7,

  9. [9]

    The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218,

    Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218,

  10. [10]

    A statistical analysis for per-instance evaluation of stochastic optimizers: Avoiding unreliable conclusions.arXiv preprint arXiv:2503.16589v2,

    Moslem Noori, Elisabetta Valiante, Ignacio Rozada, Thomas Van Vaerenbergh, and Ma- soud Mohseni. A statistical analysis for per-instance evaluation of stochastic optimizers: Avoiding unreliable conclusions.arXiv preprint arXiv:2503.16589v2,

  11. [12]

    Liah Rosenfeld and Leonardo Vanneschi

    URL https://arxiv.org/abs/2601.17655. Liah Rosenfeld and Leonardo Vanneschi. A survey on batch training in genetic program- ming.Genetic Programming and Evolvable Machines, 26(1):2,

  12. [13]

    An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747,

    Sebastian Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747,

  13. [14]

    A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay.arXiv preprint arXiv:1803.09820,

    Leslie N Smith. A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay.arXiv preprint arXiv:1803.09820,

  14. [15]

    On the origin of implicit regularization in stochastic gradient descent.arXiv preprint arXiv:2101.12176,

    Samuel L Smith, Benoit Dherin, David GT Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent.arXiv preprint arXiv:2101.12176,

  15. [16]

    doi: https://doi.org/10.1016/j.neunet.2022.05.030

    ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2022.05.030. URLhttps://www.sciencedirect.com/ science/article/pii/S0893608022002040. 54 Supplementary Material Appendix S1. Test Problems and Degree of Overparametrisation TheIris,Wine, andBCWdatasets were sourced from the Scikit-learn library (Pedregosa et al., 2011), whileE-coliandDNAwere obtained f...