Mise en abyme with artificial intelligence: how to predict the accuracy of NN, applied to hyper-parameter tuning

Giorgia Franchini; Mathilde Galinier; Micaela Verucchi

arxiv: 1907.00924 · v1 · pith:RXBUXCTSnew · submitted 2019-06-28 · 💻 cs.LG · stat.ML

Mise en abyme with artificial intelligence: how to predict the accuracy of NN, applied to hyper-parameter tuning

Giorgia Franchini , Mathilde Galinier , Micaela Verucchi This is my paper

Pith reviewed 2026-05-25 13:26 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords neural network accuracy predictionhyperparameter optimizationsupport vector machinesearly training behaviorMNIST classificationCIFAR-10 classificationconvolutional neural networkscurve fitting

0 comments

The pith

A support vector machine trained on early training runs can predict final neural network accuracy, enabling low-cost hyperparameter searches that recover known optimal results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multiple training runs with varied network characteristics can create a database pairing initial accuracy behavior with final accuracy. Curve fitting combined with a support vector machine trained on that database then predicts final accuracy from only the first few iterations of a new run. This matters because full training is the dominant computational cost when designing networks or tuning their hyperparameters. The method is demonstrated on convolutional networks for MNIST and CIFAR-10 classification, where it locates hyperparameter settings that achieve the best accuracies reported in the literature.

Core claim

By collecting initial and final accuracies across multiple trainings that differ in network characteristics, fitting curves to early performance, and training a support vector machine on the resulting database, the final accuracy of a network can be predicted from its primary iterations alone; applying this predictor inside a probabilistic hyperparameter search recovers the optimal accuracies known for MNIST and CIFAR-10 at substantially reduced computational cost.

What carries the argument

Support vector machine trained on a database of early accuracy observations paired with final accuracies from prior runs, together with curve fitting to model training trajectories.

Load-bearing premise

The mapping from early training behavior to final accuracy remains sufficiently stable across changes in network characteristics for an SVM trained on a modest database of prior runs to generalize to new configurations.

What would settle it

For a previously unseen hyperparameter configuration, measure the actual accuracy after full training and compare it to the accuracy predicted by the SVM from the first few iterations; a large discrepancy falsifies the claim.

Figures

Figures reproduced from arXiv: 1907.00924 by Giorgia Franchini, Mathilde Galinier, Micaela Verucchi.

**Figure 2.** Figure 2: Comparison of the predicted final accuracy with SVM and curve fitting and the ground truth, feeding the method with only two training epochs (on the left) and four (on the right). the ground truth ( i.e. the network is fully trained up to convergence with the same hyper-parameters). As it can be seen, the method can effectively provide a satisfactory prediction of the final behaviour of the network after o… view at source ↗

read the original abstract

In the context of deep learning, the costliest phase from a computational point of view is the full training of the learning algorithm. However, this process is to be used a significant number of times during the design of a new artificial neural network, leading therefore to extremely expensive operations. Here, we propose a low-cost strategy to predict the accuracy of the algorithm, based only on its initial behaviour. To do so, we train the network of interest up to convergence several times, modifying its characteristics at each training. The initial and final accuracies observed during this beforehand process are stored in a database. We then make use of both curve fitting and Support Vector Machines techniques, the latter being trained on the created database, to predict the accuracy of the network, given its accuracy on the primary iterations of its learning. This approach can be of particular interest when the space of the characteristics of the network is notably large or when its full training is highly time-consuming. The results we obtained are promising and encouraged us to apply this strategy to a topical issue: hyper-parameter optimisation (HO). In particular, we focused on the HO of a convolutional neural network for the classification of the databases MNIST and CIFAR-10. By using our method of prediction, and an algorithm implemented by us for a probabilistic exploration of the hyper-parameter space, we were able to find the hyper-parameter settings corresponding to the optimal accuracies already known in literature, at a quite low-cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes training neural networks multiple times with varying characteristics to build a database of initial and final accuracies, then using curve fitting and an SVM trained on this database to predict final accuracy from early training behavior. This predictor is integrated into a custom probabilistic hyperparameter optimization procedure to search for optimal settings of a CNN on MNIST and CIFAR-10, with the claim that known literature optima are recovered at low computational cost.

Significance. If the early-to-final accuracy mapping generalizes reliably to unseen hyperparameter configurations, the approach could reduce the number of full trainings needed during hyperparameter search, offering a practical low-cost alternative for expensive models or large search spaces. The reported recovery of known optima on standard benchmarks would indicate potential utility, though the absence of quantitative validation leaves the practical impact unevaluated.

major comments (3)

[Abstract] Abstract: The assertion of 'promising results' and recovery of 'the hyper-parameter settings corresponding to the optimal accuracies already known in literature' is unsupported by any reported metrics, achieved accuracies, number of full trainings avoided, baseline comparisons, or error analysis of the predictions.
[Method (database and SVM)] Database construction and SVM training: The SVM is fitted directly to accuracy pairs generated by the same class of network trainings it is later asked to forecast, conditioning the predictor on data drawn from the target distribution rather than independent external benchmarks; this makes generalization to new hyper-parameter combinations an unverified assumption.
[Experiments (HO results)] Hyper-parameter optimisation experiments: No information is supplied on database size, diversity of network characteristics, held-out validation error of the SVM, or the probabilistic exploration algorithm; without these, the success on MNIST/CIFAR-10 provides no evidence that prediction errors would not cause the search to miss superior configurations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. We address each major comment below, indicating revisions that will be incorporated into the next version of the manuscript to provide the requested quantitative support and methodological details.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of 'promising results' and recovery of 'the hyper-parameter settings corresponding to the optimal accuracies already known in literature' is unsupported by any reported metrics, achieved accuracies, number of full trainings avoided, baseline comparisons, or error analysis of the predictions.

Authors: We agree that the abstract would be strengthened by quantitative support. In the revision we will add specific metrics including the SVM prediction error on held-out data, the number of full trainings performed versus avoided, the final accuracies recovered on MNIST and CIFAR-10, and a brief comparison to a standard grid-search baseline. These additions will be drawn from the experimental results already obtained and will be stated concisely in the abstract. revision: yes
Referee: [Method (database and SVM)] Database construction and SVM training: The SVM is fitted directly to accuracy pairs generated by the same class of network trainings it is later asked to forecast, conditioning the predictor on data drawn from the target distribution rather than independent external benchmarks; this makes generalization to new hyper-parameter combinations an unverified assumption.

Authors: The database is generated by systematically varying hyper-parameters and architectural choices across repeated trainings, and the subsequent hyper-parameter search is allowed to propose combinations outside the exact training set. Nevertheless, the referee correctly notes that explicit verification of generalization is missing. We will add a held-out validation split of the accuracy-pair database and report the SVM test error on configurations not seen during SVM training, thereby quantifying the generalization assumption. revision: partial
Referee: [Experiments (HO results)] Hyper-parameter optimisation experiments: No information is supplied on database size, diversity of network characteristics, held-out validation error of the SVM, or the probabilistic exploration algorithm; without these, the success on MNIST/CIFAR-10 provides no evidence that prediction errors would not cause the search to miss superior configurations.

Authors: We acknowledge that the current manuscript omits these essential details. The revised version will report: (i) the exact number of trainings used to build the database and the ranges of characteristics varied, (ii) the held-out validation error of the SVM, and (iii) a concise description of the probabilistic exploration procedure (including how prediction uncertainty is propagated). These additions will allow readers to evaluate the risk that prediction errors could cause the search to overlook better configurations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical surrogate model is self-contained

full rationale

The paper builds an empirical database from multiple full trainings with varied network characteristics, fits an SVM (plus curve fitting) on that database to map early behavior to final accuracy, and deploys the fitted model as a cheap surrogate inside a probabilistic hyper-parameter search. This is a standard learned predictor whose validity rests on generalization from the database rather than any derivation that reduces by construction to its own inputs. No equations, self-citations, or uniqueness claims are present that would create a self-definitional or fitted-input loop; the reported recovery of known MNIST/CIFAR-10 optima is offered as external empirical evidence, not a mathematical identity.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The prediction pipeline depends on empirical fitting of an SVM and curve models to a database of training runs; the only explicit free parameters are therefore the SVM hyperparameters and the curve coefficients. The central domain assumption is that early behavior is stably predictive of final accuracy.

free parameters (2)

SVM hyperparameters (kernel, regularization)
Chosen or fitted when training the support vector machine on the collected accuracy database.
Curve-fitting coefficients
Determined from the observed accuracy trajectories during the preliminary multi-run collection phase.

axioms (1)

domain assumption Early training accuracy behavior correlates reliably with final accuracy across variations in network characteristics.
This premise is required for the SVM trained on the database to produce useful predictions for unseen configurations.

pith-pipeline@v0.9.0 · 5805 in / 1328 out tokens · 54106 ms · 2026-05-25T13:26:03.256933+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Hutter, H

F. Hutter, H. Hoos, and K. Leyton-Brown, Sequential model-based optimization for general algorithm conﬁguration. In LION-5, 2011. Extended version as UBC Tech report TR-2010-10

work page 2011
[2]

J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kgl, Algorithms for hyper-parameter optimization. In NIPS. 2011

work page 2011
[3]

Jones, A taxonomy of global optimization methods based on response surfaces

D.R. Jones, A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21:345383, 2001

work page 2001
[4]

Shahriari, K

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams and N. de Freitas, Taking the human out of the loop: A review of bayesian optimization, Proc. IEEE 104(1) (2016) 148175

work page 2016
[5]

Mockus, V

J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the extremum. In L.C.W. Dixon and G.P. Szego, editors, Towards Global Optimization, volume 2, pages 117129. North Holland, New York, 1978

work page 1978
[6]

Bergstra and Y

J. Bergstra and Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res. 13(1) (2012) 281305

work page 2012
[7]

Snoek, H

J. Snoek, H. Larochelle and R. P. Adams, Practical Bayesian optimization of ma- chine learning algorithms, Adv. Neural Inf. Process. Syst. 25 (2012) 29512959

work page 2012
[8]

Zoph and Q

B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning, Int. Conf. Learning Representations, Toulon, France, 2017, pp. 116

work page 2017
[9]

Baker, O

B. Baker, O. Gupta, N. Naik and R. Raskar, Designing neural network architectures using reinforcement learning, Int. Conf. Learning Representations, Toulon, France, 2017, pp. 118

work page 2017
[10]

Practical Block-wise Neural Network Architecture Generation

Z. Zhong, J. Yan, W. Wei, J. Shao and C.-L. Liu, Practical block-wise neural network architecture generation, Conf. Computer Vision and Pattern Recognition , Salt Lake City, Utah, USA, 2018, arXiv preprint: 1708.05552

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

H. Cai, T. Chen, W. Zhang, Y. Yu and J. Wang, Eﬃcient architecture search by net- work transformation, AAAI Conf. Artiﬁcial Intelligence , New Orleans, Louisiana, USA, 2018, pp. 27872794 10 G. Franchini, M. Galinier, M. Verucchi

work page 2018
[12]

Chapelle and V

O. Chapelle and V. Vapnik, Model Selection for Support Vector Machines. In Advances in Neural Information Processing Systems , Vol 12, (1999)

work page 1999
[13]

CRC Press, 1994

Sandra Lach Arlinghaus, PHB Practical Handbook of Curve Fitting. CRC Press, 1994

work page 1994

[1] [1]

Hutter, H

F. Hutter, H. Hoos, and K. Leyton-Brown, Sequential model-based optimization for general algorithm conﬁguration. In LION-5, 2011. Extended version as UBC Tech report TR-2010-10

work page 2011

[2] [2]

J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kgl, Algorithms for hyper-parameter optimization. In NIPS. 2011

work page 2011

[3] [3]

Jones, A taxonomy of global optimization methods based on response surfaces

D.R. Jones, A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21:345383, 2001

work page 2001

[4] [4]

Shahriari, K

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams and N. de Freitas, Taking the human out of the loop: A review of bayesian optimization, Proc. IEEE 104(1) (2016) 148175

work page 2016

[5] [5]

Mockus, V

J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the extremum. In L.C.W. Dixon and G.P. Szego, editors, Towards Global Optimization, volume 2, pages 117129. North Holland, New York, 1978

work page 1978

[6] [6]

Bergstra and Y

J. Bergstra and Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res. 13(1) (2012) 281305

work page 2012

[7] [7]

Snoek, H

J. Snoek, H. Larochelle and R. P. Adams, Practical Bayesian optimization of ma- chine learning algorithms, Adv. Neural Inf. Process. Syst. 25 (2012) 29512959

work page 2012

[8] [8]

Zoph and Q

B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning, Int. Conf. Learning Representations, Toulon, France, 2017, pp. 116

work page 2017

[9] [9]

Baker, O

B. Baker, O. Gupta, N. Naik and R. Raskar, Designing neural network architectures using reinforcement learning, Int. Conf. Learning Representations, Toulon, France, 2017, pp. 118

work page 2017

[10] [10]

Practical Block-wise Neural Network Architecture Generation

Z. Zhong, J. Yan, W. Wei, J. Shao and C.-L. Liu, Practical block-wise neural network architecture generation, Conf. Computer Vision and Pattern Recognition , Salt Lake City, Utah, USA, 2018, arXiv preprint: 1708.05552

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

H. Cai, T. Chen, W. Zhang, Y. Yu and J. Wang, Eﬃcient architecture search by net- work transformation, AAAI Conf. Artiﬁcial Intelligence , New Orleans, Louisiana, USA, 2018, pp. 27872794 10 G. Franchini, M. Galinier, M. Verucchi

work page 2018

[12] [12]

Chapelle and V

O. Chapelle and V. Vapnik, Model Selection for Support Vector Machines. In Advances in Neural Information Processing Systems , Vol 12, (1999)

work page 1999

[13] [13]

CRC Press, 1994

Sandra Lach Arlinghaus, PHB Practical Handbook of Curve Fitting. CRC Press, 1994

work page 1994