Hyp-RL : Hyperparameter Optimization by Reinforcement Learning
Pith reviewed 2026-05-25 14:55 UTC · model grok-4.3
The pith
Reinforcement learning can select the next hyperparameter to test by learning from future validation loss reduction instead of fixed heuristics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors model hyperparameter optimization as a sequential decision problem addressed with reinforcement learning. The policy learns to select the next hyperparameter to test based on the subsequent reduction in validation loss it will lead to, either because it yields good models itself or because it allows the policy to build a better surrogate model that chooses better hyperparameters later. Experiments on a large battery of 50 data sets demonstrate that this method outperforms the state-of-the-art approaches for hyperparameter learning.
What carries the argument
A reinforcement learning policy that selects the next hyperparameter setting to evaluate according to the expected future reduction in validation loss.
Load-bearing premise
A single reinforcement learning policy trained on the described setup generalizes across the 50 datasets and model families without dataset-specific retraining or adjustments.
What would settle it
Applying the method to a fresh collection of datasets and observing that it fails to reach lower validation losses than Bayesian optimization baselines after the same number of evaluations.
Figures
read the original abstract
Hyperparameter tuning is an omnipresent problem in machine learning as it is an integral aspect of obtaining the state-of-the-art performance for any model. Most often, hyperparameters are optimized just by training a model on a grid of possible hyperparameter values and taking the one that performs best on a validation sample (grid search). More recently, methods have been introduced that build a so-called surrogate model that predicts the validation loss for a specific hyperparameter setting, model and dataset and then sequentially select the next hyperparameter to test, based on a heuristic function of the expected value and the uncertainty of the surrogate model called acquisition function (sequential model-based Bayesian optimization, SMBO). In this paper we model the hyperparameter optimization problem as a sequential decision problem, which hyperparameter to test next, and address it with reinforcement learning. This way our model does not have to rely on a heuristic acquisition function like SMBO, but can learn which hyperparameters to test next based on the subsequent reduction in validation loss they will eventually lead to, either because they yield good models themselves or because they allow the hyperparameter selection policy to build a better surrogate model that is able to choose better hyperparameters later on. Experiments on a large battery of 50 data sets demonstrate that our method outperforms the state-of-the-art approaches for hyperparameter learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Hyp-RL, which frames hyperparameter optimization as a sequential decision-making task solved via reinforcement learning. The RL policy is trained to select the next hyperparameter configuration based on its expected contribution to reducing validation loss, either directly or by improving the surrogate model for future selections. This is positioned as an alternative to SMBO methods that rely on heuristic acquisition functions. The central empirical claim is that the approach outperforms state-of-the-art hyperparameter optimization methods across experiments on 50 datasets.
Significance. If the reported generalization of a single learned policy holds without dataset-specific retraining or data leakage, the work would provide a data-driven alternative to hand-designed acquisition functions in Bayesian optimization for hyperparameter tuning. The direct optimization of long-term validation loss via RL is a conceptually distinct approach from myopic heuristics.
major comments (3)
- [Abstract] Abstract: The claim that experiments on 50 datasets demonstrate outperformance requires explicit description of the train/eval split used for RL policy training and whether a single policy was applied across all datasets and model families. Without this, it is impossible to determine whether gains reflect a transferable acquisition strategy or dataset-specific adaptation.
- [Experiments] Experiments section: No information is supplied on the specific SMBO baselines, number of independent runs, statistical significance tests, or variance in performance; these omissions make it impossible to assess whether the reported superiority is robust or could be explained by differences in tuning effort between the RL agent and the baselines.
- [Method] Method section: The state representation fed to the RL policy is not described in sufficient detail to evaluate whether it incorporates dataset- or model-family-specific features that would require retraining for each new evaluation dataset, which directly affects the load-bearing generalization assumption.
minor comments (1)
- [Abstract] Abstract: The phrase 'a large battery of 50 data sets' should be accompanied by a brief characterization of dataset diversity and model families to allow readers to gauge the scope of the generalization claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, clarifying our experimental design and committing to revisions that enhance transparency without altering the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that experiments on 50 datasets demonstrate outperformance requires explicit description of the train/eval split used for RL policy training and whether a single policy was applied across all datasets and model families. Without this, it is impossible to determine whether gains reflect a transferable acquisition strategy or dataset-specific adaptation.
Authors: We agree that the abstract should make the train/eval protocol explicit. The RL policy is trained once on a separate collection of datasets and then applied without retraining or dataset-specific adaptation to the 50 evaluation datasets spanning multiple model families. This setup is intended to demonstrate a transferable policy. We will revise the abstract to state this split and the use of a single policy. revision: yes
-
Referee: [Experiments] Experiments section: No information is supplied on the specific SMBO baselines, number of independent runs, statistical significance tests, or variance in performance; these omissions make it impossible to assess whether the reported superiority is robust or could be explained by differences in tuning effort between the RL agent and the baselines.
Authors: We acknowledge that these experimental details are missing from the current version. We will expand the experiments section to name the exact SMBO baselines, report the number of independent runs, include statistical significance tests, and provide variance measures so that the robustness of the comparisons can be properly evaluated. revision: yes
-
Referee: [Method] Method section: The state representation fed to the RL policy is not described in sufficient detail to evaluate whether it incorporates dataset- or model-family-specific features that would require retraining for each new evaluation dataset, which directly affects the load-bearing generalization assumption.
Authors: We agree that the state representation requires a more precise description. The features are deliberately general (current hyperparameter values, historical validation losses, and surrogate-model statistics) and contain no dataset- or model-family identifiers. We will revise the method section to list the exact state components and confirm that the same policy is used across all evaluation datasets without retraining. revision: yes
Circularity Check
No circularity: empirical outperformance claim rests on external benchmarks, not self-referential definitions or fits.
full rationale
The paper frames HPO as an RL sequential decision task and reports that the learned policy outperforms SMBO baselines on a held-out collection of 50 datasets. No equations, surrogate-model parameters, or acquisition functions are defined in terms of the target performance metric; the RL objective is standard policy optimization on validation-loss reduction. The central result is an empirical comparison against independent baselines rather than a derived quantity that equals its own training inputs by construction. No self-citation chain is invoked to establish uniqueness or to smuggle an ansatz. The generalization assumption is a standard empirical claim, not a definitional reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: 12th {USENIX} Symposium on Operating Systems Design and Im- plementation ({OSDI} 16)
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe- mawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: 12th {USENIX} Symposium on Operating Systems Design and Im- plementation ({OSDI} 16). pp. 265–283 (2016)
work page 2016
-
[2]
Designing Neural Network Architectures using Reinforcement Learning
Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
a partial differential equation for the fredholm resolvent
Bellman, R.: Functional equations in the theory of dynamic programming–vii. a partial differential equation for the fredholm resolvent. Proceedings of the Ameri- can Mathematical Society 8(3), 435–440 (1957)
work page 1957
-
[4]
Journal of Machine Learning Research 13(Feb), 281–305 (2012)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal of Machine Learning Research 13(Feb), 281–305 (2012)
work page 2012
-
[5]
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Cai, Q., Filos-Ratsikas, A., Tang, P., Zhang, Y.: Reinforcement mechanism design for fraudulent behaviour in e-commerce. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
work page 2018
-
[7]
Chollet, F., et al.: Keras (2015)
work page 2015
- [8]
-
[9]
IEEE Computer Society (2018)
work page 2018
-
[10]
BOHB: Robust and Efficient Hyperparameter Optimization at Scale
Falkner, S., Klein, A., Hutter, F.: Bohb: Robust and efficient hyperparameter op- timization at scale. arXiv preprint arXiv:1807.01774 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
CoRR abs/1802.02219 (2018), http://arxiv.org/abs/1802.02219
Feurer, M., Letham, B., Bakshy, E.: Scalable meta-learning for bayesian optimiza- tion. CoRR abs/1802.02219 (2018), http://arxiv.org/abs/1802.02219
-
[12]
Feurer, M., Springenberg, J.T., Hutter, F.: Initializing bayesian hyperparameter optimization via meta-learning. In: Proceedings of the Twenty-Ninth AAAI Confer- ence on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA. pp. 1128– 1135 (2015), http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/ 10029 Hyp-RL : Hyperparameter Optimizatio...
work page 2015
-
[13]
In: Thirty-Second AAAI Conference on Arti- ficial Intelligence (2018)
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Thirty-Second AAAI Conference on Arti- ficial Intelligence (2018)
work page 2018
-
[14]
In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., Silver, D.: Rainbow: Combining improvements in deep reinforcement learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
work page 2018
-
[15]
Neural computation 9(8), 1735–1780 (1997)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
work page 1997
-
[16]
In: International Conference on Learning and Intelligent Optimization
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization. pp. 507–523. Springer (2011)
work page 2011
-
[17]
Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Global Optimization 13(4), 455–492 (1998). https://doi.org/10.1023/A:1008306431147, https://doi.org/10.1023/A: 1008306431147
-
[18]
Komer, B., Bergstra, J., Eliasmith, C.: Hyperopt-sklearn: automatic hyperparam- eter configuration for scikit-learn. In: ICML workshop on AutoML. pp. 2825–2830. Citeseer (2014)
work page 2014
-
[19]
Continuous control with deep reinforcement learning
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[20]
Lindauer, M., Hutter, F.: Warmstarting of model-based algorithm configuration. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, F...
work page 2018
-
[21]
In: Workshop on Automatic Machine Learn- ing
Mendoza, H., Klein, A., Feurer, M., Springenberg, J.T., Hutter, F.: Towards automatically-tuned neural networks. In: Workshop on Automatic Machine Learn- ing. pp. 58–65 (2016)
work page 2016
-
[22]
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)
work page 2015
-
[23]
Perrone, V., Jenatton, R., Seeger, M.W., Archambeau, C.: Scalable hyperpa- rameter transfer learning. In: Advances in Neural Information Processing Sys- tems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr´ eal, Canada. pp. 6846–6856 (2018),http: //papers.nips.cc/paper/7917-scalable-hyperparamete...
work page 2018
-
[24]
Schilling, N., Wistuba, M., Drumond, L., Schmidt-Thieme, L.: Hyperparame- ter optimization with factorized multilayer perceptrons. In: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part II. pp. 87–103 (2015). https://doi.org/10.1007/978-3-319-23525-7 6, http...
-
[25]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Proceedings of the IEEE 104(1), 148–175 (2016) 16 H.S
Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE 104(1), 148–175 (2016) 16 H.S. Jomaa et al
work page 2016
-
[27]
Entertainment Computing 18, 103–123 (2017)
Silva, M.P., do Nascimento Silva, V., Chaimowicz, L.: Dynamic difficulty adjust- ment on moba games. Entertainment Computing 18, 103–123 (2017)
work page 2017
-
[28]
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Master- ing the game of go with deep neural networks and tree search. nature 529(7587), 484 (2016)
work page 2016
-
[29]
In: Advances in neural information processing systems
Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems. pp. 2951–2959 (2012)
work page 2012
-
[30]
In: International conference on machine learning
Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., Adams, R.: Scalable bayesian optimization using deep neural networks. In: International conference on machine learning. pp. 2171–2180 (2015)
work page 2015
-
[31]
In: Proceedings of the 1993 Connectionist Models Summer School Hills- dale, NJ
Thrun, S., Schwartz, A.: Issues in using function approximation for reinforcement learning. In: Proceedings of the 1993 Connectionist Models Summer School Hills- dale, NJ. Lawrence Erlbaum (1993)
work page 1993
-
[32]
In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
work page 2016
-
[33]
Machine learning 8(3-4), 279–292 (1992)
Watkins, C.J., Dayan, P.: Q-learning. Machine learning 8(3-4), 279–292 (1992)
work page 1992
-
[34]
Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Sequential model-free hy- perparameter tuning. In: 2015 IEEE International Conference on Data Min- ing, ICDM 2015, Atlantic City, NJ, USA, November 14-17, 2015. pp. 1033– 1038 (2015). https://doi.org/10.1109/ICDM.2015.20, https://doi.org/10.1109/ ICDM.2015.20
-
[35]
In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Hyperparameter optimization ma- chines. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). pp. 41–50. IEEE (2016)
work page 2016
-
[36]
Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Two-stage transfer surrogate model for automatic hyperparameter optimization. In: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I. pp. 199– 214 (2016). https://doi.org/10.1007/978-3-319-46128-1 13, ...
-
[37]
Machine Learning 107(1), 43–78 (2018)
Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Scalable gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning 107(1), 43–78 (2018). https://doi.org/10.1007/s10994-017-5684-y, https://doi.org/10. 1007/s10994-017-5684-y
-
[38]
Reinforcement Learning for Learning Rate Control
Xu, C., Qin, T., Wang, G., Liu, T.Y.: Reinforcement learning for learning rate control. arXiv preprint arXiv:1705.11159 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr´ eal, Canada. pp. 2402–2413 (2018), http://papers.nips.cc/paper/ 7507-meta-gradient-reinforcement-learning
work page 2018
-
[40]
Yogatama, D., Mann, G.: Efficient transfer learning method for automatic hyper- parameter tuning. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014. pp. 1077–1085 (2014), http://jmlr.org/proceedings/papers/v33/ yogatama14.html
work page 2014
- [41]
-
[42]
Neural Architecture Search with Reinforcement Learning
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.