Understanding High-Dimensional Bayesian Optimization
Pith reviewed 2026-05-23 03:25 UTC · model grok-4.3
The pith
Vanishing gradients from Gaussian process initialization schemes cause most high-dimensional Bayesian optimization failures, while maximum likelihood estimation of length scales suffices for state-of-the-art performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our empirical analysis shows that vanishing gradients caused by Gaussian process (GP) initialization schemes play a major role in the failures of high-dimensional Bayesian optimization (HDBO) and that methods that promote local search behaviors are better suited for the task. We find that maximum likelihood estimation (MLE) of GP length scales suffices for state-of-the-art performance. Based on this, we propose a simple variant of MLE called MSR that leverages these findings to achieve state-of-the-art performance on a comprehensive set of real-world applications.
What carries the argument
Vanishing gradients induced by common Gaussian process initialization schemes, countered by maximum likelihood estimation of length scales that favors local search behavior.
If this is right
- Maximum likelihood estimation of GP length scales alone reaches state-of-the-art results in high-dimensional settings.
- Methods that promote local search outperform those that emphasize global exploration.
- The MSR variant of MLE attains state-of-the-art performance on diverse real-world applications.
- Targeted experiments can isolate and confirm the contribution of vanishing gradients to HDBO failures.
Where Pith is reading between the lines
- High-dimensional problems may reward focused local exploitation more than broad exploration strategies.
- Similar initialization and length-scale adjustments could simplify other Gaussian-process-based optimizers.
- MSR might be tested on synthetic high-dimensional functions with known global optima to separate local-search benefits from benchmark-specific effects.
- The results suggest that many reported failures of high-dimensional BO may be fixable with standard tools rather than requiring entirely new algorithms.
Load-bearing premise
The performance gaps observed across the tested real-world applications arise primarily from the vanishing-gradient mechanism rather than from other unexamined elements of the experimental design or benchmark selection.
What would settle it
A controlled trial that alters only the GP initialization to eliminate vanishing gradients while holding all other factors fixed, after which the performance advantage of MLE-based local-search methods disappears.
Figures
read the original abstract
Recent work reported that simple Bayesian optimization (BO) methods perform well for high-dimensional real-world tasks, seemingly contradicting prior work and tribal knowledge. This paper investigates why. We identify underlying challenges that arise in high-dimensional BO and explain why recent methods succeed. Our empirical analysis shows that vanishing gradients caused by Gaussian process (GP) initialization schemes play a major role in the failures of high-dimensional Bayesian optimization (HDBO) and that methods that promote local search behaviors are better suited for the task. We find that maximum likelihood estimation (MLE) of GP length scales suffices for state-of-the-art performance. Based on this, we propose a simple variant of MLE called MSR that leverages these findings to achieve state-of-the-art performance on a comprehensive set of real-world applications. We present targeted experiments to illustrate and confirm our findings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates why simple Bayesian optimization methods succeed on high-dimensional real-world tasks despite prior expectations. It identifies vanishing gradients from Gaussian process initialization schemes as a primary cause of high-dimensional BO failures, argues that methods promoting local search are better suited, shows that MLE of GP length scales suffices for strong performance, and proposes a simple MLE variant called MSR that achieves state-of-the-art results on real-world applications, supported by targeted experiments.
Significance. If the empirical findings hold after improved controls, the work offers a clear mechanistic explanation for recent HDBO observations and a practical, low-complexity method (MSR) that matches or exceeds more elaborate approaches. The emphasis on initialization effects and local-search promotion provides a useful lens for diagnosing and designing future high-dimensional optimizers. The targeted experiments are a positive step toward reproducibility in this empirical domain.
major comments (1)
- [section on targeted experiments and empirical analysis] The central attribution of performance gaps to vanishing gradients from GP initialization requires explicit isolation experiments that toggle only the initialization scheme (or length-scale handling) while fixing acquisition-function optimization, length-scale constraints, random seeds, and benchmark selection. The described targeted experiments do not appear to include such controls, leaving open the possibility that observed differences arise from other unablated factors (see skeptic note on weakest assumption).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify our experimental design. We respond to the single major comment below.
read point-by-point responses
-
Referee: [section on targeted experiments and empirical analysis] The central attribution of performance gaps to vanishing gradients from GP initialization requires explicit isolation experiments that toggle only the initialization scheme (or length-scale handling) while fixing acquisition-function optimization, length-scale constraints, random seeds, and benchmark selection. The described targeted experiments do not appear to include such controls, leaving open the possibility that observed differences arise from other unablated factors (see skeptic note on weakest assumption).
Authors: We appreciate the referee's emphasis on rigorous isolation. Our targeted experiments (detailed in the section on empirical analysis) were designed to vary only the GP initialization scheme and length-scale handling: we compared standard initialization (which induces vanishing gradients) against MLE-based length-scale estimation while holding fixed the acquisition-function optimizer, length-scale constraints (e.g., bounds and positivity), random seeds, and the exact set of benchmark tasks. All other algorithmic components remained identical across runs. This isolates the contribution of initialization-induced gradient issues. We will revise the manuscript to add an explicit paragraph and table footnote enumerating these fixed factors, thereby making the isolation protocol unambiguous. revision: partial
Circularity Check
No circularity: claims rest on targeted experiments, not self-referential definitions or fitted inputs
full rationale
The paper presents an empirical investigation into HDBO failures, attributing them to vanishing gradients from GP initialization via targeted experiments, and proposes MSR as a simple MLE variant. No equations define quantities in terms of themselves, no predictions are fitted inputs renamed, and no load-bearing self-citations or uniqueness theorems reduce the central claims to prior author work by construction. The analysis is driven by experimental comparisons on real-world tasks, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gaussian processes provide a suitable surrogate model for the unknown objective in Bayesian optimization
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our empirical analysis shows that vanishing gradients caused by Gaussian process (GP) initialization schemes play a major role in the failures of high-dimensional Bayesian optimization (HDBO) and that methods that promote local search behaviors are better suited for the task. We find that maximum likelihood estimation (MLE) of GP length scales suffices for state-of-the-art performance.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a simple variant of MLE called MSR that leverages these findings to achieve state-of-the-art performance on a comprehensive set of real-world applications.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights
AB-SID-iVAR enables Gaussian process active learning for self-induced Boltzmann distributions by closed-form approximation of the target, with high-probability error vanishing guarantees and empirical gains on PES and...
-
Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?
Proposes SPMO framework with ESPI acquisition function to find one high-quality single solution in many-objective BO under limited budgets instead of approximating the entire Pareto front.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Unexpected improvements to expected improvement for bayesian optimization
Ament, S., Daulton, S., Eriksson, D., Balandat, M., and Bakshy, E. Unexpected improvements to expected improvement for bayesian optimization . Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[3]
Balandat, M., Karrer, B., Jiang, D., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization . Advances in neural information processing systems, 33: 0 21524--21538, 2020. URL https://github.com/pytorch/botorch/tree/v0.12.0. Last access: Jan 16, 2025
work page 2020
-
[4]
Bardou, A., Thiran, P., and Begin, T. Relaxing the additivity constraints in decentralized no-regret high-dimensional bayesian optimization. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[5]
Baudi s , P. and Po s \'i k, P. Online Black-Box Algorithm Portfolios for Continuous Optimization . In Parallel Problem Solving from Nature -- PPSN XIII, pp.\ 40--49, Cham, 2014. Springer International Publishing
work page 2014
-
[6]
Binois, M. and Wycoff, N. A Survey on High-dimensional Gaussian Process Modeling with Application to Bayesian Optimization . ACM Trans. Evol. Learn. Optim., 2 0 (2), aug 2022. doi:10.1145/3545611
-
[7]
Bouhlel, M. A., Bartoli, N., Regis, R. G., Otsmane, A., and Morlier, J. Efficient global optimization for high-dimensional constrained problems by using the K riging models combined with the partial least squares method. Engineering Optimization, 50 0 (12): 0 2038--2053, 2018
work page 2038
-
[8]
Calandra, R., Seyfarth, A., Peters, J., and Deisenroth, M. P. Bayesian optimization for learning gaits under uncertainty . Annals of Mathematics and Artificial Intelligence, 76 0 (1): 0 5--23, 2016
work page 2016
-
[9]
Semi-supervised E mbedding L earning for H igh-dimensional B ayesian O ptimization
Chen, J., Zhu, G., Yuan, C., and Huang, Y. Semi-supervised E mbedding L earning for H igh-dimensional B ayesian O ptimization. arXiv preprint arXiv:2005.14601, 2020
-
[10]
Deshwal, A., Ament, S., Balandat, M., Bakshy, E., Doppa, J. R., and Eriksson, D. Bayesian optimization over high-dimensional combinatorial spaces via dictionary-based embeddings . In International Conference on Artificial Intelligence and Statistics, pp.\ 7021--7039. PMLR, 2023
work page 2023
-
[11]
K., Nickisch, H., and Rasmussen, C
Duvenaud, D. K., Nickisch, H., and Rasmussen, C. Additive gaussian processes . Advances in neural information processing systems, 24, 2011
work page 2011
-
[12]
Eriksson, D. and Jankowiak, M. High-dimensional Bayesian optimization with sparse axis-aligned subspaces . In Uncertainty in Artificial Intelligence, pp.\ 493--503. PMLR, 2021
work page 2021
-
[13]
Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. Scalable global optimization via local Bayesian optimization . Advances in neural information processing systems, 32, 2019
work page 2019
-
[14]
Frazier, P. I. A tutorial on Bayesian optimization . arXiv preprint arXiv:1807.02811, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Discovering and exploiting additive structure for B ayesian optimization
Gardner, J., Guo, C., Weinberger, K., Garnett, R., and Grosse, R. Discovering and exploiting additive structure for B ayesian optimization . In International Conference on Artificial Intelligence and Statistics, pp.\ 1311--1319, 2017
work page 2017
-
[16]
High-dimensional Bayesian optimization via tree-structured additive models
Han, E., Arora, I., and Scarlett, J. High-dimensional Bayesian optimization via tree-structured additive models . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 7630--7638, 2021
work page 2021
-
[17]
O., Hvarfner, C., Papenmeier, L., and Nardi, L
Hellsten, E. O., Hvarfner, C., Papenmeier, L., and Nardi, L. High-dimensional Bayesian Optimization with Group Testing . arXiv preprint arXiv:2310.03515, 2023
-
[18]
Herrmann, M., Lange, F. J. D., Eggensperger, K., Casalicchio, G., Wever, M., Feurer, M., R \"u gamer, D., H \"u llermeier, E., Boulesteix, A.-L., and Bischl, B. Position: Why We Must Rethink Empirical Research in Machine Learning . In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[19]
Hoang, T. N., Hoang, Q. M., Ouyang, R., and Low, K. H. Decentralized high-dimensional bayesian optimization with factor graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018
work page 2018
-
[20]
Hvarfner, C., Hellsten, E. O., and Nardi, L. Vanilla B ayesian optimization performs great in high dimensions. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 2079...
work page 2024
-
[21]
Jones, D. R. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21: 0 345--383, 2001
work page 2001
-
[22]
Jones, D. R. Large-Scale Multi-Disciplinary Mass Optimization in the Auto Industry . In MOPTA 2008 Conference (20 August 2008), 2008
work page 2008
-
[23]
R., Schonlau, M., and Welch, W
Jones, D. R., Schonlau, M., and Welch, W. J. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13: 0 455--492, 1998
work page 1998
-
[24]
High dimensional Bayesian optimisation and bandits via additive models
Kandasamy, K., Schneider, J., and P \'o czos, B. High dimensional Bayesian optimisation and bandits via additive models . In International conference on machine learning, pp.\ 295--304. PMLR, 2015
work page 2015
-
[25]
Karvonen, T. and Oates, C. J. Maximum likelihood estimation in Gaussian process regression is ill-posed . Journal of Machine Learning Research, 24 0 (120): 0 1--47, 2023
work page 2023
-
[26]
K \"o ppen, M. The curse of dimensionality . In 5th online world conference on soft computing in industrial applications (WSC5), volume 1, pp.\ 4--8, 2000
work page 2000
-
[27]
Lam, R., Poloczek, M., Frazier, P., and Willcox, K. E. Advances in Bayesian optimization with applications in aerospace engineering . In 2018 AIAA Non-Deterministic Approaches Conference, pp.\ 1656, 2018
work page 2018
-
[28]
Re-examining linear embeddings for high-dimensional Bayesian optimization
Letham, B., Calandra, R., Rai, A., and Bakshy, E. Re-examining linear embeddings for high-dimensional Bayesian optimization . Advances in neural information processing systems, 33: 0 1546--1558, 2020
work page 2020
-
[29]
Lizotte, D. J., Wang, T., Bowling, M. H., Schuurmans, D., et al. Automatic Gait Optimization With G aussian Process Regression. In IJCAI, volume 7, pp.\ 944--949, 2007
work page 2007
-
[30]
W., Constantine, P., Palacios, F., and Alonso, J
Lukaczyk, T. W., Constantine, P., Palacios, F., and Alonso, J. J. Active subspaces for shape optimization . In 10th AIAA multidisciplinary design optimization conference, pp.\ 1171, 2014
work page 2014
-
[31]
T., Moore, J., Kusner, M., Bradshaw, J., and Gardner, J
Maus, N., Jones, H. T., Moore, J., Kusner, M., Bradshaw, J., and Gardner, J. R. Local Latent Space Bayesian Optimization over Structured Inputs . In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[32]
Mayr, M., Ahmad, F., Chatzilygeroudis, K. I., Nardi, L., and Krüger, V. Skill-based Multi-objective Reinforcement Learning of Industrial Robot Tasks with Planning and Knowledge Integration . CoRR, abs/2203.10033, 2022
-
[33]
The Bayesian approach to global optimization
Mockus, J. The Bayesian approach to global optimization . In System Modeling and Optimization: Proceedings of the 10th IFIP Conference New York City, USA, August 31--September 4, 1981, pp.\ 473--481. Springer, 2005
work page 1981
-
[34]
Moriconi, R., Deisenroth, M. P., and Sesh Kumar, K. High-dimensional Bayesian optimization using low-dimensional feature spaces . Machine Learning, 109: 0 1925--1943, 2020
work page 1925
-
[35]
Mutny, M. and Krause, A. Efficient high dimensional B ayesian optimization with additivity and quadrature F ourier features . Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[36]
A framework for Bayesian optimization in embedded subspaces
Nayebi, A., Munteanu, A., and Poloczek, M. A framework for Bayesian optimization in embedded subspaces . In International Conference on Machine Learning, pp.\ 4752--4761. PMLR, 2019
work page 2019
-
[37]
Negoescu, D. M., Frazier, P. I., and Powell, W. B. The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery . INFORMS Journal on Computing, 23 0 (3): 0 346--363, 2011
work page 2011
-
[38]
Combinatorial bayesian optimization using the graph cartesian product
Oh, C., Tomczak, J., Gavves, E., and Welling, M. Combinatorial bayesian optimization using the graph cartesian product . Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[39]
Increasing the scope as you learn: Adaptive bayesian optimization in nested subspaces
Papenmeier, L., Nardi, L., and Poloczek, M. Increasing the scope as you learn: Adaptive bayesian optimization in nested subspaces . Advances in Neural Information Processing Systems, 35: 0 11586--11601, 2022
work page 2022
-
[40]
Bounce: Reliable high-dimensional bayesian optimization for combinatorial and mixed spaces
Papenmeier, L., Nardi, L., and Poloczek, M. Bounce: Reliable high-dimensional bayesian optimization for combinatorial and mixed spaces. In Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[41]
Exploring Exploration in Bayesian Optimization
Papenmeier, L., Cheng, N., Becker, S., and Nardi, L. Exploring exploration in bayesian optimization. arXiv preprint arXiv:2502.08208, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [42]
-
[43]
Bayesian optimization using domain knowledge on the ATRIAS biped
Rai, A., Antonova, R., Song, S., Martin, W., Geyer, H., and Atkeson, C. Bayesian optimization using domain knowledge on the ATRIAS biped . In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 1771--1778. IEEE, 2018
work page 2018
-
[44]
Cylindrical Thompson Sampling for High-Dimensional Bayesian Optimization
Rashidi, B., Johnstonbaugh, K., and Gao, C. Cylindrical Thompson Sampling for High-Dimensional Bayesian Optimization . In International Conference on Artificial Intelligence and Statistics, pp.\ 3502--3510. PMLR, 2024
work page 2024
-
[45]
Regis, R. G. Trust regions in Kriging-based optimization with expected improvement . Engineering optimization, 48 0 (6): 0 1037--1059, 2016
work page 2016
-
[46]
Regis, R. G. and Shoemaker, C. A. Combining radial basis function surrogates and dynamic coordinate search in high-dimensional expensive black-box optimization. Engineering Optimization, 45 0 (5): 0 529--555, 2013
work page 2013
-
[47]
Lassobench: A high-dimensional hyperparameter optimization benchmark suite for lasso
S ehi \'c , K., Gramfort, A., Salmon, J., and Nardi, L. Lassobench: A high-dimensional hyperparameter optimization benchmark suite for lasso . In International Conference on Automated Machine Learning, pp.\ 2--1. PMLR, 2022
work page 2022
-
[48]
Monte carlo tree search based variable selection for high dimensional bayesian optimization
Song, L., Xue, K., Huang, X., and Qian, C. Monte carlo tree search based variable selection for high dimensional bayesian optimization . Advances in Neural Information Processing Systems, 35: 0 28488--28501, 2022
work page 2022
-
[49]
Gaussian process optimization in the bandit setting: no regret and experimental design
Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design . In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML'10, pp.\ 1015–1022, Madison, WI, USA, 2010. Omnipress. ISBN 9781605589077
work page 2010
-
[50]
Tripp, A., Daxberger, E., and Hern\' a ndez-Lobato, J. M. Sample- E fficient O ptimization in the L atent S pace of D eep G enerative M odels via W eighted R etraining. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.\ 11259--11272. Curran Associates, I...
work page 2020
-
[51]
Learning Search Space Partition for Black-box Optimization using Monte Carlo Tree Search
Wang, L., Fonseca, R., and Tian, Y. Learning Search Space Partition for Black-box Optimization using Monte Carlo Tree Search . Advances in Neural Information Processing Systems, 33: 0 19511--19522, 2020
work page 2020
-
[52]
Bayesian optimization in a billion dimensions via random embeddings
Wang, Z., Hutter, F., Zoghi, M., Matheson, D., and De Feitas, N. Bayesian optimization in a billion dimensions via random embeddings . Journal of Artificial Intelligence Research, 55: 0 361--387, 2016
work page 2016
-
[53]
Batched large-scale Bayesian optimization in high-dimensional spaces
Wang, Z., Gehring, C., Kohli, P., and Jegelka, S. Batched large-scale Bayesian optimization in high-dimensional spaces . In International Conference on Artificial Intelligence and Statistics, pp.\ 745--754. PMLR, 2018
work page 2018
-
[54]
Williams, C. K. and Rasmussen, C. E. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006
work page 2006
-
[55]
Wolpert, D. H. and Macready, W. G. No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1 0 (1): 0 67--82, 1997
work page 1997
-
[56]
Xu, Z. and Zhe, S. Standard Gaussian Process is All You Need for High-Dimensional Bayesian Optimization . arXiv preprint arXiv:2402.02746v3, 2024
-
[57]
Ziomek, J. K. and Ammar, H. B. Are random decompositions all we need in high dimensional Bayesian optimisation? In International Conference on Machine Learning, pp.\ 43347--43368. PMLR, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.