pith. machine review for the scientific record. sign in

arxiv: 2605.06454 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

ORTHOBO: Orthogonal Bayesian Hyperparameter Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Bayesian optimizationhyperparameter optimizationvariance reductioncontrol variatesMonte Carlo methodsacquisition functionsensemble surrogates
0
0 comments X

The pith

An orthogonal acquisition estimator reduces Monte Carlo variance in Bayesian optimization by subtracting a score-function control variate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Bayesian optimization relies on accurate acquisition function estimates to choose the next hyperparameter candidate, but Monte Carlo sampling introduces noise that can flip rankings and lead to poor decisions. The paper identifies this overlooked failure mode and proposes an orthogonal acquisition estimator that subtracts an optimally weighted score-function control variate. This makes the residual orthogonal to posterior score directions, cutting variance while keeping the expected acquisition value unchanged. When built into the OrthoBO framework with ensemble models and a log transform, it stabilizes rankings and improves performance on neural network tuning tasks.

Core claim

We propose an orthogonal acquisition estimator that subtracts an optimally weighted score-function control variate, which yields an acquisition residual orthogonal to posterior score directions and which thus reduces Monte Carlo variance. We further introduce OrthoBO: a Bayesian optimization framework that combines our orthogonal acquisition estimator with ensemble surrogates and an outer log transformation. We show theoretically that our estimator preserves the target, leads to variance reduction, and improves pairwise ranking stability.

What carries the argument

The orthogonal acquisition estimator, which subtracts an optimally weighted score-function control variate to yield a residual orthogonal to posterior score directions, reducing Monte Carlo variance in acquisition estimates.

If this is right

  • The estimator preserves the expected value of the original acquisition function.
  • Monte Carlo variance in acquisition estimates is reduced.
  • Pairwise ranking stability of candidates improves.
  • OrthoBO achieves strong performance in hyperparameter optimization for neural network training and fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This variance reduction technique could extend to other sampling-based decision processes in machine learning.
  • Integrating orthogonality into control variates might apply to policy gradient methods in reinforcement learning.
  • Existing Bayesian optimization implementations could adopt this estimator by modifying only the acquisition evaluation step.

Load-bearing premise

An optimally weighted score-function control variate can be computed such that the acquisition residual is orthogonal to the posterior score directions for typical acquisition functions and surrogate models.

What would settle it

Running the orthogonal estimator on standard benchmark acquisition functions and observing no reduction in empirical variance or no improvement in ranking stability compared to standard Monte Carlo estimation would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.06454 by Maresa Schr\"oder, Michael Klar, Pascal Janetzky, Stefan Feuerriegel.

Figure 1
Figure 1. Figure 1: Failure mode in Bayesian HPO: estimation noise in the acquisition value can flip rankings and lead to suboptimal configu￾rations. Example. Consider tuning a classification model on a large-scale image dataset, where each evalua￾tion requires training on many images and assessing performance under an imbalanced class distribution. Two hyperparameter configurations may differ only slightly (e.g., a deeper ne… view at source ↗
Figure 2
Figure 2. Figure 2: Our proposed ORTHOBO. It improves acquisition estimation through variance reduction. For this, we orthogonalize the acquisition estimation without changing the acquisition target. Below, we present ORTHOBO, our orthogonalized framework for BO that addresses the failure modes discussed above (an overview is shown in view at source ↗
Figure 3
Figure 3. Figure 3: ; further results are in view at source ↗
Figure 4
Figure 4. Figure 4: Training (5D CNN). Outliers (20% prob.) are injected to make MNIST training more challenging. ⇒ ORTHOBO achieves the strongest final performance. (i) Training (neural network). We employ the 5D CNN benchmark by Ament et al. [3] and include their RPR method as comparison. Details are in Supple￾ment F.3. To make the training more challenging, we introduce outliers with 20% probability at different it￾eration… view at source ↗
Figure 5
Figure 5. Figure 5: Fine-tuning (ViT). ⇒ ORTHOBO improves faster than the baselines and achieves the highest best-so-far validation score. (ii) Fine-tuning (vision transformer). We optimize five fine-tuning hyperparameters on the industrial WM811K wafer-map dataset [33, 66] for a vision transformer (ViT) based on the F1 score. Details are provided in Appendix F.4. We present results in view at source ↗
Figure 6
Figure 6. Figure 6: Examples of wafer-map manufacturing failures. view at source ↗
Figure 7
Figure 7. Figure 7: Best-so-far regret as a function of iterations on four widely used regression problems, view at source ↗
Figure 8
Figure 8. Figure 8: Best-so-far regret as a function of iterations on four widely used regression problems, using view at source ↗
Figure 9
Figure 9. Figure 9: Best-so-far regret on Levy16 with a Matérn-5/2 ARD kernel and only n0 = 4 ini￾tial Sobol points. ORTHOBO remains among the strongest methods and improves steadily despite weakly identified surrogate hyperparameters. To isolate this effect, we use a GP with a Matérn￾5/2 kernel and ARD. Unlike the isotropic ker￾nels studied above, ARD assigns a separate lengthscale to each input dimension. This makes the sur… view at source ↗
Figure 10
Figure 10. Figure 10: Results for ensembling view at source ↗
Figure 11
Figure 11. Figure 11: Evolution of the entropy across ensembling weights. For all methods, the used tempered view at source ↗
Figure 12
Figure 12. Figure 12: Different mc samples Results Reducing the MC budget makes the performance of acquisition-based methods more sensitive to estimation noise. Across the tested budgets, ORTHOBO remains competitive on Ackley8 and consistently achieves low regret on Michalewicz10. The effect of orthogonalization are most visible 31 view at source ↗
Figure 13
Figure 13. Figure 13: Results for HPO optimization for training on CIFAR10 from scratch. view at source ↗
read the original abstract

Bayesian optimization is widely used for hyperparameter optimization when model evaluations are expensive; however, noisy acquisition estimates can lead to unstable decisions. We identify acquisition estimation noise as a failure mode that was previously overlooked: even when the surrogate model and acquisition target are correctly specified, finite-sample Monte Carlo error can perturb acquisition values. This can, in turn, flip candidate rankings and lead to suboptimal BO decisions. As a remedy, we aim at variance reduction and propose an orthogonal acquisition estimator that subtracts an optimally weighted score-function control variate, which yields an acquisition residual orthogonal to posterior score directions and which thus reduces Monte Carlo variance. We further introduce OrthoBO: a Bayesian optimization framework that combines our orthogonal acquisition estimator with ensemble surrogates and an outer log transformation. We show theoretically that our estimator preserves the target, leads to variance reduction, and improves pairwise ranking stability. We further verify the theoretical properties of OrthoBO through numerical experiments where our framework reduces acquisition estimation variance, stabilizes candidate rankings, and achieves strong performance. We also demonstrate the downstream utility of OrthoBO in hyperparameter optimization for neural network training and fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes OrthoBO, a Bayesian optimization framework for hyperparameter tuning that addresses acquisition estimation noise via an orthogonal acquisition estimator. This estimator subtracts an optimally weighted score-function control variate to yield an acquisition residual orthogonal to posterior score directions, with the goal of reducing Monte Carlo variance. The approach is combined with ensemble surrogates and an outer log transformation. Theoretical claims include preservation of the target acquisition value, variance reduction, and improved pairwise ranking stability; these are supported by numerical experiments and demonstrated on neural network training and fine-tuning tasks.

Significance. If the variance reduction and ranking stability hold in finite-sample regimes without being offset by covariance estimation costs, the orthogonal control variate construction could improve decision stability in Bayesian optimization for expensive black-box functions. The integration with ensemble surrogates offers a practical extension, and the focus on an overlooked noise source in acquisition estimates provides a targeted contribution to the BO literature.

major comments (2)
  1. [Abstract and Theoretical Claims] Abstract and theoretical claims: the manuscript states that the estimator 'preserves the target, leads to variance reduction, and improves pairwise ranking stability' and that this is shown theoretically. However, the optimal weight β = Cov(â, s)/Var(s) must be known exactly to guarantee orthogonality and net variance reduction. For the ensemble surrogates used in OrthoBO (non-analytic posteriors), both covariance and variance must be estimated from the same finite MC samples as the acquisition itself; the resulting second-order estimation error is not bounded in the provided theoretical statements, so the residual need not remain orthogonal and variance reduction can vanish or reverse in typical BO sample regimes.
  2. [Numerical Experiments] Numerical experiments section: the abstract and manuscript mention verification of variance reduction, ranking stability, and strong performance, but provide no details on experimental setup, baselines, number of independent runs, or statistical significance testing. This prevents assessment of whether the reported improvements are robust or attributable to the orthogonal estimator rather than other design choices such as the ensemble or log transform.
minor comments (1)
  1. [Abstract] The abstract could more explicitly note that the theoretical guarantees assume exact knowledge of the control variate weight while the practical implementation relies on sample estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where the theoretical claims require clarification regarding finite-sample estimation and where the experimental section needs expanded details for reproducibility. We address both points below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Theoretical Claims] Abstract and theoretical claims: the manuscript states that the estimator 'preserves the target, leads to variance reduction, and improves pairwise ranking stability' and that this is shown theoretically. However, the optimal weight β = Cov(â, s)/Var(s) must be known exactly to guarantee orthogonality and net variance reduction. For the ensemble surrogates used in OrthoBO (non-analytic posteriors), both covariance and variance must be estimated from the same finite MC samples as the acquisition itself; the resulting second-order estimation error is not bounded in the provided theoretical statements, so the residual need not remain orthogonal and variance reduction can vanish or reverse in typical BO sample regimes.

    Authors: We agree that the theoretical statements establish orthogonality and variance reduction under the assumption that the optimal weight β is known exactly. In the practical setting with ensemble surrogates, β is estimated from the same finite Monte Carlo samples, and the manuscript does not provide a bound on the resulting estimation error. We will revise the abstract, introduction, and theoretical section to explicitly distinguish the ideal (known-β) case from the estimated-β implementation, clarify that the unbiasedness of the estimator is preserved regardless of β, and add a short discussion of the finite-sample regime supported by the numerical results. We will also include a brief remark that empirical evidence indicates net variance reduction is retained in the sample sizes used for acquisition estimation. revision: partial

  2. Referee: [Numerical Experiments] Numerical experiments section: the abstract and manuscript mention verification of variance reduction, ranking stability, and strong performance, but provide no details on experimental setup, baselines, number of independent runs, or statistical significance testing. This prevents assessment of whether the reported improvements are robust or attributable to the orthogonal estimator rather than other design choices such as the ensemble or log transform.

    Authors: We acknowledge that the current experimental section lacks sufficient detail for full assessment. We will expand this section to include: (i) complete description of the experimental setup and hyperparameter ranges, (ii) explicit list of baselines with implementation references, (iii) the number of independent runs performed (currently 20 per task), and (iv) the statistical significance tests applied (paired Wilcoxon signed-rank tests with reported p-values). We will also add an ablation study isolating the orthogonal estimator from the ensemble surrogate and log transform to demonstrate its specific contribution to variance reduction and ranking stability. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper introduces an orthogonal acquisition estimator via subtraction of an optimally weighted score-function control variate, yielding a residual orthogonal to posterior score directions by direct construction from standard control-variate theory. Theoretical guarantees (target preservation, variance reduction, ranking stability) follow from the orthogonality property under the stated assumptions without reducing to fitted inputs or self-referential definitions. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the provided derivation steps; the estimator is presented as an independent variance-reduction technique applied to BO acquisition functions. Practical estimation of the weight is treated as a separate implementation detail rather than a definitional step that forces the claimed result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters and assumptions; the method relies on the existence of an optimal weighting for the control variate and orthogonality properties.

axioms (1)
  • domain assumption Finite-sample Monte Carlo estimates of acquisition functions can be improved via score-function control variates that are orthogonal to posterior score directions.
    Core premise of the proposed estimator stated in the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1150 out tokens · 30472 ms · 2026-05-08T12:47:07.359552+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    AlBahar, I

    A. AlBahar, I. Kim, and X. Yue. A robust asymmetric kernel function for Bayesian optimization, with application to image defect detection in manufacturing systems.IEEE Transactions on Automation Science and Engineering, 19(4):3222–3233, 2021

  2. [2]

    Ament, S

    S. Ament, S. Daulton, D. Eriksson, M. Balandat, and E. Bakshy. Unexpected improvements to expected improvement for Bayesian optimization. InNeural Information Processing Systems (NeurIPS), 2023

  3. [3]

    Ament, E

    S. Ament, E. Santorella, D. Eriksson, B. Letham, M. Balandat, and E. Bakshy. Robust Gaussian processes via relevance pursuit. InNeural Information Processing Systems (NeurIPS), 2024

  4. [4]

    Balandat, B

    M. Balandat, B. Karrer, D. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. Botorch: A framework for efficient Monte-Carlo Bayesian optimization. InNeural Information Processing Systems (NeurIPS), 2020

  5. [5]

    T. Ban, M. Ohue, and Y . Akiyama. Efficient hyperparameter optimization by using Bayesian op- timization for drug-target interaction prediction. InInternational Conference on Computational Advances in Bio and Medical Sciences (ICCABS), 2017

  6. [6]

    Bergstra, R

    J. Bergstra, R. Bardenet, Y . Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. InNeural Information Processing Systems (NeurIPS), 2011

  7. [7]

    Bergstra, D

    J. Bergstra, D. Yamins, and D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. InInternational Conference on Machine Learning (ICML), 2013

  8. [8]

    Berkenkamp, A

    F. Berkenkamp, A. P. Schoellig, and A. Krause. No-regret Bayesian optimization with unknown hyperparameters.Journal of Machine Learning Research, 20(50):1–24, 2019

  9. [9]

    P. J. Bickel, C. A. Klaassen, P. J. Bickel, Y . Ritov, J. Klaassen, J. A. Wellner, and Y . Ritov. Efficient and adaptive estimation for semiparametric models, volume 4. Springer, 1993

  10. [10]

    Bodin, M

    E. Bodin, M. Kaiser, I. Kazlauskaite, Z. Dai, N. Campbell, and C. H. Ek. Modulating surrogates for Bayesian optimization. InInternational Conference on Machine Learning (ICML), 2020

  11. [11]

    Bogunovic and A

    I. Bogunovic and A. Krause. Misspecified Gaussian process bandit optimization. InNeural Information Processing Systems (NeurIPS), 2021

  12. [12]

    Chernozhukov, D

    V . Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21:C1–C68, 2018

  13. [13]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  14. [14]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  15. [15]

    Eriksson and M

    D. Eriksson and M. Poloczek. Scalable constrained Bayesian optimization. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2021

  16. [16]

    Eriksson, M

    D. Eriksson, M. Pearce, J. Gardner, R. D. Turner, and M. Poloczek. Scalable global optimization via local Bayesian optimization. InNeural Information Processing Systems (NeurIPS), 2019

  17. [17]

    Falkner, A

    S. Falkner, A. Klein, and F. Hutter. BOHB: Robust and efficient hyperparameter optimization at scale. InInternational Conference on Machine Learning (ICML), 2018

  18. [18]

    Fedus, B

    W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  19. [19]

    Feurer, A

    M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. InNeural Information Processing Systems (NeurIPS), 2015

  20. [20]

    D. J. Foster and V . Syrgkanis. Orthogonal statistical learning.The Annals of Statistics, 51(3): 879–908, 2023. 11

  21. [21]

    Frauen, M

    D. Frauen, M. Schröder, K. Hess, and S. Feuerriegel. Orthogonal survival learners for estimating heterogeneous treatment effects from time-to-event data. InNeural Information Processing Systems (NeurIPS), 2025

  22. [22]

    Frauen, A

    D. Frauen, A. Deviyani, M. van der Schaar, and S. Feuerriegel. Nonparametric LLM evaluation from preference data.arXiv preprint, arXiv:2601.21816, 2026

  23. [23]

    P. I. Frazier. Bayesian optimization. InRecent advances in optimization and modeling of contemporary problems, pages 255–278. 2018

  24. [24]

    P. I. Frazier. A tutorial on Bayesian optimization.arXiv preprint, arXiv:1807.02811, 2018

  25. [25]

    Freund and R

    Y . Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119–139, 1997

  26. [26]

    González, Z

    J. González, Z. Dai, P. Hennig, and N. Lawrence. Batch Bayesian optimization via local penalization. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2016

  27. [27]

    H. Ha, V . Nguyen, H. Tran-The, H. Zhang, X. Zhang, and A. v. d. Hengel. Provably efficient Bayesian optimization with unknown Gaussian process hyperparameter estimation.arXiv preprint, arXiv:2306.06844, 2023

  28. [28]

    Herbster and M

    M. Herbster and M. K. Warmuth. Tracking the best expert.Machine learning, 32(2):151–178, 1998

  29. [29]

    J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. InNeural Information Processing Systems (NeurIPS), 2014

  30. [30]

    J. M. Hernández-Lobato, M. Gelbart, M. Hoffman, R. Adams, and Z. Ghahramani. Predictive entropy search for Bayesian optimization with unknown constraints. InInternational Conference on Machine Learning (ICML), 2015

  31. [31]

    K. Hess, D. Frauen, N. Kilbertus, and S. Feuerriegel. Debiased neural operators for estimating functionals.arXiv preprint, arXiv:2604.19296, 2026

  32. [32]

    R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991

  33. [33]

    J.-S. R. Jang. Mir-wm811k: Dataset for wafer map failure pattern recognition. http:// mirlab.org/dataset/public/, 2015

  34. [34]

    D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13(4):455–492, 1998

  35. [35]

    E. H. Kennedy. Semiparametric doubly robust targeted double machine learning: a review. Handbook of Statistical Methods for Precision Medicine, pages 207–236, 2024

  36. [36]

    Krizhevsky

    A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009

  37. [37]

    LeCun, C

    Y . LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database.ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010

  38. [38]

    Q. Lu, K. D. Polyzos, B. Li, and G. B. Giannakis. Surrogate modeling for Bayesian optimization beyond a single Gaussian process.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):11283–11296, 2023

  39. [39]

    Mackey, V

    L. Mackey, V . Syrgkanis, and I. Zadik. Orthogonal machine learning: Power and limitations. In International Conference on Machine Learning (ICML), 2018

  40. [40]

    Martinez-Cantin, K

    R. Martinez-Cantin, K. Tee, and M. McCourt. Practical Bayesian optimization in the presence of outliers. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2018

  41. [41]

    Melnychuk and S

    V . Melnychuk and S. Feuerriegel. GDR-learners: Orthogonal learning of generative models for potential outcomes.arXiv preprint, arXiv:2509.22953, 2025

  42. [42]

    H. B. Moss, D. S. Leslie, J. Gonzalez, and P. Rayson. GIBBON: General-purpose information- based Bayesian optimisation.Journal of Machine Learning Research, 22(235):1–49, 2021

  43. [43]

    H. B. Moss, S. W. Ober, and V . Picheny. Inducing point allocation for sparse Gaussian processes in high-throughput Bayesian optimisation. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2023. 12

  44. [44]

    V . Nath, D. Yang, A. Hatamizadeh, A. A. Abidin, A. Myronenko, H. R. Roth, and D. Xu. The power of proxy data and proxy networks for hyper-parameter optimization in medical image segmentation. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, 2021

  45. [45]

    Neiswanger and A

    W. Neiswanger and A. Ramdas. Uncertainty quantification using martingales for misspecified Gaussian processes. InInternational Conference on Algorithmic Learning Theory (ALT), 2021

  46. [46]

    Nie and S

    X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects.Biometrika, 108(2):299–319, 2021

  47. [47]

    Oprescu, V

    M. Oprescu, V . Syrgkanis, and Z. S. Wu. Orthogonal random forest for causal inference. In International Conference on Machine Learning (ICML), 2019

  48. [48]

    K. D. Polyzos, Q. Lu, and G. B. Giannakis. Bayesian optimization with ensemble learning models and adaptive expected improvement. InInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

  49. [49]

    Quitadadmo, J

    A. Quitadadmo, J. Johnson, and X. Shi. Bayesian hyperparameter optimization for machine learning based eqtl analysis. InInternational Conference on Bioinformatics, Computational Biology, and Health Informatics, 2017

  50. [50]

    J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coefficients when some regressors are not always observed.Journal of the American Statistical Association, 89(427): 846–866, 1994

  51. [51]

    J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology.Epidemiology, 11(5):550–560, 2000

  52. [52]

    Schröder, V

    M. Schröder, V . Melnychuk, and S. Feuerriegel. Differentially private learners for heterogeneous treatment effects.International Conference on Learning Representations (ICLR), 2025

  53. [53]

    Shahriari, K

    B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of Bayesian optimization.Proceedings of the IEEE, 104(1):148–175, 2016

  54. [54]

    Shazeer, A

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017

  55. [55]

    Snoek, H

    J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. InNeural Information Processing Systems (NeurIPS), 2012

  56. [56]

    Srinivas, A

    N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. InInternational Conference on Machine Learning (ICML), 2010

  57. [57]

    Tani and C

    L. Tani and C. Veelken. Comparison of Bayesian and particle swarm algorithms for hyper- parameter optimisation in machine learning applications in high energy physics.Computer Physics Communications, 294:108955, 2024

  58. [58]

    Törn and A

    A. Törn and A. Žilinskas.Global optimization, volume 350. Springer, 1989

  59. [59]

    Turner, D

    R. Turner, D. Eriksson, M. McCourt, J. Kiili, E. Laaksonen, Z. Xu, and I. Guyon. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. InNeurIPS Competition and Demonstration Track, 2020

  60. [60]

    M. J. Van Der Laan and D. Rubin. Targeted maximum likelihood learning.The International Journal of Biostatistics, 2, 2006

  61. [61]

    Wang and S

    Z. Wang and S. Jegelka. Max-value entropy search for efficient Bayesian optimization. In International conference on machine learning, pages 3627–3635. PMLR, 2017

  62. [62]

    arXiv preprint arXiv:2304.11127 , year=

    S. Watanabe. Tree-structured parzen estimator: Understanding its algorithm components and their roles for better empirical performance.arXiv preprint, arXiv:2304.11127, 2023

  63. [63]

    Wilson, F

    J. Wilson, F. Hutter, and M. Deisenroth. Maximizing acquisition functions for Bayesian optimization. InNeural Information Processing Systems (NeurIPS), 2018

  64. [64]

    J. T. Wilson, R. Moriconi, F. Hutter, and M. P. Deisenroth. The reparameterization trick for acquisition functions.arXiv preprint, arXiv:1712.00424, 2017. 13

  65. [65]

    J. Wu, S. Toscano-Palmerin, P. I. Frazier, and A. G. Wilson. Practical multi-fidelity Bayesian optimization for hyperparameter tuning. InUncertainty in Artificial Intelligence (UAI), 2020

  66. [66]

    Wu, J.-S

    M.-J. Wu, J.-S. R. Jang, and J.-L. Chen. Wafer map failure pattern recognition and similarity ranking for large-scale data sets.IEEE Transactions on Semiconductor Manufacturing, 28(1): 1–12, 2015

  67. [67]

    Y . Yuan, W. Wang, and W. Pang. A systematic comparison study on hyperparameter optimisation of graph neural networks for molecular property prediction. InProceedings of the genetic and evolutionary computation conference, 2021. 14 A Extended related work Below, we discuss more distant related literature, completing the discussion in Section 2. We first c...

  68. [68]

    Var bAorth m (z) = 1 S Var(hm)−Cov(h m, gm)⊤ Cov(gm, gm)−1 Cov(gm, hm) ,(28) and therefore Var bAorth m (z) ≤Var bAMC m (z) ,(29) where bAMC m (z) := 1 S PS s=1 hm(θ(s) m ;z)

    bAorth m (z)is unbiased forA m(z) 2. Var bAorth m (z) = 1 S Var(hm)−Cov(h m, gm)⊤ Cov(gm, gm)−1 Cov(gm, hm) ,(28) and therefore Var bAorth m (z) ≤Var bAMC m (z) ,(29) where bAMC m (z) := 1 S PS s=1 hm(θ(s) m ;z)

  69. [69]

    The residual rm(θm;z) :=a m(θm;z)− Cov(gm, gm)−1 Cov(gm, am) ⊤ gm(θm) is or- thogonal to the score directions. Proof. The proof is identical to that of the EI case in Supplement E after replacing EIm(λ;θ m) by the generic acquisition valuea m(z;θ m). Remark.Proposition C.1 shows that the theoretical properties of our orthogonalized approach are acquisitio...

  70. [70]

    Redistribution and use in any form must be accompanied by the following two citations:

  71. [71]

    Wafer Map Failure Pattern Recognition and Similarity Ranking for Large-Scale Data Sets,

    Ming-Ju Wu, Jyh-Shing Roger Jang, and Jui-Long Chen, "Wafer Map Failure Pattern Recognition and Similarity Ranking for Large-Scale Data Sets," in IEEE Transactions on Semiconductor Manufacturing, vol. 28, no. 1, pp. 1-12, Feb. 2015, doi: 10.1109/TSM.2014.2364237

  72. [72]

    Weight 3

    MIR-WM811K: Dataset for wafer map failure pattern recognition, 2015 http://mirlab.org/dataset/public/ We included these citations at the appropriate places. Further, we received written permission by the authors to use the dataset for our research. F.1 TPE implementation TPE implementation.For the TPE-based experiments, we approximate the parameter distri...