pith. machine review for the scientific record. sign in

arxiv: 2605.04895 · v1 · submitted 2026-05-06 · 💻 cs.LG · stat.ML

Recognition: unknown

Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:36 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords bayesian optimizationtransfer learningacquisition functionsregime conditioninghyperparameter optimizationprior correlationconditional treatment effects
0
0 comments X

The pith

Transfer Bayesian optimization rankings reverse with budget ratio and prior quality, explained by a portable regime score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Comparisons of acquisition functions in transfer Bayesian optimization usually report which one wins on average across hidden conditions. This paper shows those averages are unstable because the better choice flips when the budget-to-space ratio or the prior's rank correlation changes. They introduce the Portable Regime Score PRS equal to budget over search space size times one minus prior rank correlation to predict the transition point. An adaptive planner that estimates the regime online beats both fixed methods and matched per-context oracles on multiple benchmarks, while pre-registered predictions based on the score match observed winners in two thirds of cases. The protocol they recommend is to always report the regime variables with any performance claim so that results become interpretable rather than mixture-dependent.

Core claim

Published transfer-BO comparisons estimate an average treatment effect of acquisition choice over hidden regime variables while practitioners need the conditional effect for their specific prior quality, budget ratio, and metric. The Portable Regime Score PRS equals (B/|A|)(1-rho) where rho is the prior rank correlation and can be estimated from pilot contexts. Across 79 conditions a hierarchical model gives beta equal to 0.50, nineteen percent of conditions fall in an equivalence zone, and in five published reversal cases PRS predicts the winner from pre-comparison observables. RegimePlanner estimates rho online and switches acquisition accordingly, winning all sixteen HPO-B search spaces.

What carries the argument

The Portable Regime Score PRS = (B/|A|)(1-rho), which identifies the regime and therefore which acquisition function holds the advantage.

If this is right

  • Unconditional leaderboards become unstable whenever the conditional advantage changes sign across regimes.
  • Reporting B/|A|, rho, K, and metric alongside any acquisition claim makes the result interpretable.
  • RegimePlanner exceeds the matched per-context oracle by 18 percent on GDSC2 while winning every HPO-B space at B=100.
  • Pre-registered PRS-based predictions reach 67.5 percent overall accuracy and above 90 percent inside EMA prior families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks could stratify reported results by estimated PRS to avoid averaging over opposing regimes.
  • The same regime logic may apply to other sequential decision settings where prior quality interacts with remaining budget.
  • Online estimation of rho during the main run could further improve adaptation beyond the pilot-based version.

Load-bearing premise

The prior rank correlation rho can be reliably estimated from pilot contexts before the main comparison and that this estimate accurately reflects the regime for the full experiment.

What would settle it

A new set of transfer-BO experiments in which the acquisition that PRS predicts to win does not actually outperform its competitor, or in which RegimePlanner fails to beat the per-context oracle.

Figures

Figures reproduced from arXiv: 2605.04895 by Noel Thomas.

Figure 1
Figure 1. Figure 1: Same BO loop, same surrogates, same action spaces: four distinct winning methods view at source ↗
Figure 2
Figure 2. Figure 2: Two-axis regime diagnostic. (A) Exploration advantage vs. PRS, positive trend within each benchmark. Triangles = low-n families (n≤1 each). (B) All 79 conditions in (B/|A|, ρ) space; green = exploration wins, red = Greedy wins, gray = ties. The take-away: PRS orders conditions within each benchmark; the cross-benchmark threshold shifts with noise σ 2 view at source ↗
Figure 3
Figure 3. Figure 3: Hit@1 for the shuffled Buchwald 4 × 4 prior-by-acquisition design. Exploration helps in the EMA regime, but structured and oracle priors compress acquisition differences and make Greedy competitive or best view at source ↗
Figure 4
Figure 4. Figure 4: RegimePlanner validation. (A) GDSC2 at default budget (50 seeds): the REGIMEPLANNER (amber) outperforms all fixed planners, simple adaptive baselines, and a {Greedy, UCB}-matched per-context oracle by +18%; under a wider {Greedy, UCB, Thompson, REIGN} oracle the gap is −12%. (B) Threshold sensitivity on the cross-validation seed set used for θ selection (separate from Panel A): performance is approximately… view at source ↗
Figure 5
Figure 5. Figure 5: Same GDSC2 benchmark, same four planners, same surrogate; only the budget changed. view at source ↗
Figure 6
Figure 6. Figure 6: Metric choice is a regime variable. On GDSC2 ( view at source ↗
Figure 7
Figure 7. Figure 7: Empirical validation of the PRS threshold formula (Lemma A.6). view at source ↗
Figure 8
Figure 8. Figure 8: Prior rank correlation ρ vs. context position (Buchwald EMA, K = 15). Left: Greedy’s ρ converges to ρ ∗ ≈ 0.064 by K ≈ 7, consistent with Observation A.1. Right: UCB compounds prior quality to ρ ≈ 0.51 while Greedy stagnates at ρ ≈ 0.08; this asymmetry drives the late Hit@1 crossover. In contrast, the simpler EMA prior under the same noise is more vulnerable because it lacks spectral concentration: ρ drops… view at source ↗
Figure 9
Figure 9. Figure 9: PRS calibration curve on the pre-HPO-B scatter. Each dot is a condition (Buchwald=blue, view at source ↗
read the original abstract

Published transfer-BO comparisons often estimate an average treatment effect of acquisition choice over hidden regime variables, while practitioners need the conditional effect for their specific prior quality, budget ratio, and metric. An audit of 40 transfer-BO papers from NeurIPS, ICML, ICLR, AISTATS, UAI, TMLR, JMLR, and AutoML-Conf (2022-2025) finds that 98% never vary B/|A| as a controlled axis. On the same GDSC2 benchmark, changing only the budget reverses the ranking: at B=50, Greedy outperforms UCB by 0.050 Hit@1, while at B=100, UCB outperforms Greedy by 0.035. We capture this transition with the Portable Regime Score PRS=(B/|A|)(1-rho), where rho is the prior rank correlation and can be estimated from pilot contexts before the main comparison. Across 79 conditions spanning chemistry, drug-response biology, and HPO, a hierarchical model gives beta=0.50 (p=1.1e-9), and 19% of conditions fall in an equivalence zone where |advantage|<0.01 Hit@1. In five published reversal cases, PRS predicts the winner from pre-comparison observables. A No-Free-Leaderboard proposition explains why unconditional rankings are unstable: when CATE changes sign across regimes, the reported ATE becomes a function of benchmark mixture. RegimePlanner, which estimates rho online and switches acquisition accordingly, wins all 16 HPO-B search spaces at B=100 and exceeds the matched {Greedy,UCB} per-context oracle on GDSC2 by 18%. Pre-registered predictions achieve 27/40=67.5% overall accuracy and above 90% within EMA prior families. The practical protocol is simple: report B/|A|, rho, K, and metric alongside any claimed acquisition advantage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper audits 40 transfer-BO papers (2022-2025) and finds 98% do not vary budget ratio B/|A| as a controlled axis. It demonstrates ranking reversals on GDSC2 (Greedy beats UCB by 0.050 Hit@1 at B=50; UCB beats Greedy by 0.035 at B=100), introduces Portable Regime Score PRS=(B/|A|)(1-rho) with rho from pilots, reports a hierarchical model fit yielding beta=0.50 (p=1.1e-9) across 79 conditions with 19% equivalence zone, and proposes RegimePlanner (online rho estimation and switching) that wins all 16 HPO-B spaces at B=100 and exceeds the per-context {Greedy,UCB} oracle by 18% on GDSC2. Pre-registered predictions achieve 67.5% accuracy overall (90%+ within EMA families).

Significance. If the PRS reliably identifies regimes and the empirical results hold, the work is significant for explaining instability in unconditional BO leaderboards via the No-Free-Leaderboard proposition and for supplying a practical, observable-based protocol (report B/|A|, rho, K, metric). The pre-registered predictions and cross-domain coverage (chemistry, biology, HPO) are strengths that could shift evaluation standards in multi-context optimization.

major comments (3)
  1. [Hierarchical model description] Hierarchical model (abstract and empirical sections): the specification, priors, data exclusion rules, exact Hit@1 definition, and statistical controls for beta=0.50 (p=1.1e-9) across 79 conditions are not provided. This is load-bearing for the 19% equivalence zone claim and the overall beta estimate.
  2. [PRS and RegimePlanner validation] PRS construction and pilot rho (abstract, § on RegimePlanner): the claim that pilot-derived rho accurately reflects the realized rank correlation in main runs is central to PRS=(B/|A|)(1-rho) driving switches and to the 18% oracle exceedance on GDSC2. No direct validation of pilot-to-realized rho correlation or sensitivity to pilot size/selection is shown, leaving the conditional advantage sensitive to estimation variance.
  3. [GDSC2 empirical results] GDSC2 results (abstract): details on how the matched per-context {Greedy,UCB} oracle is constructed and how the 0.050/0.035 Hit@1 differences and 18% exceedance are computed (including any multiple-testing controls) are missing, undermining verification of the reversal and exceedance claims.
minor comments (2)
  1. [Literature audit] Audit of 40 papers: clarify selection criteria and search terms used to identify the 40 papers from the listed venues so the 98% statistic can be reproduced.
  2. [Notation] Notation and first-use: define all acronyms (PRS, EMA, CATE, ATE) and ensure consistent use of B/|A| and rho throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and valuable feedback on our manuscript. Their comments highlight important areas where additional details and validations are needed to strengthen the presentation. We have revised the paper to address all major comments by providing the requested specifications, validations, and computational details. Below we respond point by point.

read point-by-point responses
  1. Referee: [Hierarchical model description] Hierarchical model (abstract and empirical sections): the specification, priors, data exclusion rules, exact Hit@1 definition, and statistical controls for beta=0.50 (p=1.1e-9) across 79 conditions are not provided. This is load-bearing for the 19% equivalence zone claim and the overall beta estimate.

    Authors: We concur that the hierarchical model was described at too high a level. The revised manuscript now includes a complete specification in Section 4.2 and Appendix C: the model is a hierarchical Bayesian regression with fixed effect beta for the PRS coefficient, random intercepts per condition, and priors N(0,1) for beta, HalfNormal(1) for sigmas. Data exclusion: conditions with <5 contexts or missing pilot data are dropped (resulting in 79 from 85). Hit@1 is defined as the proportion of trials where the acquisition's top recommendation matches the true best within 1% of the range. The beta=0.50 (p=1.1e-9) is the posterior mean with a Wald test; the 19% equivalence zone uses the region where |posterior mean advantage| < 0.01. We have also added the full Stan code and convergence diagnostics. revision: yes

  2. Referee: [PRS and RegimePlanner validation] PRS construction and pilot rho (abstract, § on RegimePlanner): the claim that pilot-derived rho accurately reflects the realized rank correlation in main runs is central to PRS=(B/|A|)(1-rho) driving switches and to the 18% oracle exceedance on GDSC2. No direct validation of pilot-to-realized rho correlation or sensitivity to pilot size/selection is shown, leaving the conditional advantage sensitive to estimation variance.

    Authors: We accept that explicit validation of the pilot rho was not included. We have added a new analysis in Section 5.3 and Figure 4 showing that rho from 5-pilot contexts correlates at Pearson r=0.79 with the full-run realized rho across the 79 conditions. A sensitivity study in Appendix D varies pilot size (3-8) and selection method (random vs. diverse), confirming that PRS switching performance degrades only mildly (still >10% oracle exceedance) for pilots >=4. This mitigates concerns about estimation variance for the RegimePlanner. revision: yes

  3. Referee: [GDSC2 empirical results] GDSC2 results (abstract): details on how the matched per-context {Greedy,UCB} oracle is constructed and how the 0.050/0.035 Hit@1 differences and 18% exceedance are computed (including any multiple-testing controls) are missing, undermining verification of the reversal and exceedance claims.

    Authors: We agree these details are necessary for reproducibility. The revised text in Section 3.2 explains: the per-context oracle is the pointwise maximum of Greedy and UCB performance in each of the 20 GDSC2 contexts, averaged to give the oracle baseline. The 0.050 Hit@1 advantage for Greedy at B=50 and 0.035 for UCB at B=100 are differences in mean performance over 50 runs, with SEs reported. The 18% exceedance is (RegimePlanner mean - oracle mean)/oracle mean *100%. Multiple comparisons across B values and methods are controlled with Holm-Bonferroni at alpha=0.05; all key differences remain significant. Raw per-run data and code are now in the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines PRS explicitly as a function of pre-comparison observables (B/|A| and pilot-estimated rho), reports a fitted hierarchical coefficient beta=0.50 on 79 conditions as a descriptive relationship, and separately states that pre-registered predictions achieve 67.5% accuracy on 40 cases. RegimePlanner's online rho estimation is an algorithmic component whose empirical performance (wins on 16 HPO-B spaces, 18% oracle exceedance on GDSC2) is measured directly on benchmarks rather than asserted by construction. No equation or claim reduces a prediction to a fitted input by definition, no self-citation chain bears the central result, and pre-registration plus external benchmark measurements keep the validation independent of the fitted quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Central claims rest on the PRS definition, the assumption that pilot rho estimates transfer to the main regime, and the hierarchical model's ability to generalize beta across domains.

free parameters (1)
  • beta = 0.50
    Fitted coefficient in the hierarchical model relating PRS to acquisition advantage across 79 conditions.
axioms (1)
  • domain assumption Prior rank correlation rho can be estimated from pilot contexts before the main comparison
    Used both for PRS computation and for online estimation inside RegimePlanner.
invented entities (2)
  • Portable Regime Score (PRS) no independent evidence
    purpose: Quantify regime to predict which acquisition function wins
    Newly defined as (B/|A|)(1-rho) and validated on multiple benchmarks.
  • RegimePlanner no independent evidence
    purpose: Adaptive switching of acquisition function based on estimated rho
    New algorithm proposed and tested on HPO-B and GDSC2.

pith-pipeline@v0.9.0 · 5656 in / 1508 out tokens · 78475 ms · 2026-05-08T16:36:50.370367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    Howey, Michael A

    Masaki Adachi, Brady Planden, David A. Howey, Michael A. Osborne, Sebastian Orbell, Natalia Ares, Krikamol Muandet, and Siu Lun Chau. Looping in the human: Collaborative and explainable Bayesian optimization. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Researc...

  2. [2]

    Using confidence bounds for exploitation-exploration trade-offs

    Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3: 0 397--422, 2002. doi:10.5555/2503308.2188395

  3. [3]

    RECOVER : sequential model optimization for drug combination repurposing

    Paul Bertin, Jarrid Rector-Brooks, Deepak Sharma, Thomas Gaudelet, Andrew Anighoro, Torsten Gross, Francisco Mart \'i nez-Pe \ n a, Eileen L Tang, et al. RECOVER : sequential model optimization for drug combination repurposing. Cell Reports Methods, 3 0 (10): 0 100599, 2023. doi:10.1016/j.crmeth.2023.100599

  4. [4]

    What is the state of neural network pruning? In Proceedings of Machine Learning and Systems (MLSys), 2020

    Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? In Proceedings of Machine Learning and Systems (MLSys), 2020

  5. [5]

    Concentration Inequalities: A Nonasymptotic Theory of Independence

    St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013

  6. [6]

    Accounting for variance in machine learning benchmarks

    Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Tal Arbel, Chris Pal, Ga \"e l Varoquaux, and Pascal Vincent. Accounting for variance in machine learning benchmarks. In Proceedings of Machine...

  7. [7]

    Towards learning universal hyperparameter optimizers with transformers

    Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Qiuyi Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'Aurelio Ranzato, Sagi Perel, and Nando de Freitas. Towards learning universal hyperparameter optimizers with transformers. In Advances in Neural Information Processing Systems, volume 35, 2022

  8. [8]

    On provably robust meta- Bayesian optimization

    Zhongxiang Dai, Yizhou Chen, Haibin Yu, Bryan Kian Hsiang Low, and Patrick Jaillet. On provably robust meta- Bayesian optimization. In Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence, volume 180 of Proceedings of Machine Learning Research, pages 475--485, 2022

  9. [9]

    BOHB : Robust and efficient hyperparameter optimization at scale

    Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, 2018

  10. [10]

    HyperBO+ : Pre-training a universal prior for Bayesian optimization with hierarchical Gaussian processes

    Zhou Fan, Xinran Han, and Zi Wang. HyperBO+ : Pre-training a universal prior for Bayesian optimization with hierarchical Gaussian processes. In NeurIPS 2022 Workshop on Gaussian Processes, Spatiotemporal Modeling, and Decision-making Systems, 2022. URL https://arxiv.org/abs/2212.10538

  11. [11]

    Transfer learning for Bayesian optimization on heterogeneous search spaces

    Zhou Fan, Xinran Han, and Zi Wang. Transfer learning for Bayesian optimization on heterogeneous search spaces. Transactions on Machine Learning Research, 2024

  12. [12]

    Practical transfer learning for Bayesian optimization

    Matthias Feurer, Benjamin Letham, Frank Hutter, and Eytan Bakshy. Practical transfer learning for Bayesian optimization. arXiv preprint arXiv:1802.02219, 2018. doi:10.48550/arxiv.1802.02219

  13. [13]

    Systematic identification of genomic markers of drug sensitivity in cancer cells

    Mathew J Garnett, Elie Edelman, Sarah J Heidorn, Chris D Greenman, Anahita Dastur, Ka Chi Lau, Patricia Greninger, Iain R Thompson, Xian Luo, Jorge Soares, et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature, 483 0 (7391): 0 570--575, 2012. doi:10.1038/nature11005

  14. [14]

    Bandit processes and dynamic allocation indices

    John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B, 41 0 (2): 0 148--164, 1979

  15. [15]

    Gauche: a library for gaussian processes in chemistry

    Ryan-Rhys Griffiths, Leo Klarner, Harriet Moss, et al. Gauche: a library for gaussian processes in chemistry. Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

  16. [16]

    Deep reinforcement learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. doi:10.1609/aaai.v32i1.11694

  17. [17]

    arXiv preprint arXiv:2204.11051 , year=

    Carl Hvarfner, Danny Stoll, Artur Souza, Luigi Nardi, Andr \'e Biedenkapp, and Marius Lindauer. BO : Augmenting acquisition functions with user beliefs for B ayesian optimization. In International Conference on Learning Representations, 2022. doi:10.48550/arxiv.2204.11051

  18. [18]

    Imbens and Donald B

    Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015

  19. [19]

    Well-tuned simple nets excel on tabular datasets

    Arlind Kadra, Marius Lindauer, Frank Hutter, and Josif Grabocka. Well-tuned simple nets excel on tabular datasets. In Advances in Neural Information Processing Systems, 2021

  20. [20]

    Reddi, Sebastian U

    Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD : Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, 2020

  21. [21]

    Information complexity in bandit subset selection

    Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In Conference on Learning Theory, 2013

  22. [22]

    Contextual gaussian process bandit optimization

    Andreas Krause and Cheng Soon Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, 2011

  23. [23]

    A sober look at LLM s for material discovery: Are they actually good for B ayesian optimization over molecules? In International Conference on Machine Learning, 2024

    Agustinus Kristiadi, Felix Strieth-Kalthoff, Marta Skreta, Pascal Poupart, Al \'a n Aspuru-Guzik, and Geoff Pleiss. A sober look at LLM s for material discovery: Are they actually good for B ayesian optimization over molecules? In International Conference on Machine Learning, 2024

  24. [24]

    Heavy-tailed class imbalance and why adam outperforms gradient descent on language models

    Frederik Kunstner, Alan Milligan, Robin Yadav, Mark Schmidt, and Alberto Bietti. Heavy-tailed class imbalance and why adam outperforms gradient descent on language models. In Advances in Neural Information Processing Systems, 2024

  25. [25]

    Random search and reproducibility for neural architecture search

    Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. In Uncertainty in Artificial Intelligence, 2020

  26. [26]

    Provable Accelerated Bayesian Optimization with Knowledge Transfer

    Haitao Lin, Boxin Zhao, Mladen Kolar, and Chong Liu. Provable accelerated B ayesian optimization with knowledge transfer. arXiv:2511.03125, 2025. doi:10.48550/arxiv.2511.03125

  27. [27]

    SMAC3 : A versatile bayesian optimization package for hyperparameter optimization

    Marius Lindauer, Katharina Eggensperger, Matthias Feurer, Andr \'e Biedenkapp, Difan Deng, Carolin Benjamins, Tim Ruhkopf, Ren \'e Sass, and Frank Hutter. SMAC3 : A versatile bayesian optimization package for hyperparameter optimization. Journal of Machine Learning Research, 23, 2022

  28. [28]

    Lipton and Jacob Steinhardt

    Zachary C. Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship. Communications of the ACM, 62 0 (6): 0 45--53, 2019

  29. [29]

    Are GAN s created equal? a large-scale study

    Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GAN s created equal? a large-scale study. In Advances in Neural Information Processing Systems, 2018

  30. [30]

    PriorBand : Practical hyperparameter optimization in the age of deep learning

    Neeratyoy Mallik, Edward Bergman, Carl Hvarfner, Danny Stoll, Maciej Janowski, Marius Lindauer, Luigi Nardi, and Frank Hutter. PriorBand : Practical hyperparameter optimization in the age of deep learning. In Advances in Neural Information Processing Systems, 2023. doi:10.48550/arxiv.2402.05878

  31. [31]

    When do neural nets outperform boosted trees on tabular data? In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

    Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C , Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data? In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

  32. [32]

    Multi-fidelity B ayesian optimization with unreliable information sources

    Petrus Mikkola, Julien Martinelli, Louis Filstroff, and Samuel Kaski. Multi-fidelity B ayesian optimization with unreliable information sources. In International Conference on Artificial Intelligence and Statistics, 2023. doi:10.48550/arxiv.2305.02997

  33. [33]

    Pfns4bo: in-context learning for bayesian optimization

    Samuel M \"u ller, Matthias Feurer, Noah Hollmann, and Frank Hutter. Pfns4bo: in-context learning for bayesian optimization. In International Conference on Machine Learning, 2023

  34. [34]

    A metric learning reality check

    Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. In European Conference on Computer Vision (ECCV), 2020

  35. [35]

    Prior-dependent allocations for B ayesian fixed-budget best-arm identification in structured bandits

    Nicolas Nguyen, Imad Aouali, Andr \'a s Gy \"o rgy, and Claire Vernade. Prior-dependent allocations for B ayesian fixed-budget best-arm identification in structured bandits. In International Conference on Artificial Intelligence and Statistics, 2025

  36. [36]

    Meta-VBO : Utilizing prior tasks in optimizing risk measures with Gaussian processes

    Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Meta-VBO : Utilizing prior tasks in optimizing risk measures with Gaussian processes. In International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ElykcDu5YK

  37. [37]

    Hpobench: A collection of reproducible multi-fidelity benchmark problems for hpo.arXiv preprint arXiv:2109.06716,

    Sebastian Pineda Arango, Hadi S Jomaa, Martin Wistuba, and Josif Grabocka. HPO - B : A large-scale reproducible benchmark for black-box HPO based on OpenML . In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021. doi:10.48550/arxiv.2109.06716

  38. [38]

    Partial rankings of optimizers

    Julian Rodemann and Hannah Blocher. Partial rankings of optimizers. In International Conference on Learning Representations (Tiny Papers), 2024

  39. [39]

    Meta-learning reliable priors in the function space for Bayesian optimization

    Jonas Rothfuss, Dominique Heyn, Jinfan Chen, and Andreas Krause. Meta-learning reliable priors in the function space for Bayesian optimization. Advances in Neural Information Processing Systems, 34, 2021

  40. [40]

    Learning to optimize via posterior sampling

    Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 0 (4): 0 1221--1243, 2014. doi:10.1287/moor.2014.0650

  41. [41]

    Russo Daniel, Van Roy

    Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11 0 (1): 0 1--96, 2018. doi:10.1561/2200000070

  42. [42]

    Are emergent abilities of large lan- guage models a mirage?arXiv preprint arXiv:2304.15004, 2023

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems, volume 36, 2023. doi:10.48550/arxiv.2304.15004

  43. [43]

    Bayesian Reaction Optimization as a Tool for Chemical Synthesis

    Benjamin J. Shields, Jason Stevens, Jun Li, Marvin Parasram, Farhan Damani, Jesus I. Martinez Alvarado, Jacob M. Janey, Ryan P. Adams, and Abigail G. Doyle. Bayesian reaction optimization as a tool for chemical synthesis. Nature, 590 0 (7844): 0 89--96, 2021. doi:10.1038/s41586-021-03213-y

  44. [44]

    Optimizer benchmarking needs to account for hyperparameter tuning

    Prabhu Teja Sivaprasad, Florian Mai, Thijs Vogels, Martin Jaggi, and Fran c ois Fleuret. Optimizer benchmarking needs to account for hyperparameter tuning. In International Conference on Machine Learning, 2020

  45. [45]

    Multi-task bayesian optimization

    Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In Advances in Neural Information Processing Systems, 2013

  46. [46]

    White, Jeffrey F

    Christopher Tosh, Mauricio Tec, Jessica B. White, Jeffrey F. Quinn, Glorymar Ibanez Sanchez, Paul Calder, Andrew L. Kung, Filemon S. Dela Cruz, Wesley Tansey, et al. A bayesian active learning platform for scalable combination drug screens. Nature Communications, 16: 0 156, 2025. doi:10.1038/s41467-024-55287-7

  47. [47]

    Monte Carlo tree search based space transfer for black-box optimization

    Shukuan Wang, Ke Xue, Lei Song, Xiaobin Huang, and Chao Qian. Monte Carlo tree search based space transfer for black-box optimization. In Advances in Neural Information Processing Systems, volume 37, 2024 a

  48. [48]

    Pre-trained gaussian processes for bayesian optimization

    Zi Wang, George Dahl, Kevin Swersky, et al. Pre-trained gaussian processes for bayesian optimization. Journal of Machine Learning Research, 25 0 (212): 0 1--83, 2024 b . URL http://jmlr.org/papers/v25/23-0269.html

  49. [49]

    Multi-objective tree-structured Parzen estimator meets meta-learning

    Shuhei Watanabe, Noor Awad, Masaki Onishi, and Frank Hutter. Multi-objective tree-structured Parzen estimator meets meta-learning. In NeurIPS Workshop on Meta-Learning , 2022

  50. [50]

    Few-shot bayesian optimization with deep kernel surrogates

    Martin Wistuba and Josif Grabocka. Few-shot bayesian optimization with deep kernel surrogates. In International Conference on Learning Representations, 2021

  51. [51]

    Wolpert and William G

    David H. Wolpert and William G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1 0 (1): 0 67--82, 1997

  52. [52]

    Jones, and Michael A

    Wenjie Xu, Masaki Adachi, Colin N. Jones, and Michael A. Osborne. Principled Bayesian optimisation in collaboration with human experts. In Advances in Neural Information Processing Systems, volume 37, 2024