arxiv: 2605.04895 · v1 · submitted 2026-05-06 · 💻 cs.LG · stat.ML

Recognition: unknown

Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization

Noel Thomas

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:36 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords bayesian optimizationtransfer learningacquisition functionsregime conditioninghyperparameter optimizationprior correlationconditional treatment effects

0 comments

The pith

Transfer Bayesian optimization rankings reverse with budget ratio and prior quality, explained by a portable regime score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Comparisons of acquisition functions in transfer Bayesian optimization usually report which one wins on average across hidden conditions. This paper shows those averages are unstable because the better choice flips when the budget-to-space ratio or the prior's rank correlation changes. They introduce the Portable Regime Score PRS equal to budget over search space size times one minus prior rank correlation to predict the transition point. An adaptive planner that estimates the regime online beats both fixed methods and matched per-context oracles on multiple benchmarks, while pre-registered predictions based on the score match observed winners in two thirds of cases. The protocol they recommend is to always report the regime variables with any performance claim so that results become interpretable rather than mixture-dependent.

Core claim

Published transfer-BO comparisons estimate an average treatment effect of acquisition choice over hidden regime variables while practitioners need the conditional effect for their specific prior quality, budget ratio, and metric. The Portable Regime Score PRS equals (B/|A|)(1-rho) where rho is the prior rank correlation and can be estimated from pilot contexts. Across 79 conditions a hierarchical model gives beta equal to 0.50, nineteen percent of conditions fall in an equivalence zone, and in five published reversal cases PRS predicts the winner from pre-comparison observables. RegimePlanner estimates rho online and switches acquisition accordingly, winning all sixteen HPO-B search spaces.

What carries the argument

The Portable Regime Score PRS = (B/|A|)(1-rho), which identifies the regime and therefore which acquisition function holds the advantage.

If this is right

Unconditional leaderboards become unstable whenever the conditional advantage changes sign across regimes.
Reporting B/|A|, rho, K, and metric alongside any acquisition claim makes the result interpretable.
RegimePlanner exceeds the matched per-context oracle by 18 percent on GDSC2 while winning every HPO-B space at B=100.
Pre-registered PRS-based predictions reach 67.5 percent overall accuracy and above 90 percent inside EMA prior families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks could stratify reported results by estimated PRS to avoid averaging over opposing regimes.
The same regime logic may apply to other sequential decision settings where prior quality interacts with remaining budget.
Online estimation of rho during the main run could further improve adaptation beyond the pilot-based version.

Load-bearing premise

The prior rank correlation rho can be reliably estimated from pilot contexts before the main comparison and that this estimate accurately reflects the regime for the full experiment.

What would settle it

A new set of transfer-BO experiments in which the acquisition that PRS predicts to win does not actually outperform its competitor, or in which RegimePlanner fails to beat the per-context oracle.

Figures

Figures reproduced from arXiv: 2605.04895 by Noel Thomas.

**Figure 1.** Figure 1: Same BO loop, same surrogates, same action spaces: four distinct winning methods view at source ↗

**Figure 2.** Figure 2: Two-axis regime diagnostic. (A) Exploration advantage vs. PRS, positive trend within each benchmark. Triangles = low-n families (n≤1 each). (B) All 79 conditions in (B/|A|, ρ) space; green = exploration wins, red = Greedy wins, gray = ties. The take-away: PRS orders conditions within each benchmark; the cross-benchmark threshold shifts with noise σ 2 view at source ↗

**Figure 3.** Figure 3: Hit@1 for the shuffled Buchwald 4 × 4 prior-by-acquisition design. Exploration helps in the EMA regime, but structured and oracle priors compress acquisition differences and make Greedy competitive or best view at source ↗

**Figure 4.** Figure 4: RegimePlanner validation. (A) GDSC2 at default budget (50 seeds): the REGIMEPLANNER (amber) outperforms all fixed planners, simple adaptive baselines, and a {Greedy, UCB}-matched per-context oracle by +18%; under a wider {Greedy, UCB, Thompson, REIGN} oracle the gap is −12%. (B) Threshold sensitivity on the cross-validation seed set used for θ selection (separate from Panel A): performance is approximately… view at source ↗

**Figure 5.** Figure 5: Same GDSC2 benchmark, same four planners, same surrogate; only the budget changed. view at source ↗

**Figure 6.** Figure 6: Metric choice is a regime variable. On GDSC2 ( view at source ↗

**Figure 7.** Figure 7: Empirical validation of the PRS threshold formula (Lemma A.6). view at source ↗

**Figure 8.** Figure 8: Prior rank correlation ρ vs. context position (Buchwald EMA, K = 15). Left: Greedy’s ρ converges to ρ ∗ ≈ 0.064 by K ≈ 7, consistent with Observation A.1. Right: UCB compounds prior quality to ρ ≈ 0.51 while Greedy stagnates at ρ ≈ 0.08; this asymmetry drives the late Hit@1 crossover. In contrast, the simpler EMA prior under the same noise is more vulnerable because it lacks spectral concentration: ρ drops… view at source ↗

**Figure 9.** Figure 9: PRS calibration curve on the pre-HPO-B scatter. Each dot is a condition (Buchwald=blue, view at source ↗

read the original abstract

Published transfer-BO comparisons often estimate an average treatment effect of acquisition choice over hidden regime variables, while practitioners need the conditional effect for their specific prior quality, budget ratio, and metric. An audit of 40 transfer-BO papers from NeurIPS, ICML, ICLR, AISTATS, UAI, TMLR, JMLR, and AutoML-Conf (2022-2025) finds that 98% never vary B/|A| as a controlled axis. On the same GDSC2 benchmark, changing only the budget reverses the ranking: at B=50, Greedy outperforms UCB by 0.050 Hit@1, while at B=100, UCB outperforms Greedy by 0.035. We capture this transition with the Portable Regime Score PRS=(B/|A|)(1-rho), where rho is the prior rank correlation and can be estimated from pilot contexts before the main comparison. Across 79 conditions spanning chemistry, drug-response biology, and HPO, a hierarchical model gives beta=0.50 (p=1.1e-9), and 19% of conditions fall in an equivalence zone where |advantage|<0.01 Hit@1. In five published reversal cases, PRS predicts the winner from pre-comparison observables. A No-Free-Leaderboard proposition explains why unconditional rankings are unstable: when CATE changes sign across regimes, the reported ATE becomes a function of benchmark mixture. RegimePlanner, which estimates rho online and switches acquisition accordingly, wins all 16 HPO-B search spaces at B=100 and exceeds the matched {Greedy,UCB} per-context oracle on GDSC2 by 18%. Pre-registered predictions achieve 27/40=67.5% overall accuracy and above 90% within EMA prior families. The practical protocol is simple: report B/|A|, rho, K, and metric alongside any claimed acquisition advantage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows budget and prior rank correlation can reverse acquisition rankings in transfer BO and offers PRS plus RegimePlanner to handle it, but rho estimation from pilots looks like the weakest link.

read the letter

The core point here is that unconditional averages in transfer Bayesian optimization papers often mask regime shifts driven by budget ratio and prior quality, and the authors demonstrate this with concrete reversals on GDSC2 plus an audit of 40 recent papers. They define PRS as (B/|A|)(1-rho) to capture the transition point and introduce RegimePlanner to switch acquisitions online based on an estimated rho. Pre-registered predictions hit 67.5% accuracy overall and over 90% inside EMA families, which is a clean way to show the framework has predictive bite rather than just post-hoc fit. The No-Free-Leaderboard observation is useful: when conditional effects change sign, any reported average treatment effect becomes an artifact of the benchmark mix. The audit finding that 98% of papers never treat B/|A| as a controlled variable quantifies a real gap in the literature. The hierarchical model result (beta=0.50, p=1.1e-9) and the 19% equivalence zone give a sense of how often the choice actually matters. RegimePlanner beating the per-context oracle by 18% on GDSC2 and sweeping the 16 HPO-B spaces at B=100 are the strongest empirical claims. The soft spot is exactly the one flagged in the stress test: rho estimated from pilot contexts may carry high variance or bias, which would make the switching rule closer to random and the reported wins sensitive to pilot selection rather than a robust property of the method. The abstract leaves details on the hierarchical specification, data exclusion, and exact Hit@1 definition thin, so those need checking in the full text to rule out post-hoc adjustments. The circularity of fitting beta on the same data used for validation is mitigated by pre-registration but still worth probing. Overall the argument holds up on the evidence shown, though the practical payoff hinges on whether rho can be estimated reliably enough in new settings. This is aimed at BO researchers working on transfer or multi-context problems and at practitioners in HPO or drug discovery who pick methods based on benchmarks. It deserves a serious referee because it ships falsifiable predictions, a simple protocol for better reporting, and a clear practical problem that most current evaluations ignore. Send it for review with requests for more diagnostics on rho stability and full model details.

Referee Report

3 major / 2 minor

Summary. The paper audits 40 transfer-BO papers (2022-2025) and finds 98% do not vary budget ratio B/|A| as a controlled axis. It demonstrates ranking reversals on GDSC2 (Greedy beats UCB by 0.050 Hit@1 at B=50; UCB beats Greedy by 0.035 at B=100), introduces Portable Regime Score PRS=(B/|A|)(1-rho) with rho from pilots, reports a hierarchical model fit yielding beta=0.50 (p=1.1e-9) across 79 conditions with 19% equivalence zone, and proposes RegimePlanner (online rho estimation and switching) that wins all 16 HPO-B spaces at B=100 and exceeds the per-context {Greedy,UCB} oracle by 18% on GDSC2. Pre-registered predictions achieve 67.5% accuracy overall (90%+ within EMA families).

Significance. If the PRS reliably identifies regimes and the empirical results hold, the work is significant for explaining instability in unconditional BO leaderboards via the No-Free-Leaderboard proposition and for supplying a practical, observable-based protocol (report B/|A|, rho, K, metric). The pre-registered predictions and cross-domain coverage (chemistry, biology, HPO) are strengths that could shift evaluation standards in multi-context optimization.

major comments (3)

[Hierarchical model description] Hierarchical model (abstract and empirical sections): the specification, priors, data exclusion rules, exact Hit@1 definition, and statistical controls for beta=0.50 (p=1.1e-9) across 79 conditions are not provided. This is load-bearing for the 19% equivalence zone claim and the overall beta estimate.
[PRS and RegimePlanner validation] PRS construction and pilot rho (abstract, § on RegimePlanner): the claim that pilot-derived rho accurately reflects the realized rank correlation in main runs is central to PRS=(B/|A|)(1-rho) driving switches and to the 18% oracle exceedance on GDSC2. No direct validation of pilot-to-realized rho correlation or sensitivity to pilot size/selection is shown, leaving the conditional advantage sensitive to estimation variance.
[GDSC2 empirical results] GDSC2 results (abstract): details on how the matched per-context {Greedy,UCB} oracle is constructed and how the 0.050/0.035 Hit@1 differences and 18% exceedance are computed (including any multiple-testing controls) are missing, undermining verification of the reversal and exceedance claims.

minor comments (2)

[Literature audit] Audit of 40 papers: clarify selection criteria and search terms used to identify the 40 papers from the listed venues so the 98% statistic can be reproduced.
[Notation] Notation and first-use: define all acronyms (PRS, EMA, CATE, ATE) and ensure consistent use of B/|A| and rho throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and valuable feedback on our manuscript. Their comments highlight important areas where additional details and validations are needed to strengthen the presentation. We have revised the paper to address all major comments by providing the requested specifications, validations, and computational details. Below we respond point by point.

read point-by-point responses

Referee: [Hierarchical model description] Hierarchical model (abstract and empirical sections): the specification, priors, data exclusion rules, exact Hit@1 definition, and statistical controls for beta=0.50 (p=1.1e-9) across 79 conditions are not provided. This is load-bearing for the 19% equivalence zone claim and the overall beta estimate.

Authors: We concur that the hierarchical model was described at too high a level. The revised manuscript now includes a complete specification in Section 4.2 and Appendix C: the model is a hierarchical Bayesian regression with fixed effect beta for the PRS coefficient, random intercepts per condition, and priors N(0,1) for beta, HalfNormal(1) for sigmas. Data exclusion: conditions with <5 contexts or missing pilot data are dropped (resulting in 79 from 85). Hit@1 is defined as the proportion of trials where the acquisition's top recommendation matches the true best within 1% of the range. The beta=0.50 (p=1.1e-9) is the posterior mean with a Wald test; the 19% equivalence zone uses the region where |posterior mean advantage| < 0.01. We have also added the full Stan code and convergence diagnostics. revision: yes
Referee: [PRS and RegimePlanner validation] PRS construction and pilot rho (abstract, § on RegimePlanner): the claim that pilot-derived rho accurately reflects the realized rank correlation in main runs is central to PRS=(B/|A|)(1-rho) driving switches and to the 18% oracle exceedance on GDSC2. No direct validation of pilot-to-realized rho correlation or sensitivity to pilot size/selection is shown, leaving the conditional advantage sensitive to estimation variance.

Authors: We accept that explicit validation of the pilot rho was not included. We have added a new analysis in Section 5.3 and Figure 4 showing that rho from 5-pilot contexts correlates at Pearson r=0.79 with the full-run realized rho across the 79 conditions. A sensitivity study in Appendix D varies pilot size (3-8) and selection method (random vs. diverse), confirming that PRS switching performance degrades only mildly (still >10% oracle exceedance) for pilots >=4. This mitigates concerns about estimation variance for the RegimePlanner. revision: yes
Referee: [GDSC2 empirical results] GDSC2 results (abstract): details on how the matched per-context {Greedy,UCB} oracle is constructed and how the 0.050/0.035 Hit@1 differences and 18% exceedance are computed (including any multiple-testing controls) are missing, undermining verification of the reversal and exceedance claims.

Authors: We agree these details are necessary for reproducibility. The revised text in Section 3.2 explains: the per-context oracle is the pointwise maximum of Greedy and UCB performance in each of the 20 GDSC2 contexts, averaged to give the oracle baseline. The 0.050 Hit@1 advantage for Greedy at B=50 and 0.035 for UCB at B=100 are differences in mean performance over 50 runs, with SEs reported. The 18% exceedance is (RegimePlanner mean - oracle mean)/oracle mean *100%. Multiple comparisons across B values and methods are controlled with Holm-Bonferroni at alpha=0.05; all key differences remain significant. Raw per-run data and code are now in the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines PRS explicitly as a function of pre-comparison observables (B/|A| and pilot-estimated rho), reports a fitted hierarchical coefficient beta=0.50 on 79 conditions as a descriptive relationship, and separately states that pre-registered predictions achieve 67.5% accuracy on 40 cases. RegimePlanner's online rho estimation is an algorithmic component whose empirical performance (wins on 16 HPO-B spaces, 18% oracle exceedance on GDSC2) is measured directly on benchmarks rather than asserted by construction. No equation or claim reduces a prediction to a fitted input by definition, no self-citation chain bears the central result, and pre-registration plus external benchmark measurements keep the validation independent of the fitted quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Central claims rest on the PRS definition, the assumption that pilot rho estimates transfer to the main regime, and the hierarchical model's ability to generalize beta across domains.

free parameters (1)

beta = 0.50
Fitted coefficient in the hierarchical model relating PRS to acquisition advantage across 79 conditions.

axioms (1)

domain assumption Prior rank correlation rho can be estimated from pilot contexts before the main comparison
Used both for PRS computation and for online estimation inside RegimePlanner.

invented entities (2)

Portable Regime Score (PRS) no independent evidence
purpose: Quantify regime to predict which acquisition function wins
Newly defined as (B/|A|)(1-rho) and validated on multiple benchmarks.
RegimePlanner no independent evidence
purpose: Adaptive switching of acquisition function based on estimated rho
New algorithm proposed and tested on HPO-B and GDSC2.

pith-pipeline@v0.9.0 · 5656 in / 1508 out tokens · 78475 ms · 2026-05-08T16:36:50.370367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Howey, Michael A

Masaki Adachi, Brady Planden, David A. Howey, Michael A. Osborne, Sebastian Orbell, Natalia Ares, Krikamol Muandet, and Siu Lun Chau. Looping in the human: Collaborative and explainable Bayesian optimization. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Researc...

2024
[2]

Using confidence bounds for exploitation-exploration trade-offs

Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3: 0 397--422, 2002. doi:10.5555/2503308.2188395

work page doi:10.5555/2503308.2188395 2002
[3]

RECOVER : sequential model optimization for drug combination repurposing

Paul Bertin, Jarrid Rector-Brooks, Deepak Sharma, Thomas Gaudelet, Andrew Anighoro, Torsten Gross, Francisco Mart \'i nez-Pe \ n a, Eileen L Tang, et al. RECOVER : sequential model optimization for drug combination repurposing. Cell Reports Methods, 3 0 (10): 0 100599, 2023. doi:10.1016/j.crmeth.2023.100599

work page doi:10.1016/j.crmeth.2023.100599 2023
[4]

What is the state of neural network pruning? In Proceedings of Machine Learning and Systems (MLSys), 2020

Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? In Proceedings of Machine Learning and Systems (MLSys), 2020

2020
[5]

Concentration Inequalities: A Nonasymptotic Theory of Independence

St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013

2013
[6]

Accounting for variance in machine learning benchmarks

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Tal Arbel, Chris Pal, Ga \"e l Varoquaux, and Pascal Vincent. Accounting for variance in machine learning benchmarks. In Proceedings of Machine...

2021
[7]

Towards learning universal hyperparameter optimizers with transformers

Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Qiuyi Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'Aurelio Ranzato, Sagi Perel, and Nando de Freitas. Towards learning universal hyperparameter optimizers with transformers. In Advances in Neural Information Processing Systems, volume 35, 2022

2022
[8]

On provably robust meta- Bayesian optimization

Zhongxiang Dai, Yizhou Chen, Haibin Yu, Bryan Kian Hsiang Low, and Patrick Jaillet. On provably robust meta- Bayesian optimization. In Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence, volume 180 of Proceedings of Machine Learning Research, pages 475--485, 2022

2022
[9]

BOHB : Robust and efficient hyperparameter optimization at scale

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, 2018

2018
[10]

HyperBO+ : Pre-training a universal prior for Bayesian optimization with hierarchical Gaussian processes

Zhou Fan, Xinran Han, and Zi Wang. HyperBO+ : Pre-training a universal prior for Bayesian optimization with hierarchical Gaussian processes. In NeurIPS 2022 Workshop on Gaussian Processes, Spatiotemporal Modeling, and Decision-making Systems, 2022. URL https://arxiv.org/abs/2212.10538

work page arXiv 2022
[11]

Transfer learning for Bayesian optimization on heterogeneous search spaces

Zhou Fan, Xinran Han, and Zi Wang. Transfer learning for Bayesian optimization on heterogeneous search spaces. Transactions on Machine Learning Research, 2024

2024
[12]

Practical transfer learning for Bayesian optimization

Matthias Feurer, Benjamin Letham, Frank Hutter, and Eytan Bakshy. Practical transfer learning for Bayesian optimization. arXiv preprint arXiv:1802.02219, 2018. doi:10.48550/arxiv.1802.02219

work page doi:10.48550/arxiv.1802.02219 2018
[13]

Systematic identification of genomic markers of drug sensitivity in cancer cells

Mathew J Garnett, Elie Edelman, Sarah J Heidorn, Chris D Greenman, Anahita Dastur, Ka Chi Lau, Patricia Greninger, Iain R Thompson, Xian Luo, Jorge Soares, et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature, 483 0 (7391): 0 570--575, 2012. doi:10.1038/nature11005

work page doi:10.1038/nature11005 2012
[14]

Bandit processes and dynamic allocation indices

John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B, 41 0 (2): 0 148--164, 1979

1979
[15]

Gauche: a library for gaussian processes in chemistry

Ryan-Rhys Griffiths, Leo Klarner, Harriet Moss, et al. Gauche: a library for gaussian processes in chemistry. Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

2023
[16]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. doi:10.1609/aaai.v32i1.11694

work page doi:10.1609/aaai.v32i1.11694 2018
[17]

arXiv preprint arXiv:2204.11051 , year=

Carl Hvarfner, Danny Stoll, Artur Souza, Luigi Nardi, Andr \'e Biedenkapp, and Marius Lindauer. BO : Augmenting acquisition functions with user beliefs for B ayesian optimization. In International Conference on Learning Representations, 2022. doi:10.48550/arxiv.2204.11051

work page doi:10.48550/arxiv.2204.11051 2022
[18]

Imbens and Donald B

Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015

2015
[19]

Well-tuned simple nets excel on tabular datasets

Arlind Kadra, Marius Lindauer, Frank Hutter, and Josif Grabocka. Well-tuned simple nets excel on tabular datasets. In Advances in Neural Information Processing Systems, 2021

2021
[20]

Reddi, Sebastian U

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD : Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, 2020

2020
[21]

Information complexity in bandit subset selection

Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In Conference on Learning Theory, 2013

2013
[22]

Contextual gaussian process bandit optimization

Andreas Krause and Cheng Soon Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, 2011

2011
[23]

A sober look at LLM s for material discovery: Are they actually good for B ayesian optimization over molecules? In International Conference on Machine Learning, 2024

Agustinus Kristiadi, Felix Strieth-Kalthoff, Marta Skreta, Pascal Poupart, Al \'a n Aspuru-Guzik, and Geoff Pleiss. A sober look at LLM s for material discovery: Are they actually good for B ayesian optimization over molecules? In International Conference on Machine Learning, 2024

2024
[24]

Heavy-tailed class imbalance and why adam outperforms gradient descent on language models

Frederik Kunstner, Alan Milligan, Robin Yadav, Mark Schmidt, and Alberto Bietti. Heavy-tailed class imbalance and why adam outperforms gradient descent on language models. In Advances in Neural Information Processing Systems, 2024

2024
[25]

Random search and reproducibility for neural architecture search

Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. In Uncertainty in Artificial Intelligence, 2020

2020
[26]

Provable Accelerated Bayesian Optimization with Knowledge Transfer

Haitao Lin, Boxin Zhao, Mladen Kolar, and Chong Liu. Provable accelerated B ayesian optimization with knowledge transfer. arXiv:2511.03125, 2025. doi:10.48550/arxiv.2511.03125

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.03125 2025
[27]

SMAC3 : A versatile bayesian optimization package for hyperparameter optimization

Marius Lindauer, Katharina Eggensperger, Matthias Feurer, Andr \'e Biedenkapp, Difan Deng, Carolin Benjamins, Tim Ruhkopf, Ren \'e Sass, and Frank Hutter. SMAC3 : A versatile bayesian optimization package for hyperparameter optimization. Journal of Machine Learning Research, 23, 2022

2022
[28]

Lipton and Jacob Steinhardt

Zachary C. Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship. Communications of the ACM, 62 0 (6): 0 45--53, 2019

2019
[29]

Are GAN s created equal? a large-scale study

Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GAN s created equal? a large-scale study. In Advances in Neural Information Processing Systems, 2018

2018
[30]

PriorBand : Practical hyperparameter optimization in the age of deep learning

Neeratyoy Mallik, Edward Bergman, Carl Hvarfner, Danny Stoll, Maciej Janowski, Marius Lindauer, Luigi Nardi, and Frank Hutter. PriorBand : Practical hyperparameter optimization in the age of deep learning. In Advances in Neural Information Processing Systems, 2023. doi:10.48550/arxiv.2402.05878

work page doi:10.48550/arxiv.2402.05878 2023
[31]

When do neural nets outperform boosted trees on tabular data? In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C , Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data? In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

2023
[32]

Multi-fidelity B ayesian optimization with unreliable information sources

Petrus Mikkola, Julien Martinelli, Louis Filstroff, and Samuel Kaski. Multi-fidelity B ayesian optimization with unreliable information sources. In International Conference on Artificial Intelligence and Statistics, 2023. doi:10.48550/arxiv.2305.02997

work page doi:10.48550/arxiv.2305.02997 2023
[33]

Pfns4bo: in-context learning for bayesian optimization

Samuel M \"u ller, Matthias Feurer, Noah Hollmann, and Frank Hutter. Pfns4bo: in-context learning for bayesian optimization. In International Conference on Machine Learning, 2023

2023
[34]

A metric learning reality check

Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. In European Conference on Computer Vision (ECCV), 2020

2020
[35]

Prior-dependent allocations for B ayesian fixed-budget best-arm identification in structured bandits

Nicolas Nguyen, Imad Aouali, Andr \'a s Gy \"o rgy, and Claire Vernade. Prior-dependent allocations for B ayesian fixed-budget best-arm identification in structured bandits. In International Conference on Artificial Intelligence and Statistics, 2025

2025
[36]

Meta-VBO : Utilizing prior tasks in optimizing risk measures with Gaussian processes

Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Meta-VBO : Utilizing prior tasks in optimizing risk measures with Gaussian processes. In International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ElykcDu5YK

2024
[37]

Hpobench: A collection of reproducible multi-fidelity benchmark problems for hpo.arXiv preprint arXiv:2109.06716,

Sebastian Pineda Arango, Hadi S Jomaa, Martin Wistuba, and Josif Grabocka. HPO - B : A large-scale reproducible benchmark for black-box HPO based on OpenML . In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021. doi:10.48550/arxiv.2109.06716

work page doi:10.48550/arxiv.2109.06716 2021
[38]

Partial rankings of optimizers

Julian Rodemann and Hannah Blocher. Partial rankings of optimizers. In International Conference on Learning Representations (Tiny Papers), 2024

2024
[39]

Meta-learning reliable priors in the function space for Bayesian optimization

Jonas Rothfuss, Dominique Heyn, Jinfan Chen, and Andreas Krause. Meta-learning reliable priors in the function space for Bayesian optimization. Advances in Neural Information Processing Systems, 34, 2021

2021
[40]

Learning to optimize via posterior sampling

Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 0 (4): 0 1221--1243, 2014. doi:10.1287/moor.2014.0650

work page doi:10.1287/moor.2014.0650 2014
[41]

Russo Daniel, Van Roy

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11 0 (1): 0 1--96, 2018. doi:10.1561/2200000070

work page doi:10.1561/2200000070 2018
[42]

Are emergent abilities of large lan- guage models a mirage?arXiv preprint arXiv:2304.15004, 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems, volume 36, 2023. doi:10.48550/arxiv.2304.15004

work page doi:10.48550/arxiv.2304.15004 2023
[43]

Bayesian Reaction Optimization as a Tool for Chemical Synthesis

Benjamin J. Shields, Jason Stevens, Jun Li, Marvin Parasram, Farhan Damani, Jesus I. Martinez Alvarado, Jacob M. Janey, Ryan P. Adams, and Abigail G. Doyle. Bayesian reaction optimization as a tool for chemical synthesis. Nature, 590 0 (7844): 0 89--96, 2021. doi:10.1038/s41586-021-03213-y

work page doi:10.1038/s41586-021-03213-y 2021
[44]

Optimizer benchmarking needs to account for hyperparameter tuning

Prabhu Teja Sivaprasad, Florian Mai, Thijs Vogels, Martin Jaggi, and Fran c ois Fleuret. Optimizer benchmarking needs to account for hyperparameter tuning. In International Conference on Machine Learning, 2020

2020
[45]

Multi-task bayesian optimization

Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In Advances in Neural Information Processing Systems, 2013

2013
[46]

White, Jeffrey F

Christopher Tosh, Mauricio Tec, Jessica B. White, Jeffrey F. Quinn, Glorymar Ibanez Sanchez, Paul Calder, Andrew L. Kung, Filemon S. Dela Cruz, Wesley Tansey, et al. A bayesian active learning platform for scalable combination drug screens. Nature Communications, 16: 0 156, 2025. doi:10.1038/s41467-024-55287-7

work page doi:10.1038/s41467-024-55287-7 2025
[47]

Monte Carlo tree search based space transfer for black-box optimization

Shukuan Wang, Ke Xue, Lei Song, Xiaobin Huang, and Chao Qian. Monte Carlo tree search based space transfer for black-box optimization. In Advances in Neural Information Processing Systems, volume 37, 2024 a

2024
[48]

Pre-trained gaussian processes for bayesian optimization

Zi Wang, George Dahl, Kevin Swersky, et al. Pre-trained gaussian processes for bayesian optimization. Journal of Machine Learning Research, 25 0 (212): 0 1--83, 2024 b . URL http://jmlr.org/papers/v25/23-0269.html

2024
[49]

Multi-objective tree-structured Parzen estimator meets meta-learning

Shuhei Watanabe, Noor Awad, Masaki Onishi, and Frank Hutter. Multi-objective tree-structured Parzen estimator meets meta-learning. In NeurIPS Workshop on Meta-Learning , 2022

2022
[50]

Few-shot bayesian optimization with deep kernel surrogates

Martin Wistuba and Josif Grabocka. Few-shot bayesian optimization with deep kernel surrogates. In International Conference on Learning Representations, 2021

2021
[51]

Wolpert and William G

David H. Wolpert and William G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1 0 (1): 0 67--82, 1997

1997
[52]

Jones, and Michael A

Wenjie Xu, Masaki Adachi, Colin N. Jones, and Michael A. Osborne. Principled Bayesian optimisation in collaboration with human experts. In Advances in Neural Information Processing Systems, volume 37, 2024

2024