BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

Apivich Hemachandra; Bryan Kian Hsiang Low; Ruth Wan Theng Chew; Zhiliang Chen

arxiv: 2605.17000 · v1 · pith:QEP33ICZnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

Ruth Wan Theng Chew , Zhiliang Chen , Apivich Hemachandra , Bryan Kian Hsiang Low This is my paper

Pith reviewed 2026-05-19 20:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords black-box optimizationBayesian optimizationLLM tuningsurrogate modelsbenchmarkmulti-fidelitynoisy optimization

0 comments

The pith

BoLT supplies lightweight surrogate models from thousands of real LLM runs so black-box optimization researchers can test methods on realistic expensive tasks without prohibitive costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM training and inference configurations are noisy, high-dimensional, and costly to evaluate, so practitioners often tune them with heuristics rather than principled optimization. Bayesian optimization and other black-box methods could improve sample efficiency, yet most researchers cannot afford the compute to experiment with new algorithms on actual models. BoLT addresses this by releasing surrogate models fitted to data from thousands of real LLM experiments. The surrogates reproduce multi-fidelity, multi-objective, heteroscedastic-noise, and high-dimensional features typical of prompt, hyperparameter, and data-mixture tuning. When a wide range of BO and BBO algorithms are run on these surrogates, selected Bayesian optimization methods show consistent advantages over alternatives.

Core claim

BoLT is the first LLM-centric benchmark that democratizes access to realistic optimization problems by providing lightweight surrogate models fitted to results from thousands of actual LLM experiments. These surrogates embed the practical challenges of multi-fidelity evaluation, multi-objective trade-offs, heteroscedastic noise, and high-dimensional search spaces that arise when tuning LLM training and inference configurations. Benchmarking a broad collection of Bayesian optimization and black-box optimization methods on BoLT shows that particular BO approaches maintain an edge in performance across the tasks.

What carries the argument

Lightweight surrogate models fitted to real LLM experimental data that approximate the objective landscapes for configuration tuning.

If this is right

Researchers can now iterate on new black-box optimization algorithms using realistic LLM-like problems at low cost.
Performance gaps in existing methods for handling noise and multiple objectives become measurable on LLM-relevant tasks.
Algorithm designers gain concrete targets for improving sample efficiency on high-dimensional noisy surfaces.
The benchmark supports reproducible comparisons that were previously blocked by compute barriers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Methods that excel on the surrogates could transfer to real LLM tuning workflows if the landscapes match closely enough.
The surrogate-fitting strategy might be reused to create accessible benchmarks for other domains where full experiments are prohibitively expensive.
Adding newer model families to the benchmark would test whether the observed method rankings remain stable as architectures evolve.

Load-bearing premise

The fitted surrogate models must accurately reproduce the optimization landscapes, noise patterns, and relative performance ordering of methods that would appear on the true expensive LLM tasks.

What would settle it

Execute the top-ranked BO methods identified on BoLT directly on the original LLM tasks and check whether their sample-efficiency and final-performance advantages over other methods still hold.

Figures

Figures reproduced from arXiv: 2605.17000 by Apivich Hemachandra, Bryan Kian Hsiang Low, Ruth Wan Theng Chew, Zhiliang Chen.

**Figure 2.** Figure 2: Results on BOLT’s HPO problems. MF variants are plotted against the cumulative budget, where each observation cost range from 0.1 to 1, depending on the fidelity chosen. noiseless Pareto frontier across all observations at iteration t. These are common evaluation metrics found in optimization literature [15, 67]. The optimal values f∗ and HV∗ are found empirically via exhaustive evaluation for tabular data… view at source ↗

**Figure 3.** Figure 3: Results on BOLT’s DMO problems. (q = 5) indicates a batch size of 5, and NSGA2 has 50 observations in each iteration. All other methods use only 1 observation per iteration. tion iterations. Batched candidates are selected via sequential greedy conditioning [6]. Batch sizes and population (for genetic algorithms) are noted where they differ from a default of one. We compare against a range of acquisition f… view at source ↗

**Figure 4.** Figure 4: Results on BOLT’s PO problems. (q = 5) indicates a batch size of 5. Note that dTURBO and dBAxUS are our adapted methods for a discretized search space. PO-768, which descends to approximately −7, suggesting it benefits from the richer expressivity of the full embedding dimensionality, though it remains far behind trust-region methods. Batch evaluation partially compensates for the absence of trust-region s… view at source ↗

**Figure 5.** Figure 5: PCA explained variance for PO Subspace reduction may be unnecessary for LLM embedding spaces. At PO-128, dBAxUS performs notably worse than dTuRBO, as PO128 embeddings require 74.2% of principal components to explain 95% of variance ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Real and predicted ranks on test set. 0.00 0.25 0.50 0.75 1.00 Fidelity (normalized training tokens) 0.00 0.05 0.10 0.15 Relative error HPO-MF-Cont 2 1 0 Normalized difference 0 10 20 30 40 50 Count HPO-MF-Disc: 4B vs 8B [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Differences in emulated objective at different fidelities [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Estimated optimal Pareto front, as 2D slices of a 3D space [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Noise emulator predictions for MATH-500 standard deviation. if_prop1 indicates mix [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of β for UCB on HPO and DMO base problems. "UCB" refers to using scheduled βt from Srinivas et al. [67] D.3 Cost scale sensitivity In cost-scaled acquisition functions, acquisition values are divided by an affine cost model c(x) = α · fid(x) + 1, where fid(x) is the fidelity parameter and α controls the cost sensitivity. Larger α penalizes high-fidelity queries more heavily in their acquisition… view at source ↗

**Figure 11.** Figure 11: Performance on MF HPO problems on different cost scale. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Fidelity proportions of observations on MF HPO problems. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Results on BOLT’s HPO problems using inference regret. Random is not included as random sampling does not maintain a posterior mean. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Results on BOLT’s HPO problems using per-step simple regret. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

read the original abstract

Optimization of LLM training and inference configurations, such as hyperparameters, data mixtures, and prompts, is critical to performance, but it is often approached heuristically in practice, leading to potentially suboptimal outcomes. By framing them as noisy, expensive, and derivative-free optimization problems, Bayesian optimization (BO) and other black-box optimization (BBO) methods offer a promising yet underexplored direction for principled, sample-efficient methods. However, LLM training and inference costs are prohibitively high for most of the BBO research community, and new methods are often only evaluated on synthetic test functions and small-scale datasets that fail to capture the challenges of modern LLM optimization problems. This impedes the development of BBO methods and makes it difficult to assess their effectiveness on modern LLM tasks. We introduce BoLT, the first LLM-centric benchmark that democratizes LLM research for the BBO community. BoLT is released at https://github.com/chewwt/bolt. BoLT covers broad and well-motivated LLM optimization problems, involving multi-fidelity, multi-objective, heteroscedastic noise, and high-dimensional search spaces. Each problem in BoLT is grounded in real experimental data and made fully reproducible and accessible through lightweight surrogate models fitted to the results of thousands of real LLM experiments. We benchmark BoLT against an extensive range of BO and BBO methods, showing that selected BO methods consistently outperform others across tasks and highlighting gaps in existing BBO methods on LLM tasks, underscoring the need to modernize benchmarks for the BBO community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BoLT gives BBO researchers a set of surrogates from real LLM runs that could replace synthetic test functions, but the value turns on how well those surrogates match actual landscapes and method orderings.

read the letter

BoLT is worth knowing about if you work on black-box optimization and want test problems that actually resemble tuning large language models. The new part is the collection of surrogate models built from thousands of real runs on LLM tasks. These cover multi-fidelity, multi-objective, noisy, and high-dimensional cases that synthetic functions usually miss. The authors make the surrogates and the benchmark code public, so anyone can run experiments without the compute cost. They also compare a range of BO and BBO methods and note that some do better consistently while others have gaps on these tasks. That comparison is useful because it shows where existing methods fall short for this domain. The grounding in real experimental data is a step up from the small-scale or synthetic setups that the paper criticizes. The main thing to watch is the quality of the surrogates. Everything depends on them capturing the landscapes, the noise structure, and the ordering of methods correctly. The paper claims they are fitted to real data, but I would look for details on how they validated that, such as prediction accuracy on held-out points or correlation between surrogate and real rankings. If those checks are solid, the benchmark is a real advance. If not, it could give misleading signals about which methods to develop further. A reader interested in applying optimization to LLMs or in developing new BBO algorithms for expensive problems would find this paper and the released materials helpful. It is not a theoretical advance but a practical resource that could change what counts as a standard test in the field. I think it should go to peer review. The idea fills a gap, the execution seems careful, and the community would benefit from having these tasks available even if some revision is needed on the validation side.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BoLT, the first LLM-centric benchmark for black-box optimization research. It releases lightweight surrogate models fitted to thousands of real LLM experiments to make expensive tasks (hyperparameter tuning, data mixtures, prompt optimization) accessible to the BBO community. The benchmark covers multi-fidelity, multi-objective, heteroscedastic noise, and high-dimensional problems. The authors evaluate a broad range of BO and BBO methods on BoLT and report that selected BO methods consistently outperform others across tasks, while highlighting gaps in existing methods.

Significance. If the surrogates accurately reproduce real LLM optimization landscapes, noise characteristics, and relative method orderings, BoLT would be a substantial contribution: it lowers the barrier for BBO researchers to test methods on practically relevant, expensive problems rather than synthetic functions. The public release of code and reproducible surrogates supports this utility. The work directly addresses the mismatch between current BBO benchmarks and modern LLM-scale tasks.

major comments (2)

[§3] §3 (Surrogate Model Construction): The manuscript states that surrogates are fitted to results from thousands of real LLM experiments and are intended to reproduce optimization landscapes, heteroscedastic noise, and multi-fidelity behavior, but reports no quantitative fidelity metrics (e.g., held-out predictive MSE, Spearman rank correlation of method performances, or fidelity across fidelity levels). This validation is load-bearing for the central claim that rankings and behaviors observed on BoLT will transfer to actual expensive LLM tasks.
[§5] §5 (Benchmarking Experiments): The claim that 'selected BO methods consistently outperform others across tasks' is presented without statistical significance testing or confidence intervals on performance differences. Given the heteroscedastic noise explicitly modeled in the surrogates, this weakens the strength of the comparative conclusions.

minor comments (2)

[Abstract and §2] The abstract and §2 could more explicitly state the regression or interpolation technique used to build the surrogates (e.g., Gaussian process, random forest, or neural network) rather than referring only to 'lightweight surrogate models'.
[Table 2] Table 2 (task characteristics) would benefit from an additional column reporting the number of real experiments used to fit each surrogate, to allow readers to assess data density per task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important aspects of surrogate validation and statistical rigor that we have addressed through revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Surrogate Model Construction): The manuscript states that surrogates are fitted to results from thousands of real LLM experiments and are intended to reproduce optimization landscapes, heteroscedastic noise, and multi-fidelity behavior, but reports no quantitative fidelity metrics (e.g., held-out predictive MSE, Spearman rank correlation of method performances, or fidelity across fidelity levels). This validation is load-bearing for the central claim that rankings and behaviors observed on BoLT will transfer to actual expensive LLM tasks.

Authors: We agree that quantitative fidelity metrics are essential to support the claim that BoLT surrogates accurately reproduce real LLM optimization landscapes, noise characteristics, and relative method orderings. In the revised manuscript, we have added a new validation subsection to §3. This includes held-out predictive MSE and R² scores computed on a disjoint set of real LLM experimental results. We also report Spearman rank correlations between method performance rankings obtained on the surrogates and those from a small collection of held-out real LLM runs. Finally, we provide per-fidelity-level error metrics to verify multi-fidelity behavior. These additions directly bolster the transferability argument. revision: yes
Referee: [§5] §5 (Benchmarking Experiments): The claim that 'selected BO methods consistently outperform others across tasks' is presented without statistical significance testing or confidence intervals on performance differences. Given the heteroscedastic noise explicitly modeled in the surrogates, this weakens the strength of the comparative conclusions.

Authors: We acknowledge that the lack of statistical significance testing and confidence intervals weakens the comparative claims, especially in the presence of heteroscedastic noise. In the revision, we have updated §5 and the appendix to include bootstrap confidence intervals around all performance metrics. We have also added results from Wilcoxon signed-rank tests (with p-values) to assess whether observed differences between methods are statistically significant. These changes provide the requested rigor while preserving the original experimental setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity; surrogates and evaluations remain independent

full rationale

The paper constructs lightweight surrogate models by fitting to results from thousands of real LLM experiments (external data) and then evaluates BO/BBO methods on those surrogates. No step reduces the reported method outperformance or benchmark utility to the authors' fitting procedure by construction, self-definition, or self-citation chain. The performance ordering is obtained by running the methods on the surrogates rather than being statistically forced by the surrogate parameters themselves, and the surrogates are presented as reproducible approximations grounded outside the present evaluation loop. This is a standard benchmark construction with no load-bearing circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The benchmark construction implicitly relies on the assumption that the chosen surrogate fitting procedure preserves the essential properties of the original expensive black-box functions; no explicit free parameters or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5824 in / 1249 out tokens · 21936 ms · 2026-05-19T20:44:11.678175+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We fit 2-layer MLPs for the optimization objective of HPO and DMO tasks... Emulators are validated on a Sobol-sampled held-out test set... Spearman’s rank correlation ρ as the validation metric
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorem unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BOLT covers broad and well-motivated LLM optimization problems, involving multi-fidelity, multi-objective, heteroscedastic noise, and high-dimensional search spaces.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 7 internal anchors

[1]

Akiba, S

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019

work page 2019
[2]

arXiv preprint arXiv:2402.16827

A. Albalak, Y . Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, et al. A survey on data selection for language models.arXiv preprint arXiv:2402.16827, 2024

work page arXiv 2024
[3]

Ament, S

S. Ament, S. Daulton, D. Eriksson, M. Balandat, and E. Bakshy. Unexpected improvements to expected improvement for Bayesian optimization.Advances in Neural Information Processing Systems, 36:20577– 20612, 2023

work page 2023
[4]

S. P. Arango, H. S. Jomaa, M. Wistuba, and J. Grabocka. Hpo-b: A large-scale reproducible benchmark for black-box hpo based on openml. InThirty-fifth Conference on Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021
[5]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Balandat, B

M. Balandat, B. Karrer, D. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. BoTorch: A framework for efficient monte-carlo bayesian optimization.Advances in Neural Information Processing Systems, 33:21524–21538, 2020

work page 2020
[7]

Belakaria, A

S. Belakaria, A. Deshwal, and J. R. Doppa. Max-value entropy search for multi-objective Bayesian optimization.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[8]

Bergstra, R

J. Bergstra, R. Bardenet, Y . Bengio, and B. Kégl. Algorithms for hyper-parameter optimization.Advances in Neural Information Processing Systems, 24, 2011

work page 2011
[9]

Bingham and S

D. Bingham and S. Surjanovic. Virtual library of simulation experiments: Test functions and datasets,

work page
[10]

URLhttps://www.sfu.ca/~ssurjano/optimization.html

work page
[11]

Bliek, A

L. Bliek, A. Guijt, R. Karlsson, S. Verwer, and M. De Weerdt. Benchmarking surrogate-based optimisation algorithms on expensive black-box functions.Applied Soft Computing, 147:110744, 2023

work page 2023
[12]

Bradford, A

E. Bradford, A. M. Schweidtmann, and A. Lapkin. Efficient multiobjective optimization employing Gaussian processes, spectral sampling and a genetic algorithm.Journal of global optimization, 71(2): 407–438, 2018

work page 2018
[13]

Chapelle and L

O. Chapelle and L. Li. An empirical evaluation of Thompson sampling.Advances in Neural Information Processing Systems, 24, 2011

work page 2011
[14]

L. Chen, J. Chen, T. Goldstein, H. Huang, and T. Zhou. Instructzero: Efficient instruction optimization for black-box large language models. InInternational Conference on Machine Learning, 2024

work page 2024
[15]

Z. Chen, G. K. R. Lau, C.-S. Foo, and B. K. H. Low. DUET: Optimizing training data mixtures via feedback from unseen evaluation tasks.arXiv:2502.00270, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Daulton, M

S. Daulton, M. Balandat, and E. Bakshy. Parallel bayesian optimization of multiple noisy objectives with expected hypervolume improvement.Advances in Neural Information Processing Systems, 34:2187–2200, 2021. 10

work page 2021
[17]

K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: NSGA-II.IEEE transactions on evolutionary computation, 6(2):182–197, 2002

work page 2002
[18]

Dreczkowski, A

K. Dreczkowski, A. Grosnit, and H. Bou Ammar. Framework and benchmarks for combinatorial and mixed-variable Bayesian optimization.Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 36:69464–69489, 2023

work page 2023
[19]

Eggensperger, P

K. Eggensperger, P. Müller, N. Mallik, M. Feurer, R. Sass, A. Klein, N. Awad, M. Lindauer, and F. Hutter. Hpobench: A collection of reproducible multi-fidelity benchmark problems for HPO. InNeural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021
[20]

Eriksson and M

D. Eriksson and M. Jankowiak. High-dimensional Bayesian optimization with sparse axis-aligned sub- spaces. InUncertainty in artificial intelligence, pages 493–503. PMLR, 2021

work page 2021
[21]

Eriksson, M

D. Eriksson, M. Pearce, J. Gardner, R. D. Turner, and M. Poloczek. Scalable global optimization via local Bayesian optimization.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[22]

Falkner, A

S. Falkner, A. Klein, and F. Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, pages 1437–1446. PMLR, 2018

work page 2018
[23]

Fernando, D

C. Fernando, D. S. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self- referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, pages 13481–13544. PMLR, 2024

work page 2024
[24]

Frazier, W

P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613, 2009

work page 2009
[25]

P. I. Frazier. A tutorial on Bayesian optimization.arXiv preprint arXiv:1807.02811, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 07 2024. URLhttps://zenodo.org/records/12608602

work page arXiv 2024
[27]

Garnett.Bayesian optimization

R. Garnett.Bayesian optimization. Cambridge University Press, 2023

work page 2023
[28]

N. Hansen. The CMA evolution strategy: a comparing review.Towards a new evolutionary computation: Advances in the estimation of distribution algorithms, pages 75–102, 2006

work page 2006
[29]

Hansen, S

N. Hansen, S. Finck, R. Ros, and A. Auger.Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions. PhD thesis, INRIA, 2009

work page 2009
[30]

Hansen, A

N. Hansen, A. Auger, R. Ros, O. Mersmann, T. Tušar, and D. Brockhoff. COCO: A platform for comparing continuous optimizers in a black-box setting.Optimization Methods and Software, 36(1):114–144, 2021

work page 2021
[31]

F. Häse, M. Aldeghi, R. J. Hickman, L. M. Roch, M. Christensen, E. Liles, J. E. Hein, and A. Aspuru-Guzik. Olympus: a benchmarking framework for noisy optimization and experiment planning.Machine Learning: Science and Technology, 2(3):035021, 2021

work page 2021
[32]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021
[33]

Hernández-Lobato, J

D. Hernández-Lobato, J. Hernandez-Lobato, A. Shah, and R. Adams. Predictive entropy search for multi- objective Bayesian optimization. InInternational Conference on Machine Learning, pages 1492–1501. PMLR, 2016

work page 2016
[34]

J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions.Advances in Neural Information Processing Systems, 27, 2014

work page 2014
[35]

W. Hu, Y . Shu, Z. Yu, Z. Wu, X. Lin, Z. Dai, S.-K. Ng, and B. K. H. Low. Localized zeroth-order prompt optimization.Advances in Neural Information Processing Systems, 37:86309–86345, 2024

work page 2024
[36]

Hvarfner, F

C. Hvarfner, F. Hutter, and L. Nardi. Joint entropy search for maximally-informed Bayesian optimization. Advances in Neural Information Processing Systems, 35:11494–11506, 2022

work page 2022
[37]

Hvarfner, E

C. Hvarfner, E. O. Hellsten, and L. Nardi. Vanilla Bayesian optimization performs great in high dimensions. InInternational Conference on Machine Learning, pages 20793–20817. PMLR, 2024. 11

work page 2024
[38]

C. Jang, H. Lee, J. Kim, and J. Lee. Model fusion through Bayesian optimization in language model fine-tuning.Advances in Neural Information Processing Systems, 37:29878–29912, 2024

work page 2024
[39]

D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998

work page 1998
[40]

Kersting, C

K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. Most likely heteroscedastic gaussian process regression. InInternational Conference on Machine learning, pages 393–400, 2007

work page 2007
[41]

J. Knowles. Parego: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems.IEEE transactions on evolutionary computation, 10(1):50–66, 2006

work page 2006
[42]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 35:22199–22213, 2022

work page 2022
[43]

Kumar, G

A. Kumar, G. Wu, M. Z. Ali, R. Mallipeddi, P. N. Suganthan, and S. Das. A test-suite of non-convex constrained optimization problems from the real-world and some baseline results.Swarm and evolutionary computation, 56:100693, 2020

work page 2020
[44]

Kusupati, G

A. Kusupati, G. Bhatt, A. Rege, M. Wallingford, A. Sinha, V . Ramanujan, W. Howard-Snyder, K. Chen, S. Kakade, P. Jain, et al. Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022

work page 2022
[45]

Lambert, J

N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, X. Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. InSecond Conference on Language Modeling, 2025

work page 2025
[46]

Lázaro-Gredilla and M

M. Lázaro-Gredilla and M. K. Titsias. Variational heteroscedastic Gaussian process regression. In International Conference on Machine Learning, pages 841–848, 2011

work page 2011
[47]

H. Li, W. Zheng, Q. Wang, H. Zhang, Z. Wang, S. Xuyang, Y . Fan, Z. Ding, H. Wang, N. Ding, S. Zhou, X. Zhang, and D. Jiang. Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining.arXiv:2503.04715, 2025

work page arXiv 2025
[48]

L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, J. Ben-Tzur, M. Hardt, B. Recht, and A. Talwalkar. A system for massively parallel hyperparameter tuning.Proceedings of machine learning and systems, 2: 230–246, 2020

work page 2020
[49]

Y . Li, Z. Liu, and E. Xing. Data mixing optimization for supervised fine-tuning of large language models. InInternational Conference on Machine Learning, pages 35419–35437. PMLR, 2025

work page 2025
[50]

Liang, A

Q. Liang, A. E. Gongora, Z. Ren, A. Tiihonen, Z. Liu, S. Sun, J. R. Deneault, D. Bash, F. Mekki-Berrada, S. A. Khan, et al. Benchmarking the performance of Bayesian optimization across multiple experimental materials science domains.npj Computational Materials, 7(1):188, 2021

work page 2021
[51]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[52]

X. Lin, Z. Wu, Z. Dai, W. Hu, Y . Shu, S.-K. Ng, P. Jaillet, and B. K. H. Low. Use your INSTINCT: Instruc- tion optimization for llms using neural bandits coupled with transformers. InInternational Conference on Machine Learning, pages 30317–30345. PMLR, 2024

work page 2024
[53]

Lindauer, K

M. Lindauer, K. Eggensperger, M. Feurer, A. Biedenkapp, D. Deng, C. Benjamins, T. Ruhkopf, R. Sass, and F. Hutter. Smac3: A versatile Bayesian optimization package for hyperparameter optimization.Journal of Machine Learning Research, 23(54):1–9, 2022

work page 2022
[54]

J. Liu, C. S. Xia, Y . Wang, and L. Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36:21558–21572, 2023

work page 2023
[55]

Q. Liu, X. Zheng, N. Muennighoff, G. Zeng, L. Dou, T. Pang, J. Jiang, and M. Lin. Regmix: Data mixture as regression for language model pre-training. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[56]

H. B. Moss, D. S. Leslie, J. Gonzalez, and P. Rayson. Gibbon: General-purpose information-based Bayesian optimisation.Journal of Machine Learning Research, 22(235):1–49, 2021. 12

work page 2021
[57]

Nomura, S

M. Nomura, S. Watanabe, Y . Akimoto, Y . Ozaki, and M. Onishi. Warm starting CMA-ES for hyperpa- rameter optimization. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 9188–9196, 2021

work page 2021
[58]

T. Olmo, A. Ettinger, A. Bertsch, B. Kuehl, D. Graham, D. Heineman, D. Groeneveld, F. Brahman, F. Timbers, H. Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

P. S. Palar, R. P. Liem, L. R. Zuhal, and K. Shimoyama. On the use of surrogate models in engineering design optimization and exploration: The key issues. InProceedings of the genetic and evolutionary computation conference companion, pages 1592–1602, 2019

work page 2019
[60]

Papenmeier, L

L. Papenmeier, L. Nardi, and M. Poloczek. Increasing the scope as you learn: Adaptive Bayesian optimization in nested subspaces.Advances in Neural Information Processing Systems, 35:11586–11601, 2022

work page 2022
[61]

Papenmeier, M

L. Papenmeier, M. Poloczek, and L. Nardi. Understanding high-dimensional Bayesian optimization. In International Conference on Machine Learning, pages 47902–47923. PMLR, 2025

work page 2025
[62]

Pfisterer, L

F. Pfisterer, L. Schneider, J. Moosbauer, M. Binder, and B. Bischl. Yahpo gym-an efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. InInternational Conference on Automated Machine Learning, pages 3–1. PMLR, 2022

work page 2022
[63]

EmbeddingGemma: Powerful and Lightweight Text Representations

H. Schechter Vera, S. Dua, B. Zhang, D. Salz, R. Mullins, S. Raghuram Panyam, S. Smoot, I. Naim, J. Zou, F. Chen, D. Cer, A. Lisak, M. Choi, L. Gonzalez, O. Sanseviero, G. Cameron, I. Ballantyne, K. Black, K. Chen, W. Wang, Z. Li, G. Martins, J. Lee, M. Sherwood, J. Ji, R. Wu, J. Zheng, J. Singh, A. Sharma, D. Sreepat, A. Jain, A. Elarabawy, A. Co, A. Dou...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Seong-Eun, L

B. Seong-Eun, L. Jung-Mok, K. Sung-Bin, and T.-H. Oh. Efficient hyper-parameter search for LoRA via language-aided Bayesian optimization.arXiv preprint arXiv:2602.11171, 2026

work page arXiv 2026
[65]

Shahriari, K

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of Bayesian optimization.Proceedings of the IEEE, 104(1):148–175, 2015

work page 2015
[66]

C. Shi, K. Yang, Z. Chen, J. Li, J. Yang, and C. Shen. Efficient prompt optimization through the lens of best arm identification.Advances in Neural Information Processing Systems, 37:99646–99685, 2024

work page 2024
[67]

Snoek, H

J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 25, 2012

work page 2012
[68]

Srinivas, A

N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. InInternational Conference on Machine Learning, 2010

work page 2010
[69]

Takeno, H

S. Takeno, H. Fukuoka, Y . Tsukada, T. Koyama, M. Shiga, I. Takeuchi, and M. Karasuyama. Multi-fidelity Bayesian optimization with max-value entropy search and its parallelization. InInternational Conference on Machine Learning, pages 9334–9345. PMLR, 2020

work page 2020
[70]

Toshniwal, W

S. Toshniwal, W. Du, I. Moshkov, B. Kisacanin, A. Ayrapetyan, and I. Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. InThe Thirteenth International Conference on Learning Representations, 2023

work page 2023
[71]

Tribes, S

C. Tribes, S. Benarroch-Lelong, P. Lu, and I. Kobyzev. Hyperparameter optimization for large language model instruction-tuning. InAAAI Conference on Artificial Intelligence, 2024

work page 2024
[72]

B. Tu, A. Gandy, N. Kantas, and B. Shafei. Joint entropy search for multi-objective Bayesian optimization. Advances in Neural Information Processing Systems, 35:9922–9938, 2022

work page 2022
[73]

L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

work page 2023
[74]

Wang and S

Z. Wang and S. Jegelka. Max-value entropy search for efficient Bayesian optimization. InInternational Conference on Machine Learning, pages 3627–3635. PMLR, 2017. 13

work page 2017
[75]

Z. Wu, X. Lin, Z. Dai, W. Hu, Y . Shu, S.-K. Ng, P. Jaillet, and B. K. H. Low. Prompt optimization with ease? efficient ordering-aware automated selection of exemplars.Advances in Neural Information Processing Systems, 37:122706–122740, 2024

work page 2024
[76]

S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y . Lu, P. S. Liang, Q. V . Le, T. Ma, and A. W. Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36:69798–69818, 2023

work page 2023
[77]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[79]

T. Yen, A. W. T. Siah, H. Chen, C. D. Guetta, T. Peng, and H. Namkoong. Data mixture optimization: A multi-fidelity multi-scale bayesian framework. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[80]

Zhang, A

Y . Zhang, A. Mohamed, H. Abdine, G. Shang, and M. Vazirgiannis. Beyond random sampling: Efficient language model pretraining via curriculum learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5776–5794, 2026

work page 2026

Showing first 80 references.

[1] [1]

Akiba, S

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019

work page 2019

[2] [2]

arXiv preprint arXiv:2402.16827

A. Albalak, Y . Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, et al. A survey on data selection for language models.arXiv preprint arXiv:2402.16827, 2024

work page arXiv 2024

[3] [3]

Ament, S

S. Ament, S. Daulton, D. Eriksson, M. Balandat, and E. Bakshy. Unexpected improvements to expected improvement for Bayesian optimization.Advances in Neural Information Processing Systems, 36:20577– 20612, 2023

work page 2023

[4] [4]

S. P. Arango, H. S. Jomaa, M. Wistuba, and J. Grabocka. Hpo-b: A large-scale reproducible benchmark for black-box hpo based on openml. InThirty-fifth Conference on Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021

[5] [5]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Balandat, B

M. Balandat, B. Karrer, D. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. BoTorch: A framework for efficient monte-carlo bayesian optimization.Advances in Neural Information Processing Systems, 33:21524–21538, 2020

work page 2020

[7] [7]

Belakaria, A

S. Belakaria, A. Deshwal, and J. R. Doppa. Max-value entropy search for multi-objective Bayesian optimization.Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[8] [8]

Bergstra, R

J. Bergstra, R. Bardenet, Y . Bengio, and B. Kégl. Algorithms for hyper-parameter optimization.Advances in Neural Information Processing Systems, 24, 2011

work page 2011

[9] [9]

Bingham and S

D. Bingham and S. Surjanovic. Virtual library of simulation experiments: Test functions and datasets,

work page

[10] [10]

URLhttps://www.sfu.ca/~ssurjano/optimization.html

work page

[11] [11]

Bliek, A

L. Bliek, A. Guijt, R. Karlsson, S. Verwer, and M. De Weerdt. Benchmarking surrogate-based optimisation algorithms on expensive black-box functions.Applied Soft Computing, 147:110744, 2023

work page 2023

[12] [12]

Bradford, A

E. Bradford, A. M. Schweidtmann, and A. Lapkin. Efficient multiobjective optimization employing Gaussian processes, spectral sampling and a genetic algorithm.Journal of global optimization, 71(2): 407–438, 2018

work page 2018

[13] [13]

Chapelle and L

O. Chapelle and L. Li. An empirical evaluation of Thompson sampling.Advances in Neural Information Processing Systems, 24, 2011

work page 2011

[14] [14]

L. Chen, J. Chen, T. Goldstein, H. Huang, and T. Zhou. Instructzero: Efficient instruction optimization for black-box large language models. InInternational Conference on Machine Learning, 2024

work page 2024

[15] [15]

Z. Chen, G. K. R. Lau, C.-S. Foo, and B. K. H. Low. DUET: Optimizing training data mixtures via feedback from unseen evaluation tasks.arXiv:2502.00270, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Daulton, M

S. Daulton, M. Balandat, and E. Bakshy. Parallel bayesian optimization of multiple noisy objectives with expected hypervolume improvement.Advances in Neural Information Processing Systems, 34:2187–2200, 2021. 10

work page 2021

[17] [17]

K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: NSGA-II.IEEE transactions on evolutionary computation, 6(2):182–197, 2002

work page 2002

[18] [18]

Dreczkowski, A

K. Dreczkowski, A. Grosnit, and H. Bou Ammar. Framework and benchmarks for combinatorial and mixed-variable Bayesian optimization.Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 36:69464–69489, 2023

work page 2023

[19] [19]

Eggensperger, P

K. Eggensperger, P. Müller, N. Mallik, M. Feurer, R. Sass, A. Klein, N. Awad, M. Lindauer, and F. Hutter. Hpobench: A collection of reproducible multi-fidelity benchmark problems for HPO. InNeural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021

[20] [20]

Eriksson and M

D. Eriksson and M. Jankowiak. High-dimensional Bayesian optimization with sparse axis-aligned sub- spaces. InUncertainty in artificial intelligence, pages 493–503. PMLR, 2021

work page 2021

[21] [21]

Eriksson, M

D. Eriksson, M. Pearce, J. Gardner, R. D. Turner, and M. Poloczek. Scalable global optimization via local Bayesian optimization.Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[22] [22]

Falkner, A

S. Falkner, A. Klein, and F. Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, pages 1437–1446. PMLR, 2018

work page 2018

[23] [23]

Fernando, D

C. Fernando, D. S. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self- referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, pages 13481–13544. PMLR, 2024

work page 2024

[24] [24]

Frazier, W

P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613, 2009

work page 2009

[25] [25]

P. I. Frazier. A tutorial on Bayesian optimization.arXiv preprint arXiv:1807.02811, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 07 2024. URLhttps://zenodo.org/records/12608602

work page arXiv 2024

[27] [27]

Garnett.Bayesian optimization

R. Garnett.Bayesian optimization. Cambridge University Press, 2023

work page 2023

[28] [28]

N. Hansen. The CMA evolution strategy: a comparing review.Towards a new evolutionary computation: Advances in the estimation of distribution algorithms, pages 75–102, 2006

work page 2006

[29] [29]

Hansen, S

N. Hansen, S. Finck, R. Ros, and A. Auger.Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions. PhD thesis, INRIA, 2009

work page 2009

[30] [30]

Hansen, A

N. Hansen, A. Auger, R. Ros, O. Mersmann, T. Tušar, and D. Brockhoff. COCO: A platform for comparing continuous optimizers in a black-box setting.Optimization Methods and Software, 36(1):114–144, 2021

work page 2021

[31] [31]

F. Häse, M. Aldeghi, R. J. Hickman, L. M. Roch, M. Christensen, E. Liles, J. E. Hein, and A. Aspuru-Guzik. Olympus: a benchmarking framework for noisy optimization and experiment planning.Machine Learning: Science and Technology, 2(3):035021, 2021

work page 2021

[32] [32]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021

[33] [33]

Hernández-Lobato, J

D. Hernández-Lobato, J. Hernandez-Lobato, A. Shah, and R. Adams. Predictive entropy search for multi- objective Bayesian optimization. InInternational Conference on Machine Learning, pages 1492–1501. PMLR, 2016

work page 2016

[34] [34]

J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions.Advances in Neural Information Processing Systems, 27, 2014

work page 2014

[35] [35]

W. Hu, Y . Shu, Z. Yu, Z. Wu, X. Lin, Z. Dai, S.-K. Ng, and B. K. H. Low. Localized zeroth-order prompt optimization.Advances in Neural Information Processing Systems, 37:86309–86345, 2024

work page 2024

[36] [36]

Hvarfner, F

C. Hvarfner, F. Hutter, and L. Nardi. Joint entropy search for maximally-informed Bayesian optimization. Advances in Neural Information Processing Systems, 35:11494–11506, 2022

work page 2022

[37] [37]

Hvarfner, E

C. Hvarfner, E. O. Hellsten, and L. Nardi. Vanilla Bayesian optimization performs great in high dimensions. InInternational Conference on Machine Learning, pages 20793–20817. PMLR, 2024. 11

work page 2024

[38] [38]

C. Jang, H. Lee, J. Kim, and J. Lee. Model fusion through Bayesian optimization in language model fine-tuning.Advances in Neural Information Processing Systems, 37:29878–29912, 2024

work page 2024

[39] [39]

D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998

work page 1998

[40] [40]

Kersting, C

K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. Most likely heteroscedastic gaussian process regression. InInternational Conference on Machine learning, pages 393–400, 2007

work page 2007

[41] [41]

J. Knowles. Parego: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems.IEEE transactions on evolutionary computation, 10(1):50–66, 2006

work page 2006

[42] [42]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 35:22199–22213, 2022

work page 2022

[43] [43]

Kumar, G

A. Kumar, G. Wu, M. Z. Ali, R. Mallipeddi, P. N. Suganthan, and S. Das. A test-suite of non-convex constrained optimization problems from the real-world and some baseline results.Swarm and evolutionary computation, 56:100693, 2020

work page 2020

[44] [44]

Kusupati, G

A. Kusupati, G. Bhatt, A. Rege, M. Wallingford, A. Sinha, V . Ramanujan, W. Howard-Snyder, K. Chen, S. Kakade, P. Jain, et al. Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022

work page 2022

[45] [45]

Lambert, J

N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, X. Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. InSecond Conference on Language Modeling, 2025

work page 2025

[46] [46]

Lázaro-Gredilla and M

M. Lázaro-Gredilla and M. K. Titsias. Variational heteroscedastic Gaussian process regression. In International Conference on Machine Learning, pages 841–848, 2011

work page 2011

[47] [47]

H. Li, W. Zheng, Q. Wang, H. Zhang, Z. Wang, S. Xuyang, Y . Fan, Z. Ding, H. Wang, N. Ding, S. Zhou, X. Zhang, and D. Jiang. Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining.arXiv:2503.04715, 2025

work page arXiv 2025

[48] [48]

L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, J. Ben-Tzur, M. Hardt, B. Recht, and A. Talwalkar. A system for massively parallel hyperparameter tuning.Proceedings of machine learning and systems, 2: 230–246, 2020

work page 2020

[49] [49]

Y . Li, Z. Liu, and E. Xing. Data mixing optimization for supervised fine-tuning of large language models. InInternational Conference on Machine Learning, pages 35419–35437. PMLR, 2025

work page 2025

[50] [50]

Liang, A

Q. Liang, A. E. Gongora, Z. Ren, A. Tiihonen, Z. Liu, S. Sun, J. R. Deneault, D. Bash, F. Mekki-Berrada, S. A. Khan, et al. Benchmarking the performance of Bayesian optimization across multiple experimental materials science domains.npj Computational Materials, 7(1):188, 2021

work page 2021

[51] [51]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[52] [52]

X. Lin, Z. Wu, Z. Dai, W. Hu, Y . Shu, S.-K. Ng, P. Jaillet, and B. K. H. Low. Use your INSTINCT: Instruc- tion optimization for llms using neural bandits coupled with transformers. InInternational Conference on Machine Learning, pages 30317–30345. PMLR, 2024

work page 2024

[53] [53]

Lindauer, K

M. Lindauer, K. Eggensperger, M. Feurer, A. Biedenkapp, D. Deng, C. Benjamins, T. Ruhkopf, R. Sass, and F. Hutter. Smac3: A versatile Bayesian optimization package for hyperparameter optimization.Journal of Machine Learning Research, 23(54):1–9, 2022

work page 2022

[54] [54]

J. Liu, C. S. Xia, Y . Wang, and L. Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36:21558–21572, 2023

work page 2023

[55] [55]

Q. Liu, X. Zheng, N. Muennighoff, G. Zeng, L. Dou, T. Pang, J. Jiang, and M. Lin. Regmix: Data mixture as regression for language model pre-training. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[56] [56]

H. B. Moss, D. S. Leslie, J. Gonzalez, and P. Rayson. Gibbon: General-purpose information-based Bayesian optimisation.Journal of Machine Learning Research, 22(235):1–49, 2021. 12

work page 2021

[57] [57]

Nomura, S

M. Nomura, S. Watanabe, Y . Akimoto, Y . Ozaki, and M. Onishi. Warm starting CMA-ES for hyperpa- rameter optimization. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 9188–9196, 2021

work page 2021

[58] [58]

T. Olmo, A. Ettinger, A. Bertsch, B. Kuehl, D. Graham, D. Heineman, D. Groeneveld, F. Brahman, F. Timbers, H. Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

P. S. Palar, R. P. Liem, L. R. Zuhal, and K. Shimoyama. On the use of surrogate models in engineering design optimization and exploration: The key issues. InProceedings of the genetic and evolutionary computation conference companion, pages 1592–1602, 2019

work page 2019

[60] [60]

Papenmeier, L

L. Papenmeier, L. Nardi, and M. Poloczek. Increasing the scope as you learn: Adaptive Bayesian optimization in nested subspaces.Advances in Neural Information Processing Systems, 35:11586–11601, 2022

work page 2022

[61] [61]

Papenmeier, M

L. Papenmeier, M. Poloczek, and L. Nardi. Understanding high-dimensional Bayesian optimization. In International Conference on Machine Learning, pages 47902–47923. PMLR, 2025

work page 2025

[62] [62]

Pfisterer, L

F. Pfisterer, L. Schneider, J. Moosbauer, M. Binder, and B. Bischl. Yahpo gym-an efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. InInternational Conference on Automated Machine Learning, pages 3–1. PMLR, 2022

work page 2022

[63] [63]

EmbeddingGemma: Powerful and Lightweight Text Representations

H. Schechter Vera, S. Dua, B. Zhang, D. Salz, R. Mullins, S. Raghuram Panyam, S. Smoot, I. Naim, J. Zou, F. Chen, D. Cer, A. Lisak, M. Choi, L. Gonzalez, O. Sanseviero, G. Cameron, I. Ballantyne, K. Black, K. Chen, W. Wang, Z. Li, G. Martins, J. Lee, M. Sherwood, J. Ji, R. Wu, J. Zheng, J. Singh, A. Sharma, D. Sreepat, A. Jain, A. Elarabawy, A. Co, A. Dou...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Seong-Eun, L

B. Seong-Eun, L. Jung-Mok, K. Sung-Bin, and T.-H. Oh. Efficient hyper-parameter search for LoRA via language-aided Bayesian optimization.arXiv preprint arXiv:2602.11171, 2026

work page arXiv 2026

[65] [65]

Shahriari, K

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of Bayesian optimization.Proceedings of the IEEE, 104(1):148–175, 2015

work page 2015

[66] [66]

C. Shi, K. Yang, Z. Chen, J. Li, J. Yang, and C. Shen. Efficient prompt optimization through the lens of best arm identification.Advances in Neural Information Processing Systems, 37:99646–99685, 2024

work page 2024

[67] [67]

Snoek, H

J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 25, 2012

work page 2012

[68] [68]

Srinivas, A

N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. InInternational Conference on Machine Learning, 2010

work page 2010

[69] [69]

Takeno, H

S. Takeno, H. Fukuoka, Y . Tsukada, T. Koyama, M. Shiga, I. Takeuchi, and M. Karasuyama. Multi-fidelity Bayesian optimization with max-value entropy search and its parallelization. InInternational Conference on Machine Learning, pages 9334–9345. PMLR, 2020

work page 2020

[70] [70]

Toshniwal, W

S. Toshniwal, W. Du, I. Moshkov, B. Kisacanin, A. Ayrapetyan, and I. Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. InThe Thirteenth International Conference on Learning Representations, 2023

work page 2023

[71] [71]

Tribes, S

C. Tribes, S. Benarroch-Lelong, P. Lu, and I. Kobyzev. Hyperparameter optimization for large language model instruction-tuning. InAAAI Conference on Artificial Intelligence, 2024

work page 2024

[72] [72]

B. Tu, A. Gandy, N. Kantas, and B. Shafei. Joint entropy search for multi-objective Bayesian optimization. Advances in Neural Information Processing Systems, 35:9922–9938, 2022

work page 2022

[73] [73]

L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

work page 2023

[74] [74]

Wang and S

Z. Wang and S. Jegelka. Max-value entropy search for efficient Bayesian optimization. InInternational Conference on Machine Learning, pages 3627–3635. PMLR, 2017. 13

work page 2017

[75] [75]

Z. Wu, X. Lin, Z. Dai, W. Hu, Y . Shu, S.-K. Ng, P. Jaillet, and B. K. H. Low. Prompt optimization with ease? efficient ordering-aware automated selection of exemplars.Advances in Neural Information Processing Systems, 37:122706–122740, 2024

work page 2024

[76] [76]

S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y . Lu, P. S. Liang, Q. V . Le, T. Ma, and A. W. Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36:69798–69818, 2023

work page 2023

[77] [77]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[78] [78]

C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[79] [79]

T. Yen, A. W. T. Siah, H. Chen, C. D. Guetta, T. Peng, and H. Namkoong. Data mixture optimization: A multi-fidelity multi-scale bayesian framework. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[80] [80]

Zhang, A

Y . Zhang, A. Mohamed, H. Abdine, G. Shang, and M. Vazirgiannis. Beyond random sampling: Efficient language model pretraining via curriculum learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5776–5794, 2026

work page 2026