pith. sign in

arxiv: 2605.17000 · v1 · pith:QEP33ICZnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

Pith reviewed 2026-05-19 20:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords black-box optimizationBayesian optimizationLLM tuningsurrogate modelsbenchmarkmulti-fidelitynoisy optimization
3
0 comments X

The pith

BoLT supplies lightweight surrogate models from thousands of real LLM runs so black-box optimization researchers can test methods on realistic expensive tasks without prohibitive costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM training and inference configurations are noisy, high-dimensional, and costly to evaluate, so practitioners often tune them with heuristics rather than principled optimization. Bayesian optimization and other black-box methods could improve sample efficiency, yet most researchers cannot afford the compute to experiment with new algorithms on actual models. BoLT addresses this by releasing surrogate models fitted to data from thousands of real LLM experiments. The surrogates reproduce multi-fidelity, multi-objective, heteroscedastic-noise, and high-dimensional features typical of prompt, hyperparameter, and data-mixture tuning. When a wide range of BO and BBO algorithms are run on these surrogates, selected Bayesian optimization methods show consistent advantages over alternatives.

Core claim

BoLT is the first LLM-centric benchmark that democratizes access to realistic optimization problems by providing lightweight surrogate models fitted to results from thousands of actual LLM experiments. These surrogates embed the practical challenges of multi-fidelity evaluation, multi-objective trade-offs, heteroscedastic noise, and high-dimensional search spaces that arise when tuning LLM training and inference configurations. Benchmarking a broad collection of Bayesian optimization and black-box optimization methods on BoLT shows that particular BO approaches maintain an edge in performance across the tasks.

What carries the argument

Lightweight surrogate models fitted to real LLM experimental data that approximate the objective landscapes for configuration tuning.

If this is right

  • Researchers can now iterate on new black-box optimization algorithms using realistic LLM-like problems at low cost.
  • Performance gaps in existing methods for handling noise and multiple objectives become measurable on LLM-relevant tasks.
  • Algorithm designers gain concrete targets for improving sample efficiency on high-dimensional noisy surfaces.
  • The benchmark supports reproducible comparisons that were previously blocked by compute barriers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods that excel on the surrogates could transfer to real LLM tuning workflows if the landscapes match closely enough.
  • The surrogate-fitting strategy might be reused to create accessible benchmarks for other domains where full experiments are prohibitively expensive.
  • Adding newer model families to the benchmark would test whether the observed method rankings remain stable as architectures evolve.

Load-bearing premise

The fitted surrogate models must accurately reproduce the optimization landscapes, noise patterns, and relative performance ordering of methods that would appear on the true expensive LLM tasks.

What would settle it

Execute the top-ranked BO methods identified on BoLT directly on the original LLM tasks and check whether their sample-efficiency and final-performance advantages over other methods still hold.

Figures

Figures reproduced from arXiv: 2605.17000 by Apivich Hemachandra, Bryan Kian Hsiang Low, Ruth Wan Theng Chew, Zhiliang Chen.

Figure 1
Figure 1. Figure 1: 2D slices of emulator landscapes [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Results on BOLT’s HPO problems. MF variants are plotted against the cumulative budget, where each observation cost range from 0.1 to 1, depending on the fidelity chosen. noiseless Pareto frontier across all observations at iteration t. These are common evaluation metrics found in optimization literature [15, 67]. The optimal values f∗ and HV∗ are found empirically via exhaustive evaluation for tabular data… view at source ↗
Figure 3
Figure 3. Figure 3: Results on BOLT’s DMO problems. (q = 5) indicates a batch size of 5, and NSGA2 has 50 observations in each iteration. All other methods use only 1 observation per iteration. tion iterations. Batched candidates are selected via sequential greedy conditioning [6]. Batch sizes and population (for genetic algorithms) are noted where they differ from a default of one. We compare against a range of acquisition f… view at source ↗
Figure 4
Figure 4. Figure 4: Results on BOLT’s PO problems. (q = 5) indicates a batch size of 5. Note that dTURBO and dBAxUS are our adapted methods for a discretized search space. PO-768, which descends to approximately −7, suggesting it benefits from the richer expressivity of the full embedding dimensionality, though it remains far behind trust-region methods. Batch evaluation partially compensates for the absence of trust-region s… view at source ↗
Figure 5
Figure 5. Figure 5: PCA explained variance for PO Subspace reduction may be unnecessary for LLM embedding spaces. At PO-128, dBAxUS performs notably worse than dTuRBO, as PO￾128 embeddings require 74.2% of principal components to explain 95% of variance ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real and predicted ranks on test set. 0.00 0.25 0.50 0.75 1.00 Fidelity (normalized training tokens) 0.00 0.05 0.10 0.15 Relative error HPO-MF-Cont 2 1 0 Normalized difference 0 10 20 30 40 50 Count HPO-MF-Disc: 4B vs 8B [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Differences in emulated objective at different fidelities [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Estimated optimal Pareto front, as 2D slices of a 3D space [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Noise emulator predictions for MATH-500 standard deviation. if_prop1 indicates mix [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of β for UCB on HPO and DMO base problems. "UCB" refers to using scheduled βt from Srinivas et al. [67] D.3 Cost scale sensitivity In cost-scaled acquisition functions, acquisition values are divided by an affine cost model c(x) = α · fid(x) + 1, where fid(x) is the fidelity parameter and α controls the cost sensitivity. Larger α penalizes high-fidelity queries more heavily in their acquisition… view at source ↗
Figure 11
Figure 11. Figure 11: Performance on MF HPO problems on different cost scale. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Fidelity proportions of observations on MF HPO problems. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Results on BOLT’s HPO problems using inference regret. Random is not included as random sampling does not maintain a posterior mean. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Results on BOLT’s HPO problems using per-step simple regret. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

Optimization of LLM training and inference configurations, such as hyperparameters, data mixtures, and prompts, is critical to performance, but it is often approached heuristically in practice, leading to potentially suboptimal outcomes. By framing them as noisy, expensive, and derivative-free optimization problems, Bayesian optimization (BO) and other black-box optimization (BBO) methods offer a promising yet underexplored direction for principled, sample-efficient methods. However, LLM training and inference costs are prohibitively high for most of the BBO research community, and new methods are often only evaluated on synthetic test functions and small-scale datasets that fail to capture the challenges of modern LLM optimization problems. This impedes the development of BBO methods and makes it difficult to assess their effectiveness on modern LLM tasks. We introduce BoLT, the first LLM-centric benchmark that democratizes LLM research for the BBO community. BoLT is released at https://github.com/chewwt/bolt. BoLT covers broad and well-motivated LLM optimization problems, involving multi-fidelity, multi-objective, heteroscedastic noise, and high-dimensional search spaces. Each problem in BoLT is grounded in real experimental data and made fully reproducible and accessible through lightweight surrogate models fitted to the results of thousands of real LLM experiments. We benchmark BoLT against an extensive range of BO and BBO methods, showing that selected BO methods consistently outperform others across tasks and highlighting gaps in existing BBO methods on LLM tasks, underscoring the need to modernize benchmarks for the BBO community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BoLT, the first LLM-centric benchmark for black-box optimization research. It releases lightweight surrogate models fitted to thousands of real LLM experiments to make expensive tasks (hyperparameter tuning, data mixtures, prompt optimization) accessible to the BBO community. The benchmark covers multi-fidelity, multi-objective, heteroscedastic noise, and high-dimensional problems. The authors evaluate a broad range of BO and BBO methods on BoLT and report that selected BO methods consistently outperform others across tasks, while highlighting gaps in existing methods.

Significance. If the surrogates accurately reproduce real LLM optimization landscapes, noise characteristics, and relative method orderings, BoLT would be a substantial contribution: it lowers the barrier for BBO researchers to test methods on practically relevant, expensive problems rather than synthetic functions. The public release of code and reproducible surrogates supports this utility. The work directly addresses the mismatch between current BBO benchmarks and modern LLM-scale tasks.

major comments (2)
  1. [§3] §3 (Surrogate Model Construction): The manuscript states that surrogates are fitted to results from thousands of real LLM experiments and are intended to reproduce optimization landscapes, heteroscedastic noise, and multi-fidelity behavior, but reports no quantitative fidelity metrics (e.g., held-out predictive MSE, Spearman rank correlation of method performances, or fidelity across fidelity levels). This validation is load-bearing for the central claim that rankings and behaviors observed on BoLT will transfer to actual expensive LLM tasks.
  2. [§5] §5 (Benchmarking Experiments): The claim that 'selected BO methods consistently outperform others across tasks' is presented without statistical significance testing or confidence intervals on performance differences. Given the heteroscedastic noise explicitly modeled in the surrogates, this weakens the strength of the comparative conclusions.
minor comments (2)
  1. [Abstract and §2] The abstract and §2 could more explicitly state the regression or interpolation technique used to build the surrogates (e.g., Gaussian process, random forest, or neural network) rather than referring only to 'lightweight surrogate models'.
  2. [Table 2] Table 2 (task characteristics) would benefit from an additional column reporting the number of real experiments used to fit each surrogate, to allow readers to assess data density per task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important aspects of surrogate validation and statistical rigor that we have addressed through revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Surrogate Model Construction): The manuscript states that surrogates are fitted to results from thousands of real LLM experiments and are intended to reproduce optimization landscapes, heteroscedastic noise, and multi-fidelity behavior, but reports no quantitative fidelity metrics (e.g., held-out predictive MSE, Spearman rank correlation of method performances, or fidelity across fidelity levels). This validation is load-bearing for the central claim that rankings and behaviors observed on BoLT will transfer to actual expensive LLM tasks.

    Authors: We agree that quantitative fidelity metrics are essential to support the claim that BoLT surrogates accurately reproduce real LLM optimization landscapes, noise characteristics, and relative method orderings. In the revised manuscript, we have added a new validation subsection to §3. This includes held-out predictive MSE and R² scores computed on a disjoint set of real LLM experimental results. We also report Spearman rank correlations between method performance rankings obtained on the surrogates and those from a small collection of held-out real LLM runs. Finally, we provide per-fidelity-level error metrics to verify multi-fidelity behavior. These additions directly bolster the transferability argument. revision: yes

  2. Referee: [§5] §5 (Benchmarking Experiments): The claim that 'selected BO methods consistently outperform others across tasks' is presented without statistical significance testing or confidence intervals on performance differences. Given the heteroscedastic noise explicitly modeled in the surrogates, this weakens the strength of the comparative conclusions.

    Authors: We acknowledge that the lack of statistical significance testing and confidence intervals weakens the comparative claims, especially in the presence of heteroscedastic noise. In the revision, we have updated §5 and the appendix to include bootstrap confidence intervals around all performance metrics. We have also added results from Wilcoxon signed-rank tests (with p-values) to assess whether observed differences between methods are statistically significant. These changes provide the requested rigor while preserving the original experimental setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity; surrogates and evaluations remain independent

full rationale

The paper constructs lightweight surrogate models by fitting to results from thousands of real LLM experiments (external data) and then evaluates BO/BBO methods on those surrogates. No step reduces the reported method outperformance or benchmark utility to the authors' fitting procedure by construction, self-definition, or self-citation chain. The performance ordering is obtained by running the methods on the surrogates rather than being statistically forced by the surrogate parameters themselves, and the surrogates are presented as reproducible approximations grounded outside the present evaluation loop. This is a standard benchmark construction with no load-bearing circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The benchmark construction implicitly relies on the assumption that the chosen surrogate fitting procedure preserves the essential properties of the original expensive black-box functions; no explicit free parameters or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5824 in / 1249 out tokens · 21936 ms · 2026-05-19T20:44:11.678175+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 7 internal anchors

  1. [1]

    Akiba, S

    T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019

  2. [2]

    arXiv preprint arXiv:2402.16827

    A. Albalak, Y . Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, et al. A survey on data selection for language models.arXiv preprint arXiv:2402.16827, 2024

  3. [3]

    Ament, S

    S. Ament, S. Daulton, D. Eriksson, M. Balandat, and E. Bakshy. Unexpected improvements to expected improvement for Bayesian optimization.Advances in Neural Information Processing Systems, 36:20577– 20612, 2023

  4. [4]

    S. P. Arango, H. S. Jomaa, M. Wistuba, and J. Grabocka. Hpo-b: A large-scale reproducible benchmark for black-box hpo based on openml. InThirty-fifth Conference on Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  5. [5]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  6. [6]

    Balandat, B

    M. Balandat, B. Karrer, D. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. BoTorch: A framework for efficient monte-carlo bayesian optimization.Advances in Neural Information Processing Systems, 33:21524–21538, 2020

  7. [7]

    Belakaria, A

    S. Belakaria, A. Deshwal, and J. R. Doppa. Max-value entropy search for multi-objective Bayesian optimization.Advances in Neural Information Processing Systems, 32, 2019

  8. [8]

    Bergstra, R

    J. Bergstra, R. Bardenet, Y . Bengio, and B. Kégl. Algorithms for hyper-parameter optimization.Advances in Neural Information Processing Systems, 24, 2011

  9. [9]

    Bingham and S

    D. Bingham and S. Surjanovic. Virtual library of simulation experiments: Test functions and datasets,

  10. [10]

    URLhttps://www.sfu.ca/~ssurjano/optimization.html

  11. [11]

    Bliek, A

    L. Bliek, A. Guijt, R. Karlsson, S. Verwer, and M. De Weerdt. Benchmarking surrogate-based optimisation algorithms on expensive black-box functions.Applied Soft Computing, 147:110744, 2023

  12. [12]

    Bradford, A

    E. Bradford, A. M. Schweidtmann, and A. Lapkin. Efficient multiobjective optimization employing Gaussian processes, spectral sampling and a genetic algorithm.Journal of global optimization, 71(2): 407–438, 2018

  13. [13]

    Chapelle and L

    O. Chapelle and L. Li. An empirical evaluation of Thompson sampling.Advances in Neural Information Processing Systems, 24, 2011

  14. [14]

    L. Chen, J. Chen, T. Goldstein, H. Huang, and T. Zhou. Instructzero: Efficient instruction optimization for black-box large language models. InInternational Conference on Machine Learning, 2024

  15. [15]

    Z. Chen, G. K. R. Lau, C.-S. Foo, and B. K. H. Low. DUET: Optimizing training data mixtures via feedback from unseen evaluation tasks.arXiv:2502.00270, 2025

  16. [16]

    Daulton, M

    S. Daulton, M. Balandat, and E. Bakshy. Parallel bayesian optimization of multiple noisy objectives with expected hypervolume improvement.Advances in Neural Information Processing Systems, 34:2187–2200, 2021. 10

  17. [17]

    K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: NSGA-II.IEEE transactions on evolutionary computation, 6(2):182–197, 2002

  18. [18]

    Dreczkowski, A

    K. Dreczkowski, A. Grosnit, and H. Bou Ammar. Framework and benchmarks for combinatorial and mixed-variable Bayesian optimization.Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 36:69464–69489, 2023

  19. [19]

    Eggensperger, P

    K. Eggensperger, P. Müller, N. Mallik, M. Feurer, R. Sass, A. Klein, N. Awad, M. Lindauer, and F. Hutter. Hpobench: A collection of reproducible multi-fidelity benchmark problems for HPO. InNeural Information Processing Systems Track on Datasets and Benchmarks, 2021

  20. [20]

    Eriksson and M

    D. Eriksson and M. Jankowiak. High-dimensional Bayesian optimization with sparse axis-aligned sub- spaces. InUncertainty in artificial intelligence, pages 493–503. PMLR, 2021

  21. [21]

    Eriksson, M

    D. Eriksson, M. Pearce, J. Gardner, R. D. Turner, and M. Poloczek. Scalable global optimization via local Bayesian optimization.Advances in Neural Information Processing Systems, 32, 2019

  22. [22]

    Falkner, A

    S. Falkner, A. Klein, and F. Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, pages 1437–1446. PMLR, 2018

  23. [23]

    Fernando, D

    C. Fernando, D. S. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self- referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, pages 13481–13544. PMLR, 2024

  24. [24]

    Frazier, W

    P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613, 2009

  25. [25]

    P. I. Frazier. A tutorial on Bayesian optimization.arXiv preprint arXiv:1807.02811, 2018

  26. [26]

    L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 07 2024. URLhttps://zenodo.org/records/12608602

  27. [27]

    Garnett.Bayesian optimization

    R. Garnett.Bayesian optimization. Cambridge University Press, 2023

  28. [28]

    N. Hansen. The CMA evolution strategy: a comparing review.Towards a new evolutionary computation: Advances in the estimation of distribution algorithms, pages 75–102, 2006

  29. [29]

    Hansen, S

    N. Hansen, S. Finck, R. Ros, and A. Auger.Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions. PhD thesis, INRIA, 2009

  30. [30]

    Hansen, A

    N. Hansen, A. Auger, R. Ros, O. Mersmann, T. Tušar, and D. Brockhoff. COCO: A platform for comparing continuous optimizers in a black-box setting.Optimization Methods and Software, 36(1):114–144, 2021

  31. [31]

    F. Häse, M. Aldeghi, R. J. Hickman, L. M. Roch, M. Christensen, E. Liles, J. E. Hein, and A. Aspuru-Guzik. Olympus: a benchmarking framework for noisy optimization and experiment planning.Machine Learning: Science and Technology, 2(3):035021, 2021

  32. [32]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  33. [33]

    Hernández-Lobato, J

    D. Hernández-Lobato, J. Hernandez-Lobato, A. Shah, and R. Adams. Predictive entropy search for multi- objective Bayesian optimization. InInternational Conference on Machine Learning, pages 1492–1501. PMLR, 2016

  34. [34]

    J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions.Advances in Neural Information Processing Systems, 27, 2014

  35. [35]

    W. Hu, Y . Shu, Z. Yu, Z. Wu, X. Lin, Z. Dai, S.-K. Ng, and B. K. H. Low. Localized zeroth-order prompt optimization.Advances in Neural Information Processing Systems, 37:86309–86345, 2024

  36. [36]

    Hvarfner, F

    C. Hvarfner, F. Hutter, and L. Nardi. Joint entropy search for maximally-informed Bayesian optimization. Advances in Neural Information Processing Systems, 35:11494–11506, 2022

  37. [37]

    Hvarfner, E

    C. Hvarfner, E. O. Hellsten, and L. Nardi. Vanilla Bayesian optimization performs great in high dimensions. InInternational Conference on Machine Learning, pages 20793–20817. PMLR, 2024. 11

  38. [38]

    C. Jang, H. Lee, J. Kim, and J. Lee. Model fusion through Bayesian optimization in language model fine-tuning.Advances in Neural Information Processing Systems, 37:29878–29912, 2024

  39. [39]

    D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998

  40. [40]

    Kersting, C

    K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. Most likely heteroscedastic gaussian process regression. InInternational Conference on Machine learning, pages 393–400, 2007

  41. [41]

    J. Knowles. Parego: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems.IEEE transactions on evolutionary computation, 10(1):50–66, 2006

  42. [42]

    Kojima, S

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 35:22199–22213, 2022

  43. [43]

    Kumar, G

    A. Kumar, G. Wu, M. Z. Ali, R. Mallipeddi, P. N. Suganthan, and S. Das. A test-suite of non-convex constrained optimization problems from the real-world and some baseline results.Swarm and evolutionary computation, 56:100693, 2020

  44. [44]

    Kusupati, G

    A. Kusupati, G. Bhatt, A. Rege, M. Wallingford, A. Sinha, V . Ramanujan, W. Howard-Snyder, K. Chen, S. Kakade, P. Jain, et al. Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022

  45. [45]

    Lambert, J

    N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, X. Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. InSecond Conference on Language Modeling, 2025

  46. [46]

    Lázaro-Gredilla and M

    M. Lázaro-Gredilla and M. K. Titsias. Variational heteroscedastic Gaussian process regression. In International Conference on Machine Learning, pages 841–848, 2011

  47. [47]

    H. Li, W. Zheng, Q. Wang, H. Zhang, Z. Wang, S. Xuyang, Y . Fan, Z. Ding, H. Wang, N. Ding, S. Zhou, X. Zhang, and D. Jiang. Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining.arXiv:2503.04715, 2025

  48. [48]

    L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, J. Ben-Tzur, M. Hardt, B. Recht, and A. Talwalkar. A system for massively parallel hyperparameter tuning.Proceedings of machine learning and systems, 2: 230–246, 2020

  49. [49]

    Y . Li, Z. Liu, and E. Xing. Data mixing optimization for supervised fine-tuning of large language models. InInternational Conference on Machine Learning, pages 35419–35437. PMLR, 2025

  50. [50]

    Liang, A

    Q. Liang, A. E. Gongora, Z. Ren, A. Tiihonen, Z. Liu, S. Sun, J. R. Deneault, D. Bash, F. Mekki-Berrada, S. A. Khan, et al. Benchmarking the performance of Bayesian optimization across multiple experimental materials science domains.npj Computational Materials, 7(1):188, 2021

  51. [51]

    Lightman, V

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

  52. [52]

    X. Lin, Z. Wu, Z. Dai, W. Hu, Y . Shu, S.-K. Ng, P. Jaillet, and B. K. H. Low. Use your INSTINCT: Instruc- tion optimization for llms using neural bandits coupled with transformers. InInternational Conference on Machine Learning, pages 30317–30345. PMLR, 2024

  53. [53]

    Lindauer, K

    M. Lindauer, K. Eggensperger, M. Feurer, A. Biedenkapp, D. Deng, C. Benjamins, T. Ruhkopf, R. Sass, and F. Hutter. Smac3: A versatile Bayesian optimization package for hyperparameter optimization.Journal of Machine Learning Research, 23(54):1–9, 2022

  54. [54]

    J. Liu, C. S. Xia, Y . Wang, and L. Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36:21558–21572, 2023

  55. [55]

    Q. Liu, X. Zheng, N. Muennighoff, G. Zeng, L. Dou, T. Pang, J. Jiang, and M. Lin. Regmix: Data mixture as regression for language model pre-training. InThe Thirteenth International Conference on Learning Representations, 2025

  56. [56]

    H. B. Moss, D. S. Leslie, J. Gonzalez, and P. Rayson. Gibbon: General-purpose information-based Bayesian optimisation.Journal of Machine Learning Research, 22(235):1–49, 2021. 12

  57. [57]

    Nomura, S

    M. Nomura, S. Watanabe, Y . Akimoto, Y . Ozaki, and M. Onishi. Warm starting CMA-ES for hyperpa- rameter optimization. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 9188–9196, 2021

  58. [58]

    T. Olmo, A. Ettinger, A. Bertsch, B. Kuehl, D. Graham, D. Heineman, D. Groeneveld, F. Brahman, F. Timbers, H. Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

  59. [59]

    P. S. Palar, R. P. Liem, L. R. Zuhal, and K. Shimoyama. On the use of surrogate models in engineering design optimization and exploration: The key issues. InProceedings of the genetic and evolutionary computation conference companion, pages 1592–1602, 2019

  60. [60]

    Papenmeier, L

    L. Papenmeier, L. Nardi, and M. Poloczek. Increasing the scope as you learn: Adaptive Bayesian optimization in nested subspaces.Advances in Neural Information Processing Systems, 35:11586–11601, 2022

  61. [61]

    Papenmeier, M

    L. Papenmeier, M. Poloczek, and L. Nardi. Understanding high-dimensional Bayesian optimization. In International Conference on Machine Learning, pages 47902–47923. PMLR, 2025

  62. [62]

    Pfisterer, L

    F. Pfisterer, L. Schneider, J. Moosbauer, M. Binder, and B. Bischl. Yahpo gym-an efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. InInternational Conference on Automated Machine Learning, pages 3–1. PMLR, 2022

  63. [63]

    EmbeddingGemma: Powerful and Lightweight Text Representations

    H. Schechter Vera, S. Dua, B. Zhang, D. Salz, R. Mullins, S. Raghuram Panyam, S. Smoot, I. Naim, J. Zou, F. Chen, D. Cer, A. Lisak, M. Choi, L. Gonzalez, O. Sanseviero, G. Cameron, I. Ballantyne, K. Black, K. Chen, W. Wang, Z. Li, G. Martins, J. Lee, M. Sherwood, J. Ji, R. Wu, J. Zheng, J. Singh, A. Sharma, D. Sreepat, A. Jain, A. Elarabawy, A. Co, A. Dou...

  64. [64]

    Seong-Eun, L

    B. Seong-Eun, L. Jung-Mok, K. Sung-Bin, and T.-H. Oh. Efficient hyper-parameter search for LoRA via language-aided Bayesian optimization.arXiv preprint arXiv:2602.11171, 2026

  65. [65]

    Shahriari, K

    B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of Bayesian optimization.Proceedings of the IEEE, 104(1):148–175, 2015

  66. [66]

    C. Shi, K. Yang, Z. Chen, J. Li, J. Yang, and C. Shen. Efficient prompt optimization through the lens of best arm identification.Advances in Neural Information Processing Systems, 37:99646–99685, 2024

  67. [67]

    Snoek, H

    J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 25, 2012

  68. [68]

    Srinivas, A

    N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. InInternational Conference on Machine Learning, 2010

  69. [69]

    Takeno, H

    S. Takeno, H. Fukuoka, Y . Tsukada, T. Koyama, M. Shiga, I. Takeuchi, and M. Karasuyama. Multi-fidelity Bayesian optimization with max-value entropy search and its parallelization. InInternational Conference on Machine Learning, pages 9334–9345. PMLR, 2020

  70. [70]

    Toshniwal, W

    S. Toshniwal, W. Du, I. Moshkov, B. Kisacanin, A. Ayrapetyan, and I. Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. InThe Thirteenth International Conference on Learning Representations, 2023

  71. [71]

    Tribes, S

    C. Tribes, S. Benarroch-Lelong, P. Lu, and I. Kobyzev. Hyperparameter optimization for large language model instruction-tuning. InAAAI Conference on Artificial Intelligence, 2024

  72. [72]

    B. Tu, A. Gandy, N. Kantas, and B. Shafei. Joint entropy search for multi-objective Bayesian optimization. Advances in Neural Information Processing Systems, 35:9922–9938, 2022

  73. [73]

    L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

  74. [74]

    Wang and S

    Z. Wang and S. Jegelka. Max-value entropy search for efficient Bayesian optimization. InInternational Conference on Machine Learning, pages 3627–3635. PMLR, 2017. 13

  75. [75]

    Z. Wu, X. Lin, Z. Dai, W. Hu, Y . Shu, S.-K. Ng, P. Jaillet, and B. K. H. Low. Prompt optimization with ease? efficient ordering-aware automated selection of exemplars.Advances in Neural Information Processing Systems, 37:122706–122740, 2024

  76. [76]

    S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y . Lu, P. S. Liang, Q. V . Le, T. Ma, and A. W. Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36:69798–69818, 2023

  77. [77]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  78. [78]

    C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

  79. [79]

    T. Yen, A. W. T. Siah, H. Chen, C. D. Guetta, T. Peng, and H. Namkoong. Data mixture optimization: A multi-fidelity multi-scale bayesian framework. InAdvances in Neural Information Processing Systems, 2025

  80. [80]

    Zhang, A

    Y . Zhang, A. Mohamed, H. Abdine, G. Shang, and M. Vazirgiannis. Beyond random sampling: Efficient language model pretraining via curriculum learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5776–5794, 2026

Showing first 80 references.