pith. sign in

arxiv: 2605.23417 · v1 · pith:PEN2QC3Jnew · submitted 2026-05-22 · 💻 cs.LG

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

Pith reviewed 2026-05-25 04:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords black-box optimizationfoundation modelspre-training datasetoptimization trajectoriesBBO-Pilescaling behaviorimitation learning
0
0 comments X

The pith

Foundation models trained on a new open dataset of 500K optimization trajectories can imitate black-box optimization methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BBO-Pile as the first public dataset with more than 500,000 optimization trajectories collected from 3,095 different black-box problems using multiple optimizers. It then trains foundation models of sizes from 2M to 80M parameters on subsets of this data ranging from 200M to 2B tokens and analyzes their scaling with compute. Because most existing black-box optimization techniques need extensive tuning and do not transfer well across domains, a model that learns general principles from large trajectory data could provide a more adaptable solution. The results indicate that this pre-training approach successfully reproduces optimization behavior, which supports further development of such models.

Core claim

We introduce BBO-Pile, the first open-source dataset comprising over 500K optimization trajectories evaluated across 3095 different black-boxes for different optimizers, which represents by far the largest public dataset for this task. Using this dataset, we train a family of foundation models at multiple scales, ranging from 2M to 80M parameters and from 200M to 2B training tokens, and study their scaling behavior with respect to compute. Our results demonstrate that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods, paving the way for future research in this direction.

What carries the argument

BBO-Pile, the dataset of optimization trajectories that serves as training data for scaling foundation models to imitate optimizers.

If this is right

  • Large-scale pre-training on optimization trajectories produces models that imitate black-box methods.
  • Scaling behavior can be studied as model size and token count increase up to 80M parameters and 2B tokens.
  • The public release of the dataset enables reproducible research on foundation models for optimization.
  • Models trained this way have the potential to generalize across different problem classes without manual hyperparameter tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These models could be applied to new optimization domains where traditional methods require significant tuning.
  • Future work might combine this dataset with synthetic data to further improve generalization.
  • Performance on real-world black-box problems could be tested to validate transfer from the training distribution.

Load-bearing premise

The 500K trajectories from the 3095 black-boxes and chosen optimizers provide a representative sample allowing models to learn generalizable optimization principles that transfer to unseen problems.

What would settle it

Train the models on BBO-Pile and test them on a collection of black-box problems completely outside the dataset; if the models do not perform competitively with tuned traditional optimizers, the viability of the pre-training approach would be questioned.

Figures

Figures reproduced from arXiv: 2605.23417 by Aaron Klein, David Salinas, Herilalaina Rakotoarison, Luca Thale-Bombien.

Figure 1
Figure 1. Figure 1: Composition of our open-source dataset for pre-training foundation models for black [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the encoding of a trial for a search space [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the encoding and decoding of hyperparameter and objective values. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Validation loss curves of each parameter count N / token budget D pair across FLOPS C ≈ 6 × N × D. We select the model with the best learning rate and batch size according to our grid search. Color indicates parameter count and red dots mark Pareto optimality after initial convergence phase. Right: Shows our scaling-law fit on the Pareto optimal point from the left [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the original CQR / RS method with CQR / RS simulated by our models at [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the original CQR / RS method with CQR / RS simulated by our models at [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Our 80M model (dashed lines) vs. original optimizers (solid lines) on tasks with search [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of optimizers of a completely unseen benchmark family (DeepAR). [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of estimated per-hyperparameter densities between our model and each [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ranks (left) and normalized regret (right) of all optimization methods averaged across all [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Hyperparameter grid of each model and token budget. Color indicates the validation loss [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Runtime Comparison on the FC-Net Protein Task. We report the wall-clock time (log-seconds) across 100 trials for our proposed methods, including native LitGPT and a vLLM￾accelerated Hugging Face implementation, as well as Random Search and CQR baselines. Results are aggregated over 30 independent seeds using consistent model checkpoints to ensure comparability. trained using the same random seed to contro… view at source ↗
Figure 13
Figure 13. Figure 13: Generalization to unseen tasks of known search spaces. Each panel shows the objective [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Generalization to unseen search spaces. Results span three HPO-B regression tasks, [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Generalization to held-out test tasks from the DeepAR benchmark. Each panel plots the [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of the sampling distributions on FC-Net search space. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of the sampling distributions on LC-Bench (Fashion-MNIST) search space. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Comparison of the sampling distributions on NAS-Bench-201 (ImageNet) search space. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Comparison of the sampling distributions on TabRepo (CatBoost) search space. [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
read the original abstract

Most black-box optimization methods require extensive hyperparameter tuning, often limiting their ability to generalize across different optimization domains. Foundation models for black-box optimization that learn optimization principles from a large collection of optimization trajectories offer a promising alternative, with the potential to outperform manually designed methods across diverse problem classes. However, prior work has either relied on non-public datasets or on purely synthetic data, limiting reproducibility and generalization to real-world problems. As a result, progress in this area has been constrained by the lack of large-scale, real-world, publicly available pre-training data. We introduce BBO-Pile, the first open-source dataset comprising over 500K optimization trajectories evaluated across 3095 different black-boxes for different optimizers, which represents by far the largest public dataset for this task. Using this dataset, we train a family of foundation models at multiple scales, ranging from 2M to 80M parameters and from 200M to 2B training tokens, and study their scaling behavior with respect to compute. Our results demonstrate that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods, paving the way for future research in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BBO-Pile, an open-source dataset containing over 500K optimization trajectories collected from 3095 black-box functions using multiple optimizers. It trains a family of foundation models (2M–80M parameters, 200M–2B tokens) on this data, analyzes scaling behavior with respect to compute, and concludes that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods.

Significance. If the dataset is shown to be sufficiently diverse and the trained models demonstrate transfer to unseen problems and optimizers, the work would provide a valuable public resource that could accelerate research on learned optimizers, similar to the role of large pre-training corpora in other domains. The open release directly addresses reproducibility barriers noted in prior work.

major comments (2)
  1. [§3 and §4] §3 (Dataset Construction) and §4 (Data Collection): No quantitative diversity metrics are reported for the 3095 black-boxes (e.g., histograms or coverage statistics over dimensionality, modality, noise level, or degree of multimodality). Without these, it is impossible to verify that the 500K trajectories support extraction of generalizable optimization principles rather than dataset-specific heuristics, which is load-bearing for the central claim of viability for imitating BBO methods.
  2. [§5] §5 (Experiments and Scaling): The evaluation does not include explicit held-out splits on unseen problem classes or optimizers with reported transfer metrics (e.g., regret or success rate on out-of-distribution instances). Scaling curves alone cannot distinguish memorization from genuine imitation of general BBO strategies.
minor comments (2)
  1. [Abstract] The abstract states trajectories were collected 'for different optimizers' but does not list the specific optimizers or their hyperparameter settings used in data generation.
  2. [Figures/Tables in §5] Table or figure captions for scaling results should explicitly state the number of held-out black-boxes and whether they were drawn from the same distribution as the training set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments on dataset diversity and held-out evaluation are well-taken and will be addressed through additional analysis in the revision.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Dataset Construction) and §4 (Data Collection): No quantitative diversity metrics are reported for the 3095 black-boxes (e.g., histograms or coverage statistics over dimensionality, modality, noise level, or degree of multimodality). Without these, it is impossible to verify that the 500K trajectories support extraction of generalizable optimization principles rather than dataset-specific heuristics, which is load-bearing for the central claim of viability for imitating BBO methods.

    Authors: We agree that quantitative diversity metrics would strengthen the manuscript and help substantiate the generalizability of the trajectories. In the revised version we will add histograms and coverage statistics over dimensionality, modality, noise level, and degree of multimodality for the 3095 black-box functions. These additions will directly support the claim that the dataset enables extraction of general optimization principles. revision: yes

  2. Referee: [§5] §5 (Experiments and Scaling): The evaluation does not include explicit held-out splits on unseen problem classes or optimizers with reported transfer metrics (e.g., regret or success rate on out-of-distribution instances). Scaling curves alone cannot distinguish memorization from genuine imitation of general BBO strategies.

    Authors: We acknowledge that explicit held-out splits and transfer metrics are necessary to distinguish memorization from genuine imitation. While the present experiments focus on scaling behavior, the revised manuscript will include held-out splits on unseen problem classes and optimizers together with reported transfer metrics (regret and success rate) on out-of-distribution instances. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical dataset release and scaling study is self-contained

full rationale

The paper introduces BBO-Pile, a new public dataset of 500K trajectories across 3095 black-boxes, then trains models at multiple scales and reports scaling behavior. No equations, fitted parameters, or derivations appear in the provided text. The central claim—that large-scale pre-training imitates black-box optimization—is supported by new empirical data collection and training runs rather than any reduction to prior fitted values, self-definitions, or self-citation chains. The work contains no load-bearing mathematical steps that could be circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical data-release and scaling study with no mathematical derivations, fitted constants, or postulated entities described in the abstract.

pith-pipeline@v0.9.0 · 5746 in / 1299 out tokens · 40082 ms · 2026-05-25T04:52:20.703124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 3 internal anchors

  1. [1]

    Aglietti, I

    V . Aglietti, I. Ktena, J. Schrouff, E. Sgouritsa, F. J. R. Ruiz, A. Malek, A. Bellot, and S. Chi- appa. FunBO: Discovering acquisition functions for Bayesian optimization with funsearch. arXiv:2406.04824 [cs.LG], 2025

  2. [2]

    Andrychowicz, M

    M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas. Learning to learn by gradient descent by gradient descent. InProceedings of the 29th International Conference on Advances in Neural Information Processing Systems (NeurIPS’16), 2016

  3. [3]

    A. F. Ansari, L. Stella, A. C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapoor, J. Zschiegner, D. C. Maddix, H. Wang, M. W. Mahoney, K. Torkkola, A. G. Wilson, M. Bohlke-Schneider, and B. Wang. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

  4. [4]

    S. P. Arango, H. Jomaa, M. Wistuba, and J. Grabocka. HPO-B: A large-scale reproducible benchmark for black-box hpo based on openml. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS’21), 2021

  5. [5]

    Bergstra, R

    J. Bergstra, R. Bardenet, Y . Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. InProceedings of the 24th International Conference on Advances in Neural Information Processing Systems (NeurIPS’11), 2011

  6. [6]

    Bergstra and Y

    J. Bergstra and Y . Bengio. Random search for hyper-parameter optimization.Journal of Machine Learning Research (JMLR-12), 2012

  7. [7]

    Binder, F

    M. Binder, F. Pfisterer, and B. Bischl. Collecting empirical data about hyperparameters for data driven automl.Democratizing Machine Learning Contributions in AutoML and Fairness, 2020

  8. [8]

    Calandra, N

    R. Calandra, N. Gopalan, A. Seyfarth, J. Peters, and M. Deisenroth. Bayesian gait optimization for bipedal locomotion. InProceedings of the Eighth International Conference on Learning and Intelligent Optimization (LION’14), 2014

  9. [9]

    Y . Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas. Learning to learn without gradient descent by gradient descent. InProceedings of the 34th International Conference on Machine Learning (ICML’17), 2017. 10

  10. [10]

    Y . Chen, X. Song, C. Lee, Z. Wang, Q. Zhang, D. Dohan, K. Kawakami, G. Kochanski, A. Doucet, M. A. Ranzato, S. Perel, and N. de Freitas. Towards learning universal hyperpa- rameter optimizers with transformers. InProceedings of the 36th International Conference on Advances in Neural Information Processing Systems (NeurIPS’22), 2022

  11. [11]

    Cherti, R

    M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023

  12. [12]

    A. I. Cowen-Rivers, W. Lyu, R. Tutunov, Z. Wang, A. Grosnit, R. R. Griffiths, H. Jianye, J. Wang, and H. B. Ammar. An empirical study of assumptions in Bayesian optimisation.arXiv preprint arXiv:2012.03826, 445, 2020

  13. [13]

    Dong and Y

    X. Dong and Y . Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. InInternational Conference on Learning Representations (ICLR’20), 2020

  14. [14]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR’21), 2021

  15. [15]

    Eggensperger, F

    K. Eggensperger, F. Hutter, H. Hoos, and K. Leyton-Brown. Efficient benchmarking of hyperpa- rameter optimizers via surrogates. InProceedings of the 29th National Conference on Artificial Intelligence (AAAI’15), 2015

  16. [16]

    Eggensperger, P

    K. Eggensperger, P. Müller, N. Mallik, M. Feurer, R. Sass, A. Klein, N. Awad, M. Lindauer, and F. Hutter. HPOBench: A collection of reproducible multi-fidelity benchmark problems for HPO. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS’21), 2021

  17. [17]

    Garnett.Bayesian Optimization

    R. Garnett.Bayesian Optimization. Cambridge University Press, 2023

  18. [18]

    Gómez-Bombarelli, J

    R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Auto- matic chemical design using a data-driven continuous representation of molecules.ACS Central Science, 2018

  19. [19]

    N. Hansen. The CMA evolution strategy: a comparing review. InTowards a new evolutionary computation. Advances on estimation of distribution algorithms. Springer Berlin Heidelberg, 2006

  20. [20]

    Hansen, A

    N. Hansen, A. Auger, O. Mersmann, T. Tušar, and D. Brockhoff. COCO: A platform for comparing continuous optimizers in a black-box setting.arXiv:1603.08785 [cs.AI], 2016

  21. [21]

    Hansen, S

    N. Hansen, S. Finck, R. Ros, and A. Auger.Real-parameter black-box optimization benchmark- ing 2009: Noiseless functions definitions. PhD thesis, INRIA, 2009

  22. [22]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models.arXiv:2203.15556 [cs.CL], 2022

  23. [23]

    Hollmann, S

    N. Hollmann, S. Müller, K. Eggensperger, and F. Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations (ICLR’23), 20223

  24. [24]

    Hollmann, S

    N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeis- ter, and F. Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 2025

  25. [25]

    Hutter, M

    F. Hutter, M. López-Ibáñez, C. Fawcett, M. Lindauer, H. Hoos, K. Leyton-Brown, and T. Stützle. Aclib: a benchmark library for algorithm configuration. InProceedings of the Learning and Intelligent OptimizatioN Conference (LION 8), 2014. 11

  26. [26]

    D. R. Jones. A taxonomy of global optimization methods based on response surfaces.Journal of Global Optimization, 2001

  27. [27]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv:2001.08361 [cs.LG], 2020

  28. [28]

    Klein, Z

    A. Klein, Z. Dai, F. Hutter, N. Lawrence, and J. Gonzalez. Meta-surrogate benchmarking for hyperparameter optimization. InProceedings of the 32th International Conference on Advances in Neural Information Processing Systems (NeurIPS’19), 2019

  29. [29]

    Tabular Benchmarks for Joint Architecture and Hyperparameter Optimization

    A. Klein and F. Hutter. Tabular benchmarks for joint architecture and hyperparameter optimiza- tion.arXiv:1905.04970 [cs.LG], 2019

  30. [30]

    Krishnamurthy, K

    A. Krishnamurthy, K. Harris, D. J. Foster, C. Zhang, and A. Slivkins. Can large language models explore in-context? InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS’24), 2024

  31. [31]

    W. Li, N. van Stein, T. Back, and E. Raponi. LLaMEA-BO: a large language model evolutionary algorithm for automatically generating bayesian optimization algorithms.arXiv:2505.21034 [cs.LG], 2025

  32. [32]

    Litgpt.https://github.com/Lightning-AI/litgpt, 2023

    Lightning-AI. Litgpt.https://github.com/Lightning-AI/litgpt, 2023

  33. [33]

    T. Liu, N. Astorga, N. Seedat, and M. van der Schaar. Large language models to enhance Bayesian optimization. InThe Twelfth International Conference on Learning Representations (ICLR’24), 2024

  34. [34]

    Loshchilov and F

    I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR’17), 2017

  35. [35]

    Müller, M

    S. Müller, M. Feurer, N. Hollmann, and F. Hutter. Pfns4bo: In-context learning for bayesian optimization. InProceedings of the 40th International Conference on Machine Learning (ICML’23), 2023

  36. [36]

    Perrone, H

    V . Perrone, H. Shen, M. Seeger, C. Archambeau, and R. Jenatton. Learning search spaces for bayesian optimization: Another view of hyperparameter transfer learning. InProceedings of the 32th International Conference on Advances in Neural Information Processing Systems (NeurIPS’19), 2019

  37. [37]

    Pfisterer, L

    F. Pfisterer, L. Schneider, J. Moosbauer, M. Binder, and B. Bischl. Yahpo gym - an efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. InFirst Conference on Automated Machine Learning (Main Track), 2022

  38. [38]

    E. Real, A. Aggarwal, Y . Huang, and Q. V . Le. Regularized Evolution for Image Classifier Architecture Search. InProceedings of the Conference on Artificial Intelligence (AAAI’19), 2019

  39. [39]

    Salinas and N

    D. Salinas and N. Erickson. Tabrepo: A large scale repository of tabular model evaluations and its automl applications.arXiv preprint arXiv:2311.02971, 2023

  40. [40]

    Salinas, V

    D. Salinas, V . Flunkert, J. Gasthaus, and T. Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 2020

  41. [41]

    Salinas, J

    D. Salinas, J. Golebiowsk, A. Klein, M. Seeger, and C. Archambeau. Optimizing hyperparame- ters with conformal quantile regression. InProceedings of the 40th International Conference on Machine Learning (ICML’23), 2023

  42. [42]

    Salinas, M

    D. Salinas, M. Seeger, A. Klein, V . Perrone, M. Wistuba, and C. Archambeau. Syne tune: A library for large scale hyperparameter tuning and reproducible research. InFirst Conference on Automated Machine Learning (Main Track), 2022

  43. [43]

    Salinas, H

    D. Salinas, H. Shen, and V . Perrone. A quantile-based approach for hyperparameter transfer learning. InProceedings of the 37th International Conference on Machine Learning (ICML’20), 2020. 12

  44. [44]

    Schmidhuber

    J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook

  45. [45]

    Schwanke, L

    A. Schwanke, L. Ivanov, D. Salinas, F. Ferreira, A. Klein, F. Hutter, and A. Zela. Improving llm-based global optimization with search space partitioning. InInternational Conference on Learning Representations (ICLR’26), 2026

  46. [46]

    Sennrich, B

    R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), 2016

  47. [47]

    X. Song, Y . Tian, R. T. Lange, C. Lee, Y . Tang, and Y . Chen. Position: Leverage foundational models for black-box optimization. InProceedings of the Forty-first International Conference on Machine Learning (ICML’24), 2024

  48. [48]

    J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust Bayesian neural networks. InProceedings of the 29th International Conference on Advances in Neural Information Processing Systems (NeurIPS’16), 2016

  49. [49]

    Storn and K

    R. Storn and K. Price. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.Journal of Global Optimization, 1997

  50. [50]

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), Feb. 2024

  51. [51]

    Talbi.Metaheuristics: from design to implementation

    E.-G. Talbi.Metaheuristics: from design to implementation. John Wiley & Sons, 2009

  52. [52]

    L. C. Tiao, A. Klein, M. W. Seeger, E. V . Bonilla, C. Archambeau, and F. Ramos. Bore: Bayesian optimization by density-ratio estimation. InProceedings of the 38th International Conference on Machine Learning (ICML’21), 2021

  53. [53]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’12), 2012

  54. [54]

    van Stein and T

    N. van Stein and T. Bäck. Llamea: Automatically generating metaheuristics with large language models. InProceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO’25), 2025

  55. [55]

    Veliˇckovi´c, A

    P. Veliˇckovi´c, A. Vitvitskyi, L. Markeeva, B. Ibarz, L. Buesing, M. Balog, and A. Novikov. Amplifying human performance in combinatorial competitive programming.arXiv:2411.19744 [cs.LG], 2024

  56. [56]

    Z. Wang, G. E. Dahl, K. Swersky, C. Lee, Z. Nado, J. Gilmer, J. Snoek, and Z. Ghahramani. Pre- trained Gaussian processes for Bayesian optimization.Journal of Machine Learning Research (JMLR’24), 2024

  57. [57]

    Wistuba, N

    M. Wistuba, N. Schilling, and L. Schmidt-Thieme. Learning hyperparameter optimization initializations. InIEEE International Conference on Data Science and Advanced Analytics (DSAA), 2015

  58. [58]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  59. [59]

    Limitations

    L. Zimmer, M. Lindauer, and F. Hutter. Auto-pytorch tabular: Multi-fidelity metalearning for efficient and robust autodl.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 13 A Benchmark Families We consider the following benchmark families from the literature: • FC-Net[ 29]: Considers the optimization of the hyperparameters and archite...

  60. [60]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...