An Open-Source Training Dataset for Foundation Models for Black-box Optimization

Aaron Klein; David Salinas; Herilalaina Rakotoarison; Luca Thale-Bombien

arxiv: 2605.23417 · v1 · pith:PEN2QC3Jnew · submitted 2026-05-22 · 💻 cs.LG

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

Aaron Klein , Herilalaina Rakotoarison , Luca Thale-Bombien , David Salinas This is my paper

Pith reviewed 2026-05-25 04:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords black-box optimizationfoundation modelspre-training datasetoptimization trajectoriesBBO-Pilescaling behaviorimitation learning

0 comments

The pith

Foundation models trained on a new open dataset of 500K optimization trajectories can imitate black-box optimization methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BBO-Pile as the first public dataset with more than 500,000 optimization trajectories collected from 3,095 different black-box problems using multiple optimizers. It then trains foundation models of sizes from 2M to 80M parameters on subsets of this data ranging from 200M to 2B tokens and analyzes their scaling with compute. Because most existing black-box optimization techniques need extensive tuning and do not transfer well across domains, a model that learns general principles from large trajectory data could provide a more adaptable solution. The results indicate that this pre-training approach successfully reproduces optimization behavior, which supports further development of such models.

Core claim

We introduce BBO-Pile, the first open-source dataset comprising over 500K optimization trajectories evaluated across 3095 different black-boxes for different optimizers, which represents by far the largest public dataset for this task. Using this dataset, we train a family of foundation models at multiple scales, ranging from 2M to 80M parameters and from 200M to 2B training tokens, and study their scaling behavior with respect to compute. Our results demonstrate that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods, paving the way for future research in this direction.

What carries the argument

BBO-Pile, the dataset of optimization trajectories that serves as training data for scaling foundation models to imitate optimizers.

If this is right

Large-scale pre-training on optimization trajectories produces models that imitate black-box methods.
Scaling behavior can be studied as model size and token count increase up to 80M parameters and 2B tokens.
The public release of the dataset enables reproducible research on foundation models for optimization.
Models trained this way have the potential to generalize across different problem classes without manual hyperparameter tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These models could be applied to new optimization domains where traditional methods require significant tuning.
Future work might combine this dataset with synthetic data to further improve generalization.
Performance on real-world black-box problems could be tested to validate transfer from the training distribution.

Load-bearing premise

The 500K trajectories from the 3095 black-boxes and chosen optimizers provide a representative sample allowing models to learn generalizable optimization principles that transfer to unseen problems.

What would settle it

Train the models on BBO-Pile and test them on a collection of black-box problems completely outside the dataset; if the models do not perform competitively with tuned traditional optimizers, the viability of the pre-training approach would be questioned.

Figures

Figures reproduced from arXiv: 2605.23417 by Aaron Klein, David Salinas, Herilalaina Rakotoarison, Luca Thale-Bombien.

**Figure 2.** Figure 2: Illustration of the encoding of a trial for a search space [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the encoding and decoding of hyperparameter and objective values. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Left: Validation loss curves of each parameter count N / token budget D pair across FLOPS C ≈ 6 × N × D. We select the model with the best learning rate and batch size according to our grid search. Color indicates parameter count and red dots mark Pareto optimality after initial convergence phase. Right: Shows our scaling-law fit on the Pareto optimal point from the left [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 5.** Figure 5: Comparison of the original CQR / RS method with CQR / RS simulated by our models at [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of the original CQR / RS method with CQR / RS simulated by our models at [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Our 80M model (dashed lines) vs. original optimizers (solid lines) on tasks with search [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of optimizers of a completely unseen benchmark family (DeepAR). [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of estimated per-hyperparameter densities between our model and each [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Ranks (left) and normalized regret (right) of all optimization methods averaged across all [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Hyperparameter grid of each model and token budget. Color indicates the validation loss [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Runtime Comparison on the FC-Net Protein Task. We report the wall-clock time (log-seconds) across 100 trials for our proposed methods, including native LitGPT and a vLLMaccelerated Hugging Face implementation, as well as Random Search and CQR baselines. Results are aggregated over 30 independent seeds using consistent model checkpoints to ensure comparability. trained using the same random seed to contro… view at source ↗

**Figure 13.** Figure 13: Generalization to unseen tasks of known search spaces. Each panel shows the objective [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Generalization to unseen search spaces. Results span three HPO-B regression tasks, [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Generalization to held-out test tasks from the DeepAR benchmark. Each panel plots the [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Comparison of the sampling distributions on FC-Net search space. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Comparison of the sampling distributions on LC-Bench (Fashion-MNIST) search space. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Comparison of the sampling distributions on NAS-Bench-201 (ImageNet) search space. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: Comparison of the sampling distributions on TabRepo (CatBoost) search space. [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

read the original abstract

Most black-box optimization methods require extensive hyperparameter tuning, often limiting their ability to generalize across different optimization domains. Foundation models for black-box optimization that learn optimization principles from a large collection of optimization trajectories offer a promising alternative, with the potential to outperform manually designed methods across diverse problem classes. However, prior work has either relied on non-public datasets or on purely synthetic data, limiting reproducibility and generalization to real-world problems. As a result, progress in this area has been constrained by the lack of large-scale, real-world, publicly available pre-training data. We introduce BBO-Pile, the first open-source dataset comprising over 500K optimization trajectories evaluated across 3095 different black-boxes for different optimizers, which represents by far the largest public dataset for this task. Using this dataset, we train a family of foundation models at multiple scales, ranging from 2M to 80M parameters and from 200M to 2B training tokens, and study their scaling behavior with respect to compute. Our results demonstrate that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods, paving the way for future research in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BBO-Pile, an open-source dataset containing over 500K optimization trajectories collected from 3095 black-box functions using multiple optimizers. It trains a family of foundation models (2M–80M parameters, 200M–2B tokens) on this data, analyzes scaling behavior with respect to compute, and concludes that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods.

Significance. If the dataset is shown to be sufficiently diverse and the trained models demonstrate transfer to unseen problems and optimizers, the work would provide a valuable public resource that could accelerate research on learned optimizers, similar to the role of large pre-training corpora in other domains. The open release directly addresses reproducibility barriers noted in prior work.

major comments (2)

[§3 and §4] §3 (Dataset Construction) and §4 (Data Collection): No quantitative diversity metrics are reported for the 3095 black-boxes (e.g., histograms or coverage statistics over dimensionality, modality, noise level, or degree of multimodality). Without these, it is impossible to verify that the 500K trajectories support extraction of generalizable optimization principles rather than dataset-specific heuristics, which is load-bearing for the central claim of viability for imitating BBO methods.
[§5] §5 (Experiments and Scaling): The evaluation does not include explicit held-out splits on unseen problem classes or optimizers with reported transfer metrics (e.g., regret or success rate on out-of-distribution instances). Scaling curves alone cannot distinguish memorization from genuine imitation of general BBO strategies.

minor comments (2)

[Abstract] The abstract states trajectories were collected 'for different optimizers' but does not list the specific optimizers or their hyperparameter settings used in data generation.
[Figures/Tables in §5] Table or figure captions for scaling results should explicitly state the number of held-out black-boxes and whether they were drawn from the same distribution as the training set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments on dataset diversity and held-out evaluation are well-taken and will be addressed through additional analysis in the revision.

read point-by-point responses

Referee: [§3 and §4] §3 (Dataset Construction) and §4 (Data Collection): No quantitative diversity metrics are reported for the 3095 black-boxes (e.g., histograms or coverage statistics over dimensionality, modality, noise level, or degree of multimodality). Without these, it is impossible to verify that the 500K trajectories support extraction of generalizable optimization principles rather than dataset-specific heuristics, which is load-bearing for the central claim of viability for imitating BBO methods.

Authors: We agree that quantitative diversity metrics would strengthen the manuscript and help substantiate the generalizability of the trajectories. In the revised version we will add histograms and coverage statistics over dimensionality, modality, noise level, and degree of multimodality for the 3095 black-box functions. These additions will directly support the claim that the dataset enables extraction of general optimization principles. revision: yes
Referee: [§5] §5 (Experiments and Scaling): The evaluation does not include explicit held-out splits on unseen problem classes or optimizers with reported transfer metrics (e.g., regret or success rate on out-of-distribution instances). Scaling curves alone cannot distinguish memorization from genuine imitation of general BBO strategies.

Authors: We acknowledge that explicit held-out splits and transfer metrics are necessary to distinguish memorization from genuine imitation. While the present experiments focus on scaling behavior, the revised manuscript will include held-out splits on unseen problem classes and optimizers together with reported transfer metrics (regret and success rate) on out-of-distribution instances. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical dataset release and scaling study is self-contained

full rationale

The paper introduces BBO-Pile, a new public dataset of 500K trajectories across 3095 black-boxes, then trains models at multiple scales and reports scaling behavior. No equations, fitted parameters, or derivations appear in the provided text. The central claim—that large-scale pre-training imitates black-box optimization—is supported by new empirical data collection and training runs rather than any reduction to prior fitted values, self-definitions, or self-citation chains. The work contains no load-bearing mathematical steps that could be circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical data-release and scaling study with no mathematical derivations, fitted constants, or postulated entities described in the abstract.

pith-pipeline@v0.9.0 · 5746 in / 1299 out tokens · 40082 ms · 2026-05-25T04:52:20.703124+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 3 internal anchors

[1]

Aglietti, I

V . Aglietti, I. Ktena, J. Schrouff, E. Sgouritsa, F. J. R. Ruiz, A. Malek, A. Bellot, and S. Chi- appa. FunBO: Discovering acquisition functions for Bayesian optimization with funsearch. arXiv:2406.04824 [cs.LG], 2025

work page arXiv 2025
[2]

Andrychowicz, M

M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas. Learning to learn by gradient descent by gradient descent. InProceedings of the 29th International Conference on Advances in Neural Information Processing Systems (NeurIPS’16), 2016

work page 2016
[3]

A. F. Ansari, L. Stella, A. C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapoor, J. Zschiegner, D. C. Maddix, H. Wang, M. W. Mahoney, K. Torkkola, A. G. Wilson, M. Bohlke-Schneider, and B. Wang. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

work page 2024
[4]

S. P. Arango, H. Jomaa, M. Wistuba, and J. Grabocka. HPO-B: A large-scale reproducible benchmark for black-box hpo based on openml. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS’21), 2021

work page 2021
[5]

Bergstra, R

J. Bergstra, R. Bardenet, Y . Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. InProceedings of the 24th International Conference on Advances in Neural Information Processing Systems (NeurIPS’11), 2011

work page 2011
[6]

Bergstra and Y

J. Bergstra and Y . Bengio. Random search for hyper-parameter optimization.Journal of Machine Learning Research (JMLR-12), 2012

work page 2012
[7]

Binder, F

M. Binder, F. Pfisterer, and B. Bischl. Collecting empirical data about hyperparameters for data driven automl.Democratizing Machine Learning Contributions in AutoML and Fairness, 2020

work page 2020
[8]

Calandra, N

R. Calandra, N. Gopalan, A. Seyfarth, J. Peters, and M. Deisenroth. Bayesian gait optimization for bipedal locomotion. InProceedings of the Eighth International Conference on Learning and Intelligent Optimization (LION’14), 2014

work page 2014
[9]

Y . Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas. Learning to learn without gradient descent by gradient descent. InProceedings of the 34th International Conference on Machine Learning (ICML’17), 2017. 10

work page 2017
[10]

Y . Chen, X. Song, C. Lee, Z. Wang, Q. Zhang, D. Dohan, K. Kawakami, G. Kochanski, A. Doucet, M. A. Ranzato, S. Perel, and N. de Freitas. Towards learning universal hyperpa- rameter optimizers with transformers. InProceedings of the 36th International Conference on Advances in Neural Information Processing Systems (NeurIPS’22), 2022

work page 2022
[11]

Cherti, R

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023

work page 2023
[12]

A. I. Cowen-Rivers, W. Lyu, R. Tutunov, Z. Wang, A. Grosnit, R. R. Griffiths, H. Jianye, J. Wang, and H. B. Ammar. An empirical study of assumptions in Bayesian optimisation.arXiv preprint arXiv:2012.03826, 445, 2020

work page arXiv 2012
[13]

Dong and Y

X. Dong and Y . Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. InInternational Conference on Learning Representations (ICLR’20), 2020

work page 2020
[14]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR’21), 2021

work page 2021
[15]

Eggensperger, F

K. Eggensperger, F. Hutter, H. Hoos, and K. Leyton-Brown. Efficient benchmarking of hyperpa- rameter optimizers via surrogates. InProceedings of the 29th National Conference on Artificial Intelligence (AAAI’15), 2015

work page 2015
[16]

Eggensperger, P

K. Eggensperger, P. Müller, N. Mallik, M. Feurer, R. Sass, A. Klein, N. Awad, M. Lindauer, and F. Hutter. HPOBench: A collection of reproducible multi-fidelity benchmark problems for HPO. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS’21), 2021

work page 2021
[17]

Garnett.Bayesian Optimization

R. Garnett.Bayesian Optimization. Cambridge University Press, 2023

work page 2023
[18]

Gómez-Bombarelli, J

R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Auto- matic chemical design using a data-driven continuous representation of molecules.ACS Central Science, 2018

work page 2018
[19]

N. Hansen. The CMA evolution strategy: a comparing review. InTowards a new evolutionary computation. Advances on estimation of distribution algorithms. Springer Berlin Heidelberg, 2006

work page 2006
[20]

Hansen, A

N. Hansen, A. Auger, O. Mersmann, T. Tušar, and D. Brockhoff. COCO: A platform for comparing continuous optimizers in a black-box setting.arXiv:1603.08785 [cs.AI], 2016

work page arXiv 2016
[21]

Hansen, S

N. Hansen, S. Finck, R. Ros, and A. Auger.Real-parameter black-box optimization benchmark- ing 2009: Noiseless functions definitions. PhD thesis, INRIA, 2009

work page 2009
[22]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models.arXiv:2203.15556 [cs.CL], 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Hollmann, S

N. Hollmann, S. Müller, K. Eggensperger, and F. Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations (ICLR’23), 20223

work page
[24]

Hollmann, S

N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeis- ter, and F. Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 2025

work page 2025
[25]

Hutter, M

F. Hutter, M. López-Ibáñez, C. Fawcett, M. Lindauer, H. Hoos, K. Leyton-Brown, and T. Stützle. Aclib: a benchmark library for algorithm configuration. InProceedings of the Learning and Intelligent OptimizatioN Conference (LION 8), 2014. 11

work page 2014
[26]

D. R. Jones. A taxonomy of global optimization methods based on response surfaces.Journal of Global Optimization, 2001

work page 2001
[27]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv:2001.08361 [cs.LG], 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[28]

Klein, Z

A. Klein, Z. Dai, F. Hutter, N. Lawrence, and J. Gonzalez. Meta-surrogate benchmarking for hyperparameter optimization. InProceedings of the 32th International Conference on Advances in Neural Information Processing Systems (NeurIPS’19), 2019

work page 2019
[29]

Tabular Benchmarks for Joint Architecture and Hyperparameter Optimization

A. Klein and F. Hutter. Tabular benchmarks for joint architecture and hyperparameter optimiza- tion.arXiv:1905.04970 [cs.LG], 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[30]

Krishnamurthy, K

A. Krishnamurthy, K. Harris, D. J. Foster, C. Zhang, and A. Slivkins. Can large language models explore in-context? InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS’24), 2024

work page 2024
[31]

W. Li, N. van Stein, T. Back, and E. Raponi. LLaMEA-BO: a large language model evolutionary algorithm for automatically generating bayesian optimization algorithms.arXiv:2505.21034 [cs.LG], 2025

work page arXiv 2025
[32]

Litgpt.https://github.com/Lightning-AI/litgpt, 2023

Lightning-AI. Litgpt.https://github.com/Lightning-AI/litgpt, 2023

work page 2023
[33]

T. Liu, N. Astorga, N. Seedat, and M. van der Schaar. Large language models to enhance Bayesian optimization. InThe Twelfth International Conference on Learning Representations (ICLR’24), 2024

work page 2024
[34]

Loshchilov and F

I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR’17), 2017

work page 2017
[35]

Müller, M

S. Müller, M. Feurer, N. Hollmann, and F. Hutter. Pfns4bo: In-context learning for bayesian optimization. InProceedings of the 40th International Conference on Machine Learning (ICML’23), 2023

work page 2023
[36]

Perrone, H

V . Perrone, H. Shen, M. Seeger, C. Archambeau, and R. Jenatton. Learning search spaces for bayesian optimization: Another view of hyperparameter transfer learning. InProceedings of the 32th International Conference on Advances in Neural Information Processing Systems (NeurIPS’19), 2019

work page 2019
[37]

Pfisterer, L

F. Pfisterer, L. Schneider, J. Moosbauer, M. Binder, and B. Bischl. Yahpo gym - an efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. InFirst Conference on Automated Machine Learning (Main Track), 2022

work page 2022
[38]

E. Real, A. Aggarwal, Y . Huang, and Q. V . Le. Regularized Evolution for Image Classifier Architecture Search. InProceedings of the Conference on Artificial Intelligence (AAAI’19), 2019

work page 2019
[39]

Salinas and N

D. Salinas and N. Erickson. Tabrepo: A large scale repository of tabular model evaluations and its automl applications.arXiv preprint arXiv:2311.02971, 2023

work page arXiv 2023
[40]

Salinas, V

D. Salinas, V . Flunkert, J. Gasthaus, and T. Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 2020

work page 2020
[41]

Salinas, J

D. Salinas, J. Golebiowsk, A. Klein, M. Seeger, and C. Archambeau. Optimizing hyperparame- ters with conformal quantile regression. InProceedings of the 40th International Conference on Machine Learning (ICML’23), 2023

work page 2023
[42]

Salinas, M

D. Salinas, M. Seeger, A. Klein, V . Perrone, M. Wistuba, and C. Archambeau. Syne tune: A library for large scale hyperparameter tuning and reproducible research. InFirst Conference on Automated Machine Learning (Main Track), 2022

work page 2022
[43]

Salinas, H

D. Salinas, H. Shen, and V . Perrone. A quantile-based approach for hyperparameter transfer learning. InProceedings of the 37th International Conference on Machine Learning (ICML’20), 2020. 12

work page 2020
[44]

Schmidhuber

J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook

work page
[45]

Schwanke, L

A. Schwanke, L. Ivanov, D. Salinas, F. Ferreira, A. Klein, F. Hutter, and A. Zela. Improving llm-based global optimization with search space partitioning. InInternational Conference on Learning Representations (ICLR’26), 2026

work page 2026
[46]

Sennrich, B

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), 2016

work page 2016
[47]

X. Song, Y . Tian, R. T. Lange, C. Lee, Y . Tang, and Y . Chen. Position: Leverage foundational models for black-box optimization. InProceedings of the Forty-first International Conference on Machine Learning (ICML’24), 2024

work page 2024
[48]

J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust Bayesian neural networks. InProceedings of the 29th International Conference on Advances in Neural Information Processing Systems (NeurIPS’16), 2016

work page 2016
[49]

Storn and K

R. Storn and K. Price. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.Journal of Global Optimization, 1997

work page 1997
[50]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), Feb. 2024

work page 2024
[51]

Talbi.Metaheuristics: from design to implementation

E.-G. Talbi.Metaheuristics: from design to implementation. John Wiley & Sons, 2009

work page 2009
[52]

L. C. Tiao, A. Klein, M. W. Seeger, E. V . Bonilla, C. Archambeau, and F. Ramos. Bore: Bayesian optimization by density-ratio estimation. InProceedings of the 38th International Conference on Machine Learning (ICML’21), 2021

work page 2021
[53]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’12), 2012

work page 2012
[54]

van Stein and T

N. van Stein and T. Bäck. Llamea: Automatically generating metaheuristics with large language models. InProceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO’25), 2025

work page 2025
[55]

Veliˇckovi´c, A

P. Veliˇckovi´c, A. Vitvitskyi, L. Markeeva, B. Ibarz, L. Buesing, M. Balog, and A. Novikov. Amplifying human performance in combinatorial competitive programming.arXiv:2411.19744 [cs.LG], 2024

work page arXiv 2024
[56]

Z. Wang, G. E. Dahl, K. Swersky, C. Lee, Z. Nado, J. Gilmer, J. Snoek, and Z. Ghahramani. Pre- trained Gaussian processes for Bayesian optimization.Journal of Machine Learning Research (JMLR’24), 2024

work page 2024
[57]

Wistuba, N

M. Wistuba, N. Schilling, and L. Schmidt-Thieme. Learning hyperparameter optimization initializations. InIEEE International Conference on Data Science and Advanced Analytics (DSAA), 2015

work page 2015
[58]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page 2025
[59]

Limitations

L. Zimmer, M. Lindauer, and F. Hutter. Auto-pytorch tabular: Multi-fidelity metalearning for efficient and robust autodl.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 13 A Benchmark Families We consider the following benchmark families from the literature: • FC-Net[ 29]: Considers the optimization of the hyperparameters and archite...

work page 2021
[60]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Aglietti, I

V . Aglietti, I. Ktena, J. Schrouff, E. Sgouritsa, F. J. R. Ruiz, A. Malek, A. Bellot, and S. Chi- appa. FunBO: Discovering acquisition functions for Bayesian optimization with funsearch. arXiv:2406.04824 [cs.LG], 2025

work page arXiv 2025

[2] [2]

Andrychowicz, M

M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas. Learning to learn by gradient descent by gradient descent. InProceedings of the 29th International Conference on Advances in Neural Information Processing Systems (NeurIPS’16), 2016

work page 2016

[3] [3]

A. F. Ansari, L. Stella, A. C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapoor, J. Zschiegner, D. C. Maddix, H. Wang, M. W. Mahoney, K. Torkkola, A. G. Wilson, M. Bohlke-Schneider, and B. Wang. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

work page 2024

[4] [4]

S. P. Arango, H. Jomaa, M. Wistuba, and J. Grabocka. HPO-B: A large-scale reproducible benchmark for black-box hpo based on openml. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS’21), 2021

work page 2021

[5] [5]

Bergstra, R

J. Bergstra, R. Bardenet, Y . Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. InProceedings of the 24th International Conference on Advances in Neural Information Processing Systems (NeurIPS’11), 2011

work page 2011

[6] [6]

Bergstra and Y

J. Bergstra and Y . Bengio. Random search for hyper-parameter optimization.Journal of Machine Learning Research (JMLR-12), 2012

work page 2012

[7] [7]

Binder, F

M. Binder, F. Pfisterer, and B. Bischl. Collecting empirical data about hyperparameters for data driven automl.Democratizing Machine Learning Contributions in AutoML and Fairness, 2020

work page 2020

[8] [8]

Calandra, N

R. Calandra, N. Gopalan, A. Seyfarth, J. Peters, and M. Deisenroth. Bayesian gait optimization for bipedal locomotion. InProceedings of the Eighth International Conference on Learning and Intelligent Optimization (LION’14), 2014

work page 2014

[9] [9]

Y . Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas. Learning to learn without gradient descent by gradient descent. InProceedings of the 34th International Conference on Machine Learning (ICML’17), 2017. 10

work page 2017

[10] [10]

Y . Chen, X. Song, C. Lee, Z. Wang, Q. Zhang, D. Dohan, K. Kawakami, G. Kochanski, A. Doucet, M. A. Ranzato, S. Perel, and N. de Freitas. Towards learning universal hyperpa- rameter optimizers with transformers. InProceedings of the 36th International Conference on Advances in Neural Information Processing Systems (NeurIPS’22), 2022

work page 2022

[11] [11]

Cherti, R

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023

work page 2023

[12] [12]

A. I. Cowen-Rivers, W. Lyu, R. Tutunov, Z. Wang, A. Grosnit, R. R. Griffiths, H. Jianye, J. Wang, and H. B. Ammar. An empirical study of assumptions in Bayesian optimisation.arXiv preprint arXiv:2012.03826, 445, 2020

work page arXiv 2012

[13] [13]

Dong and Y

X. Dong and Y . Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. InInternational Conference on Learning Representations (ICLR’20), 2020

work page 2020

[14] [14]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR’21), 2021

work page 2021

[15] [15]

Eggensperger, F

K. Eggensperger, F. Hutter, H. Hoos, and K. Leyton-Brown. Efficient benchmarking of hyperpa- rameter optimizers via surrogates. InProceedings of the 29th National Conference on Artificial Intelligence (AAAI’15), 2015

work page 2015

[16] [16]

Eggensperger, P

K. Eggensperger, P. Müller, N. Mallik, M. Feurer, R. Sass, A. Klein, N. Awad, M. Lindauer, and F. Hutter. HPOBench: A collection of reproducible multi-fidelity benchmark problems for HPO. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS’21), 2021

work page 2021

[17] [17]

Garnett.Bayesian Optimization

R. Garnett.Bayesian Optimization. Cambridge University Press, 2023

work page 2023

[18] [18]

Gómez-Bombarelli, J

R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Auto- matic chemical design using a data-driven continuous representation of molecules.ACS Central Science, 2018

work page 2018

[19] [19]

N. Hansen. The CMA evolution strategy: a comparing review. InTowards a new evolutionary computation. Advances on estimation of distribution algorithms. Springer Berlin Heidelberg, 2006

work page 2006

[20] [20]

Hansen, A

N. Hansen, A. Auger, O. Mersmann, T. Tušar, and D. Brockhoff. COCO: A platform for comparing continuous optimizers in a black-box setting.arXiv:1603.08785 [cs.AI], 2016

work page arXiv 2016

[21] [21]

Hansen, S

N. Hansen, S. Finck, R. Ros, and A. Auger.Real-parameter black-box optimization benchmark- ing 2009: Noiseless functions definitions. PhD thesis, INRIA, 2009

work page 2009

[22] [22]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models.arXiv:2203.15556 [cs.CL], 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Hollmann, S

N. Hollmann, S. Müller, K. Eggensperger, and F. Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations (ICLR’23), 20223

work page

[24] [24]

Hollmann, S

N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeis- ter, and F. Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 2025

work page 2025

[25] [25]

Hutter, M

F. Hutter, M. López-Ibáñez, C. Fawcett, M. Lindauer, H. Hoos, K. Leyton-Brown, and T. Stützle. Aclib: a benchmark library for algorithm configuration. InProceedings of the Learning and Intelligent OptimizatioN Conference (LION 8), 2014. 11

work page 2014

[26] [26]

D. R. Jones. A taxonomy of global optimization methods based on response surfaces.Journal of Global Optimization, 2001

work page 2001

[27] [27]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv:2001.08361 [cs.LG], 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[28] [28]

Klein, Z

A. Klein, Z. Dai, F. Hutter, N. Lawrence, and J. Gonzalez. Meta-surrogate benchmarking for hyperparameter optimization. InProceedings of the 32th International Conference on Advances in Neural Information Processing Systems (NeurIPS’19), 2019

work page 2019

[29] [29]

Tabular Benchmarks for Joint Architecture and Hyperparameter Optimization

A. Klein and F. Hutter. Tabular benchmarks for joint architecture and hyperparameter optimiza- tion.arXiv:1905.04970 [cs.LG], 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[30] [30]

Krishnamurthy, K

A. Krishnamurthy, K. Harris, D. J. Foster, C. Zhang, and A. Slivkins. Can large language models explore in-context? InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS’24), 2024

work page 2024

[31] [31]

W. Li, N. van Stein, T. Back, and E. Raponi. LLaMEA-BO: a large language model evolutionary algorithm for automatically generating bayesian optimization algorithms.arXiv:2505.21034 [cs.LG], 2025

work page arXiv 2025

[32] [32]

Litgpt.https://github.com/Lightning-AI/litgpt, 2023

Lightning-AI. Litgpt.https://github.com/Lightning-AI/litgpt, 2023

work page 2023

[33] [33]

T. Liu, N. Astorga, N. Seedat, and M. van der Schaar. Large language models to enhance Bayesian optimization. InThe Twelfth International Conference on Learning Representations (ICLR’24), 2024

work page 2024

[34] [34]

Loshchilov and F

I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR’17), 2017

work page 2017

[35] [35]

Müller, M

S. Müller, M. Feurer, N. Hollmann, and F. Hutter. Pfns4bo: In-context learning for bayesian optimization. InProceedings of the 40th International Conference on Machine Learning (ICML’23), 2023

work page 2023

[36] [36]

Perrone, H

V . Perrone, H. Shen, M. Seeger, C. Archambeau, and R. Jenatton. Learning search spaces for bayesian optimization: Another view of hyperparameter transfer learning. InProceedings of the 32th International Conference on Advances in Neural Information Processing Systems (NeurIPS’19), 2019

work page 2019

[37] [37]

Pfisterer, L

F. Pfisterer, L. Schneider, J. Moosbauer, M. Binder, and B. Bischl. Yahpo gym - an efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. InFirst Conference on Automated Machine Learning (Main Track), 2022

work page 2022

[38] [38]

E. Real, A. Aggarwal, Y . Huang, and Q. V . Le. Regularized Evolution for Image Classifier Architecture Search. InProceedings of the Conference on Artificial Intelligence (AAAI’19), 2019

work page 2019

[39] [39]

Salinas and N

D. Salinas and N. Erickson. Tabrepo: A large scale repository of tabular model evaluations and its automl applications.arXiv preprint arXiv:2311.02971, 2023

work page arXiv 2023

[40] [40]

Salinas, V

D. Salinas, V . Flunkert, J. Gasthaus, and T. Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 2020

work page 2020

[41] [41]

Salinas, J

D. Salinas, J. Golebiowsk, A. Klein, M. Seeger, and C. Archambeau. Optimizing hyperparame- ters with conformal quantile regression. InProceedings of the 40th International Conference on Machine Learning (ICML’23), 2023

work page 2023

[42] [42]

Salinas, M

D. Salinas, M. Seeger, A. Klein, V . Perrone, M. Wistuba, and C. Archambeau. Syne tune: A library for large scale hyperparameter tuning and reproducible research. InFirst Conference on Automated Machine Learning (Main Track), 2022

work page 2022

[43] [43]

Salinas, H

D. Salinas, H. Shen, and V . Perrone. A quantile-based approach for hyperparameter transfer learning. InProceedings of the 37th International Conference on Machine Learning (ICML’20), 2020. 12

work page 2020

[44] [44]

Schmidhuber

J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook

work page

[45] [45]

Schwanke, L

A. Schwanke, L. Ivanov, D. Salinas, F. Ferreira, A. Klein, F. Hutter, and A. Zela. Improving llm-based global optimization with search space partitioning. InInternational Conference on Learning Representations (ICLR’26), 2026

work page 2026

[46] [46]

Sennrich, B

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), 2016

work page 2016

[47] [47]

X. Song, Y . Tian, R. T. Lange, C. Lee, Y . Tang, and Y . Chen. Position: Leverage foundational models for black-box optimization. InProceedings of the Forty-first International Conference on Machine Learning (ICML’24), 2024

work page 2024

[48] [48]

J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust Bayesian neural networks. InProceedings of the 29th International Conference on Advances in Neural Information Processing Systems (NeurIPS’16), 2016

work page 2016

[49] [49]

Storn and K

R. Storn and K. Price. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.Journal of Global Optimization, 1997

work page 1997

[50] [50]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), Feb. 2024

work page 2024

[51] [51]

Talbi.Metaheuristics: from design to implementation

E.-G. Talbi.Metaheuristics: from design to implementation. John Wiley & Sons, 2009

work page 2009

[52] [52]

L. C. Tiao, A. Klein, M. W. Seeger, E. V . Bonilla, C. Archambeau, and F. Ramos. Bore: Bayesian optimization by density-ratio estimation. InProceedings of the 38th International Conference on Machine Learning (ICML’21), 2021

work page 2021

[53] [53]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’12), 2012

work page 2012

[54] [54]

van Stein and T

N. van Stein and T. Bäck. Llamea: Automatically generating metaheuristics with large language models. InProceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO’25), 2025

work page 2025

[55] [55]

Veliˇckovi´c, A

P. Veliˇckovi´c, A. Vitvitskyi, L. Markeeva, B. Ibarz, L. Buesing, M. Balog, and A. Novikov. Amplifying human performance in combinatorial competitive programming.arXiv:2411.19744 [cs.LG], 2024

work page arXiv 2024

[56] [56]

Z. Wang, G. E. Dahl, K. Swersky, C. Lee, Z. Nado, J. Gilmer, J. Snoek, and Z. Ghahramani. Pre- trained Gaussian processes for Bayesian optimization.Journal of Machine Learning Research (JMLR’24), 2024

work page 2024

[57] [57]

Wistuba, N

M. Wistuba, N. Schilling, and L. Schmidt-Thieme. Learning hyperparameter optimization initializations. InIEEE International Conference on Data Science and Advanced Analytics (DSAA), 2015

work page 2015

[58] [58]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page 2025

[59] [59]

Limitations

L. Zimmer, M. Lindauer, and F. Hutter. Auto-pytorch tabular: Multi-fidelity metalearning for efficient and robust autodl.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 13 A Benchmark Families We consider the following benchmark families from the literature: • FC-Net[ 29]: Considers the optimization of the hyperparameters and archite...

work page 2021

[60] [60]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page