SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization
Pith reviewed 2026-05-21 20:10 UTC · model grok-4.3
The pith
SemanticOpt fine-tunes LLMs on Bayesian optimization trajectories with natural-language context to propose experiments that combine numerical data and semantic knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SemanticOpt equips LLMs with optimization capabilities by fine-tuning them on structured Bayesian optimization trajectories augmented with natural-language context. The resulting models jointly use numerical and semantic evidence when proposing new experiments while producing interpretable predictions aligned with Bayesian surrogate models. Across a constructed benchmark of real-world optimization problems paired with semantic information, SemanticOpt outperforms both classical optimizers and existing LLM-based approaches on average when given relevant semantic information.
What carries the argument
Fine-tuning of LLMs on structured Bayesian optimization trajectories augmented with natural-language context, which enables joint numerical-semantic proposal generation aligned with surrogate models.
Load-bearing premise
Fine-tuning LLMs on structured Bayesian optimization trajectories augmented with natural-language context produces reliable proposals that jointly use numerical and semantic evidence and remain aligned with Bayesian surrogate models.
What would settle it
Direct evaluation on the paper's benchmark of real-world problems where SemanticOpt fails to outperform classical optimizers and existing LLM approaches on average even when relevant semantic information is supplied.
Figures
read the original abstract
Optimizing an experimental system can be extremely challenging when each experiment is expensive, time-consuming, or difficult to perform. Existing optimizers for expensive black-box problems, such as Bayesian optimization, are typically limited to numerical or categorical observations. They do not make use of broader domain knowledge, such as expert heuristics, relevant scientific papers, or similar previous experiments. Large language models (LLMs) can interpret this semantic information; however, even state-of-the-art LLMs struggle to reliably solve black-box optimization problems. We introduce SemanticOpt, a framework for semantic black-box optimization that equips LLMs with optimization capabilities by fine-tuning them on structured Bayesian optimization trajectories augmented with natural-language context. SemanticOpt jointly uses numerical and semantic evidence when proposing new experiments, while producing interpretable predictions aligned with Bayesian surrogate models. We construct a range of real-world optimization problems paired with semantic information to create a diverse benchmark for evaluating semantic black-box optimization. Across these domains, SemanticOpt outperforms both classical optimizers and existing LLM-based approaches on average when given relevant semantic information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SemanticOpt, a framework that fine-tunes LLMs on structured Bayesian optimization trajectories augmented with natural-language context to enable semantic black-box optimization. It constructs a benchmark of real-world optimization problems paired with semantic information and reports that SemanticOpt outperforms both classical optimizers and existing LLM-based approaches on average when given relevant semantic information.
Significance. If the empirical claims hold under rigorous validation, this work could meaningfully advance expensive black-box optimization by incorporating domain knowledge and expert heuristics that numerical methods cannot access. The construction of a diverse benchmark and the alignment of LLM proposals with Bayesian surrogates are constructive contributions that address a recognized limitation in the field.
major comments (2)
- [§4] §4 (Benchmark Construction): The central claim of average outperformance rests on a newly constructed benchmark of real-world problems paired with semantic information. The manuscript must explicitly state the provenance of the semantic context (pre-existing expert sources versus post-hoc generation or selection) and whether benchmark construction and semantic pairing were pre-specified before any experiments, as this directly affects whether the comparison to classical optimizers is fair and non-circular.
- [Results] Results section and abstract: Average outperformance is reported without error bars, statistical significance tests, exact data splits, or confirmation that benchmark construction was pre-specified. These omissions are load-bearing for the empirical claim and must be addressed with concrete details on variance across runs and domains.
minor comments (2)
- [Abstract] Abstract: The phrase 'outperforms ... on average' should specify the primary metric (e.g., cumulative regret, final objective value) and the number of domains or problems over which the average is taken.
- [§3.2] §3.2: The claim that proposals are 'aligned with Bayesian surrogate models' would be strengthened by a brief illustrative example or equation showing how numerical and semantic evidence are jointly used in the proposal step.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. These have prompted us to strengthen the transparency and statistical rigor of the manuscript. We address each major comment below and have revised the paper accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Benchmark Construction): The central claim of average outperformance rests on a newly constructed benchmark of real-world problems paired with semantic information. The manuscript must explicitly state the provenance of the semantic context (pre-existing expert sources versus post-hoc generation or selection) and whether benchmark construction and semantic pairing were pre-specified before any experiments, as this directly affects whether the comparison to classical optimizers is fair and non-circular.
Authors: We agree that explicit documentation of provenance and pre-specification is essential for interpretability. The benchmark problems were drawn from established real-world optimization tasks (e.g., hyperparameter tuning, chemical reaction yield maximization, and materials design) that predate this work. Semantic contexts were extracted from pre-existing peer-reviewed literature and documented expert heuristics for each domain; no post-hoc generation or result-dependent selection occurred. In the revised §4 we now include a dedicated subsection and supplementary table that lists, for every problem, the exact source references and the date of benchmark finalization (prior to any model training or evaluation). This confirms the construction and pairing were pre-specified, preserving the fairness of comparisons to classical optimizers. revision: yes
-
Referee: [Results] Results section and abstract: Average outperformance is reported without error bars, statistical significance tests, exact data splits, or confirmation that benchmark construction was pre-specified. These omissions are load-bearing for the empirical claim and must be addressed with concrete details on variance across runs and domains.
Authors: We acknowledge that the original reporting lacked sufficient statistical detail. The revised Results section and abstract now report (i) mean performance with standard-error error bars computed over five independent random seeds, (ii) p-values from Wilcoxon signed-rank tests comparing SemanticOpt against each baseline on a per-domain and aggregate basis, (iii) the precise train/validation/test splits used for fine-tuning and trajectory evaluation, and (iv) a per-domain breakdown table that reveals variance across problem types. The pre-specification confirmation is cross-referenced to the new §4 subsection described above. These additions directly address the load-bearing concerns while preserving the original empirical findings. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper defines SemanticOpt as an LLM fine-tuned on structured Bayesian optimization trajectories augmented with natural-language context, then evaluates its proposals against external classical optimizers and prior LLM baselines on a constructed benchmark of real-world problems paired with semantic information. No quoted step equates a claimed prediction or outperformance result to its own inputs by construction, nor does any load-bearing premise reduce to a self-citation chain or fitted parameter renamed as output. The fine-tuning uses trajectories generated by classical methods as training data, but the subsequent performance comparison is independent and falsifiable against those same baselines. Benchmark construction introduces potential selection concerns but does not create definitional equivalence within the method itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Fine-tuning on structured Bayesian optimization trajectories with natural-language context enables LLMs to produce proposals aligned with Bayesian surrogate models.
Reference graph
Works this paper leans on
-
[1]
Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al
URLhttps://proceedings.neurips.cc/paper/2020/hash/ f5b1b89d98b7286673128a5fb112cb9a-Abstract.html. Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247,
-
[2]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[3]
Meta-learning of black-box solvers using deep reinforcement learning
Sofian Chaybouti, Ludovic Dos Santos, Cedric Malherbe, and Aladin Virmaux. Meta-learning of black-box solvers using deep reinforcement learning. InNeurIPS 2022, MetaLearn Workshop,
work page 2022
-
[4]
Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc’aurelio Ranzato, et al. Towards learning uni- versal hyperparameter optimizers with transformers.Advances in Neural Information Processing Systems, 35:32053–32068, 2022b. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice P...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test Suite
Ouassim Elhara, Konstantinos Varelas, Duc Nguyen, Tea Tusar, Dimo Brockhoff, Nikolaus Hansen, and Anne Auger. Coco: the large scale black-box optimization benchmarking (bbob-largescale) test suite.arXiv preprint arXiv:1903.06396,
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[6]
Beichen Huang, Xingyu Wu, Yu Zhou, Jibin Wu, Liang Feng, Ran Cheng, and Kay Chen Tan. Exploring the true potential: Evaluating the black-box optimization capability of large language models.arXiv preprint arXiv:2404.06290,
-
[7]
OpenVLA: An Open-Source Vision-Language-Action Model
11 Preprint. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Generative pretraining for black-box optimization.arXiv preprint arXiv:2206.10786,
Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Generative pretraining for black-box optimization.arXiv preprint arXiv:2206.10786,
-
[9]
Ke Li and Jitendra Malik. Learning to optimize.arXiv preprint arXiv:1606.01885,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Large language models to enhance bayesian optimization
URLhttp://jmlr.org/papers/v23/21-0888.html. Tennison Liu, Nicol´as Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization.arXiv preprint arXiv:2402.03921,
-
[11]
Jamison Meindl, Yunsheng Tian, Tony Cui, Veronika Thost, Zhang-Wei Hong, Johannes D ¨urholt, Jie Chen, Wojciech Matusik, and Mina Konakovi ´c Lukovi ´c. Zeroshotopt: Towards zero-shot pretrained models for efficient black-box optimization.arXiv preprint arXiv:2510.03051,
-
[12]
Reinforced in-context black-box optimization.arXiv preprint arXiv:2402.17423, 2024a
Lei Song, Chenxiao Gao, Ke Xue, Chenyang Wu, Dong Li, Jianye Hao, Zongzhang Zhang, and Chao Qian. Reinforced in-context black-box optimization.arXiv preprint arXiv:2402.17423, 2024a. Xingyou Song, Oscar Li, Chansoo Lee, Daiyi Peng, Sagi Perel, Yutian Chen, et al. Omnipred: Language models as universal regressors.arXiv preprint arXiv:2402.14547, 2024b. Xin...
-
[13]
Yunsheng Tian, Mina Konakovi´c Lukovi´c, Timothy Erps, Michael Foshey, and Wojciech Matusik
12 Preprint. Yunsheng Tian, Mina Konakovi´c Lukovi´c, Timothy Erps, Michael Foshey, and Wojciech Matusik. Autooed: Automated optimal experiment design platform.arXiv preprint arXiv:2104.05959,
-
[14]
InNeurIPS 2020 Competition and Demonstration Track, pp. 3–26. PMLR,
work page 2020
-
[15]
Michael V olpp, Lukas P Fr¨ohlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel. Meta-learning acquisition functions for transfer learning in bayesian optimiza- tion.arXiv preprint arXiv:1904.02642,
-
[16]
URLhttps://arxiv.org/abs/2309. 03409. A IMPLEMENTATION DETAILS A.1 DATA GENERATION To construct a diverse training dataset, we implement multiple classes of continuous synthetic black- box functions within a unified environment interface. Each environment is initialized with a dimen- sionality and a random seed for reproducibility. Inputs are normalized t...
work page 2023
-
[17]
This is another important metric for an optimizer and highlights the robustness ofGPTOpt
We see that GPTOptoutperforms all baselines in win-rate. This is another important metric for an optimizer and highlights the robustness ofGPTOpt. C EVALUATIONS We provide further detail into the evaluation strategy used for both benchmarks and baselines in our experiments. C.1 BENCHMARKS GP:We use the function generator used for generating training data ...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.