Tree-Structured Synergy of Large Language Models and Bayesian Optimization for Efficient CASH

Beicheng Xu; Bin Cui; Lingching Tung; Weitong Qian; Yupeng Lu

arxiv: 2601.12355 · v2 · submitted 2026-01-18 · 💻 cs.LG

Tree-Structured Synergy of Large Language Models and Bayesian Optimization for Efficient CASH

Beicheng Xu , Weitong Qian , Lingching Tung , Yupeng Lu , Bin Cui This is my paper

Pith reviewed 2026-05-16 13:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords AutoMLCASHBayesian optimizationlarge language modelsMonte Carlo tree searchhyperparameter tuningalgorithm selection

0 comments

The pith

A Monte Carlo tree search integrates large language models and Bayesian optimization to solve the CASH problem in AutoML.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes LB-MCTS to tackle the combined algorithm selection and hyperparameter optimization problem by placing both Bayesian optimization and large language models inside a single Monte Carlo tree search structure. The tree acts as shared memory so that LLM semantic proposals can address cold-start problems while BO surrogates provide quantitative refinement later in the search. A reliability-aware policy decides the balance between the two based on how reliable the surrogate has become. This matters because it offers a way to automate machine learning model selection and tuning without requiring users to have deep knowledge of either optimization techniques or machine learning algorithms.

Core claim

The central discovery is that structuring the optimization as a Monte Carlo tree search allows for synergistic use of Bayesian optimization for algorithm-specific surrogate modeling and large language models for path-aware semantic proposals and reflections, with an adaptive shift from LLM-driven to BO-driven proposals as the surrogate improves.

What carries the argument

Monte Carlo Tree Search tree as shared state for algorithm selection, hyperparameter refinement, and adaptive BO-LLM proposer synergy with a reliability-aware policy.

Load-bearing premise

The assumption that an LLM can reliably generate path-aware semantic proposals and reflections inside the shared MCTS state without introducing errors that the reliability-aware policy cannot correct.

What would settle it

If experiments on additional datasets show that LB-MCTS does not outperform the strongest individual baseline of pure BO or pure LLM, that would falsify the claim of effective synergy.

read the original abstract

To lower the expertise barrier in machine learning, the AutoML community has focused on the CASH problem, which jointly automates algorithm selection and hyperparameter tuning. While traditional methods like Bayesian Optimization (BO) struggle with cold-start issues, Large Language Models (LLMs) can mitigate these through semantic priors. However, existing LLM-based optimizers generalize poorly to high-dimensional, structured CASH spaces. In this paper, we propose LB-MCTS, a trajectory-structured optimization framework that uses a Monte Carlo Tree Search tree as a shared state for algorithm selection, hyperparameter refinement, and BO-LLM proposer synergy. Within this shared state, BO provides algorithm-specific surrogate modeling for quantitative search, while the LLM exploits path-aware selective memory to generate semantic proposals and reflections. As the surrogate model improves, a reliability-aware proposer policy adaptively shifts from LLM-driven to BO-driven proposals within a unified search trajectory. Experiments on 104 AMLB datasets demonstrate that LB-MCTS consistently outperforms BO-based, LLM-based, and hybrid baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LB-MCTS uses one MCTS tree to share state between BO surrogates and path-aware LLM proposals with an adaptive reliability switch, which is a clean way to handle cold-start CASH but the 104-dataset wins need full implementation details to evaluate.

read the letter

The core idea is to run algorithm selection and hyperparameter search inside a single Monte Carlo Tree Search tree. BO supplies the quantitative surrogate for each node while the LLM generates semantic proposals and reflections that respect the path taken so far. A reliability-aware policy then hands more control to BO once the surrogate improves. That adaptive hand-off inside one trajectory is the piece that feels new compared with earlier LLM-BO hybrids that kept the two components more separate.

Referee Report

2 major / 2 minor

Summary. The paper introduces LB-MCTS, a trajectory-structured optimization framework that employs a Monte Carlo Tree Search (MCTS) tree as a shared state to integrate Bayesian Optimization (BO) for quantitative surrogate modeling with Large Language Models (LLMs) for path-aware semantic proposals and reflections. A reliability-aware policy adaptively shifts proposal generation from LLM-driven to BO-driven as the surrogate improves. The central claim is that this synergy yields consistent outperformance over BO-based, LLM-based, and hybrid baselines across 104 AMLB datasets for the CASH problem.

Significance. If the reported outperformance holds under rigorous controls, the work offers a concrete mechanism for addressing cold-start and high-dimensional issues in AutoML by mediating LLM semantic priors and BO quantitative search within a unified tree state; this could inform future hybrid optimizers that leverage both modalities without post-hoc switching.

major comments (2)

[§4] §4 (Experiments): The claim that LB-MCTS 'consistently outperforms' all baselines on 104 AMLB datasets is load-bearing, yet the section supplies no information on baseline re-implementations, shared hyperparameter budgets, number of function evaluations per run, random seeds, or statistical tests (e.g., Wilcoxon signed-rank or confidence intervals). These omissions prevent verification that the performance gap is not attributable to unequal computational resources or post-hoc tuning.
[§3.3] §3.3 (Reliability-aware Proposer Policy): The adaptive shift from LLM to BO proposals is described only at a high level; the manuscript must specify the exact reliability metric (e.g., surrogate variance threshold or prediction interval width), how it is computed from the shared MCTS state, and the decision rule for switching, because any heuristic threshold directly affects the claimed synergy and could introduce bias if not validated.

minor comments (2)

[Abstract] The acronym AMLB is used without expansion or citation in the abstract and early sections; a parenthetical definition or reference to the AutoML Benchmark should be added on first use.
[§4] Figure captions and axis labels in the experimental plots should explicitly state the performance metric (e.g., normalized regret or accuracy) and the number of independent runs averaged.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of reproducibility and clarity. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [§4] §4 (Experiments): The claim that LB-MCTS 'consistently outperforms' all baselines on 104 AMLB datasets is load-bearing, yet the section supplies no information on baseline re-implementations, shared hyperparameter budgets, number of function evaluations per run, random seeds, or statistical tests (e.g., Wilcoxon signed-rank or confidence intervals). These omissions prevent verification that the performance gap is not attributable to unequal computational resources or post-hoc tuning.

Authors: We agree that these experimental details are necessary to substantiate the performance claims and enable verification. In the revised manuscript, Section 4 will be expanded to include full descriptions of baseline re-implementations (including code references and any adaptations made), confirmation of identical hyperparameter budgets and function evaluation limits across all methods, the number of random seeds employed with averaging procedure, and results of statistical tests such as Wilcoxon signed-rank tests with associated p-values and confidence intervals. These additions will directly address concerns about resource equality and post-hoc tuning. revision: yes
Referee: [§3.3] §3.3 (Reliability-aware Proposer Policy): The adaptive shift from LLM to BO proposals is described only at a high level; the manuscript must specify the exact reliability metric (e.g., surrogate variance threshold or prediction interval width), how it is computed from the shared MCTS state, and the decision rule for switching, because any heuristic threshold directly affects the claimed synergy and could introduce bias if not validated.

Authors: We acknowledge that the current description in §3.3 remains high-level. The revised manuscript will provide the precise reliability metric (derived from the BO surrogate's uncertainty estimates), the exact computation procedure using the shared MCTS tree state (e.g., aggregation over node statistics), and the explicit decision rule for the adaptive shift, including any threshold or validation steps. This will allow readers to fully reproduce and assess the policy's contribution to the claimed synergy. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents LB-MCTS as a framework combining MCTS as shared state with BO surrogates and LLM path-aware proposals, plus an adaptive reliability-aware policy. No equations, derivations, or quantitative predictions appear in the abstract or described mechanism. The central claim is empirical outperformance on 104 AMLB datasets, which rests on external benchmarks rather than any internal reduction of outputs to fitted inputs or self-citations. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that LLM proposals remain useful within the MCTS structure.

pith-pipeline@v0.9.0 · 5484 in / 1144 out tokens · 54793 ms · 2026-05-16T13:33:47.754063+00:00 · methodology

Tree-Structured Synergy of Large Language Models and Bayesian Optimization for Efficient CASH

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)