arxiv: 2603.24647 · v5 · submitted 2026-03-25 · 💻 cs.LG · stat.ML

Recognition: no theorem link

Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Fabio Ferreira , Lucca Wobbe , Arjun Krishnakumar , Frank Hutter , Arber Zela

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:22 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords hyperparameter optimizationLLM agentsCMA-EShybrid optimizationautoresearchlanguage model tuningoptimization algorithms

0 comments

The pith

Centaur hybrid optimizer beats both classical HPO algorithms and pure LLM agents, with a 0.8B model already sufficient.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLM agents can replace classical hyperparameter optimization methods by using the autoresearch setup to tune a small language model under fixed compute. Classical algorithms such as CMA-ES and TPE win on a fixed search space because they reliably avoid failures like out-of-memory errors, while pure LLM agents struggle to maintain consistent state across trials. Letting the LLM edit source code directly narrows the gap but does not overtake classical performance. The authors introduce Centaur, a hybrid that feeds CMA-ES internal state to the LLM for trial proposals, and this combination produces the strongest results observed.

Core claim

Centaur integrates the interpretable internal state of CMA-ES (mean vector, step-size, covariance matrix) with LLM-generated trial suggestions. This hybrid records the best performance across the experiments. A language model of only 0.8 billion parameters already surpasses every classical baseline and every pure LLM method tested, while unconstrained code editing by LLMs still requires frontier-scale models to reach competitive levels.

What carries the argument

Centaur, the hybrid optimizer that shares CMA-ES's mean vector, step-size, and covariance matrix with an LLM to guide trial proposals.

If this is right

Classical methods remain strong at avoiding common execution failures such as out-of-memory errors during search.
LLMs contribute most when their domain knowledge supplements rather than replaces classical state tracking.
Unconstrained source-code editing by LLMs demands larger models to match classical reliability.
Sharing explicit optimization state helps LLMs overcome their difficulty maintaining progress across trials.
Even modest-scale LLMs become competitive once paired with classical internal representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrids of this form could transfer to hyperparameter tuning for vision or reinforcement-learning models.
Dynamically varying the fraction of LLM-proposed trials might improve results on problems with different failure profiles.
Extending the state-sharing mechanism to multi-objective or constrained optimization could let LLMs propose feasible regions more effectively.
Testing whether the 0.8B-model advantage persists when the underlying model being tuned is itself much larger would clarify scaling behavior.

Load-bearing premise

That performance differences observed on one small language model under fixed compute and specific failure modes will generalize to other models, tasks, and search spaces.

What would settle it

Repeating the comparison on a vision model or a substantially larger language model and finding that Centaur no longer outperforms standalone CMA-ES or TPE.

read the original abstract

The autoresearch repository enables an LLM agent to optimize hyperparameters by editing training code directly. We use it as a testbed to compare classical HPO algorithms against LLM-based methods on tuning the hyperparameters of a small language model under a fixed compute budget. When defining a fixed search space over autoresearch, classical methods such as CMA-ES and TPE consistently outperform LLM-based agents, where avoiding out-of-memory failures matters more than search diversity. Allowing the LLM to directly edit source code narrows the gap to the classical methods but does not close it, even with frontier models available at the time of writing such as Claude Opus 4.6 and Gemini 3.1 Pro Preview. We observe that LLMs struggle to track optimization state across trials. In contrast, classical methods lack the domain knowledge of LLMs. To combine the strengths of both, we introduce Centaur, a hybrid that shares CMA-ES's interpretable internal state, including mean vector, step-size, and covariance matrix, with an LLM. Centaur achieves the best result in our experiments, and a 0.8B LLM already suffices to outperform all classical and pure LLM methods. Unconstrained code editing requires larger models to be competitive with classical methods. We further analyze search diversity, model scaling from 0.8B to frontier models, and ablate the fraction of LLM-proposed trials in Centaur. All in all, our results suggest that LLMs are most effective as a complement to classical optimizers, not as a replacement. Code is available at https://github.com/ferreirafabio/autoresearch-automl & interactive demo at https://ferreirafabio.github.io/autoresearch-automl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Centaur hybrid beats both sides on this autoresearch benchmark, but the win looks tied to code-editing and OOM avoidance rather than pure numerical state sharing.

read the letter

The main thing to know is that Centaur, their hybrid that passes CMA-ES mean, step-size, and covariance to an LLM, comes out on top in the experiments, and even a 0.8B model already beats the pure classical and pure LLM baselines. The paper sets up autoresearch as a testbed where an LLM agent tunes a small language model by editing training code under fixed compute. In a fixed search space, CMA-ES and TPE win because they dodge out-of-memory failures more reliably than LLM agents. Letting the LLM edit code narrows the gap but does not close it, even with big models. Centaur closes it by giving the LLM the classical optimizer's internal state, and the authors show this works with small models too. They also report scaling trends, search diversity numbers, and an ablation on the fraction of LLM-proposed trials. Code is released, which is useful. The soft spot is that the advantage appears inside the autoresearch loop where syntactic validity and OOM avoidance drive a lot of the signal. The paper does not show whether the same 0.8B model improves CMA-ES on a standard numerical HPO task where the only feedback is loss values on fixed code. Generalization beyond this one environment therefore stays open. This is for AutoML people who want to see how LLMs can complement rather than replace classical methods. It is coherent on its own terms and reports a concrete new hybrid, so it deserves a serious referee even if later work will need to test it on broader benchmarks.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the autoresearch repository, which allows an LLM agent to perform hyperparameter optimization by directly editing training code. Through experiments on tuning hyperparameters of a small language model under fixed compute, it shows that classical algorithms like CMA-ES and TPE outperform pure LLM agents in fixed search spaces primarily due to better avoidance of out-of-memory failures. The authors propose Centaur, a hybrid optimizer that shares CMA-ES internal state (mean, step-size, covariance) with an LLM for proposals. They report that Centaur achieves the best results, with a 0.8B LLM already outperforming classical and pure LLM methods, and provide analyses on search diversity, model scaling, and ablations on LLM trial fractions.

Significance. If the empirical findings hold under more rigorous controls, this work provides evidence that LLMs are most effective as complements to classical HPO methods rather than replacements, particularly through state sharing. The release of code and an interactive demo supports reproducibility, which is a strength. The observation that small models suffice in the hybrid setting could influence efficient AutoML designs, though the specific autoresearch testbed limits broader claims.

major comments (3)

[Abstract] The central claim that 'a 0.8B LLM already suffices to outperform all classical and pure LLM methods' is presented without reference to the number of experimental runs, error bars, or statistical significance tests, undermining verifiability of the outperformance.
[Centaur and Ablations] The ablation study on the fraction of LLM-proposed trials is conducted exclusively within the autoresearch framework that permits code editing and OOM avoidance; no separate evaluation is provided on tasks relying purely on numerical loss signals without syntactic code validity.
[Experiments] The comparison between fixed search space (where classical dominate) and code editing (where gap narrows) lacks details on the exact search space definition, failure modes controlled for, and how state tracking failures in LLMs were quantified across trials.

minor comments (2)

[Abstract] Clarify the exact model versions used (e.g., 'Claude Opus 4.6' appears non-standard; confirm if this is a placeholder or specific release).
[Introduction] The manuscript would benefit from a clearer definition of the 'autoresearch' testbed early on to help readers understand the experimental setup before the results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened for clarity and rigor. We address each major comment point by point below and outline the corresponding revisions.

read point-by-point responses

Referee: [Abstract] The central claim that 'a 0.8B LLM already suffices to outperform all classical and pure LLM methods' is presented without reference to the number of experimental runs, error bars, or statistical significance tests, undermining verifiability of the outperformance.

Authors: We agree that these details are necessary to support the claim. In the revised manuscript we will explicitly state that results are averaged over 5 independent runs, include standard error bars in all performance plots and tables, and report p-values from paired Wilcoxon signed-rank tests (p < 0.05) confirming that the 0.8B Centaur variant significantly outperforms both classical baselines and pure LLM agents. revision: yes
Referee: [Centaur and Ablations] The ablation study on the fraction of LLM-proposed trials is conducted exclusively within the autoresearch framework that permits code editing and OOM avoidance; no separate evaluation is provided on tasks relying purely on numerical loss signals without syntactic code validity.

Authors: We acknowledge that the ablation is tied to the code-editing testbed. We will add a new paragraph in the discussion section clarifying that the observed benefits of LLM proposals partly rely on syntactic validity checking and OOM avoidance, which are not available in standard numerical HPO. We will also note this as a scope limitation and suggest that future work should evaluate the hybrid on purely numerical benchmarks; no additional experiments are feasible within the current study. revision: partial
Referee: [Experiments] The comparison between fixed search space (where classical dominate) and code editing (where gap narrows) lacks details on the exact search space definition, failure modes controlled for, and how state tracking failures in LLMs were quantified across trials.

Authors: We agree that these methodological details should be expanded. The revised version will include: (i) the complete fixed search space definition with all hyperparameter names, ranges, and types; (ii) an enumerated list of failure modes (OOM, syntax errors, runtime exceptions) and the exact detection logic used; and (iii) a quantitative measure of state-tracking failures, defined as the fraction of LLM proposals that omitted or incorrectly referenced prior best values or covariance information, computed via automated logging and manual verification on a random sample of 200 trajectories. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical comparison

full rationale

The paper conducts an empirical evaluation of classical HPO methods (CMA-ES, TPE) versus LLM agents and the hybrid Centaur on the autoresearch code-editing benchmark. All performance claims rest on observed metrics from fixed-compute experiments against external, independently implemented baselines. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation of results; the hybrid is defined by explicit state sharing rather than by construction from its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the autoresearch testbed experiments and the standard behavior of CMA-ES/TPE; no major free parameters are introduced beyond the experimental setup itself.

axioms (1)

standard math Standard implementations of CMA-ES and TPE behave as documented in the optimization literature
The paper relies on these algorithms as established baselines without re-deriving them.

invented entities (1)

Centaur hybrid optimizer no independent evidence
purpose: To combine interpretable CMA-ES state with LLM proposal generation
New method introduced in this work to address limitations of pure LLM and pure classical approaches.

pith-pipeline@v0.9.0 · 5629 in / 1238 out tokens · 55525 ms · 2026-05-15T00:22:34.196387+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Open-Ended Task Discovery via Bayesian Optimization
cs.AI 2026-05 unverdicted novelty 6.0

Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
cs.IR 2026-04 unverdicted novelty 5.0

AgenticRecTune deploys five LLM agents (Actor, Critic, Insight, Skill, Online) and a self-evolving Skillhub to handle end-to-end configuration optimization for multi-stage recommendation systems.
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
cs.IR 2026-04 unverdicted novelty 4.0

AgenticRecTune deploys Actor, Critic, Insight, Skill, and Online agents plus a self-evolving Skillhub to propose, filter, test, and learn from recommendation system configurations using Gemini LLMs.