pith. sign in

arxiv: 2606.25207 · v1 · pith:IZAEQOW2new · submitted 2026-06-23 · 💻 cs.LG · cs.AI· cs.CL

ASAP: Agent-System Co-Design for Wall-Clock-Centered Auto HPO Research for ML Experiments

Pith reviewed 2026-06-25 23:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords hyperparameter optimizationLLM agentsagent-system co-designwall-clock optimizationspeculation parallelisminductive biasHPO toolsKV-cache reuse
0
0 comments X

The pith

An LLM integrates proposals from multiple hyperparameter optimizers while system redesigns hide their latency under model evaluations to improve end-to-end wall-clock performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that every HPO tool carries a fixed inductive bias that limits it on diverse problems, and that prior LLM-based HPO approaches merely substitute one such bias while measuring only iteration count rather than real runtime. ASAP instead routes the LLM to combine and select among a pool of existing optimizers each round. On the system side it adds a prefix-stable prompt for KV-cache reuse, a speculation mechanism that runs LLM and tool calls in parallel with model evaluation under a relative-error accept test, and an off-critical-path Self-Tuner that adjusts the speculation threshold from logs. Experiments across modern HPO tasks show consistent gains over baselines, indicating that the combination of tool diversity and wall-clock co-design translates iteration improvements into practical speedups.

Core claim

ASAP shows that an LLM can serve as an integrator that draws proposals from a diverse pool of inductive-biased optimizers and selects among them each round, while a re-architected execution loop—prefix-stable prompting for KV-cache reuse, speculation parallelism hidden behind model evaluations via a relative-error accept test, and an off-path Self-Tuner—reduces total wall-clock time without increasing regret, producing better configurations than either single-tool or iteration-centric LLM baselines on varied HPO tasks.

What carries the argument

The LLM integrator that selects among proposals from multiple optimizers, paired with speculation parallelism and relative-error accept test to hide latency.

If this is right

  • HPO performance can improve by drawing on multiple priors simultaneously rather than committing to one tool or one LLM replacement.
  • Practical HPO evaluation must measure total elapsed time, not iteration count, once LLM or tool calls add serial cost.
  • Prefix-stable prompts and relative-error speculation can be reused in other iterative loops that mix LLM calls with expensive evaluations.
  • An off-path Self-Tuner can adapt runtime parameters from execution history while keeping the main optimization path unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same integrator-plus-speculation pattern may apply to other agentic workflows that currently treat LLMs as drop-in replacements for single specialized modules.
  • Benchmarks for agent systems could usefully report both regret curves and cumulative wall-clock cost under realistic latency models.
  • If the relative-error accept test proves robust, similar approximate checks might safely accelerate other LLM-augmented search or planning loops.

Load-bearing premise

The LLM can reliably combine and choose among proposals from different optimizers without its own pretraining bias erasing the diversity those optimizers supply.

What would settle it

A collection of HPO problems on which, after a fixed wall-clock budget, ASAP reaches final model performance no better than the strongest single baseline optimizer, or on which the speculation accept test rejects enough good proposals to raise final regret.

Figures

Figures reproduced from arXiv: 2606.25207 by Haomin Zhuang, Kehan Guo, Nitesh V. Chawla, Olaf Wiest, Taicheng Guo, Xiangliang Zhang, Yujun Zhou.

Figure 1
Figure 1. Figure 1: Motivation of ASAP. Prior LLM-HPO replaces a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ASAP execution loop. After a serial warm-up, the Speculator predicts round [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Normalized regret vs. (a) end-to-end wall-clock and (b) iteration, aggregated over the 28 combinations from HPOBench [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance by optimization-landscapes. ASAP [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fraction of Judge selections per tool. Rugged, where their smooth-kernel/network prior mismatches the jagged landscape). Integration turns the diversity that defeats single tools into uniform robustness [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Where ASAP applies and where speculation helps, [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Speculation does not change which hyperparame [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Self-Tuner behavior on the speculate-effective [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case studies (RQ3) on three (benchmark, model) combinations (PD1/Transformer, PD1/Wide-ResNet, HPOBench/XG [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Trivial-eval (sub-second sklearn HPO) on the full Bayesmark grid. ASAP (red, ours) retains its per-iteration advantage [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
read the original abstract

Hyperparameter Optimization (HPO) is essential for maximizing machine learning model performance, and its core challenge is sample efficiency: finding strong configurations within a limited budget. Because every HPO tool relies on a surrogate prior that imparts its own inductive bias, individual tools struggle once problems become sufficiently diverse and drift from these priors. Motivated by the reasoning and generalization capabilities of LLMs, recent work has explored using LLMs for HPO and reports improved per-iteration performance. Yet these methods share two limitations with a common origin: they use the LLM as a single-tool replacement evaluated by iteration count. (i) Deployed in place of prior tools, the LLM is itself constrained by its pretraining objective to one family of inductive-biased proposals; this single-source setup still fails to handle the full diversity of problems. (ii) Per-iteration evaluation ignores that, in real runs, LLM inference or tool execution is paid serially on top of model evaluation every round, so iteration-count gains do not translate into end-to-end wall-clock gains. We present ASAP, an agent-system co-design that addresses both limitations. On the agent side, ASAP uses the LLM to integrate a diverse pool of inductive-biased optimizers and to select among their proposals each round. On the system side, ASAP re-architects the loop to reduce end-to-end wall-clock while preserving regret quality: a prefix-stable prompt maximizes KV-cache reuse across rounds; speculation parallelism hides the remaining LLM and tool latency under model evaluation via a relative-error accept test; and a Self-Tuner adapts the speculation threshold from execution logs off the critical path. Extensive experiments on diverse modern HPO tasks show that ASAP consistently outperforms baselines, underscoring the value of tool integration and agent-system co-design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ASAP, an agent-system co-design for hyperparameter optimization (HPO). The agent side uses an LLM to integrate proposals from a diverse pool of inductive-biased optimizers and select among them each round. The system side adds a prefix-stable prompt for KV-cache reuse, speculation parallelism that hides LLM/tool latency under model evaluation via a relative-error accept test, and a Self-Tuner that adapts the speculation threshold from logs. The central empirical claim is that this combination yields both lower wall-clock time and better or equal regret than baselines on diverse modern HPO tasks.

Significance. If the reported regret curves, wall-clock breakdowns, and ablations hold, the work demonstrates that LLM-based integration of existing tools plus targeted system mechanisms can overcome single-source inductive bias while delivering practical end-to-end speedups. The explicit measurement of proposal-source diversity and confirmation that the accept test preserves final regret are concrete strengths that move the field beyond iteration-count metrics.

major comments (2)
  1. [§4] §4 (system mechanisms), the relative-error accept test: the manuscript shows empirically that the test does not measurably degrade final regret on the reported benchmarks, but provides no analytic bound on the bias introduced when the speculation threshold is adapted by the Self-Tuner; a short worst-case argument or additional synthetic experiment would strengthen the claim that regret quality is preserved by construction rather than only on the chosen tasks.
  2. [§5.3] §5.3 (ablations), proposal-source diversity: while the paper reports that the LLM selects from multiple sources, the diversity metric (e.g., fraction of proposals accepted per source) is only shown aggregated; per-task breakdowns would clarify whether the claimed advantage collapses on problems where one optimizer dominates.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 repeatedly contrast “per-iteration” vs. “wall-clock” evaluation; adding a one-sentence definition of the wall-clock metric (including whether it counts only model evaluations or also LLM inference) would remove ambiguity for readers.
  2. [Figures in §5] Figure captions for the regret curves do not state the number of independent runs or whether shaded regions are standard deviation or standard error; this is a common omission that affects interpretability of the “consistent outperformance” claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We address each of the major comments below.

read point-by-point responses
  1. Referee: [§4] §4 (system mechanisms), the relative-error accept test: the manuscript shows empirically that the test does not measurably degrade final regret on the reported benchmarks, but provides no analytic bound on the bias introduced when the speculation threshold is adapted by the Self-Tuner; a short worst-case argument or additional synthetic experiment would strengthen the claim that regret quality is preserved by construction rather than only on the chosen tasks.

    Authors: We acknowledge that an analytic bound on the bias would provide stronger theoretical guarantees. Deriving such a bound is non-trivial because the Self-Tuner adapts the threshold based on execution logs in a data-dependent manner. However, we will add a synthetic experiment in the revised version that simulates worst-case adaptation scenarios to empirically bound the potential bias, thereby addressing the concern that regret preservation holds only on the chosen tasks. revision: yes

  2. Referee: [§5.3] §5.3 (ablations), proposal-source diversity: while the paper reports that the LLM selects from multiple sources, the diversity metric (e.g., fraction of proposals accepted per source) is only shown aggregated; per-task breakdowns would clarify whether the claimed advantage collapses on problems where one optimizer dominates.

    Authors: The aggregated diversity metric was presented to demonstrate the overall utilization of multiple sources across the benchmark suite. We agree that per-task breakdowns can provide additional clarity. In the revision, we will include per-task diversity metrics (e.g., fraction of proposals accepted per source for each task) to confirm that the multi-source advantage does not collapse on any individual problem. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical systems contribution describing an agent-system co-design (LLM integration of optimizer proposals plus speculation accept test and Self-Tuner) whose central claim is that extensive experiments on modern HPO tasks show consistent outperformance. No equations, parameter fits, predictions derived from inputs, or load-bearing self-citations appear in the provided text; the result is presented as a measured benchmark outcome rather than a derivation that reduces to its own definitions or prior author work by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach relies on standard LLM prompting and existing HPO tools whose internal assumptions are not detailed here.

pith-pipeline@v0.9.1-grok · 5887 in / 1138 out tokens · 16579 ms · 2026-06-25T23:27:47.419351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 26 canonical work pages · 11 internal anchors

  1. [1]

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Frame- work. arXiv:1907.10902 [cs.LG] https://arxiv.org/abs/1907.10902

  2. [2]

    Huan ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. 2026. A Survey...

  3. [3]

    James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for Hyper-Parameter Optimization. InAdvances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger (Eds.), Vol. 24. Curran Associates, Inc. https://proceedings.neurips.cc/paper_ files/paper/2011/file/86e8f7ab32cfd12577b...

  4. [4]

    James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization.Journal of Machine Learning Research13, 10 (2012), 281–305. http://jmlr.org/papers/v13/bergstra12a.html

  5. [5]

    Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. 2025. Multi-Agent Evolve: LLM Self-Improve through Co-evolution. arXiv:2510.23595 [cs.AI] https://arxiv.org/abs/2510.23595

  6. [6]

    Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, Bang Liu, and Chenglin Wu. 2024. SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning. arXiv:2410.17238 [cs.AI] https://arxiv.org/abs/2410.17238

  7. [7]

    Skeel, and Hartmut Neven

    Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven. 2014. Bayesian Sampling Using Stochastic Gradient Ther- mostats. InAdvances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger (Eds.), Vol. 27. Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper_fi...

  8. [8]

    Katharina Eggensperger, Philipp Müller, Neeratyoy Mallik, Matthias Feurer, René Sass, Aaron Klein, Noor Awad, Marius Lindauer, and Frank Hutter. 2022. HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems for HPO. arXiv:2109.06716 [cs.LG] https://arxiv.org/abs/2109.06716

  9. [9]

    David Eriksson, Michael Pearce, Jacob R Gardner, Ryan Turner, and Matthias Poloczek. 2020. Scalable Global Optimization via Local Bayesian Optimization. arXiv:1910.01739 [cs.LG] https://arxiv.org/abs/1910.01739

  10. [10]

    Yilin Guan, Qingfeng Lan, Sun Fei, Dujian Ding, Devang Acharya, Chi Wang, William Yang Wang, and Wenyue Hua. 2025. Dynamic Speculative Agent Plan- ning. arXiv:2509.01920 [cs.AI] https://arxiv.org/abs/2509.01920

  11. [11]

    Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang

  12. [12]

    Ds-agent: Automated data science by empowering large language models with case-based reasoning

    DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning. arXiv:2402.17453 [cs.LG] https://arxiv.org/ abs/2402.17453

  13. [13]

    Taicheng Guo, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2026. Au- toLLMResearch: Training Research Agents for Automating LLM Experiment Configuration–Learning from Cheap, Optimizing Expensive.arXiv preprint arXiv:2605.11518(2026)

  14. [14]

    Taicheng Guo, Hai Wang, ChaoChun Liu, Mohsen Golalikhani, Xin Chen, Xian- gliang Zhang, and Chandan K Reddy. 2025. MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training.arXiv preprint arXiv:2510.12831 (2025)

  15. [15]

    Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Chenxing Wei, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Xiangru Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Zhibin Gou, Zongze Xu, and Chenglin Wu. 2024. Data Inte...

  16. [16]

    Pascal Kerschke. 2017. Comprehensive Feature-Based Landscape Analysis of Continuous and Constrained Optimization Problems Using the R-Package flacco. arXiv:1708.05258 [stat.ML] https://arxiv.org/abs/1708.05258 ASAP: Agent-System Co-Design for Wall-Clock-Centered A uto HPO Research for ML Experiments

  17. [17]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192 [cs.LG] https://arxiv. org/abs/2211.17192

  18. [18]

    Yucen Lily Li, Tim G. J. Rudner, and Andrew Gordon Wilson. 2024. A Study of Bayesian Neural Network Surrogates for Bayesian Optimization. arXiv:2305.20028 [cs.LG] https://arxiv.org/abs/2305.20028

  19. [19]

    Marius Lindauer, Katharina Eggensperger, Matthias Feurer, André Biedenkapp, Difan Deng, Carolin Benjamins, Tim Ruhopf, René Sass, and Frank Hutter. 2022. SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Opti- mization. arXiv:2109.09831 [cs.LG] https://arxiv.org/abs/2109.09831

  20. [20]

    Siyi Liu, Chen Gao, and Yong Li. 2025. Large Language Model Agent for Hyper- Parameter Optimization. arXiv:2402.01881 [cs.LG] https://arxiv.org/abs/2402. 01881

  21. [21]

    Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. 2024. Large Language Models to Enhance Bayesian Optimization. arXiv:2402.03921 [cs.LG] https://arxiv.org/abs/2402.03921

  22. [22]

    Yeonju Ro, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Ricardo Bianchini, Aditya Akella, Zhangyang Wang, Mattan Erez, and Esha Choukse. 2025. Sherlock: Reliable and Efficient Agentic Workflow Execution. arXiv:2511.00330 [cs.MA] https://arxiv.org/abs/2511.00330

  23. [23]

    scikit optimize. [n. d.]. https://scikit-optimize.github.io/stable/

  24. [24]

    Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. arXiv:1206.2944 [stat.ML] https: //arxiv.org/abs/1206.2944

  25. [25]

    Scalable Bayesian Optimization Using Deep Neural Networks

    Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. 2015. Scalable Bayesian Optimization Using Deep Neural Networks. arXiv:1502.05700 [stat.ML] https://arxiv.org/abs/1502.05700

  26. [26]

    Alexander Tornede, Difan Deng, Theresa Eimer, Joseph Giovanelli, Aditya Mo- han, Tim Ruhkopf, Sarah Segel, Daphne Theodorakopoulos, Tanja Tornede, Henning Wachsmuth, and Marius Lindauer. 2024. AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks. arXiv:2306.08107 [cs.LG] https://arxiv.org/abs/2306.08107

  27. [27]

    Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, and Zoubin Ghahramani

    Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, and Zoubin Ghahramani. 2024. Pre-trained Gaussian Processes for Bayesian Optimization. arXiv:2109.08215 [cs.LG] https://arxiv.org/ abs/2109.08215

  28. [28]

    Shuhei Watanabe. 2026. Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance. arXiv:2304.11127 [cs.LG] https://arxiv.org/abs/2304.11127

  29. [29]

    Christopher Williams and Carl Rasmussen. 1995. Gaussian Processes for Regres- sion. InAdvances in Neural Information Processing Systems, D. Touretzky, M.C. Mozer, and M. Hasselmo (Eds.), Vol. 8. MIT Press. https://proceedings.neurips. cc/paper_files/paper/1995/file/7cce53cf90577442771720a370c3c723-Paper.pdf

  30. [30]

    David H Wolpert and William G Macready. 2002. No free lunch theorems for optimization.IEEE transactions on evolutionary computation1, 1 (2002), 67–82

  31. [31]

    Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. 2026. EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle. arXiv:2510.16079 [cs.CL] https://arxiv.org/abs/2510.16079

  32. [32]

    Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. 2026. Speculative Actions: A Lossless Framework for Faster Agentic Systems. arXiv:2510.04371 [cs.AI] https://arxiv.org/abs/2510.04371

  33. [33]

    ML Eval Time

    Michael R. Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. 2024. Using Large Language Models for Hyperparameter Optimization. arXiv:2312.04528 [cs.LG] https://arxiv.org/abs/2312.04528 Guo et al. Appendix A Implementation Details This section collects everything needed to reproduce our exper- iments: the benchmark tasks we evaluate on (S...