pith. sign in

arxiv: 2605.11518 · v2 · pith:NPY6YP34new · submitted 2026-05-12 · 💻 cs.AI · cs.CL· cs.LG

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

Pith reviewed 2026-05-20 22:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords LLM experiment configurationmulti-fidelity learningreinforcement learning agentshyperparameter optimizationexperiment automationMarkov Decision ProcessAutoML for LLMs
0
0 comments X

The pith

Agents can learn principles from cheap low-fidelity LLM experiments to configure expensive high-fidelity ones more effectively than baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoLLMResearch as a way to train agents that automate the configuration of large language model experiments. It builds LLMConfig-Gym, a multi-fidelity environment drawn from over a million GPU hours of verifiable results across four key tasks, and trains agents inside a long-horizon Markov Decision Process that rewards reasoning across fidelity levels. Agents thereby extract generalizable rules from inexpensive runs and apply them to costly ones. A sympathetic reader would care because poor manual configurations waste large amounts of compute, and this method aims to reduce that waste while cutting dependence on expert intuition.

Core claim

The AutoLLMResearch framework with LLMConfig-Gym and the structured MDP training pipeline enables agents to learn generalizable principles from low-fidelity experiments and efficiently identify promising configurations in expensive LLM settings, as shown by superior performance against baselines on held-out experiments.

What carries the argument

LLMConfig-Gym, a multi-fidelity experimental environment encompassing four LLM tasks and backed by over one million GPU hours of data, combined with a structured training pipeline that formulates configuration search as a long-horizon Markov Decision Process and rewards cross-fidelity extrapolation.

If this is right

  • Agents trained this way reduce the number of expensive trials needed to reach good LLM configurations.
  • The same trained agents generalize across multiple distinct LLM experiment tasks without retraining from scratch.
  • The MDP formulation produces interpretable reasoning traces that show how agents extrapolate across fidelity levels.
  • The overall approach supplies a practical route to automating configuration decisions that currently rely on manual expert effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-fidelity training idea could be tested on other high-cost optimization problems such as neural architecture search at scale.
  • Patterns discovered by the agents might point to previously unnoticed regularities in how model scale interacts with hyperparameter choices.
  • Integrating the trained agent directly into an automated pipeline that launches real high-fidelity runs would test end-to-end utility.

Load-bearing premise

The structure of outcomes in low-fidelity experiments must closely mirror the structure of outcomes in high-fidelity experiments so that principles learned in the former transfer usefully to the latter.

What would settle it

If agents trained in the multi-fidelity environment show no performance advantage over strong baselines or random search when tested on fresh held-out high-fidelity LLM configuration tasks, the central claim would be false.

read the original abstract

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AutoLLMResearch, an agentic framework for automating configuration of high-cost LLM experiments. It consists of LLMConfig-Gym, a multi-fidelity environment spanning four tasks and backed by over one million GPU hours of verifiable outcomes, together with an MDP-based training pipeline that incentivizes agents to extract generalizable principles from low-fidelity runs for extrapolation to expensive settings. The authors report that the resulting agents outperform diverse baselines on held-out experiments and exhibit generalization and interpretability.

Significance. If the multi-fidelity transfer is substantiated, the work would offer a practical route to reducing the compute burden of LLM architecture and hyperparameter search, a recognized bottleneck in the field. The scale of the underlying dataset and the explicit MDP formulation for long-horizon cross-fidelity reasoning are notable strengths that could support reproducible follow-on research.

major comments (2)
  1. [LLMConfig-Gym and evaluation sections] LLMConfig-Gym description and evaluation: The central claim requires that low-fidelity tasks preserve relative rankings and optimization structure of high-fidelity LLM runs. No quantitative measure (Kendall tau, Spearman rank correlation, or landscape similarity) between low- and high-fidelity performance on identical configurations is reported, despite the environment being supported by over one million GPU hours of data. Without this, it remains possible that policies optimize for gym-specific artifacts rather than transferable LLM behavior.
  2. [Evaluation section] Evaluation and baselines: The abstract states superior performance on held-out experiments, yet the provided description supplies neither the exact baseline implementations, number of independent runs, statistical significance tests, nor ablation results isolating the contribution of the cross-fidelity MDP incentive. These omissions prevent assessment of whether the reported gains are robust or attributable to the claimed extrapolation mechanism.
minor comments (2)
  1. [MDP formulation] Notation for the MDP state and reward components could be clarified with an explicit table mapping symbols to their definitions.
  2. [Abstract] The abstract refers to 'diverse strong baselines' without naming them; a brief enumeration in the introduction would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the dataset scale and MDP formulation, and constructive major comments. We address each point below and will incorporate revisions to strengthen the claims.

read point-by-point responses
  1. Referee: [LLMConfig-Gym and evaluation sections] LLMConfig-Gym description and evaluation: The central claim requires that low-fidelity tasks preserve relative rankings and optimization structure of high-fidelity LLM runs. No quantitative measure (Kendall tau, Spearman rank correlation, or landscape similarity) between low- and high-fidelity performance on identical configurations is reported, despite the environment being supported by over one million GPU hours of data. Without this, it remains possible that policies optimize for gym-specific artifacts rather than transferable LLM behavior.

    Authors: We agree that explicit quantitative validation of rank preservation and landscape similarity between fidelities is necessary to support the central multi-fidelity transfer claim. The current manuscript describes the environment and its data backing but does not report these metrics. In the revision we will add a dedicated analysis subsection to LLMConfig-Gym that computes and reports Kendall's tau and Spearman's rank correlation on performance rankings for identical configurations across low- and high-fidelity settings, along with a simple landscape similarity measure. This will directly demonstrate that relative optimization structure is preserved rather than being gym-specific. revision: yes

  2. Referee: [Evaluation section] Evaluation and baselines: The abstract states superior performance on held-out experiments, yet the provided description supplies neither the exact baseline implementations, number of independent runs, statistical significance tests, nor ablation results isolating the contribution of the cross-fidelity MDP incentive. These omissions prevent assessment of whether the reported gains are robust or attributable to the claimed extrapolation mechanism.

    Authors: We acknowledge that the evaluation section currently lacks the level of detail required for full reproducibility and attribution assessment. In the revised manuscript we will expand this section to include: precise implementation details and code references for all baselines; the exact number of independent runs per method; results of statistical significance tests (e.g., paired t-tests with p-values); and targeted ablation experiments that compare the full cross-fidelity MDP incentive against variants without it. These additions will allow readers to evaluate robustness and confirm that performance gains stem from the extrapolation mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with held-out evaluation

full rationale

The paper describes an empirical agent-training setup: LLMConfig-Gym is populated from over one million GPU hours of actual runs, the MDP formulation is a modeling choice, and performance is measured against baselines on held-out experiments. No equation or claim reduces by construction to a fitted parameter on the evaluation data, no self-citation supplies a uniqueness theorem that forces the result, and the central claim (transfer from low- to high-fidelity) is presented as an empirical finding rather than a definitional identity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence and fidelity of the newly introduced LLMConfig-Gym environment and on the assumption that cross-fidelity extrapolation is learnable via the MDP formulation. No explicit free parameters or invented physical entities are described; the main additions are the environment and the training pipeline.

axioms (2)
  • domain assumption The multi-fidelity environment captures the structure of the LLM configuration landscape
    Stated in the abstract as the core challenge the framework must solve.
  • domain assumption Long-horizon MDP formulation is appropriate for configuration research
    Used to structure the training pipeline.
invented entities (1)
  • LLMConfig-Gym no independent evidence
    purpose: Multi-fidelity environment encompassing four LLM experiment tasks with verifiable outcomes
    Newly proposed component that supplies the interaction data for agent training.

pith-pipeline@v0.9.0 · 5821 in / 1532 out tokens · 39704 ms · 2026-05-20T22:34:06.238113+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 12 internal anchors

  1. [1]

    Droste Effect. 2026. URLhttps://en.wikipedia.org/wiki/Droste_effect

  2. [2]

    Towards learning universal hyperparameter optimizers with transformers

    Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'Aurelio Ranzato, Sagi Perel, and Nando de Freitas. Towards learning universal hyperparameter optimizers with transformers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Informa...

  3. [3]

    Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel

    Michael Volpp, Lukas P. Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel. Meta-learning acquisition functions for transfer learning in bayesian optimization. InInternational Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=ryeYpJSKwr

  4. [4]

    End- to-end meta-bayesian optimisation with transformer neural processes

    Alexandre Maraval, Matthieu Zimmer, Antoine Grosnit, and Haitham Bou Ammar. End- to-end meta-bayesian optimisation with transformer neural processes. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11246–11260. Curran Associates, Inc.,

  5. [5]

    17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 2561721d0ca69bab22b749cfc4f48f6c-Paper-Conference.pdf. 17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

  6. [6]

    Few-shotbayesianoptimizationwithdeepkernelsurrogates

    MartinWistubaandJosifGrabocka. Few-shotbayesianoptimizationwithdeepkernelsurrogates. InInternational Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=bJxgv5C3sYc

  7. [7]

    Large language models to enhance bayesian optimization

    Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OOxotBmGol

  8. [8]

    Using large language models for hyperparameter optimization

    Michael Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using large language models for hyperparameter optimization. InNeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URLhttps://openreview.net/forum?id=FUdZ6HEOre

  9. [9]

    Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025

    KananMahammadliandSeydaErtekin. Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025. URLhttps://arxiv.org/abs/2410.20302

  10. [10]

    Agenthpo: Large language model agent for hyper-parameter optimization

    Siyi Liu, Chen Gao, and Yong Li. Agenthpo: Large language model agent for hyper-parameter optimization. In Beidi Chen, Shijia Liu, Mert Pilanci, Weijie Su, Jeremias Sulam, Yuxiang Wang, and Zhihui Zhu, editors,Conference on Parsimony and Learning, volume 280 ofProceedings of Machine Learning Research, pages 1146–1169. PMLR, 24–27 Mar 2025. URLhttps:// pro...

  11. [11]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/2001.08361

  12. [12]

    Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Aaron Klein, Lennart Purucker, Jörg K. H. Franke, and Frank Hutter. Hw-gpt-bench: Hardware-aware architecture benchmark for language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Pro- cessing Systems, volume 37, pag...

  13. [13]

    Tuning large neural networks via zero-shot hyperparameter transfer

    Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pag...

  14. [14]

    Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance

    Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance. InTheThirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=jjCB27TMK3

  15. [15]

    An empirical analysis of compute-optimal large language model train- ing

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...

  16. [16]

    Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024

    Alon Halfon, Shai Gretz, Ofir Arviv, Artem Spector, Orith Toledo-Ronen, Yoav Katz, Liat Ein-Dor, Michal Shmueli-Scheuer, and Noam Slonim. Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024. URLhttps://arxiv.org/ abs/2407.18990

  17. [17]

    autoresearch.https://github.com/karpathy/autoresearch, 2026

    Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2026

  18. [18]

    Introducing deep research

    OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025

  19. [19]

    AIDE: AI-Driven Exploration in the Space of Code

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. URLhttps://arxiv. org/abs/2502.13138

  20. [20]

    Akiba, S

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Op- tuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 2623–2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi...

  21. [21]

    Scikit-optimize: Sequential model-based optimization in python

    Scikit-Optimize. Scikit-optimize: Sequential model-based optimization in python. URLhttps: //scikit-optimize.github.io/

  22. [22]

    Speculations concerning the first ultraintelligent machine

    Irving John Good. Speculations concerning the first ultraintelligent machine. volume 6 ofAdvances in Computers, pages 31–88. Elsevier, 1966. doi: https://doi.org/10.1016/ S0065-2458(08)60418-0. URL https://www.sciencedirect.com/science/article/ pii/S0065245808604180

  23. [23]

    Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements

    Juergen Schmidhuber. Goedel machines: Self-referential universal problem solvers making provably optimal self-improvements, 2006. URLhttps://arxiv.org/abs/cs/0309048

  24. [24]

    AI with recursive self-improvement

    MingchenZhuge, AilingZeng,DeyaoZhu,SherryYang,VikasChandra,andJürgenSchmidhuber. AI with recursive self-improvement. InICLR 2026 Workshop Proposals, 2026. URLhttps: //openreview.net/forum?id=OsPQ6zTQXV

  25. [25]

    Posttrainbench: Can llm agents automate llm post-training?,

    Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training?,

  26. [26]

    URLhttps://arxiv.org/abs/2603.08640

  27. [27]

    Darwin gödel machine: Open-ended evolution of self-improving agents

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= pUpzQZTvGY

  28. [28]

    Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine

    Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jürgen Schmidhuber. Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum...

  29. [29]

    Learning to reason with LLMs

    OpenAI. Learning to reason with LLMs. 2024. URL https://openai.com/index/ learning-to-reason-with-llms

  30. [30]

    Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September

    DayaGuo,DejianYang,HaoweiZhang,JunxiaoSong,PeiyiWang,etal. Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September

  31. [31]

    doi: 10.1038/s41586-025-09422-z

    ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URLhttp://dx.doi.org/10. 1038/s41586-025-09422-z

  32. [32]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, et al. Kimi k1.5: Scaling reinforcement learning with llms, 2025. URLhttps://arxiv.org/abs/2501.12599

  33. [33]

    MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

    Taicheng Guo, Hai Wang, ChaoChun Liu, Mohsen Golalikhani, Xin Chen, Xiangliang Zhang, and Chandan K Reddy. Mtsql-r1: Towards long-horizon multi-turn text-to-sql via agentic training. arXiv preprint arXiv:2510.12831, 2025

  34. [34]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/ 2402.03300

  35. [35]

    difflib — Helpers for computing deltas

    Python Software Foundation. difflib — Helpers for computing deltas. 2026. URLhttps: //docs.python.org/3/library/difflib.html

  36. [36]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025

  37. [37]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

  38. [38]

    Openai gpt-5 system card, 2025

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601. 03267

  39. [39]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  40. [40]

    LlamaFactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400–410, Bangkok, Thailand, August

  41. [41]

    L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.38. URL https://aclanthology.org/2024.acl-demos.38/

  42. [42]

    Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Stor...

  43. [43]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961. doi: 1...

  44. [44]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...

  45. [45]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps: //arxiv.org/abs/2408.06292

  46. [46]

    Self-Driving Laboratories for Chemistry and Materials Science.Chemical Reviews, 2024

    Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Sergio Pablo-García, Ella M. Rajaonson, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alán Aspuru-Guzik. Self-driving lab- oratories for chemistry and materials science.Chemical Reviews, 124(16):963...

  47. [47]

    Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for Bayesian optimization in scientific domains: Are we there yet? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510, Suzhou, China, November 2025. Association for Computational ...

  48. [48]

    What can large language models do in chem- istry? a comprehensive benchmark on eight tasks

    Taicheng Guo, kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chem- istry? a comprehensive benchmark on eight tasks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural In- formation Processing Systems, volume 36, pages 5...

  49. [49]

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ bbb330189ce02be00cf7346167028ab1-Paper-Datasets_and_Benchmarks.pdf

  50. [50]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/ abs/2101.00027

  51. [51]

    Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining

    Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Zhenyu Ding, Haoying Wang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining, 2025. URLhttps://arxiv.org/abs/2503.04715

  52. [52]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John 21 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/ abs...

  53. [53]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  54. [54]

    MMLU-pro: A more robust and challenging multi-task language understanding benchmark

    YuboWang, XueguangMa, GeZhang, YuanshengNi, AbhranilChandra, ShiguangGuo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processing Syst...

  55. [55]

    ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025

    Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, and Jonathan Richard Schwarz. ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=0Euvm9zDpu

  56. [56]

    Available tasks:

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christo- pher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannan...

  57. [57]

    learning rate: 0.001953, batch size: 128.0 ######

    learning rate: 0.001953, batch size: 64.0 3. learning rate: 0.001953, batch size: 128.0 ###### ...... ###### Experiment Environment information: the total number of training tokens seen by the model during training is: 8000000000, and the count of trainable model parameters excluding token embedding matrices is: 119992320 In this environment, the Top-3 co...

  58. [58]

    exec_config

    learning rate: 0.002762, batch size: 128.0 2. learning rate: 0.003906, batch size: 128.0 3. learning rate: 0.005524, batch size: 128.0 ###### Remember: 1. Consideryourremainingbudgeis2, previousexperimentalresults,bestconfigurationsfromlow-fidelityexperiments when making decisions. 2. You MUST have to call "exec_config" tool to query the score of the conf...

  59. [59]

    exec_config

    kl loss weight: 0.0, learning rate: 5e-06, batch size: 32.0 ###### ...... ###### Experiment Environment information: the dataset is mmlu_history, the model is Qwen2.5-1.5B-Instruct, Note the Training epoch is 15 In this environment, the Top-3 configurations are: 1. kl loss weight: 0.0, learning rate: 1e-06, batch size: 64.0 2. kl loss weight: 0.0, learnin...