AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive
Pith reviewed 2026-05-20 22:34 UTC · model grok-4.3
The pith
Agents can learn principles from cheap low-fidelity LLM experiments to configure expensive high-fidelity ones more effectively than baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The AutoLLMResearch framework with LLMConfig-Gym and the structured MDP training pipeline enables agents to learn generalizable principles from low-fidelity experiments and efficiently identify promising configurations in expensive LLM settings, as shown by superior performance against baselines on held-out experiments.
What carries the argument
LLMConfig-Gym, a multi-fidelity experimental environment encompassing four LLM tasks and backed by over one million GPU hours of data, combined with a structured training pipeline that formulates configuration search as a long-horizon Markov Decision Process and rewards cross-fidelity extrapolation.
If this is right
- Agents trained this way reduce the number of expensive trials needed to reach good LLM configurations.
- The same trained agents generalize across multiple distinct LLM experiment tasks without retraining from scratch.
- The MDP formulation produces interpretable reasoning traces that show how agents extrapolate across fidelity levels.
- The overall approach supplies a practical route to automating configuration decisions that currently rely on manual expert effort.
Where Pith is reading between the lines
- The same multi-fidelity training idea could be tested on other high-cost optimization problems such as neural architecture search at scale.
- Patterns discovered by the agents might point to previously unnoticed regularities in how model scale interacts with hyperparameter choices.
- Integrating the trained agent directly into an automated pipeline that launches real high-fidelity runs would test end-to-end utility.
Load-bearing premise
The structure of outcomes in low-fidelity experiments must closely mirror the structure of outcomes in high-fidelity experiments so that principles learned in the former transfer usefully to the latter.
What would settle it
If agents trained in the multi-fidelity environment show no performance advantage over strong baselines or random search when tested on fresh held-out high-fidelity LLM configuration tasks, the central claim would be false.
read the original abstract
Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AutoLLMResearch, an agentic framework for automating configuration of high-cost LLM experiments. It consists of LLMConfig-Gym, a multi-fidelity environment spanning four tasks and backed by over one million GPU hours of verifiable outcomes, together with an MDP-based training pipeline that incentivizes agents to extract generalizable principles from low-fidelity runs for extrapolation to expensive settings. The authors report that the resulting agents outperform diverse baselines on held-out experiments and exhibit generalization and interpretability.
Significance. If the multi-fidelity transfer is substantiated, the work would offer a practical route to reducing the compute burden of LLM architecture and hyperparameter search, a recognized bottleneck in the field. The scale of the underlying dataset and the explicit MDP formulation for long-horizon cross-fidelity reasoning are notable strengths that could support reproducible follow-on research.
major comments (2)
- [LLMConfig-Gym and evaluation sections] LLMConfig-Gym description and evaluation: The central claim requires that low-fidelity tasks preserve relative rankings and optimization structure of high-fidelity LLM runs. No quantitative measure (Kendall tau, Spearman rank correlation, or landscape similarity) between low- and high-fidelity performance on identical configurations is reported, despite the environment being supported by over one million GPU hours of data. Without this, it remains possible that policies optimize for gym-specific artifacts rather than transferable LLM behavior.
- [Evaluation section] Evaluation and baselines: The abstract states superior performance on held-out experiments, yet the provided description supplies neither the exact baseline implementations, number of independent runs, statistical significance tests, nor ablation results isolating the contribution of the cross-fidelity MDP incentive. These omissions prevent assessment of whether the reported gains are robust or attributable to the claimed extrapolation mechanism.
minor comments (2)
- [MDP formulation] Notation for the MDP state and reward components could be clarified with an explicit table mapping symbols to their definitions.
- [Abstract] The abstract refers to 'diverse strong baselines' without naming them; a brief enumeration in the introduction would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive summary, recognition of the dataset scale and MDP formulation, and constructive major comments. We address each point below and will incorporate revisions to strengthen the claims.
read point-by-point responses
-
Referee: [LLMConfig-Gym and evaluation sections] LLMConfig-Gym description and evaluation: The central claim requires that low-fidelity tasks preserve relative rankings and optimization structure of high-fidelity LLM runs. No quantitative measure (Kendall tau, Spearman rank correlation, or landscape similarity) between low- and high-fidelity performance on identical configurations is reported, despite the environment being supported by over one million GPU hours of data. Without this, it remains possible that policies optimize for gym-specific artifacts rather than transferable LLM behavior.
Authors: We agree that explicit quantitative validation of rank preservation and landscape similarity between fidelities is necessary to support the central multi-fidelity transfer claim. The current manuscript describes the environment and its data backing but does not report these metrics. In the revision we will add a dedicated analysis subsection to LLMConfig-Gym that computes and reports Kendall's tau and Spearman's rank correlation on performance rankings for identical configurations across low- and high-fidelity settings, along with a simple landscape similarity measure. This will directly demonstrate that relative optimization structure is preserved rather than being gym-specific. revision: yes
-
Referee: [Evaluation section] Evaluation and baselines: The abstract states superior performance on held-out experiments, yet the provided description supplies neither the exact baseline implementations, number of independent runs, statistical significance tests, nor ablation results isolating the contribution of the cross-fidelity MDP incentive. These omissions prevent assessment of whether the reported gains are robust or attributable to the claimed extrapolation mechanism.
Authors: We acknowledge that the evaluation section currently lacks the level of detail required for full reproducibility and attribution assessment. In the revised manuscript we will expand this section to include: precise implementation details and code references for all baselines; the exact number of independent runs per method; results of statistical significance tests (e.g., paired t-tests with p-values); and targeted ablation experiments that compare the full cross-fidelity MDP incentive against variants without it. These additions will allow readers to evaluate robustness and confirm that performance gains stem from the extrapolation mechanism. revision: yes
Circularity Check
No significant circularity; empirical framework with held-out evaluation
full rationale
The paper describes an empirical agent-training setup: LLMConfig-Gym is populated from over one million GPU hours of actual runs, the MDP formulation is a modeling choice, and performance is measured against baselines on held-out experiments. No equation or claim reduces by construction to a fitted parameter on the evaluation data, no self-citation supplies a uniqueness theorem that forces the result, and the central claim (transfer from low- to high-fidelity) is presented as an empirical finding rather than a definitional identity. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The multi-fidelity environment captures the structure of the LLM configuration landscape
- domain assumption Long-horizon MDP formulation is appropriate for configuration research
invented entities (1)
-
LLMConfig-Gym
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Droste Effect. 2026. URLhttps://en.wikipedia.org/wiki/Droste_effect
work page 2026
-
[2]
Towards learning universal hyperparameter optimizers with transformers
Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'Aurelio Ranzato, Sagi Perel, and Nando de Freitas. Towards learning universal hyperparameter optimizers with transformers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Informa...
work page 2022
-
[3]
Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel
Michael Volpp, Lukas P. Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel. Meta-learning acquisition functions for transfer learning in bayesian optimization. InInternational Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=ryeYpJSKwr
work page 2020
-
[4]
End- to-end meta-bayesian optimisation with transformer neural processes
Alexandre Maraval, Matthieu Zimmer, Antoine Grosnit, and Haitham Bou Ammar. End- to-end meta-bayesian optimisation with transformer neural processes. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11246–11260. Curran Associates, Inc.,
-
[5]
17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 2561721d0ca69bab22b749cfc4f48f6c-Paper-Conference.pdf. 17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration
work page 2023
-
[6]
Few-shotbayesianoptimizationwithdeepkernelsurrogates
MartinWistubaandJosifGrabocka. Few-shotbayesianoptimizationwithdeepkernelsurrogates. InInternational Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=bJxgv5C3sYc
work page 2021
-
[7]
Large language models to enhance bayesian optimization
Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OOxotBmGol
work page 2024
-
[8]
Using large language models for hyperparameter optimization
Michael Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using large language models for hyperparameter optimization. InNeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URLhttps://openreview.net/forum?id=FUdZ6HEOre
work page 2023
-
[9]
Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025
KananMahammadliandSeydaErtekin. Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025. URLhttps://arxiv.org/abs/2410.20302
-
[10]
Agenthpo: Large language model agent for hyper-parameter optimization
Siyi Liu, Chen Gao, and Yong Li. Agenthpo: Large language model agent for hyper-parameter optimization. In Beidi Chen, Shijia Liu, Mert Pilanci, Weijie Su, Jeremias Sulam, Yuxiang Wang, and Zhihui Zhu, editors,Conference on Parsimony and Learning, volume 280 ofProceedings of Machine Learning Research, pages 1146–1169. PMLR, 24–27 Mar 2025. URLhttps:// pro...
work page 2025
-
[11]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[12]
Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Aaron Klein, Lennart Purucker, Jörg K. H. Franke, and Frank Hutter. Hw-gpt-bench: Hardware-aware architecture benchmark for language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Pro- cessing Systems, volume 37, pag...
-
[13]
Tuning large neural networks via zero-shot hyperparameter transfer
Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pag...
work page 2021
-
[14]
Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance
Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance. InTheThirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=jjCB27TMK3
work page 2025
-
[15]
An empirical analysis of compute-optimal large language model train- ing
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...
work page 2022
-
[16]
Alon Halfon, Shai Gretz, Ofir Arviv, Artem Spector, Orith Toledo-Ronen, Yoav Katz, Liat Ein-Dor, Michal Shmueli-Scheuer, and Noam Slonim. Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024. URLhttps://arxiv.org/ abs/2407.18990
-
[17]
autoresearch.https://github.com/karpathy/autoresearch, 2026
Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2026
work page 2026
-
[18]
OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025
work page 2025
-
[19]
AIDE: AI-Driven Exploration in the Space of Code
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. URLhttps://arxiv. org/abs/2502.13138
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Op- tuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 2623–2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi...
-
[21]
Scikit-optimize: Sequential model-based optimization in python
Scikit-Optimize. Scikit-optimize: Sequential model-based optimization in python. URLhttps: //scikit-optimize.github.io/
-
[22]
Speculations concerning the first ultraintelligent machine
Irving John Good. Speculations concerning the first ultraintelligent machine. volume 6 ofAdvances in Computers, pages 31–88. Elsevier, 1966. doi: https://doi.org/10.1016/ S0065-2458(08)60418-0. URL https://www.sciencedirect.com/science/article/ pii/S0065245808604180
work page 1966
-
[23]
Juergen Schmidhuber. Goedel machines: Self-referential universal problem solvers making provably optimal self-improvements, 2006. URLhttps://arxiv.org/abs/cs/0309048
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[24]
AI with recursive self-improvement
MingchenZhuge, AilingZeng,DeyaoZhu,SherryYang,VikasChandra,andJürgenSchmidhuber. AI with recursive self-improvement. InICLR 2026 Workshop Proposals, 2026. URLhttps: //openreview.net/forum?id=OsPQ6zTQXV
work page 2026
-
[25]
Posttrainbench: Can llm agents automate llm post-training?,
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training?,
- [26]
-
[27]
Darwin gödel machine: Open-ended evolution of self-improving agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= pUpzQZTvGY
work page 2026
-
[28]
Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jürgen Schmidhuber. Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum...
work page 2026
-
[29]
OpenAI. Learning to reason with LLMs. 2024. URL https://openai.com/index/ learning-to-reason-with-llms
work page 2024
-
[30]
DayaGuo,DejianYang,HaoweiZhang,JunxiaoSong,PeiyiWang,etal. Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September
-
[31]
doi: 10.1038/s41586-025-09422-z
ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URLhttp://dx.doi.org/10. 1038/s41586-025-09422-z
-
[32]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, et al. Kimi k1.5: Scaling reinforcement learning with llms, 2025. URLhttps://arxiv.org/abs/2501.12599
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training
Taicheng Guo, Hai Wang, ChaoChun Liu, Mohsen Golalikhani, Xin Chen, Xiangliang Zhang, and Chandan K Reddy. Mtsql-r1: Towards long-horizon multi-turn text-to-sql via agentic training. arXiv preprint arXiv:2510.12831, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/ 2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
difflib — Helpers for computing deltas
Python Software Foundation. difflib — Helpers for computing deltas. 2026. URLhttps: //docs.python.org/3/library/difflib.html
work page 2026
-
[36]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025
work page 2025
-
[37]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Openai gpt-5 system card, 2025
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601. 03267
work page 2025
-
[39]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
LlamaFactory: Unified efficient fine-tuning of 100+ language models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400–410, Bangkok, Thailand, August
-
[41]
L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.38. URL https://aclanthology.org/2024.acl-demos.38/
-
[42]
Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Stor...
work page 2022
-
[43]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961. doi: 1...
-
[44]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...
-
[45]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps: //arxiv.org/abs/2408.06292
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Self-Driving Laboratories for Chemistry and Materials Science.Chemical Reviews, 2024
Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Sergio Pablo-García, Ella M. Rajaonson, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alán Aspuru-Guzik. Self-driving lab- oratories for chemistry and materials science.Chemical Reviews, 124(16):963...
-
[47]
Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for Bayesian optimization in scientific domains: Are we there yet? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510, Suzhou, China, November 2025. Association for Computational ...
-
[48]
What can large language models do in chem- istry? a comprehensive benchmark on eight tasks
Taicheng Guo, kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chem- istry? a comprehensive benchmark on eight tasks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural In- formation Processing Systems, volume 36, pages 5...
-
[49]
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ bbb330189ce02be00cf7346167028ab1-Paper-Datasets_and_Benchmarks.pdf
work page 2023
-
[50]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/ abs/2101.00027
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[51]
Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining
Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Zhenyu Ding, Haoying Wang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining, 2025. URLhttps://arxiv.org/abs/2503.04715
-
[52]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John 21 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/ abs...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[53]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
MMLU-pro: A more robust and challenging multi-task language understanding benchmark
YuboWang, XueguangMa, GeZhang, YuanshengNi, AbhranilChandra, ShiguangGuo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processing Syst...
work page 2024
-
[55]
Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, and Jonathan Richard Schwarz. ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=0Euvm9zDpu
work page 2025
-
[56]
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christo- pher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannan...
work page 2025
-
[57]
learning rate: 0.001953, batch size: 128.0 ######
learning rate: 0.001953, batch size: 64.0 3. learning rate: 0.001953, batch size: 128.0 ###### ...... ###### Experiment Environment information: the total number of training tokens seen by the model during training is: 8000000000, and the count of trainable model parameters excluding token embedding matrices is: 119992320 In this environment, the Top-3 co...
-
[58]
learning rate: 0.002762, batch size: 128.0 2. learning rate: 0.003906, batch size: 128.0 3. learning rate: 0.005524, batch size: 128.0 ###### Remember: 1. Consideryourremainingbudgeis2, previousexperimentalresults,bestconfigurationsfromlow-fidelityexperiments when making decisions. 2. You MUST have to call "exec_config" tool to query the score of the conf...
-
[59]
kl loss weight: 0.0, learning rate: 5e-06, batch size: 32.0 ###### ...... ###### Experiment Environment information: the dataset is mmlu_history, the model is Qwen2.5-1.5B-Instruct, Note the Training epoch is 15 In this environment, the Top-3 configurations are: 1. kl loss weight: 0.0, learning rate: 1e-06, batch size: 64.0 2. kl loss weight: 0.0, learnin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.