AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

Nitesh V. Chawla; Olaf Wiest; Taicheng Guo; Xiangliang Zhang

arxiv: 2605.11518 · v2 · pith:NPY6YP34new · submitted 2026-05-12 · 💻 cs.AI · cs.CL· cs.LG

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

Taicheng Guo , Nitesh V. Chawla , Olaf Wiest , Xiangliang Zhang This is my paper

Pith reviewed 2026-05-20 22:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM experiment configurationmulti-fidelity learningreinforcement learning agentshyperparameter optimizationexperiment automationMarkov Decision ProcessAutoML for LLMs

0 comments

The pith

Agents can learn principles from cheap low-fidelity LLM experiments to configure expensive high-fidelity ones more effectively than baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoLLMResearch as a way to train agents that automate the configuration of large language model experiments. It builds LLMConfig-Gym, a multi-fidelity environment drawn from over a million GPU hours of verifiable results across four key tasks, and trains agents inside a long-horizon Markov Decision Process that rewards reasoning across fidelity levels. Agents thereby extract generalizable rules from inexpensive runs and apply them to costly ones. A sympathetic reader would care because poor manual configurations waste large amounts of compute, and this method aims to reduce that waste while cutting dependence on expert intuition.

Core claim

The AutoLLMResearch framework with LLMConfig-Gym and the structured MDP training pipeline enables agents to learn generalizable principles from low-fidelity experiments and efficiently identify promising configurations in expensive LLM settings, as shown by superior performance against baselines on held-out experiments.

What carries the argument

LLMConfig-Gym, a multi-fidelity experimental environment encompassing four LLM tasks and backed by over one million GPU hours of data, combined with a structured training pipeline that formulates configuration search as a long-horizon Markov Decision Process and rewards cross-fidelity extrapolation.

If this is right

Agents trained this way reduce the number of expensive trials needed to reach good LLM configurations.
The same trained agents generalize across multiple distinct LLM experiment tasks without retraining from scratch.
The MDP formulation produces interpretable reasoning traces that show how agents extrapolate across fidelity levels.
The overall approach supplies a practical route to automating configuration decisions that currently rely on manual expert effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-fidelity training idea could be tested on other high-cost optimization problems such as neural architecture search at scale.
Patterns discovered by the agents might point to previously unnoticed regularities in how model scale interacts with hyperparameter choices.
Integrating the trained agent directly into an automated pipeline that launches real high-fidelity runs would test end-to-end utility.

Load-bearing premise

The structure of outcomes in low-fidelity experiments must closely mirror the structure of outcomes in high-fidelity experiments so that principles learned in the former transfer usefully to the latter.

What would settle it

If agents trained in the multi-fidelity environment show no performance advantage over strong baselines or random search when tested on fresh held-out high-fidelity LLM configuration tasks, the central claim would be false.

read the original abstract

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new multi-fidelity gym and MDP agent for high-cost LLM config search backed by real large-scale data, but the transfer from cheap to expensive runs rests on an assumption that still lacks direct verification.

read the letter

The main thing to know is that this work builds LLMConfig-Gym, a multi-fidelity environment covering four LLM experiment tasks and supported by over a million GPU hours of actual outcomes. They train agents via a long-horizon MDP to pick up patterns from low-fidelity runs and use them on expensive high-fidelity ones. That framing targets a genuine bottleneck where expert intuition currently drives most decisions because brute-force search is too costly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AutoLLMResearch, an agentic framework for automating configuration of high-cost LLM experiments. It consists of LLMConfig-Gym, a multi-fidelity environment spanning four tasks and backed by over one million GPU hours of verifiable outcomes, together with an MDP-based training pipeline that incentivizes agents to extract generalizable principles from low-fidelity runs for extrapolation to expensive settings. The authors report that the resulting agents outperform diverse baselines on held-out experiments and exhibit generalization and interpretability.

Significance. If the multi-fidelity transfer is substantiated, the work would offer a practical route to reducing the compute burden of LLM architecture and hyperparameter search, a recognized bottleneck in the field. The scale of the underlying dataset and the explicit MDP formulation for long-horizon cross-fidelity reasoning are notable strengths that could support reproducible follow-on research.

major comments (2)

[LLMConfig-Gym and evaluation sections] LLMConfig-Gym description and evaluation: The central claim requires that low-fidelity tasks preserve relative rankings and optimization structure of high-fidelity LLM runs. No quantitative measure (Kendall tau, Spearman rank correlation, or landscape similarity) between low- and high-fidelity performance on identical configurations is reported, despite the environment being supported by over one million GPU hours of data. Without this, it remains possible that policies optimize for gym-specific artifacts rather than transferable LLM behavior.
[Evaluation section] Evaluation and baselines: The abstract states superior performance on held-out experiments, yet the provided description supplies neither the exact baseline implementations, number of independent runs, statistical significance tests, nor ablation results isolating the contribution of the cross-fidelity MDP incentive. These omissions prevent assessment of whether the reported gains are robust or attributable to the claimed extrapolation mechanism.

minor comments (2)

[MDP formulation] Notation for the MDP state and reward components could be clarified with an explicit table mapping symbols to their definitions.
[Abstract] The abstract refers to 'diverse strong baselines' without naming them; a brief enumeration in the introduction would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the dataset scale and MDP formulation, and constructive major comments. We address each point below and will incorporate revisions to strengthen the claims.

read point-by-point responses

Referee: [LLMConfig-Gym and evaluation sections] LLMConfig-Gym description and evaluation: The central claim requires that low-fidelity tasks preserve relative rankings and optimization structure of high-fidelity LLM runs. No quantitative measure (Kendall tau, Spearman rank correlation, or landscape similarity) between low- and high-fidelity performance on identical configurations is reported, despite the environment being supported by over one million GPU hours of data. Without this, it remains possible that policies optimize for gym-specific artifacts rather than transferable LLM behavior.

Authors: We agree that explicit quantitative validation of rank preservation and landscape similarity between fidelities is necessary to support the central multi-fidelity transfer claim. The current manuscript describes the environment and its data backing but does not report these metrics. In the revision we will add a dedicated analysis subsection to LLMConfig-Gym that computes and reports Kendall's tau and Spearman's rank correlation on performance rankings for identical configurations across low- and high-fidelity settings, along with a simple landscape similarity measure. This will directly demonstrate that relative optimization structure is preserved rather than being gym-specific. revision: yes
Referee: [Evaluation section] Evaluation and baselines: The abstract states superior performance on held-out experiments, yet the provided description supplies neither the exact baseline implementations, number of independent runs, statistical significance tests, nor ablation results isolating the contribution of the cross-fidelity MDP incentive. These omissions prevent assessment of whether the reported gains are robust or attributable to the claimed extrapolation mechanism.

Authors: We acknowledge that the evaluation section currently lacks the level of detail required for full reproducibility and attribution assessment. In the revised manuscript we will expand this section to include: precise implementation details and code references for all baselines; the exact number of independent runs per method; results of statistical significance tests (e.g., paired t-tests with p-values); and targeted ablation experiments that compare the full cross-fidelity MDP incentive against variants without it. These additions will allow readers to evaluate robustness and confirm that performance gains stem from the extrapolation mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with held-out evaluation

full rationale

The paper describes an empirical agent-training setup: LLMConfig-Gym is populated from over one million GPU hours of actual runs, the MDP formulation is a modeling choice, and performance is measured against baselines on held-out experiments. No equation or claim reduces by construction to a fitted parameter on the evaluation data, no self-citation supplies a uniqueness theorem that forces the result, and the central claim (transfer from low- to high-fidelity) is presented as an empirical finding rather than a definitional identity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence and fidelity of the newly introduced LLMConfig-Gym environment and on the assumption that cross-fidelity extrapolation is learnable via the MDP formulation. No explicit free parameters or invented physical entities are described; the main additions are the environment and the training pipeline.

axioms (2)

domain assumption The multi-fidelity environment captures the structure of the LLM configuration landscape
Stated in the abstract as the core challenge the framework must solve.
domain assumption Long-horizon MDP formulation is appropriate for configuration research
Used to structure the training pipeline.

invented entities (1)

LLMConfig-Gym no independent evidence
purpose: Multi-fidelity environment encompassing four LLM experiment tasks with verifiable outcomes
Newly proposed component that supplies the interaction data for agent training.

pith-pipeline@v0.9.0 · 5821 in / 1532 out tokens · 39704 ms · 2026-05-20T22:34:06.238113+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 12 internal anchors

[1]

Droste Effect. 2026. URLhttps://en.wikipedia.org/wiki/Droste_effect

work page 2026
[2]

Towards learning universal hyperparameter optimizers with transformers

Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'Aurelio Ranzato, Sagi Perel, and Nando de Freitas. Towards learning universal hyperparameter optimizers with transformers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Informa...

work page 2022
[3]

Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel

Michael Volpp, Lukas P. Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel. Meta-learning acquisition functions for transfer learning in bayesian optimization. InInternational Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=ryeYpJSKwr

work page 2020
[4]

End- to-end meta-bayesian optimisation with transformer neural processes

Alexandre Maraval, Matthieu Zimmer, Antoine Grosnit, and Haitham Bou Ammar. End- to-end meta-bayesian optimisation with transformer neural processes. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11246–11260. Curran Associates, Inc.,

work page
[5]

17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 2561721d0ca69bab22b749cfc4f48f6c-Paper-Conference.pdf. 17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

work page 2023
[6]

Few-shotbayesianoptimizationwithdeepkernelsurrogates

MartinWistubaandJosifGrabocka. Few-shotbayesianoptimizationwithdeepkernelsurrogates. InInternational Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=bJxgv5C3sYc

work page 2021
[7]

Large language models to enhance bayesian optimization

Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OOxotBmGol

work page 2024
[8]

Using large language models for hyperparameter optimization

Michael Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using large language models for hyperparameter optimization. InNeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URLhttps://openreview.net/forum?id=FUdZ6HEOre

work page 2023
[9]

Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025

KananMahammadliandSeydaErtekin. Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025. URLhttps://arxiv.org/abs/2410.20302

work page arXiv 2025
[10]

Agenthpo: Large language model agent for hyper-parameter optimization

Siyi Liu, Chen Gao, and Yong Li. Agenthpo: Large language model agent for hyper-parameter optimization. In Beidi Chen, Shijia Liu, Mert Pilanci, Weijie Su, Jeremias Sulam, Yuxiang Wang, and Zhihui Zhu, editors,Conference on Parsimony and Learning, volume 280 ofProceedings of Machine Learning Research, pages 1146–1169. PMLR, 24–27 Mar 2025. URLhttps:// pro...

work page 2025
[11]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Aaron Klein, Lennart Purucker, Jörg K. H. Franke, and Frank Hutter. Hw-gpt-bench: Hardware-aware architecture benchmark for language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Pro- cessing Systems, volume 37, pag...

work page doi:10.52202/079017-1944 2024
[13]

Tuning large neural networks via zero-shot hyperparameter transfer

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pag...

work page 2021
[14]

Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance. InTheThirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=jjCB27TMK3

work page 2025
[15]

An empirical analysis of compute-optimal large language model train- ing

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...

work page 2022
[16]

Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024

Alon Halfon, Shai Gretz, Ofir Arviv, Artem Spector, Orith Toledo-Ronen, Yoav Katz, Liat Ein-Dor, Michal Shmueli-Scheuer, and Noam Slonim. Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024. URLhttps://arxiv.org/ abs/2407.18990

work page arXiv 2024
[17]

autoresearch.https://github.com/karpathy/autoresearch, 2026

Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2026

work page 2026
[18]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025

work page 2025
[19]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. URLhttps://arxiv. org/abs/2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Akiba, S

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Op- tuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 2623–2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi...

work page doi:10.1145/3292500.3330701 2019
[21]

Scikit-optimize: Sequential model-based optimization in python

Scikit-Optimize. Scikit-optimize: Sequential model-based optimization in python. URLhttps: //scikit-optimize.github.io/

work page
[22]

Speculations concerning the first ultraintelligent machine

Irving John Good. Speculations concerning the first ultraintelligent machine. volume 6 ofAdvances in Computers, pages 31–88. Elsevier, 1966. doi: https://doi.org/10.1016/ S0065-2458(08)60418-0. URL https://www.sciencedirect.com/science/article/ pii/S0065245808604180

work page 1966
[23]

Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements

Juergen Schmidhuber. Goedel machines: Self-referential universal problem solvers making provably optimal self-improvements, 2006. URLhttps://arxiv.org/abs/cs/0309048

work page internal anchor Pith review Pith/arXiv arXiv 2006
[24]

AI with recursive self-improvement

MingchenZhuge, AilingZeng,DeyaoZhu,SherryYang,VikasChandra,andJürgenSchmidhuber. AI with recursive self-improvement. InICLR 2026 Workshop Proposals, 2026. URLhttps: //openreview.net/forum?id=OsPQ6zTQXV

work page 2026
[25]

Posttrainbench: Can llm agents automate llm post-training?,

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training?,

work page
[26]

URLhttps://arxiv.org/abs/2603.08640

work page arXiv
[27]

Darwin gödel machine: Open-ended evolution of self-improving agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= pUpzQZTvGY

work page 2026
[28]

Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine

Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jürgen Schmidhuber. Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum...

work page 2026
[29]

Learning to reason with LLMs

OpenAI. Learning to reason with LLMs. 2024. URL https://openai.com/index/ learning-to-reason-with-llms

work page 2024
[30]

Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September

DayaGuo,DejianYang,HaoweiZhang,JunxiaoSong,PeiyiWang,etal. Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September

work page
[31]

doi: 10.1038/s41586-025-09422-z

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URLhttp://dx.doi.org/10. 1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z
[32]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, et al. Kimi k1.5: Scaling reinforcement learning with llms, 2025. URLhttps://arxiv.org/abs/2501.12599

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

Taicheng Guo, Hai Wang, ChaoChun Liu, Mohsen Golalikhani, Xin Chen, Xiangliang Zhang, and Chandan K Reddy. Mtsql-r1: Towards long-horizon multi-turn text-to-sql via agentic training. arXiv preprint arXiv:2510.12831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

difflib — Helpers for computing deltas

Python Software Foundation. difflib — Helpers for computing deltas. 2026. URLhttps: //docs.python.org/3/library/difflib.html

work page 2026
[36]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025

work page 2025
[37]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601. 03267

work page 2025
[39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

LlamaFactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400–410, Bangkok, Thailand, August

work page
[41]

L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.38. URL https://aclanthology.org/2024.acl-demos.38/

work page doi:10.18653/v1/2024.acl-demos.38 2024
[42]

Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Stor...

work page 2022
[43]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961. doi: 1...

work page doi:10.1145/3689031.3696075 2025
[44]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...

work page doi:10.52202/079017-2000 2024
[45]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps: //arxiv.org/abs/2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Self-Driving Laboratories for Chemistry and Materials Science.Chemical Reviews, 2024

Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Sergio Pablo-García, Ella M. Rajaonson, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alán Aspuru-Guzik. Self-driving lab- oratories for chemistry and materials science.Chemical Reviews, 124(16):963...

work page doi:10.1021/acs.chemrev.4c00055 2024
[47]

Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for Bayesian optimization in scientific domains: Are we there yet? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510, Suzhou, China, November 2025. Association for Computational ...

work page doi:10.18653/v1/2025.findings-emnlp.838 2025
[48]

What can large language models do in chem- istry? a comprehensive benchmark on eight tasks

Taicheng Guo, kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chem- istry? a comprehensive benchmark on eight tasks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural In- formation Processing Systems, volume 36, pages 5...

work page
[49]

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ bbb330189ce02be00cf7346167028ab1-Paper-Datasets_and_Benchmarks.pdf

work page 2023
[50]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/ abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020
[51]

Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining

Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Zhenyu Ding, Haoying Wang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining, 2025. URLhttps://arxiv.org/abs/2503.04715

work page arXiv 2025
[52]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John 21 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/ abs...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

MMLU-pro: A more robust and challenging multi-task language understanding benchmark

YuboWang, XueguangMa, GeZhang, YuanshengNi, AbhranilChandra, ShiguangGuo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processing Syst...

work page 2024
[55]

ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025

Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, and Jonathan Richard Schwarz. ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=0Euvm9zDpu

work page 2025
[56]

Available tasks:

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christo- pher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannan...

work page 2025
[57]

learning rate: 0.001953, batch size: 128.0 ######

learning rate: 0.001953, batch size: 64.0 3. learning rate: 0.001953, batch size: 128.0 ###### ...... ###### Experiment Environment information: the total number of training tokens seen by the model during training is: 8000000000, and the count of trainable model parameters excluding token embedding matrices is: 119992320 In this environment, the Top-3 co...

work page
[58]

exec_config

learning rate: 0.002762, batch size: 128.0 2. learning rate: 0.003906, batch size: 128.0 3. learning rate: 0.005524, batch size: 128.0 ###### Remember: 1. Consideryourremainingbudgeis2, previousexperimentalresults,bestconfigurationsfromlow-fidelityexperiments when making decisions. 2. You MUST have to call "exec_config" tool to query the score of the conf...

work page
[59]

exec_config

kl loss weight: 0.0, learning rate: 5e-06, batch size: 32.0 ###### ...... ###### Experiment Environment information: the dataset is mmlu_history, the model is Qwen2.5-1.5B-Instruct, Note the Training epoch is 15 In this environment, the Top-3 configurations are: 1. kl loss weight: 0.0, learning rate: 1e-06, batch size: 64.0 2. kl loss weight: 0.0, learnin...

work page

[1] [1]

Droste Effect. 2026. URLhttps://en.wikipedia.org/wiki/Droste_effect

work page 2026

[2] [2]

Towards learning universal hyperparameter optimizers with transformers

Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'Aurelio Ranzato, Sagi Perel, and Nando de Freitas. Towards learning universal hyperparameter optimizers with transformers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Informa...

work page 2022

[3] [3]

Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel

Michael Volpp, Lukas P. Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel. Meta-learning acquisition functions for transfer learning in bayesian optimization. InInternational Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=ryeYpJSKwr

work page 2020

[4] [4]

End- to-end meta-bayesian optimisation with transformer neural processes

Alexandre Maraval, Matthieu Zimmer, Antoine Grosnit, and Haitham Bou Ammar. End- to-end meta-bayesian optimisation with transformer neural processes. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11246–11260. Curran Associates, Inc.,

work page

[5] [5]

17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 2561721d0ca69bab22b749cfc4f48f6c-Paper-Conference.pdf. 17 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration

work page 2023

[6] [6]

Few-shotbayesianoptimizationwithdeepkernelsurrogates

MartinWistubaandJosifGrabocka. Few-shotbayesianoptimizationwithdeepkernelsurrogates. InInternational Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=bJxgv5C3sYc

work page 2021

[7] [7]

Large language models to enhance bayesian optimization

Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OOxotBmGol

work page 2024

[8] [8]

Using large language models for hyperparameter optimization

Michael Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using large language models for hyperparameter optimization. InNeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URLhttps://openreview.net/forum?id=FUdZ6HEOre

work page 2023

[9] [9]

Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025

KananMahammadliandSeydaErtekin. Sequentiallargelanguagemodel-basedhyper-parameter optimization, 2025. URLhttps://arxiv.org/abs/2410.20302

work page arXiv 2025

[10] [10]

Agenthpo: Large language model agent for hyper-parameter optimization

Siyi Liu, Chen Gao, and Yong Li. Agenthpo: Large language model agent for hyper-parameter optimization. In Beidi Chen, Shijia Liu, Mert Pilanci, Weijie Su, Jeremias Sulam, Yuxiang Wang, and Zhihui Zhu, editors,Conference on Parsimony and Learning, volume 280 ofProceedings of Machine Learning Research, pages 1146–1169. PMLR, 24–27 Mar 2025. URLhttps:// pro...

work page 2025

[11] [11]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[12] [12]

Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Aaron Klein, Lennart Purucker, Jörg K. H. Franke, and Frank Hutter. Hw-gpt-bench: Hardware-aware architecture benchmark for language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Pro- cessing Systems, volume 37, pag...

work page doi:10.52202/079017-1944 2024

[13] [13]

Tuning large neural networks via zero-shot hyperparameter transfer

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pag...

work page 2021

[14] [14]

Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizingdatamixturesbypredictinglanguagemodelingperformance. InTheThirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=jjCB27TMK3

work page 2025

[15] [15]

An empirical analysis of compute-optimal large language model train- ing

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...

work page 2022

[16] [16]

Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024

Alon Halfon, Shai Gretz, Ofir Arviv, Artem Spector, Orith Toledo-Ronen, Yoav Katz, Liat Ein-Dor, Michal Shmueli-Scheuer, and Noam Slonim. Stay tuned: An empirical study of the impact of hyperparameters on llm tuning in real-world applications, 2024. URLhttps://arxiv.org/ abs/2407.18990

work page arXiv 2024

[17] [17]

autoresearch.https://github.com/karpathy/autoresearch, 2026

Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2026

work page 2026

[18] [18]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025

work page 2025

[19] [19]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. URLhttps://arxiv. org/abs/2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Akiba, S

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Op- tuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 2623–2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi...

work page doi:10.1145/3292500.3330701 2019

[21] [21]

Scikit-optimize: Sequential model-based optimization in python

Scikit-Optimize. Scikit-optimize: Sequential model-based optimization in python. URLhttps: //scikit-optimize.github.io/

work page

[22] [22]

Speculations concerning the first ultraintelligent machine

Irving John Good. Speculations concerning the first ultraintelligent machine. volume 6 ofAdvances in Computers, pages 31–88. Elsevier, 1966. doi: https://doi.org/10.1016/ S0065-2458(08)60418-0. URL https://www.sciencedirect.com/science/article/ pii/S0065245808604180

work page 1966

[23] [23]

Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements

Juergen Schmidhuber. Goedel machines: Self-referential universal problem solvers making provably optimal self-improvements, 2006. URLhttps://arxiv.org/abs/cs/0309048

work page internal anchor Pith review Pith/arXiv arXiv 2006

[24] [24]

AI with recursive self-improvement

MingchenZhuge, AilingZeng,DeyaoZhu,SherryYang,VikasChandra,andJürgenSchmidhuber. AI with recursive self-improvement. InICLR 2026 Workshop Proposals, 2026. URLhttps: //openreview.net/forum?id=OsPQ6zTQXV

work page 2026

[25] [25]

Posttrainbench: Can llm agents automate llm post-training?,

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training?,

work page

[26] [26]

URLhttps://arxiv.org/abs/2603.08640

work page arXiv

[27] [27]

Darwin gödel machine: Open-ended evolution of self-improving agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= pUpzQZTvGY

work page 2026

[28] [28]

Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine

Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jürgen Schmidhuber. Huxley-g\”odel machine: Human-level cod- ing agent development by an approximation of the optimal self-improving machine. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum...

work page 2026

[29] [29]

Learning to reason with LLMs

OpenAI. Learning to reason with LLMs. 2024. URL https://openai.com/index/ learning-to-reason-with-llms

work page 2024

[30] [30]

Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September

DayaGuo,DejianYang,HaoweiZhang,JunxiaoSong,PeiyiWang,etal. Deepseek-r1incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September

work page

[31] [31]

doi: 10.1038/s41586-025-09422-z

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URLhttp://dx.doi.org/10. 1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z

[32] [32]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, et al. Kimi k1.5: Scaling reinforcement learning with llms, 2025. URLhttps://arxiv.org/abs/2501.12599

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

Taicheng Guo, Hai Wang, ChaoChun Liu, Mohsen Golalikhani, Xin Chen, Xiangliang Zhang, and Chandan K Reddy. Mtsql-r1: Towards long-horizon multi-turn text-to-sql via agentic training. arXiv preprint arXiv:2510.12831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

difflib — Helpers for computing deltas

Python Software Foundation. difflib — Helpers for computing deltas. 2026. URLhttps: //docs.python.org/3/library/difflib.html

work page 2026

[36] [36]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025

work page 2025

[37] [37]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601. 03267

work page 2025

[39] [39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

LlamaFactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400–410, Bangkok, Thailand, August

work page

[41] [41]

L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.38. URL https://aclanthology.org/2024.acl-demos.38/

work page doi:10.18653/v1/2024.acl-demos.38 2024

[42] [42]

Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Stor...

work page 2022

[43] [43]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961. doi: 1...

work page doi:10.1145/3689031.3696075 2025

[44] [44]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...

work page doi:10.52202/079017-2000 2024

[45] [45]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps: //arxiv.org/abs/2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Self-Driving Laboratories for Chemistry and Materials Science.Chemical Reviews, 2024

Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Sergio Pablo-García, Ella M. Rajaonson, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alán Aspuru-Guzik. Self-driving lab- oratories for chemistry and materials science.Chemical Reviews, 124(16):963...

work page doi:10.1021/acs.chemrev.4c00055 2024

[47] [47]

Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for Bayesian optimization in scientific domains: Are we there yet? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510, Suzhou, China, November 2025. Association for Computational ...

work page doi:10.18653/v1/2025.findings-emnlp.838 2025

[48] [48]

What can large language models do in chem- istry? a comprehensive benchmark on eight tasks

Taicheng Guo, kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chem- istry? a comprehensive benchmark on eight tasks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural In- formation Processing Systems, volume 36, pages 5...

work page

[49] [49]

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ bbb330189ce02be00cf7346167028ab1-Paper-Datasets_and_Benchmarks.pdf

work page 2023

[50] [50]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/ abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020

[51] [51]

Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining

Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Zhenyu Ding, Haoying Wang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Predictable scale: Part i, step law – optimal hyperparameter scaling law in large language model pretraining, 2025. URLhttps://arxiv.org/abs/2503.04715

work page arXiv 2025

[52] [52]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John 21 AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/ abs...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[53] [53]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

MMLU-pro: A more robust and challenging multi-task language understanding benchmark

YuboWang, XueguangMa, GeZhang, YuanshengNi, AbhranilChandra, ShiguangGuo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processing Syst...

work page 2024

[55] [55]

ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025

Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, and Jonathan Richard Schwarz. ADMIRE-bayesopt: Accelerated data MIxture RE-weighting for language models with bayesian optimization.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=0Euvm9zDpu

work page 2025

[56] [56]

Available tasks:

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christo- pher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannan...

work page 2025

[57] [57]

learning rate: 0.001953, batch size: 128.0 ######

learning rate: 0.001953, batch size: 64.0 3. learning rate: 0.001953, batch size: 128.0 ###### ...... ###### Experiment Environment information: the total number of training tokens seen by the model during training is: 8000000000, and the count of trainable model parameters excluding token embedding matrices is: 119992320 In this environment, the Top-3 co...

work page

[58] [58]

exec_config

learning rate: 0.002762, batch size: 128.0 2. learning rate: 0.003906, batch size: 128.0 3. learning rate: 0.005524, batch size: 128.0 ###### Remember: 1. Consideryourremainingbudgeis2, previousexperimentalresults,bestconfigurationsfromlow-fidelityexperiments when making decisions. 2. You MUST have to call "exec_config" tool to query the score of the conf...

work page

[59] [59]

exec_config

kl loss weight: 0.0, learning rate: 5e-06, batch size: 32.0 ###### ...... ###### Experiment Environment information: the dataset is mmlu_history, the model is Qwen2.5-1.5B-Instruct, Note the Training epoch is 15 In this environment, the Top-3 configurations are: 1. kl loss weight: 0.0, learning rate: 1e-06, batch size: 64.0 2. kl loss weight: 0.0, learnin...

work page