AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

Kai Cai; Penghao Yin; Shilin He; Shuzheng Gao; Xiao-Ping Zhang; Zhaojian Yu

arxiv: 2606.31551 · v1 · pith:ELI2SYJ4new · submitted 2026-06-30 · 💻 cs.CL

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

Zhaojian Yu , Penghao Yin , Shuzheng Gao , Shilin He , Kai Cai , Xiao-Ping Zhang This is my paper

Pith reviewed 2026-07-01 05:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords AutoTrainessautonomous post-traininglanguage model agentsPostTrainBenchagent-computer interfacesLM post-trainingworkflow-guided agents

0 comments

The pith

AutoTrainess lets language models autonomously post-train other models by replacing raw command lines with explicit workflows and interfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that autonomous post-training of language models requires more than coding ability: an agent must plan iterations, build data, run stable jobs, evaluate results, and track state over long periods. AutoTrainess addresses this by turning those steps into a set of agent-computer interfaces backed by human-derived rules and constraints instead of leaving the model to improvise in a bare CLI. On the PostTrainBench benchmark the system raises average scores from 23.21 to 26.94 with one model and from 12.13 to 19.58 with another, while working across different base models and training setups. A reader would care because the work directly targets the human labor still required to keep frontier models improving after initial pre-training.

Core claim

AutoTrainess is an LM agent that exposes post-training operations as a repository of agent-computer interfaces for planning, data preparation, training, evaluation, and logging; rather than operating in an underspecified CLI environment, it externalizes prior human experience as explicit workflows, rules, and execution constraints that guide the agent toward reliable training behavior.

What carries the argument

AutoTrainess, the system of agent-computer interfaces and guiding workflows that externalizes human post-training experience into structured operations for the LM agent.

If this is right

AutoTrainess raises PostTrainBench average scores above CLI-only baselines for GPT-5.4 (Codex) and DeepSeek-V4-Flash.
The same structured interfaces improve results across different language models and training harnesses.
Encoding human workflows as explicit constraints makes long-horizon autonomous training more stable and repeatable.
The agent can now carry out the full cycle of planning, data construction, training, and evaluation without constant human oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the interfaces generalize, teams could run overnight training experiments with far less expert supervision.
The same pattern of externalizing domain workflows might apply to other long-horizon agent tasks such as scientific data pipelines.
Success here suggests that future agents may need rich, task-specific interface layers rather than only raw tool access.

Load-bearing premise

Benchmark score differences between AutoTrainess and CLI baselines reflect real gains in autonomous training ability rather than artifacts of the particular test setup.

What would settle it

A controlled run on a fresh model and benchmark where AutoTrainess produces no higher scores or fewer successful training jobs than a plain CLI agent would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.31551 by Kai Cai, Penghao Yin, Shilin He, Shuzheng Gao, Xiao-Ping Zhang, Zhaojian Yu.

**Figure 2.** Figure 2: AutoTrainess is a LM agent that interacts with training environments through a training [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Failure rate under interface ablations. Each bar reports the action failure rate, and the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation on data related action counts in PostTrainBench trajectories. AutoTrainess Action full w/o data Read dataset content 725 579 (-20.1%) Clean dataset 95 71 (-25.3%) Build preference pairs 44 32 (-27.3%) Synthesize Data 22 4 (-81.8%) Find dataset sources 18 17 (-5.6%) Add corrective data 18 16 (-11.1%) 20 40 60 80 100 Exploration: all train-to-eval handoffs 2 3 4 5 6 7 Exploitation: retained improvem… view at source ↗

**Figure 5.** Figure 5: Exploration exploitation balance under interface ablations. Exploration is measured by the number of train-to-eval handoffs, while exploitation is measured by retained improvements. it is removed, training commands do not necessarily fail more often, but their artifacts can become more ambiguous or inconsistently referenced by downstream evaluation commands. Overall, the ablation results suggest that eac… view at source ↗

**Figure 6.** Figure 6: Frequency of different agent behaviors in AutoTrainess at different time stages. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of the model training process of AutoTrainess across different datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Statistics of agent behaviors most and least correlated with performance improvement. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation study of data skill on ArenaHard and Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation study of eval skill on HealthBench and Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: AGENTS.md for AutoTrainess framework. C.2 Plan Plan --- name: iteration_plan description: Use when defining the goal and action plan for the next ,→ experiment iteration. metadata: short-description: Plan the next experiment iteration --- # iteration_plan ## Purpose Define a clear goal and concrete action plan for the current experiment ,→ iteration based on real evidence from previous experiments. ## Inp… view at source ↗

**Figure 12.** Figure 12: Instruction of Plan skill. C.3 Data Process Data Process --- name: data description: Use when preparing training data. metadata: short-description: Prepare training data --- # data ## Purpose Prepare training data that addresses real problems exposed by previous training or evaluation, aligns with the benchmark evaluation interface, and is ready for downstream training. ,→ ,→ ## Core principles - Drive al… view at source ↗

**Figure 13.** Figure 13: Instruction of Data Process skill. C.3.1 Selection Data Process → Selection # Data Selection ## Purpose Identify the data needs suggested by observed problems and choose initial ,→ source directions for construction. ## Required outputs - The target problems or behaviors the new data should support. - Initial source directions for construction. - Important constraints or risks for construction, such as be… view at source ↗

**Figure 14.** Figure 14: Instruction of Selection in Data Process skill. C.3.2 Construction AGENTS # Data Construction ## Purpose Turn the selected data needs and initial source directions into a ,→ benchmark-aligned training dataset. ## Required outputs - A training dataset that is ready for downstream training. - A concise dataset description covering source origins, transformations, target problems or behaviors, sample format,… view at source ↗

**Figure 15.** Figure 15: Instruction of Construction in Data Process skill. C.3.3 Validation Data Process → Validation # Data Validation ## Purpose Validate the constructed dataset and dataset description before training, then decide whether the data is ready, needs reconstruction, or requires a new selection decision. ,→ ,→ ## Required outputs 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 17.** Figure 17: Instruction of Training skill. C.4.1 SFT Training → SFT # SFT Stage ## Purpose Run the minimum valid supervised fine-tuning workflow for the current stage ,→ with LlamaFactory. ## Inputs - The training dataset prepared by the data workflow. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Instruction of SFT in Training skill. C.4.2 RL Training → RL # RL Stage ## Purpose Run the minimum valid RL workflow for the current stage with LlamaFactory ,→ only when current evidence supports using RL. ## Inputs - Recent evaluation evidence showing why RL is needed. - The current model or base model to continue from. - The minimum reward definition, feedback signal, or RL data required by the ,→ selec… view at source ↗

**Figure 19.** Figure 19: Instruction of RL in Training skill. C.4.3 Shared Instruction Training → Shared # LlamaFactory Workflow ## Purpose Define the shared LlamaFactory workflow for training: installation, ,→ environment checks, execution boundary, and failure handling. ## Shared rules - Use `hiyouga/LlamaFactory` for all training work. - Do not replace it with another training framework or a custom training ,→ loop. - If `Llam… view at source ↗

**Figure 20.** Figure 20: Shared instruction in Training skill. C.5 Evaluation Evaluation --- name: eval description: Use when evaluating a model on the current benchmark(s). metadata: short-description: Run benchmark evaluation --- # eval ## Purpose Run the benchmark's real evaluation on `final_model/` and record reproducible ,→ evidence needed for the next stage decision. ## Inputs - Workspace repository (current working directo… view at source ↗

**Figure 21.** Figure 21: Instruction of Evaluation skill. C.6 Log Log --- name: log description: Use when appending an experiment log entry after a completed ,→ iteration. metadata: short-description: Append experiment log --- # log ## Task After each completed iteration, append one new entry to ,→ `task/experiment_log.md`. ## Rules - If `task/experiment_log.md` does not exist, create it. If it already ,→ exists, append a new ent… view at source ↗

**Figure 22.** Figure 22: Instruction of Log skill. D Detailed Results on PostTrainBench [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

read the original abstract

Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks. A central challenge is that autonomous post-training is not just a coding problem: it requires the agent to repeatedly plan iterations, construct benchmark-aligned data, run stable training jobs, evaluate checkpoints, and preserve experiment state across many hours of interaction. We present AutoTrainess, a LM agent that exposes these operations as a repository of agent-computer interfaces for planning, data preparation, training, evaluation, and logging. Rather than leaving the agent to operate in a raw CLI environment with an underspecified action space, AutoTrainess externalizes prior human experience as explicit workflows, rules, and execution constraints that guide the agent toward effective and reliable training behavior. On PostTrainBench, AutoTrainess consistently outperforms CLI-only baselines, achieving 26.94 average score with GPT-5.4 (Codex) versus 23.21 for CLI-only. It also generalizes across models and harnesses, improving DeepSeek-V4-Flash (OpenCode) from 12.13 to 19.58.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoTrainess adds explicit workflows and interfaces for autonomous post-training agents and shows score gains over CLI baselines on PostTrainBench, but the abstract supplies no controls or variance data so those gains remain hard to interpret.

read the letter

The main point is that the authors built AutoTrainess to give an agent structured access to the full post-training loop—planning iterations, building benchmark data, launching jobs, evaluating checkpoints, and keeping state—rather than dropping it into a raw CLI. They report clear lifts on PostTrainBench: 26.94 versus 23.21 with GPT-5.4 and 19.58 versus 12.13 with DeepSeek-V4-Flash.

What the work does cleanly is name the gap between ordinary code agents and the repeated, stateful, constraint-heavy nature of actual training runs, then try to close it by baking prior human practice into explicit rules and execution constraints. That framing is reasonable and points in a useful direction.

The soft spot is the complete lack of experimental detail. The abstract gives no information on how the CLI baseline was equipped, whether the reported averages are stable across seeds or prompt changes, how tasks were sampled, or whether the added workflows simply encode the benchmark’s own success criteria. Without those pieces the numerical differences cannot be read as evidence of general capability rather than setup-specific effects.

The paper is aimed at people working on agent systems for ML infrastructure. A reader already thinking about reducing human loops in training pipelines could extract the interface design ideas, but anyone needing reproducible evidence will have to wait for the full methods.

I would send it to peer review so the experiments can be checked directly.

Referee Report

3 major / 1 minor

Summary. The paper introduces AutoTrainess, an LM agent for autonomous post-training of language models. It exposes operations for planning, data preparation, training, evaluation, and logging as agent-computer interfaces, externalizing human experience via explicit workflows, rules, and execution constraints rather than raw CLI. The central empirical claim is that this yields consistent gains on PostTrainBench: 26.94 average score with GPT-5.4 (Codex) versus 23.21 for CLI-only, and generalization to DeepSeek-V4-Flash (OpenCode) from 12.13 to 19.58.

Significance. If the reported gains prove robust to controls, variance, and equivalent baselines, the work could meaningfully advance autonomous ML operations by demonstrating the value of structured interfaces for long-horizon training tasks. The approach of codifying prior human workflows into agent-accessible constraints is a concrete contribution that could reduce reliance on manual intervention, provided the benchmark results generalize beyond the specific harness.

major comments (3)

[Abstract] Abstract: the 3.73-point (GPT-5.4) and 7.45-point (DeepSeek) gains are presented as evidence that exposed interfaces outperform raw CLI, yet no information is supplied on task distribution within PostTrainBench, number of runs, variance, or statistical significance; without these, it is impossible to rule out that the differences are artifacts of the specific setup rather than general capability gains.
[Abstract] Abstract: the CLI-only baseline is asserted to be inferior, but the text supplies no description of its action space, state-tracking mechanism, or whether it receives equivalent information to AutoTrainess; this equivalence is load-bearing for the central claim that the improvement stems from the explicit workflows and constraints rather than from differences in available tools.
[Abstract] Abstract: the generalization claim across models and harnesses is stated without any detail on the concrete workflows, rules, or execution constraints that were implemented, leaving open whether the measured improvement is due to the interface design or to unstated implementation choices that may not transfer.

minor comments (1)

[Abstract] Abstract: model identifiers such as 'GPT-5.4 (Codex)' and 'DeepSeek-V4-Flash (OpenCode)' are non-standard and should be clarified with references to exact versions or citations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and commit to revisions that will strengthen the clarity and rigor of the empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract: the 3.73-point (GPT-5.4) and 7.45-point (DeepSeek) gains are presented as evidence that exposed interfaces outperform raw CLI, yet no information is supplied on task distribution within PostTrainBench, number of runs, variance, or statistical significance; without these, it is impossible to rule out that the differences are artifacts of the specific setup rather than general capability gains.

Authors: We agree that the abstract should include or reference these details to allow readers to assess robustness. The full manuscript reports results aggregated over multiple runs on PostTrainBench; in the revision we will explicitly state the number of runs, report variance or standard deviation, note any statistical tests, and summarize the task distribution within the benchmark. revision: yes
Referee: [Abstract] Abstract: the CLI-only baseline is asserted to be inferior, but the text supplies no description of its action space, state-tracking mechanism, or whether it receives equivalent information to AutoTrainess; this equivalence is load-bearing for the central claim that the improvement stems from the explicit workflows and constraints rather than from differences in available tools.

Authors: The CLI-only baseline operates in the same underlying environment but lacks the structured workflows, rules, and constraints provided by AutoTrainess. To make this equivalence explicit, the revision will add a dedicated paragraph describing the action space, state representation, and information available to the CLI-only agent. revision: yes
Referee: [Abstract] Abstract: the generalization claim across models and harnesses is stated without any detail on the concrete workflows, rules, or execution constraints that were implemented, leaving open whether the measured improvement is due to the interface design or to unstated implementation choices that may not transfer.

Authors: We will expand the methods section with concrete descriptions and examples of the workflows, rules, and execution constraints used for each model/harness pair. This will allow readers to evaluate transferability of the interface design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system comparison with no derivation chain

full rationale

The paper presents an agent system (AutoTrainess) and reports direct empirical scores on PostTrainBench against CLI baselines. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The performance claims (e.g., 26.94 vs 23.21) are presented as measured outcomes rather than derived quantities that reduce to their own inputs by construction. This is standard empirical reporting with no identifiable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the text focuses on system description and empirical outcomes without mathematical derivations or new postulated constructs.

pith-pipeline@v0.9.1-grok · 5751 in / 1109 out tokens · 35499 ms · 2026-07-01T05:37:42.758243+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 16 canonical work pages · 10 internal anchors

[1]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[2]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty- first International Conference on Machine Learning, ICML 2024, Vienna, Austria,...

2024
[3]

Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algori...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Alpharesearch: Accelerating new algorithm discovery with language models.CoRR, abs/2511.08522, 2025

Zhaojian Yu, Kaiyue Feng, Yilun Zhao, Shilin He, Xiao-Ping Zhang, and Arman Co- han. Alpharesearch: Accelerating new algorithm discovery with language models.CoRR, abs/2511.08522, 2025

work page arXiv 2025
[5]

SWE-agent: Agent-computer interfaces enable automated soft- ware engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[6]

Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can LLM agents automate LLM post-training? CoRR, abs/2603.08640, 2026

work page arXiv 2026
[7]

Llamafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), ACL 2024, Bangkok, Thailand, August 11-16, 202...

2024
[8]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second Interna...

2025
[9]

Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): from tool use to agentic evaluation of large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, ed...

2025
[10]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health.CoRR, abs/2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

2023
[15]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. Search-based llms for code optimization.CoRR, abs/2408.12159, 2024

work page arXiv 2024
[17]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.CoRR, abs/2404.07972, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Riosworld: Benchmarking the risk of multimodal computer-use agents.CoRR, abs/2506.00618, 2025

Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents.CoRR, abs/2506.00618, 2025

work page arXiv 2025
[19]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.CoRR, abs/2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Openresearcher: Unleashing AI for accelerated scientific research

Yuxiang Zheng, Shichao Sun, Lin Qiu, Dongyu Ru, Cheng Jiayang, Xuefeng Li, Jifan Lin, Binjie Wang, Yun Luo, Renjie Pan, Yang Xu, Qingkai Min, Zizhao Zhang, Yiwen Wang, Wenjie Li, and Pengfei Liu. Openresearcher: Unleashing AI for accelerated scientific research. In Delia Irazú Hernández Farías, Tom Hope, and Manling Li, editors,Proceedings of the 2024 Con...

2024
[21]

Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.CoRR, abs/2603.20278, 2026

Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, and Wenhu Chen. Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.CoRR, abs/2603.20278, 2026

work page arXiv 2026
[22]

Juraj Gottweis, Wei-Hung Weng, Alexander N. Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan

Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. Core- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Trans. Mach. Learn. Res., 2024, 2024

2024
[24]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 2...

2025
[25]

Paperbench: Evaluating ai’s ability to replicate AI research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate AI research. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Mahara...

2025
[26]

Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y . Wong, and Simon See. Newtonbench: Benchmarking generalizable scientific law discovery in LLM agents.CoRR, abs/2510.07172, 2025

work page arXiv 2025
[27]

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Lei Xiong, Kun Luo, Ziyi Xia, Wenbo Zhang, Jin-Ge Yao, Zheng Liu, Jingying Shao, Jianlyu Chen, Hongjin Qian, Xi Yang, Qian Yu, Hao Li, Chen Yue, Xiaan Du, Yuyang Wang, Yesheng Liu, Haiyu Xu, and Zhicheng Dou. Autoresearchbench: Benchmarking AI agents on complex scientific literature discovery.CoRR, abs/2604.25256, 2026. 12 A Case Study on the Effects of D...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Prepare the current base model for evaluation
[29]

Run the real benchmark evaluation
[30]

Record the evaluation setup and result. Decision rules: - If an explicit target exists and the base model already reaches the target, stop.,→ - If evaluation fails because of an engineering or environment issue, fix the issue and repeat Stage 1. Otherwise, enter Stage 2.,→ ### Stage 2: Local Diagnosis and Optimization Run local iterations to establish a r...
[31]

Review previous experiment results and identify the main problems
[32]

Decide what the current iteration is mainly trying to improve
[33]

Define the main changes to make in this iteration
[34]

State what outcome will count as success for this iteration
[35]

Figure 12: Instruction ofPlanskill

Provide concise guidance for downstream data and training work. Figure 12: Instruction ofPlanskill. C.3 Data Process Data Process --- name: data description: Use when preparing training data. metadata: short-description: Prepare training data --- # data ## Purpose Prepare training data that addresses real problems exposed by previous training or evaluatio...
[36]

Read [shared/conventions.md](./shared/conventions.md) for shared rules
[37]

Run [selection/stage.md](./selection/stage.md) to identify target data needs and initial source directions.,→
[38]

Run [construction/stage.md](./construction/stage.md) to turn those needs and directions into a benchmark-aligned training dataset.,→
[39]

Run [validation/stage.md](./validation/stage.md) for data validation before training.,→
[40]

If validation finds target-need or source-direction issues, return to selection

If validation finds construction issues, return to construction. If validation finds target-need or source-direction issues, return to selection. ,→ ,→ ## Required outputs - A final training dataset ready for downstream training. 19 - A concise dataset description covering target problems, data sources, sample format, known limitations, and validation sta...
[41]

Review available evidence from prior training, evaluation, or benchmark misses.,→
[42]

Identify the data needs implied by those problems or required benchmark-facing behaviors.,→
[43]

If local or external data is substantially different from the benchmark distribution, consider synthetic or model-distilled data as source directions

Choose initial source directions, such as local data, external data, synthetic data, or model-distilled data. If local or external data is substantially different from the benchmark distribution, consider synthetic or model-distilled data as source directions. ,→ ,→ ,→
[44]

Pass unresolved assumptions, source limitations, leakage risks, and construction constraints to the construction stage.,→ Figure 14: Instruction ofSelectioninData Processskill. C.3.2 Construction AGENTS # Data Construction ## Purpose Turn the selected data needs and initial source directions into a benchmark-aligned training dataset.,→ ## Required outputs...
[45]

Review the target problems, source directions, constraints, and risks passed from selection.,→
[46]

Inspect the benchmark evaluation path and render or reconstruct several evaluation-style examples when possible.,→
[47]

Decide the target training sample format from the observed model-facing input, expected output form, answer boundary, and final-answer location.,→
[48]

If they are viable, continue construction; if not, return to selection

Inspect candidate sources and decide whether they can support the target data needs. If they are viable, continue construction; if not, return to selection. ,→ ,→
[49]

Extract, clean, rewrite, restructure, synthesize, or distill samples as needed.,→
[50]

Filter out broken, unreadable, empty, duplicated, misaligned, or clearly low-value samples, then reduce redundant or weakly relevant samples to keep the dataset focused. ,→ ,→
[51]

Produce the final dataset and dataset description. ## Decision standard The stage is complete when the dataset is usable for training, aligned with the benchmark-facing task, and described well enough for validation.,→ Figure 15: Instruction ofConstructioninData Processskill. C.3.3 Validation Data Process→Validation # Data Validation ## Purpose Validate t...
[52]

Inspect the constructed dataset and dataset description
[53]

Check structural correctness, including schema, required fields, encoding, and malformed samples.,→
[54]

Compare several constructed training samples against the rendered evaluation-style examples.,→
[55]

Check whether the dataset matches the benchmark evaluation interface and target behaviors.,→
[56]

Review sample quality and look for garbage, corruption, duplication, leakage risk, or unrealistic synthesis.,→
[57]

Decide whether any detected problem belongs to construction or selection
[58]

Produce one of three decisions: - approve for training - return to construction - return to selection ## Decision standard The stage is complete when the dataset is approved for training or sent back with a clear reason and return target.,→ 22 Figure 16: Instruction ofValidationinData Processskill. C.4 Training Training --- name: train description: Use wh...
[59]

Read [shared/llamafactory.md](./shared/llamafactory.md)
[60]

Decide whether the current stage requires [sft/stage.md](./sft/stage.md) or [rl/stage.md](./rl/stage.md).,→
[61]

Follow the selected stage document
[62]

Run training through the provided script in`scripts/`
[63]

Figure 17: Instruction ofTrainingskill

Export`final_model/`for evaluation. Figure 17: Instruction ofTrainingskill. C.4.1 SFT Training→SFT # SFT Stage ## Purpose Run the minimum valid supervised fine-tuning workflow for the current stage with LlamaFactory.,→ ## Inputs - The training dataset prepared by the data workflow. 23 - The benchmark-facing sample format or schema. - A valid base model pa...
[64]

Review the prepared training data and its benchmark-facing format
[66]

Prepare the minimum SFT dataset assets and verify the LlamaFactory config using`shared/llamafactory.md`.,→
[67]

Run a small validation training with`scripts/run_llamafactory.sh`
[68]

If the validation run is usable, continue the intended SFT run
[69]

## Decision standard The stage is complete only when the SFT run is reproducible, the exported model is evaluation-ready, and the result is not justified by training loss alone

Export`final_model/`and leave it ready for evaluation. ## Decision standard The stage is complete only when the SFT run is reproducible, the exported model is evaluation-ready, and the result is not justified by training loss alone. ,→ ,→ Figure 18: Instruction ofSFTinTrainingskill. C.4.2 RL Training→RL # RL Stage ## Purpose Run the minimum valid RL workf...
[70]

Review the latest evaluation evidence and confirm that RL is justified
[71]

Read`shared/llamafactory.md`and confirm that LlamaFactory is usable
[72]

Prepare the minimum reward setup or RL data, and verify the LlamaFactory config using`shared/llamafactory.md`.,→
[73]

Run a small validation RL run with`scripts/run_llamafactory.sh`
[74]

If the validation run is usable, continue the intended RL run
[75]

Export`final_model/`and leave it ready for evaluation. ## Decision standard 24 The stage is complete only when RL is justified by current evidence, the run is reproducible, and the exported model is ready for real evaluation.,→ Figure 19: Instruction ofRLinTrainingskill. C.4.3 Shared Instruction Training→Shared # LlamaFactory Workflow ## Purpose Define th...
[76]

Locate the canonical evaluation entrypoint
[77]

If using a limited evaluation, determine the benchmark sample count and choose a limit that satisfies the sample-floor rule.,→
[78]

Run evaluation on`final_model/`
[79]

Save raw outputs, commands, the sample count or limit used, and a concise metrics summary under`eval_results/`.,→
[80]

If evaluation fails, debug it inside the benchmark's real evaluation workflow, then retry with the minimum necessary fix.,→
[81]

Use `skills/eval/scripts/summarize_eval_samples.py`when compatible `inspect_ai`logs are available; otherwise, add the minimum benchmark-specific script or logging needed

Generate`eval_results/sample_summary.md`with 15 random samples including score, input, target, and model output. Use `skills/eval/scripts/summarize_eval_samples.py`when compatible `inspect_ai`logs are available; otherwise, add the minimum benchmark-specific script or logging needed. ,→ ,→ ,→ ,→

Showing first 80 references.

[1] [1]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024

[2] [2]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty- first International Conference on Machine Learning, ICML 2024, Vienna, Austria,...

2024

[3] [3]

Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algori...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Alpharesearch: Accelerating new algorithm discovery with language models.CoRR, abs/2511.08522, 2025

Zhaojian Yu, Kaiyue Feng, Yilun Zhao, Shilin He, Xiao-Ping Zhang, and Arman Co- han. Alpharesearch: Accelerating new algorithm discovery with language models.CoRR, abs/2511.08522, 2025

work page arXiv 2025

[5] [5]

SWE-agent: Agent-computer interfaces enable automated soft- ware engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[6] [6]

Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can LLM agents automate LLM post-training? CoRR, abs/2603.08640, 2026

work page arXiv 2026

[7] [7]

Llamafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), ACL 2024, Bangkok, Thailand, August 11-16, 202...

2024

[8] [8]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second Interna...

2025

[9] [9]

Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): from tool use to agentic evaluation of large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, ed...

2025

[10] [10]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health.CoRR, abs/2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

2023

[15] [15]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. Search-based llms for code optimization.CoRR, abs/2408.12159, 2024

work page arXiv 2024

[17] [17]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.CoRR, abs/2404.07972, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Riosworld: Benchmarking the risk of multimodal computer-use agents.CoRR, abs/2506.00618, 2025

Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents.CoRR, abs/2506.00618, 2025

work page arXiv 2025

[19] [19]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.CoRR, abs/2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Openresearcher: Unleashing AI for accelerated scientific research

Yuxiang Zheng, Shichao Sun, Lin Qiu, Dongyu Ru, Cheng Jiayang, Xuefeng Li, Jifan Lin, Binjie Wang, Yun Luo, Renjie Pan, Yang Xu, Qingkai Min, Zizhao Zhang, Yiwen Wang, Wenjie Li, and Pengfei Liu. Openresearcher: Unleashing AI for accelerated scientific research. In Delia Irazú Hernández Farías, Tom Hope, and Manling Li, editors,Proceedings of the 2024 Con...

2024

[21] [21]

Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.CoRR, abs/2603.20278, 2026

Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, and Wenhu Chen. Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.CoRR, abs/2603.20278, 2026

work page arXiv 2026

[22] [22]

Juraj Gottweis, Wei-Hung Weng, Alexander N. Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan

Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. Core- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Trans. Mach. Learn. Res., 2024, 2024

2024

[24] [24]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 2...

2025

[25] [25]

Paperbench: Evaluating ai’s ability to replicate AI research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate AI research. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Mahara...

2025

[26] [26]

Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y . Wong, and Simon See. Newtonbench: Benchmarking generalizable scientific law discovery in LLM agents.CoRR, abs/2510.07172, 2025

work page arXiv 2025

[27] [27]

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Lei Xiong, Kun Luo, Ziyi Xia, Wenbo Zhang, Jin-Ge Yao, Zheng Liu, Jingying Shao, Jianlyu Chen, Hongjin Qian, Xi Yang, Qian Yu, Hao Li, Chen Yue, Xiaan Du, Yuyang Wang, Yesheng Liu, Haiyu Xu, and Zhicheng Dou. Autoresearchbench: Benchmarking AI agents on complex scientific literature discovery.CoRR, abs/2604.25256, 2026. 12 A Case Study on the Effects of D...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Prepare the current base model for evaluation

[29] [29]

Run the real benchmark evaluation

[30] [30]

Record the evaluation setup and result. Decision rules: - If an explicit target exists and the base model already reaches the target, stop.,→ - If evaluation fails because of an engineering or environment issue, fix the issue and repeat Stage 1. Otherwise, enter Stage 2.,→ ### Stage 2: Local Diagnosis and Optimization Run local iterations to establish a r...

[31] [31]

Review previous experiment results and identify the main problems

[32] [32]

Decide what the current iteration is mainly trying to improve

[33] [33]

Define the main changes to make in this iteration

[34] [34]

State what outcome will count as success for this iteration

[35] [35]

Figure 12: Instruction ofPlanskill

Provide concise guidance for downstream data and training work. Figure 12: Instruction ofPlanskill. C.3 Data Process Data Process --- name: data description: Use when preparing training data. metadata: short-description: Prepare training data --- # data ## Purpose Prepare training data that addresses real problems exposed by previous training or evaluatio...

[36] [36]

Read [shared/conventions.md](./shared/conventions.md) for shared rules

[37] [37]

Run [selection/stage.md](./selection/stage.md) to identify target data needs and initial source directions.,→

[38] [38]

Run [construction/stage.md](./construction/stage.md) to turn those needs and directions into a benchmark-aligned training dataset.,→

[39] [39]

Run [validation/stage.md](./validation/stage.md) for data validation before training.,→

[40] [40]

If validation finds target-need or source-direction issues, return to selection

If validation finds construction issues, return to construction. If validation finds target-need or source-direction issues, return to selection. ,→ ,→ ## Required outputs - A final training dataset ready for downstream training. 19 - A concise dataset description covering target problems, data sources, sample format, known limitations, and validation sta...

[41] [41]

Review available evidence from prior training, evaluation, or benchmark misses.,→

[42] [42]

Identify the data needs implied by those problems or required benchmark-facing behaviors.,→

[43] [43]

If local or external data is substantially different from the benchmark distribution, consider synthetic or model-distilled data as source directions

Choose initial source directions, such as local data, external data, synthetic data, or model-distilled data. If local or external data is substantially different from the benchmark distribution, consider synthetic or model-distilled data as source directions. ,→ ,→ ,→

[44] [44]

Pass unresolved assumptions, source limitations, leakage risks, and construction constraints to the construction stage.,→ Figure 14: Instruction ofSelectioninData Processskill. C.3.2 Construction AGENTS # Data Construction ## Purpose Turn the selected data needs and initial source directions into a benchmark-aligned training dataset.,→ ## Required outputs...

[45] [45]

Review the target problems, source directions, constraints, and risks passed from selection.,→

[46] [46]

Inspect the benchmark evaluation path and render or reconstruct several evaluation-style examples when possible.,→

[47] [47]

Decide the target training sample format from the observed model-facing input, expected output form, answer boundary, and final-answer location.,→

[48] [48]

If they are viable, continue construction; if not, return to selection

Inspect candidate sources and decide whether they can support the target data needs. If they are viable, continue construction; if not, return to selection. ,→ ,→

[49] [49]

Extract, clean, rewrite, restructure, synthesize, or distill samples as needed.,→

[50] [50]

Filter out broken, unreadable, empty, duplicated, misaligned, or clearly low-value samples, then reduce redundant or weakly relevant samples to keep the dataset focused. ,→ ,→

[51] [51]

Produce the final dataset and dataset description. ## Decision standard The stage is complete when the dataset is usable for training, aligned with the benchmark-facing task, and described well enough for validation.,→ Figure 15: Instruction ofConstructioninData Processskill. C.3.3 Validation Data Process→Validation # Data Validation ## Purpose Validate t...

[52] [52]

Inspect the constructed dataset and dataset description

[53] [53]

Check structural correctness, including schema, required fields, encoding, and malformed samples.,→

[54] [54]

Compare several constructed training samples against the rendered evaluation-style examples.,→

[55] [55]

Check whether the dataset matches the benchmark evaluation interface and target behaviors.,→

[56] [56]

Review sample quality and look for garbage, corruption, duplication, leakage risk, or unrealistic synthesis.,→

[57] [57]

Decide whether any detected problem belongs to construction or selection

[58] [58]

Produce one of three decisions: - approve for training - return to construction - return to selection ## Decision standard The stage is complete when the dataset is approved for training or sent back with a clear reason and return target.,→ 22 Figure 16: Instruction ofValidationinData Processskill. C.4 Training Training --- name: train description: Use wh...

[59] [59]

Read [shared/llamafactory.md](./shared/llamafactory.md)

[60] [60]

Decide whether the current stage requires [sft/stage.md](./sft/stage.md) or [rl/stage.md](./rl/stage.md).,→

[61] [61]

Follow the selected stage document

[62] [62]

Run training through the provided script in`scripts/`

[63] [63]

Figure 17: Instruction ofTrainingskill

Export`final_model/`for evaluation. Figure 17: Instruction ofTrainingskill. C.4.1 SFT Training→SFT # SFT Stage ## Purpose Run the minimum valid supervised fine-tuning workflow for the current stage with LlamaFactory.,→ ## Inputs - The training dataset prepared by the data workflow. 23 - The benchmark-facing sample format or schema. - A valid base model pa...

[64] [64]

Review the prepared training data and its benchmark-facing format

[65] [66]

Prepare the minimum SFT dataset assets and verify the LlamaFactory config using`shared/llamafactory.md`.,→

[66] [67]

Run a small validation training with`scripts/run_llamafactory.sh`

[67] [68]

If the validation run is usable, continue the intended SFT run

[68] [69]

## Decision standard The stage is complete only when the SFT run is reproducible, the exported model is evaluation-ready, and the result is not justified by training loss alone

Export`final_model/`and leave it ready for evaluation. ## Decision standard The stage is complete only when the SFT run is reproducible, the exported model is evaluation-ready, and the result is not justified by training loss alone. ,→ ,→ Figure 18: Instruction ofSFTinTrainingskill. C.4.2 RL Training→RL # RL Stage ## Purpose Run the minimum valid RL workf...

[69] [70]

Review the latest evaluation evidence and confirm that RL is justified

[70] [71]

Read`shared/llamafactory.md`and confirm that LlamaFactory is usable

[71] [72]

Prepare the minimum reward setup or RL data, and verify the LlamaFactory config using`shared/llamafactory.md`.,→

[72] [73]

Run a small validation RL run with`scripts/run_llamafactory.sh`

[73] [74]

If the validation run is usable, continue the intended RL run

[74] [75]

Export`final_model/`and leave it ready for evaluation. ## Decision standard 24 The stage is complete only when RL is justified by current evidence, the run is reproducible, and the exported model is ready for real evaluation.,→ Figure 19: Instruction ofRLinTrainingskill. C.4.3 Shared Instruction Training→Shared # LlamaFactory Workflow ## Purpose Define th...

[75] [76]

Locate the canonical evaluation entrypoint

[76] [77]

If using a limited evaluation, determine the benchmark sample count and choose a limit that satisfies the sample-floor rule.,→

[77] [78]

Run evaluation on`final_model/`

[78] [79]

Save raw outputs, commands, the sample count or limit used, and a concise metrics summary under`eval_results/`.,→

[79] [80]

If evaluation fails, debug it inside the benchmark's real evaluation workflow, then retry with the minimum necessary fix.,→

[80] [81]

Use `skills/eval/scripts/summarize_eval_samples.py`when compatible `inspect_ai`logs are available; otherwise, add the minimum benchmark-specific script or logging needed

Generate`eval_results/sample_summary.md`with 15 random samples including score, input, target, and model output. Use `skills/eval/scripts/summarize_eval_samples.py`when compatible `inspect_ai`logs are available; otherwise, add the minimum benchmark-specific script or logging needed. ,→ ,→ ,→ ,→