ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

Hao Liang; Hengyi Feng; Lu Ma; Shengjie Ye; Wentao Zhang; Zhengyang Zhao

arxiv: 2606.01279 · v1 · pith:3W4S4DA4new · submitted 2026-05-31 · 💻 cs.AI

ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

Zhengyang Zhao , Shengjie Ye , Lu Ma , Hao Liang , Hengyi Feng , Wentao Zhang This is my paper

Pith reviewed 2026-06-28 16:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentsdata synthesispost-training alignmentautomated alignmentWorld Tree routinginstruction alignmentagent skillsweb data curation

0 comments

The pith

Equipping weaker agents with Andes lets them synthesize high-quality alignment data and reach state-of-the-art on PostTrainBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Andes as a framework that turns data generation for post-training into a plug-and-play agent skill rather than requiring agents to build complex strategies from scratch. It supplies a self-evolving World Tree routing mechanism plus actionable diagnostic reports so agents can steer synthesis inside noisy web environments through a closed-loop interface. This setup is shown to lift foundationally weaker agents to state-of-the-art automated alignment results on PostTrainBench while delivering robust cross-task generalization under tight compute limits.

Core claim

Andes reimagines data generation as a plug-and-play agent skill. By leveraging a self-evolving World Tree routing mechanism and actionable diagnostic reports, it allows trainer agents to dynamically steer data synthesis through an interactive, closed-loop interface. Equipping foundationally weaker agents with Andes improves automated alignment, securing state-of-the-art performance on PostTrainBench and robust cross-task generalization.

What carries the argument

The self-evolving World Tree routing mechanism that supplies an abstraction layer for agents to steer data synthesis via diagnostic feedback.

If this is right

Weaker agents can now handle long-horizon web data tasks without devising strategies from scratch.
Automated alignment reaches state-of-the-art on PostTrainBench under strict compute constraints.
Cross-task generalization improves because the same interface works across different alignment objectives.
Dataset quality rises because the closed-loop interface filters and balances data dynamically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing-plus-diagnostics pattern could be applied to other agent tasks that require long-horizon information gathering.
Widespread adoption might shrink the amount of human-curated seed data needed for alignment pipelines.
If the mechanism scales, it could support more fully autonomous research agents that iterate on their own training data.

Load-bearing premise

The routing mechanism and diagnostic reports let agents steer synthesis effectively in noisy web settings without context overload.

What would settle it

An experiment in which Andes-equipped agents still generate low-quality or unbalanced datasets and fail to beat baselines on PostTrainBench.

Figures

Figures reproduced from arXiv: 2606.01279 by Hao Liang, Hengyi Feng, Lu Ma, Shengjie Ye, Wentao Zhang, Zhengyang Zhao.

**Figure 1.** Figure 1: Andes achieves SOTA performance on PostTrainBench. Compared to the bare execution baseline GLM-4.7 (Scaffold-only), Andes drives a definitive alignment leap to 33.4%, outperforming Opus-4.7 by 4.8%. *Equal contribution. †Corresponding author. Preprint. arXiv:2606.01279v1 [cs.AI] 31 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of Andes. Guided by the Andes skill, a trainer agent decomposes downstream benchmarks into capability domains and invokes Andes once per domain; each call routes sampled topics through a self-evolving world tree, runs a two-stage QA generation and refinement pipeline, and returns a refined dataset together with a synthesis report that drives the trainer agent’s configuration of the next call. 3.3 … view at source ↗

**Figure 3.** Figure 3: Visualization of Andes routing and node evolution mechanism. (a) The increasing fusion-data ratio shows that routing gradually shifts toward GSM8K-relevant nodes.(b) Top-routed GSM8K topics contain higher fusion ratios, indicating effective allocation to target-aligned capability regions.(c) The growing number of evolved themes and scenarios shows that Andes refreshes frequently selected nodes to preserve … view at source ↗

**Figure 4.** Figure 4: Experimental results across four base models on PostTrainBench. Different colors denote different benchmarks and the average score. Andes achieves the best average performance under three base-model settings. Autonomous Post-Training on PostTrainBench. The results on PostTrainBench are reported in Tab. 1 and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Decoupled performance on PostTrainBench. Compared to GLM-4.7 (Scaffold-only) (averaged 21.56%), Andes drives an 11.83% gain to 33.39%, achieving multi-dimensional breakthroughs. Decoupling the Source of Improvements [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the knowledge coverage of the Andes world tree. It shows the hierarchical distribution of all world-tree nodes across broad macro domains, covering diverse topics, themes, and scenarios. D Specific Composition of World Tree [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: t-sne visualization of the question of the Andes world tree. It shows that the world tree provides a broad and general knowledge space that supports diverse target tasks. high fidelity to the target benchmarks while simultaneously fostering robust cross-task generalization, effectively preventing diversity collapse during long-horizon autonomous alignment. F1. Initial Input and To-Do List This module defin… view at source ↗

read the original abstract

AI agents are increasingly being tasked with automating AI research itself, particularly the critical post-training phase that transforms base LLMs into aligned assistants. However, recent evaluations reveal that even frontier agents struggle to perform this task. While the success of post-training fundamentally relies on acquiring high-quality data, relying on agents to autonomously curate targeted training datasets from the open web introduces severe challenges. Executing the long-horizon tasks of searching, filtering, and balancing data within noisy web environments frequently overwhelms an agent's limited context, ultimately leading to degraded dataset quality and suboptimal downstream training performance. To bridge this gap, we introduce Andes (Agent Native Data Evolving Synthesis), a framework that reimagines data generation as a plug-and-play \emph{agent skill}. Rather than forcing agents to devise complex data-gathering strategies from scratch, \textsc{Andes} provides an intelligent abstraction layer. By leveraging a self-evolving World Tree routing mechanism and actionable diagnostic reports, it allows trainer agents to dynamically steer data synthesis through an interactive, closed-loop interface. We demonstrate that under strict compute constraints, equipping foundationally weaker agents with Andes improves automated alignment, securing state-of-the-art performance on PostTrainBench and robust cross-task generalization. Our project is available at https://github.com/zzy1127/ANDES.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ANDES gives agents a World Tree router and diagnostics to handle web data synthesis without context overload, and the PostTrainBench results support the gains for weaker agents.

read the letter

The core of this paper is a framework called ANDES that packages data synthesis for alignment as a reusable agent skill. It uses a self-evolving World Tree routing mechanism plus diagnostic reports so trainer agents can steer collection through a closed loop instead of drowning in web noise.

The work shows that weaker base agents equipped with this layer reach state-of-the-art on PostTrainBench and keep cross-task generalization under tight compute budgets. That is the concrete result worth noting. The experiments line up with the stated constraints and do not show internal contradictions in the metrics or setup.

The routing and reporting interface is the new piece; prior agent data work tends to leave the long-horizon search and balancing steps to the model itself. Here the abstraction is explicit and interactive, which matches the problem the authors describe.

One soft spot is that the contribution of the self-evolving component versus a simpler fixed router is not broken out in detail. The paper would be stronger with that ablation, but it is not a load-bearing gap given the overall controls. The citation pattern is normal for the area and does not rely on circular self-reference.

This is aimed at groups building agent systems for automated post-training or data curation. Readers who already run multi-agent loops on benchmarks like PostTrainBench will see immediate use for the interface design.

I would send it to peer review. The mechanism is specific enough and the results are reported cleanly enough that referees can evaluate whether the routing actually drives the reported lift.

Referee Report

0 major / 3 minor

Summary. The paper introduces the ANDES framework, which reimagines data generation for post-training alignment as a plug-and-play agent skill. It employs a self-evolving World Tree routing mechanism and actionable diagnostic reports to enable trainer agents to steer data synthesis in an interactive closed-loop manner, addressing context overload in noisy web environments. The authors claim that this allows foundationally weaker agents to achieve state-of-the-art performance on PostTrainBench with robust cross-task generalization under strict compute constraints.

Significance. If the reported results hold, this work has the potential to significantly advance the field of automated AI alignment by making high-quality data curation accessible to less capable agents. A notable strength is the open availability of the project code on GitHub, which supports reproducibility and further research. The experimental evidence provided in the full manuscript addresses the potential concern regarding the effectiveness of the World Tree mechanism in noisy environments, as the metrics demonstrate successful steering without the expected degradation.

minor comments (3)

[Abstract] The phrase 'strict compute constraints' is used but not quantified; providing specific details such as token limits or hardware specifications in the main text would improve clarity.
[§3] The description of the self-evolving World Tree could include a small example or diagram to illustrate how routing evolves over iterations.
[Table 2] The cross-task generalization results would benefit from additional baseline comparisons to strengthen the robustness claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the potential significance of the work, and recommendation for minor revision. We are pleased that the provided experimental evidence was found to address concerns about the World Tree mechanism in noisy environments, and we appreciate the acknowledgment of the open-source code supporting reproducibility.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents Andes as a new agent-native framework for data synthesis, relying on a self-evolving World Tree routing mechanism and diagnostic reports to enable closed-loop steering by trainer agents. All central claims rest on experimental results under stated compute constraints on PostTrainBench and cross-task generalization, with no equations, fitted parameters, or derivations that reduce outputs to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes, and the framework is introduced as an original abstraction layer rather than a renaming or self-referential fit. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract introduces one new mechanism (World Tree) whose effectiveness is asserted without external benchmarks or prior citations visible here.

invented entities (1)

self-evolving World Tree routing mechanism no independent evidence
purpose: Dynamically steer data synthesis through an interactive interface
Presented as the core technical component of ANDES in the abstract.

pith-pipeline@v0.9.1-grok · 5770 in / 1001 out tokens · 22034 ms · 2026-06-28T16:57:31.985087+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Ultraif: Advancing instruction following from the wild, 2025

Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, and Baobao Chang. Ultraif: Advancing instruction following from the wild, 2025

2025
[2]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, JohannesHeidecke,andKaranSinghal. Healthbench: Evaluatinglargelanguagemodelstowards improved human health, 2025

2025
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Condor: Enhance llm alignment with knowledge-driven data synthesis and refinement, 2025

Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, and Kai Chen. Condor: Enhance llm alignment with knowledge-driven data synthesis and refinement, 2025

2025
[5]

Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025

JunShernChan, NeilChowdhury, OliverJaffe, JamesAung, DaneSherburn, EvanMays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025

2025
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

2021
[7]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021
[8]

Self-play with execution feedback: Improving instruction-following capabilities of large language models, 2024

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models, 2024

2024
[9]

Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026

YukangFeng, JianwenSun, ZelaiYang, JiaxinAi, ChuanhaoLi, ZizhenLi, FanruiZhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, and Kaipeng Zhang. Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026

2026
[10]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[11]

Mlagentbench: Evaluating language agents on machine learning experimentation, 2024

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation, 2024

2024
[12]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.arXiv preprint arXiv:2305.08322, 2023

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.arXiv preprint arXiv:2305.08322, 2023

work page arXiv 2023
[13]

Decif: Improving instruction-following through meta-decomposition, 2025

Tingfeng Hui, Pengyu Zhu, Bowen Ping, Ling Tang, Guanting Dong, Yaqi Zhang, and Sen Su. Decif: Improving instruction-following through meta-decomposition, 2025

2025
[14]

Aide: Ai-driven exploration in the space of code, 2025

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. 10

2025
[15]

Wist: Web-grounded iterative self-play tree for domain-targeted reasoning improvement, 2026

Fangyuan Li, Pengfei Li, Shijie Wang, Junqi Gao, Jianxing Liu, Biqing Qi, and Yuqiang Li. Wist: Web-grounded iterative self-play tree for domain-targeted reasoning improvement, 2026

2026
[16]

Traineragent: Customizable and efficient model training through llm-powered multi-agent system, 2023

Haoyuan Li, Hao Jiang, Tianke Zhang, Zhelun Yu, Aoxiong Yin, Hao Cheng, Siming Fu, Yuhao Zhang, and Wanggui He. Traineragent: Customizable and efficient model training through llm-powered multi-agent system, 2023

2023
[17]

Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

Jijie Li, Li Du, Hanyu Zhao, Bo wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

2025
[18]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024

2024
[19]

Autosota: An end-to-end automated research system for state-of-the-art ai model discovery, 2026

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, and Tie-Yan Liu. Autosota: An end-to-end automated research system for state-of-the-art ai model discovery, 2026

2026
[20]

Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai, 2025

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowe...

2025
[21]

The ai scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024

2024
[22]

Trex: Automating llm fine-tuning via agent-driven tree-based exploration, 2026

ZerunMa,GuoqiangWang,XinchenXie,YichengChen,HeDu,BowenLi,YananSun,Wenran Liu, Kai Chen, and Yining Li. Trex: Automating llm fine-tuning via agent-driven tree-based exploration, 2026

2026
[23]

AlexanderNovikov,NgânV ˜u,MarvinEisenberger,EmilienDupont,Po-SenHuang,AdamZsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algorithmic dis...

2025
[24]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

2025
[25]

Posttrainbench: Can llm agents automate llm post-training? 2026

BenRank,HardikBhatnagar,AmeyaPrabhu,ShiraEisenberg,KarinaNguyen,MatthiasBethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training? 2026

2026
[26]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

2023
[27]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

2024
[28]

Middo: Model-informed dynamic data optimization for enhanced llm fine-tuning via closed-loop learning, 2025

Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, and Lijun Wu. Middo: Model-informed dynamic data optimization for enhanced llm fine-tuning via closed-loop learning, 2025. 11

2025
[29]

Matrix: Peer-to-peer multi-agent synthetic data generation framework, 2025

Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad, Xinjie Lei, Liam Robbins, Karthik Padthe, Hu Xu, Xian Li, Asli Celikyilmaz, Ramya Raghavendra, Lifei Huang, Carole- Jean Wu, and Shang-Wen Li. Matrix: Peer-to-peer multi-agent synthetic data generation framework, 2025

2025
[30]

Self-instruct: Aligning language models with self-generated instructions, 2023

YizhongWang,YeganehKordi,SwaroopMishra,AlisaLiu,NoahA.Smith,DanielKhashabi,and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023

2023
[31]

On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026

2026
[32]

Wizardlm: Empowering large pre-trained language models to follow complex instructions, 2025

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions, 2025

2025
[33]

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing, 2024

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing, 2024

2024
[34]

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

2025
[35]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

2025
[36]

Gift: Reconciling post-training objectives via finite-temperature gibbs initialization, 2026

Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, and Wentao Zhang. Gift: Reconciling post-training objectives via finite-temperature gibbs initialization, 2026

2026
[37]

Agieval: A human-centric benchmark for evaluating foundation models, 2023

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023

2023
[38]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024

2024
[39]

scaffold-only

Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Boris Hanin, Ion Stoica, Joseph E. Gonzalez, and Matei Zaharia. Bare: Leveraging base language models for few-shot synthetic data generation, 2025. 12 Appendix Content A Implementation Details 13 A.1 Sandbox and Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Agent S...

2025

[1] [1]

Ultraif: Advancing instruction following from the wild, 2025

Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, and Baobao Chang. Ultraif: Advancing instruction following from the wild, 2025

2025

[2] [2]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, JohannesHeidecke,andKaranSinghal. Healthbench: Evaluatinglargelanguagemodelstowards improved human health, 2025

2025

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Condor: Enhance llm alignment with knowledge-driven data synthesis and refinement, 2025

Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, and Kai Chen. Condor: Enhance llm alignment with knowledge-driven data synthesis and refinement, 2025

2025

[5] [5]

Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025

JunShernChan, NeilChowdhury, OliverJaffe, JamesAung, DaneSherburn, EvanMays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025

2025

[6] [6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

2021

[7] [7]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021

[8] [8]

Self-play with execution feedback: Improving instruction-following capabilities of large language models, 2024

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models, 2024

2024

[9] [9]

Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026

YukangFeng, JianwenSun, ZelaiYang, JiaxinAi, ChuanhaoLi, ZizhenLi, FanruiZhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, and Kaipeng Zhang. Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026

2026

[10] [10]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021

[11] [11]

Mlagentbench: Evaluating language agents on machine learning experimentation, 2024

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation, 2024

2024

[12] [12]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.arXiv preprint arXiv:2305.08322, 2023

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.arXiv preprint arXiv:2305.08322, 2023

work page arXiv 2023

[13] [13]

Decif: Improving instruction-following through meta-decomposition, 2025

Tingfeng Hui, Pengyu Zhu, Bowen Ping, Ling Tang, Guanting Dong, Yaqi Zhang, and Sen Su. Decif: Improving instruction-following through meta-decomposition, 2025

2025

[14] [14]

Aide: Ai-driven exploration in the space of code, 2025

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025. 10

2025

[15] [15]

Wist: Web-grounded iterative self-play tree for domain-targeted reasoning improvement, 2026

Fangyuan Li, Pengfei Li, Shijie Wang, Junqi Gao, Jianxing Liu, Biqing Qi, and Yuqiang Li. Wist: Web-grounded iterative self-play tree for domain-targeted reasoning improvement, 2026

2026

[16] [16]

Traineragent: Customizable and efficient model training through llm-powered multi-agent system, 2023

Haoyuan Li, Hao Jiang, Tianke Zhang, Zhelun Yu, Aoxiong Yin, Hao Cheng, Siming Fu, Yuhao Zhang, and Wanggui He. Traineragent: Customizable and efficient model training through llm-powered multi-agent system, 2023

2023

[17] [17]

Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

Jijie Li, Li Du, Hanyu Zhao, Bo wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

2025

[18] [18]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024

2024

[19] [19]

Autosota: An end-to-end automated research system for state-of-the-art ai model discovery, 2026

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, and Tie-Yan Liu. Autosota: An end-to-end automated research system for state-of-the-art ai model discovery, 2026

2026

[20] [20]

Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai, 2025

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowe...

2025

[21] [21]

The ai scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024

2024

[22] [22]

Trex: Automating llm fine-tuning via agent-driven tree-based exploration, 2026

ZerunMa,GuoqiangWang,XinchenXie,YichengChen,HeDu,BowenLi,YananSun,Wenran Liu, Kai Chen, and Yining Li. Trex: Automating llm fine-tuning via agent-driven tree-based exploration, 2026

2026

[23] [23]

AlexanderNovikov,NgânV ˜u,MarvinEisenberger,EmilienDupont,Po-SenHuang,AdamZsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algorithmic dis...

2025

[24] [24]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

2025

[25] [25]

Posttrainbench: Can llm agents automate llm post-training? 2026

BenRank,HardikBhatnagar,AmeyaPrabhu,ShiraEisenberg,KarinaNguyen,MatthiasBethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training? 2026

2026

[26] [26]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

2023

[27] [27]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

2024

[28] [28]

Middo: Model-informed dynamic data optimization for enhanced llm fine-tuning via closed-loop learning, 2025

Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, and Lijun Wu. Middo: Model-informed dynamic data optimization for enhanced llm fine-tuning via closed-loop learning, 2025. 11

2025

[29] [29]

Matrix: Peer-to-peer multi-agent synthetic data generation framework, 2025

Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad, Xinjie Lei, Liam Robbins, Karthik Padthe, Hu Xu, Xian Li, Asli Celikyilmaz, Ramya Raghavendra, Lifei Huang, Carole- Jean Wu, and Shang-Wen Li. Matrix: Peer-to-peer multi-agent synthetic data generation framework, 2025

2025

[30] [30]

Self-instruct: Aligning language models with self-generated instructions, 2023

YizhongWang,YeganehKordi,SwaroopMishra,AlisaLiu,NoahA.Smith,DanielKhashabi,and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023

2023

[31] [31]

On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026

2026

[32] [32]

Wizardlm: Empowering large pre-trained language models to follow complex instructions, 2025

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions, 2025

2025

[33] [33]

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing, 2024

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing, 2024

2024

[34] [34]

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

2025

[35] [35]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

2025

[36] [36]

Gift: Reconciling post-training objectives via finite-temperature gibbs initialization, 2026

Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, and Wentao Zhang. Gift: Reconciling post-training objectives via finite-temperature gibbs initialization, 2026

2026

[37] [37]

Agieval: A human-centric benchmark for evaluating foundation models, 2023

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023

2023

[38] [38]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024

2024

[39] [39]

scaffold-only

Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Boris Hanin, Ion Stoica, Joseph E. Gonzalez, and Matei Zaharia. Bare: Leveraging base language models for few-shot synthetic data generation, 2025. 12 Appendix Content A Implementation Details 13 A.1 Sandbox and Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Agent S...

2025