SEAL: Synergistic Co-Evolution of Agents and Learning Environments

Pan Wang; Wei Wu; Xin Zhang; Xiujin Liu; Yihao Hu; Zhihao Wen

arxiv: 2605.24426 · v1 · pith:HMTXGVNYnew · submitted 2026-05-23 · 💻 cs.CL

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

Yihao Hu , Zhihao Wen , Xiujin Liu , Pan Wang , Xin Zhang , Wei Wu This is my paper

Pith reviewed 2026-06-30 13:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentstool useco-evolutionself-improvementfailure diagnosisenvironment adaptationmulti-turn interactionlow-resource learning

0 comments

The pith

SEAL co-evolves LLM agents and training environments via shared turn-level failure diagnoses to improve tool-use learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies Agent-Environment Misalignment, where static environments lag behind evolving agent capabilities during interactive training. SEAL addresses this with a closed loop: it runs on-policy trajectories, diagnoses failures at the turn level, and feeds those labels as a common signal to adapt the environment's cues and feedback while also reweighting policy advantages. This joint adaptation is tested in multi-turn tool-use settings. A sympathetic reader would care because most prior self-evolution methods change only the policy or only the environment, leaving the supervision signal misaligned. The reported outcome is that 400 training samples suffice for sizable gains and out-of-distribution transfer.

Core claim

SEAL collects on-policy trajectories under executable verification, converts failed rollouts into turn-level failure labels, and uses these labels both to evolve the environment's training interface (clearer tool affordances, constraint information, recovery feedback) and to optimize the policy through diagnosis-guided advantage reweighting, producing consistent gains across backbones in in-distribution and out-of-distribution tool-use evaluations.

What carries the argument

The shared signal of turn-level failure labels diagnosed from on-policy trajectories, applied simultaneously to environment adaptation and policy optimization.

If this is right

With only 400 training samples SEAL produces average-point gains between +8.25 and +26.25 across three different backbones.
The same training run yields positive transfer on out-of-distribution multi-turn tool-use evaluations.
Environment adaptation consists of exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback.
Policy optimization uses diagnosis-guided advantage reweighting derived from the same failure labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same shared-label loop might reduce the volume of human-written demonstrations needed for other interactive tasks such as code repair or multi-step planning.
Dynamic learning interfaces that change with the agent's revealed weaknesses could complement static reward models used in conventional reinforcement learning from human feedback.
If the failure-diagnosis step itself can be automated reliably, the method points toward fully closed-loop self-improvement without external supervision at each cycle.

Load-bearing premise

Turn-level failure labels extracted from on-policy trajectories form a reliable shared signal that can drive both environment changes and policy updates without adding new biases or inconsistencies.

What would settle it

Running the same 400-sample training regime on the three backbones but replacing the diagnosed failure labels with random or fixed labels, then observing no average-point gains or loss of out-of-distribution transfer, would falsify the value of the shared-signal mechanism.

read the original abstract

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emph{Agent-Environment Misalignment}: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEAL's joint adaptation via shared failure diagnoses is a clean framing but the abstract leaves the label quality unaddressed, so the gains stay hard to evaluate.

read the letter

SEAL's main contribution is the closed-loop setup that lets both the agent policy and the training environment adapt from the same turn-level failure labels extracted from on-policy trajectories. The environment gets updated with clearer affordances and recovery feedback while the policy uses diagnosis-guided advantage reweighting. That shared-signal design is distinct from prior work that tweaks only one side.

The experiments report clear gains: +8.25 to +26.25 points with just 400 samples across three backbones, plus positive out-of-distribution transfer on multi-turn tool-use tasks. The numbers are specific enough to be useful as a starting point for people working on low-resource agent training.

The soft spot is exactly where the stress-test note flags it. The abstract gives no description of how the failure labels are produced, no validation steps, and no ablations on label noise. If the diagnosis step is heuristic or model-mediated without checks, the co-evolution loop could reinforce whatever the initial trajectories already capture rather than expose genuine capability gaps. That makes the central claim rest on an unexamined assumption.

This paper is for researchers focused on interactive LLM agents and self-evolution methods who want ideas for tighter coupling between learner and substrate. A reader looking for a new angle on data-efficient training would get value from the framing and the reported transfer results.

The work is coherent on its own terms and the idea is worth testing further, so it deserves peer review. Referees will likely press for the missing diagnosis details and controls, but the core setup is worth that scrutiny.

Referee Report

2 major / 0 minor

Summary. The paper claims that most self-evolution methods for LLM agents adapt either the policy or the learning environment in isolation, creating 'Agent-Environment Misalignment.' SEAL addresses this via a closed-loop framework that collects on-policy trajectories, diagnoses failed rollouts into turn-level failure labels, and uses these labels as a shared signal: the environment adapts by exposing clearer tool affordances, constraints, and recovery feedback, while the policy is optimized via diagnosis-guided advantage reweighting. With only 400 training samples, SEAL reports +8.25 to +26.25 average-point gains across three backbones on in-distribution and out-of-distribution multi-turn tool-use evaluations, demonstrating positive OOD transfer.

Significance. If the central claim holds after validation of the diagnosis step, the work would be significant for low-resource agent learning: jointly adapting the learner and its training substrate could yield more robust self-improving agents than isolated policy or environment tuning, with the reported OOD transfer and small-sample gains offering a concrete path toward efficient interactive tool-use systems.

major comments (2)

[Abstract] The performance gains (+8.25 to +26.25 points) and OOD transfer rest on the claim that turn-level failure labels from on-policy trajectories provide a reliable shared signal for simultaneous environment adaptation and policy optimization. The abstract provides no description of the diagnosis procedure, no inter-annotator agreement or validation metrics, and no ablation on label noise; if the diagnosis is LLM-mediated or heuristic, errors could create self-reinforcing loops that inflate in-distribution scores without true robustness. This is load-bearing for the closed-loop co-evolution claim.
[Abstract] The methods for executable verification of trajectories and the precise mechanism of diagnosis-guided advantage reweighting are not detailed in the provided abstract; without these, it is impossible to assess whether the reported gains reduce to fitted parameters or introduce new biases in the co-evolution loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the SEAL framework for low-resource agent learning. The two major comments both concern the level of detail in the abstract regarding the diagnosis procedure, executable verification, and advantage reweighting. The full manuscript elaborates on these elements; we will revise the abstract to incorporate brief descriptions and validation references as part of addressing the major revision request.

read point-by-point responses

Referee: [Abstract] The performance gains (+8.25 to +26.25 points) and OOD transfer rest on the claim that turn-level failure labels from on-policy trajectories provide a reliable shared signal for simultaneous environment adaptation and policy optimization. The abstract provides no description of the diagnosis procedure, no inter-annotator agreement or validation metrics, and no ablation on label noise; if the diagnosis is LLM-mediated or heuristic, errors could create self-reinforcing loops that inflate in-distribution scores without true robustness. This is load-bearing for the closed-loop co-evolution claim.

Authors: We agree that the abstract is too concise on the diagnosis procedure and lacks any mention of validation or noise analysis. The full manuscript details the turn-level diagnosis process and includes supporting analyses. To directly address the concern about reliability and potential self-reinforcing loops, we will revise the abstract to add a short clause summarizing the diagnosis validation approach and reference to noise ablations. revision: yes
Referee: [Abstract] The methods for executable verification of trajectories and the precise mechanism of diagnosis-guided advantage reweighting are not detailed in the provided abstract; without these, it is impossible to assess whether the reported gains reduce to fitted parameters or introduce new biases in the co-evolution loop.

Authors: We acknowledge that the abstract omits specifics on executable verification and the advantage reweighting mechanism. These components are described in the full manuscript. We will revise the abstract to include concise descriptions of both the verification process and the reweighting approach, enabling readers to better evaluate potential sources of gains or bias. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results with no derivations or self-referential reductions

full rationale

The paper describes an empirical framework (SEAL) for co-evolving LLM agents and environments via on-policy trajectories and turn-level failure labels, reporting experimental gains (+8.25 to +26.25 points with 400 samples, OOD transfer) across backbones. No equations, mathematical derivations, fitted parameters, or first-principles claims appear in the provided text. The central results are presented as direct experimental outcomes rather than predictions that reduce to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify load-bearing steps. The diagnosis procedure and shared-signal assumption are described at a high level but not derived from prior results within the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework depends on the effectiveness of failure diagnosis as a shared signal; this is a domain assumption rather than a derived result. No free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption Failed rollouts can be reliably diagnosed into accurate turn-level failure labels that serve as a useful shared signal
This premise is required for both the environment adaptation and the policy reweighting steps described in the abstract.

invented entities (1)

Agent-Environment Misalignment no independent evidence
purpose: To name the structural gap where agent capability changes but the environment does not
Conceptual term introduced to motivate the co-evolution approach; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5757 in / 1274 out tokens · 38653 ms · 2026-06-30T13:35:15.272688+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 35 canonical work pages · 18 internal anchors

[1]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[2]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

2023
[3]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[4]

Omnivideo-r1: Reinforcing audio-visual reasoning with query intention and modality attention.arXiv preprint arXiv:2602.05847, 2026

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, et al. Omnivideo-r1: Reinforcing audio-visual reasoning with query intention and modality attention.arXiv preprint arXiv:2602.05847, 2026

work page arXiv 2026
[5]

Dual Latent Memory for Visual Multi-agent System

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, et al. Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Agentic Reasoning for Large Language Models

Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, et al. Agentic reasoning for large language models.arXiv preprint arXiv:2601.12538, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation

Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jian-Guang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 496–507, 2025

2025
[9]

Counterfactual evolution of multimodal datasets via visual programming.Advances in Neural Information Processing Systems, 38:81947–81976, 2026

Minghe Gao, Zhongqi Yue, Wenjie Yan, Yihao Hu, Wei Ji, Siliang Tang, Jun Xiao, Tat-Seng Chua, Yueting Zhuang, and Juncheng Li. Counterfactual evolution of multimodal datasets via visual programming.Advances in Neural Information Processing Systems, 38:81947–81976, 2026

2026
[10]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

work page arXiv 2025
[12]

Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

work page arXiv 2025
[13]

Seagent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

work page arXiv 2025
[14]

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, and Zhihao Wen. Atlasva: Self-evolving visual skill memory for teacher-free vlm agents.arXiv preprint arXiv:2605.17933, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Api-bank: A comprehensive benchmark for tool-augmented llms

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 3102–3116, 2023

2023
[16]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[17]

Agent-environment alignment via automated interface generation.arXiv preprint arXiv:2505.21055, 2025

Kaiming Liu, Xuanyu Lei, Ziyue Wang, Peng Li, and Yang Liu. Agent-environment alignment via automated interface generation.arXiv preprint arXiv:2505.21055, 2025

work page arXiv 2025
[18]

Glove: Global verifier for llm memory-environment realignment.arXiv preprint arXiv:2601.19249, 2026

Xingkun Yin and Hongyang Du. Glove: Global verifier for llm memory-environment realignment.arXiv preprint arXiv:2601.19249, 2026. 12

work page arXiv 2026
[19]

Tool execution hallucination in llm-based agents: A unified taxonomy with detection, mitigation, and future directions.TechRxiv, 2026

Hanli Peng, Yongsen Zheng, Ziyao Liu, and Kwok-Yan Lam. Tool execution hallucination in llm-based agents: A unified taxonomy with detection, mitigation, and future directions.TechRxiv, 2026

2026
[20]

Evotool: Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection

Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, and Eduard Hovy. Evotool: Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection. arXiv preprint arXiv:2603.04900, 2026

work page arXiv 2026
[21]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Agent-r: Training language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Training language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

work page arXiv 2025
[23]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

A systematic survey of self-evolving agents: From model-centric to environment-driven co-evolution.TechRxiv, 2026

Zhishang Xiang, Chengyi Yang, Zerui Chen, Zhimin Wei, Yunbo Tang, Zongpei Teng, Zexi Peng, Zongxia Li, Chengsong Huang, Yicheng He, et al. A systematic survey of self-evolving agents: From model-centric to environment-driven co-evolution.TechRxiv, 2026

2026
[25]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[26]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

2009
[27]

Don’t just fine-tune the agent, tune the environment.arXiv preprint arXiv:2510.10197, 2025

Siyuan Lu, Zechuan Wang, Hongxuan Zhang, Qintong Wu, Leilei Gan, Chenyi Zhuang, Jinjie Gu, and Tao Lin. Don’t just fine-tune the agent, tune the environment.arXiv preprint arXiv:2510.10197, 2025

work page arXiv 2025
[28]

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu, Yong Wang, and Xiangxiang Chu. Coevolve: Training llm agents via agent-data mutual evolution.arXiv preprint arXiv:2604.15840, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

From failure to mastery: Generating hard samples for tool-use agents.arXiv preprint arXiv:2601.01498, 2026

Bingguang Hao, Zengzhuang Xu, Yuntao Wen, Xinyi Xu, Yang Liu, Tong Zhao, Maolin Wang, Long Chen, Dong Wang, Yicheng Chen, et al. From failure to mastery: Generating hard samples for tool-use agents.arXiv preprint arXiv:2601.01498, 2026

work page arXiv 2026
[30]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Self-consolidation for self-evolving agents.arXiv preprint arXiv:2602.01966, 2026

Hongzhuo Yu, Fei Zhu, Guo-Sen Xie, and Ling Shao. Self-consolidation for self-evolving agents.arXiv preprint arXiv:2602.01966, 2026

work page arXiv 2026
[32]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:25...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Self-adapting language models.arXiv preprint arXiv:2506.10943, 2025

Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Self-adapting language models.arXiv preprint arXiv:2506.10943, 2025

work page arXiv 2025
[34]

Teacher-student curriculum learning.IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2020

Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning.IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2020

2020
[35]

Automatic curriculum learning for deep RL: A short survey

Rémy Portelas, Cédric Colas, Lilian Weng, Katja Hofmann, and Pierre-Yves Oudeyer. Automatic curriculum learning for deep RL: A short survey. InProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pages 4819–4825, 2020

2020
[36]

Self-instruct: Aligning language models with self-generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, 2023. 13

2023
[37]

Wizardlm: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

2024
[38]

Large language models as tool makers.arXiv preprint arXiv:2305.17126, 2023

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers.arXiv preprint arXiv:2305.17126, 2023

work page arXiv 2023
[39]

Creator: Tool creation for disentangling abstract and concrete reasoning of large language models

Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 6922–6939, 2023

2023
[40]

Craft: Customizing llms by creating and retrieving from specialized toolsets.arXiv preprint arXiv:2309.17428, 2023

Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R Fung, Hao Peng, and Heng Ji. Craft: Customizing llms by creating and retrieving from specialized toolsets.arXiv preprint arXiv:2309.17428, 2023

work page arXiv 2023
[41]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025

2025
[42]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Agentgym: Evolving large language model-based agents across diverse environments

Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, et al. Agentgym: Evolving large language model-based agents across diverse environments. arXiv preprint arXiv:2406.04151, 2024

work page arXiv 2024
[46]

Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

Jiayi Zhang, Yiran Peng, Fanqi Kong, Cheng Yang, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, et al. Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

work page arXiv 2025
[47]

Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025

Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025

work page arXiv 2025
[48]

Gemini 3 pro preview model documentation

Google. Gemini 3 pro preview model documentation. Google AI for Developers documentation, 2025. https://ai.google.dev/gemini-api/docs/models/gemini-3-pro-preview

2025
[49]

Claude sonnet 4.5 system card

Anthropic. Claude sonnet 4.5 system card. System card, 2025. https://www.anthropic.com/claude-sonnet-4-5-system-card

2025
[50]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, Aaron Ostrow, Aaron Welihinda, Alex Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, David Bieber, Mike Schaekermann, Panupong Pasupat, Nitish Sachdeva, Inderjit Dhillon, Michael Blistein, Omer Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Glm-4.6 model card

Zhipu AI. Glm-4.6 model card. Hugging Face model card, 2025.https://huggingface.co/zai-org/GLM-4.6

2025
[53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jinkai Xu, Jing Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

xlam-2-3b-fc-r model card

Salesforce AI Research. xlam-2-3b-fc-r model card. Hugging Face model card, 2025. https://huggingface.co/Salesforce/xLAM-2-3b-fc-r

2025
[56]

Toolace: Winning the points of llm function calling

Weiwen Liu, Xingshan Zeng, Keqing He, Yongliang Wang, Zhaoyang Yan, Fanya Wang, Jingsheng Cheng, Runlong Wang, Minpeng Shen, Xin Jiang, Yujie Qian, Qun Liu, and Lifeng Shang. Toolace: Winning the points of llm function calling. InInternational Conference on Learning Representations, 2025

2025
[57]

Bitagent-8b model card

BitAgent. Bitagent-8b model card. Hugging Face model card, 2025. https://huggingface.co/BitAgent/BitAgent-8B

2025
[58]

name": <function-name>,

watt-ai. watt-tool-8b model card. Hugging Face model card, 2024. https://huggingface.co/watt-ai/watt-tool-8B. 15 Appendix A Benchmark and Evaluation Details BFCL V3.Our in-distribution evaluation uses the multi-turn subset of the Berkeley Function-Calling Leaderboard (BFCL) V3. The benchmark evaluates whether an agent can correctly use executable tools ov...

2024

[1] [1]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022

[2] [2]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

2023

[3] [3]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023

[4] [4]

Omnivideo-r1: Reinforcing audio-visual reasoning with query intention and modality attention.arXiv preprint arXiv:2602.05847, 2026

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, et al. Omnivideo-r1: Reinforcing audio-visual reasoning with query intention and modality attention.arXiv preprint arXiv:2602.05847, 2026

work page arXiv 2026

[5] [5]

Dual Latent Memory for Visual Multi-agent System

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, et al. Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Agentic Reasoning for Large Language Models

Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, et al. Agentic reasoning for large language models.arXiv preprint arXiv:2601.12538, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation

Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jian-Guang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 496–507, 2025

2025

[9] [9]

Counterfactual evolution of multimodal datasets via visual programming.Advances in Neural Information Processing Systems, 38:81947–81976, 2026

Minghe Gao, Zhongqi Yue, Wenjie Yan, Yihao Hu, Wei Ji, Siliang Tang, Jun Xiao, Tat-Seng Chua, Yueting Zhuang, and Juncheng Li. Counterfactual evolution of multimodal datasets via visual programming.Advances in Neural Information Processing Systems, 38:81947–81976, 2026

2026

[10] [10]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

work page arXiv 2025

[12] [12]

Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

work page arXiv 2025

[13] [13]

Seagent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

work page arXiv 2025

[14] [14]

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, and Zhihao Wen. Atlasva: Self-evolving visual skill memory for teacher-free vlm agents.arXiv preprint arXiv:2605.17933, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Api-bank: A comprehensive benchmark for tool-augmented llms

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 3102–3116, 2023

2023

[16] [16]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024

[17] [17]

Agent-environment alignment via automated interface generation.arXiv preprint arXiv:2505.21055, 2025

Kaiming Liu, Xuanyu Lei, Ziyue Wang, Peng Li, and Yang Liu. Agent-environment alignment via automated interface generation.arXiv preprint arXiv:2505.21055, 2025

work page arXiv 2025

[18] [18]

Glove: Global verifier for llm memory-environment realignment.arXiv preprint arXiv:2601.19249, 2026

Xingkun Yin and Hongyang Du. Glove: Global verifier for llm memory-environment realignment.arXiv preprint arXiv:2601.19249, 2026. 12

work page arXiv 2026

[19] [19]

Tool execution hallucination in llm-based agents: A unified taxonomy with detection, mitigation, and future directions.TechRxiv, 2026

Hanli Peng, Yongsen Zheng, Ziyao Liu, and Kwok-Yan Lam. Tool execution hallucination in llm-based agents: A unified taxonomy with detection, mitigation, and future directions.TechRxiv, 2026

2026

[20] [20]

Evotool: Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection

Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, and Eduard Hovy. Evotool: Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection. arXiv preprint arXiv:2603.04900, 2026

work page arXiv 2026

[21] [21]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Agent-r: Training language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Training language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

work page arXiv 2025

[23] [23]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

A systematic survey of self-evolving agents: From model-centric to environment-driven co-evolution.TechRxiv, 2026

Zhishang Xiang, Chengyi Yang, Zerui Chen, Zhimin Wei, Yunbo Tang, Zongpei Teng, Zexi Peng, Zongxia Li, Chengsong Huang, Yicheng He, et al. A systematic survey of self-evolving agents: From model-centric to environment-driven co-evolution.TechRxiv, 2026

2026

[25] [25]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[26] [26]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

2009

[27] [27]

Don’t just fine-tune the agent, tune the environment.arXiv preprint arXiv:2510.10197, 2025

Siyuan Lu, Zechuan Wang, Hongxuan Zhang, Qintong Wu, Leilei Gan, Chenyi Zhuang, Jinjie Gu, and Tao Lin. Don’t just fine-tune the agent, tune the environment.arXiv preprint arXiv:2510.10197, 2025

work page arXiv 2025

[28] [28]

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu, Yong Wang, and Xiangxiang Chu. Coevolve: Training llm agents via agent-data mutual evolution.arXiv preprint arXiv:2604.15840, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

From failure to mastery: Generating hard samples for tool-use agents.arXiv preprint arXiv:2601.01498, 2026

Bingguang Hao, Zengzhuang Xu, Yuntao Wen, Xinyi Xu, Yang Liu, Tong Zhao, Maolin Wang, Long Chen, Dong Wang, Yicheng Chen, et al. From failure to mastery: Generating hard samples for tool-use agents.arXiv preprint arXiv:2601.01498, 2026

work page arXiv 2026

[30] [30]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Self-consolidation for self-evolving agents.arXiv preprint arXiv:2602.01966, 2026

Hongzhuo Yu, Fei Zhu, Guo-Sen Xie, and Ling Shao. Self-consolidation for self-evolving agents.arXiv preprint arXiv:2602.01966, 2026

work page arXiv 2026

[32] [32]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:25...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Self-adapting language models.arXiv preprint arXiv:2506.10943, 2025

Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Self-adapting language models.arXiv preprint arXiv:2506.10943, 2025

work page arXiv 2025

[34] [34]

Teacher-student curriculum learning.IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2020

Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning.IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2020

2020

[35] [35]

Automatic curriculum learning for deep RL: A short survey

Rémy Portelas, Cédric Colas, Lilian Weng, Katja Hofmann, and Pierre-Yves Oudeyer. Automatic curriculum learning for deep RL: A short survey. InProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pages 4819–4825, 2020

2020

[36] [36]

Self-instruct: Aligning language models with self-generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, 2023. 13

2023

[37] [37]

Wizardlm: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

2024

[38] [38]

Large language models as tool makers.arXiv preprint arXiv:2305.17126, 2023

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers.arXiv preprint arXiv:2305.17126, 2023

work page arXiv 2023

[39] [39]

Creator: Tool creation for disentangling abstract and concrete reasoning of large language models

Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 6922–6939, 2023

2023

[40] [40]

Craft: Customizing llms by creating and retrieving from specialized toolsets.arXiv preprint arXiv:2309.17428, 2023

Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R Fung, Hao Peng, and Heng Ji. Craft: Customizing llms by creating and retrieving from specialized toolsets.arXiv preprint arXiv:2309.17428, 2023

work page arXiv 2023

[41] [41]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025

2025

[42] [42]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Agentgym: Evolving large language model-based agents across diverse environments

Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, et al. Agentgym: Evolving large language model-based agents across diverse environments. arXiv preprint arXiv:2406.04151, 2024

work page arXiv 2024

[46] [46]

Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

Jiayi Zhang, Yiran Peng, Fanqi Kong, Cheng Yang, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, et al. Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

work page arXiv 2025

[47] [47]

Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025

Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025

work page arXiv 2025

[48] [48]

Gemini 3 pro preview model documentation

Google. Gemini 3 pro preview model documentation. Google AI for Developers documentation, 2025. https://ai.google.dev/gemini-api/docs/models/gemini-3-pro-preview

2025

[49] [49]

Claude sonnet 4.5 system card

Anthropic. Claude sonnet 4.5 system card. System card, 2025. https://www.anthropic.com/claude-sonnet-4-5-system-card

2025

[50] [50]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, Aaron Ostrow, Aaron Welihinda, Alex Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, David Bieber, Mike Schaekermann, Panupong Pasupat, Nitish Sachdeva, Inderjit Dhillon, Michael Blistein, Omer Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Glm-4.6 model card

Zhipu AI. Glm-4.6 model card. Hugging Face model card, 2025.https://huggingface.co/zai-org/GLM-4.6

2025

[53] [53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jinkai Xu, Jing Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi X...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

xlam-2-3b-fc-r model card

Salesforce AI Research. xlam-2-3b-fc-r model card. Hugging Face model card, 2025. https://huggingface.co/Salesforce/xLAM-2-3b-fc-r

2025

[56] [56]

Toolace: Winning the points of llm function calling

Weiwen Liu, Xingshan Zeng, Keqing He, Yongliang Wang, Zhaoyang Yan, Fanya Wang, Jingsheng Cheng, Runlong Wang, Minpeng Shen, Xin Jiang, Yujie Qian, Qun Liu, and Lifeng Shang. Toolace: Winning the points of llm function calling. InInternational Conference on Learning Representations, 2025

2025

[57] [57]

Bitagent-8b model card

BitAgent. Bitagent-8b model card. Hugging Face model card, 2025. https://huggingface.co/BitAgent/BitAgent-8B

2025

[58] [58]

name": <function-name>,

watt-ai. watt-tool-8b model card. Hugging Face model card, 2024. https://huggingface.co/watt-ai/watt-tool-8B. 15 Appendix A Benchmark and Evaluation Details BFCL V3.Our in-distribution evaluation uses the multi-turn subset of the Berkeley Function-Calling Leaderboard (BFCL) V3. The benchmark evaluates whether an agent can correctly use executable tools ov...

2024