pith. sign in

arxiv: 2605.24426 · v1 · pith:HMTXGVNYnew · submitted 2026-05-23 · 💻 cs.CL

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

Pith reviewed 2026-06-30 13:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentstool useco-evolutionself-improvementfailure diagnosisenvironment adaptationmulti-turn interactionlow-resource learning
0
0 comments X

The pith

SEAL co-evolves LLM agents and training environments via shared turn-level failure diagnoses to improve tool-use learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies Agent-Environment Misalignment, where static environments lag behind evolving agent capabilities during interactive training. SEAL addresses this with a closed loop: it runs on-policy trajectories, diagnoses failures at the turn level, and feeds those labels as a common signal to adapt the environment's cues and feedback while also reweighting policy advantages. This joint adaptation is tested in multi-turn tool-use settings. A sympathetic reader would care because most prior self-evolution methods change only the policy or only the environment, leaving the supervision signal misaligned. The reported outcome is that 400 training samples suffice for sizable gains and out-of-distribution transfer.

Core claim

SEAL collects on-policy trajectories under executable verification, converts failed rollouts into turn-level failure labels, and uses these labels both to evolve the environment's training interface (clearer tool affordances, constraint information, recovery feedback) and to optimize the policy through diagnosis-guided advantage reweighting, producing consistent gains across backbones in in-distribution and out-of-distribution tool-use evaluations.

What carries the argument

The shared signal of turn-level failure labels diagnosed from on-policy trajectories, applied simultaneously to environment adaptation and policy optimization.

If this is right

  • With only 400 training samples SEAL produces average-point gains between +8.25 and +26.25 across three different backbones.
  • The same training run yields positive transfer on out-of-distribution multi-turn tool-use evaluations.
  • Environment adaptation consists of exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback.
  • Policy optimization uses diagnosis-guided advantage reweighting derived from the same failure labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shared-label loop might reduce the volume of human-written demonstrations needed for other interactive tasks such as code repair or multi-step planning.
  • Dynamic learning interfaces that change with the agent's revealed weaknesses could complement static reward models used in conventional reinforcement learning from human feedback.
  • If the failure-diagnosis step itself can be automated reliably, the method points toward fully closed-loop self-improvement without external supervision at each cycle.

Load-bearing premise

Turn-level failure labels extracted from on-policy trajectories form a reliable shared signal that can drive both environment changes and policy updates without adding new biases or inconsistencies.

What would settle it

Running the same 400-sample training regime on the three backbones but replacing the diagnosed failure labels with random or fixed labels, then observing no average-point gains or loss of out-of-distribution transfer, would falsify the value of the shared-signal mechanism.

read the original abstract

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emph{Agent-Environment Misalignment}: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that most self-evolution methods for LLM agents adapt either the policy or the learning environment in isolation, creating 'Agent-Environment Misalignment.' SEAL addresses this via a closed-loop framework that collects on-policy trajectories, diagnoses failed rollouts into turn-level failure labels, and uses these labels as a shared signal: the environment adapts by exposing clearer tool affordances, constraints, and recovery feedback, while the policy is optimized via diagnosis-guided advantage reweighting. With only 400 training samples, SEAL reports +8.25 to +26.25 average-point gains across three backbones on in-distribution and out-of-distribution multi-turn tool-use evaluations, demonstrating positive OOD transfer.

Significance. If the central claim holds after validation of the diagnosis step, the work would be significant for low-resource agent learning: jointly adapting the learner and its training substrate could yield more robust self-improving agents than isolated policy or environment tuning, with the reported OOD transfer and small-sample gains offering a concrete path toward efficient interactive tool-use systems.

major comments (2)
  1. [Abstract] The performance gains (+8.25 to +26.25 points) and OOD transfer rest on the claim that turn-level failure labels from on-policy trajectories provide a reliable shared signal for simultaneous environment adaptation and policy optimization. The abstract provides no description of the diagnosis procedure, no inter-annotator agreement or validation metrics, and no ablation on label noise; if the diagnosis is LLM-mediated or heuristic, errors could create self-reinforcing loops that inflate in-distribution scores without true robustness. This is load-bearing for the closed-loop co-evolution claim.
  2. [Abstract] The methods for executable verification of trajectories and the precise mechanism of diagnosis-guided advantage reweighting are not detailed in the provided abstract; without these, it is impossible to assess whether the reported gains reduce to fitted parameters or introduce new biases in the co-evolution loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the SEAL framework for low-resource agent learning. The two major comments both concern the level of detail in the abstract regarding the diagnosis procedure, executable verification, and advantage reweighting. The full manuscript elaborates on these elements; we will revise the abstract to incorporate brief descriptions and validation references as part of addressing the major revision request.

read point-by-point responses
  1. Referee: [Abstract] The performance gains (+8.25 to +26.25 points) and OOD transfer rest on the claim that turn-level failure labels from on-policy trajectories provide a reliable shared signal for simultaneous environment adaptation and policy optimization. The abstract provides no description of the diagnosis procedure, no inter-annotator agreement or validation metrics, and no ablation on label noise; if the diagnosis is LLM-mediated or heuristic, errors could create self-reinforcing loops that inflate in-distribution scores without true robustness. This is load-bearing for the closed-loop co-evolution claim.

    Authors: We agree that the abstract is too concise on the diagnosis procedure and lacks any mention of validation or noise analysis. The full manuscript details the turn-level diagnosis process and includes supporting analyses. To directly address the concern about reliability and potential self-reinforcing loops, we will revise the abstract to add a short clause summarizing the diagnosis validation approach and reference to noise ablations. revision: yes

  2. Referee: [Abstract] The methods for executable verification of trajectories and the precise mechanism of diagnosis-guided advantage reweighting are not detailed in the provided abstract; without these, it is impossible to assess whether the reported gains reduce to fitted parameters or introduce new biases in the co-evolution loop.

    Authors: We acknowledge that the abstract omits specifics on executable verification and the advantage reweighting mechanism. These components are described in the full manuscript. We will revise the abstract to include concise descriptions of both the verification process and the reweighting approach, enabling readers to better evaluate potential sources of gains or bias. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results with no derivations or self-referential reductions

full rationale

The paper describes an empirical framework (SEAL) for co-evolving LLM agents and environments via on-policy trajectories and turn-level failure labels, reporting experimental gains (+8.25 to +26.25 points with 400 samples, OOD transfer) across backbones. No equations, mathematical derivations, fitted parameters, or first-principles claims appear in the provided text. The central results are presented as direct experimental outcomes rather than predictions that reduce to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify load-bearing steps. The diagnosis procedure and shared-signal assumption are described at a high level but not derived from prior results within the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework depends on the effectiveness of failure diagnosis as a shared signal; this is a domain assumption rather than a derived result. No free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption Failed rollouts can be reliably diagnosed into accurate turn-level failure labels that serve as a useful shared signal
    This premise is required for both the environment adaptation and the policy reweighting steps described in the abstract.
invented entities (1)
  • Agent-Environment Misalignment no independent evidence
    purpose: To name the structural gap where agent capability changes but the environment does not
    Conceptual term introduced to motivate the co-evolution approach; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5757 in / 1274 out tokens · 38653 ms · 2026-06-30T13:35:15.272688+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 35 canonical work pages · 18 internal anchors

  1. [1]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  2. [2]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

  3. [3]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  4. [4]

    Omnivideo-r1: Reinforcing audio-visual reasoning with query intention and modality attention.arXiv preprint arXiv:2602.05847, 2026

    Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, et al. Omnivideo-r1: Reinforcing audio-visual reasoning with query intention and modality attention.arXiv preprint arXiv:2602.05847, 2026

  5. [5]

    Dual Latent Memory for Visual Multi-agent System

    Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, et al. Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026

  6. [6]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

  7. [7]

    Agentic Reasoning for Large Language Models

    Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, et al. Agentic reasoning for large language models.arXiv preprint arXiv:2601.12538, 2026

  8. [8]

    Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation

    Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jian-Guang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 496–507, 2025

  9. [9]

    Counterfactual evolution of multimodal datasets via visual programming.Advances in Neural Information Processing Systems, 38:81947–81976, 2026

    Minghe Gao, Zhongqi Yue, Wenjie Yan, Yihao Hu, Wei Ji, Siliang Tang, Jun Xiao, Tat-Seng Chua, Yueting Zhuang, and Juncheng Li. Counterfactual evolution of multimodal datasets via visual programming.Advances in Neural Information Processing Systems, 38:81947–81976, 2026

  10. [10]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  11. [11]

    Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

    Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

  12. [12]

    Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

    Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

  13. [13]

    Seagent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

    Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

  14. [14]

    AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

    Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, and Zhihao Wen. Atlasva: Self-evolving visual skill memory for teacher-free vlm agents.arXiv preprint arXiv:2605.17933, 2026

  15. [15]

    Api-bank: A comprehensive benchmark for tool-augmented llms

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 3102–3116, 2023

  16. [16]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  17. [17]

    Agent-environment alignment via automated interface generation.arXiv preprint arXiv:2505.21055, 2025

    Kaiming Liu, Xuanyu Lei, Ziyue Wang, Peng Li, and Yang Liu. Agent-environment alignment via automated interface generation.arXiv preprint arXiv:2505.21055, 2025

  18. [18]

    Glove: Global verifier for llm memory-environment realignment.arXiv preprint arXiv:2601.19249, 2026

    Xingkun Yin and Hongyang Du. Glove: Global verifier for llm memory-environment realignment.arXiv preprint arXiv:2601.19249, 2026. 12

  19. [19]

    Tool execution hallucination in llm-based agents: A unified taxonomy with detection, mitigation, and future directions.TechRxiv, 2026

    Hanli Peng, Yongsen Zheng, Ziyao Liu, and Kwok-Yan Lam. Tool execution hallucination in llm-based agents: A unified taxonomy with detection, mitigation, and future directions.TechRxiv, 2026

  20. [20]

    Evotool: Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection

    Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, and Eduard Hovy. Evotool: Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection. arXiv preprint arXiv:2603.04900, 2026

  21. [21]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

  22. [22]

    Agent-r: Training language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

    Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Training language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

  23. [23]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

  24. [24]

    A systematic survey of self-evolving agents: From model-centric to environment-driven co-evolution.TechRxiv, 2026

    Zhishang Xiang, Chengyi Yang, Zerui Chen, Zhimin Wei, Yunbo Tang, Zongpei Teng, Zexi Peng, Zongxia Li, Chengsong Huang, Yicheng He, et al. A systematic survey of self-evolving agents: From model-centric to environment-driven co-evolution.TechRxiv, 2026

  25. [25]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  26. [26]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

  27. [27]

    Don’t just fine-tune the agent, tune the environment.arXiv preprint arXiv:2510.10197, 2025

    Siyuan Lu, Zechuan Wang, Hongxuan Zhang, Qintong Wu, Leilei Gan, Chenyi Zhuang, Jinjie Gu, and Tao Lin. Don’t just fine-tune the agent, tune the environment.arXiv preprint arXiv:2510.10197, 2025

  28. [28]

    CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

    Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu, Yong Wang, and Xiangxiang Chu. Coevolve: Training llm agents via agent-data mutual evolution.arXiv preprint arXiv:2604.15840, 2026

  29. [29]

    From failure to mastery: Generating hard samples for tool-use agents.arXiv preprint arXiv:2601.01498, 2026

    Bingguang Hao, Zengzhuang Xu, Yuntao Wen, Xinyi Xu, Yang Liu, Tong Zhao, Maolin Wang, Long Chen, Dong Wang, Yicheng Chen, et al. From failure to mastery: Generating hard samples for tool-use agents.arXiv preprint arXiv:2601.01498, 2026

  30. [30]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

  31. [31]

    Self-consolidation for self-evolving agents.arXiv preprint arXiv:2602.01966, 2026

    Hongzhuo Yu, Fei Zhu, Guo-Sen Xie, and Ling Shao. Self-consolidation for self-evolving agents.arXiv preprint arXiv:2602.01966, 2026

  32. [32]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:25...

  33. [33]

    Self-adapting language models.arXiv preprint arXiv:2506.10943, 2025

    Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Self-adapting language models.arXiv preprint arXiv:2506.10943, 2025

  34. [34]

    Teacher-student curriculum learning.IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2020

    Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning.IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2020

  35. [35]

    Automatic curriculum learning for deep RL: A short survey

    Rémy Portelas, Cédric Colas, Lilian Weng, Katja Hofmann, and Pierre-Yves Oudeyer. Automatic curriculum learning for deep RL: A short survey. InProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pages 4819–4825, 2020

  36. [36]

    Self-instruct: Aligning language models with self-generated instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, 2023. 13

  37. [37]

    Wizardlm: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

  38. [38]

    Large language models as tool makers.arXiv preprint arXiv:2305.17126, 2023

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers.arXiv preprint arXiv:2305.17126, 2023

  39. [39]

    Creator: Tool creation for disentangling abstract and concrete reasoning of large language models

    Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 6922–6939, 2023

  40. [40]

    Craft: Customizing llms by creating and retrieving from specialized toolsets.arXiv preprint arXiv:2309.17428, 2023

    Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R Fung, Hao Peng, and Heng Ji. Craft: Customizing llms by creating and retrieving from specialized toolsets.arXiv preprint arXiv:2309.17428, 2023

  41. [41]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025

  42. [42]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

  43. [43]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

  44. [44]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  45. [45]

    Agentgym: Evolving large language model-based agents across diverse environments

    Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, et al. Agentgym: Evolving large language model-based agents across diverse environments. arXiv preprint arXiv:2406.04151, 2024

  46. [46]

    Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

    Jiayi Zhang, Yiran Peng, Fanqi Kong, Cheng Yang, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, et al. Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

  47. [47]

    Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025

    Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025

  48. [48]

    Gemini 3 pro preview model documentation

    Google. Gemini 3 pro preview model documentation. Google AI for Developers documentation, 2025. https://ai.google.dev/gemini-api/docs/models/gemini-3-pro-preview

  49. [49]

    Claude sonnet 4.5 system card

    Anthropic. Claude sonnet 4.5 system card. System card, 2025. https://www.anthropic.com/claude-sonnet-4-5-system-card

  50. [50]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, Aaron Ostrow, Aaron Welihinda, Alex Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  51. [51]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, David Bieber, Mike Schaekermann, Panupong Pasupat, Nitish Sachdeva, Inderjit Dhillon, Michael Blistein, Omer Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  52. [52]

    Glm-4.6 model card

    Zhipu AI. Glm-4.6 model card. Hugging Face model card, 2025.https://huggingface.co/zai-org/GLM-4.6

  53. [53]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jinkai Xu, Jing Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Ke...

  54. [54]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi X...

  55. [55]

    xlam-2-3b-fc-r model card

    Salesforce AI Research. xlam-2-3b-fc-r model card. Hugging Face model card, 2025. https://huggingface.co/Salesforce/xLAM-2-3b-fc-r

  56. [56]

    Toolace: Winning the points of llm function calling

    Weiwen Liu, Xingshan Zeng, Keqing He, Yongliang Wang, Zhaoyang Yan, Fanya Wang, Jingsheng Cheng, Runlong Wang, Minpeng Shen, Xin Jiang, Yujie Qian, Qun Liu, and Lifeng Shang. Toolace: Winning the points of llm function calling. InInternational Conference on Learning Representations, 2025

  57. [57]

    Bitagent-8b model card

    BitAgent. Bitagent-8b model card. Hugging Face model card, 2025. https://huggingface.co/BitAgent/BitAgent-8B

  58. [58]

    name": <function-name>,

    watt-ai. watt-tool-8b model card. Hugging Face model card, 2024. https://huggingface.co/watt-ai/watt-tool-8B. 15 Appendix A Benchmark and Evaluation Details BFCL V3.Our in-distribution evaluation uses the multi-turn subset of the Berkeley Function-Calling Leaderboard (BFCL) V3. The benchmark evaluates whether an agent can correctly use executable tools ov...