Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

arxiv: 2508.08791 · v3 · submitted 2025-08-12 · 💻 cs.CL · cs.AI

Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Junjie Ye , Changhao Jiang , Zhengyin Du , Yufei Xu , Xuesong Yao , Zhiheng Xi , Xiaoran Fan , Qi Zhang

show 3 more authors

Tao Gui Xuanjing Huang Jiecao Chen

This is my paper

Pith reviewed 2026-05-18 23:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords tool usereinforcement learninglarge language modelsautomated environmentsverifiable rewardsfeedback-driven training

0 comments p. Extension

The pith

An automated pipeline builds stable training environments so LLMs can improve at tool use through reinforcement learning with built-in verifiable rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to construct high-quality environments for training large language models on tool use without depending on external tools or human supervision. It breaks tasks into scenarios, generates supporting documents and functions, scales their complexity, and deploys them locally to supply precise, measurable feedback. A separate reward mechanism scores both the accuracy of each tool call and whether the overall task is completed. Standard reinforcement learning then uses trajectories from these environments to update the model. Experiments across model scales find clear gains in tool-use ability while general capabilities remain unchanged, with the changes traced to lower-layer MLP parameters.

Core claim

The automated environment construction pipeline, together with a verifiable reward that checks precision and completeness of tool use, produces training data that lets standard RL algorithms improve LLM tool-use performance without harming general capabilities.

What carries the argument

Automated environment construction pipeline that performs scenario decomposition, document generation, function integration, complexity scaling, and localized deployment to supply detailed measurable feedback for RL.

If this is right

Tool-use performance rises across LLMs of different sizes.
General capabilities stay intact after the training.
Gains trace to updates in lower-layer MLP parameters that improve context understanding and reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be adapted to create training loops for other interactive agent skills such as web navigation or code execution.
Because environments are generated automatically, the method may scale to larger and more diverse task sets than hand-crafted datasets allow.

Load-bearing premise

The environments created by the pipeline supply feedback that is accurate enough for the intended tasks and general enough for the learned improvements to carry over to new situations.

What would settle it

Run the same RL training on models using the constructed environments and observe either no gain in tool-use metrics or clear drops in general capability benchmarks.

read the original abstract

Effective tool use is essential for large language models (LLMs) to interact with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models' tool-use performance without degrading their general capabilities. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models. Code and data are available at https://github.com/bytedance/FTRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a concrete automated pipeline for generating RL environments and verifiable rewards for LLM tool use, with reported gains on their setups, but transfer and feedback accuracy outside those setups are not clearly shown.

read the letter

This paper's main takeaway is a full automated pipeline that breaks down scenarios, generates documents, integrates functions, scales complexity, and deploys locally to create self-contained RL training environments for tool use. They pair it with a reward that scores both precision and completeness, then run standard RL on the resulting trajectories. Experiments across model scales report better tool-use performance without hurting general capabilities, and they release the code and data at the GitHub link.

Referee Report

2 major / 1 minor

Summary. The paper proposes an automated pipeline for constructing self-contained training environments to enable reinforcement learning (RL) for improving tool-use in LLMs. The pipeline incorporates scenario decomposition, document generation, function integration, complexity scaling, and localized deployment to produce environments that supply detailed, measurable feedback without external tools. A verifiable reward mechanism assesses both precision of tool use and completeness of task execution. This is combined with trajectory data to train models via standard RL algorithms. Experiments across LLMs of varying scales report significant gains in tool-use performance without degrading general capabilities, with analysis linking gains to updates in lower-layer MLP parameters. Code and data are released at the provided GitHub link.

Significance. If the central results hold under scrutiny, the work provides a scalable, automated approach to generating high-quality RL environments and rewards for LLM tool use, addressing a practical bottleneck in training agentic systems. The public release of code and data is a clear strength that enables reproducibility and community follow-up.

major comments (2)

[Pipeline description and §4 (Experiments)] The pipeline description (scenario decomposition, document generation, function integration, complexity scaling, localized deployment): the claim that these steps yield accurate, measurable feedback for genuine tool-use gains rests on unvalidated assumptions about the fidelity of LLM-driven generation steps; any systematic error would cause RL to reinforce incorrect trajectories, and no explicit fidelity checks or OOD tool-use tests are reported to rule this out.
[Experiments] Experiments section: positive outcomes are reported on multiple model scales, yet the abstract and results provide no baseline comparisons, statistical tests, or controls for environment-specific artifacts, leaving the robustness of the performance lift and the generalization claim difficult to assess.

minor comments (1)

[Abstract] The abstract could specify the exact RL algorithms employed and the precise definitions/metrics for the 'precision' and 'completeness' components of the verifiable reward.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work. We have carefully addressed each major comment below, making revisions to strengthen the validation of the pipeline and the experimental analysis where possible.

read point-by-point responses

Referee: [Pipeline description and §4 (Experiments)] The pipeline description (scenario decomposition, document generation, function integration, complexity scaling, localized deployment): the claim that these steps yield accurate, measurable feedback for genuine tool-use gains rests on unvalidated assumptions about the fidelity of LLM-driven generation steps; any systematic error would cause RL to reinforce incorrect trajectories, and no explicit fidelity checks or OOD tool-use tests are reported to rule this out.

Authors: We agree that the fidelity of LLM-driven generation steps is a critical assumption and that systematic errors could propagate through RL training. The verifiable reward mechanism was designed to mitigate this by requiring both precise tool calls and complete task execution, but we acknowledge that this does not fully substitute for direct validation of the generated content. In the revised manuscript, we have added a new subsection under the pipeline description that reports manual fidelity checks on a sampled subset of generated documents and functions against human-authored references, along with quantitative agreement metrics. We have also included OOD tool-use evaluation on held-out scenarios in the experiments to test generalization beyond the training distribution. revision: yes
Referee: [Experiments] Experiments section: positive outcomes are reported on multiple model scales, yet the abstract and results provide no baseline comparisons, statistical tests, or controls for environment-specific artifacts, leaving the robustness of the performance lift and the generalization claim difficult to assess.

Authors: We appreciate this observation on experimental robustness. The original experiments primarily contrasted RL-trained models against their untuned base versions across scales, which demonstrated consistent gains. To address the referee's points, the revised version now incorporates statistical significance testing (paired t-tests on per-task metrics) and additional controls, including performance breakdowns by environment complexity and explicit evaluation on environments with modified artifacts. These updates help quantify the reliability of the reported improvements and support the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline and experimental results are self-contained

full rationale

The paper proposes an automated environment construction pipeline (scenario decomposition, document generation, function integration, complexity scaling, localized deployment) paired with a verifiable reward (precision + completeness) for RL-based tool-use training. Performance gains are reported from experiments on LLMs of varying scales, with code and data released. No equations, fitted parameters, or self-citations reduce the central claims to quantities defined by the inputs themselves. The derivation chain is methodological description followed by external experimental validation, which remains falsifiable via the provided repository and does not rely on self-referential definitions or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that automatically generated environments supply reliable, measurable feedback that supports genuine capability improvement rather than overfitting to synthetic scenarios.

axioms (1)

domain assumption The pipeline components (scenario decomposition, document generation, function integration, complexity scaling, localized deployment) collectively produce high-quality training environments that provide detailed and measurable feedback without external tools.
Invoked when the authors state that the pipeline enables creation of environments for feedback-driven training.

pith-pipeline@v0.9.0 · 5757 in / 1288 out tokens · 46296 ms · 2026-05-18T23:50:04.382042+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution... R = 2q/(p+1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas 9 Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

MUC-4 evaluation metrics

Nancy Chinchor. MUC-4 evaluation metrics. In Proceedings of the 4th Conference on Message Understanding, MUC1992, McLean, Virginia, USA, June 16-18, 1992, pages 22–29. ACL, 1992. doi: 10.3115/1072064.1072067. URL https://doi.org/ 10.3115/1072064.1072067

work page doi:10.3115/1072064.1072067 1992
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://ar xiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

doi: 10.48550/ARXIV.2501.12948. URL https://doi.org/10.48550/arXiv.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
[9]

Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

Google Gemini Team. Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

work page
[10]

URL https://storage.googleapis.com/dee pmind-media/gemini/gemini_v2_5_report.pdf

work page
[11]

Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings

Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advancesin Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, ...

work page 2023
[12]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net,

work page 2021
[13]

URL https://openreview.net/forum?i d=d7KBjmI3GmQ

work page
[14]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021...

work page 2021
[15]

URL https://datasets-benchmarks-pro ceedings.neurips.cc/paper/2021/hash/be83ab 3ecd0db773eb2dc1b0a17836a1-Abstract-round2. html

work page 2021
[16]

Tool documentation enables zero-shot tool-usage with large language models,

Cheng-Yu Hsieh, Si-An Chen, Chun-Liang Li, Yasuhisa Fujii, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, and Tomas Pfister. Tool documentation enables zero-shot tool-usage with large language models. CoRR, abs/2308.00675,

work page arXiv
[17]

Tool documentation enables zero-shot tool-usage with large language models,

doi: 10.48550/ARXIV.2308.00675. URL https://doi.org/10.48550/arXiv.2308.00675

work page doi:10.48550/arxiv.2308.00675
[18]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, and Haotian Xu. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and 10 reward models. CoRR, abs/2501.03262, 2025. URL https://arxiv.org/abs/2501.03262

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Metatool benchmark for large language models: Deciding whether to use tools and which to use

Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. Metatool benchmark for large language models: Deciding whether to use tools and which to use. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net...

work page 2024
[20]

Controlllm: Augment language models with tools by searching on graphs

Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, and Wenhai Wang. Controlllm: Augment language models with tools by searching on graphs. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors, Computer Vision - ECCV 2024 - 18th European Co...

work page doi:10.1007/978-3-031-732 2024
[22]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2 303.08774. URL https://doi.org/10.48550/arX iv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2 2023
[24]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs. CoRR, abs/2504.13958, 2025. doi: 10.48550 /ARXIV.2504.13958. URL https://doi.org/10.4 8550/arXiv.2504.13958

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The Twelfth International Conference on Learnin...

work page 2024
[27]

Tool learning with large language models: a survey

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: a survey. FrontiersComput. Sci., 19(8):198343, 2025. doi: 10.1007/S11704-024-40678-2. URL https: //doi.org/10.1007/s11704-024-40678-2

work page doi:10.1007/s11704-024-40678-2 2025
[28]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean- Baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuan...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05530 2024
[29]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing System...

work page 2023
[31]

URL http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Hybridflow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. In Proceedings of the TwentiethEuropean Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025. doi: 10.1145/36 8...

work page doi:10.1145/36 2025
[34]

Restgpt: Connecting large language models with real-world applications via restful apis.CoRR, abs/2306.06624,

Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis.CoRR, abs/2306.06624,

work page arXiv
[35]

Restgpt: Connecting large language models with real-world applications via restful apis.CoRR, abs/2306.06624,

doi: 10.48550/ARXIV.2306.06624. URL https://doi.org/10.48550/arXiv.2306.06624

work page doi:10.48550/arxiv.2306.06624
[36]

Le, Ed H

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big- bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan L. Boyd- Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Ling...

work page doi:10.18653/v1/2023.finding 2023
[37]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.CoRR, abs/2306.05301, 2023. doi: 10.48550/ARXIV.2306.05301. URL https: //doi.org/10.48550/arXiv.2306.05301

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05301 2023
[38]

Introducing claude 4, 2025

Anthropic Team. Introducing claude 4, 2025. URL https://www.anthropic.com/news/claude-4

work page 2025
[39]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubra- manian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Associati...

work page doi:10.18653/v1/2024.acl-long.850 2024
[40]

Toolgen: Unified tool retrieval and calling via generation

Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool retrieval and calling via generation. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=XLMAMmowdY

work page 2025
[41]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, a...

work page internal anchor Pith review Pith/arXiv arXiv
[42]

The Rise and Potential of Large Language Model Based Agents: A Survey

doi: 10.48550/ARXIV.2309.07864. URL https://doi.org/10.48550/arXiv.2309.07864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.07864
[45]

URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=WE_vlu YUL-X

work page 2023
[49]

Toolsword: Unveiling safety issues of large language models in tool learning across three stages

Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, and Xuanjing Huang. Toolsword: Unveiling safety issues of large language models in tool learning across three stages. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics...

work page 2024
[50]

Association for Computational Linguistics,

work page
[51]

URL https://doi.org/10.18653/v1/2024.acl-l ong.119

doi: 10.18653/V1/2024.ACL-LONG.119. URL https://doi.org/10.18653/v1/2024.acl-l ong.119

work page doi:10.18653/v1/2024.acl-long.119 2024
[52]

Rotbench: A multi- level benchmark for evaluating the robustness of large language models in tool learning

Junjie Ye, Yilong Wu, Songyang Gao, Caishuang Huang, SixianLi, GuanyuLi, XiaoranFan, QiZhang, Tao Gui, and Xuanjing Huang. Rotbench: A multi- level benchmark for evaluating the robustness of large language models in tool learning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natu...

work page 2024
[53]

Tl-training: A task-feature-based framework for training large language models in tool use.CoRR, abs/2412.15495, 2024

Junjie Ye, Yilong Wu, Sixian Li, Yuming Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, and Zhengyin Du. Tl-training: A task-feature-based framework for training large language models in tool use.CoRR, abs/2412.15495, 2024. doi: 10.48550/ARXIV.2412. 15495. URL https://doi.org/10.48550/arXiv.2 412.15495

work page doi:10.48550/arxiv.2412 2024
[54]

Toolhop: A query-driven benchmark for evaluating large language models in multi-hop tool use

Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, and Jiecao Chen. Toolhop: A query-driven benchmark for evaluating large language models in multi-hop tool use. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, P...

work page 2025
[55]

MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, and Xuanjing Huang. A multi-dimensional constraint framework for evaluating and improving instruction following in large language models. CoRR, abs/2505.07591, 2025. doi: 10.48550/AR...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arx 2025
[56]

Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios

Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, and Xuanjing Huang. Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steve...

work page 2025
[57]

Steptool: A step- grained reinforcement learning framework for tool learning in llms

Yuanqing Yu, Zhefan Wang, Weizhi Ma, Zhicheng Guo, Jingtao Zhan, Shuai Wang, Chuhan Wu, Zhiqiang Guo, and Min Zhang. Steptool: A step- grained reinforcement learning framework for tool learning in llms. CoRR, abs/2410.07745, 2024. doi: 10.48550/ARXIV.2410.07745. URL https: //doi.org/10.48550/arXiv.2410.07745

work page doi:10.48550/arxiv.2410.07745 2024
[58]

Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Training language model agents to reflect via iterative self- training. CoRR, abs/2501.11425, 2025. doi: 10.485 50/ARXIV.2501.11425. URL https://doi.org/10 .48550/arXiv.2501.11425

work page arXiv 2025
[59]

Toolqa: A dataset for LLM question answering with external tools

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian 13 Sun, and Chao Zhang. Toolqa: A dataset for LLM question answering with external tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advancesin Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIP...

work page 2023
[60]

Bursztyn, Ryan A

Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor S. Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient ac- tion space navigation in large language models with a* search. InThe TwelfthInternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openre...

work page 2024
[61]

**Analyze the Problem**: Understand the question and determine the type of information required to answer it.,→

work page
[62]

**Tool Design**: Design a tool that can solve the problem, considering the complexity and additional functionalities it might need.,→

work page
[63]

**Parameter Specification**: Define the parameters for the tool, ensuring they are comprehensive and flexible for various use cases.,→

work page
[64]

analysis

**Output Construction**: Format the output in JSON, including both the analysis and the tool schema.,→ # Notes - Ensure the tool is versatile enough to handle different but similar queries. - Consider edge cases. # Output Format The output should be a JSON object with the following structure **without any other contents**:,→ - "analysis": A detailed analy...

work page
[65]

Identify key features that define its purpose and operations

**Parse and Understand**: Begin by parsing each tool document 's JSON schema to understand its functionality, inputs, and outputs. Identify key features that define its purpose and operations. ,→ ,→

work page
[66]

Look for description of each tool to determine similarities

**Compare Documents**: Systematically compare each document to identify tools with identical or overlapping functionalities. Look for description of each tool to determine similarities. ,→ ,→

work page
[67]

Ensure the merged schema accommodates all functionalities from the original tools without loss of essential detail or compatibility

**Merge Tools**: For each group of functionally identical tools, merge them into a single new schema. Ensure the merged schema accommodates all functionalities from the original tools without loss of essential detail or compatibility. ,→ ,→

work page
[68]

analysis

**Compose Analysis**: Draft your reasoning process, describing how the schemas were compared, how conclusions on identical functionalities were reached, and details of how they were merged. ,→ ,→ # Output Format Your output must be valid JSON according to the following structure: - `"analysis"`: A string detailing your reasoning, including how you compare...

work page
[69]

**Analyze the Current Tool**: Examine the existing tool 's description and parameters to understand its functionality and limitations.,→

work page
[70]

**Identify Areas for Refinement**: Determine which aspects of the tool can be improved or expanded to better meet real-world requirements.,→

work page
[71]

Introduce new parameters to increase complexity and utility, but ensure full compatibility with legacy functionality

**Refine the Description**: Refine existing parameters so that each parameter value is an objective entity. Introduce new parameters to increase complexity and utility, but ensure full compatibility with legacy functionality. ,→ ,→

work page
[72]

analysis

**Ensure Compatibility**: Verify that the refined version remains compatible with the original tool 's purpose and structure.,→ # Output Format The output should be in JSON format with the following structure **without any other contents**:,→ - "analysis": Analysis of ideas about refining the tool. - "refined_version": The version after refinement, should...

work page
[73]

Ensure that these details are used as-is in the function implementation

**Understand the Tool Document**: Carefully review the tool document to identify the function name, parameter names, and types. Ensure that these details are used as-is in the function implementation. ,→ ,→

work page
[74]

**Analyze Question-Answer Pairs**: Examine these pairs to understand how questions map to function inputs and how answers should be derived from function outputs.,→

work page
[75]

- Define parameters exactly as specified in the tool document

**Implement the Function**: - Use the tool-specified function name. - Define parameters exactly as specified in the tool document. - Implement logic to correctly derive answers for questions based on the input parameters.,→ - When parameters are assigned default values, Make sure that the function return value contains the complete given answer, i.e., the...

work page
[76]

analysis

**Error Handling**: Develop a comprehensive mechanism to return error messages for incorrect inputs or other issues, ensuring the function operates reliably in all scenarios. ,→ ,→ # Output Format The result should be output in JSON format, adhering to the following structure **without anything else**:,→ - "analysis": A detailed explanation of the functio...

work page
[77]

**Understand the Problem**: Read and comprehend the details of the given problem

work page
[78]

**Analyze the Code**: Examine the provided function code to ascertain how it addresses the problem.,→

work page
[79]

**Confirm Code-to-Problem Suitability**: Determine if the function correctly solves the problem as described.,→

work page
[80]

analysis

**Derive Function Call**: Craft a function call using the problem 's specific details for parameter values.,→ # Output Format Output the result in the following JSON format without any additional text: - "analysis": A description analyzing how the given code relates to and addresses the problem.,→ - "call": The function call formatted as func(param="value...

work page arXiv 1964

[1] [1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020

[3] [4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas 9 Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [5]

MUC-4 evaluation metrics

Nancy Chinchor. MUC-4 evaluation metrics. In Proceedings of the 4th Conference on Message Understanding, MUC1992, McLean, Virginia, USA, June 16-18, 1992, pages 22–29. ACL, 1992. doi: 10.3115/1072064.1072067. URL https://doi.org/ 10.3115/1072064.1072067

work page doi:10.3115/1072064.1072067 1992

[5] [6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://ar xiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [7]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

doi: 10.48550/ARXIV.2501.12948. URL https://doi.org/10.48550/arXiv.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948

[8] [9]

Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

Google Gemini Team. Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

work page

[9] [10]

URL https://storage.googleapis.com/dee pmind-media/gemini/gemini_v2_5_report.pdf

work page

[10] [11]

Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings

Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advancesin Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, ...

work page 2023

[11] [12]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net,

work page 2021

[12] [13]

URL https://openreview.net/forum?i d=d7KBjmI3GmQ

work page

[13] [14]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021...

work page 2021

[14] [15]

URL https://datasets-benchmarks-pro ceedings.neurips.cc/paper/2021/hash/be83ab 3ecd0db773eb2dc1b0a17836a1-Abstract-round2. html

work page 2021

[15] [16]

Tool documentation enables zero-shot tool-usage with large language models,

Cheng-Yu Hsieh, Si-An Chen, Chun-Liang Li, Yasuhisa Fujii, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, and Tomas Pfister. Tool documentation enables zero-shot tool-usage with large language models. CoRR, abs/2308.00675,

work page arXiv

[16] [17]

Tool documentation enables zero-shot tool-usage with large language models,

doi: 10.48550/ARXIV.2308.00675. URL https://doi.org/10.48550/arXiv.2308.00675

work page doi:10.48550/arxiv.2308.00675

[17] [18]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, and Haotian Xu. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and 10 reward models. CoRR, abs/2501.03262, 2025. URL https://arxiv.org/abs/2501.03262

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [19]

Metatool benchmark for large language models: Deciding whether to use tools and which to use

Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. Metatool benchmark for large language models: Deciding whether to use tools and which to use. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net...

work page 2024

[19] [20]

Controlllm: Augment language models with tools by searching on graphs

Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, and Wenhai Wang. Controlllm: Augment language models with tools by searching on graphs. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors, Computer Vision - ECCV 2024 - 18th European Co...

work page doi:10.1007/978-3-031-732 2024

[20] [22]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2 303.08774. URL https://doi.org/10.48550/arX iv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2 2023

[21] [24]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs. CoRR, abs/2504.13958, 2025. doi: 10.48550 /ARXIV.2504.13958. URL https://doi.org/10.4 8550/arXiv.2504.13958

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [26]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The Twelfth International Conference on Learnin...

work page 2024

[23] [27]

Tool learning with large language models: a survey

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: a survey. FrontiersComput. Sci., 19(8):198343, 2025. doi: 10.1007/S11704-024-40678-2. URL https: //doi.org/10.1007/s11704-024-40678-2

work page doi:10.1007/s11704-024-40678-2 2025

[24] [28]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean- Baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuan...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05530 2024

[25] [29]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing System...

work page 2023

[26] [31]

URL http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv

[27] [33]

Hybridflow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. In Proceedings of the TwentiethEuropean Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025. doi: 10.1145/36 8...

work page doi:10.1145/36 2025

[28] [34]

Restgpt: Connecting large language models with real-world applications via restful apis.CoRR, abs/2306.06624,

Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis.CoRR, abs/2306.06624,

work page arXiv

[29] [35]

Restgpt: Connecting large language models with real-world applications via restful apis.CoRR, abs/2306.06624,

doi: 10.48550/ARXIV.2306.06624. URL https://doi.org/10.48550/arXiv.2306.06624

work page doi:10.48550/arxiv.2306.06624

[30] [36]

Le, Ed H

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big- bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan L. Boyd- Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Ling...

work page doi:10.18653/v1/2023.finding 2023

[31] [37]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.CoRR, abs/2306.05301, 2023. doi: 10.48550/ARXIV.2306.05301. URL https: //doi.org/10.48550/arXiv.2306.05301

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05301 2023

[32] [38]

Introducing claude 4, 2025

Anthropic Team. Introducing claude 4, 2025. URL https://www.anthropic.com/news/claude-4

work page 2025

[33] [39]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubra- manian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Associati...

work page doi:10.18653/v1/2024.acl-long.850 2024

[34] [40]

Toolgen: Unified tool retrieval and calling via generation

Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool retrieval and calling via generation. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=XLMAMmowdY

work page 2025

[35] [41]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, a...

work page internal anchor Pith review Pith/arXiv arXiv

[36] [42]

The Rise and Potential of Large Language Model Based Agents: A Survey

doi: 10.48550/ARXIV.2309.07864. URL https://doi.org/10.48550/arXiv.2309.07864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.07864

[37] [45]

URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv

[38] [47]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=WE_vlu YUL-X

work page 2023

[39] [49]

Toolsword: Unveiling safety issues of large language models in tool learning across three stages

Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, and Xuanjing Huang. Toolsword: Unveiling safety issues of large language models in tool learning across three stages. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics...

work page 2024

[40] [50]

Association for Computational Linguistics,

work page

[41] [51]

URL https://doi.org/10.18653/v1/2024.acl-l ong.119

doi: 10.18653/V1/2024.ACL-LONG.119. URL https://doi.org/10.18653/v1/2024.acl-l ong.119

work page doi:10.18653/v1/2024.acl-long.119 2024

[42] [52]

Rotbench: A multi- level benchmark for evaluating the robustness of large language models in tool learning

Junjie Ye, Yilong Wu, Songyang Gao, Caishuang Huang, SixianLi, GuanyuLi, XiaoranFan, QiZhang, Tao Gui, and Xuanjing Huang. Rotbench: A multi- level benchmark for evaluating the robustness of large language models in tool learning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natu...

work page 2024

[43] [53]

Tl-training: A task-feature-based framework for training large language models in tool use.CoRR, abs/2412.15495, 2024

Junjie Ye, Yilong Wu, Sixian Li, Yuming Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, and Zhengyin Du. Tl-training: A task-feature-based framework for training large language models in tool use.CoRR, abs/2412.15495, 2024. doi: 10.48550/ARXIV.2412. 15495. URL https://doi.org/10.48550/arXiv.2 412.15495

work page doi:10.48550/arxiv.2412 2024

[44] [54]

Toolhop: A query-driven benchmark for evaluating large language models in multi-hop tool use

Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, and Jiecao Chen. Toolhop: A query-driven benchmark for evaluating large language models in multi-hop tool use. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, P...

work page 2025

[45] [55]

MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, and Xuanjing Huang. A multi-dimensional constraint framework for evaluating and improving instruction following in large language models. CoRR, abs/2505.07591, 2025. doi: 10.48550/AR...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arx 2025

[46] [56]

Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios

Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, and Xuanjing Huang. Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steve...

work page 2025

[47] [57]

Steptool: A step- grained reinforcement learning framework for tool learning in llms

Yuanqing Yu, Zhefan Wang, Weizhi Ma, Zhicheng Guo, Jingtao Zhan, Shuai Wang, Chuhan Wu, Zhiqiang Guo, and Min Zhang. Steptool: A step- grained reinforcement learning framework for tool learning in llms. CoRR, abs/2410.07745, 2024. doi: 10.48550/ARXIV.2410.07745. URL https: //doi.org/10.48550/arXiv.2410.07745

work page doi:10.48550/arxiv.2410.07745 2024

[48] [58]

Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Training language model agents to reflect via iterative self- training. CoRR, abs/2501.11425, 2025. doi: 10.485 50/ARXIV.2501.11425. URL https://doi.org/10 .48550/arXiv.2501.11425

work page arXiv 2025

[49] [59]

Toolqa: A dataset for LLM question answering with external tools

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian 13 Sun, and Chao Zhang. Toolqa: A dataset for LLM question answering with external tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advancesin Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIP...

work page 2023

[50] [60]

Bursztyn, Ryan A

Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor S. Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient ac- tion space navigation in large language models with a* search. InThe TwelfthInternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openre...

work page 2024

[51] [61]

**Analyze the Problem**: Understand the question and determine the type of information required to answer it.,→

work page

[52] [62]

**Tool Design**: Design a tool that can solve the problem, considering the complexity and additional functionalities it might need.,→

work page

[53] [63]

**Parameter Specification**: Define the parameters for the tool, ensuring they are comprehensive and flexible for various use cases.,→

work page

[54] [64]

analysis

**Output Construction**: Format the output in JSON, including both the analysis and the tool schema.,→ # Notes - Ensure the tool is versatile enough to handle different but similar queries. - Consider edge cases. # Output Format The output should be a JSON object with the following structure **without any other contents**:,→ - "analysis": A detailed analy...

work page

[55] [65]

Identify key features that define its purpose and operations

**Parse and Understand**: Begin by parsing each tool document 's JSON schema to understand its functionality, inputs, and outputs. Identify key features that define its purpose and operations. ,→ ,→

work page

[56] [66]

Look for description of each tool to determine similarities

**Compare Documents**: Systematically compare each document to identify tools with identical or overlapping functionalities. Look for description of each tool to determine similarities. ,→ ,→

work page

[57] [67]

Ensure the merged schema accommodates all functionalities from the original tools without loss of essential detail or compatibility

**Merge Tools**: For each group of functionally identical tools, merge them into a single new schema. Ensure the merged schema accommodates all functionalities from the original tools without loss of essential detail or compatibility. ,→ ,→

work page

[58] [68]

analysis

**Compose Analysis**: Draft your reasoning process, describing how the schemas were compared, how conclusions on identical functionalities were reached, and details of how they were merged. ,→ ,→ # Output Format Your output must be valid JSON according to the following structure: - `"analysis"`: A string detailing your reasoning, including how you compare...

work page

[59] [69]

**Analyze the Current Tool**: Examine the existing tool 's description and parameters to understand its functionality and limitations.,→

work page

[60] [70]

**Identify Areas for Refinement**: Determine which aspects of the tool can be improved or expanded to better meet real-world requirements.,→

work page

[61] [71]

Introduce new parameters to increase complexity and utility, but ensure full compatibility with legacy functionality

**Refine the Description**: Refine existing parameters so that each parameter value is an objective entity. Introduce new parameters to increase complexity and utility, but ensure full compatibility with legacy functionality. ,→ ,→

work page

[62] [72]

analysis

**Ensure Compatibility**: Verify that the refined version remains compatible with the original tool 's purpose and structure.,→ # Output Format The output should be in JSON format with the following structure **without any other contents**:,→ - "analysis": Analysis of ideas about refining the tool. - "refined_version": The version after refinement, should...

work page

[63] [73]

Ensure that these details are used as-is in the function implementation

**Understand the Tool Document**: Carefully review the tool document to identify the function name, parameter names, and types. Ensure that these details are used as-is in the function implementation. ,→ ,→

work page

[64] [74]

**Analyze Question-Answer Pairs**: Examine these pairs to understand how questions map to function inputs and how answers should be derived from function outputs.,→

work page

[65] [75]

- Define parameters exactly as specified in the tool document

**Implement the Function**: - Use the tool-specified function name. - Define parameters exactly as specified in the tool document. - Implement logic to correctly derive answers for questions based on the input parameters.,→ - When parameters are assigned default values, Make sure that the function return value contains the complete given answer, i.e., the...

work page

[66] [76]

analysis

**Error Handling**: Develop a comprehensive mechanism to return error messages for incorrect inputs or other issues, ensuring the function operates reliably in all scenarios. ,→ ,→ # Output Format The result should be output in JSON format, adhering to the following structure **without anything else**:,→ - "analysis": A detailed explanation of the functio...

work page

[67] [77]

**Understand the Problem**: Read and comprehend the details of the given problem

work page

[68] [78]

**Analyze the Code**: Examine the provided function code to ascertain how it addresses the problem.,→

work page

[69] [79]

**Confirm Code-to-Problem Suitability**: Determine if the function correctly solves the problem as described.,→

work page

[70] [80]

analysis

**Derive Function Call**: Craft a function call using the problem 's specific details for parameter values.,→ # Output Format Output the result in the following JSON format without any additional text: - "analysis": A description analyzing how the given code relates to and addresses the problem.,→ - "call": The function call formatted as func(param="value...

work page arXiv 1964