ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

arxiv: 2508.04086 · v2 · submitted 2025-08-06 · 💻 cs.CL

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

Zhongyi Zhou , Kohei Uehara , Haoyu Zhang , Jingtao Zhou , Lin Gu , Ruofei Du , Zheng Xu , Tatsuya Harada This is my paper

Pith reviewed 2026-05-19 01:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords tool-use datasettextual gradientsagentic frameworksynthetic data generationLLM fine-tuningtool callingiterative refinement

0 comments p. Extension

The pith

ToolGrad generates superior tool-use training data by building valid chains first with textual gradients before creating queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard ways to make datasets for teaching language models to use tools start with a user query and then try to add complex tool sequences, which often fails and wastes effort. ToolGrad reverses this by first assembling valid tool-use chains through repeated steps guided by textual gradients that point out and fix problems, then writing user queries to match those chains. The resulting ToolGrad-500 dataset has more intricate tool interactions, costs far less to create, and succeeds almost every time. Models fine-tuned on it beat models trained on pricier conventional datasets and even some closed-source large models. A reader would care because good training data is what lets AI systems reliably handle real tasks that require multiple tools in sequence.

Core claim

ToolGrad is an agentic framework that first constructs valid tool-use chains through an iterative process guided by textual gradients, and then synthesizes corresponding user queries. This answer-first approach produced the ToolGrad-500 dataset with more complex tool use, lower cost, and almost 100 percent pass rate. Models trained on ToolGrad outperform those trained on expensive baseline datasets and proprietary LLMs.

What carries the argument

The iterative process that uses textual gradients to build and refine valid tool-use chains before any user query is generated.

Load-bearing premise

The iterative textual gradient process reliably produces valid and complex tool-use chains that match what real users ask.

What would settle it

Fine-tuning an LLM on ToolGrad data and then measuring no gain or a loss in accuracy on a benchmark of genuine user queries that require tool use would show the central claim is incorrect.

Figures

Figures reproduced from arXiv: 2508.04086 by Haoyu Zhang, Jingtao Zhou, Kohei Uehara, Lin Gu, Ruofei Du, Tatsuya Harada, Zheng Xu, Zhongyi Zhou.

**Figure 2.** Figure 2: ToolGrad Framework. Each iteration starts with (qt, Wt, rt) and a mini-batch of APIs. An API Proposer first predicts up to m APIs, and then m API Executors perform tool calls and return execution reports. An API Selector finds the most valuable API to chain Wt → Wt+1. Lastly, an LLM updater is used to predict qt+1, rt+1. nAI SDK. With this dataset, the model is trained to predict all the tool uses in one s… view at source ↗

**Figure 3.** Figure 3: ToolGrad-5K benchmark on non-reasoning models. Raw data in the figure is available in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of base and reasoning Gemini / [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: A visualized comparison among standard, ReAct, DFS inference frameworks. Our system cannot self-instruct while generating the data. That being said, “API Executor” cannot learn from the tool-use experiences of other “API Executor” within or even outside a generation session. To further enhance our system, we encourage future work to incorporate a memory system in our current implementation of ToolGrad. B … view at source ↗

read the original abstract

Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like depth-first search (DFS). This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual "gradients", and then synthesizes corresponding user queries. This "answer-first" approach led to ToolGrad-500, a dataset generated with more complex tool use, lower cost, and almost 100% pass rate. Experiments show that ToolGrad models outperform those trained on expensive baseline datasets and proprietary LLMs. The ToolGrad source code, dataset, and models are available at https://github.com/zhongyi-zhou/toolgrad.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToolGrad flips the usual query-first pipeline by building valid tool chains first with iterative textual feedback, then generating matching queries.

read the letter

ToolGrad flips the usual query-first pipeline by building valid tool chains first with iterative textual feedback, then generating matching queries. Prior work often starts with a user query and tries to annotate complex tool sequences afterward, frequently running into failures with methods like depth-first search. This paper inverts that by constructing the chains first through repeated refinement guided by textual gradients, then creating queries that fit those chains. The result is their ToolGrad-500 dataset, which shows higher complexity in tool use, lower generation cost, and nearly 100% pass rate on checks. Models trained on it reportedly beat both expensive baseline datasets and some proprietary LLMs on tool-use tasks. The public release of code, data, and models is a clear plus, as it lets others test the claims without extra work. The inversion targets a real bottleneck in synthetic data for tool-augmented models, and the reported efficiency gains line up with the design choice. One soft spot is that the abstract stays high-level on how the textual gradients are actually computed and applied during iterations. The full paper will need to include ablations, exact metrics, and controls to confirm the gains are robust rather than tied to specific implementation details. This work is aimed at researchers building datasets for tool-calling LLMs and agent systems. Anyone focused on efficient synthetic data or improving tool-use performance would find the method and artifacts directly useful. It deserves peer review because the core claim is testable, the artifacts support verification, and the approach engages honestly with the limitations of earlier query-first methods.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ToolGrad, an agentic framework for synthesizing tool-use datasets for LLMs. It inverts the conventional pipeline by first iteratively constructing valid tool-use chains guided by textual 'gradients' (rather than generating queries first and then attempting DFS-style annotations), and subsequently synthesizing corresponding user queries. This produces the ToolGrad-500 dataset with greater tool-use complexity, reduced generation cost, and a near-100% pass rate. Experiments show that models trained on ToolGrad data outperform those trained on expensive baseline datasets as well as proprietary LLMs.

Significance. If the results hold under rigorous scrutiny, the work offers a practical advance in efficient, high-validity synthetic data generation for tool-augmented agents. The public release of code, dataset, and models is a clear strength that supports reproducibility and community follow-up. The inversion of the generation order directly targets a known failure mode in prior annotation pipelines.

major comments (2)

[§4.2, Table 2] §4.2, Table 2: The reported outperformance over baselines lacks any statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals on the pass-rate and complexity deltas). Without these, the claim that ToolGrad models 'outperform' remains suggestive rather than conclusive, especially given the modest dataset size of 500 examples.
[§3.2] §3.2: The iterative textual-gradient procedure is described at a high level without an explicit stopping criterion or failure-mode analysis. It is therefore unclear whether the near-100% pass rate is an intrinsic property of the method or an artifact of the particular tool set and prompt templates used in ToolGrad-500.

minor comments (2)

[Abstract] The abstract states 'almost 100% pass rate' while the main text should report the precise figure, the exact definition of a 'pass,' and any edge cases that were filtered.
[Figure 1, §3.1] Figure 1 caption and §3.1: The diagram of the textual-gradient loop would benefit from an accompanying pseudocode listing to make the update rule reproducible from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of our inversion of the synthesis pipeline, and the recommendation for minor revision. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [§4.2, Table 2] §4.2, Table 2: The reported outperformance over baselines lacks any statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals on the pass-rate and complexity deltas). Without these, the claim that ToolGrad models 'outperform' remains suggestive rather than conclusive, especially given the modest dataset size of 500 examples.

Authors: We agree that statistical testing would make the outperformance claims more conclusive. In the revised manuscript we will add bootstrap confidence intervals (1,000 resamples) around the pass-rate and complexity deltas reported in Table 2. We will also note the number of evaluation runs used to compute these intervals, directly addressing the modest dataset size concern. revision: yes
Referee: [§3.2] §3.2: The iterative textual-gradient procedure is described at a high level without an explicit stopping criterion or failure-mode analysis. It is therefore unclear whether the near-100% pass rate is an intrinsic property of the method or an artifact of the particular tool set and prompt templates used in ToolGrad-500.

Authors: We will expand §3.2 to include the explicit stopping criterion (termination when the textual gradient reports no further validity improvement or after a hard maximum of five iterations) and a short failure-mode analysis based on our generation logs. While the near-100% pass rate was obtained with the current tool set, the answer-first construction with gradient guidance is intended to enforce validity by design rather than relying on post-hoc repair; we will clarify this distinction in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in ToolGrad derivation chain

full rationale

The paper presents ToolGrad as a new agentic framework that inverts prior data-generation pipelines by first constructing valid tool-use chains via iterative textual gradients and then synthesizing user queries. This methodological choice is described directly in the abstract without reduction to self-citations, fitted parameters renamed as predictions, or ansatzes imported from prior author work. The central claims rest on experimental outperformance and the public release of code, dataset, and models, which are externally verifiable and independent of the derivation itself. No load-bearing step equates outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical method paper in NLP, it likely relies on standard LLM capabilities and prompting techniques without introducing new mathematical axioms or entities. Specific hyperparameters or assumptions are not detailed in the abstract.

pith-pipeline@v0.9.0 · 5685 in / 1007 out tokens · 43749 ms · 2026-05-19T01:07:59.573950+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ToolGrad first constructs valid tool-use chains through an iterative process guided by textual 'gradients', and then synthesizes corresponding user queries.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

[1]

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Curran Associates, Inc. Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. 2025. Octotools: An agentic framework with extensible tools for complex reason- ing. Preprint, arXiv:2502.11271. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

API Proposer

InstructPipe: Building Visual Programming Pipelines With Human Instructions Using LLMs. Preprint, arXiv:2312.09672. Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2023. ToolQA: A Dataset for LLM Question Answering With External Tools. In Ad- vances in Neural Information Processing Systems , volume 36, pages 50117–50143. Curran Associates, ...

work page arXiv 2023
[3]

verify whether the API-calling result follows the plan

work page
[4]

report `success = False `, if you fail to get the expected result, and explain why

work page
[5]

report `success = True `, if you get the expected result, and provide justification for the success

work page
[6]

API Executor

if you report `success = True `, you should also report which function calling step leads to the success. Chances are that the API may return bad results or fail to execute in one attempt. In such cases, you should do another try by changing the input. If it still fails, you should report `success = False`. The following is the plan: {plan} Notes: - If yo...

work page
[7]

whether any API can be used to augment the current workflow

work page
[8]

if yes, select one API to augment the current workflow

work page
[9]

API Selector

decide whether you want to append the selected API to a api-use chain or create a new api-use chain with this API. 3.1 When the `tool_input` value in ` ToolAgentAction` of this API is dependent on any API execution `response` in an api-use chain, choose the append operation. Examples include the `tool_input` reuse any information in the ` response`. When ...

work page
[10]

The query should be sufficiently detailed to ensure an LLM can trigger all API calls in the provided chains

*Infer the user query* that would have triggered all the API-calling events. The query should be sufficiently detailed to ensure an LLM can trigger all API calls in the provided chains

work page
[11]

Inverse Prediction

*Predict the agent 's response* to the user after executing all API calls in the workflow. The response should reflect the results of the executed APIs in a natural and informative way. Notes: - The inferred user query must be comprehensive enough to guide the LLM in generating all API calls (including the input and the selection of api/tool name) across ...

work page
[12]

- Response Count: Determine how many of these requests the response addresses

Coverage of Requests: - User Requests Count: Identify the number of distinct requests or tasks contained in the user query. - Response Count: Determine how many of these requests the response addresses. - If a request is not addressed at all, that aspect should receive a score of 0

work page
[13]

- If all API calls related to the request are failed, then the score is 0

Quality of Each Response: - For each request/task that the request addresses, rate the quality of the response on a scale from 0 to 100. - If all API calls related to the request are failed, then the score is 0. - If there is successful API call related to the request, then the score can be greater than 0. - socre = 100 means the response is 1) grounded o...

work page
[14]

API Executor

Final Score Calculation: - Compute the final score by averaging the individual scores for each aspect of the query. - For example, if the user query requests 5 tasks, the AI response only does 3 tasks, and the quality of the response is 80, 90, 70, then the final score is (80 + 90 + 70 + 0 + 0) / 5 = 48. Input Data: User query: {query} Tool use trace: {to...

work page 2023
[15]

standard

are trained to incorporate the ReAct (Yao 13 Table 5: LLM Benchmark on ToolGrad-5K. The best score is highlighted in each metric across all models. ToolGrad gpt-4.1 gemini-2.5 flash claude-3.7 sonnet deepseek v3 llama-4 maverick1B 4B 12B Tool recall 98.8 99.3 99.6 84.1 82.4 84.9 83.9 83.4 Success rate 95.5 96.4 96.8 78.6 78.4 79.6 79.4 80.6 QoR 93.7 95.3 ...

work page 2023

[1] [1]

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Curran Associates, Inc. Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. 2025. Octotools: An agentic framework with extensible tools for complex reason- ing. Preprint, arXiv:2502.11271. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

API Proposer

InstructPipe: Building Visual Programming Pipelines With Human Instructions Using LLMs. Preprint, arXiv:2312.09672. Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2023. ToolQA: A Dataset for LLM Question Answering With External Tools. In Ad- vances in Neural Information Processing Systems , volume 36, pages 50117–50143. Curran Associates, ...

work page arXiv 2023

[3] [3]

verify whether the API-calling result follows the plan

work page

[4] [4]

report `success = False `, if you fail to get the expected result, and explain why

work page

[5] [5]

report `success = True `, if you get the expected result, and provide justification for the success

work page

[6] [6]

API Executor

if you report `success = True `, you should also report which function calling step leads to the success. Chances are that the API may return bad results or fail to execute in one attempt. In such cases, you should do another try by changing the input. If it still fails, you should report `success = False`. The following is the plan: {plan} Notes: - If yo...

work page

[7] [7]

whether any API can be used to augment the current workflow

work page

[8] [8]

if yes, select one API to augment the current workflow

work page

[9] [9]

API Selector

decide whether you want to append the selected API to a api-use chain or create a new api-use chain with this API. 3.1 When the `tool_input` value in ` ToolAgentAction` of this API is dependent on any API execution `response` in an api-use chain, choose the append operation. Examples include the `tool_input` reuse any information in the ` response`. When ...

work page

[10] [10]

The query should be sufficiently detailed to ensure an LLM can trigger all API calls in the provided chains

*Infer the user query* that would have triggered all the API-calling events. The query should be sufficiently detailed to ensure an LLM can trigger all API calls in the provided chains

work page

[11] [11]

Inverse Prediction

*Predict the agent 's response* to the user after executing all API calls in the workflow. The response should reflect the results of the executed APIs in a natural and informative way. Notes: - The inferred user query must be comprehensive enough to guide the LLM in generating all API calls (including the input and the selection of api/tool name) across ...

work page

[12] [12]

- Response Count: Determine how many of these requests the response addresses

Coverage of Requests: - User Requests Count: Identify the number of distinct requests or tasks contained in the user query. - Response Count: Determine how many of these requests the response addresses. - If a request is not addressed at all, that aspect should receive a score of 0

work page

[13] [13]

- If all API calls related to the request are failed, then the score is 0

Quality of Each Response: - For each request/task that the request addresses, rate the quality of the response on a scale from 0 to 100. - If all API calls related to the request are failed, then the score is 0. - If there is successful API call related to the request, then the score can be greater than 0. - socre = 100 means the response is 1) grounded o...

work page

[14] [14]

API Executor

Final Score Calculation: - Compute the final score by averaging the individual scores for each aspect of the query. - For example, if the user query requests 5 tasks, the AI response only does 3 tasks, and the quality of the response is 80, 90, 70, then the final score is (80 + 90 + 70 + 0 + 0) / 5 = 48. Input Data: User query: {query} Tool use trace: {to...

work page 2023

[15] [15]

standard

are trained to incorporate the ReAct (Yao 13 Table 5: LLM Benchmark on ToolGrad-5K. The best score is highlighted in each metric across all models. ToolGrad gpt-4.1 gemini-2.5 flash claude-3.7 sonnet deepseek v3 llama-4 maverick1B 4B 12B Tool recall 98.8 99.3 99.6 84.1 82.4 84.9 83.9 83.4 Success rate 95.5 96.4 96.8 78.6 78.4 79.6 79.4 80.6 QoR 93.7 95.3 ...

work page 2023