pith. sign in

arxiv: 2508.04086 · v2 · submitted 2025-08-06 · 💻 cs.CL

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

Pith reviewed 2026-05-19 01:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords tool-use datasettextual gradientsagentic frameworksynthetic data generationLLM fine-tuningtool callingiterative refinement
0
0 comments X p. Extension

The pith

ToolGrad generates superior tool-use training data by building valid chains first with textual gradients before creating queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard ways to make datasets for teaching language models to use tools start with a user query and then try to add complex tool sequences, which often fails and wastes effort. ToolGrad reverses this by first assembling valid tool-use chains through repeated steps guided by textual gradients that point out and fix problems, then writing user queries to match those chains. The resulting ToolGrad-500 dataset has more intricate tool interactions, costs far less to create, and succeeds almost every time. Models fine-tuned on it beat models trained on pricier conventional datasets and even some closed-source large models. A reader would care because good training data is what lets AI systems reliably handle real tasks that require multiple tools in sequence.

Core claim

ToolGrad is an agentic framework that first constructs valid tool-use chains through an iterative process guided by textual gradients, and then synthesizes corresponding user queries. This answer-first approach produced the ToolGrad-500 dataset with more complex tool use, lower cost, and almost 100 percent pass rate. Models trained on ToolGrad outperform those trained on expensive baseline datasets and proprietary LLMs.

What carries the argument

The iterative process that uses textual gradients to build and refine valid tool-use chains before any user query is generated.

Load-bearing premise

The iterative textual gradient process reliably produces valid and complex tool-use chains that match what real users ask.

What would settle it

Fine-tuning an LLM on ToolGrad data and then measuring no gain or a loss in accuracy on a benchmark of genuine user queries that require tool use would show the central claim is incorrect.

Figures

Figures reproduced from arXiv: 2508.04086 by Haoyu Zhang, Jingtao Zhou, Kohei Uehara, Lin Gu, Ruofei Du, Tatsuya Harada, Zheng Xu, Zhongyi Zhou.

Figure 1
Figure 1. Figure 1: Prior art for tool-use dataset generation (top) starts with a user query, followed by an expensive, failure [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ToolGrad Framework. Each iteration starts with (qt, Wt, rt) and a mini-batch of APIs. An API Proposer first predicts up to m APIs, and then m API Executors perform tool calls and return execution reports. An API Selector finds the most valuable API to chain Wt → Wt+1. Lastly, an LLM updater is used to predict qt+1, rt+1. nAI SDK. With this dataset, the model is trained to predict all the tool uses in one s… view at source ↗
Figure 3
Figure 3. Figure 3: ToolGrad-5K benchmark on non-reasoning models. Raw data in the figure is available in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of base and reasoning Gemini / [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A visualized comparison among standard, ReAct, DFS inference frameworks. Our system cannot self-instruct while generating the data. That being said, “API Executor” cannot learn from the tool-use experiences of other “API Executor” within or even outside a generation ses￾sion. To further enhance our system, we encourage future work to incorporate a memory system in our current implementation of ToolGrad. B … view at source ↗
read the original abstract

Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like depth-first search (DFS). This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual "gradients", and then synthesizes corresponding user queries. This "answer-first" approach led to ToolGrad-500, a dataset generated with more complex tool use, lower cost, and almost 100% pass rate. Experiments show that ToolGrad models outperform those trained on expensive baseline datasets and proprietary LLMs. The ToolGrad source code, dataset, and models are available at https://github.com/zhongyi-zhou/toolgrad.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ToolGrad, an agentic framework for synthesizing tool-use datasets for LLMs. It inverts the conventional pipeline by first iteratively constructing valid tool-use chains guided by textual 'gradients' (rather than generating queries first and then attempting DFS-style annotations), and subsequently synthesizing corresponding user queries. This produces the ToolGrad-500 dataset with greater tool-use complexity, reduced generation cost, and a near-100% pass rate. Experiments show that models trained on ToolGrad data outperform those trained on expensive baseline datasets as well as proprietary LLMs.

Significance. If the results hold under rigorous scrutiny, the work offers a practical advance in efficient, high-validity synthetic data generation for tool-augmented agents. The public release of code, dataset, and models is a clear strength that supports reproducibility and community follow-up. The inversion of the generation order directly targets a known failure mode in prior annotation pipelines.

major comments (2)
  1. [§4.2, Table 2] §4.2, Table 2: The reported outperformance over baselines lacks any statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals on the pass-rate and complexity deltas). Without these, the claim that ToolGrad models 'outperform' remains suggestive rather than conclusive, especially given the modest dataset size of 500 examples.
  2. [§3.2] §3.2: The iterative textual-gradient procedure is described at a high level without an explicit stopping criterion or failure-mode analysis. It is therefore unclear whether the near-100% pass rate is an intrinsic property of the method or an artifact of the particular tool set and prompt templates used in ToolGrad-500.
minor comments (2)
  1. [Abstract] The abstract states 'almost 100% pass rate' while the main text should report the precise figure, the exact definition of a 'pass,' and any edge cases that were filtered.
  2. [Figure 1, §3.1] Figure 1 caption and §3.1: The diagram of the textual-gradient loop would benefit from an accompanying pseudocode listing to make the update rule reproducible from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of our inversion of the synthesis pipeline, and the recommendation for minor revision. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [§4.2, Table 2] §4.2, Table 2: The reported outperformance over baselines lacks any statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals on the pass-rate and complexity deltas). Without these, the claim that ToolGrad models 'outperform' remains suggestive rather than conclusive, especially given the modest dataset size of 500 examples.

    Authors: We agree that statistical testing would make the outperformance claims more conclusive. In the revised manuscript we will add bootstrap confidence intervals (1,000 resamples) around the pass-rate and complexity deltas reported in Table 2. We will also note the number of evaluation runs used to compute these intervals, directly addressing the modest dataset size concern. revision: yes

  2. Referee: [§3.2] §3.2: The iterative textual-gradient procedure is described at a high level without an explicit stopping criterion or failure-mode analysis. It is therefore unclear whether the near-100% pass rate is an intrinsic property of the method or an artifact of the particular tool set and prompt templates used in ToolGrad-500.

    Authors: We will expand §3.2 to include the explicit stopping criterion (termination when the textual gradient reports no further validity improvement or after a hard maximum of five iterations) and a short failure-mode analysis based on our generation logs. While the near-100% pass rate was obtained with the current tool set, the answer-first construction with gradient guidance is intended to enforce validity by design rather than relying on post-hoc repair; we will clarify this distinction in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in ToolGrad derivation chain

full rationale

The paper presents ToolGrad as a new agentic framework that inverts prior data-generation pipelines by first constructing valid tool-use chains via iterative textual gradients and then synthesizing user queries. This methodological choice is described directly in the abstract without reduction to self-citations, fitted parameters renamed as predictions, or ansatzes imported from prior author work. The central claims rest on experimental outperformance and the public release of code, dataset, and models, which are externally verifiable and independent of the derivation itself. No load-bearing step equates outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical method paper in NLP, it likely relies on standard LLM capabilities and prompting techniques without introducing new mathematical axioms or entities. Specific hyperparameters or assumptions are not detailed in the abstract.

pith-pipeline@v0.9.0 · 5685 in / 1007 out tokens · 43749 ms · 2026-05-19T01:07:59.573950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

    Curran Associates, Inc. Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. 2025. Octotools: An agentic framework with extensible tools for complex reason- ing. Preprint, arXiv:2502.11271. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 ...

  2. [2]

    API Proposer

    InstructPipe: Building Visual Programming Pipelines With Human Instructions Using LLMs. Preprint, arXiv:2312.09672. Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2023. ToolQA: A Dataset for LLM Question Answering With External Tools. In Ad- vances in Neural Information Processing Systems , volume 36, pages 50117–50143. Curran Associates, ...

  3. [3]

    verify whether the API-calling result follows the plan

  4. [4]

    report `success = False `, if you fail to get the expected result, and explain why

  5. [5]

    report `success = True `, if you get the expected result, and provide justification for the success

  6. [6]

    API Executor

    if you report `success = True `, you should also report which function calling step leads to the success. Chances are that the API may return bad results or fail to execute in one attempt. In such cases, you should do another try by changing the input. If it still fails, you should report `success = False`. The following is the plan: {plan} Notes: - If yo...

  7. [7]

    whether any API can be used to augment the current workflow

  8. [8]

    if yes, select one API to augment the current workflow

  9. [9]

    API Selector

    decide whether you want to append the selected API to a api-use chain or create a new api-use chain with this API. 3.1 When the `tool_input` value in ` ToolAgentAction` of this API is dependent on any API execution `response` in an api-use chain, choose the append operation. Examples include the `tool_input` reuse any information in the ` response`. When ...

  10. [10]

    The query should be sufficiently detailed to ensure an LLM can trigger all API calls in the provided chains

    *Infer the user query* that would have triggered all the API-calling events. The query should be sufficiently detailed to ensure an LLM can trigger all API calls in the provided chains

  11. [11]

    Inverse Prediction

    *Predict the agent 's response* to the user after executing all API calls in the workflow. The response should reflect the results of the executed APIs in a natural and informative way. Notes: - The inferred user query must be comprehensive enough to guide the LLM in generating all API calls (including the input and the selection of api/tool name) across ...

  12. [12]

    - Response Count: Determine how many of these requests the response addresses

    Coverage of Requests: - User Requests Count: Identify the number of distinct requests or tasks contained in the user query. - Response Count: Determine how many of these requests the response addresses. - If a request is not addressed at all, that aspect should receive a score of 0

  13. [13]

    - If all API calls related to the request are failed, then the score is 0

    Quality of Each Response: - For each request/task that the request addresses, rate the quality of the response on a scale from 0 to 100. - If all API calls related to the request are failed, then the score is 0. - If there is successful API call related to the request, then the score can be greater than 0. - socre = 100 means the response is 1) grounded o...

  14. [14]

    API Executor

    Final Score Calculation: - Compute the final score by averaging the individual scores for each aspect of the query. - For example, if the user query requests 5 tasks, the AI response only does 3 tasks, and the quality of the response is 80, 90, 70, then the final score is (80 + 90 + 70 + 0 + 0) / 5 = 48. Input Data: User query: {query} Tool use trace: {to...

  15. [15]

    standard

    are trained to incorporate the ReAct (Yao 13 Table 5: LLM Benchmark on ToolGrad-5K. The best score is highlighted in each metric across all models. ToolGrad gpt-4.1 gemini-2.5 flash claude-3.7 sonnet deepseek v3 llama-4 maverick1B 4B 12B Tool recall 98.8 99.3 99.6 84.1 82.4 84.9 83.9 83.4 Success rate 95.5 96.4 96.8 78.6 78.4 79.6 79.4 80.6 QoR 93.7 95.3 ...