ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"
Pith reviewed 2026-05-19 01:07 UTC · model grok-4.3
The pith
ToolGrad generates superior tool-use training data by building valid chains first with textual gradients before creating queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ToolGrad is an agentic framework that first constructs valid tool-use chains through an iterative process guided by textual gradients, and then synthesizes corresponding user queries. This answer-first approach produced the ToolGrad-500 dataset with more complex tool use, lower cost, and almost 100 percent pass rate. Models trained on ToolGrad outperform those trained on expensive baseline datasets and proprietary LLMs.
What carries the argument
The iterative process that uses textual gradients to build and refine valid tool-use chains before any user query is generated.
Load-bearing premise
The iterative textual gradient process reliably produces valid and complex tool-use chains that match what real users ask.
What would settle it
Fine-tuning an LLM on ToolGrad data and then measuring no gain or a loss in accuracy on a benchmark of genuine user queries that require tool use would show the central claim is incorrect.
Figures
read the original abstract
Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like depth-first search (DFS). This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual "gradients", and then synthesizes corresponding user queries. This "answer-first" approach led to ToolGrad-500, a dataset generated with more complex tool use, lower cost, and almost 100% pass rate. Experiments show that ToolGrad models outperform those trained on expensive baseline datasets and proprietary LLMs. The ToolGrad source code, dataset, and models are available at https://github.com/zhongyi-zhou/toolgrad.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ToolGrad, an agentic framework for synthesizing tool-use datasets for LLMs. It inverts the conventional pipeline by first iteratively constructing valid tool-use chains guided by textual 'gradients' (rather than generating queries first and then attempting DFS-style annotations), and subsequently synthesizing corresponding user queries. This produces the ToolGrad-500 dataset with greater tool-use complexity, reduced generation cost, and a near-100% pass rate. Experiments show that models trained on ToolGrad data outperform those trained on expensive baseline datasets as well as proprietary LLMs.
Significance. If the results hold under rigorous scrutiny, the work offers a practical advance in efficient, high-validity synthetic data generation for tool-augmented agents. The public release of code, dataset, and models is a clear strength that supports reproducibility and community follow-up. The inversion of the generation order directly targets a known failure mode in prior annotation pipelines.
major comments (2)
- [§4.2, Table 2] §4.2, Table 2: The reported outperformance over baselines lacks any statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals on the pass-rate and complexity deltas). Without these, the claim that ToolGrad models 'outperform' remains suggestive rather than conclusive, especially given the modest dataset size of 500 examples.
- [§3.2] §3.2: The iterative textual-gradient procedure is described at a high level without an explicit stopping criterion or failure-mode analysis. It is therefore unclear whether the near-100% pass rate is an intrinsic property of the method or an artifact of the particular tool set and prompt templates used in ToolGrad-500.
minor comments (2)
- [Abstract] The abstract states 'almost 100% pass rate' while the main text should report the precise figure, the exact definition of a 'pass,' and any edge cases that were filtered.
- [Figure 1, §3.1] Figure 1 caption and §3.1: The diagram of the textual-gradient loop would benefit from an accompanying pseudocode listing to make the update rule reproducible from the text alone.
Simulated Author's Rebuttal
We thank the referee for the positive summary, the recognition of our inversion of the synthesis pipeline, and the recommendation for minor revision. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [§4.2, Table 2] §4.2, Table 2: The reported outperformance over baselines lacks any statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals on the pass-rate and complexity deltas). Without these, the claim that ToolGrad models 'outperform' remains suggestive rather than conclusive, especially given the modest dataset size of 500 examples.
Authors: We agree that statistical testing would make the outperformance claims more conclusive. In the revised manuscript we will add bootstrap confidence intervals (1,000 resamples) around the pass-rate and complexity deltas reported in Table 2. We will also note the number of evaluation runs used to compute these intervals, directly addressing the modest dataset size concern. revision: yes
-
Referee: [§3.2] §3.2: The iterative textual-gradient procedure is described at a high level without an explicit stopping criterion or failure-mode analysis. It is therefore unclear whether the near-100% pass rate is an intrinsic property of the method or an artifact of the particular tool set and prompt templates used in ToolGrad-500.
Authors: We will expand §3.2 to include the explicit stopping criterion (termination when the textual gradient reports no further validity improvement or after a hard maximum of five iterations) and a short failure-mode analysis based on our generation logs. While the near-100% pass rate was obtained with the current tool set, the answer-first construction with gradient guidance is intended to enforce validity by design rather than relying on post-hoc repair; we will clarify this distinction in the revision. revision: partial
Circularity Check
No significant circularity in ToolGrad derivation chain
full rationale
The paper presents ToolGrad as a new agentic framework that inverts prior data-generation pipelines by first constructing valid tool-use chains via iterative textual gradients and then synthesizing user queries. This methodological choice is described directly in the abstract without reduction to self-citations, fitted parameters renamed as predictions, or ansatzes imported from prior author work. The central claims rest on experimental outperformance and the public release of code, dataset, and models, which are externally verifiable and independent of the derivation itself. No load-bearing step equates outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ToolGrad first constructs valid tool-use chains through an iterative process guided by textual 'gradients', and then synthesizes corresponding user queries.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
Curran Associates, Inc. Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. 2025. Octotools: An agentic framework with extensible tools for complex reason- ing. Preprint, arXiv:2502.11271. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
InstructPipe: Building Visual Programming Pipelines With Human Instructions Using LLMs. Preprint, arXiv:2312.09672. Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2023. ToolQA: A Dataset for LLM Question Answering With External Tools. In Ad- vances in Neural Information Processing Systems , volume 36, pages 50117–50143. Curran Associates, ...
-
[3]
verify whether the API-calling result follows the plan
-
[4]
report `success = False `, if you fail to get the expected result, and explain why
-
[5]
report `success = True `, if you get the expected result, and provide justification for the success
-
[6]
if you report `success = True `, you should also report which function calling step leads to the success. Chances are that the API may return bad results or fail to execute in one attempt. In such cases, you should do another try by changing the input. If it still fails, you should report `success = False`. The following is the plan: {plan} Notes: - If yo...
-
[7]
whether any API can be used to augment the current workflow
-
[8]
if yes, select one API to augment the current workflow
-
[9]
decide whether you want to append the selected API to a api-use chain or create a new api-use chain with this API. 3.1 When the `tool_input` value in ` ToolAgentAction` of this API is dependent on any API execution `response` in an api-use chain, choose the append operation. Examples include the `tool_input` reuse any information in the ` response`. When ...
-
[10]
*Infer the user query* that would have triggered all the API-calling events. The query should be sufficiently detailed to ensure an LLM can trigger all API calls in the provided chains
-
[11]
*Predict the agent 's response* to the user after executing all API calls in the workflow. The response should reflect the results of the executed APIs in a natural and informative way. Notes: - The inferred user query must be comprehensive enough to guide the LLM in generating all API calls (including the input and the selection of api/tool name) across ...
-
[12]
- Response Count: Determine how many of these requests the response addresses
Coverage of Requests: - User Requests Count: Identify the number of distinct requests or tasks contained in the user query. - Response Count: Determine how many of these requests the response addresses. - If a request is not addressed at all, that aspect should receive a score of 0
-
[13]
- If all API calls related to the request are failed, then the score is 0
Quality of Each Response: - For each request/task that the request addresses, rate the quality of the response on a scale from 0 to 100. - If all API calls related to the request are failed, then the score is 0. - If there is successful API call related to the request, then the score can be greater than 0. - socre = 100 means the response is 1) grounded o...
-
[14]
Final Score Calculation: - Compute the final score by averaging the individual scores for each aspect of the query. - For example, if the user query requests 5 tasks, the AI response only does 3 tasks, and the quality of the response is 80, 90, 70, then the final score is (80 + 90 + 70 + 0 + 0) / 5 = 48. Input Data: User query: {query} Tool use trace: {to...
work page 2023
-
[15]
are trained to incorporate the ReAct (Yao 13 Table 5: LLM Benchmark on ToolGrad-5K. The best score is highlighted in each metric across all models. ToolGrad gpt-4.1 gemini-2.5 flash claude-3.7 sonnet deepseek v3 llama-4 maverick1B 4B 12B Tool recall 98.8 99.3 99.6 84.1 82.4 84.9 83.9 83.4 Success rate 95.5 96.4 96.8 78.6 78.4 79.6 79.4 80.6 QoR 93.7 95.3 ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.