arxiv: 2602.20426 · v2 · submitted 2026-02-23 · 💻 cs.AI

Recognition: no theorem link

Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

Ruocheng Guo , Kaiwen Dong , Xiang Gao , Kamalika Das

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords tool description rewritingLLM agentscurriculum learningtool usescalabilityStableToolBenchtrace-free deploymentAPI interfaces

0 comments

The pith

Trace-Free+ teaches models to rewrite ambiguous tool descriptions so LLM agents stay reliable as catalogs grow past 150 candidates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tool descriptions written for humans often contain ambiguities that cause LLM agents to fail when many tools compete for selection. Trace-Free+ addresses this by using curriculum learning that starts with detailed execution traces and gradually shifts to rewriting descriptions without any traces. The method builds a large dataset of rewritten interfaces from real APIs through a synthesis workflow that avoids per-tool pipelines. In scaling tests on StableToolBench it cuts accuracy degradation by 29 percent and raises query success by 61 percent while generalizing across domains. The gains add to those from agent fine-tuning and require no retraining on new tool sets.

Core claim

Trace-Free+ is a curriculum learning framework that progressively moves supervision from trace-rich training to trace-free deployment, allowing a model to internalize reusable patterns of what makes a tool description effective for agents. Supported by a large-scale dataset synthesized from real-world APIs, the approach eliminates the need to rerun multi-stage pipelines for every new tool and avoids optimizing tools in isolation.

What carries the argument

Trace-Free+, a curriculum learning framework that transfers supervision from trace-rich settings to trace-free deployment for rewriting tool descriptions.

If this is right

Tool catalogs can expand to 150-plus entries with only modest accuracy loss instead of sharp drops.
The rewritten descriptions generalize across domains without any additional training.
Performance improves on top of existing agent fine-tuning rather than replacing it.
Elimination of per-tool multi-stage pipelines makes large-scale catalog maintenance feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

API providers could adopt automatic description rewriting as a standard preprocessing step before exposing tools to agents.
The same curriculum pattern might apply to rewriting prompts for other agent behaviors such as planning or memory management.
Over time, agents could iteratively rewrite their own tool interfaces based on observed failures, creating self-improving catalogs.

Load-bearing premise

Patterns learned from synthesized trace-rich data transfer to unseen real-world APIs without overfitting or performance loss.

What would settle it

Run the method on a fresh benchmark containing 200 tools drawn from a domain absent from the training synthesis; if query-level success does not rise by at least 30 percent relative to the baseline while catalog size increases, the central claim fails.

Figures

Figures reproduced from arXiv: 2602.20426 by Kaiwen Dong, Kamalika Das, Ruocheng Guo, Xiang Gao.

**Figure 2.** Figure 2: The data synthesis pipeline. 3.2 Data Synthesis for Tool Interface Improvement We construct training data through a three-stage pipeline that converts real-world APIs into high-quality supervision for tool description generation. At a high level, we (1) collect working tool interfaces, (2) synthesize multi-step queries that expose interface deficiencies, and (3) generate improved descriptions that encode b… view at source ↗

**Figure 3.** Figure 3: Scaling experiment results on the more challenging G2-G3 subsets of StableTool [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

While most efforts to improve LLM-based tool-using agents focus on the agent itself - through larger models, better prompting, or fine-tuning - agent performance increasingly plateaus due to the quality of the tool interfaces these agents consume. Tool descriptions are often written for human developers and tolerate ambiguity that agents cannot resolve, particularly as the number of candidate tools grows. Existing approaches to improving tool interfaces (1) require re-running a multi-stage per-tool pipeline - synthesizing queries, executing an agent to collect trajectories, annotating trajectories, and prompting a strong LLM multiple times - for every API that enters the catalog, and (2) typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to internalize reusable patterns of what makes a tool description effective. To support this approach, we construct a large-scale dataset of high-quality tool interfaces derived from real-world APIs through a principled data synthesis workflow. Experiments on widely adopted benchmarks show that Trace-Free+ improves robustness as tool catalogs scale to 150+ candidates - in scaling experiments, reducing accuracy degradation by 29.23% and improving average query-level success by 60.89% on StableToolBench - generalizes across domains without retraining, and provides complementary gains on top of agent fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Trace-Free+ uses curriculum transfer to rewrite tool descriptions and cuts scaling degradation on large catalogs, but details on baselines and overfitting risks need checking.

read the letter

The main point from this paper is that by using curriculum learning to rewrite tool descriptions, they can make LLM agents more reliable when there are lots of tools to choose from. They introduce Trace-Free+, which trains progressively from settings with agent traces to ones without, so the model picks up reusable rules for good descriptions. This avoids the expensive per-tool pipeline of synthesizing queries, running agents, etc., for every new API. They created a big dataset from real APIs to train on. The results on StableToolBench are the strongest part: when scaling to 150+ tools, accuracy drops less by 29%, and success rates go up 61%. It generalizes without retraining and complements agent fine-tuning. The soft spot is that we don't see error bars or full baseline details in the abstract, and the synthesis process might introduce artifacts that the model memorizes rather than learning general principles. The stress-test note flags this overfitting risk, which could undermine the generalization claim if not addressed in the full text. This is useful for anyone building agent systems with expanding tool sets. A practitioner or researcher in LLM tool use would get practical ideas from it. The thinking seems clear and engaged with the literature on agent limitations. I think it should go to peer review - the empirical gains are worth checking closely, and the curriculum idea is a fresh angle on the problem.

Referee Report

2 major / 2 minor

Summary. The paper introduces Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment for rewriting tool descriptions, enabling more reliable LLM-agent tool use. It constructs a large-scale dataset of high-quality tool interfaces from real-world APIs via a principled synthesis workflow, and reports that this approach reduces accuracy degradation by 29.23% and improves average query-level success by 60.89% on StableToolBench as catalogs scale to 150+ candidates, generalizes across domains without retraining, and yields complementary gains atop agent fine-tuning.

Significance. If the central claims hold, the work addresses a key scalability bottleneck in tool-using agents by shifting focus from per-tool pipelines to reusable description patterns learned via curriculum transfer. The reported robustness gains on large catalogs and no-retraining generalization would represent a practical advance over existing multi-stage per-API methods, with potential to improve agent performance plateaus caused by ambiguous human-written interfaces.

major comments (2)

[Experiments] Experiments section (scaling results on StableToolBench): the reported 29.23% reduction in accuracy degradation and 60.89% improvement in query-level success lack accompanying error bars, exact baseline definitions, or data exclusion criteria, making it difficult to assess whether the gains are statistically robust or sensitive to particular splits.
[Method] Curriculum transfer description (trace-rich to trace-free phase): the claim that patterns internalize as reusable effectiveness rules and generalize to unseen tools without retraining is load-bearing for the no-retraining contribution, yet no ablations are described that remove specific synthesis heuristics or evaluate on APIs collected after the dataset construction cutoff; this leaves open the possibility that gains reflect memorization of dataset phrasing rather than transferable patterns.

minor comments (2)

[Abstract] Abstract: the numerical improvements are stated without reference to the precise baseline configurations or number of runs, which should be clarified for reproducibility.
[Data] Dataset construction paragraph: the 'principled data synthesis workflow' is referenced but lacks a high-level diagram or pseudocode summarizing the stages (query synthesis, trajectory collection, annotation), which would aid reader understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Experiments] Experiments section (scaling results on StableToolBench): the reported 29.23% reduction in accuracy degradation and 60.89% improvement in query-level success lack accompanying error bars, exact baseline definitions, or data exclusion criteria, making it difficult to assess whether the gains are statistically robust or sensitive to particular splits.

Authors: We agree that the absence of error bars, precise baseline definitions, and data exclusion criteria limits the ability to fully assess statistical robustness. In the revised manuscript, we will include error bars computed over multiple random seeds for all scaling results on StableToolBench, provide explicit definitions of all baselines (including original human-written descriptions and any intermediate variants), and detail the data inclusion/exclusion criteria used in the experiments. revision: yes
Referee: [Method] Curriculum transfer description (trace-rich to trace-free phase): the claim that patterns internalize as reusable effectiveness rules and generalize to unseen tools without retraining is load-bearing for the no-retraining contribution, yet no ablations are described that remove specific synthesis heuristics or evaluate on APIs collected after the dataset construction cutoff; this leaves open the possibility that gains reflect memorization of dataset phrasing rather than transferable patterns.

Authors: We acknowledge that dedicated ablations isolating individual synthesis heuristics would provide stronger evidence against memorization. Our cross-domain generalization results—where the model is applied to entirely new tool catalogs from different domains without any retraining—offer supporting evidence that the internalized patterns are reusable rather than dataset-specific. We did not evaluate on post-cutoff APIs, as the dataset was constructed from the benchmarks available at the time of the study. In the revision, we will expand the discussion to address memorization concerns explicitly and clarify the dataset construction timeline and its relation to the evaluated domains. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents Trace-Free+ as a curriculum learning framework that transfers patterns from trace-rich training to trace-free inference, backed by a dataset synthesized from real-world APIs via a principled workflow. All reported results (e.g., 29.23% reduced accuracy degradation and 60.89% higher success on StableToolBench) are empirical measurements against external benchmarks rather than quantities derived from internal equations or fitted parameters. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text; the central claims rest on standard ML generalization testing without reduction to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that effective tool descriptions follow learnable, reusable patterns that can be extracted from trajectories and applied without traces; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Tool description quality is the primary bottleneck for agent performance on large catalogs and can be improved via learned rewriting patterns
Invoked to justify shifting focus from agent improvements to interface rewriting.

pith-pipeline@v0.9.0 · 5547 in / 1230 out tokens · 40641 ms · 2026-05-15T19:59:06.141457+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

The JSON string of the current MCP schema for one API provider ( initially )

work page
[2]

A set of tools ( APIs ) defined by that schema that you can call to get concrete feedback

work page
[3]

good "`if you can find at least one meaningful , repeatable , successful call that returns plausible data . -`health =

Generic utilities that annotate the schema with health labels and successful call examples . Your goal - For each API in the provider , actively explore how to call it successfully . - Use both the schema to infer true parameter names , types , and constraints . - Adapt your calls when you see errors , instead of giving up after a single failure . - After...

work page
[4]

Inspect the current schema to list all APIs you need to evaluate

work page
[5]

For each API , design and execute a small sequence of test calls to discover a working call ( or to conclude that the API is broken or uncertain )

work page
[6]

Update your calling strategy for that API whenever you observe new errors or unexpected behavior

work page
[7]

Once you have enough evidence , annotate the API's health and , if possible , record successful call examples using the utility tools

work page
[8]

Repeat until all APIs in the schema are annotated

work page
[9]

final_answer

End by calling the " final_answer " tool with a concise summary of what you annotated . Output and completion - You must finish by calling the " final_answer " tool . It is the only way to complete the task . - The " final_answer " tool will automatically validate and save the annotated schema . Examples Task : " Evaluate APIs and annotate their health " ...

work page
[13]

If information is missing , probe with a minimal call

Prefer evidence from Observations . If information is missing , probe with a minimal call . Now Begin ! </ system_prompt > < user_prompt > Evaluate and annotate the health of each API based on the schema by actively interacting with the tools . The schema you are given is : {{ schema }} </ user_prompt > A.2 Detailed User Query Synthesis Procedure This sec...

work page 2025
[14]

Analyze the subtask input to understand what it's trying to accomplish

work page
[15]

Consider whether the subtask needs to fetch NEW data from external sources

work page
[16]

from the list

Determine if the subtask is just processing / analyzing data that's already available CRITERIA for API NEED : - The subtask needs to SEARCH for , FIND , GET , or RETRIEVE information - The subtask needs to access external data sources - The subtask cannot be completed with just the data from previous steps - The subtask involves making requests to externa...

work page
[17]

subtask_output

Analyze the subtask query and the " subtask_output " in the Context Section to understand what specific information is needed

work page
[18]

Select exactly ONE API from the available tools that is most appropriate for this subtask

work page
[19]

selected_api

Focus on the current subtask only - don't consider future steps ### Context Section { context_section } ### Subtask Query { query } Important : You must select exactly ONE API that is most appropriate for this specific subtask . You must respond in JSON format with exactly one selected API : {{ " selected_api ": {{ " reasoning ": " why this specific API w...

work page
[20]

Try to organize the response into a natural language answer

work page
[21]

We will not show the API response to the user , thus you need to make full use of the response and give the information in the response that can satisfy the user's question in as much detail as possible

work page
[22]

The question may have dependencies on answers of other questions , so we will provide logs of previous questions and answers . There are logs of previous questions and answers : { context_section } This is the user's question : { subtask query } This is the response output by the API tool : { call_result } ... </ user_prompt > Evaluation MetricsFor the to...

work page 2025
[23]

The JSON string of the current schema that you will rewrite ( initially )

work page
[24]

A set of tools defined by that schema that you can call to get concrete feedback

work page
[25]

Generic utilities that help edit the schema incrementally

work page
[26]

Missing required parameter X ,

The interactive history of past tool calls and their results . 20 Preprint. Under review. What to change - You may change a tool's description and parameters . - You must not change any tool name . You must not change the structure of the schema . - Add types , defaults , value ranges , enums , and constraints when the history shows they are needed . - Ma...

work page
[27]

Inspect the current schema and the history

work page
[28]

Exercise the tool ( s ) with targeted calls to reveal true parameter names , required fields , and constraints

work page
[29]

Edit the schema incrementally using the utility tools to reflect observed behavior

work page
[30]

Validate by re - running the previously failing calls until they succeed or until you reach the real limits of the tool

work page
[31]

final_answer

End with the final answer tool . Output and completion - You must finish by calling the " final_answer " tool . It is the only way to complete the task . - The " final_answer " tool will automatically save the schema . Examples Task : " Rewrite the schema based on the log " Action : { " name ": " s o m e _ t o o l _ i n _ s c h e m a " , " arguments ": {"...

work page
[32]

final_answer

Always provide a tool call . If you are answering , call " final_answer "

work page
[33]

Pass literal values , not variable names

Use only the arguments the tool expects . Pass literal values , not variable names

work page
[34]

Do not repeat an identical tool call with the exact same arguments

work page
[35]

If information is missing , probe with a minimal call

Prefer evidence from Observations and the history over guesses . If information is missing , probe with a minimal call . Now Begin ! </ system_prompt > < user_prompt > Rewrite the schema of the tool based on the log below and interacting with the tools . The schema you are given is : {{ schema }} </ user_prompt > B.3 Training and Inference Details All exp...

work page 2023
[37]

description

Generate valid parameters Inputs : - API name : { tool_name } - Parameter schema : { parameter_json } - Baseline description : { o r i g i n a l _ d e s c r i p t i o n } Infer ( do not output ) : - When to use vs not use this API - Required vs optional parameters - Parameter meanings and constraints - Cross - parameter dependencies or exclusions - Common...

work page
[38]

Decide when to use this API

work page
[39]

description

Generate valid parameters Inputs : - API name : { tool_name } - Parameter schema : { parameter_json } - Example queries + errors : { query_examples } - Baseline description : { o r i g i n a l _ d e s c r i p t i o n } Infer ( do not output ) : - When to use vs not use this API - Common parameter mistakes - Required vs optional parameters - Cross - parame...

work page 2025
[40]

longitude (re- quireslat)

date must be in YYYY-MM-DD format. image must be a Base64- encoded string.” 5 Cross-param dependencies Parameter-level. Does it constrain one parameter based on another? Constraints be- tween parameters within the same tool — parame- ters that must be paired together or are mutually exclusive. “longitude (re- quireslat).” Table 7: Definitions of the five ...

work page arXiv