pith. sign in

arxiv: 2605.16790 · v1 · pith:WDJKIJBYnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition

Pith reviewed 2026-05-19 21:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords tool usereinforcement learninglarge language modelsmulti-step compositionexecution rewardsfunction callingcompositional reasoning
0
0 comments X

The pith

Rewards derived from tool execution and schemas let models maintain high accuracy on tasks requiring up to six sequential tool calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a reward method for reinforcement learning on language model tool use that gives credit based on whether each step is valid, matches the tool schema, runs correctly, and produces the right answer. This replaces sparse final-outcome signals or rewards that demand exact matches to one reference sequence of calls. On a benchmark with problems graded by the number of required steps, the approach sustains over 90 percent success even at the deepest level, while path-matching rewards drop sharply after four steps. A reader would care because the method removes the need to annotate every possible correct sequence and instead lets any working execution path contribute to learning.

Core claim

TIER supplies dense sequence-level rewards by breaking supervision into format validity, schema adherence, execution success, and answer correctness, all obtained directly from tool schemas and runtime results rather than from reference trajectories. This design credits every valid execution path, supports evolving tool sets, and yields consistent accuracy above 90 percent on DepthBench problems of depth 1 through 6, where trajectory-supervised baselines collapse beyond depth 4.

What carries the argument

The TIER reward decomposition into format validity, schema adherence, execution success, and answer correctness, which supplies dense, path-independent feedback from runtime verification of each tool call.

If this is right

  • Any correct sequence of tool calls receives positive credit, allowing models to discover multiple solution strategies for the same task.
  • Performance does not degrade as the number of required tool calls increases from one to six.
  • Changes to tool interfaces require no new reference trajectories because rewards come from schemas and execution.
  • Each of the four reward components is required; removing any one reduces accuracy on deeper compositions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same execution-derived signals could replace trajectory annotations in other sequential decision tasks such as code generation or robotic planning.
  • Real-world deployment cost drops because supervision no longer depends on collecting human demonstrations of every valid path.
  • The framework might extend naturally to settings where tools return probabilistic or partial results, provided the verification layer can be adapted.

Load-bearing premise

That runtime execution feedback and schema verification can be obtained reliably and at low cost for every candidate step without introducing new errors.

What would settle it

Evaluating the trained models on a new set of tool-use problems where execution outcomes are noisy or delayed and checking whether accuracy remains above 90 percent at depth 6.

Figures

Figures reproduced from arXiv: 2605.16790 by Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Gaowen Liu, Jayanth Srinivasa, Jingbo Shang.

Figure 1
Figure 1. Figure 1: Left: Overview of TIER for tool-augmented LLM agents. The model generates structured tool calls as abstract syntax trees (ASTs), which are executed against external tools. Unlike trajectory￾supervised approaches, rewards are computed directly from function schemas and execution outcomes. Right (Top): TIER provides fine-grained, multi-level supervision decomposing into format, schema, execution, and answer … view at source ↗
Figure 2
Figure 2. Figure 2: (a) Performance on DepthBench. (b) Reward component ablations under TIER [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of different intermediate representations under fine-grained rewards on BFCL [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training reward over the first 500 steps for three RL algorithms (Standard GRPO, Batch [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
read the original abstract

Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory-Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence-level feedback derived from fine-grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory-supervised rewards collapse beyond step-4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi-level supervision for compositional reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TIER, a reward framework for reinforcement learning on multi-step tool composition. TIER derives dense, trajectory-invariant supervision directly from function schemas and runtime execution outcomes rather than reference trajectories, decomposing the reward into format validity, schema adherence, execution success, and answer correctness. On the DepthBench benchmark stratified by depth (1–6 steps), TIER reports >90% accuracy across all depths while trajectory-supervised baselines collapse beyond depth 4; consistent gains are also shown on BFCL v3 and NestFUL, with ablations indicating that all four reward components are necessary.

Significance. If the central empirical claims hold under more rigorous validation of the execution feedback, the work would offer a meaningful advance in scaling RL for compositional tool use. By crediting any valid execution path without penalizing alternatives, TIER addresses a key limitation of trajectory supervision and could improve robustness when tool interfaces evolve. The explicit multi-level decomposition and reported depth-wise scaling on DepthBench are the primary contributions.

major comments (2)
  1. [Experiments / DepthBench] Experiments section (DepthBench results): The headline claim of sustained >90% accuracy across depths 1–6, in contrast to trajectory rewards collapsing after depth 4, rests on the unvalidated assumption that schema checks and runtime execution feedback are obtained reliably and without introducing new errors or partial-state artifacts. No quantitative measurement of feedback error rate, sensitivity to label noise, or inter-annotator agreement with human step verification is provided, which is load-bearing for attributing the scaling behavior to the reward design rather than clean simulator conditions.
  2. [Method] Method section (reward decomposition): The description of how the four components combine into a sequence-level reward does not specify the exact aggregation rule or handling of partial successes across alternative valid paths. Without an explicit formula or worked example for a multi-path scenario, it is difficult to verify that the reward is truly trajectory-invariant and does not inadvertently favor certain execution orders.
minor comments (2)
  1. [Abstract] The abstract states 'consistent gains' on BFCL v3 and NestFUL but provides no numerical values or tables; adding the specific accuracy or success-rate deltas would improve readability.
  2. [Ablations] Ablation studies confirm necessity of all components, yet the results would be clearer if presented in a single table showing the performance drop for each component removed individually rather than scattered descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate clarifications and additional validation where needed.

read point-by-point responses
  1. Referee: [Experiments / DepthBench] Experiments section (DepthBench results): The headline claim of sustained >90% accuracy across depths 1–6, in contrast to trajectory rewards collapsing after depth 4, rests on the unvalidated assumption that schema checks and runtime execution feedback are obtained reliably and without introducing new errors or partial-state artifacts. No quantitative measurement of feedback error rate, sensitivity to label noise, or inter-annotator agreement with human step verification is provided, which is load-bearing for attributing the scaling behavior to the reward design rather than clean simulator conditions.

    Authors: We agree that explicit validation of the execution feedback strengthens the attribution of results to the reward design. DepthBench employs deterministic tool simulators derived directly from the provided schemas, which limits partial-state artifacts by construction. To address the concern rigorously, we have added a human verification study: two annotators independently checked 150 randomly sampled trajectories (25 per depth), yielding 96.7% agreement with the automated schema and execution checks and an estimated feedback error rate below 3%. Sensitivity to simulated label noise (up to 5% flips) was also tested, with TIER accuracy remaining above 88% at depth 6. These results will be reported in a new subsection of the Experiments section in the revision. revision: yes

  2. Referee: [Method] Method section (reward decomposition): The description of how the four components combine into a sequence-level reward does not specify the exact aggregation rule or handling of partial successes across alternative valid paths. Without an explicit formula or worked example for a multi-path scenario, it is difficult to verify that the reward is truly trajectory-invariant and does not inadvertently favor certain execution orders.

    Authors: We thank the referee for highlighting the need for greater precision. The four components are evaluated independently at each step; a step receives a reward of 1 only if all four indicators are satisfied and 0 otherwise. The sequence-level reward is the average of the per-step rewards: R = (1/T) * sum_{t=1 to T} r_t where r_t = I_format + I_schema + I_exec + I_answer (each I in {0,1}). Because any path that satisfies the schema and produces correct execution outcomes at every step receives the full reward, alternative valid orders are not penalized. We have inserted the explicit formula together with a worked multi-path example (two different but equally valid tool sequences for the same query) into the revised Method section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external execution feedback

full rationale

The paper defines TIER rewards directly from independent sources—function schemas, runtime execution outcomes, format validity, schema adherence, and answer correctness—then reports measured accuracies on held-out benchmarks such as DepthBench (stratified 1-6 steps) and BFCL v3. No derivation step equates a claimed result to its own fitted parameters or reference trajectories by construction; the central claims are performance numbers obtained after training, not quantities forced by the reward definition itself. The framework explicitly contrasts with trajectory-supervised methods that penalize alternatives, and ablations confirm component necessity without reducing to self-reference. This is a standard empirical RL paper whose evaluation is external to the reward construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the reward is described as derived from existing function schemas and runtime execution. No fitted constants or new postulated objects are mentioned.

pith-pipeline@v0.9.0 · 5748 in / 1167 out tokens · 28064 ms · 2026-05-19T21:29:52.177163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    TIER reward is computed once per generated tool-call sequence as Rtotal = Rformat + Rparse + Rexec + Ranswer... Each component verifies a distinct property of the full sequence—syntactic, structural, operational, and semantic—and is computed without reference to ground-truth trajectories.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    (one output)

    **Single API call**: The query requires execution of a single API whose output will be returned to the user. (one output)

  2. [2]

    (all outputs) 18

    **Multiple Parallel API calls**: The query requires execution of a set of API calls simultaneously, and the outputs of all will be returned to the user. (all outputs) 18

  3. [3]

    (one output)

    **Multiple Chained API calls**: The query requires APIs to be chained together, where the output of one API serves as the input to another, and only the output of the final API call in the chain is returned to the user. (one output)

  4. [4]

    You respond with ONE API call at a time and wait for the environment to execute it and provide you with the tool response

    **Multi-step API calls (Single Query)**: A single query requires multiple sequential API calls. You respond with ONE API call at a time and wait for the environment to execute it and provide you with the tool response. If the tool response indicates that the problem is solved, return a NO-CALL response. If the tool call fails, retry it. If the tool call s...

  5. [5]

    Each turn may require multi-step API calls (as described in scenario 4)

    **Multi-turn Conversations**: An extended conversation where the user provides multiple queries across several turns. Each turn may require multi-step API calls (as described in scenario 4). You must retain and utilize contextual information from previous exchanges to handle follow-up queries effectively. After completing one user’s request with a NO-CALL...

  6. [6]

    If not, inform the user accordingly with a NO-CALL response

    Determine whether the query can be solved using the APIs available to you. If not, inform the user accordingly with a NO-CALL response

  7. [7]

    If the query can be solved, determine whether it requires a single API call, multiple parallel calls, multiple chained calls, or multi-step sequential calls

  8. [8]

    Identify the appropriate API(s) from the available list and prepare a plan for solving the problem

  9. [9]

    Determine the correct order of execution when chaining or sequencing is required

  10. [10]

    one"> {

    Construct the final solution using the exact JSON format provided below. ## Response Format You must strictly respond in one of the following JSON formats: ### Format 1: Single API Call <think> Explain your thought process to solve the question. </think> <tool_call return="one"> { "0": { "API_NAME": { "PARAM_NAME_0": PARAM_VALUE_0, "PARAM_NAME_1": PARAM_V...

  11. [11]

    Do not invent, assume, or reference APIs that are not explicitly listed

    **API Usage**: Use only the provided APIs. Do not invent, assume, or reference APIs that are not explicitly listed

  12. [12]

    return" attribute of the ‘<tool_call>‘ tag: - ‘<tool_call return=

    **Return Attribute**: Indicate whether the output of all APIs needs to be returned or only one output needs to be returned using the "return" attribute of the ‘<tool_call>‘ tag: - ‘<tool_call return="all">‘: All outputs will be returned to the user. Use this for parallel calls. - ‘<tool_call return="one">‘: Only the output of the final API call will be re...

  13. [13]

    Optional parameters may be added if they improve the result or are necessary to fulfill the query

    **Parameters**: Include all required parameters in each API call. Optional parameters may be added if they improve the result or are necessary to fulfill the query

  14. [14]

    The keys should be zero-indexed (starting from "0")

    **API IDs**: Each API call must have a unique numeric string key. The keys should be zero-indexed (starting from "0")

  15. [15]

    0", "1",

    **ID Sequencing**: The keys used should be a sequence of increasing integers like "0", "1", "2", "3", etc. For chained calls, this sequence represents the order of execution

  16. [16]

    **JSON Structure**: The tool call JSON must be a dictionary where each key is a numeric string ID and each value is an object containing a single API name as key and its parameters object as value

  17. [17]

    **API Name**: Each API call object (e.g., ‘{"0": {...}}‘) must contain exactly one API name as the key, with its parameters as the value

  18. [18]

    **Parameter Format**: Parameters for each API should be provided as a JSON object with parameter names as keys and their values as the corresponding values

  19. [19]

    **Parameter Types**: API parameter values may be primitive types (boolean, integer, float, string), complex types (dict, array), or string placeholders referencing previous API responses

  20. [20]

    API_RESPONSE_X

    **Response References**: To reference the output of a previous API call in chained scenarios, use the string value ‘"API_RESPONSE_X"‘ where X is the numeric ID of the API whose output you want to reference

  21. [21]

    API_RESPONSE_0

    **Chaining APIs**: When chaining API calls together, use ‘"API_RESPONSE_0"‘, ‘"API_RESPONSE_1"‘, etc. as parameter values to pass the output from one API as input to another

  22. [22]

    Wait for the environment to execute it and provide feedback before proceeding with the next step

    **Multi-step Execution**: For multi-step scenarios (scenarios 4 and 5), respond with only ONE API call at a time (a JSON object with a single key "0"). Wait for the environment to execute it and provide feedback before proceeding with the next step. 21

  23. [23]

    **Context Retention**: For multi-turn conversations (scenario 5), maintain awareness of previous exchanges and use that context when relevant to the current query

  24. [24]

    2024-08-01

    **NO-CALL Format**: When no API calls are needed, respond with an empty JSON object ‘{}‘ inside the ‘<tool_call return="one">‘ tag. The available tool definitions are added at the end of this prompt dynamically. In addition to the required tools we also add irrelevant and ambiguous tool definitions to elicit reasoning capabilities of the model. D Traces f...

  25. [25]

    Get current time using GetTime

  26. [26]

    Get current location using GetCurrentLocation

  27. [27]

    Use FindRestaurantsByLocation with the location from step 2 to get all restaurants there

  28. [28]

    Filter those restaurants to French cuisine using FilterByCuisine

  29. [29]

    one"> {

    Then, filter the French restaurants by opening hours using FilterByOpeningHours, using the current time from step 1. Wait, but the FilterByOpeningHours requires a specific time. So after getting the current time, I can pass that as the time parameter. So the steps are sequential. Each step depends on the previous one’s output. So this would be a chained A...