Beyond the final answer: Evaluating the reasoning trajectories of tool-augmented agents

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park · 2025 · cs.AI · arXiv 2510.02837

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Counterfactual Trace Auditing of LLM Agent Skills

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.

citing papers explorer

Showing 2 of 2 citing papers.

Counterfactual Trace Auditing of LLM Agent Skills cs.AI · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization cs.AI · 2026-05-18 · unverdicted · none · ref 7 · internal anchor
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.

Beyond the final answer: Evaluating the reasoning trajectories of tool-augmented agents

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer