pith. machine review for the scientific record. sign in

arxiv: 2510.02837 · v2 · submitted 2025-10-03 · 💻 cs.AI · cs.CL

Recognition: unknown

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Authors on Pith no claims yet
classification 💻 cs.AI cs.CL
keywords trajectoriestool-augmentedevaluationtracetrajectoryagentagentsanswer
0
0 comments X
read the original abstract

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Counterfactual Trace Auditing of LLM Agent Skills

    cs.AI 2026-05 unverdicted novelty 7.0

    CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.