pith. machine review for the scientific record. sign in

arxiv: 2604.10015 · v2 · submitted 2026-04-11 · 💻 cs.AI · cs.CE· cs.CL· cs.MM

Recognition: unknown

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:50 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.CLcs.MM
keywords LLM tool callingfinancial benchmarkstrajectory evaluationinformation utilizationpreference optimizationlong-horizon tasks
0
0 comments X

The pith

LLMs select the right financial tools but fail to reason effectively over their outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FinTrace, a benchmark of 800 expert-annotated trajectories covering 34 real-world financial task categories. It applies a nine-metric rubric across four axes to assess entire sequences of tool calls rather than isolated actions. Evaluation of 13 models shows frontier systems choose tools accurately yet all models struggle to integrate returned information into sound reasoning or high-quality final answers. The authors also release a trajectory preference dataset and demonstrate that supervised fine-tuning plus direct preference optimization raises intermediate process scores while end-to-end answer quality stays limited.

Core claim

The paper establishes that LLMs display a pronounced capability split on long-horizon financial tasks: they reliably invoke appropriate tools and execute them correctly, yet they consistently fail to utilize the information returned by those tools for coherent intermediate reasoning or accurate final outputs. This gap appears across all tested models when evaluated at the full trajectory level rather than per call.

What carries the argument

FinTrace benchmark of 800 trajectories evaluated by a nine-metric rubric organized into four axes of action correctness, execution efficiency, process quality, and output quality.

If this is right

  • Frontier models achieve strong tool selection but all models struggle with information utilization and final answer quality.
  • Trajectory-level preference training improves intermediate reasoning metrics more effectively than supervised fine-tuning alone.
  • Direct preference optimization suppresses specific failure modes during the process.
  • Gains in process quality do not yet translate into proportional improvements in end-to-end answer quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models may benefit from separate post-tool synthesis modules rather than relying solely on end-to-end generation.
  • The trajectory evaluation approach could expose similar gaps in other domains that require chained tool use over many steps.
  • Further gains in final answer quality may require training objectives that directly reward output synthesis beyond current preference methods.

Load-bearing premise

The 800 expert-annotated trajectories and nine-metric rubric accurately and comprehensively capture the quality of LLM tool-calling trajectories in real-world financial tasks.

What would settle it

A study in which models achieve high final-answer accuracy on the same financial tasks while still scoring low on information utilization metrics, or in which independent financial experts rate the trajectories differently from the rubric.

Figures

Figures reproduced from arXiv: 2604.10015 by Anke Xu, Haohang Li, Jimin Huang, Jordan W. Suchow, K.P. Subbalakshmi, Lingfei Qian, Minxue Tang, Weijin Liu, Wenbo Cao, Xueqing Peng, Yangyang Yu, Yupeng Cao, Zhiyuan Yao, Zining Zhu.

Figure 1
Figure 1. Figure 1: FinTrace benchmark construction pipeline. systematically addressed, is essential for deploying capable and trustworthy language models in real-world financial services. Recent studies have examined the tool-calling capabilities of LLMs on general-purpose tasks (Li et al., 2023; Qin et al., 2023; Wang et al.; Kate et al., 2025), primarily emphasizing success rates and end-to-end task completion. These works… view at source ↗
Figure 2
Figure 2. Figure 2: Overall performance of 13 LLMs on the FinTrace benchmark. margin, particularly on trajectory-level reasoning metrics. These patterns suggest that strong tool selection alone is insufficient, consistent performance across the full trajectory is what separates the top-ranked systems. Process Quality and Output Quality Remain the Primary Bottleneck. A striking pattern across all models is the disparity betwee… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of LLM-judged metric scores (1–5) for Qwen 3.5-9B at each training [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of 15,095 source queries across 12 task buckets encompassing 30+ [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Annotation Platform 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality. To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes -- action correctness, execution efficiency, process quality, and output quality -- enabling fine-grained assessment of LLM tool-calling behavior. Our evaluation of 13 LLMs reveals that while frontier models achieve strong tool selection, all models struggle with information utilization and final answer quality, exposing a critical gap between invoking the right tools and reasoning effectively over their outputs. To move beyond diagnosis, we construct FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, containing 8,196 curated trajectories with tool-augmented contexts and preference pairs. We fine-tune Qwen-3.5-9B using supervised fine-tuning followed by direct preference optimization (DPO) and show that training on FinTrace-Training consistently improves intermediate reasoning metrics, with DPO more effectively suppressing failure modes. However, end-to-end answer quality remains a bottleneck, indicating that trajectory-level improvements do not yet fully propagate to final output quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces FinTrace, a benchmark of 800 expert-annotated trajectories spanning 34 real-world financial task categories, evaluated via a nine-metric rubric organized into four axes (action correctness, execution efficiency, process quality, output quality). Evaluation of 13 LLMs shows frontier models perform well on tool selection but struggle with information utilization and final-answer quality. The authors also release FinTrace-Training, a trajectory-level preference dataset of 8,196 examples, and demonstrate that SFT followed by DPO on Qwen-3.5-9B improves intermediate metrics while final answer quality remains a bottleneck.

Significance. If the annotation protocol and metric definitions are reliable, this work provides a useful trajectory-level evaluation framework that exposes a concrete gap in LLM financial tool use beyond call-level metrics. The accompanying preference dataset and training results offer a practical starting point for improving long-horizon financial agents.

minor comments (3)
  1. Abstract and §4: the statement that training 'consistently improves intermediate reasoning metrics' would be strengthened by reporting the magnitude of per-metric gains (e.g., absolute or relative deltas) rather than qualitative description alone.
  2. §3.1: the selection and balancing criteria for the 34 task categories and difficulty levels are not fully detailed; a short table or paragraph on category distribution would clarify coverage.
  3. §5: while per-axis score distributions are mentioned, a single consolidated table comparing all 13 models on the nine metrics would improve readability and support the central claim about the tool-selection vs. utilization gap.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of FinTrace and the recommendation for minor revision. The summary accurately captures the benchmark construction, evaluation findings on LLM limitations in information utilization and output quality, and the release of the FinTrace-Training preference dataset with DPO results. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is the introduction of a new benchmark (FinTrace) consisting of 800 expert-annotated trajectories and a nine-metric rubric, followed by empirical evaluation of 13 LLMs and construction of a new preference dataset (FinTrace-Training) for fine-tuning. All load-bearing claims about model performance gaps (tool selection vs. information utilization) are derived directly from these newly collected annotations and standard training procedures (SFT + DPO), without any equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the results to prior inputs by construction. The derivation chain is therefore self-contained and externally falsifiable via the released annotations and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the reliability of expert annotations and the sufficiency of the rubric metrics; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Expert-annotated trajectories accurately represent real-world long-horizon financial tasks and their quality.
    Benchmark construction and evaluation depend on this without stated validation steps.
  • domain assumption The nine metrics across four axes capture the relevant dimensions of tool-calling trajectory quality.
    Rubric-based protocol assumes comprehensive coverage of action, efficiency, process, and output.

pith-pipeline@v0.9.0 · 5625 in / 1444 out tokens · 50594 ms · 2026-05-10T16:50:38.691662+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 16 canonical work pages · 7 internal anchors

  1. [1]

    Agentpro: Enhancing llm agents with automated process supervision

    Yuchen Deng, Shichen Fan, Naibo Wang, Xinkui Zhao, and See Kiong Ng. Agentpro: Enhancing llm agents with automated process supervision. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 9992–10017,

  2. [2]

    LoRA: Low-Rank Adaptation of Large Language Models

    URLhttps://arxiv.org/abs/2106.09685. Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, et al. Finsearchcomp: Towards a realistic, expert- level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160,

  3. [3]

    Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

    Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944,

  4. [4]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,

  5. [5]

    URL https://aclanthology.org/2025.sigdial-1.32/

    Association for Computational Linguistics. URL https://aclanthology.org/2025.sigdial-1.32/. 10 Preprint. Under review. Mohammad Kachuee, Sarthak Ahuja, Vaibhav Kumar, Puyang Xu, and Xiaohu Liu. Im- proving tool retrieval by leveraging large language models for query generation. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di E...

  6. [6]

    URL https://aclanthology.org/2025.coling-industry.3/

    Association for Computational Linguistics. URL https://aclanthology.org/2025.coling-industry.3/. Kiran Kate, Tejaswini Pedapati, Kinjal Basu, Yara Rizk, Vijil Chenthamarakshan, Subhajit Chaudhury, Mayank Agarwal, and Ibrahim Abdelaziz. Longfunceval: Measuring the effectiveness of long context models for function calling.arXiv preprint arXiv:2505.10570,

  7. [7]

    Api-bank: A comprehensive benchmark for tool-augmented llms

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. InProceedings of the 2023 conference on empirical methods in natural language processing, pp. 3102–3116,

  8. [8]

    Findeepforecast: A live multi-agent system for benchmarking deep research agents in financial forecasting.arXiv preprint arXiv:2601.05039,

    Xiangyu Li, Xuan Yao, Guohao Qi, Fengbin Zhu, Kelvin JL Koa, Xiang Yao Ng, Ziyang Liu, Xingyu Ni, Chang Liu, Yonghui Yang, et al. Findeepforecast: A live multi-agent system for benchmarking deep research agents in financial forecasting.arXiv preprint arXiv:2601.05039,

  9. [9]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688,

  10. [10]

    Fintoolbench: Evaluating llm agents for real-world financial tool use.arXiv preprint arXiv:2603.08262,

    Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, et al. Fintoolbench: Evaluating llm agents for real-world financial tool use.arXiv preprint arXiv:2603.08262,

  11. [11]

    arXiv preprint arXiv:2508.01780 , year=

    Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, and Le Sun. Livemcpbench: Can agents navigate an ocean of mcp tools?arXiv preprint arXiv:2508.01780,

  12. [12]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,

  13. [13]

    Tool- Gen: Unified tool retrieval and calling via generation,

    Guangyu Wang, Jianhong Liu, Meilin Zhou, Xiaoming Chen, Lihua Zhang, and Zhihao Sun. Toolbench 2.0: Evaluating long-horizon and multi-step tool use in llms. Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool retrieval and calling via generation.arXiv preprint arXiv:2410.03439, 2024a. Yubo Wang, Xueguang Ma, Ge ...

  14. [14]

    LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.19314, 4:2,

  15. [15]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: benchmarking llm agents on consequential real world tasks.arXiv preprint arXiv:2412.14161,

  16. [16]

    The evolution of tool use in llm agents: From single-tool call to multi-tool orchestration.arXiv preprint arXiv:2603.22862,

    Haoyuan Xu, Chang Li, Xinyan Ma, Xianhao Ou, Zihan Zhang, Tao He, Xiangyu Liu, Zixiang Wang, Jiafeng Liang, Zheng Chu, et al. The evolution of tool use in llm agents: From single-tool call to multi-tool orchestration.arXiv preprint arXiv:2603.22862,

  17. [17]

    Survey on Evaluation of LLM-based Agents

    Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. Survey on evaluation of llm-based agents.arXiv preprint arXiv:2503.16416,

  18. [18]

    A Appendix A.1 Limitations Deep research agents are advancing at a fast pace

    Fengbin Zhu, Xiang Yao Ng, Ziyang Liu, Chang Liu, Xianwei Zeng, Chao Wang, Tianhui Tan, Xuan Yao, Pengyang Shao, Min Xu, et al. Findeepresearch: Evaluating deep research agents in rigorous financial analysis.arXiv preprint arXiv:2510.13936,

  19. [19]

    Finmcp-bench: Benchmarking llm agents for real-world financial tool use under the model context protocol.arXiv preprint arXiv:2603.24943,

    Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li, Xianyin Zhang, Lifan Guo, Feng Chen, Yong Liu, et al. Finmcp-bench: Benchmarking llm agents for real-world financial tool use under the model context protocol.arXiv preprint arXiv:2603.24943,

  20. [20]

    Under review

    12 Preprint. Under review. A Related Work A.1 LLM Evaluation As LLMs have become more widely used, evaluating their capabilities has remained a cen- tral concern (Chang et al., 2024; Yehudai et al., 2025). Static benchmarks such as GPQA Rein et al. (2024), MMLU-Pro (Wang et al., 2024b), and LiveBench (White et al.,

  21. [21]

    More recently, benchmarks have shifted toward evaluating LLM-based agents’ autonomous behavior in complex, interactive envi- ronments

    have been proposed to evaluate LLMs on graduate-level reasoning, robust multitask understanding, and contamination-free assessment, respectively. More recently, benchmarks have shifted toward evaluating LLM-based agents’ autonomous behavior in complex, interactive envi- ronments. Representative efforts have assessed agents across a wide range of scenarios...

  22. [22]

    {name}:{description}

    further examines the alignment between selected tools and user intent. However, these benchmarks primarily assess tool-calling success rate or tool-intent alignment without evaluating the quality of the complete multi-step trajectories. Our work addresses this gap by introducing a rubric-based evaluation protocol that jointly assesses action correctness, ...