pith. sign in

arxiv: 2510.14703 · v2 · submitted 2025-10-16 · 💻 cs.AI

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

Pith reviewed 2026-05-18 06:27 UTC · model grok-4.3

classification 💻 cs.AI
keywords function callingprocess reward modelinference scalingstructured outputsbeam searchlarge language modelstool usereward modeling
0
0 comments X

The pith

ToolPRM scores each intra-call decision to scale inference for structured function calling better than outcome or coarse rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a process reward model called ToolPRM that assigns scores to individual choices inside a function call, such as selecting the right function name and populating its arguments correctly. It creates training data for this model by masking parts of functions, running multiple rollouts, and labeling quality at each step rather than only at the end. When combined with beam search, this fine-grained scoring produces more accurate predictions and lifts results across several function-calling test sets. Readers would care because reliable function calling lets language models use external tools without constant human fixes, and scaling performance at inference time offers gains without retraining the base model. The work additionally observes that structured outputs require exploring many early options but keeping fewer later, since an early JSON mistake cannot be repaired.

Core claim

We propose ToolPRM, a process reward model scoring each intra-call decision (function name and argument filling). We build the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation. ToolPRM outperforms outcome and coarse-grained reward models in predictive accuracy and yields consistent test-time gains on multiple function-calling benchmarks. We further show that structured generation follows 'explore more but retain less', since early JSON errors are unrecoverable.

What carries the argument

ToolPRM, the process reward model that scores every intra-call decision to guide fine-grained beam search over structured outputs.

If this is right

  • ToolPRM delivers higher predictive accuracy for good intra-call paths than either final-outcome or coarse-grained reward models.
  • The same model produces steady performance lifts when used for test-time scaling on existing function-calling benchmarks.
  • Structured function outputs obey an explore-more-retain-less rule because an early JSON formatting error cannot be corrected downstream.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fine-grained supervision approach could be applied to other structured generation settings such as code completion where early syntax mistakes also compound.
  • Increasing the beam width further while using ToolPRM might reduce the performance gap to much larger base models without additional training.
  • The masking-plus-rollout data pipeline offers a template for creating step-level signals in broader tool-use or API-calling scenarios.

Load-bearing premise

The dataset built by masking functions, collecting rollouts, and adding step-level labels supplies rewards that remain accurate and transfer to new test distributions.

What would settle it

Running ToolPRM-guided beam search on a new function-calling benchmark with unseen function schemas and finding no accuracy gain over greedy decoding or an outcome reward model would disprove the consistent test-time improvement claim.

Figures

Figures reproduced from arXiv: 2510.14703 by Bizhe Bai, Fei Huang, Fengshuo Bai, Hairui Wang, Huacan Chai, Jianghao Lin, Renjie Ding, Weinan Zhang, Weixi Song, Xin Peng, Ying Wen, Yuanyuan Shi, Yuxuan Peng.

Figure 1
Figure 1. Figure 1: The illustration of data collection for ToolPRM. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The state transition of function calling (left) and beam search with ToolPRM (right). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The learning curves of different reward models ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The performance on BFCL (Averaged Accuracy) and ToolAlpaca (Averaged F1 Score) w.r.t. different generation budgets [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large language models (LLMs) excel at function calling, but inference scaling has been explored mainly for unstructured generation. We propose an inference-scaling framework for structured outputs that combines fine-grained beam search with \textbf{ToolPRM}, a process reward model scoring each intra-call decision (function name and argument filling). We build the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation. ToolPRM outperforms outcome and coarse-grained reward models in predictive accuracy and yields consistent test-time gains on multiple function-calling benchmarks. We further show that structured generation follows ``\textbf{explore more but retain less}'', since early JSON errors are unrecoverable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes ToolPRM, a process reward model for fine-grained inference scaling of structured outputs in LLM function calling. It constructs the first intra-call supervision dataset through function masking, rollout collection, and step-level annotation, then uses ToolPRM to score individual decisions (function name and argument filling) within calls. The central claims are that ToolPRM outperforms outcome and coarse-grained reward models on predictive accuracy and delivers consistent gains under test-time beam search on function-calling benchmarks; the work also reports that structured generation obeys an “explore more but retain less” pattern because early JSON errors are unrecoverable.

Significance. If the fine-grained rewards prove accurate and transferable, the framework would advance inference-time scaling for tool-use and structured generation tasks by enabling more precise process-level guidance. The new dataset construction pipeline and the empirical observation on exploration/retention trade-offs in JSON-structured outputs are concrete contributions that could inform future work on process supervision.

major comments (1)
  1. Dataset construction section: the step-level annotation procedure (function masking + rollout collection) is not shown to assign labels independently of final rollout outcome. If annotations rely on outcome proxies or simple heuristics, early correct function-name/argument decisions can be mislabeled when later steps fail, directly threatening both the reported predictive-accuracy advantage over outcome/coarse-grained baselines and the claimed generalization to unseen test distributions.
minor comments (1)
  1. Abstract and §4: the slogan “explore more but retain less” is introduced without a forward reference to the section or figure that quantifies the exploration/retention trade-off; a brief parenthetical or citation would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: Dataset construction section: the step-level annotation procedure (function masking + rollout collection) is not shown to assign labels independently of final rollout outcome. If annotations rely on outcome proxies or simple heuristics, early correct function-name/argument decisions can be mislabeled when later steps fail, directly threatening both the reported predictive-accuracy advantage over outcome/coarse-grained baselines and the claimed generalization to unseen test distributions.

    Authors: We appreciate the referee highlighting this critical requirement for true process supervision. In our pipeline, function masking generates partial structured outputs and rollouts produce candidate completions, but the subsequent step-level annotation is performed by human experts who judge each intra-call decision (function-name selection or argument filling) solely on its correctness given the query and prior context. Annotators are explicitly instructed to ignore the remainder of the rollout and the final success or failure; a correct early decision receives a positive label even if a later step introduces an error. This protocol was designed to avoid the exact contamination the referee describes. We acknowledge, however, that the current manuscript does not provide sufficient detail on the annotation guidelines or inter-annotator agreement statistics to make this independence fully transparent. In the revised version we will expand the Dataset Construction section with the precise annotation rubric, example decisions, and confirmation that labels were assigned without reference to rollout outcomes. These additions will directly support the validity of the reported predictive-accuracy gains and the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent dataset construction and external evaluation

full rationale

The paper constructs a new fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation, trains ToolPRM on this data to score individual decisions, and evaluates predictive accuracy plus test-time gains on multiple external function-calling benchmarks. No quoted equations, self-citations, or claims reduce the reported outperformance or the 'explore more but retain less' observation to fitted parameters by construction, self-referential definitions, or load-bearing prior work by the same authors. The central results rest on standard supervised training and benchmark comparison rather than any reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claims rest on the validity of the newly constructed intra-call supervision dataset and the assumption that process-level rewards at this granularity improve beam search outcomes.

invented entities (1)
  • ToolPRM no independent evidence
    purpose: Process reward model scoring intra-call decisions for function calling
    New model introduced to provide fine-grained supervision.

pith-pipeline@v0.9.0 · 5683 in / 1070 out tokens · 28055 ms · 2026-05-18T06:27:54.227345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

    cs.CL 2026-04 unverdicted novelty 7.0

    DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Chulaka Gu- nasekara, et al. 2024. Granite-function calling model: Introducing function calling abilities via multi-task learning of granular tasks.arXiv preprint arXiv:2407.00121 (2024)

  2. [2]

    Ackley, Geoffrey E

    David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. 1985. A learning algorithm for boltzmann machines.Cognitive Science9, 1 (1985), 147–169. doi:10. 1016/S0364-0213(85)80012-4

  3. [3]

    Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, et al. 2025. Inference-time scaling for complex tasks: Where we stand and what lies ahead. arXiv preprint arXiv:2504.00294(2025)

  4. [4]

    Kinjal Basu, Ibrahim Abdelaziz, Kiran Kate, Mayank Agarwal, Maxwell Crouse, Yara Rizk, Kelsey Bradford, Asim Munawar, Sadhana Kumaravel, Saurabh Goyal, et al. 2024. Nestful: A benchmark for evaluating llms on nested sequences of api calls.arXiv preprint arXiv:2409.03797(2024)

  5. [5]

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christo- pher Ré, and Azalia Mirhoseini. 2024. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787(2024)

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  7. [7]

    Wei Chen, Zhiyuan Li, and Mingyuan Ma. 2024. Octopus: On-device language model for function calling of software apis.arXiv preprint arXiv:2404.01549 (2024)

  8. [8]

    Sehyun Choi, Tianqing Fang, Zhaowei Wang, and Yangqiu Song. 2023. KCTS: knowledge-constrained tree search decoding with token-level hallucination de- tection.arXiv preprint arXiv:2310.09044(2023)

  9. [9]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  10. [10]

    Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2024. Tinyagent: Function calling at the edge.arXiv preprint arXiv:2409.00608(2024)

  11. [11]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

  12. [12]

    Chia-Yu Hung, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. 2025. Reward-Guided Tree Search for Inference Time Alignment of Large Language Models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 12575–12593

  13. [13]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  14. [14]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

  15. [15]

    Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, et al. 2025. A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems.arXiv preprint arXiv:2504.09037(2025)

  16. [16]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

  17. [17]

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A comprehensive benchmark for tool-augmented llms.arXiv preprint arXiv:2304.08244(2023)

  18. [18]

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations

  19. [19]

    Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, et al . 2025. How can recommender systems benefit from large language models: A survey.ACM Transactions on Information Systems43, 2 (2025), 1–47

  20. [20]

    Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, et al. 2024. Hammer: Robust function-calling for on-device language models via function masking.arXiv preprint arXiv:2410.04587(2024)

  21. [21]

    Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Ha- jishirzi, and Asli Celikyilmaz. 2023. Don’t throw away your value model! Gener- ating more preferable text with Value-Guided Monte-Carlo Tree Search decoding. arXiv preprint arXiv:2309.15028(2023)

  22. [22]

    Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. 2024. Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920(2024)

  23. [23]

    Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh RN, et al. 2024. Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets.Advances in Neural Information Processing Systems37 (2024), 54463–54482

  24. [24]

    Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. 2023. Let’s reward step by step: Step-Level reward model as the Navigators for Reasoning.arXiv preprint arXiv:2310.10080(2023)

  25. [25]

    Vaskar Nath, Pranav Raja, Claire Yoon, and Sean Hendryx. 2025. ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark.arXiv preprint arXiv:2501.01290(2025)

  26. [26]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  27. [27]

    Kanghee Park, Timothy Zhou, and Loris D’Antoni. 2025. Flexible and Efficient Grammar-Constrained Decoding.arXiv preprint arXiv:2502.05111(2025)

  28. [28]

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2024. Go- rilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems37 (2024), 126544–126565

  29. [29]

    Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava

  30. [30]

    A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods.arXiv preprint arXiv:2502.01618(2025)

  31. [31]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789(2023)

  32. [32]

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. 2024. Rewarding progress: Scaling automated process verifiers for llm reasoning.arXiv preprint arXiv:2410.08146(2024)

  33. [33]

    Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. 2025. R-prm: Reasoning-driven process reward modeling.arXiv preprint arXiv:2503.21295(2025)

  34. [34]

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems36 (2023), 38154–38180

  35. [35]

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)

  36. [36]

    Venkat Krishna Srinivasan, Zhen Dong, Banghua Zhu, Brian Yu, Damon Mosk- Aoyama, Kurt Keutzer, Jiantao Jiao, and Jian Zhang. 2023. Nexusraven: a commercially-permissive language model for function calling. InNeurIPS 2023 Foundation Models for Decision Making Workshop

  37. [37]

    Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. 2023. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301(2023)

  38. [38]

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275(2022)

  39. [39]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

  40. [40]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171(2022)

  41. [41]

    Zhuoer Wang, Leonardo FR Ribeiro, Alexandros Papangelis, Rohan Mukherjee, Tzu-Yen Wang, Xinyan Zhao, Arijit Biswas, James Caverlee, and Angeliki Metalli- nou. 2024. FANTAstic SEquences and where to find them: Faithful and efficient API call generation through state-tracked constrained decoding and reranking. arXiv preprint arXiv:2407.13945(2024)

  42. [42]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  43. [44]

    Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. 2024. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724(2024)

  44. [45]

    Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, and Weinan Zhang. 2025. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges.arXiv preprint arXiv:2508.05668(2025)

  45. [46]

    Yunjia Xi, Weiwen Liu, Jianghao Lin, Bo Chen, Ruiming Tang, Weinan Zhang, and Yong Yu. 2024. Memocrs: Memory-enhanced sequential conversational recommender systems with large language models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 2585– 2595

  46. [47]

    Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. 2023. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems36 (2023), 41618–41650

  47. [48]

    Patil, Ion Stoica, and Joseph E

    Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. Berkeley Function Calling Leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_ leaderboard.html

  48. [49]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al . 2024. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115(2024)

  49. [50]

    Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, et al. 2025. A survey of ai agent protocols.arXiv preprint arXiv:2504.16736(2025)

  50. [51]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems36 (2023), 11809–11822

  51. [52]

    Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum, and Chuang Gan. 2023. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510(2023)

  52. [53]

    Weinan Zhang, Junwei Liao, Ning Li, Kounianhua Du, and Jianghao Lin. 2024. Agentic information retrieval.arXiv preprint arXiv:2410.09713(2024)

  53. [54]

    Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark Gerstein, Rui Wang, Gongshen Liu, et al . 2025. Igniting language intelligence: The hitchhiker’s guide from chain-of-thought reasoning to language agents.Comput. Surveys57, 8 (2025), 1–39

  54. [55]

    Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Yong Yu, Weinan Zhang, and Mengyue Yang. 2025. Cold: Counterfactually-guided length debiasing for process reward models.arXiv preprint arXiv:2507.15698(2025)

  55. [56]

    Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, et al. 2025. A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models.arXiv preprint arXiv:2510.08049(2025)

  56. [57]

    Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. 2025. ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario.arXiv preprint arXiv:2501.10132(2025)

  57. [58]

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2023. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406(2023). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009