ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
Pith reviewed 2026-05-18 06:27 UTC · model grok-4.3
The pith
ToolPRM scores each intra-call decision to scale inference for structured function calling better than outcome or coarse rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose ToolPRM, a process reward model scoring each intra-call decision (function name and argument filling). We build the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation. ToolPRM outperforms outcome and coarse-grained reward models in predictive accuracy and yields consistent test-time gains on multiple function-calling benchmarks. We further show that structured generation follows 'explore more but retain less', since early JSON errors are unrecoverable.
What carries the argument
ToolPRM, the process reward model that scores every intra-call decision to guide fine-grained beam search over structured outputs.
If this is right
- ToolPRM delivers higher predictive accuracy for good intra-call paths than either final-outcome or coarse-grained reward models.
- The same model produces steady performance lifts when used for test-time scaling on existing function-calling benchmarks.
- Structured function outputs obey an explore-more-retain-less rule because an early JSON formatting error cannot be corrected downstream.
Where Pith is reading between the lines
- The same fine-grained supervision approach could be applied to other structured generation settings such as code completion where early syntax mistakes also compound.
- Increasing the beam width further while using ToolPRM might reduce the performance gap to much larger base models without additional training.
- The masking-plus-rollout data pipeline offers a template for creating step-level signals in broader tool-use or API-calling scenarios.
Load-bearing premise
The dataset built by masking functions, collecting rollouts, and adding step-level labels supplies rewards that remain accurate and transfer to new test distributions.
What would settle it
Running ToolPRM-guided beam search on a new function-calling benchmark with unseen function schemas and finding no accuracy gain over greedy decoding or an outcome reward model would disprove the consistent test-time improvement claim.
Figures
read the original abstract
Large language models (LLMs) excel at function calling, but inference scaling has been explored mainly for unstructured generation. We propose an inference-scaling framework for structured outputs that combines fine-grained beam search with \textbf{ToolPRM}, a process reward model scoring each intra-call decision (function name and argument filling). We build the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation. ToolPRM outperforms outcome and coarse-grained reward models in predictive accuracy and yields consistent test-time gains on multiple function-calling benchmarks. We further show that structured generation follows ``\textbf{explore more but retain less}'', since early JSON errors are unrecoverable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ToolPRM, a process reward model for fine-grained inference scaling of structured outputs in LLM function calling. It constructs the first intra-call supervision dataset through function masking, rollout collection, and step-level annotation, then uses ToolPRM to score individual decisions (function name and argument filling) within calls. The central claims are that ToolPRM outperforms outcome and coarse-grained reward models on predictive accuracy and delivers consistent gains under test-time beam search on function-calling benchmarks; the work also reports that structured generation obeys an “explore more but retain less” pattern because early JSON errors are unrecoverable.
Significance. If the fine-grained rewards prove accurate and transferable, the framework would advance inference-time scaling for tool-use and structured generation tasks by enabling more precise process-level guidance. The new dataset construction pipeline and the empirical observation on exploration/retention trade-offs in JSON-structured outputs are concrete contributions that could inform future work on process supervision.
major comments (1)
- Dataset construction section: the step-level annotation procedure (function masking + rollout collection) is not shown to assign labels independently of final rollout outcome. If annotations rely on outcome proxies or simple heuristics, early correct function-name/argument decisions can be mislabeled when later steps fail, directly threatening both the reported predictive-accuracy advantage over outcome/coarse-grained baselines and the claimed generalization to unseen test distributions.
minor comments (1)
- Abstract and §4: the slogan “explore more but retain less” is introduced without a forward reference to the section or figure that quantifies the exploration/retention trade-off; a brief parenthetical or citation would improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address the single major comment below.
read point-by-point responses
-
Referee: Dataset construction section: the step-level annotation procedure (function masking + rollout collection) is not shown to assign labels independently of final rollout outcome. If annotations rely on outcome proxies or simple heuristics, early correct function-name/argument decisions can be mislabeled when later steps fail, directly threatening both the reported predictive-accuracy advantage over outcome/coarse-grained baselines and the claimed generalization to unseen test distributions.
Authors: We appreciate the referee highlighting this critical requirement for true process supervision. In our pipeline, function masking generates partial structured outputs and rollouts produce candidate completions, but the subsequent step-level annotation is performed by human experts who judge each intra-call decision (function-name selection or argument filling) solely on its correctness given the query and prior context. Annotators are explicitly instructed to ignore the remainder of the rollout and the final success or failure; a correct early decision receives a positive label even if a later step introduces an error. This protocol was designed to avoid the exact contamination the referee describes. We acknowledge, however, that the current manuscript does not provide sufficient detail on the annotation guidelines or inter-annotator agreement statistics to make this independence fully transparent. In the revised version we will expand the Dataset Construction section with the precise annotation rubric, example decisions, and confirmation that labels were assigned without reference to rollout outcomes. These additions will directly support the validity of the reported predictive-accuracy gains and the generalization claims. revision: yes
Circularity Check
No significant circularity; derivation relies on independent dataset construction and external evaluation
full rationale
The paper constructs a new fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation, trains ToolPRM on this data to score individual decisions, and evaluates predictive accuracy plus test-time gains on multiple external function-calling benchmarks. No quoted equations, self-citations, or claims reduce the reported outperformance or the 'explore more but retain less' observation to fitted parameters by construction, self-referential definitions, or load-bearing prior work by the same authors. The central results rest on standard supervised training and benchmark comparison rather than any reduction of outputs to inputs.
Axiom & Free-Parameter Ledger
invented entities (1)
-
ToolPRM
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation... ToolPRM outperforms outcome and coarse-grained reward models
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
explore more but retain less... due to the unrecoverability characteristics of structured function calling generation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
Reference graph
Works this paper leans on
-
[1]
Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Chulaka Gu- nasekara, et al. 2024. Granite-function calling model: Introducing function calling abilities via multi-task learning of granular tasks.arXiv preprint arXiv:2407.00121 (2024)
-
[2]
David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. 1985. A learning algorithm for boltzmann machines.Cognitive Science9, 1 (1985), 147–169. doi:10. 1016/S0364-0213(85)80012-4
work page 1985
- [3]
- [4]
-
[5]
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christo- pher Ré, and Azalia Mirhoseini. 2024. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901
work page 2020
- [7]
- [8]
-
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186
work page 2019
- [10]
-
[11]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Chia-Yu Hung, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. 2025. Reward-Guided Tree Search for Inference Time Alignment of Large Language Models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 12575–12593
work page 2025
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, et al. 2025. A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems.arXiv preprint arXiv:2504.09037(2025)
-
[16]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213
work page 2022
-
[17]
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A comprehensive benchmark for tool-augmented llms.arXiv preprint arXiv:2304.08244(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations
work page 2023
-
[19]
Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, et al . 2025. How can recommender systems benefit from large language models: A survey.ACM Transactions on Information Systems43, 2 (2025), 1–47
work page 2025
- [20]
- [21]
- [22]
-
[23]
Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh RN, et al. 2024. Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets.Advances in Neural Information Processing Systems37 (2024), 54463–54482
work page 2024
- [24]
- [25]
-
[26]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
work page 2022
- [27]
-
[28]
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2024. Go- rilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems37 (2024), 126544–126565
work page 2024
-
[29]
Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava
- [30]
-
[31]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. 2024. Rewarding progress: Scaling automated process verifiers for llm reasoning.arXiv preprint arXiv:2410.08146(2024)
work page internal anchor Pith review arXiv 2024
- [33]
-
[34]
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems36 (2023), 38154–38180
work page 2023
-
[35]
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Venkat Krishna Srinivasan, Zhen Dong, Banghua Zhu, Brian Yu, Damon Mosk- Aoyama, Kurt Keutzer, Jiantao Jiao, and Jian Zhang. 2023. Nexusraven: a commercially-permissive language model for function calling. InNeurIPS 2023 Foundation Models for Decision Making Workshop
work page 2023
-
[37]
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. 2023. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345
work page 2024
-
[40]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Zhuoer Wang, Leonardo FR Ribeiro, Alexandros Papangelis, Rohan Mukherjee, Tzu-Yen Wang, Xinyan Zhao, Arijit Biswas, James Caverlee, and Angeliki Metalli- nou. 2024. FANTAstic SEquences and where to find them: Faithful and efficient API call generation through state-tracked constrained decoding and reranking. arXiv preprint arXiv:2407.13945(2024)
-
[42]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
work page 2022
-
[44]
Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. 2024. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724(2024)
work page internal anchor Pith review arXiv 2024
- [45]
-
[46]
Yunjia Xi, Weiwen Liu, Jianghao Lin, Bo Chen, Ruiming Tang, Weinan Zhang, and Yong Yu. 2024. Memocrs: Memory-enhanced sequential conversational recommender systems with large language models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 2585– 2595
work page 2024
-
[47]
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. 2023. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems36 (2023), 41618–41650
work page 2023
-
[48]
Patil, Ion Stoica, and Joseph E
Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. Berkeley Function Calling Leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_ leaderboard.html
work page 2024
-
[49]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al . 2024. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [50]
-
[51]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems36 (2023), 11809–11822
work page 2023
- [52]
- [53]
-
[54]
Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark Gerstein, Rui Wang, Gongshen Liu, et al . 2025. Igniting language intelligence: The hitchhiker’s guide from chain-of-thought reasoning to language agents.Comput. Surveys57, 8 (2025), 1–39
work page 2025
-
[55]
Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Yong Yu, Weinan Zhang, and Mengyue Yang. 2025. Cold: Counterfactually-guided length debiasing for process reward models.arXiv preprint arXiv:2507.15698(2025)
work page internal anchor Pith review arXiv 2025
-
[56]
Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, et al. 2025. A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models.arXiv preprint arXiv:2510.08049(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [57]
-
[58]
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2023. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406(2023). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.