pith. machine review for the scientific record. sign in

arxiv: 2510.26020 · v2 · submitted 2025-10-29 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

Authors on Pith no claims yet
classification 💻 cs.CL cs.AIcs.LG
keywords portoolsteptool-useagentsestimatesimportancereasoningbranching
0
0 comments X
read the original abstract

Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with calls to external tools. However, training such agents from outcome-only rewards suffers from credit-assignment ambiguity, obscuring which intermediate tool-use decisions drive success or failure. In this paper, we propose PORTool, an importance-aware policy-optimization algorithm that reinforces agents' tool-use competence from outcome-level supervision while assigning reward at the step level. Specifically, PORTool generates a rewarded rollout tree in which trajectories share prefixes before branching, enabling direct comparisons among alternative tool-use decisions within the same context. It then estimates each step's importance by a correctness-dominant signal, i.e., whether descendants of that step can ultimately produce a correct final answer, plus an auxiliary term indicating whether the step's tool calls satisfy formatting constraints and execute successfully. Using these step-wise importance estimates, PORTool updates the policy to generate efficient tool-call steps, guided by both local comparisons within each branching decision and the overall quality of entire trajectories. Experiments show that PORTool improves final-answer accuracy while reducing tool-call steps compared with state-of-the-art policy-optimization baselines, and ablation studies confirm the robustness of the proposed step-wise importance estimates.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  2. Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 6.0

    A decision-based agent for KB-VQA learns to dynamically select retrieval or answer actions over multiple steps and achieves state-of-the-art results on InfoSeek and E-VQA after fine-tuning on automatically collected t...