pith. sign in

arxiv: 2606.05784 · v1 · pith:3L4KJBNPnew · submitted 2026-06-04 · 💻 cs.AI

TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

Pith reviewed 2026-06-28 02:08 UTC · model grok-4.3

classification 💻 cs.AI
keywords credit assignmentpolicy optimizationtool-augmented agentsmultimodal searchreinforcement learningGRPOcounterfactual correction
0
0 comments X

The pith

TAPO corrects credit misassignment in reinforcement learning for tool-using multimodal search agents by transferring advantages between equivalent tool calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that GRPO and similar methods broadcast a single trajectory-level advantage to every token, so useful tool calls inside failing searches receive the same negative signal as useless ones. It measures that more than half of failing trajectories contain such correctable misassignments. TAPO exploits the observation that information-acquisition tools with matching call parameters produce equivalent actions and should therefore receive comparable credit. It builds counterfactual witnesses inside each training batch and applies a gated correction to restore proper advantage values. The result is consistent gains on search benchmarks for three different RL algorithms with no added models or sampling cost.

Core claim

Credit misassignment is a systematic failure mode of GRPO in tool-augmented multimodal search agents because uniform trajectory-level advantage broadcast penalizes valuable tool-use steps identically to valueless ones; over half of failing trajectories exhibit this pattern. TAPO addresses it by constructing counterfactual witnesses from the parameter-determinism property of information-acquisition tools and applying confidence-gated conservative advantage correction, yielding plug-and-play gains on GRPO, GSPO, and SAPO across multiple benchmarks without extra annotation or models.

What carries the argument

Parameter-determinism property of information-acquisition tools, which treats similar call parameters as equivalent actions that should share comparable credit; used to build batch-internal counterfactual witnesses for advantage correction.

If this is right

  • TAPO yields consistent gains over GRPO, GSPO, and SAPO on multiple multimodal search benchmarks.
  • The correction requires only existing batch data and adds negligible compute.
  • More than half of observed failing trajectories and tool-use actions become correctable without new supervision.
  • The method works as a drop-in replacement for the three tested RL algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parameter-determinism idea could be tested in non-search tool-use domains where action equivalence is easy to check.
  • Batch-internal counterfactuals may reduce variance in other credit-assignment problems that currently rely on external value models.
  • If the fraction of correctable misassignments stays high across new benchmarks, credit transfer becomes a first-line fix rather than an optional add-on.

Load-bearing premise

Similar tool call parameters define equivalent information-acquisition actions that should receive comparable credit.

What would settle it

A controlled test in which tool calls with nearly identical parameters produce measurably different information outcomes yet TAPO still improves performance, or a run showing zero gain from the correction on held-out benchmarks.

Figures

Figures reproduced from arXiv: 2606.05784 by Chengqi Dong, Chuhuai Yue, Fenghe Tang, Guojun Yin, Hang He, Jiajun Chai, S Kevin Zhou, Xiaohan Wang, Yandong Liu.

Figure 1
Figure 1. Figure 1: A concrete example of credit misassignment [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prevalence and reliability of correctable credit misassignment in GRPO. A failing action is considered [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of TAPO. ➀ Standard GRPO broadcasts trajectory-level advantages uniformly to all tokens. ➁ Tool-use steps from successful trajectories are clustered into reference groups. ➂ Each failing tool-use step is matched to the most similar reference group, and advantage transfer is applied via a two-level gate (θ and αmin). Definition 1 (Credit Misassignment). An opti￾mization method exhibits credit misas… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on generality across RL algorithms. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robustness of TAPO to the transfer coefficient [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics of Vanilla GRPO vs. TAPO [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Wall-clock time breakdown per training step. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training dynamics of TAPO (β=0.25, Qwen3- VL-4B). Training accuracy increases steadily through￾out training, while the fraction of correctable trajectories (Correctable Traj.) and correctable steps (Correctable Steps) remain structurally stable and exhibit a slow up￾ward trend, indicating that credit misassignment persists across the entire training horizon. duces negligible computational overhead and can … view at source ↗
Figure 9
Figure 9. Figure 9: Six-dimensional training dynamics of TAPO ( [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-tool-type training dynamics of TAPO ( [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case 1: Entity Recognition + Knowledge Retrieval. The model identifies Legoland Windsor via image search and retrieves its opening year with a single text query, correctly answering 1996 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Case 2: Multi-Hop Iterative Query Refinement. Starting from a group photo, the model chains four tool calls across three reasoning hops to arrive at South Korea [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Case 3: Persistent Query Refinement for Numerical Retrieval. The model refines its text query three times after receiving non-specific results, ultimately extracting the exact island width of 300 m [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Case 4: Visual Detail Extraction via Zoom + Text Confirmation. Unable to identify the operator from a global image search, the model zooms into the hull text, reads the URL, and confirms Manche Iles Express via a follow-up search [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
read the original abstract

We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that GRPO in tool-augmented multimodal search agents suffers from systematic credit misassignment because uniform broadcast of trajectory-level advantages penalizes valuable tool-use steps in failing trajectories equally with valueless ones. It empirically quantifies this phenomenon as affecting over half of failing trajectories and failing tool-use actions, and introduces TAPO, which exploits the parameter-determinism of information-acquisition tools to build counterfactual witnesses within the training batch and apply confidence-gated conservative advantage correction, yielding consistent plug-and-play gains over GRPO, GSPO, and SAPO baselines on multiple multimodal search benchmarks with negligible overhead.

Significance. If the empirical scale of correctable misassignment is robustly measured and the parameter-determinism assumption holds without substantial state-dependence violations, TAPO would provide a lightweight, annotation-free improvement to credit assignment for tool-using RL agents. The promised public release of code and models would enable direct reproducibility and extension.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim that 'over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment' is load-bearing for the motivation and for the assertion that the wasted signal is 'substantial and structurally exploitable,' yet the abstract supplies no measurement procedure, exact definition of 'correctable,' dataset, or counting methodology.
  2. [Abstract] Abstract: TAPO's counterfactual witnesses and advantage correction rest on the stated property that 'similar call parameters define equivalent information-acquisition actions,' but the manuscript provides no analysis or safeguards for cases in which the same parameters yield different information value depending on trajectory state (early vs. late search), which directly risks transferring advantage between non-equivalent actions.
minor comments (1)
  1. [Abstract] The abstract introduces the acronyms GRPO, GSPO, and SAPO without expansion on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim that 'over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment' is load-bearing for the motivation and for the assertion that the wasted signal is 'substantial and structurally exploitable,' yet the abstract supplies no measurement procedure, exact definition of 'correctable,' dataset, or counting methodology.

    Authors: We agree that the abstract would be strengthened by including a concise description of the quantification procedure. The full manuscript (Section 4.2) defines correctable credit misassignment as the subset of tool-use actions in failing trajectories for which at least one similar-parameter witness exists among successful trajectories in the same training batch; the count is obtained by enumerating all failing tool-use steps on the evaluation benchmarks described in Section 5.1 and checking parameter similarity against the batch. We will revise the abstract to reference this definition, the dataset family, and the batch-based counting method. revision: yes

  2. Referee: [Abstract] Abstract: TAPO's counterfactual witnesses and advantage correction rest on the stated property that 'similar call parameters define equivalent information-acquisition actions,' but the manuscript provides no analysis or safeguards for cases in which the same parameters yield different information value depending on trajectory state (early vs. late search), which directly risks transferring advantage between non-equivalent actions.

    Authors: The referee correctly notes that the manuscript states the parameter-determinism assumption without an explicit analysis of state dependence. We will add a dedicated paragraph (and, if space permits, a short empirical table) in the revised manuscript that (a) reports the frequency of early-vs-late parameter matches on our benchmarks and (b) quantifies the residual risk after the existing confidence gate. Should the analysis reveal non-negligible state-dependent divergence, we will either tighten the similarity threshold or flag the limitation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper identifies credit misassignment empirically from trajectory data, quantifies its prevalence directly from observations (over half of failing cases), and introduces TAPO as a correction mechanism grounded in the observable parameter-determinism property of information-acquisition tools. No equations, fitted parameters, or self-citations are shown to reduce the central claims or advantages by construction to the inputs themselves. The method is presented as a plug-and-play addition validated on external benchmarks, with no load-bearing steps that equate predictions to their own definitions or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that tool calls with matching parameters are informationally equivalent and therefore merit shared credit; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit
    This premise is invoked to justify constructing counterfactual witnesses and performing confidence-gated conservative advantage correction.

pith-pipeline@v0.9.1-grok · 5758 in / 1275 out tokens · 37497 ms · 2026-06-28T02:08:35.809614+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 21 canonical work pages · 13 internal anchors

  1. [1]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Deepeyes: Incentivizing" thinking with images" via reinforcement learning , author=. arXiv preprint arXiv:2505.14362 , year=

  2. [2]

    DeepEyesV2: Toward Agentic Multimodal Model

    Deepeyesv2: Toward agentic multimodal model , author=. arXiv preprint arXiv:2511.05271 , year=

  3. [3]

    arXiv preprint arXiv:2603.08754 , year=

    Hindsight Credit Assignment for Long-Horizon LLM Agents , author=. arXiv preprint arXiv:2603.08754 , year=

  4. [4]

    arXiv e-prints , pages=

    Livevqa: Live visual knowledge seeking , author=. arXiv e-prints , pages=

  5. [5]

    2025 , eprint=

    Group-in-Group Policy Optimization for LLM Agent Training , author=. 2025 , eprint=

  6. [6]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Simplevqa: Multimodal factuality evaluation for multimodal large language models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  7. [7]

    2024 , journal =

    HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

  8. [8]

    arXiv preprint arXiv:2505.14246 , year=

    Visual agentic reinforcement fine-tuning , author=. arXiv preprint arXiv:2505.14246 , year=

  9. [9]

    arXiv preprint arXiv:2510.12801 , year=

    Deepmmsearch-r1: Empowering multimodal llms in multimodal web search , author=. arXiv preprint arXiv:2510.12801 , year=

  10. [10]

    arXiv preprint arXiv:2409.12959 , year=

    Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=

  11. [11]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  12. [12]

    Qwen2.5-VL Technical Report

    Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

  13. [13]

    2025 , howpublished =

    OpenAI , title =. 2025 , howpublished =

  14. [14]

    2025 , howpublished =

    Google , title =. 2025 , howpublished =

  15. [15]

    2025 , howpublished =

    Google DeepMind , title =. 2025 , howpublished =

  16. [16]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  17. [17]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  18. [18]

    2025 , url =

    Qwen Team , title =. 2025 , url =

  19. [19]

    MMSearch-R1: Incentivizing LMMs to Search

    Mmsearch-r1: Incentivizing lmms to search , author=. arXiv preprint arXiv:2506.20670 , year=

  20. [20]

    arXiv preprint arXiv:2509.06980 , year=

    Rlfactory: A plug-and-play reinforcement learning post-training framework for llm multi-turn tool-use , author=. arXiv preprint arXiv:2509.06980 , year=

  21. [21]

    arXiv preprint arXiv:2512.24330 , year=

    SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning , author=. arXiv preprint arXiv:2512.24330 , year=

  22. [22]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  23. [23]

    The twelfth international conference on learning representations , year=

    Let's verify step by step , author=. The twelfth international conference on learning representations , year=

  24. [24]

    arXiv preprint arXiv:2603.13956 , year=

    EviAgent: Evidence-Driven Agent for Radiology Report Generation , author=. arXiv preprint arXiv:2603.13956 , year=

  25. [25]

    WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    Webwatcher: Breaking new frontier of vision-language deep research agent , author=. arXiv preprint arXiv:2508.05748 , year=

  26. [26]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Can pre-trained vision and language models answer visual information-seeking questions? , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  27. [27]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  28. [28]

    Training Multi-Image Vision Agents via End2End Reinforcement Learning

    Training Multi-Image Vision Agents via End2End Reinforcement Learning , author=. arXiv preprint arXiv:2512.08980 , year=

  29. [29]

    CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

    CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment , author=. arXiv preprint arXiv:2510.18471 , year=

  30. [30]

    Proceedings of the AAAI Conference on Artificial Intelligence , number=

    Promoting efficient reasoning with verifiable stepwise reward , author=. Proceedings of the AAAI Conference on Artificial Intelligence , number=

  31. [31]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  32. [32]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Improve mathematical reasoning in language models by automated process supervision , author=. arXiv preprint arXiv:2406.06592 , year=

  33. [33]

    From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

    From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models , author=. arXiv preprint arXiv:2604.09459 , year=

  34. [34]

    2025 , eprint=

    Visual Agentic Reinforcement Fine-Tuning , author=. 2025 , eprint=

  35. [35]

    2025 , eprint=

    Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch , author=. 2025 , eprint=

  36. [36]

    2026 , eprint=

    MTA-Agent: An Open Recipe for Multimodal Deep Search Agents , author=. 2026 , eprint=

  37. [37]

    2026 , eprint=

    Intrinsic Credit Assignment for Long Horizon Interaction , author=. 2026 , eprint=

  38. [38]

    2026 , eprint=

    Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models , author=. 2026 , eprint=

  39. [39]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  40. [40]

    Publications Manual , year = "1983", publisher =

  41. [41]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  42. [42]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  43. [43]

    Dan Gusfield , title =. 1997

  44. [44]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  45. [45]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =