pith. sign in

arxiv: 2606.30251 · v1 · pith:BOYSYNYJnew · submitted 2026-06-29 · 💻 cs.MA

TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

Pith reviewed 2026-06-30 03:50 UTC · model grok-4.3

classification 💻 cs.MA
keywords tool-augmented credit optimizationagentic tool usemultimodal agentscredit assignmentreinforcement learningcode toolsself-supervised rewards
0
0 comments X

The pith

TACO credits each tool call by the difference it makes to final answer accuracy using inserted probe tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic multimodal models apply code operations to images and reason over the results, but outcome rewards alone cannot tell which calls help, which do nothing, or which mislead. TACO addresses this with a GRPO variant that adds two linked advantage channels. Differential Answer-Probe Reward inserts probe tokens to elicit predictions with and without each call, then uses the change in outcome reward as its credit value. Outcome-Gated Advantage Routing sends the final answer advantage only to the reasoning segments that produced that call. A reader would care because the method needs no extra judge model and trains agents that reach higher accuracy while skipping unhelpful tools.

Core claim

TACO is a GRPO variant built on Differential Answer-Probe Reward, which computes a self-supervised advantage for each tool call as the difference in outcome reward between predictions with and without the call, and Outcome-Gated Advantage Routing, which routes the final-answer advantage only to responsible segments when the call is useful. The two channels together allow the model to receive positive credit for helpful calls, negative credit for misleading ones, and zero for redundant ones, all while reusing the existing answer checker.

What carries the argument

Differential Answer-Probe Reward (DAPR) paired with Outcome-Gated Advantage Routing (OGAR) as two coupled advantage channels inside a GRPO training loop.

If this is right

  • Models reach higher accuracy on perception, reasoning, and general multimodal benchmarks.
  • Agents learn to invoke tools only when the calls improve the final answer.
  • Credit assignment requires no auxiliary judge model or added parameters.
  • Training proceeds through a two-stage SFT followed by RL pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same probe-difference idea could be tested on non-code tools if an outcome checker is available.
  • Selective tool invocation may lower average compute cost per query in deployed agent systems.
  • The approach suggests a general route for self-supervised credit assignment in any agent that produces intermediate actions with measurable outcome effects.

Load-bearing premise

The difference in outcome reward produced by probe tokens accurately isolates each tool call's contribution and stays robust to any attempt to manipulate the probes.

What would settle it

An experiment in which accuracy gains disappear once probe tokens are removed from training or in which the model begins receiving positive credit for calls that demonstrably lower answer quality.

Figures

Figures reproduced from arXiv: 2606.30251 by Chuyuan Zhang, Fangrui Lv, Hao Gu, Jianhua Tao, Jinyang Wu, Mingkuan Feng, Ruihan Jin, Zhengqi Wen.

Figure 1
Figure 1. Figure 1: A visual tool call can help or hurt. Example 1: a crop turns a wrong answer right (useful); Example 2: a crop flips a would-be-correct answer to wrong (misleading). trained with reinforcement learning from verifiable rewards (RLVR), the standard recipe for eliciting reasoning in LLMs and VLMs (Guo et al. 2025; Team et al. 2025, 2026; Wu et al. 2026d). However, code operations do not always help (Ma et al. … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of TACO. (a) The accuracy channel A1 scores the final answer; the process channel A2 is the before/after probe difference (the tool-call value), gated into the loss only when ∆≥0. (b) OGAR sends A1 to a segment only when it is responsible for the answer (solid box), gating it out otherwise (dashed box). The unit of credit is thus an action with a real observation, not a token span. 3 Method TACO a… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics. (a) Accuracy reward: TACO stays highest, while the additive-probe variant rises fastest early (probe hacking pays off) but plateaus and ends below standard GRPO. (b) Completion length: TACO steadily shortens com￾pletions, ending well below the others, so fewer tool/sandbox rounds mirror its latency advantage ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reading an occluded sponsor wordmark: the agent crops the backdrop banner and zooms [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: High-resolution perception: the agent crops and zooms a circuit board to read the three small labels printed above a [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scientific-figure reading: the agent isolates one panel of a six-method precision plot to extract the requested value. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Image rotation: the photo was uploaded sideways, so the agent rotates it before reading the route number on the bus’s [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Chart-grounded math: the agent crops the relevant chart region and computes the requested fraction. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Agentic multimodal models perform diverse operations on an image via code and reason over the returned view, an effective paradigm for fine-grained visual question answering. However, code operations can be useful, redundant, or misleading. Outcome-only rewards cannot precisely distinguish these cases, and existing process rewards either fail to attribute final correctness to individual tool calls, or require an external judge model. To address this, we introduce Tool-Augmented Credit Optimization (TACO), a GRPO variant for code-tool agents built on two coupled advantage channels. The first, Differential Answer-Probe Reward (DAPR), is a self-supervised, judge-free tool-contribution advantage that credits each tool call by its own effect on answering correctly. Probe tokens inserted into the model's reasoning elicit its predictions with and without the tool, and the difference in outcome reward is taken as the call's value: positive for a useful call, negative for a misleading one, and zero for one that changes nothing. This reuses the existing answer checker with no auxiliary judge, and, being a difference rather than an absolute probe score, is naturally robust to probe-hacking. The second is the outcome advantage from the final answer, distributed by Outcome-Gated Advantage Routing (OGAR): a parameter-free rule that, conditioned on the call's outcome, delivers this credit only to the responsible segments, suppressing wasted tool calls without any cost term. We train TACO through a two-stage SFT+RL pipeline. Extensive experiments across perception, reasoning, and general multimodal benchmarks show that it yields consistent accuracy gains and learns to invoke its tools only when they help.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TACO, a GRPO variant for training agentic multimodal models that use code tools on images. It defines two coupled advantage channels: DAPR, a self-supervised tool-contribution signal computed as the difference in outcome reward between probe-elicited predictions with and without each tool call, and OGAR, a parameter-free rule that routes final-answer outcome advantage only to responsible reasoning segments. A two-stage SFT+RL pipeline is used, with claims of consistent accuracy gains and selective tool invocation across perception, reasoning, and multimodal benchmarks.

Significance. If the central claims hold, TACO would offer a judge-free and parameter-free mechanism for precise credit assignment to individual tool calls, addressing a key limitation in outcome-only or external-judge process rewards for tool-augmented agents. The reuse of the existing answer checker and the differencing approach for robustness are notable strengths that could improve efficiency in agentic multimodal systems.

major comments (2)
  1. [DAPR component] DAPR description: the claim that probe insertion yields a clean marginal contribution via differencing assumes that the probe tokens do not independently alter attention patterns, token probabilities, or subsequent reasoning dynamics even in the counterfactual (no-tool) case. If insertion itself changes the trajectory, the reward difference no longer isolates the tool call's effect; this assumption is load-bearing for the tool-value estimation and requires explicit justification or controls.
  2. [OGAR component] OGAR description: the exact conditioning rule and segment-identification logic for routing outcome advantage to 'responsible segments' are unspecified, making it impossible to verify whether the method avoids crediting unrelated reasoning or introduces hidden biases while remaining parameter-free. This is central to the claim of suppressing wasted tool calls.
minor comments (1)
  1. The abstract would benefit from a brief statement of the base GRPO objective and how the two advantage channels are combined in the policy gradient.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: [DAPR component] DAPR description: the claim that probe insertion yields a clean marginal contribution via differencing assumes that the probe tokens do not independently alter attention patterns, token probabilities, or subsequent reasoning dynamics even in the counterfactual (no-tool) case. If insertion itself changes the trajectory, the reward difference no longer isolates the tool call's effect; this assumption is load-bearing for the tool-value estimation and requires explicit justification or controls.

    Authors: We agree this assumption requires explicit support. The manuscript emphasizes robustness via differencing rather than absolute scores, but does not provide dedicated controls for probe-induced trajectory shifts. In revision we will add an analysis subsection with empirical checks (e.g., trajectory divergence metrics and isolated-probe ablations) to quantify and bound any such effects, thereby strengthening the justification for the marginal-contribution claim. revision: partial

  2. Referee: [OGAR component] OGAR description: the exact conditioning rule and segment-identification logic for routing outcome advantage to 'responsible segments' are unspecified, making it impossible to verify whether the method avoids crediting unrelated reasoning or introduces hidden biases while remaining parameter-free. This is central to the claim of suppressing wasted tool calls.

    Authors: The referee is correct that the current manuscript leaves the precise conditioning rule and segment-identification procedure underspecified. We will revise the methods section to include the full algorithmic description: outcome advantage is routed exclusively to the contiguous reasoning segment immediately preceding a tool call when the final answer is correct and the tool call is the last one before the answer; otherwise the advantage is zeroed for that call. This rule is parameter-free and will be presented with pseudocode to allow verification that unrelated segments receive no credit. revision: yes

Circularity Check

0 steps flagged

No circularity in TACO derivation chain

full rationale

The paper introduces TACO as a GRPO variant whose two core components (DAPR and OGAR) are explicitly defined as new constructions: DAPR computes tool value via explicit differencing of outcome rewards on probe-inserted trajectories, and OGAR applies a parameter-free routing rule conditioned on call outcome. Neither component is shown to reduce by construction to its own inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The abstract and description treat the answer checker as an external reuse and present robustness via differencing as a direct consequence of the definition, with empirical validation on benchmarks rather than tautological equivalence. No self-citation chains or uniqueness theorems imported from prior author work appear in the load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly defined reward mechanisms whose independent validation is limited to the abstract's summary of experiments.

axioms (1)
  • domain assumption GRPO provides a suitable base algorithm for tool-augmented credit assignment
    TACO is explicitly described as a GRPO variant.
invented entities (2)
  • DAPR (Differential Answer-Probe Reward) no independent evidence
    purpose: Self-supervised, judge-free tool-contribution advantage via probe-token differences
    New mechanism introduced in the paper.
  • OGAR (Outcome-Gated Advantage Routing) no independent evidence
    purpose: Parameter-free routing of outcome advantage to responsible segments only
    New rule introduced in the paper.

pith-pipeline@v0.9.1-grok · 5843 in / 1185 out tokens · 68830 ms · 2026-06-30T03:50:22.600602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 16 linked inside Pith

  1. [1]

    Xiangwei Wang and Wei Wang and Ken Chen and Nanduni Nimalsiri and Saman Halgamuge , journal =

  2. [2]

    Hasegawa-Johnson and Chang D

    Eunseop Yoon and Hee Suk Yoon and Jaehyun Jang and SooHwan Eom and Qi Dai and Chong Luo and Mark A. Hasegawa-Johnson and Chang D. Yoo , journal =

  3. [4]

    Fei Wu and Zhenrong Zhang and Qikai Chang and Jianshu Zhang and Quan Liu and Jun Du , journal =

  4. [5]

    Senkang Hu and Yong Dai and Xudong Han and Zhengru Fang and Yuzhi Zhao and Sam Tak Wu Kwong and Yuguang Fang , journal =

  5. [6]

    Zhicong Lu and Zichuan Lin and Wei Jia and Changyuan Tian and Deheng Ye and Peiguang Li and Li Jin and Nayu Liu and Guangluan Xu and Wei Feng , journal =

  6. [7]

    Keon Kim and Krish Chelikavada , journal =

  7. [8]

    2026 , note =

    Yan Ma and Weiyu Zhang and Tianle Li and Linge Du and Xuyang Shen and Pengfei Liu , journal =. 2026 , note =

  8. [9]

    The Fourteenth International Conference on Learning Representations , year=

    Thyme: Think Beyond Images , author=. The Fourteenth International Conference on Learning Representations , year=

  9. [10]

    Shilin Yan and Jintao Tong and Hongwei Xue and Xiaojun Tang and Yangyang Wang and Kunyu Shi and Guannan Zhang and Ruixuan Li and Yixiong Zou , journal =

  10. [11]

    The Fourteenth International Conference on Learning Representations , year=

    DeepEyes: Incentivizing ''Thinking with Images'' via Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=

  11. [12]

    The Fourteenth International Conference on Learning Representations , year=

    DeepEyesV2: Toward Agentic Multimodal Model , author=. The Fourteenth International Conference on Learning Representations , year=

  12. [13]

    2025 , eprint=

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning , author=. 2025 , eprint=

  13. [14]

    The Fourteenth International Conference on Learning Representations , year=

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search , author=. The Fourteenth International Conference on Learning Representations , year=

  14. [15]

    2026 , eprint=

    PyVision-RL: Forging Open Agentic Vision Models via RL , author=. 2026 , eprint=

  15. [16]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Mathcoder-vl: Bridging vision and code for enhanced multimodal mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  16. [17]

    International Conference on Machine Learning , year=

    Reliable Thinking with Images , author=. International Conference on Machine Learning , year=

  17. [18]

    Jiaqi Liu and Kaiwen Xiong and Peng Xia and Yiyang Zhou and Haonian Ji and Lu Feng and Siwei Han and Mingyu Ding and Huaxiu Yao , journal =

  18. [19]

    Jinyang Wu and Chonghua Liao and Mingkuan Feng and Shuai Zhang and Zhengqi Wen and Haoran Luo and Ling Yang and Huazhe Xu and Jianhua Tao , journal =

  19. [20]

    2025 , eprint=

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. 2025 , eprint=

  20. [24]

    2025 , howpublished =

  21. [25]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  22. [26]

    International Conference on Machine Learning (ICML) , pages =

    Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , author =. International Conference on Machine Learning (ICML) , pages =

  23. [27]

    Quan Wei and Siliang Zeng and Chenliang Li and William Brown and Oana Frunza and Wei Deng and Anderson Schneider and Yuriy Nevmyvaka and Yang Katie Zhao and Alfredo Garcia and Mingyi Hong , journal =

  24. [28]

    2026 , eprint=

    DeepAgent: A General Reasoning Agent with Scalable Toolsets , author=. 2026 , eprint=

  25. [29]

    2025 , eprint=

    SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement , author=. 2025 , eprint=

  26. [30]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  27. [31]

    International Conference on Learning Representations , volume=

    Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? , author=. International Conference on Learning Representations , volume=

  28. [32]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    V?: Guided visual search as a core mechanism in multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  29. [33]

    Advances in Neural Information Processing Systems , volume=

    Measuring multimodal mathematical reasoning with math-vision dataset , author=. Advances in Neural Information Processing Systems , volume=

  30. [34]

    International Conference on Learning Representations , volume=

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. International Conference on Learning Representations , volume=

  31. [35]

    European Conference on Computer Vision , pages=

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  32. [36]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    We-math: Does your large multimodal model achieve human-like mathematical reasoning? , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  33. [38]

    Advances in Neural Information Processing Systems , volume=

    Are we on the right way for evaluating large vision-language models? , author=. Advances in Neural Information Processing Systems , volume=

  34. [39]

    Findings of the association for computational linguistics: ACL 2022 , pages=

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. Findings of the association for computational linguistics: ACL 2022 , pages=

  35. [40]

    European Conference on Computer Vision , pages=

    Blink: Multimodal large language models can see but not perceive , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  36. [44]

    Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , journal =

  37. [45]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Codev: Code with images for faithful visual reasoning via tool-aware policy optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  38. [46]

    Proceedings of the 32nd ACM International Conference on Multimedia , pages=

    VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=

  39. [48]

    Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Two-stage regularization-based structured pruning for llms , author=. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  40. [49]

    Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Beyond Examples: Towards Automated Thought-level In-Context Reasoning for Large Language Models , author=. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  41. [51]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Astar: Boosting multimodal reasoning with automated structured thinking , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  42. [52]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Pandora's box or aladdin's lamp: A comprehensive analysis revealing the role of rag noise in large language models , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  43. [55]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Better, stronger, faster: Tackling the trilemma in mllm-based segmentation with simultaneous textual mask prediction , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  44. [56]

    Bai, S.; Cai, Y.; Chen, R.; Chen, K.; Chen, X.; Cheng, Z.; Deng, L.; Ding, W.; Gao, C.; Ge, C.; et al. 2025 a . Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

  45. [57]

    Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; Zhong, H.; Zhu, Y.; Yang, M.; Li, Z.; Wan, J.; Wang, P.; Ding, W.; Fu, Z.; Xu, Y.; Ye, J.; Zhang, X.; Xie, T.; Cheng, Z.; Zhang, H.; Yang, Z.; Xu, H.; and Lin, J. 2025 b . Qwen2.5-VL Technical Report. arXiv:2502.13923

  46. [58]

    Chen, L.; Li, J.; Dong, X.; Zhang, P.; Zang, Y.; Chen, Z.; Duan, H.; Wang, J.; Qiao, Y.; Lin, D.; et al. 2024. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37: 27056--27087

  47. [59]

    Duan, H.; Yang, J.; Qiao, Y.; Fang, X.; Chen, L.; Liu, Y.; Dong, X.; Zang, Y.; Zhang, P.; Wang, J.; Lin, D.; and Chen, K. 2024. VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models. In Proceedings of the 32nd ACM International Conference on Multimedia, 11198--11201

  48. [60]

    Feng, M.; Wu, J.; Liu, S.; Zhang, S.; Fang, H.; Jin, R.; Che, F.; Shao, P.; Wen, Z.; and Tao, J. 2026. Two-stage regularization-based structured pruning for llms. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2996--3012

  49. [61]

    Feng, M.; Wu, J.; Zhang, S.; Shao, P.; Jin, R.; Wen, Z.; Tao, J.; and Che, F. 2025. Dress: Data-driven regularized structured streamlining for large language models. arXiv preprint arXiv:2501.17905

  50. [62]

    A.; Ma, W.-C.; and Krishna, R

    Fu, X.; Hu, Y.; Li, B.; Feng, Y.; Wang, H.; Lin, X.; Roth, D.; Smith, N. A.; Ma, W.-C.; and Krishna, R. 2024. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, 148--166. Springer

  51. [63]

    Guo, D.; Yang, D.; Zhang, H.; Song, J.; Wang, P.; Zhu, Q.; Xu, R.; Zhang, R.; Ma, S.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  52. [64]

    Hong, J.; Zhao, C.; Zhu, C.; Lu, W.; Xu, G.; and XingYu. 2026. DeepEyesV2: Toward Agentic Multimodal Model. In The Fourteenth International Conference on Learning Representations

  53. [65]

    C.; and Wang, B

    Hou, X.; Xu, S.; Biyani, M.; Li, M.; Liu, J.; Hollon, T. C.; and Wang, B. 2026. Codev: Code with images for faithful visual reasoning via tool-aware policy optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21500--21510

  54. [66]

    Hu, S.; Dai, Y.; Han, X.; Fang, Z.; Zhao, Y.; Kwong, S. T. W.; and Fang, Y. 2026. Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers . arXiv preprint arXiv:2605.04984

  55. [67]

    Y.; and Tao, J

    Jin, R.; Shao, P.; Wen, Z.; Wu, J.; Feng, M.; Yang, S.; Zhang, C. Y.; and Tao, J. 2026. Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs. arXiv preprint arXiv:2602.01064

  56. [68]

    Jin, R.; Shao, P.; Wen, Z.; Wu, J.; Feng, M.; Zhang, S.; and Tao, J. 2025. Radialrouter: Structured representation for efficient and robust large language models routing. arXiv preprint arXiv:2506.03880

  57. [69]

    Kim, K.; and Chelikavada, K. 2026. Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines . arXiv preprint arXiv:2604.15376

  58. [70]

    Lai, X.; Li, J.; Li, W.; Liu, T.; Li, T.; and Zhao, H. 2026. Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search. In The Fourteenth International Conference on Learning Representations

  59. [71]

    Li, B.; Zhang, Y.; Guo, D.; Zhang, R.; Li, F.; Zhang, H.; Zhang, K.; Zhang, P.; Li, Y.; Liu, Z.; et al. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

  60. [72]

    Li, H.; Yang, Y.; Lin, Y.; Dai, X.; Yang, M.; and Peng, X. 2026 a . Reliable Thinking with Images. In International Conference on Machine Learning

  61. [73]

    Li, X.; Jiao, W.; Jin, J.; Dong, G.; Jin, J.; Wang, Y.; Wang, H.; Zhu, Y.; Wen, J.-R.; Lu, Y.; and Dou, Z. 2026 b . DeepAgent: A General Reasoning Agent with Scalable Toolsets. arXiv:2510.21618

  62. [74]

    Liu, J.; Feng, M.; and Chen, L. 2026. Better, stronger, faster: Tackling the trilemma in mllm-based segmentation with simultaneous textual mask prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 33121--33130

  63. [75]

    Liu, J.; Xiong, K.; Xia, P.; Zhou, Y.; Ji, H.; Feng, L.; Han, S.; Ding, M.; and Yao, H. 2025. Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning . arXiv preprint arXiv:2511.19900

  64. [76]

    Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.-W.; Galley, M.; and Gao, J. 2024. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations, volume 2024, 23439--23554

  65. [77]

    Lu, Z.; Lin, Z.; Jia, W.; Tian, C.; Ye, D.; Li, P.; Jin, L.; Liu, N.; Xu, G.; and Feng, W. 2026. HISR: Hindsight Information Modulated Segmental Process Rewards for Multi-turn Agentic Reinforcement Learning . arXiv preprint arXiv:2603.18683

  66. [78]

    Ma, Y.; Zhang, W.; Li, T.; Du, L.; Shen, X.; and Liu, P. 2026. What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom . arXiv preprint arXiv:2602.01334. ICML 2026

  67. [79]

    L.; Tan, J

    Masry, A.; Do, X. L.; Tan, J. Q.; Joty, S.; and Hoque, E. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, 2263--2279

  68. [80]

    Y.; Harada, D.; and Russell, S

    Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In International Conference on Machine Learning (ICML), 278--287

  69. [81]

    OpenAI . 2025. Thinking with Images . https://openai.com/index/thinking-with-images/

  70. [82]

    Qi, Y.; Fu, P.; Li, H.; Liu, Y.; Jiang, C.; Qin, B.; Luo, Z.; and Luan, J. 2026. Patchcue: Enhancing vision-language model reasoning with patch-based visual cues. arXiv preprint arXiv:2603.05869

  71. [83]

    Qiao, R.; Tan, Q.; Dong, G.; MinhuiWu, M.; Sun, C.; Song, X.; Wang, J.; Gongque, Z.; Lei, S.; Zhang, Y.; et al. 2025. We-math: Does your large multimodal model achieve human-like mathematical reasoning? In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 20023--20070

  72. [84]

    K.; Wu, Y.; and Guo, D

    Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y. K.; Wu, Y.; and Guo, D. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models . arXiv preprint arXiv:2402.03300

  73. [85]

    Team, K.; Bai, T.; Bai, Y.; Bao, Y.; Cai, S.; Cao, Y.; Charles, Y.; Che, H.; Chen, C.; Chen, G.; et al. 2026. Kimi K2. 5: Visual Agentic Intelligence. arXiv preprint arXiv:2602.02276

  74. [86]

    Team, K.; Du, A.; Gao, B.; Xing, B.; Jiang, C.; Chen, C.; Li, C.; Xiao, C.; Du, C.; Liao, C.; et al. 2025. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599

  75. [87]

    Wang, H.; Su, A.; Ren, W.; Lin, F.; and Chen, W. 2025 a . Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning. arXiv:2505.15966

  76. [88]

    Wang, K.; Pan, J.; Shi, W.; Lu, Z.; Ren, H.; Zhou, A.; Zhan, M.; and Li, H. 2024. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37: 95095--95169

  77. [89]

    Wang, K.; Pan, J.; Wei, L.; Zhou, A.; Shi, W.; Lu, Z.; Xiao, H.; Yang, Y.; Ren, H.; Zhan, M.; et al. 2025 b . Mathcoder-vl: Bridging vision and code for enhanced multimodal mathematical reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, 2505--2534

  78. [90]

    Wang, W.; Ding, L.; Zeng, M.; Zhou, X.; Shen, L.; Luo, Y.; Yu, W.; and Tao, D. 2025 c . Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 7907--7915

  79. [91]

    Wang, X.; Wang, W.; Chen, K.; Nimalsiri, N.; and Halgamuge, S. 2026. Discovering Process-Outcome Credit in Multi-Step LLM Reasoning . arXiv preprint arXiv:2602.01034

  80. [92]

    K.; Garcia, A.; and Hong, M

    Wei, Q.; Zeng, S.; Li, C.; Brown, W.; Frunza, O.; Deng, W.; Schneider, A.; Nevmyvaka, Y.; Zhao, Y. K.; Garcia, A.; and Hong, M. 2025. Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design . arXiv preprint arXiv:2505.11821

Showing first 80 references.