pith. machine review for the scientific record. sign in

arxiv: 2512.03438 · v2 · submitted 2025-12-03 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal reinforcement learningadaptive verifieragentic reasoningspatial reasoningembodied AIreward hackingvisual hallucination
0
0 comments X

The pith

An adaptive verifier selects scoring functions during multimodal RL to achieve state-of-the-art results on agentic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that sparse outcome-based rewards alone are insufficient for training effective multimodal reasoning agents because they provide too little guidance on complex tasks. Argos addresses this by acting as a reward agent that, for each sample, chooses from a pool of teacher-derived and rule-based scoring functions to assess final accuracy, spatiotemporal localization of entities, and the quality of the reasoning process. Applying this verifier both to curate SFT data and throughout RL training produces stronger performance on spatial reasoning, visual hallucination, and embodied robotics benchmarks. The work further shows that models trained only with SFT data still collapse into ungrounded answers once RL begins unless the online verification continues, and that the same mechanism reduces reward hacking.

Core claim

Argos is introduced as a principled reward agent that for each sample selects scoring functions to evaluate final response accuracy, spatiotemporal localization of referred entities and actions, and reasoning process quality. Leveraging Argos across both SFT data curation and RL training yields state-of-the-art results on spatial reasoning, visual hallucination, robotics, and embodied AI benchmarks. Sole reliance on SFT post-training proves insufficient because agents collapse to ungrounded solutions during RL without the online verification; the verifier also mitigates reward-hacking, with effectiveness justified through pareto-optimality.

What carries the argument

Argos, the agentic reward agent that adaptively selects from a pool of teacher-model-derived and rule-based scoring functions to supply fine-grained rewards on accuracy, localization, and reasoning quality.

If this is right

  • Models maintain grounded reasoning through the full RL phase rather than reverting to ungrounded outputs after SFT.
  • State-of-the-art performance is reached on spatial reasoning, visual hallucination, and embodied robotics benchmarks.
  • Reward hacking decreases when the adaptive verifier operates online during multimodal RL.
  • The method rests on a pareto-optimality argument that explains why selected scoring functions outperform single outcome signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continuous verification during RL appears necessary to preserve grounding once models leave the SFT stage.
  • Adaptive selection of reward signals may offer a general route to richer training in other domains that currently rely on sparse final-answer feedback.

Load-bearing premise

A pool of teacher-model and rule-based scoring functions exists that Argos can select from to deliver consistently more informative rewards than outcome signals alone without introducing offsetting biases or artifacts.

What would settle it

Train an otherwise identical model using only final-answer outcome rewards during the RL stage and check whether it matches the reported benchmark scores or still exhibits collapse to ungrounded solutions on the spatial-reasoning and embodied-AI tasks.

Figures

Figures reproduced from arXiv: 2512.03438 by Andrea Tupini, Baolin Peng, Hao Cheng, Isar Meijier, Jianfeng Gao, Lars Liden, Lijuan Wang, Marc Pollefeys, Oier Mees, Qianhui Wu, Reuben Tan, Sheng Zhang, Theodore Zhao, XiaoDong Liu, Yong Jae Lee, Yu Gu, Yuncong Yang, Zhengyuan Yang.

Figure 1
Figure 1. Figure 1: Multimodal RL with our Argos agentic verifier. We propose to train agentic foundation models using an agentic verifier Argos that adaptively selects different scoring tools based on the training sample during the RL stage. Then, we evaluate the resulting model on multiple agentic benchmarks including embodied task planning and completion as well as spatial reasoning. Abstract Agentic reasoning models train… view at source ↗
Figure 2
Figure 2. Figure 2: Verification process. We use the same set of scoring functions for both images and videos. Each response is first parsed to extract information about generated 2D points, temporal segments, reasoning text and answer. Then, the agentic verifier adaptively decides what scoring functions to call based on the extracted information. Finally, we aggregate the scores using a gated aggregation function. Temporal r… view at source ↗
Figure 3
Figure 3. Figure 3: Grounded reasoning generation pipeline. (a) Stage I: We extract object, action and event proposals such as 2D boxes for images and video frames as well as temporal segments for videos. (b) Stage II. We use the overlaid images and video frames to prompt a pretrained LMM to generate grounded reasoning traces that explicitly refer to these points. For videos, we also include the frame numbers and their timest… view at source ↗
Figure 4
Figure 4. Figure 4: We run a small-scale comparison to ablate the effective [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Instruction template for extracting entities and interactions from generated image-level rollouts. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Instruction template for extracting entities and interactions from generated video-level rollouts. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training data mixtures. We use around 85K and 4.6K SFT and RL training samples, respectively. Point–Object Extraction Prompt You will be given a multimodal reasoning trace that includes text with references to visual observations and 2D points in (x, y) format. Your task is to extract all 2D point coordinates in the text, and for each one, identify the most semantically relevant noun phrase, object mention… view at source ↗
Figure 8
Figure 8. Figure 8: We use this prompt template to query a LM to extract the generated 2D points and their corresponding object or noun phrase. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: We use this prompt template to query a LM to extract the generated 2D points and their corresponding object or noun phrase, as 6 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: We use this template to prompt the teacher model to extract referenced objects and actions in videos. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: We use this template to prompt the teacher model to assign a binary score based on the relevance of the generated frame-level [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: We use this template to prompt the teacher model to assign a binary score based on the relevance of the generated segment-level [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Image example 1 12 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Image example 2 13 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Image example 3 14 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Image example 4 15 [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Video example 1 16 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Video example 2 17 [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
read the original abstract

Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Argos, an adaptive agentic verifier for multimodal reinforcement learning. For each sample, Argos selects from a pool of teacher-model-derived and rule-based scoring functions to evaluate final response accuracy, spatiotemporal localization of referred entities and actions, and reasoning process quality. The authors claim that applying Argos to both SFT data curation and online RL training yields state-of-the-art results on spatial reasoning, visual hallucination, robotics, and embodied AI benchmarks. They further state that SFT on curated data alone is insufficient because agents collapse to ungrounded solutions during RL, that Argos reduces reward-hacking, and that a Pareto-optimality argument justifies its effectiveness.

Significance. If the reported SOTA results hold and the adaptive selection demonstrably supplies more informative rewards than outcome signals or fixed selection without introducing new biases, the work would offer a practical advance in training reliable multimodal agentic models by moving beyond sparse rewards. The observation that online verification during RL is required to prevent collapse would be a useful empirical finding for the community.

major comments (3)
  1. Abstract: the claim of achieving state-of-the-art results across multiple agentic tasks supplies no quantitative metrics, benchmark scores, baseline comparisons, or error bars, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.
  2. Experiments section: the central claim that Argos adaptive selection from the scoring pool produces strictly more informative and less noisy rewards than outcome-based signals (and prevents collapse) is load-bearing, yet no ablation isolating adaptive selection against fixed or random selection from the identical pool is described; without this, gains could be driven by curation/filtering effects rather than the verifier.
  3. Theoretical justification: the Pareto-optimality argument for Argos effectiveness is invoked but the derivation is not shown, leaving open whether it provides independent grounding or reduces to a fitted quantity in the finite-sample RL setting.
minor comments (1)
  1. Clarify the exact size and composition of the scoring-function pool and whether Argos's selection policy is itself learned or heuristic.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, agreeing where revisions are warranted and providing clarifications where appropriate. We will incorporate changes to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the claim of achieving state-of-the-art results across multiple agentic tasks supplies no quantitative metrics, benchmark scores, baseline comparisons, or error bars, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised manuscript, we will update the abstract to report key benchmark scores, baseline comparisons, and references to error bars or statistical significance from the experiments. revision: yes

  2. Referee: Experiments section: the central claim that Argos adaptive selection from the scoring pool produces strictly more informative and less noisy rewards than outcome-based signals (and prevents collapse) is load-bearing, yet no ablation isolating adaptive selection against fixed or random selection from the identical pool is described; without this, gains could be driven by curation/filtering effects rather than the verifier.

    Authors: We acknowledge the importance of this ablation to isolate the contribution of adaptive selection. We will add experiments comparing adaptive selection against fixed and random selection from the same scoring pool in the revised experiments section to demonstrate that the observed benefits stem from the adaptive mechanism rather than curation alone. revision: yes

  3. Referee: Theoretical justification: the Pareto-optimality argument for Argos effectiveness is invoked but the derivation is not shown, leaving open whether it provides independent grounding or reduces to a fitted quantity in the finite-sample RL setting.

    Authors: We thank the referee for this observation. The Pareto-optimality argument is intended to provide independent grounding, but we agree that the derivation should be explicitly presented. In the revised manuscript, we will include the full derivation (with discussion of finite-sample considerations) in the appendix or a dedicated theoretical section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Argos as a new adaptive verifier that selects from a pool of teacher-derived and rule-based scorers to provide richer rewards in MMRL, with claims of SOTA results, insufficiency of SFT alone, and a theoretical justification via Pareto-optimality. No equations, derivations, or steps are exhibited that reduce by construction to fitted inputs, self-definitions, or self-citation chains; the adaptive selection mechanism and Pareto argument are presented as independent contributions rather than renamings or forced equivalences. Empirical demonstrations and the verifier's design provide external grounding separate from the target outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence of a sufficiently diverse and reliable pool of scoring functions plus the ability of Argos to select among them without introducing systematic bias; these are treated as given rather than derived.

axioms (2)
  • domain assumption A pool of teacher-model-derived and rule-based scoring functions can be defined that jointly cover accuracy, spatiotemporal localization, and reasoning quality for arbitrary agentic samples.
    Invoked when describing how Argos selects scoring functions for each sample.
  • domain assumption Teacher models provide noisy but still useful reward signals that can be improved by adaptive selection.
    Stated as motivation for moving beyond fixed outcome rewards.
invented entities (1)
  • Argos (Agentic Reward for Grounded & Objective Scoring) no independent evidence
    purpose: Adaptive selection of scoring functions during SFT curation and RL training.
    New system introduced to solve the problem of sample-specific reward computation.

pith-pipeline@v0.9.0 · 5637 in / 1486 out tokens · 74488 ms · 2026-05-17T03:05:01.265138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 1 Pith paper · 21 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 7

  3. [3]

    Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022

    Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022. 3

  4. [4]

    Train- ing strategies for efficient embodied reasoning

    William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Train- ing strategies for efficient embodied reasoning. InConfer- ence on Robot Learning, 2025. 8

  5. [5]

    Vision-language models provide promptable representations for reinforcement learning.Transactions on Machine Learn- ing Research, 2025

    William Chen, Oier Mees, Aviral Kumar, and Sergey Levine. Vision-language models provide promptable representations for reinforcement learning.Transactions on Machine Learn- ing Research, 2025. 3

  6. [6]

    Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. 2

  7. [7]

    Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Sch ¨olkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang Huang...

  8. [8]

    Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024. 6, 9, 11

  9. [9]

    Tool-lmm: A large multi-modal model for tool agent learning

    Sixun Dong. Tool-lmm: A large multi-modal model for tool agent learning. 2024. 2

  10. [10]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. 2023. 3

  11. [11]

    Agent ai: Surveying the horizons of multimodal interaction.CoRR, 2024

    Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, et al. Agent ai: Surveying the horizons of multimodal interaction.CoRR, 2024. 2

  12. [12]

    GRIT: Teaching MLLMs to Think with Images

    Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images.arXiv preprint arXiv:2505.15879, 2025. 3

  13. [13]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  14. [14]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Com- puter Vision, pages 148–166. Springer, 2024. 7, 1

  15. [15]

    Gemini: A family of highly capable multimodal models

    Google Gemini Team. Gemini: A family of highly capable multimodal models. 2023. 5

  16. [16]

    Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of 10 the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

  18. [18]

    Regiongpt: Towards region understanding vision lan- guage model

    Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision lan- guage model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13796– 13806, 2024. 3

  19. [19]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 3

  20. [20]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Train- ing agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025. 3

  21. [21]

    Ghil-glue: Hierarchical control with filtered sub- goal images

    Kyle Beltran Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair, Seohong Park, Blake Wulfe, Masha Itkina, Benjamin Eysenbach, Sergey Levine, Thomas Kollar, and Benjamin Burchfiel. Ghil-glue: Hierarchical control with filtered sub- goal images. InProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), Atlanta, USA,

  22. [22]

    Breaking the reasoning barrier a survey on llm complex reasoning through the lens of self-evolution

    Tao He, Hao Li, Jingchang Chen, Runxuan Liu, Yixin Cao, Lizi Liao, Zihao Zheng, Zheng Chu, Jiafeng Liang, Ming Liu, et al. Breaking the reasoning barrier a survey on llm complex reasoning through the lens of self-evolution. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7377–7417, 2025. 2

  23. [23]

    Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 5

  24. [24]

    Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

    Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

  25. [25]

    Visual language maps for robot navigation

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. InPro- ceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023. 3

  26. [26]

    Multimodal spatial language maps for robot navi- gation and manipulation.International Journal of Robotics Research (IJRR), 2025

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Multimodal spatial language maps for robot navi- gation and manipulation.International Journal of Robotics Research (IJRR), 2025. 3

  27. [27]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 3

  28. [28]

    Scaling up visual and vision-language representation learning with noisy text supervision.arXiv preprint, 2021

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision.arXiv preprint, 2021. 2

  29. [29]

    Beyond sight: Fine- tuning generalist robot policies with heterogeneous sensors via language grounding

    Joshua Jones, Oier Mees, Carmelo Sferrazza, Kyle Stachow- icz, Pieter Abbeel, and Sergey Levine. Beyond sight: Fine- tuning generalist robot policies with heterogeneous sensors via language grounding. InProceedings of the IEEE Interna- tional Conference on Robotics and Automation (ICRA), At- lanta, USA, 2025. 3

  30. [30]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025. 8

  31. [31]

    Mmctagent: Multi-modal critical thinking agent framework for complex visual reasoning.arXiv preprint arXiv:2405.18358, 2024

    Somnath Kumar, Yash Gadhia, Tanuja Ganu, and Akshay Nambi. Mmctagent: Multi-modal critical thinking agent framework for complex visual reasoning.arXiv preprint arXiv:2405.18358, 2024. 2

  32. [32]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 2

  33. [33]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2

  34. [34]

    Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025

    Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025. 2

  35. [35]

    Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 8, 2

  36. [36]

    Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt- 4v (ision), llava-1.5, and other multi-modality models.arXiv preprint arXiv:2310.14566, 2023. 1

  37. [37]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

  38. [38]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 3

  39. [39]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision- language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025. 2

  40. [40]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipu- lation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wol- fram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipu- lation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022. 3 11

  41. [41]

    Grounding language with visual affordances over unstruc- tured data

    Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over unstruc- tured data. InProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), London, UK,

  42. [42]

    Policy adaptation via language op- timization: Decomposing tasks for few-shot imitation

    Vivek Myers, Bill Chunyuan Zheng, Oier Mees, Sergey Levine, and Kuan Fang. Policy adaptation via language op- timization: Decomposing tasks for few-shot imitation. In Conference on Robot Learning, 2024. 3

  43. [43]

    Steering your generalists: Improving robotic foun- dation models via value guidance.Conference on Robot Learning (CoRL), 2024

    Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foun- dation models via value guidance.Conference on Robot Learning (CoRL), 2024. 3

  44. [44]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

  45. [45]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 3

  46. [46]

    Fast: Efficient action tokenization for vision- language-action models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models. InProceedings of Robotics: Sci- ence and Systems, Los Angeles, USA, 2025. 8

  47. [47]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

  48. [48]

    Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

    Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024. 3

  49. [49]

    Latent plans for task ag- nostic offline reinforcement learning

    Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task ag- nostic offline reinforcement learning. InProceedings of the 6th Conference on Robot Learning (CoRL), Auckland, New Zealand, 2022. 3

  50. [50]

    Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,

  51. [51]

    Robovqa: Multimodal long-horizon reasoning for robotics

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, De- bidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE,

  52. [52]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5

  53. [53]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 1

  54. [54]

    Koala: Key frame-conditioned long video-llm

    Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A Plummer, Bryan Russell, and Kate Saenko. Koala: Key frame-conditioned long video-llm. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13581–13591, 2024. 3

  55. [55]

    Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

    Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025. 2

  56. [56]

    Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024. 7, 1

  57. [57]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions.arXiv preprint arXiv:2212.10560, 2022. 3

  58. [58]

    Genartist: Multimodal llm as an agent for unified image gen- eration and editing.Advances in Neural Information Pro- cessing Systems, 37:128374–128395, 2024

    Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image gen- eration and editing.Advances in Neural Information Pro- cessing Systems, 37:128374–128395, 2024. 2

  59. [59]

    Blip-3: A family of open large multimodal models

    Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, et al. Blip-3: A family of open large multimodal models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 6124–6135,

  60. [60]

    Magma: A foundation model for multi- modal ai agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multi- modal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025. 8

  61. [61]

    EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

    Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025. 3

  62. [62]

    Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents

    Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. InForty-second International Conference on Machine Learning, 2025. 8, 2

  63. [63]

    React: Synergizing rea- soning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing rea- soning and acting in language models. InInternational Con- ference on Learning Representations (ICLR), 2023. 3

  64. [64]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, 12 Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 3

  65. [65]

    Spatial mental modeling from limited views

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 7, 1

  66. [66]

    Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006,

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal bench- mark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024. 3

  67. [67]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 2

  68. [68]

    Robotic control via em- bodied chain-of-thought reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via em- bodied chain-of-thought reasoning. InConference on Robot Learning, 2024. 8

  69. [69]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 3

  70. [70]

    Countercurate: Enhancing physical and semantic visio- linguistic compositional reasoning via counterfactual exam- ples.arXiv preprint arXiv:2402.13254, 2024

    Jianrui Zhang, Mu Cai, Tengyang Xie, and Yong Jae Lee. Countercurate: Enhancing physical and semantic visio- linguistic compositional reasoning via counterfactual exam- ples.arXiv preprint arXiv:2402.13254, 2024. 7, 1

  71. [71]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023. 3

  72. [72]

    Pareto optimal learning for estimating large language model errors

    Theodore Zhao, Mu Wei, J Preston, and Hoifung Poon. Pareto optimal learning for estimating large language model errors. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10513–10529, 2024. 5

  73. [73]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 3

  74. [74]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022. 3

  75. [75]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2

  76. [76]

    Around”, “Among

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 8 13 Multimodal Reinforcement Learning with Agentic Verifier for AI Agents Supplementa...

  77. [78]

    The associated image (if any)

  78. [79]

    seedling with roots

    A model-generated answer:{generated text} TASK Extract a concise set of referencedentities, objects, and interactionsthat arevisibly present in the image. Focus on themain objectsand their key visible attributes. When descriptive expressions are used, preferexpressive but non-redundant termsthat capture the important visible features (e.g., “seedling with...

  79. [80]

    The original question:{question}

  80. [81]

    The associated video frames

Showing first 80 references.