arxiv: 2512.03438 · v2 · submitted 2025-12-03 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

Reuben Tan , Baolin Peng , Zhengyuan Yang , Hao Cheng , Oier Mees , Theodore Zhao , Andrea Tupini , Isar Meijier

show 10 more authors

Qianhui Wu Yuncong Yang Lars Liden Yu Gu Sheng Zhang XiaoDong Liu Lijuan Wang Marc Pollefeys Yong Jae Lee Jianfeng Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal reinforcement learningadaptive verifieragentic reasoningspatial reasoningembodied AIreward hackingvisual hallucination

0 comments

The pith

An adaptive verifier selects scoring functions during multimodal RL to achieve state-of-the-art results on agentic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that sparse outcome-based rewards alone are insufficient for training effective multimodal reasoning agents because they provide too little guidance on complex tasks. Argos addresses this by acting as a reward agent that, for each sample, chooses from a pool of teacher-derived and rule-based scoring functions to assess final accuracy, spatiotemporal localization of entities, and the quality of the reasoning process. Applying this verifier both to curate SFT data and throughout RL training produces stronger performance on spatial reasoning, visual hallucination, and embodied robotics benchmarks. The work further shows that models trained only with SFT data still collapse into ungrounded answers once RL begins unless the online verification continues, and that the same mechanism reduces reward hacking.

Core claim

Argos is introduced as a principled reward agent that for each sample selects scoring functions to evaluate final response accuracy, spatiotemporal localization of referred entities and actions, and reasoning process quality. Leveraging Argos across both SFT data curation and RL training yields state-of-the-art results on spatial reasoning, visual hallucination, robotics, and embodied AI benchmarks. Sole reliance on SFT post-training proves insufficient because agents collapse to ungrounded solutions during RL without the online verification; the verifier also mitigates reward-hacking, with effectiveness justified through pareto-optimality.

What carries the argument

Argos, the agentic reward agent that adaptively selects from a pool of teacher-model-derived and rule-based scoring functions to supply fine-grained rewards on accuracy, localization, and reasoning quality.

If this is right

Models maintain grounded reasoning through the full RL phase rather than reverting to ungrounded outputs after SFT.
State-of-the-art performance is reached on spatial reasoning, visual hallucination, and embodied robotics benchmarks.
Reward hacking decreases when the adaptive verifier operates online during multimodal RL.
The method rests on a pareto-optimality argument that explains why selected scoring functions outperform single outcome signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continuous verification during RL appears necessary to preserve grounding once models leave the SFT stage.
Adaptive selection of reward signals may offer a general route to richer training in other domains that currently rely on sparse final-answer feedback.

Load-bearing premise

A pool of teacher-model and rule-based scoring functions exists that Argos can select from to deliver consistently more informative rewards than outcome signals alone without introducing offsetting biases or artifacts.

What would settle it

Train an otherwise identical model using only final-answer outcome rewards during the RL stage and check whether it matches the reported benchmark scores or still exhibits collapse to ungrounded solutions on the spatial-reasoning and embodied-AI tasks.

Figures

Figures reproduced from arXiv: 2512.03438 by Andrea Tupini, Baolin Peng, Hao Cheng, Isar Meijier, Jianfeng Gao, Lars Liden, Lijuan Wang, Marc Pollefeys, Oier Mees, Qianhui Wu, Reuben Tan, Sheng Zhang, Theodore Zhao, XiaoDong Liu, Yong Jae Lee, Yu Gu, Yuncong Yang, Zhengyuan Yang.

**Figure 1.** Figure 1: Multimodal RL with our Argos agentic verifier. We propose to train agentic foundation models using an agentic verifier Argos that adaptively selects different scoring tools based on the training sample during the RL stage. Then, we evaluate the resulting model on multiple agentic benchmarks including embodied task planning and completion as well as spatial reasoning. Abstract Agentic reasoning models train… view at source ↗

**Figure 2.** Figure 2: Verification process. We use the same set of scoring functions for both images and videos. Each response is first parsed to extract information about generated 2D points, temporal segments, reasoning text and answer. Then, the agentic verifier adaptively decides what scoring functions to call based on the extracted information. Finally, we aggregate the scores using a gated aggregation function. Temporal r… view at source ↗

**Figure 3.** Figure 3: Grounded reasoning generation pipeline. (a) Stage I: We extract object, action and event proposals such as 2D boxes for images and video frames as well as temporal segments for videos. (b) Stage II. We use the overlaid images and video frames to prompt a pretrained LMM to generate grounded reasoning traces that explicitly refer to these points. For videos, we also include the frame numbers and their timest… view at source ↗

**Figure 4.** Figure 4: We run a small-scale comparison to ablate the effective [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Instruction template for extracting entities and interactions from generated image-level rollouts. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Instruction template for extracting entities and interactions from generated video-level rollouts. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Training data mixtures. We use around 85K and 4.6K SFT and RL training samples, respectively. Point–Object Extraction Prompt You will be given a multimodal reasoning trace that includes text with references to visual observations and 2D points in (x, y) format. Your task is to extract all 2D point coordinates in the text, and for each one, identify the most semantically relevant noun phrase, object mention… view at source ↗

**Figure 8.** Figure 8: We use this prompt template to query a LM to extract the generated 2D points and their corresponding object or noun phrase. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: We use this prompt template to query a LM to extract the generated 2D points and their corresponding object or noun phrase, as 6 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: We use this template to prompt the teacher model to extract referenced objects and actions in videos. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: We use this template to prompt the teacher model to assign a binary score based on the relevance of the generated frame-level [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: We use this template to prompt the teacher model to assign a binary score based on the relevance of the generated segment-level [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Image example 1 12 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Image example 2 13 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Image example 3 14 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Image example 4 15 [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: Video example 1 16 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Video example 2 17 [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

read the original abstract

Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Argos adds per-sample adaptive selection from a scoring pool to multimodal RL for agents, but the SOTA and anti-hacking claims rest on unshown ablations that could be driven by curation alone.

read the letter

The main thing to know is that this paper introduces Argos, an adaptive verifier that picks from teacher-model and rule-based scorers for each sample during multimodal RL. It scores final accuracy, spatiotemporal localization, and reasoning quality, then uses the same component for both SFT curation and online RL training. They report SOTA on spatial reasoning, visual hallucination, and embodied benchmarks, plus reduced reward hacking and no collapse to ungrounded solutions when the verifier runs during RL.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Argos, an adaptive agentic verifier for multimodal reinforcement learning. For each sample, Argos selects from a pool of teacher-model-derived and rule-based scoring functions to evaluate final response accuracy, spatiotemporal localization of referred entities and actions, and reasoning process quality. The authors claim that applying Argos to both SFT data curation and online RL training yields state-of-the-art results on spatial reasoning, visual hallucination, robotics, and embodied AI benchmarks. They further state that SFT on curated data alone is insufficient because agents collapse to ungrounded solutions during RL, that Argos reduces reward-hacking, and that a Pareto-optimality argument justifies its effectiveness.

Significance. If the reported SOTA results hold and the adaptive selection demonstrably supplies more informative rewards than outcome signals or fixed selection without introducing new biases, the work would offer a practical advance in training reliable multimodal agentic models by moving beyond sparse rewards. The observation that online verification during RL is required to prevent collapse would be a useful empirical finding for the community.

major comments (3)

Abstract: the claim of achieving state-of-the-art results across multiple agentic tasks supplies no quantitative metrics, benchmark scores, baseline comparisons, or error bars, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.
Experiments section: the central claim that Argos adaptive selection from the scoring pool produces strictly more informative and less noisy rewards than outcome-based signals (and prevents collapse) is load-bearing, yet no ablation isolating adaptive selection against fixed or random selection from the identical pool is described; without this, gains could be driven by curation/filtering effects rather than the verifier.
Theoretical justification: the Pareto-optimality argument for Argos effectiveness is invoked but the derivation is not shown, leaving open whether it provides independent grounding or reduces to a fitted quantity in the finite-sample RL setting.

minor comments (1)

Clarify the exact size and composition of the scoring-function pool and whether Argos's selection policy is itself learned or heuristic.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, agreeing where revisions are warranted and providing clarifications where appropriate. We will incorporate changes to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the claim of achieving state-of-the-art results across multiple agentic tasks supplies no quantitative metrics, benchmark scores, baseline comparisons, or error bars, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.

Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised manuscript, we will update the abstract to report key benchmark scores, baseline comparisons, and references to error bars or statistical significance from the experiments. revision: yes
Referee: Experiments section: the central claim that Argos adaptive selection from the scoring pool produces strictly more informative and less noisy rewards than outcome-based signals (and prevents collapse) is load-bearing, yet no ablation isolating adaptive selection against fixed or random selection from the identical pool is described; without this, gains could be driven by curation/filtering effects rather than the verifier.

Authors: We acknowledge the importance of this ablation to isolate the contribution of adaptive selection. We will add experiments comparing adaptive selection against fixed and random selection from the same scoring pool in the revised experiments section to demonstrate that the observed benefits stem from the adaptive mechanism rather than curation alone. revision: yes
Referee: Theoretical justification: the Pareto-optimality argument for Argos effectiveness is invoked but the derivation is not shown, leaving open whether it provides independent grounding or reduces to a fitted quantity in the finite-sample RL setting.

Authors: We thank the referee for this observation. The Pareto-optimality argument is intended to provide independent grounding, but we agree that the derivation should be explicitly presented. In the revised manuscript, we will include the full derivation (with discussion of finite-sample considerations) in the appendix or a dedicated theoretical section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Argos as a new adaptive verifier that selects from a pool of teacher-derived and rule-based scorers to provide richer rewards in MMRL, with claims of SOTA results, insufficiency of SFT alone, and a theoretical justification via Pareto-optimality. No equations, derivations, or steps are exhibited that reduce by construction to fitted inputs, self-definitions, or self-citation chains; the adaptive selection mechanism and Pareto argument are presented as independent contributions rather than renamings or forced equivalences. Empirical demonstrations and the verifier's design provide external grounding separate from the target outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence of a sufficiently diverse and reliable pool of scoring functions plus the ability of Argos to select among them without introducing systematic bias; these are treated as given rather than derived.

axioms (2)

domain assumption A pool of teacher-model-derived and rule-based scoring functions can be defined that jointly cover accuracy, spatiotemporal localization, and reasoning quality for arbitrary agentic samples.
Invoked when describing how Argos selects scoring functions for each sample.
domain assumption Teacher models provide noisy but still useful reward signals that can be improved by adaptive selection.
Stated as motivation for moving beyond fixed outcome rewards.

invented entities (1)

Argos (Agentic Reward for Grounded & Objective Scoring) no independent evidence
purpose: Adaptive selection of scoring functions during SFT curation and RL training.
New system introduced to solve the problem of sample-specific reward computation.

pith-pipeline@v0.9.0 · 5637 in / 1486 out tokens · 74488 ms · 2026-05-17T03:05:01.265138+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Argos ... adaptively selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization ... (iii) the quality of the reasoning process ... theoretical justification ... through the concept of pareto-optimality.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 1 (δ-Pareto Domination) ... Assumption 1 ... σ-sub-Gaussian ... Theorem 1 (Global Pareto Guarantee)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 1 Pith paper · 21 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,

work page
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022

Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022. 3

work page arXiv 2022
[4]

Train- ing strategies for efficient embodied reasoning

William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Train- ing strategies for efficient embodied reasoning. InConfer- ence on Robot Learning, 2025. 8

work page 2025
[5]

Vision-language models provide promptable representations for reinforcement learning.Transactions on Machine Learn- ing Research, 2025

William Chen, Oier Mees, Aviral Kumar, and Sergey Levine. Vision-language models provide promptable representations for reinforcement learning.Transactions on Machine Learn- ing Research, 2025. 3

work page 2025
[6]

Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. 2

work page 2024
[7]

Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Sch ¨olkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang Huang...

work page 2024
[8]

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024. 6, 9, 11

work page 2024
[9]

Tool-lmm: A large multi-modal model for tool agent learning

Sixun Dong. Tool-lmm: A large multi-modal model for tool agent learning. 2024. 2

work page 2024
[10]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. 2023. 3

work page 2023
[11]

Agent ai: Surveying the horizons of multimodal interaction.CoRR, 2024

Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, et al. Agent ai: Surveying the horizons of multimodal interaction.CoRR, 2024. 2

work page 2024
[12]

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images.arXiv preprint arXiv:2505.15879, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Com- puter Vision, pages 148–166. Springer, 2024. 7, 1

work page 2024
[15]

Gemini: A family of highly capable multimodal models

Google Gemini Team. Gemini: A family of highly capable multimodal models. 2023. 5

work page 2023
[16]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of 10 the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page 2024
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Regiongpt: Towards region understanding vision lan- guage model

Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision lan- guage model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13796– 13806, 2024. 3

work page 2024
[19]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Train- ing agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Ghil-glue: Hierarchical control with filtered sub- goal images

Kyle Beltran Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair, Seohong Park, Blake Wulfe, Masha Itkina, Benjamin Eysenbach, Sergey Levine, Thomas Kollar, and Benjamin Burchfiel. Ghil-glue: Hierarchical control with filtered sub- goal images. InProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), Atlanta, USA,

work page
[22]

Breaking the reasoning barrier a survey on llm complex reasoning through the lens of self-evolution

Tao He, Hao Li, Jingchang Chen, Runxuan Liu, Yixin Cao, Lizi Liao, Zihao Zheng, Zheng Chu, Jiafeng Liang, Ming Liu, et al. Breaking the reasoning barrier a survey on llm complex reasoning through the lens of self-evolution. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7377–7417, 2025. 2

work page 2025
[23]

Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 5

work page 2025
[24]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

work page
[25]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. InPro- ceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023. 3

work page 2023
[26]

Multimodal spatial language maps for robot navi- gation and manipulation.International Journal of Robotics Research (IJRR), 2025

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Multimodal spatial language maps for robot navi- gation and manipulation.International Journal of Robotics Research (IJRR), 2025. 3

work page 2025
[27]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Scaling up visual and vision-language representation learning with noisy text supervision.arXiv preprint, 2021

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision.arXiv preprint, 2021. 2

work page 2021
[29]

Beyond sight: Fine- tuning generalist robot policies with heterogeneous sensors via language grounding

Joshua Jones, Oier Mees, Carmelo Sferrazza, Kyle Stachow- icz, Pieter Abbeel, and Sergey Levine. Beyond sight: Fine- tuning generalist robot policies with heterogeneous sensors via language grounding. InProceedings of the IEEE Interna- tional Conference on Robotics and Automation (ICRA), At- lanta, USA, 2025. 3

work page 2025
[30]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025. 8

work page 2025
[31]

Mmctagent: Multi-modal critical thinking agent framework for complex visual reasoning.arXiv preprint arXiv:2405.18358, 2024

Somnath Kumar, Yash Gadhia, Tanuja Ganu, and Akshay Nambi. Mmctagent: Multi-modal critical thinking agent framework for complex visual reasoning.arXiv preprint arXiv:2405.18358, 2024. 2

work page arXiv 2024
[32]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 2

work page 2022
[33]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2

work page 2023
[34]

Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025

Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025. 2

work page arXiv 2025
[35]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 8, 2

work page 2023
[36]

Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt- 4v (ision), llava-1.5, and other multi-modality models.arXiv preprint arXiv:2310.14566, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

work page 2023
[38]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 3

work page 2024
[39]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision- language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipu- lation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wol- fram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipu- lation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022. 3 11

work page 2022
[41]

Grounding language with visual affordances over unstruc- tured data

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over unstruc- tured data. InProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), London, UK,

work page
[42]

Policy adaptation via language op- timization: Decomposing tasks for few-shot imitation

Vivek Myers, Bill Chunyuan Zheng, Oier Mees, Sergey Levine, and Kuan Fang. Policy adaptation via language op- timization: Decomposing tasks for few-shot imitation. In Conference on Robot Learning, 2024. 3

work page 2024
[43]

Steering your generalists: Improving robotic foun- dation models via value guidance.Conference on Robot Learning (CoRL), 2024

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foun- dation models via value guidance.Conference on Robot Learning (CoRL), 2024. 3

work page 2024
[44]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 3

work page 2023
[46]

Fast: Efficient action tokenization for vision- language-action models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models. InProceedings of Robotics: Sci- ence and Systems, Los Angeles, USA, 2025. 8

work page 2025
[47]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

work page 2021
[48]

Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024. 3

work page 2024
[49]

Latent plans for task ag- nostic offline reinforcement learning

Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task ag- nostic offline reinforcement learning. InProceedings of the 6th Conference on Robot Learning (CoRL), Auckland, New Zealand, 2022. 3

work page 2022
[50]

Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,

work page
[51]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, De- bidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE,

work page
[52]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Koala: Key frame-conditioned long video-llm

Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A Plummer, Bryan Russell, and Kate Saenko. Koala: Key frame-conditioned long video-llm. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13581–13591, 2024. 3

work page 2024
[55]

Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025. 2

work page arXiv 2025
[56]

Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024. 7, 1

work page 2024
[57]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions.arXiv preprint arXiv:2212.10560, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Genartist: Multimodal llm as an agent for unified image gen- eration and editing.Advances in Neural Information Pro- cessing Systems, 37:128374–128395, 2024

Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image gen- eration and editing.Advances in Neural Information Pro- cessing Systems, 37:128374–128395, 2024. 2

work page 2024
[59]

Blip-3: A family of open large multimodal models

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, et al. Blip-3: A family of open large multimodal models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 6124–6135,

work page
[60]

Magma: A foundation model for multi- modal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multi- modal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025. 8

work page 2025
[61]

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. InForty-second International Conference on Machine Learning, 2025. 8, 2

work page 2025
[63]

React: Synergizing rea- soning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing rea- soning and acting in language models. InInternational Con- ference on Learning Representations (ICLR), 2023. 3

work page 2023
[64]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, 12 Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 7, 1

work page 2025
[66]

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006,

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal bench- mark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024. 3

work page arXiv 2024
[67]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Robotic control via em- bodied chain-of-thought reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via em- bodied chain-of-thought reasoning. InConference on Robot Learning, 2024. 8

work page 2024
[69]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Countercurate: Enhancing physical and semantic visio- linguistic compositional reasoning via counterfactual exam- ples.arXiv preprint arXiv:2402.13254, 2024

Jianrui Zhang, Mu Cai, Tengyang Xie, and Yong Jae Lee. Countercurate: Enhancing physical and semantic visio- linguistic compositional reasoning via counterfactual exam- ples.arXiv preprint arXiv:2402.13254, 2024. 7, 1

work page arXiv 2024
[71]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Pareto optimal learning for estimating large language model errors

Theodore Zhao, Mu Wei, J Preston, and Hoifung Poon. Pareto optimal learning for estimating large language model errors. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10513–10529, 2024. 5

work page 2024
[73]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[75]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Around”, “Among

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 8 13 Multimodal Reinforcement Learning with Agentic Verifier for AI Agents Supplementa...

work page 2023
[78]

The associated image (if any)

work page
[79]

seedling with roots

A model-generated answer:{generated text} TASK Extract a concise set of referencedentities, objects, and interactionsthat arevisibly present in the image. Focus on themain objectsand their key visible attributes. When descriptive expressions are used, preferexpressive but non-redundant termsthat capture the important visible features (e.g., “seedling with...

work page
[80]

The original question:{question}

work page
[81]

The associated video frames

work page

Showing first 80 references.