Robot Critics that Sweat the Small Stuff

Carl Vondrick; Junbang Liang; Pavel Tokmakov; Richard Zemel; Sreehari Rammohan; Sruthi Sudhakar

arxiv: 2606.21572 · v1 · pith:KZWXJSARnew · submitted 2026-06-19 · 💻 cs.RO

Robot Critics that Sweat the Small Stuff

Sruthi Sudhakar , Junbang Liang , Sreehari Rammohan , Pavel Tokmakov , Richard Zemel , Carl Vondrick This is my paper

Pith reviewed 2026-06-26 14:18 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot manipulationvision-language modelsprogress reasoningfailure detectionpolicy improvementfine-tuningaction selectionvideo prediction

0 comments

The pith

Fine-tuning a vision-language model on pairs of success and failure robot rollouts creates a critic that detects subtle failures and selects better actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models hold useful priors about objects and interactions but often miss the small visual changes that separate robot success from failure during manipulation. The paper trains a critic by labeling pairs of frames drawn from successful and failed executions of an existing policy, giving it direct supervision on progress and failure. This critic then scores candidate next actions that an action-conditioned video model has forecasted, choosing the ones that look likely to succeed. If the pairwise differences in the training rollouts are representative, the method supplies a way to steer policies closed-loop without hand-crafted rewards or new data for every variation. Experiments report an 11 percent gain in real-world task success and a 5.9 percent gain in simulation.

Core claim

By constructing pairwise progress supervision using success and failure rollouts obtained from a policy, the fine-tuned critic excels at fine-grained progress reasoning and subtle failure detection, outperforming prior progress reasoning baselines. When the critic is used to identify successful candidates among actions sampled from a policy and forecasted by an action-conditioned video model, average policy success rate improves by 11 percent across real-world tasks and 5.9 percent across simulation tasks.

What carries the argument

A vision-language model critic fine-tuned on pairwise progress supervision drawn from success and failure policy rollouts, which learns to compare visual states for progress and failure signals.

If this is right

The critic outperforms prior progress reasoning baselines at fine-grained tasks.
The critic correctly identifies successful candidate actions among those forecasted by the video model.
Using the critic to select actions raises average policy success by 11 percent in real-world tasks.
Using the critic to select actions raises average policy success by 5.9 percent in simulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pairwise training signal could be collected from any policy that already runs in a target domain, lowering the cost of adapting critics to new robots or environments.
If the critic generalizes across object instances, it may reduce the need for task-specific reward functions in long-horizon manipulation.
Pairing the critic with more accurate future-prediction models would tighten the loop between action selection and visual verification.

Load-bearing premise

Success and failure rollouts collected from an existing policy already contain enough pairwise visual differences to train a critic that generalizes to new scenes and subtle failures not seen in the training rollouts.

What would settle it

Evaluating the trained critic on a new manipulation task whose failure modes produce visual differences absent from the original training rollouts and checking whether its detection accuracy falls to the level of an untrained baseline.

Figures

Figures reproduced from arXiv: 2606.21572 by Carl Vondrick, Junbang Liang, Pavel Tokmakov, Richard Zemel, Sreehari Rammohan, Sruthi Sudhakar.

**Figure 1.** Figure 1: Critic in-the-loop. Given an observation, a learned stochastic policy samples K unique candidate action sequences. An action-conditioned generative video model synthesizes visual observations to produce a terminal state per candidate action. The critic performs pairwise progress comparisons to select the best candidate, which is then executed and the re-planning continues. multiple candidate action sequen… view at source ↗

**Figure 2.** Figure 2: Training Robot Critics. VLMs are finetuned with successful and failed rollouts to enable fine-grained task progress/failure detection. To construct the training dataset, consecutive frames are used from successful trajectories to obtain task progress data. Additionally, success and failure frame pairs from the same initial condition are used to obtain policy- and task-specific failure data. Success–Failure… view at source ↗

**Figure 3.** Figure 3: Real-world Evaluation Results. (a) Initial states for all eval episodes overlaid - all methods start with the same set of initial states, matched manually with reference images. (b) Critical decision points where the VLM chooses the correct action candidate to lead to success (c) Our method succeeds at the task at the end of the rollout (d) The baseline policy fails at those critical points and therefore … view at source ↗

**Figure 4.** Figure 4: Coarse vs fine-grained critic performance. Current VLM performance degrades when judging fine grained visual differences in task performance. Fine-tuning for fine-grained task progress recognition can boost these scores. A.2 Generalization to unseen tasks We evaluate the three strongest methods (ROVER, ProgressLM, and ours) on held-out tasks that are unseen for ProgressLM and our method. ROVER remains prom… view at source ↗

**Figure 5.** Figure 5: Best-of-K sampling reveals substantial headroom for Diffusion Policy, increasing success rate from 31% to 48%. 0 60 120 180 240 rollout cycle 0.025 0.050 0.075 0.100 0.125 avg sample variance PnP Lego To Bowl 0 24 48 72 96 rollout cycle 0.0 0.1 0.2 0.3 Push Bowl Sample variance averaged across all rollouts, over 5 stages of the rollout joint j0 j1 j2 j3 j4 j5 gripper [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 7.** Figure 7: Prompt templates used for baselines and our method. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Large vision-language models contain several priors about the world and object interactions, making them useful critics during inference to steer robot policies towards success. However, closed-loop robot manipulation requires judging small visual differences between success and failure, which remains a challenge for current VLMs. We introduce a method to fine-tune critics by constructing pairwise progress supervision using success and failure rollouts obtained from a policy. Our fine-tuned critic excels at fine-grained progress reasoning and subtle failure detection, outperforming prior progress reasoning baselines. Additionally, we use an action-conditioned video model to predict the visual effect of several candidate actions sampled from a policy, and show that our critic can correctly identify successful candidates to execute, improving the average policy success rate by 11% across real-world tasks and 5.9% across simulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows how to fine-tune a VLM critic on pairwise success/failure rollouts from a policy and pair it with action-conditioned video prediction to pick better actions, with reported 11% real-world and 5.9% sim gains.

read the letter

The main thing here is a practical recipe for turning rollout data into a critic that handles the small visual distinctions VLMs usually miss in closed-loop manipulation. They build pairwise progress labels directly from success and failure trajectories, fine-tune on that, then feed candidate actions through an action-conditioned video model and let the critic pick the ones that look like they will succeed. That specific loop and the pairwise construction from existing policy data look like the new pieces.

It does what it sets out to do on the numbers given: the fine-tuned critic beats the progress-reasoning baselines they cite, and the selection step lifts average policy success by the stated margins across the real and simulated tasks. The approach is straightforward and targets a real pain point without adding heavy new machinery.

The soft spots are the usual ones for an abstract-heavy view. No information on number of trials, statistical tests, exact baseline implementations, or how the rollouts were filtered and collected. The central assumption—that the source policy’s success/failure pairs already contain enough scene variety and subtle failure modes to train something that generalizes—is stated but not obviously stress-tested in the provided text. If the training distribution is narrow, the 11% lift may not travel. The stress-test note is right to flag this; nothing in the abstract contradicts it.

This is for robotics groups already running policies and looking for cheap ways to improve closed-loop selection. A reader who needs incremental tools for VLM-based critics will find the method and the real-world numbers useful. The work is coherent enough on its own terms to deserve a serious referee rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes fine-tuning vision-language model critics for closed-loop robot manipulation by constructing pairwise progress supervision from success and failure rollouts collected from an existing policy. The fine-tuned critic is claimed to outperform prior progress reasoning baselines on fine-grained progress reasoning and subtle failure detection. The critic is further combined with an action-conditioned video model to predict effects of candidate actions and select successful ones, yielding reported average policy success rate gains of 11% on real-world tasks and 5.9% on simulation tasks.

Significance. If the empirical claims hold under proper controls, the work could provide a practical route to improving VLM-based critics for detecting subtle visual distinctions in manipulation without requiring new data collection beyond existing policy rollouts. The pairwise supervision approach and its integration with forward video prediction represent a concrete empirical contribution worth testing in the robotics community.

major comments (2)

[Abstract] Abstract: The abstract states clear numerical gains (11% real-world, 5.9% simulation) but supplies no information on baselines, number of trials, statistical significance, rollout collection/filtering procedure, or evaluation protocol. These omissions are load-bearing for the central claim of outperformance and generalization.
[Abstract] Abstract: The method's generalization to new scenes and subtle failures not seen in training rests on the unverified assumption that success/failure rollouts already contain sufficient pairwise visual differences; no analysis, diversity metrics, or ablation is supplied to substantiate this assumption, which directly underpins the reported gains outside the training distribution.

minor comments (1)

[Abstract] The abstract refers to 'prior progress reasoning baselines' without naming or citing them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires expansion to better support the central claims and will revise it accordingly. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states clear numerical gains (11% real-world, 5.9% simulation) but supplies no information on baselines, number of trials, statistical significance, rollout collection/filtering procedure, or evaluation protocol. These omissions are load-bearing for the central claim of outperformance and generalization.

Authors: We agree that the abstract would benefit from additional context. In the revised manuscript, we will expand the abstract to briefly specify the baselines (prior progress reasoning methods), number of trials, statistical significance testing, rollout collection and filtering procedure, and evaluation protocol, while remaining within standard length limits. revision: yes
Referee: [Abstract] Abstract: The method's generalization to new scenes and subtle failures not seen in training rests on the unverified assumption that success/failure rollouts already contain sufficient pairwise visual differences; no analysis, diversity metrics, or ablation is supplied to substantiate this assumption, which directly underpins the reported gains outside the training distribution.

Authors: The full manuscript contains held-out evaluations on new scenes and subtle failures plus ablations on the pairwise supervision. We will revise the abstract to note that rollouts are drawn from diverse policy executions and explicitly reference the experiments section for diversity metrics and ablations. Additional analysis can be added if specific metrics are suggested. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training and evaluation procedure is self-contained

full rationale

The paper presents an empirical pipeline: collect success/failure rollouts from an existing policy, construct pairwise progress supervision, fine-tune a VLM critic, and measure outperformance on held-out tasks plus downstream policy improvement via an action-conditioned video model. No equations, fitted parameters, or derivations are described that would reduce the reported 11%/5.9% success-rate gains or fine-grained reasoning claims to the training inputs by construction. The central results are measured outcomes on separate evaluation tasks, not predictions forced by the fitting process itself. No self-citation chains or uniqueness theorems are invoked as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard supervised fine-tuning works on VLM features and that video prediction is sufficiently accurate for action selection.

pith-pipeline@v0.9.1-grok · 5676 in / 1170 out tokens · 13575 ms · 2026-06-26T14:18:28.774019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 17 linked inside Pith

[1]

Alayrac, J

J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

2022
[2]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[3]

Driess, F

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

Pith/arXiv arXiv 2023
[4]

Y . Wu, R. Tian, G. Swamy, and A. Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment.ArXiv, abs/2502.01828, 2025. URLhttps://api. semanticscholar.org/CorpusID:276107486

arXiv 2025
[5]

W . Zhao, J. Chen, Z. Meng, D. Mao, R. Song, and W . Zhang. Vlmpc: Vision-language model predictive control for robotic manipulation.ArXiv, abs/2407.09829, 2024. URLhttps:// api.semanticscholar.org/CorpusID:271212525

arXiv 2024
[6]

S. Bai, K. qin Chen, X. Liu, J. Wang, W . Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Y ang, Z. Li, J. Wan, P . Wang, W . Ding, Z. Fu, Y . Xu, J. Y e, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Y ang, H. Xu, and J. Lin. Qwen2.5-vl technical report. ArXiv, abs/2502.13923, 2025. URLhttps://api.semanticscholar.org/CorpusID: 276449796

Pith/arXiv arXiv 2025
[7]

G. C. et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.ArXiv, abs/2507.06261, 2025. URLhttps: //api.semanticscholar.org/CorpusID:280151524

Pith/arXiv arXiv 2025
[8]

Schroeder, O

P . Schroeder, O. Biza, T. Weng, H. Luo, and J. R. Glass. Rover: Recursive reasoning over videos with vision-language models for embodied tasks.ArXiv, abs/2508.01943, 2025. URL https://api.semanticscholar.org/CorpusID:280422750

arXiv 2025
[9]

Y . J. Ma, J. Hejna, A. Wahid, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P . Xu, D. Driess, T. Xiao, J. Tompson, O. Bastani, D. Jayaraman, W . Yu, T. Zhang, D. Sadigh, and F. Xia. Vision language models are in-context value learners.ArXiv, abs/2411.04549, 2024. URL https://api.semanticscholar.org/CorpusID:273877849

arXiv 2024
[10]

Zhang, C

J. Zhang, C. Qian, H. Sun, H. Lu, D. Wang, L. Xue, and H. Liu. Progresslm: Towards progress reasoning in vision-language models.ArXiv, abs/2601.15224, 2026. URLhttps://api. semanticscholar.org/CorpusID:284917784

Pith/arXiv arXiv 2026
[11]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. ArXiv, abs/2402.10329, 2024. URLhttps://api.semanticscholar.org/CorpusID: 267740127. 10

Pith/arXiv arXiv 2024
[12]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P . Fagan, J. Hejna, M. Itkina, M. Lepert, Y . Ma, P . T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Y . Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

Pith/arXiv arXiv 2024
[13]

Padalkar, A

A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, A. Raffin, A. Wahid, B. Burgess-Limerick, B. Kim, B. Sch ¨olkopf, B. Ichter, C. Lu, C. Xu, C. Finn, C. Xu, C. Chi, C. Huang, C. Chan, C. Pan, C. Fu, C. Devin, D. Driess, D. Pathak, D. Shah, D. B ¨uchler, D. Kalashnikov, D. Sadigh, E. Johns, F. Ce- o...

2024
[14]

Pumacay, I

W . Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation.ArXiv, abs/2402.08191,

arXiv
[15]

URLhttps://api.semanticscholar.org/CorpusID:267636930
[16]

Kress-Gazit, K

H. Kress-Gazit, K. Hashimoto, N. Kuppuswamy, P . Shah, P . Horgan, G. Richardson, S. Feng, and B. Burchfiel. Robot learning as an empirical science: Best practices for policy evaluation. ArXiv, abs/2409.09491, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272689744

arXiv 2024
[17]

Intelligence, K

P . Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. ...

Pith/arXiv arXiv 2025
[18]

Black, M

K. Black, M. Y . Galliker, and S. Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025

Pith/arXiv arXiv 2025
[19]

W . Wu, F. Lu, Y . Wang, S. Y ang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026
[20]

H. Xie, B. Wen, J. Zheng, Z. Chen, F. Hong, H. Diao, and Z. Liu. Dynamicvla: A vision- language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

arXiv 2026
[21]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[22]

Liang, R

J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P . Tokmakov, S. Song, and C. Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation, 2024

2024
[23]

Liang, P

J. Liang, P . Tokmakov, R. Liu, S. Sudhakar, P . Shah, R. Ambrus, and C. Vondrick. Video generators are robot policies, 2025. URLhttps://arxiv.org/abs/2508.00795

Pith/arXiv arXiv 2025
[24]

Y . Wang, L. Wang, Y . Du, B. Sundaralingam, X. Y ang, Y .-W . Chao, C. Pérez-D’ Arpino, D. Fox, and J. A. Shah. Inference-time policy steering through human interactions.2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 15626–15633, 2024. URL https://api.semanticscholar.org/CorpusID:274280942

2025
[25]

Du and S

M. Du and S. Song. Dynaguide: Steering diffusion polices with active dynamic guidance. ArXiv, abs/2506.13922, 2025. URLhttps://api.semanticscholar.org/CorpusID: 279410894

arXiv 2025
[26]

Nakamoto, O

M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance, 2025. URLhttps://arxiv.org/abs/2410.13816

arXiv 2025
[27]

S. Jang, D. Kim, C. Kim, Y . Kim, and J. Shin. Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025

arXiv 2025
[28]

W . Guo, G. Lu, H. Deng, Z. Wu, Y . Tang, and Z. Wang. Vla-reasoner: Empowering vision- language-action models with reasoning via online monte carlo tree search.arXiv preprint arXiv:2509.22643, 2025

arXiv 2025
[29]

J. Cao, Y . Huang, H. Guo, R. Zhang, M. Nan, W . Mai, J. Wang, H. Cheng, J. Sun, G. Han, et al. Compose your policies! improving diffusion-based or flow-based robot policies via test-time distribution-level composition.arXiv preprint arXiv:2510.01068, 2025

arXiv 2025
[30]

H. Qi, H. Yin, Y . Du, and H. Y ang. Strengthening generative robot policies through predictive world modeling.ArXiv, abs/2502.00622, 2025. URLhttps://api.semanticscholar. org/CorpusID:276095203

arXiv 2025
[31]

S. Gao, W . Liang, K. Zheng, A. Malik, S. Y e, S. Yu, W .-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026
[32]

Liang, W

J. Liang, W . Huang, F. Xia, P . Xu, K. Hausman, B. Ichter, P . R. Florence, and A. Zeng. Code as policies: Language model programs for embodied control.2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 9493–9500, 2022. URL https://api.semanticscholar.org/CorpusID:252355542

2023
[33]

J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh. Physi- cally grounded vision-language models for robotic manipulation.2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12462–12469, 2023. URLhttps: //api.semanticscholar.org/CorpusID:261556939. 12

2024
[34]

Y . J. Ma, W . Liang, V . Som, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image representations and rewards for robotic control. InInternational Confer- ence on Machine Learning, 2023. URLhttps://api.semanticscholar.org/CorpusID: 258999195

2023
[35]

Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi. Vision-language models as success detectors.ArXiv, abs/2303.07280, 2023. URLhttps: //api.semanticscholar.org/CorpusID:257496810

arXiv 2023
[36]

W . Zhou, M. Tao, C. Zhao, H. Guo, H. Dong, M. Tang, and J. Wang. Physvlm: Enabling visual language models to understand robotic physical reachability.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6940–6949, 2025. URLhttps: //api.semanticscholar.org/CorpusID:276929115

2025
[37]

S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A vision-language-action-critic model for robotic real-world reinforcement learn- ing.ArXiv, abs/2509.15937, 2025. URLhttps://api.semanticscholar.org/CorpusID: 281411120

arXiv 2025
[38]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. M. J. Ru- ano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. C. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P . Pastor, J. Quiambao, K. Rao, J. Retti...

2022
[39]

Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson. Rl-vlm-f: Rein- forcement learning from vision language foundation model feedback. InInternational Confer- ence on Machine Learning, 2024. URLhttps://api.semanticscholar.org/CorpusID: 267499679

2024
[40]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE inter- national conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

2017
[41]

Z. Xue, J. An, X. Y ang, and K. Grauman. Progress-aware video frame captioning.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13639– 13650, 2024. URLhttps://api.semanticscholar.org/CorpusID:274446032

2025
[42]

Hung, P .-C

K.-H. Hung, P .-C. Lo, J.-F. Y eh, H.- Y . Hsu, Y .-T. Chen, and W . H. Hsu. Victor: Learning hierarchical vision-instruction correlation rewards for long-horizon manipulation. ArXiv, abs/2405.16545, 2024. URLhttps://api.semanticscholar.org/CorpusID: 270064037

arXiv 2024
[43]

Pacaud, R

P . Pacaud, R. Garcia, S. Chen, and C. Schmid. Guardian: Detecting robotic planning and execution errors with vision-language models.ArXiv, abs/2512.01946, 2025. URLhttps: //api.semanticscholar.org/CorpusID:283448847

arXiv 2025
[44]

Agrawal, J

A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra. Vqa: Visual question answering.International Journal of Computer Vision, 123:4 – 31, 2015. URLhttps: //api.semanticscholar.org/CorpusID:3180429

2015
[45]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering.2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 6693–6702, 2019. URLhttps://api.semanticscholar. org/CorpusID:152282269. 13

2019
[46]

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W . Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Y ang, Y . Liu, W . Huang, H. Sun, Y . Su, and W . Chen. Mmmu: A massive multi-discipline multimodal understanding and rea- soning benchmark for expert agi.2024 IEEE/CVF Conference on Computer Vision and Patter...

2024
[47]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. Huang, L. S. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. 2026. URLhttps://api.semanticscholar.org/CorpusID:286223261

2026
[48]

C. Xu, T. K. Nguyen, E. Dixon, C. Rodriguez, P . Miller, R. Lee, P . Shah, R. Ambrus, H. Nishimura, and M. Itkina. Can we detect failures without failure data? uncertainty-aware runtime failure detection for imitation learning policies, 2025. URLhttps://arxiv.org/ abs/2503.08558

arXiv 2025
[49]

J. Duan, W . Pumacay, N. Kumar, Y . R. Wang, S. Tian, W . Yuan, R. Krishna, D. Fox, A. Mandlekar, and Y . Guo. Aha: A vision-language-model for detecting and reasoning over failures in robotic manipulation.ArXiv, abs/2410.00371, 2024. URLhttps://api. semanticscholar.org/CorpusID:273022765

arXiv 2024
[50]

Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models, 2025. URLhttps://arxiv.org/ abs/2510.01642

arXiv 2025
[51]

J. Park, J. Y oon, B. Jeon, J. Park, J. Shin, N. Cho, K. Lee, S. Yun, and S. Choi. Hierarchical vision language action model using success and failure demonstrations, 2025. URLhttps: //arxiv.org/abs/2512.03913

arXiv 2025
[52]

H. Li, K. Lei, S. Zang, K. Hu, Y . Liang, B. An, X. Li, and H. Xu. Failure-aware rl: Reliable offline-to-online reinforcement learning with self-recovery for real-world manipulation, 2026. URLhttps://arxiv.org/abs/2601.07821

arXiv 2026
[53]

Christiano, J

P . Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences, 2023. URLhttps://arxiv.org/abs/1706.03741

Pith/arXiv arXiv 2023
[54]

R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[55]

C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction.ArXiv, abs/1605.07157, 2016. URLhttps://api.semanticscholar. org/CorpusID:2659157

Pith/arXiv arXiv 2016
[56]

Y ang, Y

S. Y ang, Y . Du, K. Ghasemipour, J. Tompson, L. Kaelbling, D. Schuurmans, and P . Abbeel. Learning interactive real-world simulators, 2024. URLhttps://arxiv.org/abs/2310. 06114

2024
[57]

B. Wu, C. Zou, C. Li, D. Huang, F. Y ang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jiang, Linus, Patrol, P . Zhang, P . Chen, P . Zhao, Q. Tian, S. Liu, W . Kong, W . Wang, X. He, X. Li, X. Deng, X. Zhe, Y . Li, Y . Long, Y . Peng, Y . Wu, Y . Liu, Z. Wang, Z. Dai, B. Peng, C. Li, G. Gong, G. Xiao, J. Tian, J. Lin, J. Liu, J. Zhang, J. Lian, K. Pan, L. Wang, ...
[58]

URLhttps://arxiv.org/abs/2511.18870. 14

Pith/arXiv arXiv
[59]

Li and A

Y . Li and A. Torralba. Multimodal action conditioned video generation, 2025. URLhttps: //arxiv.org/abs/2510.02287

arXiv 2025
[60]

Nasiriany, A

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. ArXiv, abs/2406.02523, 2024. URLhttps://api.semanticscholar.org/CorpusID: 270226600

Pith/arXiv arXiv 2024
[61]

T. L. Team, J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, N. Kuppuswamy, K.-H. Lee, K. Liu, D. Mcconachie, I. McMahon, H. Nishimura, C. Phillips-Grafflin, C. Richter, P . Shah, K. P . Srinivasan, B. Wulfe, C. Xu, M. Zhang, A. Alspach, M. Angeles, K. Arora, V . Guizilini, A. M. Cas- tro, ...

Pith/arXiv arXiv 2025
[62]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion pol- icy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44:1684 – 1704, 2023. URLhttps://api.semanticscholar.org/CorpusID: 257378658

2023
[63]

Nasiriany, S

S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y . Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots, 2026. URLhttps://arxiv. org/abs/2603.04356

arXiv 2026
[64]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Y e, Z. Yu,...

2025

[1] [1]

Alayrac, J

J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

2022

[2] [2]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[3] [3]

Driess, F

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

Pith/arXiv arXiv 2023

[4] [4]

Y . Wu, R. Tian, G. Swamy, and A. Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment.ArXiv, abs/2502.01828, 2025. URLhttps://api. semanticscholar.org/CorpusID:276107486

arXiv 2025

[5] [5]

W . Zhao, J. Chen, Z. Meng, D. Mao, R. Song, and W . Zhang. Vlmpc: Vision-language model predictive control for robotic manipulation.ArXiv, abs/2407.09829, 2024. URLhttps:// api.semanticscholar.org/CorpusID:271212525

arXiv 2024

[6] [6]

S. Bai, K. qin Chen, X. Liu, J. Wang, W . Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Y ang, Z. Li, J. Wan, P . Wang, W . Ding, Z. Fu, Y . Xu, J. Y e, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Y ang, H. Xu, and J. Lin. Qwen2.5-vl technical report. ArXiv, abs/2502.13923, 2025. URLhttps://api.semanticscholar.org/CorpusID: 276449796

Pith/arXiv arXiv 2025

[7] [7]

G. C. et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.ArXiv, abs/2507.06261, 2025. URLhttps: //api.semanticscholar.org/CorpusID:280151524

Pith/arXiv arXiv 2025

[8] [8]

Schroeder, O

P . Schroeder, O. Biza, T. Weng, H. Luo, and J. R. Glass. Rover: Recursive reasoning over videos with vision-language models for embodied tasks.ArXiv, abs/2508.01943, 2025. URL https://api.semanticscholar.org/CorpusID:280422750

arXiv 2025

[9] [9]

Y . J. Ma, J. Hejna, A. Wahid, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P . Xu, D. Driess, T. Xiao, J. Tompson, O. Bastani, D. Jayaraman, W . Yu, T. Zhang, D. Sadigh, and F. Xia. Vision language models are in-context value learners.ArXiv, abs/2411.04549, 2024. URL https://api.semanticscholar.org/CorpusID:273877849

arXiv 2024

[10] [10]

Zhang, C

J. Zhang, C. Qian, H. Sun, H. Lu, D. Wang, L. Xue, and H. Liu. Progresslm: Towards progress reasoning in vision-language models.ArXiv, abs/2601.15224, 2026. URLhttps://api. semanticscholar.org/CorpusID:284917784

Pith/arXiv arXiv 2026

[11] [11]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. ArXiv, abs/2402.10329, 2024. URLhttps://api.semanticscholar.org/CorpusID: 267740127. 10

Pith/arXiv arXiv 2024

[12] [12]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P . Fagan, J. Hejna, M. Itkina, M. Lepert, Y . Ma, P . T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Y . Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

Pith/arXiv arXiv 2024

[13] [13]

Padalkar, A

A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, A. Raffin, A. Wahid, B. Burgess-Limerick, B. Kim, B. Sch ¨olkopf, B. Ichter, C. Lu, C. Xu, C. Finn, C. Xu, C. Chi, C. Huang, C. Chan, C. Pan, C. Fu, C. Devin, D. Driess, D. Pathak, D. Shah, D. B ¨uchler, D. Kalashnikov, D. Sadigh, E. Johns, F. Ce- o...

2024

[14] [14]

Pumacay, I

W . Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation.ArXiv, abs/2402.08191,

arXiv

[15] [15]

URLhttps://api.semanticscholar.org/CorpusID:267636930

[16] [16]

Kress-Gazit, K

H. Kress-Gazit, K. Hashimoto, N. Kuppuswamy, P . Shah, P . Horgan, G. Richardson, S. Feng, and B. Burchfiel. Robot learning as an empirical science: Best practices for policy evaluation. ArXiv, abs/2409.09491, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272689744

arXiv 2024

[17] [17]

Intelligence, K

P . Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. ...

Pith/arXiv arXiv 2025

[18] [18]

Black, M

K. Black, M. Y . Galliker, and S. Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025

Pith/arXiv arXiv 2025

[19] [19]

W . Wu, F. Lu, Y . Wang, S. Y ang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026

[20] [20]

H. Xie, B. Wen, J. Zheng, Z. Chen, F. Hong, H. Diao, and Z. Liu. Dynamicvla: A vision- language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

arXiv 2026

[21] [21]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[22] [22]

Liang, R

J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P . Tokmakov, S. Song, and C. Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation, 2024

2024

[23] [23]

Liang, P

J. Liang, P . Tokmakov, R. Liu, S. Sudhakar, P . Shah, R. Ambrus, and C. Vondrick. Video generators are robot policies, 2025. URLhttps://arxiv.org/abs/2508.00795

Pith/arXiv arXiv 2025

[24] [24]

Y . Wang, L. Wang, Y . Du, B. Sundaralingam, X. Y ang, Y .-W . Chao, C. Pérez-D’ Arpino, D. Fox, and J. A. Shah. Inference-time policy steering through human interactions.2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 15626–15633, 2024. URL https://api.semanticscholar.org/CorpusID:274280942

2025

[25] [25]

Du and S

M. Du and S. Song. Dynaguide: Steering diffusion polices with active dynamic guidance. ArXiv, abs/2506.13922, 2025. URLhttps://api.semanticscholar.org/CorpusID: 279410894

arXiv 2025

[26] [26]

Nakamoto, O

M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance, 2025. URLhttps://arxiv.org/abs/2410.13816

arXiv 2025

[27] [27]

S. Jang, D. Kim, C. Kim, Y . Kim, and J. Shin. Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025

arXiv 2025

[28] [28]

W . Guo, G. Lu, H. Deng, Z. Wu, Y . Tang, and Z. Wang. Vla-reasoner: Empowering vision- language-action models with reasoning via online monte carlo tree search.arXiv preprint arXiv:2509.22643, 2025

arXiv 2025

[29] [29]

J. Cao, Y . Huang, H. Guo, R. Zhang, M. Nan, W . Mai, J. Wang, H. Cheng, J. Sun, G. Han, et al. Compose your policies! improving diffusion-based or flow-based robot policies via test-time distribution-level composition.arXiv preprint arXiv:2510.01068, 2025

arXiv 2025

[30] [30]

H. Qi, H. Yin, Y . Du, and H. Y ang. Strengthening generative robot policies through predictive world modeling.ArXiv, abs/2502.00622, 2025. URLhttps://api.semanticscholar. org/CorpusID:276095203

arXiv 2025

[31] [31]

S. Gao, W . Liang, K. Zheng, A. Malik, S. Y e, S. Yu, W .-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026

[32] [32]

Liang, W

J. Liang, W . Huang, F. Xia, P . Xu, K. Hausman, B. Ichter, P . R. Florence, and A. Zeng. Code as policies: Language model programs for embodied control.2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 9493–9500, 2022. URL https://api.semanticscholar.org/CorpusID:252355542

2023

[33] [33]

J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh. Physi- cally grounded vision-language models for robotic manipulation.2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12462–12469, 2023. URLhttps: //api.semanticscholar.org/CorpusID:261556939. 12

2024

[34] [34]

Y . J. Ma, W . Liang, V . Som, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image representations and rewards for robotic control. InInternational Confer- ence on Machine Learning, 2023. URLhttps://api.semanticscholar.org/CorpusID: 258999195

2023

[35] [35]

Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi. Vision-language models as success detectors.ArXiv, abs/2303.07280, 2023. URLhttps: //api.semanticscholar.org/CorpusID:257496810

arXiv 2023

[36] [36]

W . Zhou, M. Tao, C. Zhao, H. Guo, H. Dong, M. Tang, and J. Wang. Physvlm: Enabling visual language models to understand robotic physical reachability.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6940–6949, 2025. URLhttps: //api.semanticscholar.org/CorpusID:276929115

2025

[37] [37]

S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A vision-language-action-critic model for robotic real-world reinforcement learn- ing.ArXiv, abs/2509.15937, 2025. URLhttps://api.semanticscholar.org/CorpusID: 281411120

arXiv 2025

[38] [38]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. M. J. Ru- ano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. C. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P . Pastor, J. Quiambao, K. Rao, J. Retti...

2022

[39] [39]

Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson. Rl-vlm-f: Rein- forcement learning from vision language foundation model feedback. InInternational Confer- ence on Machine Learning, 2024. URLhttps://api.semanticscholar.org/CorpusID: 267499679

2024

[40] [40]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE inter- national conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

2017

[41] [41]

Z. Xue, J. An, X. Y ang, and K. Grauman. Progress-aware video frame captioning.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13639– 13650, 2024. URLhttps://api.semanticscholar.org/CorpusID:274446032

2025

[42] [42]

Hung, P .-C

K.-H. Hung, P .-C. Lo, J.-F. Y eh, H.- Y . Hsu, Y .-T. Chen, and W . H. Hsu. Victor: Learning hierarchical vision-instruction correlation rewards for long-horizon manipulation. ArXiv, abs/2405.16545, 2024. URLhttps://api.semanticscholar.org/CorpusID: 270064037

arXiv 2024

[43] [43]

Pacaud, R

P . Pacaud, R. Garcia, S. Chen, and C. Schmid. Guardian: Detecting robotic planning and execution errors with vision-language models.ArXiv, abs/2512.01946, 2025. URLhttps: //api.semanticscholar.org/CorpusID:283448847

arXiv 2025

[44] [44]

Agrawal, J

A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra. Vqa: Visual question answering.International Journal of Computer Vision, 123:4 – 31, 2015. URLhttps: //api.semanticscholar.org/CorpusID:3180429

2015

[45] [45]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering.2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 6693–6702, 2019. URLhttps://api.semanticscholar. org/CorpusID:152282269. 13

2019

[46] [46]

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W . Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Y ang, Y . Liu, W . Huang, H. Sun, Y . Su, and W . Chen. Mmmu: A massive multi-discipline multimodal understanding and rea- soning benchmark for expert agi.2024 IEEE/CVF Conference on Computer Vision and Patter...

2024

[47] [47]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. Huang, L. S. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. 2026. URLhttps://api.semanticscholar.org/CorpusID:286223261

2026

[48] [48]

C. Xu, T. K. Nguyen, E. Dixon, C. Rodriguez, P . Miller, R. Lee, P . Shah, R. Ambrus, H. Nishimura, and M. Itkina. Can we detect failures without failure data? uncertainty-aware runtime failure detection for imitation learning policies, 2025. URLhttps://arxiv.org/ abs/2503.08558

arXiv 2025

[49] [49]

J. Duan, W . Pumacay, N. Kumar, Y . R. Wang, S. Tian, W . Yuan, R. Krishna, D. Fox, A. Mandlekar, and Y . Guo. Aha: A vision-language-model for detecting and reasoning over failures in robotic manipulation.ArXiv, abs/2410.00371, 2024. URLhttps://api. semanticscholar.org/CorpusID:273022765

arXiv 2024

[50] [50]

Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models, 2025. URLhttps://arxiv.org/ abs/2510.01642

arXiv 2025

[51] [51]

J. Park, J. Y oon, B. Jeon, J. Park, J. Shin, N. Cho, K. Lee, S. Yun, and S. Choi. Hierarchical vision language action model using success and failure demonstrations, 2025. URLhttps: //arxiv.org/abs/2512.03913

arXiv 2025

[52] [52]

H. Li, K. Lei, S. Zang, K. Hu, Y . Liang, B. An, X. Li, and H. Xu. Failure-aware rl: Reliable offline-to-online reinforcement learning with self-recovery for real-world manipulation, 2026. URLhttps://arxiv.org/abs/2601.07821

arXiv 2026

[53] [53]

Christiano, J

P . Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences, 2023. URLhttps://arxiv.org/abs/1706.03741

Pith/arXiv arXiv 2023

[54] [54]

R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952

[55] [55]

C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction.ArXiv, abs/1605.07157, 2016. URLhttps://api.semanticscholar. org/CorpusID:2659157

Pith/arXiv arXiv 2016

[56] [56]

Y ang, Y

S. Y ang, Y . Du, K. Ghasemipour, J. Tompson, L. Kaelbling, D. Schuurmans, and P . Abbeel. Learning interactive real-world simulators, 2024. URLhttps://arxiv.org/abs/2310. 06114

2024

[57] [57]

B. Wu, C. Zou, C. Li, D. Huang, F. Y ang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jiang, Linus, Patrol, P . Zhang, P . Chen, P . Zhao, Q. Tian, S. Liu, W . Kong, W . Wang, X. He, X. Li, X. Deng, X. Zhe, Y . Li, Y . Long, Y . Peng, Y . Wu, Y . Liu, Z. Wang, Z. Dai, B. Peng, C. Li, G. Gong, G. Xiao, J. Tian, J. Lin, J. Liu, J. Zhang, J. Lian, K. Pan, L. Wang, ...

[58] [58]

URLhttps://arxiv.org/abs/2511.18870. 14

Pith/arXiv arXiv

[59] [59]

Li and A

Y . Li and A. Torralba. Multimodal action conditioned video generation, 2025. URLhttps: //arxiv.org/abs/2510.02287

arXiv 2025

[60] [60]

Nasiriany, A

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. ArXiv, abs/2406.02523, 2024. URLhttps://api.semanticscholar.org/CorpusID: 270226600

Pith/arXiv arXiv 2024

[61] [61]

T. L. Team, J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, N. Kuppuswamy, K.-H. Lee, K. Liu, D. Mcconachie, I. McMahon, H. Nishimura, C. Phillips-Grafflin, C. Richter, P . Shah, K. P . Srinivasan, B. Wulfe, C. Xu, M. Zhang, A. Alspach, M. Angeles, K. Arora, V . Guizilini, A. M. Cas- tro, ...

Pith/arXiv arXiv 2025

[62] [62]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion pol- icy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44:1684 – 1704, 2023. URLhttps://api.semanticscholar.org/CorpusID: 257378658

2023

[63] [63]

Nasiriany, S

S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y . Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots, 2026. URLhttps://arxiv. org/abs/2603.04356

arXiv 2026

[64] [64]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Y e, Z. Yu,...

2025