pith. sign in

arxiv: 2606.21572 · v1 · pith:KZWXJSARnew · submitted 2026-06-19 · 💻 cs.RO

Robot Critics that Sweat the Small Stuff

Pith reviewed 2026-06-26 14:18 UTC · model grok-4.3

classification 💻 cs.RO
keywords robot manipulationvision-language modelsprogress reasoningfailure detectionpolicy improvementfine-tuningaction selectionvideo prediction
0
0 comments X

The pith

Fine-tuning a vision-language model on pairs of success and failure robot rollouts creates a critic that detects subtle failures and selects better actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models hold useful priors about objects and interactions but often miss the small visual changes that separate robot success from failure during manipulation. The paper trains a critic by labeling pairs of frames drawn from successful and failed executions of an existing policy, giving it direct supervision on progress and failure. This critic then scores candidate next actions that an action-conditioned video model has forecasted, choosing the ones that look likely to succeed. If the pairwise differences in the training rollouts are representative, the method supplies a way to steer policies closed-loop without hand-crafted rewards or new data for every variation. Experiments report an 11 percent gain in real-world task success and a 5.9 percent gain in simulation.

Core claim

By constructing pairwise progress supervision using success and failure rollouts obtained from a policy, the fine-tuned critic excels at fine-grained progress reasoning and subtle failure detection, outperforming prior progress reasoning baselines. When the critic is used to identify successful candidates among actions sampled from a policy and forecasted by an action-conditioned video model, average policy success rate improves by 11 percent across real-world tasks and 5.9 percent across simulation tasks.

What carries the argument

A vision-language model critic fine-tuned on pairwise progress supervision drawn from success and failure policy rollouts, which learns to compare visual states for progress and failure signals.

If this is right

  • The critic outperforms prior progress reasoning baselines at fine-grained tasks.
  • The critic correctly identifies successful candidate actions among those forecasted by the video model.
  • Using the critic to select actions raises average policy success by 11 percent in real-world tasks.
  • Using the critic to select actions raises average policy success by 5.9 percent in simulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pairwise training signal could be collected from any policy that already runs in a target domain, lowering the cost of adapting critics to new robots or environments.
  • If the critic generalizes across object instances, it may reduce the need for task-specific reward functions in long-horizon manipulation.
  • Pairing the critic with more accurate future-prediction models would tighten the loop between action selection and visual verification.

Load-bearing premise

Success and failure rollouts collected from an existing policy already contain enough pairwise visual differences to train a critic that generalizes to new scenes and subtle failures not seen in the training rollouts.

What would settle it

Evaluating the trained critic on a new manipulation task whose failure modes produce visual differences absent from the original training rollouts and checking whether its detection accuracy falls to the level of an untrained baseline.

Figures

Figures reproduced from arXiv: 2606.21572 by Carl Vondrick, Junbang Liang, Pavel Tokmakov, Richard Zemel, Sreehari Rammohan, Sruthi Sudhakar.

Figure 1
Figure 1. Figure 1: Critic in-the-loop. Given an observation, a learned stochastic policy samples K unique candidate action sequences. An action-conditioned generative video model synthesizes visual ob￾servations to produce a terminal state per candidate action. The critic performs pairwise progress comparisons to select the best candidate, which is then executed and the re-planning continues. multiple candidate action sequen… view at source ↗
Figure 2
Figure 2. Figure 2: Training Robot Critics. VLMs are finetuned with successful and failed rollouts to enable fine-grained task progress/failure detection. To construct the training dataset, consecutive frames are used from successful trajectories to obtain task progress data. Additionally, success and failure frame pairs from the same initial condition are used to obtain policy- and task-specific failure data. Success–Failure… view at source ↗
Figure 3
Figure 3. Figure 3: Real-world Evaluation Results. (a) Initial states for all eval episodes overlaid - all meth￾ods start with the same set of initial states, matched manually with reference images. (b) Critical decision points where the VLM chooses the correct action candidate to lead to success (c) Our method succeeds at the task at the end of the rollout (d) The baseline policy fails at those critical points and therefore … view at source ↗
Figure 4
Figure 4. Figure 4: Coarse vs fine-grained critic performance. Current VLM performance degrades when judging fine grained visual differences in task performance. Fine-tuning for fine-grained task progress recognition can boost these scores. A.2 Generalization to unseen tasks We evaluate the three strongest methods (ROVER, ProgressLM, and ours) on held-out tasks that are unseen for ProgressLM and our method. ROVER remains prom… view at source ↗
Figure 5
Figure 5. Figure 5: Best-of-K sampling reveals sub￾stantial headroom for Diffusion Policy, in￾creasing success rate from 31% to 48%. 0 60 120 180 240 rollout cycle 0.025 0.050 0.075 0.100 0.125 avg sample variance PnP Lego To Bowl 0 24 48 72 96 rollout cycle 0.0 0.1 0.2 0.3 Push Bowl Sample variance averaged across all rollouts, over 5 stages of the rollout joint j0 j1 j2 j3 j4 j5 gripper [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt templates used for baselines and our method. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Large vision-language models contain several priors about the world and object interactions, making them useful critics during inference to steer robot policies towards success. However, closed-loop robot manipulation requires judging small visual differences between success and failure, which remains a challenge for current VLMs. We introduce a method to fine-tune critics by constructing pairwise progress supervision using success and failure rollouts obtained from a policy. Our fine-tuned critic excels at fine-grained progress reasoning and subtle failure detection, outperforming prior progress reasoning baselines. Additionally, we use an action-conditioned video model to predict the visual effect of several candidate actions sampled from a policy, and show that our critic can correctly identify successful candidates to execute, improving the average policy success rate by 11% across real-world tasks and 5.9% across simulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes fine-tuning vision-language model critics for closed-loop robot manipulation by constructing pairwise progress supervision from success and failure rollouts collected from an existing policy. The fine-tuned critic is claimed to outperform prior progress reasoning baselines on fine-grained progress reasoning and subtle failure detection. The critic is further combined with an action-conditioned video model to predict effects of candidate actions and select successful ones, yielding reported average policy success rate gains of 11% on real-world tasks and 5.9% on simulation tasks.

Significance. If the empirical claims hold under proper controls, the work could provide a practical route to improving VLM-based critics for detecting subtle visual distinctions in manipulation without requiring new data collection beyond existing policy rollouts. The pairwise supervision approach and its integration with forward video prediction represent a concrete empirical contribution worth testing in the robotics community.

major comments (2)
  1. [Abstract] Abstract: The abstract states clear numerical gains (11% real-world, 5.9% simulation) but supplies no information on baselines, number of trials, statistical significance, rollout collection/filtering procedure, or evaluation protocol. These omissions are load-bearing for the central claim of outperformance and generalization.
  2. [Abstract] Abstract: The method's generalization to new scenes and subtle failures not seen in training rests on the unverified assumption that success/failure rollouts already contain sufficient pairwise visual differences; no analysis, diversity metrics, or ablation is supplied to substantiate this assumption, which directly underpins the reported gains outside the training distribution.
minor comments (1)
  1. [Abstract] The abstract refers to 'prior progress reasoning baselines' without naming or citing them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires expansion to better support the central claims and will revise it accordingly. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states clear numerical gains (11% real-world, 5.9% simulation) but supplies no information on baselines, number of trials, statistical significance, rollout collection/filtering procedure, or evaluation protocol. These omissions are load-bearing for the central claim of outperformance and generalization.

    Authors: We agree that the abstract would benefit from additional context. In the revised manuscript, we will expand the abstract to briefly specify the baselines (prior progress reasoning methods), number of trials, statistical significance testing, rollout collection and filtering procedure, and evaluation protocol, while remaining within standard length limits. revision: yes

  2. Referee: [Abstract] Abstract: The method's generalization to new scenes and subtle failures not seen in training rests on the unverified assumption that success/failure rollouts already contain sufficient pairwise visual differences; no analysis, diversity metrics, or ablation is supplied to substantiate this assumption, which directly underpins the reported gains outside the training distribution.

    Authors: The full manuscript contains held-out evaluations on new scenes and subtle failures plus ablations on the pairwise supervision. We will revise the abstract to note that rollouts are drawn from diverse policy executions and explicitly reference the experiments section for diversity metrics and ablations. Additional analysis can be added if specific metrics are suggested. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training and evaluation procedure is self-contained

full rationale

The paper presents an empirical pipeline: collect success/failure rollouts from an existing policy, construct pairwise progress supervision, fine-tune a VLM critic, and measure outperformance on held-out tasks plus downstream policy improvement via an action-conditioned video model. No equations, fitted parameters, or derivations are described that would reduce the reported 11%/5.9% success-rate gains or fine-grained reasoning claims to the training inputs by construction. The central results are measured outcomes on separate evaluation tasks, not predictions forced by the fitting process itself. No self-citation chains or uniqueness theorems are invoked as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard supervised fine-tuning works on VLM features and that video prediction is sufficiently accurate for action selection.

pith-pipeline@v0.9.1-grok · 5676 in / 1170 out tokens · 13575 ms · 2026-06-26T14:18:28.774019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 17 linked inside Pith

  1. [1]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

  2. [2]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  3. [3]

    Driess, F

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  4. [4]

    Y . Wu, R. Tian, G. Swamy, and A. Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment.ArXiv, abs/2502.01828, 2025. URLhttps://api. semanticscholar.org/CorpusID:276107486

  5. [5]

    W . Zhao, J. Chen, Z. Meng, D. Mao, R. Song, and W . Zhang. Vlmpc: Vision-language model predictive control for robotic manipulation.ArXiv, abs/2407.09829, 2024. URLhttps:// api.semanticscholar.org/CorpusID:271212525

  6. [6]

    S. Bai, K. qin Chen, X. Liu, J. Wang, W . Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Y ang, Z. Li, J. Wan, P . Wang, W . Ding, Z. Fu, Y . Xu, J. Y e, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Y ang, H. Xu, and J. Lin. Qwen2.5-vl technical report. ArXiv, abs/2502.13923, 2025. URLhttps://api.semanticscholar.org/CorpusID: 276449796

  7. [7]

    G. C. et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.ArXiv, abs/2507.06261, 2025. URLhttps: //api.semanticscholar.org/CorpusID:280151524

  8. [8]

    Schroeder, O

    P . Schroeder, O. Biza, T. Weng, H. Luo, and J. R. Glass. Rover: Recursive reasoning over videos with vision-language models for embodied tasks.ArXiv, abs/2508.01943, 2025. URL https://api.semanticscholar.org/CorpusID:280422750

  9. [9]

    Y . J. Ma, J. Hejna, A. Wahid, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P . Xu, D. Driess, T. Xiao, J. Tompson, O. Bastani, D. Jayaraman, W . Yu, T. Zhang, D. Sadigh, and F. Xia. Vision language models are in-context value learners.ArXiv, abs/2411.04549, 2024. URL https://api.semanticscholar.org/CorpusID:273877849

  10. [10]

    Zhang, C

    J. Zhang, C. Qian, H. Sun, H. Lu, D. Wang, L. Xue, and H. Liu. Progresslm: Towards progress reasoning in vision-language models.ArXiv, abs/2601.15224, 2026. URLhttps://api. semanticscholar.org/CorpusID:284917784

  11. [11]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. ArXiv, abs/2402.10329, 2024. URLhttps://api.semanticscholar.org/CorpusID: 267740127. 10

  12. [12]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P . Fagan, J. Hejna, M. Itkina, M. Lepert, Y . Ma, P . T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Y . Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  13. [13]

    Padalkar, A

    A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, A. Raffin, A. Wahid, B. Burgess-Limerick, B. Kim, B. Sch ¨olkopf, B. Ichter, C. Lu, C. Xu, C. Finn, C. Xu, C. Chi, C. Huang, C. Chan, C. Pan, C. Fu, C. Devin, D. Driess, D. Pathak, D. Shah, D. B ¨uchler, D. Kalashnikov, D. Sadigh, E. Johns, F. Ce- o...

  14. [14]

    Pumacay, I

    W . Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation.ArXiv, abs/2402.08191,

  15. [15]

    URLhttps://api.semanticscholar.org/CorpusID:267636930

  16. [16]

    Kress-Gazit, K

    H. Kress-Gazit, K. Hashimoto, N. Kuppuswamy, P . Shah, P . Horgan, G. Richardson, S. Feng, and B. Burchfiel. Robot learning as an empirical science: Best practices for policy evaluation. ArXiv, abs/2409.09491, 2024. URLhttps://api.semanticscholar.org/CorpusID: 272689744

  17. [17]

    Intelligence, K

    P . Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. ...

  18. [18]

    Black, M

    K. Black, M. Y . Galliker, and S. Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025

  19. [19]

    W . Wu, F. Lu, Y . Wang, S. Y ang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  20. [20]

    H. Xie, B. Wen, J. Zheng, Z. Chen, F. Hong, H. Diao, and Z. Liu. Dynamicvla: A vision- language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

  21. [21]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  22. [22]

    Liang, R

    J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P . Tokmakov, S. Song, and C. Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation, 2024

  23. [23]

    Liang, P

    J. Liang, P . Tokmakov, R. Liu, S. Sudhakar, P . Shah, R. Ambrus, and C. Vondrick. Video generators are robot policies, 2025. URLhttps://arxiv.org/abs/2508.00795

  24. [24]

    Y . Wang, L. Wang, Y . Du, B. Sundaralingam, X. Y ang, Y .-W . Chao, C. Pérez-D’ Arpino, D. Fox, and J. A. Shah. Inference-time policy steering through human interactions.2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 15626–15633, 2024. URL https://api.semanticscholar.org/CorpusID:274280942

  25. [25]

    Du and S

    M. Du and S. Song. Dynaguide: Steering diffusion polices with active dynamic guidance. ArXiv, abs/2506.13922, 2025. URLhttps://api.semanticscholar.org/CorpusID: 279410894

  26. [26]

    Nakamoto, O

    M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance, 2025. URLhttps://arxiv.org/abs/2410.13816

  27. [27]

    S. Jang, D. Kim, C. Kim, Y . Kim, and J. Shin. Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025

  28. [28]

    W . Guo, G. Lu, H. Deng, Z. Wu, Y . Tang, and Z. Wang. Vla-reasoner: Empowering vision- language-action models with reasoning via online monte carlo tree search.arXiv preprint arXiv:2509.22643, 2025

  29. [29]

    J. Cao, Y . Huang, H. Guo, R. Zhang, M. Nan, W . Mai, J. Wang, H. Cheng, J. Sun, G. Han, et al. Compose your policies! improving diffusion-based or flow-based robot policies via test-time distribution-level composition.arXiv preprint arXiv:2510.01068, 2025

  30. [30]

    H. Qi, H. Yin, Y . Du, and H. Y ang. Strengthening generative robot policies through predictive world modeling.ArXiv, abs/2502.00622, 2025. URLhttps://api.semanticscholar. org/CorpusID:276095203

  31. [31]

    S. Gao, W . Liang, K. Zheng, A. Malik, S. Y e, S. Yu, W .-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  32. [32]

    Liang, W

    J. Liang, W . Huang, F. Xia, P . Xu, K. Hausman, B. Ichter, P . R. Florence, and A. Zeng. Code as policies: Language model programs for embodied control.2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 9493–9500, 2022. URL https://api.semanticscholar.org/CorpusID:252355542

  33. [33]

    J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh. Physi- cally grounded vision-language models for robotic manipulation.2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12462–12469, 2023. URLhttps: //api.semanticscholar.org/CorpusID:261556939. 12

  34. [34]

    Y . J. Ma, W . Liang, V . Som, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image representations and rewards for robotic control. InInternational Confer- ence on Machine Learning, 2023. URLhttps://api.semanticscholar.org/CorpusID: 258999195

  35. [35]

    Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi. Vision-language models as success detectors.ArXiv, abs/2303.07280, 2023. URLhttps: //api.semanticscholar.org/CorpusID:257496810

  36. [36]

    W . Zhou, M. Tao, C. Zhao, H. Guo, H. Dong, M. Tang, and J. Wang. Physvlm: Enabling visual language models to understand robotic physical reachability.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6940–6949, 2025. URLhttps: //api.semanticscholar.org/CorpusID:276929115

  37. [37]

    S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A vision-language-action-critic model for robotic real-world reinforcement learn- ing.ArXiv, abs/2509.15937, 2025. URLhttps://api.semanticscholar.org/CorpusID: 281411120

  38. [38]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. M. J. Ru- ano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. C. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P . Pastor, J. Quiambao, K. Rao, J. Retti...

  39. [39]

    Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson. Rl-vlm-f: Rein- forcement learning from vision language foundation model feedback. InInternational Confer- ence on Machine Learning, 2024. URLhttps://api.semanticscholar.org/CorpusID: 267499679

  40. [40]

    Finn and S

    C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE inter- national conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

  41. [41]

    Z. Xue, J. An, X. Y ang, and K. Grauman. Progress-aware video frame captioning.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13639– 13650, 2024. URLhttps://api.semanticscholar.org/CorpusID:274446032

  42. [42]

    Hung, P .-C

    K.-H. Hung, P .-C. Lo, J.-F. Y eh, H.- Y . Hsu, Y .-T. Chen, and W . H. Hsu. Victor: Learning hierarchical vision-instruction correlation rewards for long-horizon manipulation. ArXiv, abs/2405.16545, 2024. URLhttps://api.semanticscholar.org/CorpusID: 270064037

  43. [43]

    Pacaud, R

    P . Pacaud, R. Garcia, S. Chen, and C. Schmid. Guardian: Detecting robotic planning and execution errors with vision-language models.ArXiv, abs/2512.01946, 2025. URLhttps: //api.semanticscholar.org/CorpusID:283448847

  44. [44]

    Agrawal, J

    A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra. Vqa: Visual question answering.International Journal of Computer Vision, 123:4 – 31, 2015. URLhttps: //api.semanticscholar.org/CorpusID:3180429

  45. [45]

    D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering.2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 6693–6702, 2019. URLhttps://api.semanticscholar. org/CorpusID:152282269. 13

  46. [46]

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W . Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Y ang, Y . Liu, W . Huang, H. Sun, Y . Su, and W . Chen. Mmmu: A massive multi-discipline multimodal understanding and rea- soning benchmark for expert agi.2024 IEEE/CVF Conference on Computer Vision and Patter...

  47. [47]

    Liang, Y

    A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. Huang, L. S. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. 2026. URLhttps://api.semanticscholar.org/CorpusID:286223261

  48. [48]

    C. Xu, T. K. Nguyen, E. Dixon, C. Rodriguez, P . Miller, R. Lee, P . Shah, R. Ambrus, H. Nishimura, and M. Itkina. Can we detect failures without failure data? uncertainty-aware runtime failure detection for imitation learning policies, 2025. URLhttps://arxiv.org/ abs/2503.08558

  49. [49]

    J. Duan, W . Pumacay, N. Kumar, Y . R. Wang, S. Tian, W . Yuan, R. Krishna, D. Fox, A. Mandlekar, and Y . Guo. Aha: A vision-language-model for detecting and reasoning over failures in robotic manipulation.ArXiv, abs/2410.00371, 2024. URLhttps://api. semanticscholar.org/CorpusID:273022765

  50. [50]

    Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models, 2025. URLhttps://arxiv.org/ abs/2510.01642

  51. [51]

    J. Park, J. Y oon, B. Jeon, J. Park, J. Shin, N. Cho, K. Lee, S. Yun, and S. Choi. Hierarchical vision language action model using success and failure demonstrations, 2025. URLhttps: //arxiv.org/abs/2512.03913

  52. [52]

    H. Li, K. Lei, S. Zang, K. Hu, Y . Liang, B. An, X. Li, and H. Xu. Failure-aware rl: Reliable offline-to-online reinforcement learning with self-recovery for real-world manipulation, 2026. URLhttps://arxiv.org/abs/2601.07821

  53. [53]

    Christiano, J

    P . Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences, 2023. URLhttps://arxiv.org/abs/1706.03741

  54. [54]

    R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  55. [55]

    C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction.ArXiv, abs/1605.07157, 2016. URLhttps://api.semanticscholar. org/CorpusID:2659157

  56. [56]

    Y ang, Y

    S. Y ang, Y . Du, K. Ghasemipour, J. Tompson, L. Kaelbling, D. Schuurmans, and P . Abbeel. Learning interactive real-world simulators, 2024. URLhttps://arxiv.org/abs/2310. 06114

  57. [57]

    B. Wu, C. Zou, C. Li, D. Huang, F. Y ang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jiang, Linus, Patrol, P . Zhang, P . Chen, P . Zhao, Q. Tian, S. Liu, W . Kong, W . Wang, X. He, X. Li, X. Deng, X. Zhe, Y . Li, Y . Long, Y . Peng, Y . Wu, Y . Liu, Z. Wang, Z. Dai, B. Peng, C. Li, G. Gong, G. Xiao, J. Tian, J. Lin, J. Liu, J. Zhang, J. Lian, K. Pan, L. Wang, ...

  58. [58]

    URLhttps://arxiv.org/abs/2511.18870. 14

  59. [59]

    Li and A

    Y . Li and A. Torralba. Multimodal action conditioned video generation, 2025. URLhttps: //arxiv.org/abs/2510.02287

  60. [60]

    Nasiriany, A

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. ArXiv, abs/2406.02523, 2024. URLhttps://api.semanticscholar.org/CorpusID: 270226600

  61. [61]

    T. L. Team, J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, N. Kuppuswamy, K.-H. Lee, K. Liu, D. Mcconachie, I. McMahon, H. Nishimura, C. Phillips-Grafflin, C. Richter, P . Shah, K. P . Srinivasan, B. Wulfe, C. Xu, M. Zhang, A. Alspach, M. Angeles, K. Arora, V . Guizilini, A. M. Cas- tro, ...

  62. [62]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion pol- icy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44:1684 – 1704, 2023. URLhttps://api.semanticscholar.org/CorpusID: 257378658

  63. [63]

    Nasiriany, S

    S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y . Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots, 2026. URLhttps://arxiv. org/abs/2603.04356

  64. [64]

    Bjorck, N

    NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Y e, Z. Yu,...