pith. sign in

arxiv: 2605.17077 · v1 · pith:EIZWMWEUnew · submitted 2026-05-16 · 💻 cs.RO · cs.AI

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

Pith reviewed 2026-05-20 15:13 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords dense language annotationsrobot policy learningvision-language modelsDeMiAnmanipulation clipsegocentric videoslearned instructorscaling robot learning
0
0 comments X

The pith

Re-labeling existing robot demonstrations with dense multi-aspect language annotations improves policy learning without new data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper explores how to get more out of existing robot demonstration data by adding rich language descriptions. It proposes DeMiAn, which uses vision-language models to annotate clips along four aspects: physical motion, scene composition, arm pose, and reasoning. A separate learned instructor then picks the most useful annotation type based on the task and scene at hand. This approach boosts performance on large datasets of robot clips and human videos, raising success rates on the RoboCasa benchmark by 5 points over a simple baseline while approaching the level of task-specific oracles. The results suggest that choosing the right kind of dense language annotation can scale robot learning more efficiently than collecting fresh demonstrations.

Core claim

DeMiAn first re-labels demonstration segments with VLM-generated annotations along physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description plus initial scene snapshot to the best annotation aspect, running asynchronously to hide latency. This improves vision-language-action policies and video-based world-action models across over 1M robot clips and 50K egocentric videos, with a 5-point success gain on RoboCasa over task-only baselines and within 3 points of per-task oracles. No single aspect dominates, and gains appear in composite and out-of-distribution tasks while shifting the compute frontier after annotation costs.

What carries the argument

DeMiAn's two-stage process of VLM dense multi-aspect annotation followed by a learned instructor that selects the appropriate annotation type for a given task and scene.

If this is right

  • Raises success by 5 points over task-only baseline on RoboCasa.
  • Comes within 3 points of a per-task oracle.
  • Improves composite-task and out-of-distribution performance.
  • Shifts the compute-performance frontier in mid-training and post-training after accounting for annotation FLOPs.
  • Positions dense re-annotation as a practical scaling lever for robot policy learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could lower the overall cost of scaling robot learning by reusing existing corpora more effectively.
  • Dynamic selection of annotation aspects might extend to other multimodal learning domains where different description types suit different tasks.
  • Testing the instructor's accuracy in real-time deployment scenarios would reveal if latency hiding holds in practice.
  • The lack of a dominant annotation aspect implies that task-specific customization is key for further gains.

Load-bearing premise

VLM-generated annotations across the four aspects are sufficiently accurate and the learned instructor can reliably select the most effective one without introducing errors or noticeable latency.

What would settle it

If experiments on RoboCasa or similar benchmarks show no significant improvement in policy success rates when applying DeMiAn compared to using only task descriptions, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.17077 by Alexander Trevithick, Bosung Kim, Brandon Cui, David Acuna, Jaehun Jung, Prithviraj Ammanabrolu, Ruiyi Wang, Yejin Choi.

Figure 1
Figure 1. Figure 1: Overview of DeMiAn. We re-annotate existing robot and human demonstrations along four aspects: physical motion, scene composition, arm pose, and reasoning. We apply DeMiAn to 11K RoboCasa 365 clips, a 1M-scale MolmoBot dataset, and 50K EgoVerse human-egocentric clips. human-egocentric datasets demand similarly heavy capture and curation—EgoVerse 50K [30], for example, was assembled over ∼1,500 hours of in-… view at source ↗
Figure 2
Figure 2. Figure 2: Asynchronous Instruction Injection. Learned Instructor. The instructor is trained via supervised fine-tuning with reward-weighted target sampling. We first construct a reward ta￾ble w(τ, k) by running the action policy with each of the four GT fixed-aspect annotations across all training tasks and recording per-task validation SR. For each training episode, the target aspect is sampled from a softmax over … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the policy-model architectures (DeMiAn VLA and DeMiAn WAM). Policy models. Our experiments cover two representative robot policy architectures: vision-language￾action policy (VLA) and world-model-based action policy (WAM), which span the mainstream paradigms in modern robot policy learning. VLAs inherit broad semantic priors from large-scale image-text pre-training; WAMs instead build their bac… view at source ↗
Figure 4
Figure 4. Figure 4: Action expert attention over prefix tokens at step 42 of a [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: DeMiAn under scaling. (A) DeMiAn-WAM mid-training on EgoVerse 50K with dense annotations, evaluated on downstream RoboCasa 365. The x-axis includes both mid-training compute and annotation-generation compute. (B) DeMiAn-VLA post-training on the 1M-scale MolmoBot corpus, evaluated by total success across four MolmoSpaces benchmarks. The x-axis includes annotation-generation and DeMiAn-VLA post-training FLOP… view at source ↗
read the original abstract

Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DeMiAn, a two-stage method that first uses VLMs to densely re-annotate over 1M robot manipulation clips and 50K EgoVerse videos along four aspects (physical motion, scene composition, arm pose, reasoning), then trains an instructor model to map a task description plus initial scene to the most effective annotation aspect at deployment. The instructor runs asynchronously to hide latency. Experiments claim that this improves both VLA policies and video-based world models on fixed data, yielding a 5-point success-rate gain on RoboCasa over a task-only baseline and coming within 3 points of a per-task oracle, with additional benefits for composite and out-of-distribution tasks.

Significance. If the VLM annotations are shown to be accurate and task-informative, the work demonstrates a practical scaling lever for robot policy learning that extracts more signal from existing corpora without new demonstrations. The observation that no fixed aspect dominates and the reported shifts in the compute-performance frontier after accounting for annotation FLOPs would be notable contributions to data-efficient robot learning.

major comments (3)
  1. [§4 (Experiments)] §4 (Experiments) and abstract: the 5-point RoboCasa gain and 3-point gap to oracle are reported without error bars, run-to-run variance, statistical tests, or ablations that isolate annotation quality from token-volume or data-filtering effects. This is load-bearing for the central claim that the four-aspect dense labels supply the additional signal.
  2. [§3.1 (Annotation generation)] §3.1 (Annotation generation): no human validation, inter-annotator agreement, or error-rate analysis is provided for the VLM outputs on robot clips. If the generated annotations contain systematic hallucinations or low fidelity on physical motion or arm-pose aspects, the observed policy improvements cannot be attributed to semantic content rather than confounds.
  3. [§4.3 (Instructor evaluation)] §4.3 (Instructor evaluation): the claim that the learned instructor reliably selects the right aspect without introducing latency or errors lacks quantitative metrics on selection accuracy or failure modes on held-out tasks, which directly affects whether the method generalizes beyond the training annotation distribution.
minor comments (2)
  1. The abstract states that DeMiAn 'shifts the compute-performance frontier after accounting for annotation-generation FLOPs,' but the main text should include an explicit breakdown of those FLOPs and the exact frontier comparison (e.g., which figure or table).
  2. [§3] Notation for the four aspects is introduced in the abstract but should be consistently defined with short names or symbols in §3 to improve readability when results are discussed per aspect.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments) and abstract: the 5-point RoboCasa gain and 3-point gap to oracle are reported without error bars, run-to-run variance, statistical tests, or ablations that isolate annotation quality from token-volume or data-filtering effects. This is load-bearing for the central claim that the four-aspect dense labels supply the additional signal.

    Authors: We agree that error bars, variance, and statistical tests are important for substantiating the central claim. In the revised manuscript we will report results over at least five random seeds with mean and standard deviation, and include a paired t-test for the 5-point gain. For isolating annotation quality from token volume we will add an ablation that matches total token count between dense and task-only conditions by repeating the task description as needed; this will be presented in a new table in §4. We confirm no additional data filtering was applied beyond the original splits and will state this explicitly. revision: yes

  2. Referee: [§3.1 (Annotation generation)] §3.1 (Annotation generation): no human validation, inter-annotator agreement, or error-rate analysis is provided for the VLM outputs on robot clips. If the generated annotations contain systematic hallucinations or low fidelity on physical motion or arm-pose aspects, the observed policy improvements cannot be attributed to semantic content rather than confounds.

    Authors: We acknowledge that direct human validation of the VLM annotations was not reported in the original submission. We will add a human study on a random sample of 300 clips (75 per aspect) with two independent annotators, reporting agreement rates and error categories (hallucination vs. omission). Preliminary internal checks already indicate that physical-motion and arm-pose annotations are largely faithful, but we will include the full analysis and inter-annotator metrics in the revision to strengthen attribution of gains to semantic content. revision: yes

  3. Referee: [§4.3 (Instructor evaluation)] §4.3 (Instructor evaluation): the claim that the learned instructor reliably selects the right aspect without introducing latency or errors lacks quantitative metrics on selection accuracy or failure modes on held-out tasks, which directly affects whether the method generalizes beyond the training annotation distribution.

    Authors: We agree that quantitative evaluation of the instructor is needed. In the revised §4.3 we will report selection accuracy on a held-out task set (including per-aspect precision/recall and a confusion matrix), together with an analysis of failure modes. We will also measure wall-clock latency of instructor inference and confirm that asynchronous execution keeps it hidden from policy execution. These additions will directly address generalization beyond the training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on held-out evaluation

full rationale

The paper reports empirical gains from VLM-generated dense annotations (physical motion, scene composition, arm pose, reasoning) on fixed corpora of 1M+ robot clips and 50K videos, with a learned instructor selecting aspects at deployment. Performance is measured on held-out RoboCasa tasks, composite tasks, and OOD settings rather than being algebraically or statistically forced by construction from the same fitted annotation parameters. No equations, self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided derivation chain; the central result is an independent experimental outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the quality and utility of VLM-generated annotations and on the instructor's ability to select them correctly; these are domain assumptions without external verification beyond the reported experiments.

axioms (1)
  • domain assumption Vision-language models produce accurate and task-relevant annotations for robot demonstration segments across physical motion, scene composition, arm pose, and reasoning.
    The entire re-labeling stage rests on this premise about VLM capability.
invented entities (1)
  • DeMiAn instructor no independent evidence
    purpose: Maps task description and initial scene to a task-appropriate dense annotation at deployment time.
    New learned component introduced to hide generation latency and select among aspects.

pith-pipeline@v0.9.0 · 5792 in / 1450 out tokens · 66042 ms · 2026-05-20T15:13:21.011414+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 27 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, K...

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    RT-H: Action Hierarchies Using Language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language, 2024. URL https://arxiv.org/abs/2403.01823

  4. [4]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan C. Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsa...

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian 10 Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashni...

  8. [8]

    Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, An...

  9. [9]

    Molmob0t: Large-scale simulation enables zero-shot manipulation, 2026

    Abhay Deshpande, Maya Guru, Rose Hendrix, Snehal Jauhri, Ainaz Eftekhar, Rohun Tripathi, Max Argus, Jordi Salvador, Haoquan Fang, Matthew Wallingford, Wilbert Pumacay, Yejin Kim, Quinn Pfeifer, Ying-Chun Lee, Piper Wolters, Omar Rayyan, Mingtong Zhang, Jiafei Duan, Karen Farley, Winson Han, Eli Vanderbilt, Dieter Fox, Ali Farhadi, Georgia Chalvatzaki, Dhr...

  10. [10]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

  11. [11]

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...

  12. [12]

    Thought cloning: Learning to think while acting by imitating human thinking,

    Shengran Hu and Jeff Clune. Thought cloning: Learning to think while acting by imitating human thinking,

  13. [13]

    URLhttps://arxiv.org/abs/2306.00323

  14. [14]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022. URL https://arxiv.org/abs/ 2201.07207

  15. [15]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022. URLhttps://arxiv.org/abs/2207.05608

  16. [16]

    Bc-z: Zero-shot task generalization with robotic imitation learning, 2022

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning, 2022. URL https://arxiv.org/abs/2202.02005

  17. [17]

    VIMA: general robot manipulation with multimodal prompts,

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts, 2023. URLhttps://arxiv.org/abs/2210.03094

  18. [18]

    Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang

    Siddharth Karamcheti, Suraj Nair, Annie S. Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics, 2023. URL https://arxiv.org/abs/ 2302.12766

  19. [19]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  20. [20]

    URLhttps://arxiv.org/abs/2403.12945

  21. [21]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024. URLhttps://arxiv.org/abs/2406.09246

  22. [22]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026. URLhttps://arxiv.org/abs/2601.16163

  23. [23]

    Molmospaces: A large-scale open ecosystem for robot navigation and manipulation, 2026

    Yejin Kim, Wilbert Pumacay, Omar Rayyan, Max Argus, Winson Han, Eli VanderBilt, Jordi Salvador, Abhay Deshpande, Rose Hendrix, Snehal Jauhri, Shuo Liu, Nur Muhammad Mahi Shafiullah, Maya Guru, Ainaz Eftekhar, Karen Farley, Donovan Clay, Jiafei Duan, Arjun Guru, Piper Wolters, Alvaro Herrasti, Ying-Chun Lee, Georgia Chalvatzaki, Yuchen Cui, Ali Farhadi, Di...

  24. [24]

    Code as Policies: Language Model Programs for Embodied Control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control, 2023. URL https: //arxiv.org/abs/2209.07753. 12

  25. [25]

    Liv: Language-image representations and rewards for robotic control, 2023

    Yecheng Jason Ma, William Liang, Vaidehi Som, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. Liv: Language-image representations and rewards for robotic control, 2023. URL https://arxiv.org/abs/2306.00958

  26. [26]

    R3M: A Universal Visual Representation for Robot Manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601

  27. [27]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots, 2024. URLhttps://arxiv.org/abs/2406.02523

  28. [28]

    Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots, 2026

    Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots, 2026. URLhttps://arxiv.org/ abs/2603.04356

  29. [29]

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    NVIDIA, :, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee,...

  30. [30]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, Y...

  31. [31]

    NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, ...

  32. [32]

    Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu, Xiongyi Cai, Alexey Gavryushin, Jiaqi Chen, Davide Liconti, Lawrence Y . Zhu, Patcharapong Aphiwetsa, Baoyu Li, Aniketh Cheluva, Pranav Kuppili, Yangcen Liu, Dhruv Patel, Aidan Gao, Hye-Young Chung, Ryan Co, Renee Zbizika, Jeff Liu, Xiaomeng Xu, Haoyu Xiong, Geng Chen, Sebastiano Oliani, Che...

  33. [33]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/ 2103.00020

  34. [34]

    Cliport: What and where pathways for robotic manipula- tion, 2021

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipula- tion, 2021. URLhttps://arxiv.org/abs/2109.12098

  35. [35]

    ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models, 2022. URLhttps://arxiv.org/abs/2209.11302. 13

  36. [36]

    Qwen3.5-Omni Technical Report

    Qwen Team. Qwen3.5-omni technical report, 2026. URLhttps://arxiv.org/abs/2604.15804

  37. [37]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

  38. [38]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

  39. [39]

    Unleashing large-scale video generative pre-training for visual robot manipulation,

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation,

  40. [40]

    URLhttps://arxiv.org/abs/2312.13139

  41. [41]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

  42. [42]

    Scaling Robot Learning with Semantically Imagined Experience

    Tianhe Yu, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Dee M, Jodilyn Peralta, Brian Ichter, Karol Hausman, and Fei Xia. Scaling robot learning with semantically imagined experience, 2023. URLhttps://arxiv.org/abs/2302.11550

  43. [43]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning, 2025. URLhttps://arxiv.org/abs/2407.08693

  44. [44]

    Sanketi, Grecia Salazar, Michael S

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...

  45. [45]

    We train the action head with rectified-flow matching following π0 [5]

    Training. We train the action head with rectified-flow matching following π0 [5]. When action labels are available, this loss is combined with the DiT’s standard video flow-matching objective so the backbone continues to adapt to the target domain; Reason 1 is kept frozen and used as an online prefix encoder. We train with AdamW (learning rate5×10 −5, wei...