pith. machine review for the scientific record. sign in

arxiv: 2507.15493 · v2 · submitted 2025-07-21 · 💻 cs.RO · cs.AI· cs.CV

Recognition: 2 theorem links

· Lean Theorem

GR-3 Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-17 07:58 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords generalist robot policyvision-language-action modelrobot imitation learningdexterous manipulationbi-manual robotVR trajectory datalong-horizon tasks
0
0 comments X

The pith

GR-3 is a vision-language-action model that generalizes to novel objects, abstract instructions, and long-horizon dexterous tasks through combined web-scale and robot data training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GR-3 as a large-scale vision-language-action model for generalist robot policies. It claims this model achieves strong generalization to new objects, environments, and abstract concepts by co-training on web-scale vision-language data, fine-tuning with VR-collected human trajectories, and imitation learning from robot trajectories. A sympathetic reader would care because the approach promises robots that adapt quickly to varied tasks with limited additional data, moving closer to practical daily assistance. The work also describes ByteMini, a bi-manual mobile robot that works with GR-3. Real-world tests position GR-3 ahead of the π0 baseline on diverse manipulation challenges.

Core claim

GR-3 is a large-scale VLA model that generalizes to novel objects, environments, and instructions involving abstract concepts. It can be efficiently fine-tuned with minimal human trajectory data and excels in long-horizon and dexterous tasks including bi-manual manipulation and mobile movement. These capabilities come from co-training with web-scale vision-language data, efficient fine-tuning from VR human trajectories, and imitation learning with robot data. Extensive experiments show it surpasses the baseline π0 on a wide variety of challenging tasks.

What carries the argument

The multi-faceted training recipe of web-scale vision-language co-training, VR human trajectory fine-tuning, and robot imitation learning.

If this is right

  • GR-3 adapts rapidly to new settings after fine-tuning on small amounts of human trajectory data collected via VR.
  • GR-3 reliably performs long-horizon tasks that combine bi-manual manipulation with mobile movement.
  • GR-3 outperforms the π0 baseline across a range of real-world challenging manipulation tasks.
  • Integration of GR-3 with the ByteMini platform enables completion of varied real-world tasks with high reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-modal training mixtures could lower the data barrier for deploying robots in unstructured home settings.
  • The same recipe might transfer to other robot hardware if the VR and imitation stages are adjusted for different kinematics.
  • Models built this way could eventually interpret vague daily commands without requiring precise step-by-step programming.

Load-bearing premise

That the multi-faceted training recipe of web-scale vision-language co-training, VR human trajectory data, and robot imitation learning produces the claimed generalization to novel objects, abstract instructions, and long-horizon dexterous tasks without heavy post-hoc tuning or task-specific overfitting.

What would settle it

Running head-to-head tests of GR-3 and π0 on a suite of previously unseen abstract-instruction tasks in new environments, without any task-specific fine-tuning, and comparing success rates plus failure modes on long-horizon sequences.

read the original abstract

We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $\pi_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GR-3, a large-scale vision-language-action (VLA) model for generalist robot policies. It claims exceptional generalization to novel objects, environments, and abstract instructions, along with strong performance on long-horizon dexterous, bi-manual, and mobile tasks. These capabilities are attributed to a multi-faceted training recipe of web-scale vision-language co-training, efficient fine-tuning on minimal VR-collected human trajectory data, and robot imitation learning. The paper also presents ByteMini, a new bi-manual mobile robot platform, and states that extensive real-world experiments show GR-3 surpasses the baseline method π0 on a wide variety of challenging tasks.

Significance. If the performance claims hold under rigorous evaluation, the work would represent incremental progress toward generalist robots by combining large-scale pretraining with targeted adaptation data. The introduction of ByteMini as a hardware platform is a secondary contribution. The absence of detailed quantitative results, however, makes the magnitude of any improvement over π0 difficult to assess at present.

major comments (2)
  1. [Abstract] Abstract: The central claim that GR-3 'surpasses the state-of-the-art baseline method, π0' on a wide variety of challenging tasks is stated without any quantitative metrics, success rates, task counts, error bars, or statistical comparisons. This is load-bearing for the empirical contribution and must be supported by concrete data.
  2. [Experiments] Experiments section: The manuscript refers to 'extensive real-world experiments' and 'efficient fine-tuning with minimal human trajectory data' but supplies no task definitions, ablation breakdowns isolating the web-scale VL, VR, and robot data components, or failure-mode analysis. Without these, the generalization claims to novel objects and abstract instructions cannot be verified.
minor comments (2)
  1. [Methods] Provide model architecture details, parameter counts, and training hyperparameters to support reproducibility.
  2. [Training recipe] Clarify the exact data scales and collection protocols for the VR human trajectories and robot imitation datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review of our manuscript on GR-3. We appreciate the feedback on strengthening the empirical presentation and will revise the manuscript to address the concerns. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that GR-3 'surpasses the state-of-the-art baseline method, π0' on a wide variety of challenging tasks is stated without any quantitative metrics, success rates, task counts, error bars, or statistical comparisons. This is load-bearing for the empirical contribution and must be supported by concrete data.

    Authors: We agree that the abstract would benefit from concrete quantitative support for the performance claim. In the revised manuscript, we will incorporate key metrics from our real-world experiments, including success rates on representative tasks, the number of tasks evaluated, and direct comparisons to the π0 baseline. This will make the central empirical contribution more transparent while preserving the abstract's concise style. revision: yes

  2. Referee: [Experiments] Experiments section: The manuscript refers to 'extensive real-world experiments' and 'efficient fine-tuning with minimal human trajectory data' but supplies no task definitions, ablation breakdowns isolating the web-scale VL, VR, and robot data components, or failure-mode analysis. Without these, the generalization claims to novel objects and abstract instructions cannot be verified.

    Authors: We acknowledge that additional experimental details would improve verifiability. We will expand the Experiments section to include explicit task definitions, ablation studies isolating the contributions of web-scale vision-language co-training, VR-collected human trajectory data, and robot imitation learning, as well as a discussion of observed failure modes. These revisions will provide stronger substantiation for the generalization results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The GR-3 paper is an empirical technical report describing a multi-component training recipe (web-scale VL co-training, VR human trajectories, robot imitation) and real-world experiments on a bi-manual mobile robot. No derivation chain, equations, or first-principles results are presented that could reduce to fitted parameters or self-referential quantities. Claims of generalization and superiority over π0 rest on experimental outcomes rather than any self-definitional or fitted-input structure. The provided text contains no load-bearing self-citations or ansatzes that collapse the central results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard machine-learning assumptions about the benefits of scaling and imitation learning plus the effectiveness of the new hardware platform; no explicit free parameters or invented theoretical entities are stated in the abstract.

axioms (1)
  • domain assumption Co-training on web-scale vision-language data plus VR human trajectories and robot data produces generalizable policies
    Invoked in the description of the multi-faceted training recipe.
invented entities (1)
  • ByteMini independent evidence
    purpose: Versatile bi-manual mobile robot platform for integration with GR-3
    New hardware introduced to demonstrate the policy in real-world tasks.

pith-pipeline@v0.9.0 · 5586 in / 1406 out tokens · 48028 ms · 2026-05-17T07:58:50.642162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dynamic Execution Commitment of Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.

  2. Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

    cs.RO 2026-05 unverdicted novelty 6.0

    HandITL blends human intent with policy execution to eliminate gesture jumps in dexterous VLA interventions, cutting jitter by 99.8%, grasp failures by 87.5%, and yielding 19% better refined policies.

  3. Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-05 conditional novelty 6.0

    GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

  4. From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation

    cs.RO 2026-04 unverdicted novelty 6.0

    Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.

  5. Device-Conditioned Neural Architecture Search for Efficient Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    DC-QFA trains one supernet over architectures and bit-widths, then runs a fast per-device search plus multi-step distillation to deliver 2-3x faster robotic policies across hardware with negligible success-rate drop.

  6. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  7. ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

    cs.CV 2026-02 unverdicted novelty 6.0

    ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.

  8. A Pragmatic VLA Foundation Model

    cs.RO 2026-01 unverdicted novelty 6.0

    LingBot-VLA is a VLA foundation model trained on massive real robot data that shows superior generalization across tasks and platforms with fast training throughput.

  9. InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    cs.RO 2025-10 unverdicted novelty 6.0

    InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

  10. SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    cs.RO 2025-09 conditional novelty 6.0

    SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...

  11. F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

    cs.RO 2025-09 unverdicted novelty 6.0

    F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.

  12. From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

  13. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  14. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

  15. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.

  16. CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

    cs.RO 2026-04 unverdicted novelty 5.0

    CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...

  17. Causal World Modeling for Robot Control

    cs.CV 2026-01 unverdicted novelty 5.0

    LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

  18. AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

    cs.LG 2025-11 unverdicted novelty 5.0

    AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 17 Pith papers · 34 internal anchors

  1. [1]

    Affordances from human videos as a versatile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

    Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  5. [5]

    Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation.arXiv preprint arXiv:2405.01527, 2024

    Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation.arXiv preprint arXiv:2405.01527, 2024

  6. [6]

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4788–4795. IEEE, 2024

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  8. [8]

    Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint arXiv:2310.10639, 2023

  9. [9]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  10. [10]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  11. [11]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  12. [12]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

  13. [13]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

  14. [14]

    Pali-x: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023

  15. [15]

    Implementation of the pure pursuit path tracking algorithm

    R Craig Coulter. Implementation of the pure pursuit path tracking algorithm. Technical report, 1992

  16. [16]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

  17. [17]

    Robonet: Large-scale multi-robot learning,

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215, 2019. 16

  18. [18]

    Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation.arXiv preprint arXiv:2408.11812, 2024

    Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation.arXiv preprint arXiv:2408.11812, 2024

  19. [19]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  20. [20]

    Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

  21. [21]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021

  22. [22]

    On the kinematic design of spherical three-degree-of-freedom parallel manipulators

    Clément M Gosselin and Eric Lavoie. On the kinematic design of spherical three-degree-of-freedom parallel manipulators. The International Journal of Robotics Research, 12(4):394–402, 1993

  23. [23]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017

  24. [24]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  25. [25]

    Seed1.5-VL Technical Report

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  26. [26]

    Query-key normalization for trans- formers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for trans- formers. arXiv preprint arXiv:2010.04245, 2020

  27. [27]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  28. [28]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803, 2024

  29. [29]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  30. [30]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

  31. [31]

    Few-shot object detection via feature reweighting

    Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. InProceedings of the IEEE/CVF international conference on computer vision, pages 8420–8429, 2019

  32. [32]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020

  33. [33]

    Language-driven representation learning for robotics.arXiv preprint arXiv:2302.12766, 2023

    Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics.arXiv preprint arXiv:2302.12766, 2023

  34. [34]

    Egomimic: Scaling imitation learning via egocentric video.arXiv preprint arXiv:2410.24221, 2024

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video.arXiv preprint arXiv:2410.24221, 2024

  35. [35]

    Mini cheetah: A platform for pushing the limits of dynamic quadruped control

    Benjamin Katz, Jared Di Carlo, and Sangbae Kim. Mini cheetah: A platform for pushing the limits of dynamic quadruped control. In2019 international conference on robotics and automation (ICRA), pages 6295–6301. IEEE, 2019. 17

  36. [36]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  37. [37]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  38. [38]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025

  39. [39]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024

  40. [40]

    Gr-mg: Leveraging partially- annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 2025

    Peiyan Li, Hongtao Wu, Yan Huang, Chilam Cheang, Liang Wang, and Tao Kong. Gr-mg: Leveraging partially- annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 2025

  41. [41]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  42. [42]

    Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

  43. [43]

    Vision-Language Foundation Models as Effective Robot Imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

  44. [44]

    Towards generalist robot policies: What matters in building vision-language-action models

    Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058, 2024

  45. [45]

    Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996, 2024

  46. [46]

    Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647, 2024

    Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647, 2024

  47. [47]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  48. [48]

    Rectified flow: A marginal preserving approach to optimal transport.arXiv preprint arXiv:2209.14577, 2022

    Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport.arXiv preprint arXiv:2209.14577, 2022

  49. [49]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  50. [50]

    Where are we in the search for an artificial visual cortex for embodied intelligence? Advancesin Neural Information Processing Systems, 36:655–677, 2023

    Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent- Pierre Berges, Tingfan Wu, Jay Vakil, et al. Where are we in the search for an artificial visual cortex for embodied intelligence? Advancesin Neural Information Processing Systems, 36:655–677, 2023

  51. [51]

    Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023

  52. [52]

    Should Robots be Obedient?

    Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell. Should robots be obedient?arXiv preprint arXiv:1705.09990, 2017

  53. [53]

    R3M: A Universal Visual Representation for Robot Manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

  54. [54]

    Singularity-consistent kinematic redundancy resolution for the srs manipulator

    Dragomir N Nenchev, Yuichi Tsumaki, and Mitsugu Takahashi. Singularity-consistent kinematic redundancy resolution for the srs manipulator. In2004 IEEE/RSJ InternationalConference on IntelligentRobots and Systems (IROS), volume 4, pages 3607–3612. IEEE, 2004. 18

  55. [55]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  56. [56]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  57. [57]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  58. [58]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  59. [59]

    Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

    Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

  60. [60]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  61. [61]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  62. [62]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024

  63. [63]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022

  64. [64]

    Multimodal diffusion transformer: Learning versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

    Moritz Reuss, Ömer Erdinç Yağmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

  65. [65]

    A whole-body control framework for humanoids operating in human environments

    Luis Sentis and Oussama Khatib. A whole-body control framework for humanoids operating in human environments. In 2006 IEEE International Conference on Robotics and Automation (ICRA), pages 2641–2648. IEEE, 2006

  66. [66]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  67. [67]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  68. [68]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprintarXiv:2405.12213, 2024

  69. [69]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023

  70. [70]

    Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

    Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. Advancesin neural information processing systems, 37:124420–124450, 2024

  71. [71]

    Generalizing from a few examples: A survey on few-shot learning

    Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur), 53(3):1–34, 2020

  72. [72]

    Any-point Trajectory Modeling for Policy Learning

    Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023. 19

  73. [73]

    Unleashing large-scale video generative pre-training for visual robot manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024

  74. [74]

    Masked visual pre-training for motor control

    Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022

  75. [75]

    xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024

    Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024

  76. [76]

    Magma: A foundation model for multimodal ai agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025

  77. [77]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

  78. [78]

    Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019. 20