pith. sign in

arxiv: 2512.15840 · v2 · submitted 2025-12-17 · 💻 cs.RO · cs.CV

Large Video Planner Enables Generalizable Robot Control

Pith reviewed 2026-05-16 21:24 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords video planningrobot foundation modelszero-shot generalizationgenerative video modelshuman activity videosrobot controlaction extractionvision-language-action
0
0 comments X

The pith

A video model trained on human activity footage generates zero-shot plans that convert into executable robot actions for novel tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that video pretraining offers a stronger foundation for robot decision-making than language or image models because videos naturally encode sequences of states and actions in the physical world. The authors assemble a large dataset of internet videos showing people performing everyday tasks and train an open generative model to predict future video frames from a current scene image plus a text instruction. These predicted videos serve as plans that are post-processed into low-level robot commands, which then run successfully on real hardware for tasks and environments never seen in training. A sympathetic reader would care because this route could let robots acquire broad capabilities from abundant existing video data instead of scarce, expensive robot-specific demonstrations.

Core claim

The authors train a generative video model at foundation-model scale on internet-scale videos of human activities and task demonstrations. Conditioned on a new scene image and natural-language instruction, the model outputs a zero-shot video plan depicting the robot completing the task. Post-processing the plan extracts low-level actions that execute successfully on real robots and on third-party-selected tasks in the wild, without any per-task fine-tuning or additional robot data.

What carries the argument

The generative video model that predicts future frame sequences from a current observation image and task instruction, acting as the planning representation before action extraction.

If this is right

  • Robots follow instructions in previously unseen scenes and tasks without retraining.
  • Physical execution succeeds on real robots for third-party chosen tasks.
  • The same model supports both simulation and hardware deployment.
  • Open release of the model and video dataset allows others to reproduce and extend the results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This video-planning route may reduce dependence on costly robot demonstration datasets by leveraging existing human video corpora.
  • The approach suggests world models learned from video could be adapted for control across multiple robot embodiments with minimal additional work.
  • Future work could test whether longer-horizon or multi-step tasks remain solvable when the video predictor is scaled further.

Load-bearing premise

Converting the generated video plans into low-level robot actions works reliably for new tasks and different robot bodies without extra per-task engineering.

What would settle it

A controlled test in which the model produces visually plausible video plans for a novel task yet the extracted actions fail to complete the task on a real robot.

Figures

Figures reproduced from arXiv: 2512.15840 by Boyuan Chen, Caiyi Zhang, Haoran Geng, Jitendra Malik, Kiwhan Song, Peihao Li, Pieter Abbeel, Russ Tedrake, Tianyuan Zhang, Vincent Sitzmann, William T. Freeman, Yilun Du.

Figure 1
Figure 1. Figure 1: Autonomous Robot Execution with Large Video Planner. Our approach uses video generation as a visual motion planner in pixel space. From a single image and a task instruction, the model generates a video depicting how the task should be completed. The predicted human motion is then retargeted to a robot hand for real-world execution, enabling zero-shot visual planning in diverse scenes. In this work, we pro… view at source ↗
Figure 2
Figure 2. Figure 2: LVP Overview: (a) Overview of the latent video diffusion framework. We first use a temporally causal VAE to encode video clips into compressed 3D latent representations. Then we train a diffusion transformer in this latent space with flow matching objectives. (b) We jointly train image-to-video (I2V) and video-to-video (V2V) with a modified diffusion forcing training strategy. During training, a random con… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Visualization of our eight dataset sources. First row: four robotics datasets. Second row: four human-centric datasets. (b) Illustration of our video diffusion sampling strategy, where scores estimated with and without history are linearly combined. Text conditioning and the diffusion transformer are omitted for clarity. Autoregressive Extension for Multi-Stage Planning. Due to the flexible history con… view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline from Video to Action. Given a generated video depicting a human hand performing a task, we first reconstruct and track the hand in 3D (second column). The reconstructed hand motion is then retargeted to dexterous hands or grippers (third column). Finally, the retargeted trajectory is transformed into the robot’s control frame and executed in the real world (rightmost column). In total, including r… view at source ↗
Figure 5
Figure 5. Figure 5: Baseline Comparison. LVP accurately generates videos of hand interactions in a zero-shot setting, such as pulling out a tissue (left) and opening a gate (right). Baseline models (Wan, Cosmos-Predict 2, Hunyuan) often produce spatial or semantic inconsistencies, highlighted by red circles. The first frame and task instruction shown under each column serve as the generation conditions. Grab the black gas noz… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of generated video plans. Eight examples in our in-the-wild test set with generated videos by LVP. 4.1 THIRD-PARTY SELECTION OF NOVEL TASKS We believe that true task-level generalization should allow any human to propose a task in any environment—without requiring prior knowledge of the capability of the model. To this end, we crowdsource test data from third-party participants by asking them… view at source ↗
Figure 7
Figure 7. Figure 7: Multi-Stage Video Plans. Our LVP can generate long-horizon video plans by repeatedly extending videos conditioned on the last six latent frames. Each example illustrates a three-stage motion plan obtained through two iterative extensions. Method Level 1: Correct contact Level 2: End state Level 3: Task complete Level 4: Perfect Average (%) Best@4 Average (%) Best@4 Average (%) Best@4 Average (%) Best@4 Wan… view at source ↗
Figure 8
Figure 8. Figure 8: Robot Execution Evaluation. Left: Comparison of Task Success Across Methods on (1) Franka Arm with Parallel-Jew Gripper and (2) G1 with Inspire Hands. 1 denotes tests on OOD objects; 2 denotes scenes that differ substantially from the training videos. Right: Visualization of the robot tasks and experiments. Video Gen. Robot Exec. Tear the Tape Open the Door Pick up the Cup Take out the Coffee Beans Video G… view at source ↗
Figure 9
Figure 9. Figure 9: Zero Shot Robot Manipulation with LVP. The model generates videos for diverse tasks, enabling zero-shot execution on both a dexterous hand and a parallel-jaw gripper. 4.3 EVALUATING REAL-WORLD ROBOT MANIPULATION The previous experiment demonstrates that our large video planner exhibits strong zero-shot gen￾eralization for unseen tasks and novel scenes. We now evaluate the complete pipeline, from video gene… view at source ↗
Figure 10
Figure 10. Figure 10: LVP-1M curation pipelines. Videos are collected from eight public sources, including four teleoperated robotics datasets and four human-centric activity datasets. We apply three processing stages: (1) aggressive filtering for quality and embodiment, (2) action-focused captioning using Gemini, and (3) temporal frequency alignment to match human motion speeds. The final dataset contains 1.4 million clips wi… view at source ↗
Figure 11
Figure 11. Figure 11: Composition of LVP-1M . Distribution of filtered LVP-1M clips across eight sources. Greenish tones indicate teleoperated robot datasets, while reddish tones represent human-centric activity datasets. The rightmost column shows the final sampling ratios after reweighting during first-stage training. D METHOD: DETAILS ABOUT OUR VIDEO PROCESSING MODULES D.1 DATA ACQUISITION AND INPUT Given a monocular RGB se… view at source ↗
Figure 12
Figure 12. Figure 12: Comparison between HaMeR and Our 4D Alignment Module. The red curve shows our 4D Alignment Module; the blue one is from HaMeR. per-frame extrinsics and metrically aligned depth. The first frame is chosen as the world reference system. D.4 ALIGNMENT OF HAMER AND MEGASAM We first use MegaSaM to estimate the intrinsic parameters K, which are then provided to HaMeR for more reliable per-frame wrist localizati… view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of generated video plans with human hand. Each row shows eight uniformly sampled frames from a generated video plan. The first frame depicts the input condition image that defines the scene. The caption below each sequence indicates the task instruction used to generate the video. 2https://github.com/unitreerobotics/xr_teleoperate 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization of generated video plans with human hand and robot gripper. Each row shows eight uniformly sampled frames from a generated video plan. The first frame depicts the input condition image that defines the scene. The caption below each sequence indicates the task instruction used to generate the video. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visualization of generated videos and corresponding robot executions – A. The first row presents five uniformly sampled frames from the generated video plan conditioned on the input scene image. The second row illustrates the robot executing the same task in real world. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visualization of generated videos and corresponding robot executions – B. The first row presents five uniformly sampled frames from the generated video plan conditioned on the input scene image. The second row illustrates the robot executing the same task in real world. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
read the original abstract

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes an alternative to vision-language-action models by training a large open video model on an internet-scale dataset of human activities and task demonstrations. The model generates zero-shot video plans for novel scenes and tasks; these plans are post-processed to extract executable robot actions. The work reports successful physical execution on third-party-selected wild tasks and real-robot experiments, claiming robust instruction following, strong generalization across tasks and environments, and real-world feasibility. The model and dataset are released publicly.

Significance. If the empirical results hold after clarification of the post-processing pipeline, the work provides a promising video-centric foundation-model approach for robotics that leverages natural spatio-temporal alignment from video pretraining. Releasing both the model and dataset is a clear strength for reproducibility. The significance is limited by the current lack of detail on how video outputs are converted to low-level actions, which is load-bearing for the generalization claims.

major comments (1)
  1. [Abstract and §3] Abstract and §3 (Methods): The central claim of zero-shot generalizable robot control rests on an unspecified post-processing step that converts generated video plans into executable low-level robot actions. No description is given of the extraction algorithm, its assumptions about camera calibration, embodiment mapping, inverse kinematics, or failure modes. Without this, it is impossible to determine whether the reported physical successes are attributable to the video model alone or to unstated task-specific engineering.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of our video-centric foundation model approach as well as the value of releasing both the model and dataset. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation of our methods.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Methods): The central claim of zero-shot generalizable robot control rests on an unspecified post-processing step that converts generated video plans into executable low-level robot actions. No description is given of the extraction algorithm, its assumptions about camera calibration, embodiment mapping, inverse kinematics, or failure modes. Without this, it is impossible to determine whether the reported physical successes are attributable to the video model alone or to unstated task-specific engineering.

    Authors: We agree that the post-processing pipeline was described at too high a level in the original submission, making it difficult to fully attribute the results to the video model. In the revised manuscript we will add a dedicated subsection in §3 that specifies the full extraction procedure: (1) keypoint tracking via optical flow on the generated video frames to recover 2-D motion trajectories, (2) lifting to 3-D using known camera intrinsics and a fixed extrinsic calibration relative to the robot base, (3) mapping of end-effector velocities to joint commands via a standard Jacobian-based inverse kinematics solver for our particular robot embodiment, and (4) an explicit discussion of failure modes (e.g., tracking drift under occlusion or lighting changes) together with the simple heuristics used to detect and reject invalid plans. These additions will clarify that the post-processing is a lightweight, embodiment-agnostic procedure rather than task-specific engineering, thereby supporting the claim that generalization originates primarily from the video planner. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training on external video data with no self-referential derivation

full rationale

The paper presents an empirical pipeline: curating an external internet-scale video dataset of human activities, training a generative video model at foundation scale, and post-processing outputs to robot actions for zero-shot evaluation on novel tasks. No equations, fitted parameters defined in terms of target outputs, or load-bearing self-citations appear in the abstract or claims. The central results are reported physical executions on third-party tasks rather than any derivation that reduces to its own inputs by construction. The post-processing step is described only at high level without internal definitions that would create tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical transfer of video pretraining to action planning. No new physical entities are postulated. Training involves standard deep-learning hyperparameters that are not enumerated in the abstract.

free parameters (1)
  • model scale and training hyperparameters
    Standard large-model choices such as learning rate schedule, batch size, and architecture depth are fitted or chosen during training but not listed explicitly.
axioms (1)
  • domain assumption Internet videos of human activities contain sufficient spatio-temporal structure to support zero-shot transfer to robot control
    Invoked when the authors state that videos are naturally aligned with robotic behavior and that the trained model produces usable plans for novel scenes.

pith-pipeline@v0.9.0 · 5571 in / 1412 out tokens · 27421 ms · 2026-05-16T21:24:31.838408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

    cs.RO 2026-05 unverdicted novelty 7.0

    WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage trainin...

  2. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  3. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  4. DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

    cs.CV 2026-04 unverdicted novelty 7.0

    DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.

  5. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  6. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  7. Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

    cs.RO 2026-03 conditional novelty 7.0

    GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

  8. DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

    cs.RO 2026-02 unverdicted novelty 7.0

    DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...

  9. World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

    cs.CV 2026-05 unverdicted novelty 6.0

    Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.

  10. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  11. From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation

    cs.RO 2026-04 unverdicted novelty 6.0

    Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.

  12. EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

    cs.RO 2026-04 unverdicted novelty 6.0

    EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...

  13. Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.

  14. Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

    cs.RO 2026-04 unverdicted novelty 6.0

    Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

  15. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  16. GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

    cs.CV 2026-05 unverdicted novelty 5.0

    GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.

  17. Embody4D: A Generalist 4D World Model for Embodied AI

    cs.CV 2026-05 unverdicted novelty 5.0

    Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.

  18. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  19. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  20. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 19 Pith papers · 35 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 3, 11

  3. [3]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, K...

  4. [4]

    Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36: 22304–22325, 2023

    Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36: 22304–22325, 2023. 3

  5. [5]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1

  6. [6]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

  7. [7]

    A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

    Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching- Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025. 3

  8. [8]

    Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 3

  9. [10]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 13

  10. [11]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

  11. [12]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 1, 9 14

  12. [13]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash- n...

  13. [14]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Homes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URLhttps://openai.com/research/ video-generation-models-as-world-simulators. 3, 4

  14. [15]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025. 1, 7

  15. [16]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems (NeurIPS), 2024

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 3, 4, 5, 13

  16. [17]

    Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation

    Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, and Stefan Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 27661–27672, 2025. 3

  17. [18]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13320–13331, 2024. 1, 6, 22

  18. [19]

    Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

    Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023. 5

  19. [20]

    Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhi- ram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, ...

  20. [21]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

  21. [22]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 2021

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evan- gelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens- 100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. URL https: //doi.org/10...

  22. [23]

    Dreher, T

    Yufei Ding, Haoran Geng, Chaoyi Xu, Xiaomeng Fang, Jiazhao Zhang, Songlin Wei, Qiyu Dai, Zhizheng Zhang, and He Wang. Open6DOR: Benchmarking open-instruction 6-dof object rearrangement and a vlm-based approach. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7359–7366, 2024. doi: 10.1109/IROS58592.2024. 10802733. 3

  23. [24]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodied ...

  24. [25]

    arXiv preprint arXiv:2310.10625 (2023)

    Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning.arXiv preprint arXiv:2310.10625, 2023. 3

  25. [26]

    Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023. 2, 3 16

  26. [27]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396, 2021. 3

  27. [28]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 4

  28. [29]

    Graspnet: A large-scale clustered and densely annotated dataset for object grasping, 2020

    Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet: A large-scale clustered and densely annotated dataset for object grasping, 2020. URL https://arxiv.org/abs/ 1912.13470. 25

  29. [30]

    Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021

    Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021. 1

  30. [31]

    Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023

    Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, and Leonidas Guibas. Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023. 3

  31. [32]

    Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

    Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...

  32. [33]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pp. 5842–5850,

  33. [34]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18995–19012, 2022. 6

  34. [35]

    Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training.arXiv preprint arXiv:2509.26625, 2025

    Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, and Filippos Kokkinos. Learning to see before seeing: Demystifying llm visual priors from language pre-training.arXiv preprint arXiv:2509.26625, 2025. 1

  35. [36]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 3, 5

  36. [37]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3, 4

  37. [38]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

  38. [39]

    Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024

    Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, et al. Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024. 3

  39. [40]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxPoser: Composable 3D value maps for robotic manipulation with language models, 2023. URL https://arxiv.org/abs/2307.05973. 3 17

  40. [41]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009,

  41. [42]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 3, 7

  42. [43]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model, 2024. URL https:// arxiv.org/abs/...

  43. [44]

    Learning to act from actionless videos through dense correspondences

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576,

  44. [45]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 6, 11

  45. [46]

    ROSE: Reconstructing objects, scenes, and trajectories from casual videos for robotic manipulation

    Peihao Li, Haoran Geng, Jameson Crate, Yanbing Han, Junyi Zhang, Feishi Wang, Char- lie Tianyue Cheng, Runpei Dong, Yen-Jen Wang, Haozhe Lou, et al. ROSE: Reconstructing objects, scenes, and trajectories from casual videos for robotic manipulation. InHuman to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans. 3

  46. [47]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. 3

  47. [48]

    ManipLLM: Embodied multimodal large language model for object-centric robotic manipulation, 2023

    Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. ManipLLM: Embodied multimodal large language model for object-centric robotic manipulation, 2023. 3

  48. [49]

    Vision-Language Foundation Models as Effective Robot Imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023. 3

  49. [50]

    Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. MegaSaM: Accurate, fast, and robust structure and motion from casual dynamic videos, 2024. URL https://arxiv.org/abs/ 2412.04463. 3, 8, 9, 23

  50. [51]

    Dreamitate: Real-world visuomotor policy learning via video generation

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024. 3

  51. [52]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 3, 4

  52. [53]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 1

  53. [54]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  54. [55]

    Imitation from observation: Learning to imitate behaviors from raw video via context translation

    YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In2018 IEEE international conference on robotics and automation (ICRA), pp. 1118–1125. IEEE, 2018. 3

  55. [56]

    Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025

    Calvin Luo, Zilai Zeng, Yilun Du, and Chen Sun. Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025. 3 18

  56. [57]

    Grounding video models to actions through goal conditioned exploration.arXiv preprint arXiv:2411.07223, 2024

    Yunhao Luo and Yilun Du. Grounding video models to actions through goal conditioned exploration.arXiv preprint arXiv:2411.07223, 2024. 3

  57. [58]

    Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023. 7

  58. [59]

    Towards generalist robot learning from internet video: A survey.Journal of Artificial Intelligence Research, 83, 2025

    Robert McCarthy, Daniel CH Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G Thuruthel, and Zhibin Li. Towards generalist robot learning from internet video: A survey.Journal of Artificial Intelligence Research, 83, 2025. 3

  59. [60]

    1x world model

    Jack Monas and Eric Jang. 1x world model. Blog post, 1X Discover, September 2024. URL https://www.1x.tech/discover/1x-world-model. Accessed: 2025-09-22. 3

  60. [61]

    GR00T N1: An open foundation model for generalist humanoid robots

    NVIDIA, Nikita Cherniadev Johan Bjorck andFernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You...

  61. [62]

    Reconstructing hands in 3D with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InCVPR, 2024. 3, 8, 23

  62. [63]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023. 4

  63. [64]

    The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

    Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024. 1

  64. [65]

    arXiv preprint arXiv:2502.00622 (2025)

    Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Strengthening generative robot policies through predictive world modeling.arXiv preprint arXiv:2502.00622, 2025. 3

  65. [66]

    Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system

    Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023. 3, 9

  66. [67]

    Em- bodied hands: Modeling and capturing hands and bodies to- gether

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022. 8

  67. [68]

    Rolling diffusion models

    David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.),Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pp. 42818– 4...

  68. [69]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 13

  69. [70]

    Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 1

  70. [71]

    Concept2robot: Learning manipulation concepts from instructions and human demonstrations.The International Journal of Robotics Research, 40(12-14):1419–1434, 2021

    Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2robot: Learning manipulation concepts from instructions and human demonstrations.The International Journal of Robotics Research, 40(12-14):1419–1434, 2021. 3 19

  71. [72]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pp. 2256–2265. pmlr, 2015. 3, 4

  72. [73]

    History-guided video diffusion

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 2, 3, 4, 5

  73. [74]

    curobo: Parallelized collision-free minimum-jerk robot motion genera- tion, 2023

    Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. curobo: Parallelized collision-free minimum-jerk robot motion genera- tion, 2023. 9

  74. [75]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1

  75. [76]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 3

  76. [77]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 1

  77. [78]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213. 3

  78. [79]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1

  79. [80]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pp. 1723–1736. PMLR, 2023. 3, 7

  80. [81]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Showing first 80 references.