Large Video Planner Enables Generalizable Robot Control

Boyuan Chen; Caiyi Zhang; Haoran Geng; Jitendra Malik; Kiwhan Song; Peihao Li; Pieter Abbeel; Russ Tedrake; Tianyuan Zhang; Vincent Sitzmann

arxiv: 2512.15840 · v2 · submitted 2025-12-17 · 💻 cs.RO · cs.CV

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen , Tianyuan Zhang , Haoran Geng , Caiyi Zhang , Peihao Li , Kiwhan Song , William T. Freeman , Jitendra Malik

show 4 more authors

Pieter Abbeel Russ Tedrake Vincent Sitzmann Yilun Du

This is my paper

Pith reviewed 2026-05-16 21:24 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords video planningrobot foundation modelszero-shot generalizationgenerative video modelshuman activity videosrobot controlaction extractionvision-language-action

0 comments

The pith

A video model trained on human activity footage generates zero-shot plans that convert into executable robot actions for novel tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that video pretraining offers a stronger foundation for robot decision-making than language or image models because videos naturally encode sequences of states and actions in the physical world. The authors assemble a large dataset of internet videos showing people performing everyday tasks and train an open generative model to predict future video frames from a current scene image plus a text instruction. These predicted videos serve as plans that are post-processed into low-level robot commands, which then run successfully on real hardware for tasks and environments never seen in training. A sympathetic reader would care because this route could let robots acquire broad capabilities from abundant existing video data instead of scarce, expensive robot-specific demonstrations.

Core claim

The authors train a generative video model at foundation-model scale on internet-scale videos of human activities and task demonstrations. Conditioned on a new scene image and natural-language instruction, the model outputs a zero-shot video plan depicting the robot completing the task. Post-processing the plan extracts low-level actions that execute successfully on real robots and on third-party-selected tasks in the wild, without any per-task fine-tuning or additional robot data.

What carries the argument

The generative video model that predicts future frame sequences from a current observation image and task instruction, acting as the planning representation before action extraction.

If this is right

Robots follow instructions in previously unseen scenes and tasks without retraining.
Physical execution succeeds on real robots for third-party chosen tasks.
The same model supports both simulation and hardware deployment.
Open release of the model and video dataset allows others to reproduce and extend the results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This video-planning route may reduce dependence on costly robot demonstration datasets by leveraging existing human video corpora.
The approach suggests world models learned from video could be adapted for control across multiple robot embodiments with minimal additional work.
Future work could test whether longer-horizon or multi-step tasks remain solvable when the video predictor is scaled further.

Load-bearing premise

Converting the generated video plans into low-level robot actions works reliably for new tasks and different robot bodies without extra per-task engineering.

What would settle it

A controlled test in which the model produces visually plausible video plans for a novel task yet the extracted actions fail to complete the task on a real robot.

Figures

Figures reproduced from arXiv: 2512.15840 by Boyuan Chen, Caiyi Zhang, Haoran Geng, Jitendra Malik, Kiwhan Song, Peihao Li, Pieter Abbeel, Russ Tedrake, Tianyuan Zhang, Vincent Sitzmann, William T. Freeman, Yilun Du.

**Figure 1.** Figure 1: Autonomous Robot Execution with Large Video Planner. Our approach uses video generation as a visual motion planner in pixel space. From a single image and a task instruction, the model generates a video depicting how the task should be completed. The predicted human motion is then retargeted to a robot hand for real-world execution, enabling zero-shot visual planning in diverse scenes. In this work, we pro… view at source ↗

**Figure 2.** Figure 2: LVP Overview: (a) Overview of the latent video diffusion framework. We first use a temporally causal VAE to encode video clips into compressed 3D latent representations. Then we train a diffusion transformer in this latent space with flow matching objectives. (b) We jointly train image-to-video (I2V) and video-to-video (V2V) with a modified diffusion forcing training strategy. During training, a random con… view at source ↗

**Figure 3.** Figure 3: (a) Visualization of our eight dataset sources. First row: four robotics datasets. Second row: four human-centric datasets. (b) Illustration of our video diffusion sampling strategy, where scores estimated with and without history are linearly combined. Text conditioning and the diffusion transformer are omitted for clarity. Autoregressive Extension for Multi-Stage Planning. Due to the flexible history con… view at source ↗

**Figure 4.** Figure 4: Pipeline from Video to Action. Given a generated video depicting a human hand performing a task, we first reconstruct and track the hand in 3D (second column). The reconstructed hand motion is then retargeted to dexterous hands or grippers (third column). Finally, the retargeted trajectory is transformed into the robot’s control frame and executed in the real world (rightmost column). In total, including r… view at source ↗

**Figure 5.** Figure 5: Baseline Comparison. LVP accurately generates videos of hand interactions in a zero-shot setting, such as pulling out a tissue (left) and opening a gate (right). Baseline models (Wan, Cosmos-Predict 2, Hunyuan) often produce spatial or semantic inconsistencies, highlighted by red circles. The first frame and task instruction shown under each column serve as the generation conditions. Grab the black gas noz… view at source ↗

**Figure 6.** Figure 6: Visualization of generated video plans. Eight examples in our in-the-wild test set with generated videos by LVP. 4.1 THIRD-PARTY SELECTION OF NOVEL TASKS We believe that true task-level generalization should allow any human to propose a task in any environment—without requiring prior knowledge of the capability of the model. To this end, we crowdsource test data from third-party participants by asking them… view at source ↗

**Figure 7.** Figure 7: Multi-Stage Video Plans. Our LVP can generate long-horizon video plans by repeatedly extending videos conditioned on the last six latent frames. Each example illustrates a three-stage motion plan obtained through two iterative extensions. Method Level 1: Correct contact Level 2: End state Level 3: Task complete Level 4: Perfect Average (%) Best@4 Average (%) Best@4 Average (%) Best@4 Average (%) Best@4 Wan… view at source ↗

**Figure 8.** Figure 8: Robot Execution Evaluation. Left: Comparison of Task Success Across Methods on (1) Franka Arm with Parallel-Jew Gripper and (2) G1 with Inspire Hands. 1 denotes tests on OOD objects; 2 denotes scenes that differ substantially from the training videos. Right: Visualization of the robot tasks and experiments. Video Gen. Robot Exec. Tear the Tape Open the Door Pick up the Cup Take out the Coffee Beans Video G… view at source ↗

**Figure 9.** Figure 9: Zero Shot Robot Manipulation with LVP. The model generates videos for diverse tasks, enabling zero-shot execution on both a dexterous hand and a parallel-jaw gripper. 4.3 EVALUATING REAL-WORLD ROBOT MANIPULATION The previous experiment demonstrates that our large video planner exhibits strong zero-shot generalization for unseen tasks and novel scenes. We now evaluate the complete pipeline, from video gene… view at source ↗

**Figure 10.** Figure 10: LVP-1M curation pipelines. Videos are collected from eight public sources, including four teleoperated robotics datasets and four human-centric activity datasets. We apply three processing stages: (1) aggressive filtering for quality and embodiment, (2) action-focused captioning using Gemini, and (3) temporal frequency alignment to match human motion speeds. The final dataset contains 1.4 million clips wi… view at source ↗

**Figure 11.** Figure 11: Composition of LVP-1M . Distribution of filtered LVP-1M clips across eight sources. Greenish tones indicate teleoperated robot datasets, while reddish tones represent human-centric activity datasets. The rightmost column shows the final sampling ratios after reweighting during first-stage training. D METHOD: DETAILS ABOUT OUR VIDEO PROCESSING MODULES D.1 DATA ACQUISITION AND INPUT Given a monocular RGB se… view at source ↗

**Figure 12.** Figure 12: Comparison between HaMeR and Our 4D Alignment Module. The red curve shows our 4D Alignment Module; the blue one is from HaMeR. per-frame extrinsics and metrically aligned depth. The first frame is chosen as the world reference system. D.4 ALIGNMENT OF HAMER AND MEGASAM We first use MegaSaM to estimate the intrinsic parameters K, which are then provided to HaMeR for more reliable per-frame wrist localizati… view at source ↗

**Figure 13.** Figure 13: Visualization of generated video plans with human hand. Each row shows eight uniformly sampled frames from a generated video plan. The first frame depicts the input condition image that defines the scene. The caption below each sequence indicates the task instruction used to generate the video. 2https://github.com/unitreerobotics/xr_teleoperate 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Visualization of generated video plans with human hand and robot gripper. Each row shows eight uniformly sampled frames from a generated video plan. The first frame depicts the input condition image that defines the scene. The caption below each sequence indicates the task instruction used to generate the video. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Visualization of generated videos and corresponding robot executions – A. The first row presents five uniformly sampled frames from the generated video plan conditioned on the input scene image. The second row illustrates the robot executing the same task in real world. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Visualization of generated videos and corresponding robot executions – B. The first row presents five uniformly sampled frames from the generated video plan conditioned on the input scene image. The second row illustrates the robot executing the same task in real world. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

read the original abstract

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper trains a foundation-scale video model on human activity data for zero-shot robot planning and shows real-robot results, but the action extraction step is the unexamined hinge.

read the letter

The main point is that this work skips the usual language-model starting point and instead pretrains a large video model directly on internet-scale human activity videos to generate plans, then turns those plans into robot actions that run zero-shot on real hardware for new tasks and scenes. They release the model and dataset, which is concrete and helpful for the field. The experiments include third-party wild tasks and physical robot trials, and the results look plausible on the surface. That shift to video as the primary modality is the substantive departure from most current VLA work, and the scale plus open release gives it some weight. The evidence for generalization rests on those physical executions, which is better than pure simulation claims. The soft spot is the post-processing step that converts generated video plans into executable low-level actions. The abstract mentions it but gives no algorithm, no assumptions about calibration or embodiment, and no error analysis or failure modes. If that conversion relies on per-task tuning or hidden heuristics, the zero-shot story does not fully hold. The paper would be tighter with quantitative breakdowns of how often the extraction succeeds or fails across robots and tasks. This is for researchers exploring video-based alternatives to language-centric robot foundation models. Readers who want to test or extend a released video planner will get value from the artifacts and the reported physical results. It has enough new data, training scale, and hardware experiments to deserve serious referee time, even if the methods need more detail on the action pipeline.

Referee Report

1 major / 0 minor

Summary. The paper proposes an alternative to vision-language-action models by training a large open video model on an internet-scale dataset of human activities and task demonstrations. The model generates zero-shot video plans for novel scenes and tasks; these plans are post-processed to extract executable robot actions. The work reports successful physical execution on third-party-selected wild tasks and real-robot experiments, claiming robust instruction following, strong generalization across tasks and environments, and real-world feasibility. The model and dataset are released publicly.

Significance. If the empirical results hold after clarification of the post-processing pipeline, the work provides a promising video-centric foundation-model approach for robotics that leverages natural spatio-temporal alignment from video pretraining. Releasing both the model and dataset is a clear strength for reproducibility. The significance is limited by the current lack of detail on how video outputs are converted to low-level actions, which is load-bearing for the generalization claims.

major comments (1)

[Abstract and §3] Abstract and §3 (Methods): The central claim of zero-shot generalizable robot control rests on an unspecified post-processing step that converts generated video plans into executable low-level robot actions. No description is given of the extraction algorithm, its assumptions about camera calibration, embodiment mapping, inverse kinematics, or failure modes. Without this, it is impossible to determine whether the reported physical successes are attributable to the video model alone or to unstated task-specific engineering.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of our video-centric foundation model approach as well as the value of releasing both the model and dataset. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation of our methods.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Methods): The central claim of zero-shot generalizable robot control rests on an unspecified post-processing step that converts generated video plans into executable low-level robot actions. No description is given of the extraction algorithm, its assumptions about camera calibration, embodiment mapping, inverse kinematics, or failure modes. Without this, it is impossible to determine whether the reported physical successes are attributable to the video model alone or to unstated task-specific engineering.

Authors: We agree that the post-processing pipeline was described at too high a level in the original submission, making it difficult to fully attribute the results to the video model. In the revised manuscript we will add a dedicated subsection in §3 that specifies the full extraction procedure: (1) keypoint tracking via optical flow on the generated video frames to recover 2-D motion trajectories, (2) lifting to 3-D using known camera intrinsics and a fixed extrinsic calibration relative to the robot base, (3) mapping of end-effector velocities to joint commands via a standard Jacobian-based inverse kinematics solver for our particular robot embodiment, and (4) an explicit discussion of failure modes (e.g., tracking drift under occlusion or lighting changes) together with the simple heuristics used to detect and reject invalid plans. These additions will clarify that the post-processing is a lightweight, embodiment-agnostic procedure rather than task-specific engineering, thereby supporting the claim that generalization originates primarily from the video planner. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training on external video data with no self-referential derivation

full rationale

The paper presents an empirical pipeline: curating an external internet-scale video dataset of human activities, training a generative video model at foundation scale, and post-processing outputs to robot actions for zero-shot evaluation on novel tasks. No equations, fitted parameters defined in terms of target outputs, or load-bearing self-citations appear in the abstract or claims. The central results are reported physical executions on third-party tasks rather than any derivation that reduces to its own inputs by construction. The post-processing step is described only at high level without internal definitions that would create tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical transfer of video pretraining to action planning. No new physical entities are postulated. Training involves standard deep-learning hyperparameters that are not enumerated in the abstract.

free parameters (1)

model scale and training hyperparameters
Standard large-model choices such as learning rate schedule, batch size, and architecture depth are fitted or chosen during training but not listed explicitly.

axioms (1)

domain assumption Internet videos of human activities contain sufficient spatio-temporal structure to support zero-shot transfer to robot control
Invoked when the authors state that videos are naturally aligned with robotic behavior and that the trained model produces usable plans for novel scenes.

pith-pipeline@v0.9.0 · 5571 in / 1412 out tokens · 27421 ms · 2026-05-16T21:24:31.838408+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation
cs.RO 2026-05 unverdicted novelty 7.0

WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage trainin...
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
cs.CV 2026-04 unverdicted novelty 7.0

DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
cs.RO 2026-03 conditional novelty 7.0

GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
cs.RO 2026-02 unverdicted novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks
cs.CV 2026-05 unverdicted novelty 6.0

Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
cs.RO 2026-04 unverdicted novelty 6.0

Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
cs.RO 2026-04 unverdicted novelty 6.0

EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
cs.RO 2026-04 unverdicted novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
cs.CV 2026-05 unverdicted novelty 5.0

GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.
Embody4D: A Generalist 4D World Model for Embodied AI
cs.CV 2026-05 unverdicted novelty 5.0

Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 19 Pith papers · 35 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 3, 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36: 22304–22325, 2023

Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36: 22304–22325, 2023. 3

work page 2023
[5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching- Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025. 3

work page internal anchor Pith review arXiv 2025
[8]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 1, 9 14

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash- n...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Homes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URLhttps://openai.com/research/ video-generation-models-as-world-simulators. 3, 4

work page 2024
[15]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems (NeurIPS), 2024

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 3, 4, 5, 13

work page 2024
[17]

Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation

Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, and Stefan Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 27661–27672, 2025. 3

work page 2025
[18]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13320–13331, 2024. 1, 6, 22

work page 2024
[19]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023. 5

work page arXiv 2023
[20]

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhi- ram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 2021

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evan- gelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens- 100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. URL https: //doi.org/10...

work page doi:10.1007/s11263-021-01531-2 2022
[23]

Dreher, T

Yufei Ding, Haoran Geng, Chaoyi Xu, Xiaomeng Fang, Jiazhao Zhang, Songlin Wei, Qiyu Dai, Zhizheng Zhang, and He Wang. Open6DOR: Benchmarking open-instruction 6-dof object rearrangement and a vlm-based approach. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7359–7366, 2024. doi: 10.1109/IROS58592.2024. 10802733. 3

work page doi:10.1109/iros58592.2024 2024
[24]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodied ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

arXiv preprint arXiv:2310.10625 (2023)

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning.arXiv preprint arXiv:2310.10625, 2023. 3

work page arXiv 2023
[26]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023. 2, 3 16

work page 2023
[27]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 4

work page 2024
[29]

Graspnet: A large-scale clustered and densely annotated dataset for object grasping, 2020

Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet: A large-scale clustered and densely annotated dataset for object grasping, 2020. URL https://arxiv.org/abs/ 1912.13470. 25

work page arXiv 2020
[30]

Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021

Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021. 1

work page 2021
[31]

Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023

Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, and Leonidas Guibas. Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023. 3

work page 2023
[32]

Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...

work page arXiv 2025
[33]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pp. 5842–5850,

work page
[34]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18995–19012, 2022. 6

work page 2022
[35]

Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training.arXiv preprint arXiv:2509.26625, 2025

Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, and Filippos Kokkinos. Learning to see before seeing: Demystifying llm visual priors from language pre-training.arXiv preprint arXiv:2509.26625, 2025. 1

work page arXiv 2025
[36]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3, 4

work page 2020
[38]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024

Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, et al. Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024. 3

work page arXiv 2024
[40]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxPoser: Composable 3D value maps for robotic manipulation with language models, 2023. URL https://arxiv.org/abs/2307.05973. 3 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model, 2024. URL https:// arxiv.org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Learning to act from actionless videos through dense correspondences

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576,

work page arXiv
[45]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 6, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

ROSE: Reconstructing objects, scenes, and trajectories from casual videos for robotic manipulation

Peihao Li, Haoran Geng, Jameson Crate, Yanbing Han, Junyi Zhang, Feishi Wang, Char- lie Tianyue Cheng, Runpei Dong, Yen-Jen Wang, Haozhe Lou, et al. ROSE: Reconstructing objects, scenes, and trajectories from casual videos for robotic manipulation. InHuman to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans. 3

work page
[47]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

ManipLLM: Embodied multimodal large language model for object-centric robotic manipulation, 2023

Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. ManipLLM: Embodied multimodal large language model for object-centric robotic manipulation, 2023. 3

work page 2023
[49]

Vision-Language Foundation Models as Effective Robot Imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023. 3

work page internal anchor Pith review arXiv 2023
[50]

Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. MegaSaM: Accurate, fast, and robust structure and motion from casual dynamic videos, 2024. URL https://arxiv.org/abs/ 2412.04463. 3, 8, 9, 23

work page arXiv 2024
[51]

Dreamitate: Real-world visuomotor policy learning via video generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024. 3

work page arXiv 2024
[52]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023
[55]

Imitation from observation: Learning to imitate behaviors from raw video via context translation

YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In2018 IEEE international conference on robotics and automation (ICRA), pp. 1118–1125. IEEE, 2018. 3

work page 2018
[56]

Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025

Calvin Luo, Zilai Zeng, Yilun Du, and Chen Sun. Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025. 3 18

work page arXiv 2025
[57]

Grounding video models to actions through goal conditioned exploration.arXiv preprint arXiv:2411.07223, 2024

Yunhao Luo and Yilun Du. Grounding video models to actions through goal conditioned exploration.arXiv preprint arXiv:2411.07223, 2024. 3

work page arXiv 2024
[58]

Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023. 7

work page 2023
[59]

Towards generalist robot learning from internet video: A survey.Journal of Artificial Intelligence Research, 83, 2025

Robert McCarthy, Daniel CH Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G Thuruthel, and Zhibin Li. Towards generalist robot learning from internet video: A survey.Journal of Artificial Intelligence Research, 83, 2025. 3

work page 2025
[60]

1x world model

Jack Monas and Eric Jang. 1x world model. Blog post, 1X Discover, September 2024. URL https://www.1x.tech/discover/1x-world-model. Accessed: 2025-09-22. 3

work page 2024
[61]

GR00T N1: An open foundation model for generalist humanoid robots

NVIDIA, Nikita Cherniadev Johan Bjorck andFernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You...

work page 2025
[62]

Reconstructing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InCVPR, 2024. 3, 8, 23

work page 2024
[63]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023. 4

work page 2023
[64]

The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024. 1

work page 2024
[65]

arXiv preprint arXiv:2502.00622 (2025)

Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Strengthening generative robot policies through predictive world modeling.arXiv preprint arXiv:2502.00622, 2025. 3

work page arXiv 2025
[66]

Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023. 3, 9

work page 2023
[67]

Em- bodied hands: Modeling and capturing hands and bodies to- gether

Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022. 8

work page arXiv 2022
[68]

Rolling diffusion models

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.),Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pp. 42818– 4...

work page 2024
[69]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 13

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 1

work page 2022
[71]

Concept2robot: Learning manipulation concepts from instructions and human demonstrations.The International Journal of Robotics Research, 40(12-14):1419–1434, 2021

Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2robot: Learning manipulation concepts from instructions and human demonstrations.The International Journal of Robotics Research, 40(12-14):1419–1434, 2021. 3 19

work page 2021
[72]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pp. 2256–2265. pmlr, 2015. 3, 4

work page 2015
[73]

History-guided video diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 2, 3, 4, 5

work page 2025
[74]

curobo: Parallelized collision-free minimum-jerk robot motion genera- tion, 2023

Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. curobo: Parallelized collision-free minimum-jerk robot motion genera- tion, 2023. 9

work page 2023
[75]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pp. 1723–1736. PMLR, 2023. 3, 7

work page 2023
[81]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 3, 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36: 22304–22325, 2023

Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36: 22304–22325, 2023. 3

work page 2023

[5] [5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching- Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025. 3

work page internal anchor Pith review arXiv 2025

[8] [8]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [10]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [11]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [12]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 1, 9 14

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [13]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash- n...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [14]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Homes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URLhttps://openai.com/research/ video-generation-models-as-world-simulators. 3, 4

work page 2024

[14] [15]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [16]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems (NeurIPS), 2024

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 3, 4, 5, 13

work page 2024

[16] [17]

Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation

Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, and Stefan Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 27661–27672, 2025. 3

work page 2025

[17] [18]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13320–13331, 2024. 1, 6, 22

work page 2024

[18] [19]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023. 5

work page arXiv 2023

[19] [20]

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhi- ram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [22]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 2021

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evan- gelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens- 100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. URL https: //doi.org/10...

work page doi:10.1007/s11263-021-01531-2 2022

[22] [23]

Dreher, T

Yufei Ding, Haoran Geng, Chaoyi Xu, Xiaomeng Fang, Jiazhao Zhang, Songlin Wei, Qiyu Dai, Zhizheng Zhang, and He Wang. Open6DOR: Benchmarking open-instruction 6-dof object rearrangement and a vlm-based approach. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7359–7366, 2024. doi: 10.1109/IROS58592.2024. 10802733. 3

work page doi:10.1109/iros58592.2024 2024

[23] [24]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodied ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [25]

arXiv preprint arXiv:2310.10625 (2023)

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning.arXiv preprint arXiv:2310.10625, 2023. 3

work page arXiv 2023

[25] [26]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023. 2, 3 16

work page 2023

[26] [27]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[27] [28]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 4

work page 2024

[28] [29]

Graspnet: A large-scale clustered and densely annotated dataset for object grasping, 2020

Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet: A large-scale clustered and densely annotated dataset for object grasping, 2020. URL https://arxiv.org/abs/ 1912.13470. 25

work page arXiv 2020

[29] [30]

Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021

Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021. 1

work page 2021

[30] [31]

Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023

Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, and Leonidas Guibas. Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023. 3

work page 2023

[31] [32]

Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...

work page arXiv 2025

[32] [33]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pp. 5842–5850,

work page

[33] [34]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18995–19012, 2022. 6

work page 2022

[34] [35]

Learning to see before seeing: Demystifying llm visual priors from lan- guage pre-training.arXiv preprint arXiv:2509.26625, 2025

Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, and Filippos Kokkinos. Learning to see before seeing: Demystifying llm visual priors from language pre-training.arXiv preprint arXiv:2509.26625, 2025. 1

work page arXiv 2025

[35] [36]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [37]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3, 4

work page 2020

[37] [38]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[38] [39]

Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024

Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, et al. Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024. 3

work page arXiv 2024

[39] [40]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxPoser: Composable 3D value maps for robotic manipulation with language models, 2023. URL https://arxiv.org/abs/2307.05973. 3 17

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [41]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [42]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [43]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model, 2024. URL https:// arxiv.org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [44]

Learning to act from actionless videos through dense correspondences

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576,

work page arXiv

[44] [45]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 6, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [46]

ROSE: Reconstructing objects, scenes, and trajectories from casual videos for robotic manipulation

Peihao Li, Haoran Geng, Jameson Crate, Yanbing Han, Junyi Zhang, Feishi Wang, Char- lie Tianyue Cheng, Runpei Dong, Yen-Jen Wang, Haozhe Lou, et al. ROSE: Reconstructing objects, scenes, and trajectories from casual videos for robotic manipulation. InHuman to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans. 3

work page

[46] [47]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

ManipLLM: Embodied multimodal large language model for object-centric robotic manipulation, 2023

Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. ManipLLM: Embodied multimodal large language model for object-centric robotic manipulation, 2023. 3

work page 2023

[48] [49]

Vision-Language Foundation Models as Effective Robot Imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023. 3

work page internal anchor Pith review arXiv 2023

[49] [50]

Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. MegaSaM: Accurate, fast, and robust structure and motion from casual dynamic videos, 2024. URL https://arxiv.org/abs/ 2412.04463. 3, 8, 9, 23

work page arXiv 2024

[50] [51]

Dreamitate: Real-world visuomotor policy learning via video generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024. 3

work page arXiv 2024

[51] [52]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[52] [53]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [54]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023

[54] [55]

Imitation from observation: Learning to imitate behaviors from raw video via context translation

YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In2018 IEEE international conference on robotics and automation (ICRA), pp. 1118–1125. IEEE, 2018. 3

work page 2018

[55] [56]

Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025

Calvin Luo, Zilai Zeng, Yilun Du, and Chen Sun. Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025. 3 18

work page arXiv 2025

[56] [57]

Grounding video models to actions through goal conditioned exploration.arXiv preprint arXiv:2411.07223, 2024

Yunhao Luo and Yilun Du. Grounding video models to actions through goal conditioned exploration.arXiv preprint arXiv:2411.07223, 2024. 3

work page arXiv 2024

[57] [58]

Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023. 7

work page 2023

[58] [59]

Towards generalist robot learning from internet video: A survey.Journal of Artificial Intelligence Research, 83, 2025

Robert McCarthy, Daniel CH Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G Thuruthel, and Zhibin Li. Towards generalist robot learning from internet video: A survey.Journal of Artificial Intelligence Research, 83, 2025. 3

work page 2025

[59] [60]

1x world model

Jack Monas and Eric Jang. 1x world model. Blog post, 1X Discover, September 2024. URL https://www.1x.tech/discover/1x-world-model. Accessed: 2025-09-22. 3

work page 2024

[60] [61]

GR00T N1: An open foundation model for generalist humanoid robots

NVIDIA, Nikita Cherniadev Johan Bjorck andFernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You...

work page 2025

[61] [62]

Reconstructing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InCVPR, 2024. 3, 8, 23

work page 2024

[62] [63]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023. 4

work page 2023

[63] [64]

The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024. 1

work page 2024

[64] [65]

arXiv preprint arXiv:2502.00622 (2025)

Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Strengthening generative robot policies through predictive world modeling.arXiv preprint arXiv:2502.00622, 2025. 3

work page arXiv 2025

[65] [66]

Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023. 3, 9

work page 2023

[66] [67]

Em- bodied hands: Modeling and capturing hands and bodies to- gether

Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022. 8

work page arXiv 2022

[67] [68]

Rolling diffusion models

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.),Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pp. 42818– 4...

work page 2024

[68] [69]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 13

work page internal anchor Pith review Pith/arXiv arXiv 2022

[69] [70]

Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 1

work page 2022

[70] [71]

Concept2robot: Learning manipulation concepts from instructions and human demonstrations.The International Journal of Robotics Research, 40(12-14):1419–1434, 2021

Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2robot: Learning manipulation concepts from instructions and human demonstrations.The International Journal of Robotics Research, 40(12-14):1419–1434, 2021. 3 19

work page 2021

[71] [72]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pp. 2256–2265. pmlr, 2015. 3, 4

work page 2015

[72] [73]

History-guided video diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 2, 3, 4, 5

work page 2025

[73] [74]

curobo: Parallelized collision-free minimum-jerk robot motion genera- tion, 2023

Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. curobo: Parallelized collision-free minimum-jerk robot motion genera- tion, 2023. 9

work page 2023

[74] [75]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [76]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [77]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [78]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[78] [79]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[79] [80]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pp. 1723–1736. PMLR, 2023. 3, 7

work page 2023

[80] [81]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025