arxiv: 2602.06949 · v1 · submitted 2026-02-06 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Shenyuan Gao , William Liang , Kaiyuan Zheng , Ayaan Malik , Seonghyeon Ye , Sihyun Yu , Wei-Cheng Tseng , Yuzhu Dong

show 22 more authors

Kaichun Mo Chen-Hsuan Lin Qianli Ma Seungjun Nah Loic Magne Jiannan Xiang Yuqi Xie Ruijie Zheng Dantong Niu You Liang Tan K.R. Zentner George Kurian Suneel Indupuru Pooya Jannaty Jinwei Gu Jun Zhang Jitendra Malik Pieter Abbeel Ming-Yu Liu Yuke Zhu Joel Jang Linxi "Jim" Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:57 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords world modelrobot learninghuman videoslatent actionsdexterous manipulationgenerative simulationfoundation modelsrobot control

0 comments

The pith

A world model pretrained on 44k hours of human videos transfers to robots with accurate physics and control after minimal fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DreamDojo pretrains a generalist world model on the largest collection of egocentric human videos assembled so far. It treats continuous latent actions extracted from these unlabeled videos as proxy controls to learn interaction dynamics. Post-training on small amounts of robot-specific data then yields strong performance in simulating physical outcomes and executing precise actions on out-of-distribution tasks. This setup supports applications such as teleoperation, policy testing, and planning without extensive robot data collection. The approach seeks to overcome data scarcity in robotics by bootstrapping from abundant human video sources.

Core claim

DreamDojo learns diverse interactions and dexterous controls from 44k hours of egocentric human videos using continuous latent actions as unified proxy actions. After post-training on small-scale target robot data, the model shows strong understanding of physics and precise action controllability on multiple challenging out-of-distribution benchmarks. A distillation pipeline further accelerates the model to real-time operation at 10.81 FPS while enhancing context consistency.

What carries the argument

Continuous latent actions, which serve as unified proxy actions learned from unlabeled videos to bridge human data to robot control.

If this is right

Supports live teleoperation of robots using the generative world model.
Facilitates policy evaluation in simulated environments.
Enables model-based planning for complex robotic tasks.
Provides real-time inference at over 10 FPS after distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling to even larger video corpora could further improve generalization across environments.
The latent action representation might apply to non-robotic control domains with similar data constraints.
Integration with existing robot policies could reduce the need for real-world trial-and-error learning.

Load-bearing premise

Latent actions derived from human videos can serve as effective proxies for robot actions without introducing domain gaps that impair accurate physics modeling.

What would settle it

Demonstrating that post-training fails to produce reliable predictions of contact-rich dynamics on out-of-distribution robot benchmarks would falsify the central claim.

read the original abstract

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DreamDojo scales pretraining to 44k hours of human video with latent action proxies and shows usable transfer after robot fine-tuning, but the abstract leaves the controllability numbers and domain-gap handling thin.

read the letter

DreamDojo's main contribution is pretraining a world model on the largest reported human video set so far—44k hours of egocentric footage—then using continuous latent actions as proxies to handle missing labels. After light post-training on robot data they report solid physics rollouts and action control on several out-of-distribution benchmarks, plus a distillation step that brings inference to real-time speeds for planning and teleoperation use cases. The data scale and the proxy-action bridge are the parts that feel new relative to earlier video world models. The practical pipeline for turning the model into a fast simulator is also a clear plus if the underlying dynamics hold up. The soft spots sit in the transfer story. The abstract does not show error bars, baseline comparisons, or an ablation that isolates how much the human pretraining actually improves robot physics prediction versus training from scratch on the target data. Embodiment differences between hands and arms are real, so without those numbers it is hard to judge whether the latent space truly aligns the domains or just papers over gaps. If the full paper contains those tables and the OOD gains survive them, the work strengthens; right now the claims rest on the post-training results alone. This paper is for groups working on scalable robot world models and model-based planning. Anyone already running video pretraining experiments will want to see the data mixture and the latent-action implementation details. It deserves a serious referee because the scale is large enough and the engineering path is concrete enough to merit detailed scrutiny, even if the transfer evidence needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces DreamDojo, a foundation world model pretrained on the largest reported video dataset (44k hours of egocentric human videos) by learning continuous latent actions as unified proxy controls to overcome the lack of action labels. After post-training on small-scale target robot data, the model is claimed to exhibit strong physics understanding and precise action controllability on multiple challenging out-of-distribution benchmarks. A distillation pipeline is presented to accelerate inference to 10.81 FPS while improving context consistency, enabling applications including live teleoperation, policy evaluation, and model-based planning.

Significance. If the empirical transfer results hold under rigorous controls, the work would mark a meaningful step toward scalable generalist robot world models by showing that large-scale unlabeled human video can supply interaction priors that reduce reliance on robot-specific labeled data. The real-time distillation component adds practical value for deployment. However, the absence of isolated transfer metrics in the abstract leaves the magnitude of the advance difficult to gauge against prior video-pretrained world models.

major comments (3)

[Abstract] Abstract: The central claim of 'strong understanding of physics and precise action controllability' on OOD benchmarks is unsupported by any quantitative metrics, error bars, baseline comparisons, or ablation results; this omission is load-bearing because the abstract is the only location where the post-training transfer performance is summarized.
[§4] §4 (Evaluation): No quantitative isolation experiment (e.g., forward-dynamics prediction error or controllability success rate with vs. without the 44k-hour human pretraining stage) is reported, leaving open whether continuous latent actions successfully bridge embodiment gaps or merely overfit to the small robot fine-tuning set.
[§3.2] §3.2 (Latent Action Model): The training objective and architecture for continuous latent actions contain no reported regularization, diversity metrics, or mode-collapse diagnostics; without these, it is impossible to verify that the latent space preserves accurate contact-rich dynamics rather than learning spurious correlations that would invalidate OOD controllability claims.

minor comments (2)

[Abstract] Abstract: '44k hours' should be expanded to '44,000 hours' for immediate readability.
[§5] §5 (Distillation): The description of the teacher-student distillation pipeline would benefit from an explicit statement of the loss terms used to preserve context consistency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have revised the paper to strengthen the quantitative support for our claims, including updates to the abstract and additional analyses in the evaluation and method sections. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'strong understanding of physics and precise action controllability' on OOD benchmarks is unsupported by any quantitative metrics, error bars, baseline comparisons, or ablation results; this omission is load-bearing because the abstract is the only location where the post-training transfer performance is summarized.

Authors: We agree that the abstract should include key quantitative results to support the central claims. In the revised manuscript, we have updated the abstract to report specific metrics from our OOD benchmarks, including success rates with error bars and comparisons against baselines, to better substantiate the claims of physics understanding and action controllability. revision: yes
Referee: [§4] §4 (Evaluation): No quantitative isolation experiment (e.g., forward-dynamics prediction error or controllability success rate with vs. without the 44k-hour human pretraining stage) is reported, leaving open whether continuous latent actions successfully bridge embodiment gaps or merely overfit to the small robot fine-tuning set.

Authors: We acknowledge the importance of an explicit isolation experiment to demonstrate the benefit of large-scale human pretraining. We have added a new ablation study in §4 that quantitatively compares forward-dynamics prediction error and controllability success rates with and without the 44k-hour human pretraining stage on the OOD benchmarks, showing that pretraining provides substantial gains beyond the small robot fine-tuning data alone. revision: yes
Referee: [§3.2] §3.2 (Latent Action Model): The training objective and architecture for continuous latent actions contain no reported regularization, diversity metrics, or mode-collapse diagnostics; without these, it is impossible to verify that the latent space preserves accurate contact-rich dynamics rather than learning spurious correlations that would invalidate OOD controllability claims.

Authors: We have revised §3.2 to explicitly describe the regularization terms in the training objective for continuous latent actions. We now also report diversity metrics (e.g., latent space entropy) and mode-collapse diagnostics (e.g., reconstruction fidelity on contact-rich sequences) to confirm that the latent space captures meaningful dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical pretraining and post-training pipeline

full rationale

The paper describes a standard two-stage training process: unsupervised learning of continuous latent actions from 44k hours of human video as proxy controls, followed by post-training on limited robot data and evaluation on OOD benchmarks. No equations, uniqueness theorems, or fitted parameters are presented that reduce the final physics prediction or controllability claims to the inputs by construction. The central results are framed as experimental outcomes from the data mixture and distillation pipeline rather than self-definitional identities or self-citation chains. The abstract and method summary contain no load-bearing self-citations or ansatzes smuggled from prior author work that would force the reported transfer performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach assumes human egocentric videos contain transferable physics and interaction dynamics for robots; continuous latent actions are introduced as an invented proxy without independent verification outside the model itself.

axioms (1)

domain assumption Human videos provide sufficient coverage of contact-rich dynamics for downstream robot tasks
Invoked to justify pretraining on 44k hours of unlabeled video as a substitute for robot data

invented entities (1)

continuous latent actions no independent evidence
purpose: Unified proxy for missing action labels in human videos to enable knowledge transfer
New representation introduced to bridge unlabeled video and robot control; no external falsifiable prediction given in abstract

pith-pipeline@v0.9.0 · 5672 in / 1316 out tokens · 59393 ms · 2026-05-16T16:57:28.111782+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video
cs.RO 2026-05 unverdicted novelty 7.0

EgoTouch is a new multi-view egocentric dataset with dense bimanual tactile supervision, and TouchAnything is a baseline framework showing that wrist views improve vision-based tactile prediction over egocentric input alone.
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
cs.RO 2026-05 unverdicted novelty 7.0

DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
cs.RO 2026-04 unverdicted novelty 6.0

SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
cs.CV 2026-03 unverdicted novelty 6.0

A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 5.0

STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...
Lifting Embodied World Models for Planning and Control
cs.CV 2026-04 unverdicted novelty 5.0

Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-eff...
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

131 extracted references · 131 canonical work pages · cited by 17 Pith papers · 29 internal anchors

[1]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World Simulation with Video Foundation Models for Physical AI. arXiv preprint arXiv:2511.00062, 2025. 2, 3, 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Diffusion for World Modeling: Visual Details Matter in Atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for World Modeling: Visual Details Matter in Atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 16

work page 2024
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.arXiv preprint arXiv:2506.09985, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Whole-Body Conditioned Egocentric Video Prediction.arXiv preprint arXiv:2506.21552, 2025

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-Body Conditioned Egocentric Video Prediction.arXiv preprint arXiv:2506.21552, 2025

work page arXiv 2025
[5]

Genie 3: A New Frontier for World Models, 2025

Philip Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, et al. Genie 3: A New Frontier for World Models, 2025. URL https://deepmind.google/discover/blog/ genie-3-a-new-frontier-for-world-models/. 2, 15, 16

work page 2025
[6]

Navigation World Models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation World Models. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2025. 16

work page 2025
[7]

H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation. InProc. of the Conf. on Artificial Intelligence (AAAI), 2025. 4

work page 2025
[8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.arXiv preprint arXiv:2503.14734, 2025. 13, 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for Real-World Control at Scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Genie: Generative Interactive Environments

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative Interactive Environments. InProc. of the International Conf. on Machine learning (ICML), 2024. 6, 7, 9, 16

work page 2024
[12]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems.arXiv preprint arXiv:2503.06669, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. InProc. Robotics: Science and Systems (RSS), 2025. 16

work page 2025
[14]

Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-Time Distribution-Level Composition.arXiv preprint arXiv:2510.01068, 2025

Jiahang Cao, Yize Huang, Hanzhong Guo, Rui Zhang, Mu Nan, Weijian Mai, Jiaxu Wang, Hao Cheng, Jingkai Sun, Gang Han, et al. Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-Time Distribution-Level Composition.arXiv preprint arXiv:2510.01068, 2025. 14 25 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

work page arXiv 2025
[15]

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large Video Planner Enables Generalizable Robot Control.arXiv preprint arXiv:2512.15840, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI.arXiv preprint arXiv:2411.00785, 2024

Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI.arXiv preprint arXiv:2411.00785, 2024. 16

work page arXiv 2024
[17]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. villa-X: Enhancing Latent Action Modeling in Vision-Language- Action Models.arXiv preprint arXiv:2507.23682, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho- Jui Hsieh. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation.arXiv preprint arXiv:2510.02283, 2025. 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

RoboNet: Large-Scale Multi-Robot Learning.arXiv preprint arXiv:1910.11215, 2019

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-Scale Multi-Robot Learning.arXiv preprint arXiv:1910.11215, 2019

work page arXiv 1910
[21]

DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance

Maximilian Du and Shuran Song. DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 16

work page 2025
[22]

Vista: AGeneralizableDrivingWorldModelwithHighFidelityandVersatileControllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and HongyangLi. Vista: AGeneralizableDrivingWorldModelwithHighFidelityandVersatileControllability. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 7, 10

work page 2024
[23]

AdaWorld: Learning Adaptable World Models with Latent Actions

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning Adaptable World Models with Latent Actions. InProc. of the International Conf. on Machine learning (ICML), 2025. 2, 6, 7, 16

work page 2025
[24]

Learning Latent Action World Models In The Wild.arXiv preprint arXiv:2601.05230, 2026

Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning Latent Action World Models In The Wild.arXiv preprint arXiv:2601.05230, 2026. 16

work page arXiv 2026
[25]

World Models Can Leverage Human Videos for Dexterous Manipulation.arXiv preprint arXiv:2512.13644, 2025

RaktimGautamGoswami, AmirBar, David Fan, Tsung-YenYang, GaoyueZhou, PrashanthKrishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World Models Can Leverage Human Videos for Dexterous Manipulation.arXiv preprint arXiv:2512.13644, 2025. 16

work page arXiv 2025
[26]

Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388,

Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft.arXiv preprint arXiv:2504.08388,

work page arXiv
[27]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-World: A Controllable Generative World Model for Robot Manipulation.arXiv preprint arXiv:2510.10125, 2025. 6, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Recurrent World Models Facilitate Policy Evolution

David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. 16

work page 2018
[29]

Mastering Diverse Domains through World Models.Nature, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering Diverse Domains through World Models.Nature, 2025. 16 26 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

work page 2025
[30]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training Agents Inside of Scalable World Models. arXiv preprint arXiv:2509.24527, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

1X World Model: Evaluating Bits, not Atoms,

Daniel Ho, Jack Monas, Juntao Ren, and Christina Yu. 1X World Model: Evaluating Bits, not Atoms,

work page
[32]

URLhttps://www.1x.tech/1x-world-model.pdf. 15

work page
[33]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025. 15

work page arXiv 2025
[35]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Ryan Hoque, Peide Huang, David Yoon, Mouli Sivapurapu, and Jian Zhang. EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video.arXiv preprint arXiv:2505.11709, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Image Quality Metrics: PSNR vs

Alain Hore and Djemel Ziou. Image Quality Metrics: PSNR vs. SSIM. InProc. of the International Conf. on Pattern Recognition (ICPR), 2010. 10

work page 2010
[37]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A Generative World Model for Autonomous Driving.arXiv preprint arXiv:2309.17080, 2023. 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

LoRA: Low-Rank Adaptation of Large Language Models

Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Language Models. InProc. of the International Conf. on Learning Representations (ICLR), 2022. 15

work page 2022
[39]

Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis.arXiv preprint arXiv:2312.08782, 2023

Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, et al. Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis.arXiv preprint arXiv:2312.08782, 2023. 2

work page arXiv 2023
[40]

Vid2World: Crafting Video Diffusion Models to Interactive World Models.arXiv preprint arXiv:2505.14357, 2025

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting Video Diffusion Models to Interactive World Models.arXiv preprint arXiv:2505.14357, 2025. 5, 6

work page arXiv 2025
[41]

Towards Video World Models, 2025

Xun Huang. Towards Video World Models, 2025. URLhttps://www.xunhuang.me/blogs/world_ model.html. 8, 16

work page 2025
[42]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 2, 8, 17

work page 2025
[43]

A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, and Gokul Swamy. A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 16

work page 2025
[44]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. DreamGen: Unlocking Generalization in Robot Learning through Video World Models. InProc. Conf. on Robot Learning (CoRL), 2025. 16

work page 2025
[45]

EnerVerse-AC: Envisioning Embodied Environments with Action Condition.arXiv preprint arXiv:2505.09723, 2025

Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, et al. EnerVerse-AC: Envisioning Embodied Environments with Action Condition.arXiv preprint arXiv:2505.09723, 2025. 16

work page arXiv 2025
[46]

World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation.arXiv preprint arXiv:2509.19080, 2025

Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation.arXiv preprint arXiv:2509.19080, 2025. 16 27 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

work page arXiv 2025
[47]

World and Human Action Models Towards Gameplay Ideation.Nature, 2025

Anssi Kanervisto, Dave Bignell, Linda Yilin Wen, Martin Grayson, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Tabish Rashid, Tim Pearce, Yuhan Cao, et al. World and Human Action Models Towards Gameplay Ideation.Nature, 2025. 16

work page 2025
[48]

Emergence of Human to Robot Transfer in Vision-Language-Action Models.arXiv preprint arXiv:2512.22414, 2025

Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, and Suraj Nair. Emergence of Human to Robot Transfer in Vision-Language-Action Models.arXiv preprint arXiv:2512.22414, 2025. 4

work page arXiv 2025
[49]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

AlexanderKhazatsky, KarlPertsch, SurajNair, AshwinBalakrishna, SudeepDasari, SiddharthKaramcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A Large- Scale In-The-Wild Robot Manipulation Dataset.arXiv preprint arXiv:2403.12945, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Learning to Simulate Dynamic Environments with GameGAN

Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to Simulate Dynamic Environments with GameGAN. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020. 16

work page 2020
[51]

DriveGAN: Towards a Controllable High-Quality Neural Simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. DriveGAN: Towards a Controllable High-Quality Neural Simulation. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),

work page
[52]

Auto-Encoding Variational Bayes

Diederik Kingma and Max Welling. Auto-Encoding Variational Bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[53]

3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996,

Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996,

work page arXiv
[54]

A Path Towards Autonomous Machine Intelligence.Open Review, 2022

Yann LeCun. A Path Towards Autonomous Machine Intelligence.Open Review, 2022. 2

work page 2022
[55]

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators.arXiv preprint arXiv:2510.00406, 2025

Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, et al. VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators.arXiv preprint arXiv:2510.00406, 2025. 16

work page arXiv 2025
[56]

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos.arXiv preprint arXiv:2510.21571, 2025

Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, et al. Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos.arXiv preprint arXiv:2510.21571, 2025. 4

work page arXiv 2025
[57]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified Video Action Model. InProc. Robotics: Science and Systems (RSS), 2025. 16

work page 2025
[58]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating Real-World Robot Manipulation Policies in Simulation.arXiv preprint arXiv:2405.05941, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

WorldEval: World Model as Real-World Robot Policies Evaluator.arXiv preprint arXiv:2505.19017, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. WorldEval: World Model as Real-World Robot Policies Evaluator.arXiv preprint arXiv:2505.19017, 2025. 13, 16

work page arXiv 2025
[60]

CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations.arXiv preprint arXiv:2505.04999, 2025

Anthony Liang, Pavel Czempin, Matthew Hong, Yutai Zhou, Erdem Biyik, and Stephen Tu. CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations.arXiv preprint arXiv:2505.04999, 2025. 16

work page arXiv 2025
[61]

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation. In Advances in Neural Information Processing Systems (NeurIPS), 2025. 16 28 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

work page 2025
[62]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling.arXiv preprint arXiv:2210.02747, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling Forcing: Autoregressive Long Video Diffusion in Real Time.arXiv preprint arXiv:2509.25161, 2025. 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

Mingyu Liu, Jiuhe Shu, Hui Chen, Zeju Li, Canyu Zhao, Jiange Yang, Shenyuan Gao, Hao Chen, and Chunhua Shen. StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation.arXiv preprint arXiv:2510.05057, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

EgoZero: Robot Learning from Smart Glasses.arXiv preprint arXiv:2505.20290, 2025

Vincent Liu, Ademi Adeniji, Haotian Zhan, Siddhant Haldar, Raunaq Bhirangi, Pieter Abbeel, and Lerrel Pinto. EgoZero: Robot Learning from Smart Glasses.arXiv preprint arXiv:2505.20290, 2025. 4

work page arXiv 2025
[66]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InProc. of the International Conf. on Learning Representations (ICLR), 2019. 9

work page 2019
[67]

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos.arXiv preprint arXiv:2507.15597, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos.arXiv preprint arXiv:2507.15597, 2025. 4

work page arXiv 2025
[68]

Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters (RA-L), 2023

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters (RA-L), 2023

work page 2023
[69]

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild. InProc. of the European Conf. on Computer Vision (ECCV), 2024

work page 2024
[70]

Structured World Models from Human Videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured World Models from Human Videos. In Proc. Robotics: Science and Systems (RSS), 2023. 16

work page 2023
[71]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.arXiv preprint arXiv:2304.07193, 2023. 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses.arXiv preprint arXiv:2511.18173, 2025

Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, and Juergen Gall. EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses.arXiv preprint arXiv:2511.18173, 2025

work page arXiv 2025
[73]

Genie 2: A Large-Scale Foundation World Model, 2024

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A Large-Scale Foundation World Model, 2024. URL https://deepmind.google/discover/blog/ genie-2-a-large-scale-foundation-world-model/. 5, 16

work page 2024
[74]

Reconstructing Hands in 3D with Transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing Hands in 3D with Transformers. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2024. 6, 11

work page 2024
[75]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InProc. of the IEEE International Conf. on Computer Vision (ICCV), 2023. 3

work page 2023
[76]

Strengthening generative robot policies through predictive world modeling.arXiv preprint arXiv:2502.00622, 2025

Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Strengthening Generative Robot Policies through Predictive World Modeling.arXiv preprint arXiv:2502.00622, 2025. 13, 16 29 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

work page arXiv 2025
[77]

Humanoid Policy ˜ Human Policy

Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid Policy ˜ Human Policy. InProc. Conf. on Robot Learning (CoRL), 2025. 4

work page 2025
[78]

Evaluating Robot Policies in a World Model.arXiv preprint arXiv:2506.00613, 2025

Julian Quevedo, Percy Liang, and Sherry Yang. Evaluating Robot Policies in a World Model.arXiv preprint arXiv:2506.00613, 2025. 13, 16

work page arXiv 2025
[79]

General Agents Need World Models

Jonathan Richens, Tom Everitt, and David Abel. General Agents Need World Models. InProc. of the International Conf. on Machine learning (ICML), 2025. 2, 16

work page 2025
[80]

Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied Hands: Modeling and Capturing Hands and Bodies Together.arXiv preprint arXiv:2201.02610, 2022. 11

work page arXiv 2022

Showing first 80 references.