pith. machine review for the scientific record. sign in

arxiv: 2602.06949 · v1 · submitted 2026-02-06 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:57 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords world modelrobot learninghuman videoslatent actionsdexterous manipulationgenerative simulationfoundation modelsrobot control
0
0 comments X

The pith

A world model pretrained on 44k hours of human videos transfers to robots with accurate physics and control after minimal fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DreamDojo pretrains a generalist world model on the largest collection of egocentric human videos assembled so far. It treats continuous latent actions extracted from these unlabeled videos as proxy controls to learn interaction dynamics. Post-training on small amounts of robot-specific data then yields strong performance in simulating physical outcomes and executing precise actions on out-of-distribution tasks. This setup supports applications such as teleoperation, policy testing, and planning without extensive robot data collection. The approach seeks to overcome data scarcity in robotics by bootstrapping from abundant human video sources.

Core claim

DreamDojo learns diverse interactions and dexterous controls from 44k hours of egocentric human videos using continuous latent actions as unified proxy actions. After post-training on small-scale target robot data, the model shows strong understanding of physics and precise action controllability on multiple challenging out-of-distribution benchmarks. A distillation pipeline further accelerates the model to real-time operation at 10.81 FPS while enhancing context consistency.

What carries the argument

Continuous latent actions, which serve as unified proxy actions learned from unlabeled videos to bridge human data to robot control.

If this is right

  • Supports live teleoperation of robots using the generative world model.
  • Facilitates policy evaluation in simulated environments.
  • Enables model-based planning for complex robotic tasks.
  • Provides real-time inference at over 10 FPS after distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scaling to even larger video corpora could further improve generalization across environments.
  • The latent action representation might apply to non-robotic control domains with similar data constraints.
  • Integration with existing robot policies could reduce the need for real-world trial-and-error learning.

Load-bearing premise

Latent actions derived from human videos can serve as effective proxies for robot actions without introducing domain gaps that impair accurate physics modeling.

What would settle it

Demonstrating that post-training fails to produce reliable predictions of contact-rich dynamics on out-of-distribution robot benchmarks would falsify the central claim.

read the original abstract

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DreamDojo, a foundation world model pretrained on the largest reported video dataset (44k hours of egocentric human videos) by learning continuous latent actions as unified proxy controls to overcome the lack of action labels. After post-training on small-scale target robot data, the model is claimed to exhibit strong physics understanding and precise action controllability on multiple challenging out-of-distribution benchmarks. A distillation pipeline is presented to accelerate inference to 10.81 FPS while improving context consistency, enabling applications including live teleoperation, policy evaluation, and model-based planning.

Significance. If the empirical transfer results hold under rigorous controls, the work would mark a meaningful step toward scalable generalist robot world models by showing that large-scale unlabeled human video can supply interaction priors that reduce reliance on robot-specific labeled data. The real-time distillation component adds practical value for deployment. However, the absence of isolated transfer metrics in the abstract leaves the magnitude of the advance difficult to gauge against prior video-pretrained world models.

major comments (3)
  1. [Abstract] Abstract: The central claim of 'strong understanding of physics and precise action controllability' on OOD benchmarks is unsupported by any quantitative metrics, error bars, baseline comparisons, or ablation results; this omission is load-bearing because the abstract is the only location where the post-training transfer performance is summarized.
  2. [§4] §4 (Evaluation): No quantitative isolation experiment (e.g., forward-dynamics prediction error or controllability success rate with vs. without the 44k-hour human pretraining stage) is reported, leaving open whether continuous latent actions successfully bridge embodiment gaps or merely overfit to the small robot fine-tuning set.
  3. [§3.2] §3.2 (Latent Action Model): The training objective and architecture for continuous latent actions contain no reported regularization, diversity metrics, or mode-collapse diagnostics; without these, it is impossible to verify that the latent space preserves accurate contact-rich dynamics rather than learning spurious correlations that would invalidate OOD controllability claims.
minor comments (2)
  1. [Abstract] Abstract: '44k hours' should be expanded to '44,000 hours' for immediate readability.
  2. [§5] §5 (Distillation): The description of the teacher-student distillation pipeline would benefit from an explicit statement of the loss terms used to preserve context consistency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have revised the paper to strengthen the quantitative support for our claims, including updates to the abstract and additional analyses in the evaluation and method sections. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of 'strong understanding of physics and precise action controllability' on OOD benchmarks is unsupported by any quantitative metrics, error bars, baseline comparisons, or ablation results; this omission is load-bearing because the abstract is the only location where the post-training transfer performance is summarized.

    Authors: We agree that the abstract should include key quantitative results to support the central claims. In the revised manuscript, we have updated the abstract to report specific metrics from our OOD benchmarks, including success rates with error bars and comparisons against baselines, to better substantiate the claims of physics understanding and action controllability. revision: yes

  2. Referee: [§4] §4 (Evaluation): No quantitative isolation experiment (e.g., forward-dynamics prediction error or controllability success rate with vs. without the 44k-hour human pretraining stage) is reported, leaving open whether continuous latent actions successfully bridge embodiment gaps or merely overfit to the small robot fine-tuning set.

    Authors: We acknowledge the importance of an explicit isolation experiment to demonstrate the benefit of large-scale human pretraining. We have added a new ablation study in §4 that quantitatively compares forward-dynamics prediction error and controllability success rates with and without the 44k-hour human pretraining stage on the OOD benchmarks, showing that pretraining provides substantial gains beyond the small robot fine-tuning data alone. revision: yes

  3. Referee: [§3.2] §3.2 (Latent Action Model): The training objective and architecture for continuous latent actions contain no reported regularization, diversity metrics, or mode-collapse diagnostics; without these, it is impossible to verify that the latent space preserves accurate contact-rich dynamics rather than learning spurious correlations that would invalidate OOD controllability claims.

    Authors: We have revised §3.2 to explicitly describe the regularization terms in the training objective for continuous latent actions. We now also report diversity metrics (e.g., latent space entropy) and mode-collapse diagnostics (e.g., reconstruction fidelity on contact-rich sequences) to confirm that the latent space captures meaningful dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical pretraining and post-training pipeline

full rationale

The paper describes a standard two-stage training process: unsupervised learning of continuous latent actions from 44k hours of human video as proxy controls, followed by post-training on limited robot data and evaluation on OOD benchmarks. No equations, uniqueness theorems, or fitted parameters are presented that reduce the final physics prediction or controllability claims to the inputs by construction. The central results are framed as experimental outcomes from the data mixture and distillation pipeline rather than self-definitional identities or self-citation chains. The abstract and method summary contain no load-bearing self-citations or ansatzes smuggled from prior author work that would force the reported transfer performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach assumes human egocentric videos contain transferable physics and interaction dynamics for robots; continuous latent actions are introduced as an invented proxy without independent verification outside the model itself.

axioms (1)
  • domain assumption Human videos provide sufficient coverage of contact-rich dynamics for downstream robot tasks
    Invoked to justify pretraining on 44k hours of unlabeled video as a substitute for robot data
invented entities (1)
  • continuous latent actions no independent evidence
    purpose: Unified proxy for missing action labels in human videos to enable knowledge transfer
    New representation introduced to bridge unlabeled video and robot control; no external falsifiable prediction given in abstract

pith-pipeline@v0.9.0 · 5672 in / 1316 out tokens · 59393 ms · 2026-05-16T16:57:28.111782+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

    cs.RO 2026-05 unverdicted novelty 7.0

    EgoTouch is a new multi-view egocentric dataset with dense bimanual tactile supervision, and TouchAnything is a baseline framework showing that wrist views improve vision-based tactile prediction over egocentric input alone.

  2. DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

    cs.RO 2026-05 unverdicted novelty 7.0

    DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

  3. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  4. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  5. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  6. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

  7. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  8. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  9. SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

    cs.RO 2026-04 unverdicted novelty 6.0

    SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...

  10. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  11. Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

    cs.CV 2026-03 unverdicted novelty 6.0

    A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.

  12. SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

    cs.CV 2026-05 unverdicted novelty 5.0

    SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

  13. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  14. STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 5.0

    STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...

  15. Lifting Embodied World Models for Planning and Control

    cs.CV 2026-04 unverdicted novelty 5.0

    Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-eff...

  16. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  17. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

131 extracted references · 131 canonical work pages · cited by 17 Pith papers · 29 internal anchors

  1. [1]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World Simulation with Video Foundation Models for Physical AI. arXiv preprint arXiv:2511.00062, 2025. 2, 3, 5, 9

  2. [2]

    Diffusion for World Modeling: Visual Details Matter in Atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for World Modeling: Visual Details Matter in Atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 16

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.arXiv preprint arXiv:2506.09985, 2025. 16

  4. [4]

    Whole-Body Conditioned Egocentric Video Prediction.arXiv preprint arXiv:2506.21552, 2025

    Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-Body Conditioned Egocentric Video Prediction.arXiv preprint arXiv:2506.21552, 2025

  5. [5]

    Genie 3: A New Frontier for World Models, 2025

    Philip Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, et al. Genie 3: A New Frontier for World Models, 2025. URL https://deepmind.google/discover/blog/ genie-3-a-new-frontier-for-world-models/. 2, 15, 16

  6. [6]

    Navigation World Models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation World Models. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2025. 16

  7. [7]

    H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

    Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation. InProc. of the Conf. on Artificial Intelligence (AAAI), 2025. 4

  8. [8]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.arXiv preprint arXiv:2503.14734, 2025. 13, 15

  9. [9]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024. 15

  10. [10]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for Real-World Control at Scale.arXiv preprint arXiv:2212.06817, 2022

  11. [11]

    Genie: Generative Interactive Environments

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative Interactive Environments. InProc. of the International Conf. on Machine learning (ICML), 2024. 6, 7, 9, 16

  12. [12]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems.arXiv preprint arXiv:2503.06669, 2025. 2

  13. [13]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. InProc. Robotics: Science and Systems (RSS), 2025. 16

  14. [14]

    Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-Time Distribution-Level Composition.arXiv preprint arXiv:2510.01068, 2025

    Jiahang Cao, Yize Huang, Hanzhong Guo, Rui Zhang, Mu Nan, Weijian Mai, Jiaxu Wang, Hao Cheng, Jingkai Sun, Gang Han, et al. Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-Time Distribution-Level Composition.arXiv preprint arXiv:2510.01068, 2025. 14 25 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

  15. [15]

    Large Video Planner Enables Generalizable Robot Control

    Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large Video Planner Enables Generalizable Robot Control.arXiv preprint arXiv:2512.15840, 2025. 4

  16. [16]

    IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI.arXiv preprint arXiv:2411.00785, 2024

    Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI.arXiv preprint arXiv:2411.00785, 2024. 16

  17. [17]

    villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. villa-X: Enhancing Latent Action Modeling in Vision-Language- Action Models.arXiv preprint arXiv:2507.23682, 2025. 4

  18. [18]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025. 10

  19. [19]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho- Jui Hsieh. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation.arXiv preprint arXiv:2510.02283, 2025. 17

  20. [20]

    RoboNet: Large-Scale Multi-Robot Learning.arXiv preprint arXiv:1910.11215, 2019

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-Scale Multi-Robot Learning.arXiv preprint arXiv:1910.11215, 2019

  21. [21]

    DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance

    Maximilian Du and Shuran Song. DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 16

  22. [22]

    Vista: AGeneralizableDrivingWorldModelwithHighFidelityandVersatileControllability

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and HongyangLi. Vista: AGeneralizableDrivingWorldModelwithHighFidelityandVersatileControllability. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 7, 10

  23. [23]

    AdaWorld: Learning Adaptable World Models with Latent Actions

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning Adaptable World Models with Latent Actions. InProc. of the International Conf. on Machine learning (ICML), 2025. 2, 6, 7, 16

  24. [24]

    Learning Latent Action World Models In The Wild.arXiv preprint arXiv:2601.05230, 2026

    Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning Latent Action World Models In The Wild.arXiv preprint arXiv:2601.05230, 2026. 16

  25. [25]

    World Models Can Leverage Human Videos for Dexterous Manipulation.arXiv preprint arXiv:2512.13644, 2025

    RaktimGautamGoswami, AmirBar, David Fan, Tsung-YenYang, GaoyueZhou, PrashanthKrishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World Models Can Leverage Human Videos for Dexterous Manipulation.arXiv preprint arXiv:2512.13644, 2025. 16

  26. [26]

    Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388,

    Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft.arXiv preprint arXiv:2504.08388,

  27. [27]

    Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-World: A Controllable Generative World Model for Robot Manipulation.arXiv preprint arXiv:2510.10125, 2025. 6, 16

  28. [28]

    Recurrent World Models Facilitate Policy Evolution

    David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. 16

  29. [29]

    Mastering Diverse Domains through World Models.Nature, 2025

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering Diverse Domains through World Models.Nature, 2025. 16 26 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

  30. [30]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training Agents Inside of Scalable World Models. arXiv preprint arXiv:2509.24527, 2025. 16

  31. [31]

    1X World Model: Evaluating Bits, not Atoms,

    Daniel Ho, Jack Monas, Juntao Ren, and Christina Yu. 1X World Model: Evaluating Bits, not Atoms,

  32. [32]

    URLhttps://www.1x.tech/1x-world-model.pdf. 15

  33. [33]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance.arXiv preprint arXiv:2207.12598,

  34. [34]

    RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025. 15

  35. [35]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    Ryan Hoque, Peide Huang, David Yoon, Mouli Sivapurapu, and Jian Zhang. EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video.arXiv preprint arXiv:2505.11709, 2025. 5

  36. [36]

    Image Quality Metrics: PSNR vs

    Alain Hore and Djemel Ziou. Image Quality Metrics: PSNR vs. SSIM. InProc. of the International Conf. on Pattern Recognition (ICPR), 2010. 10

  37. [37]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A Generative World Model for Autonomous Driving.arXiv preprint arXiv:2309.17080, 2023. 16

  38. [38]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Language Models. InProc. of the International Conf. on Learning Representations (ICLR), 2022. 15

  39. [39]

    Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis.arXiv preprint arXiv:2312.08782, 2023

    Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, et al. Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis.arXiv preprint arXiv:2312.08782, 2023. 2

  40. [40]

    Vid2World: Crafting Video Diffusion Models to Interactive World Models.arXiv preprint arXiv:2505.14357, 2025

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting Video Diffusion Models to Interactive World Models.arXiv preprint arXiv:2505.14357, 2025. 5, 6

  41. [41]

    Towards Video World Models, 2025

    Xun Huang. Towards Video World Models, 2025. URLhttps://www.xunhuang.me/blogs/world_ model.html. 8, 16

  42. [42]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 2, 8, 17

  43. [43]

    A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

    Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, and Gokul Swamy. A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 16

  44. [44]

    DreamGen: Unlocking Generalization in Robot Learning through Video World Models

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. DreamGen: Unlocking Generalization in Robot Learning through Video World Models. InProc. Conf. on Robot Learning (CoRL), 2025. 16

  45. [45]

    EnerVerse-AC: Envisioning Embodied Environments with Action Condition.arXiv preprint arXiv:2505.09723, 2025

    Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, et al. EnerVerse-AC: Envisioning Embodied Environments with Action Condition.arXiv preprint arXiv:2505.09723, 2025. 16

  46. [46]

    World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation.arXiv preprint arXiv:2509.19080, 2025

    Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation.arXiv preprint arXiv:2509.19080, 2025. 16 27 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

  47. [47]

    World and Human Action Models Towards Gameplay Ideation.Nature, 2025

    Anssi Kanervisto, Dave Bignell, Linda Yilin Wen, Martin Grayson, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Tabish Rashid, Tim Pearce, Yuhan Cao, et al. World and Human Action Models Towards Gameplay Ideation.Nature, 2025. 16

  48. [48]

    Emergence of Human to Robot Transfer in Vision-Language-Action Models.arXiv preprint arXiv:2512.22414, 2025

    Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, and Suraj Nair. Emergence of Human to Robot Transfer in Vision-Language-Action Models.arXiv preprint arXiv:2512.22414, 2025. 4

  49. [49]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    AlexanderKhazatsky, KarlPertsch, SurajNair, AshwinBalakrishna, SudeepDasari, SiddharthKaramcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A Large- Scale In-The-Wild Robot Manipulation Dataset.arXiv preprint arXiv:2403.12945, 2024. 2

  50. [50]

    Learning to Simulate Dynamic Environments with GameGAN

    Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to Simulate Dynamic Environments with GameGAN. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020. 16

  51. [51]

    DriveGAN: Towards a Controllable High-Quality Neural Simulation

    Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. DriveGAN: Towards a Controllable High-Quality Neural Simulation. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),

  52. [52]

    Auto-Encoding Variational Bayes

    Diederik Kingma and Max Welling. Auto-Encoding Variational Bayes.arXiv preprint arXiv:1312.6114,

  53. [53]

    3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996,

    Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996,

  54. [54]

    A Path Towards Autonomous Machine Intelligence.Open Review, 2022

    Yann LeCun. A Path Towards Autonomous Machine Intelligence.Open Review, 2022. 2

  55. [55]

    VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators.arXiv preprint arXiv:2510.00406, 2025

    Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, et al. VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators.arXiv preprint arXiv:2510.00406, 2025. 16

  56. [56]

    Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos.arXiv preprint arXiv:2510.21571, 2025

    Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, et al. Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos.arXiv preprint arXiv:2510.21571, 2025. 4

  57. [57]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified Video Action Model. InProc. Robotics: Science and Systems (RSS), 2025. 16

  58. [58]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating Real-World Robot Manipulation Policies in Simulation.arXiv preprint arXiv:2405.05941, 2024. 13

  59. [59]

    WorldEval: World Model as Real-World Robot Policies Evaluator.arXiv preprint arXiv:2505.19017, 2025

    Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. WorldEval: World Model as Real-World Robot Policies Evaluator.arXiv preprint arXiv:2505.19017, 2025. 13, 16

  60. [60]

    CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations.arXiv preprint arXiv:2505.04999, 2025

    Anthony Liang, Pavel Czempin, Matthew Hong, Yutai Zhou, Erdem Biyik, and Stephen Tu. CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations.arXiv preprint arXiv:2505.04999, 2025. 16

  61. [61]

    Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

    Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation. In Advances in Neural Information Processing Systems (NeurIPS), 2025. 16 28 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

  62. [62]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling.arXiv preprint arXiv:2210.02747, 2022. 3

  63. [63]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling Forcing: Autoregressive Long Video Diffusion in Real Time.arXiv preprint arXiv:2509.25161, 2025. 17

  64. [64]

    StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

    Mingyu Liu, Jiuhe Shu, Hui Chen, Zeju Li, Canyu Zhao, Jiange Yang, Shenyuan Gao, Hao Chen, and Chunhua Shen. StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation.arXiv preprint arXiv:2510.05057, 2025. 16

  65. [65]

    EgoZero: Robot Learning from Smart Glasses.arXiv preprint arXiv:2505.20290, 2025

    Vincent Liu, Ademi Adeniji, Haotian Zhan, Siddhant Haldar, Raunaq Bhirangi, Pieter Abbeel, and Lerrel Pinto. EgoZero: Robot Learning from Smart Glasses.arXiv preprint arXiv:2505.20290, 2025. 4

  66. [66]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InProc. of the International Conf. on Learning Representations (ICLR), 2019. 9

  67. [67]

    Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos.arXiv preprint arXiv:2507.15597, 2025

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos.arXiv preprint arXiv:2507.15597, 2025. 4

  68. [68]

    Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters (RA-L), 2023

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters (RA-L), 2023

  69. [69]

    Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

    Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild. InProc. of the European Conf. on Computer Vision (ECCV), 2024

  70. [70]

    Structured World Models from Human Videos

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured World Models from Human Videos. In Proc. Robotics: Science and Systems (RSS), 2023. 16

  71. [71]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.arXiv preprint arXiv:2304.07193, 2023. 20

  72. [72]

    EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses.arXiv preprint arXiv:2511.18173, 2025

    Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, and Juergen Gall. EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses.arXiv preprint arXiv:2511.18173, 2025

  73. [73]

    Genie 2: A Large-Scale Foundation World Model, 2024

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A Large-Scale Foundation World Model, 2024. URL https://deepmind.google/discover/blog/ genie-2-a-large-scale-foundation-world-model/. 5, 16

  74. [74]

    Reconstructing Hands in 3D with Transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing Hands in 3D with Transformers. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2024. 6, 11

  75. [75]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InProc. of the IEEE International Conf. on Computer Vision (ICCV), 2023. 3

  76. [76]

    Strengthening generative robot policies through predictive world modeling.arXiv preprint arXiv:2502.00622, 2025

    Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Strengthening Generative Robot Policies through Predictive World Modeling.arXiv preprint arXiv:2502.00622, 2025. 13, 16 29 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

  77. [77]

    Humanoid Policy ˜ Human Policy

    Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid Policy ˜ Human Policy. InProc. Conf. on Robot Learning (CoRL), 2025. 4

  78. [78]

    Evaluating Robot Policies in a World Model.arXiv preprint arXiv:2506.00613, 2025

    Julian Quevedo, Percy Liang, and Sherry Yang. Evaluating Robot Policies in a World Model.arXiv preprint arXiv:2506.00613, 2025. 13, 16

  79. [79]

    General Agents Need World Models

    Jonathan Richens, Tom Everitt, and David Abel. General Agents Need World Models. InProc. of the International Conf. on Machine learning (ICML), 2025. 2, 16

  80. [80]

    Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied Hands: Modeling and Capturing Hands and Bodies Together.arXiv preprint arXiv:2201.02610, 2022. 11

Showing first 80 references.