pith. sign in

arxiv: 2606.22136 · v2 · pith:YDYZP2NUnew · submitted 2026-06-20 · 💻 cs.RO

Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data

Pith reviewed 2026-06-26 11:46 UTC · model grok-4.3

classification 💻 cs.RO
keywords generative world modelsdexterous manipulationegocentric human videosrobot learningvisual language action modelsdata generationzero-shot generalization
0
0 comments X

The pith

A generative world model produces 50,000 egocentric human-hand videos that convert into robot supervision and raise zero-shot dexterous task success from 8.3 percent to 38.9 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing data sources for dexterous robot learning force a choice between expensive but aligned teleoperation data, scalable but misaligned simulation, and abundant but embodiment-mismatched real videos. Wh0 instead conditions a generative video world model on language, objects, and scenes to create a large dataset of human manipulation episodes. These videos undergo hand-motion reconstruction and visual editing to produce training signals that align with robot embodiments. When this data is combined with limited real robot demonstrations, pretrained visual-language-action models achieve substantially higher success on unseen real-world tasks.

Core claim

Conditioned on language, objects, and scenes, a generative world model yields the WM-H dataset of 50k egocentric human-object interaction videos; after hand motion reconstruction and visual editing, co-training with a small amount of real robot data adapts pretrained VLA models to dexterous manipulation, raising zero-shot success across 18 real-world tasks from 8.3 percent to 38.9 percent.

What carries the argument

Generative world model that outputs controllable WM-H videos, followed by hand motion reconstruction and visual editing to create robot-trainable supervision.

If this is right

  • Pretrained VLA models gain dexterous capabilities from far less real-robot data once supplemented by the generated episodes.
  • Performance gains are driven by the combination of large-scale video generation and explicit scene/embodiment alignment steps.
  • The method supports generalization across objects, scenes, and tasks that were previously limited by data scale or alignment.
  • Ablations confirm that removing either scalable generation or alignment reduces the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generative pipeline could be applied to other robot skills that currently lack large aligned datasets.
  • If the world model can be further conditioned on robot-specific kinematics, the domain gap after editing might shrink even more.
  • Open-sourcing the 50k-episode dataset and code allows direct measurement of how much additional real data is still required for target performance levels.

Load-bearing premise

Videos from the generative model, once reconstructed and edited, supply supervision aligned enough with real robot bodies that the resulting training signal transfers without large unmanageable domain gaps.

What would settle it

Retraining the same VLA models on the 18 tasks using only the robot data versus the combined robot-plus-WM-H data and finding no measurable difference in zero-shot success rates.

Figures

Figures reproduced from arXiv: 2606.22136 by Jieqi Shi, Jing Huo, Peiyang Wang, Yang Gao, Yangtao Chen, Yong-Lu Li, Zixuan Chen.

Figure 1
Figure 1. Figure 1: Overview of Wh0. Top: WM-H provides world-model-generated egocentric manipulation videos with diverse objects, layouts, and hand-object interactions. Middle: WM-H uniquely combines scale with low scene & embodiment gap to deployment; Wh0 converts them to robot-trainable supervision and co-trains with limited robot data atop a human-video-pretrained VLA. Bottom: The resulting policy zero-shot generalizes to… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Policy architecture and data composition. Top: A VITRA-style policy denoises actions in the unified MANO space, conditioned on PaliGemma cognition features, FoV, and current hand state. Bottom: Pretraining mixture (VITRA-1M, Ego4D-dominant) and post-training mixtures: Wh0 uses 28% teleop and 68% WM-H, heavily oversampling robot data per-sample given 400 teleop vs. 50k WM-H samples. Policy Architecture We a… view at source ↗
Figure 4
Figure 4. Figure 4: Real-world evaluation setup. Unitree G1 with Inspire hands and a head-mounted egocentric camera (teleop via Vision Pro); evaluation spans seen/unseen objects and one seen plus three unseen backgrounds. Wipe Stains Bottle in Plate Grasp Tripod Cola in Drawer Glove in Container [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Zero-shot rollouts on real-world dexterous tasks, including container-aware placement, small￾object grasping, and tool use. None of these object, container, or task combinations appear in the training set. mixture of 50k WM-H samples and 400 real teleoperated robot demonstrations. Each batch draws 28% from teleop, 68% from WM-H, and 4% from WM-H EA (WM-H frames after robot-hand editing for embodiment align… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of scene and embodiment alignment. Top: Without scene alignment, generated videos drift from the target workspace (left); with it (ours), they stay anchored. Middle: Embodiment alignment edits selected frames to a robot hand while preserving pose and motion. Right: Action-feature cosine similarity under original vs. edited appearance. well under the human-hand appearance, but degrades under the robo… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative visualization of representative WM-H failure cases. Each panel highlights typical issues such as image editing errors, physically implausible hand-object interactions, temporal inconsistencies, instruc￾tion misalignment, and imperfect robot-hand embodiment alignment. B Policy Training Details B.1 Architecture and Conditioning. The policy uses a PaliGemma2-3B vision-language backbone to encode t… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Visualization of WM-H 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Robot execution rollouts of Wh0 across various dexterous manipulation tasks. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional Wh0 rollouts and comparison with baseline VITRA. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
read the original abstract

Scaling dexterous manipulation requires generalization across objects, scenes, and tasks, yet existing data sources face a trade-off between scale and scene/embodiment alignment: teleoperation data is well aligned with robot deployment but expensive to collect; simulation is scalable but limited by the sim-to-real gap; and real egocentric videos scale effectively but remain misaligned with robot deployment. We propose Wh0, a framework that uses generative video world models as scalable and controllable sources of egocentric human-hand manipulation data to unlock the manipulation capabilities of pretrained dexterous VLA models. Conditioned on language, objects, and scenes, Wh0 uses a generative world model to produce WM-H, a 50k-episode dataset of egocentric human-object interaction videos. Wh0 then converts the generated videos into robot-trainable supervision through hand motion reconstruction and visual editing. Co-trained with a limited amount of real robot data, WM-H adapts pretrained VLA models to dexterous manipulation deployment. Across 18 real-world dexterous manipulation tasks, compared with a model post-trained only on robot data, Wh0 improves zero-shot success on unseen tasks from 8.3% to 38.9%. Ablation studies further show that scalable generation and scene/embodiment alignment are key drivers of performance gains. Videos and open-source code can be found on our project website: https://chenyt31.github.io/wh0.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes Wh0, a framework that uses generative video world models conditioned on language, objects, and scenes to synthesize WM-H, a 50k-episode dataset of egocentric human hand manipulation videos. These videos are converted into robot supervision via hand motion reconstruction and visual editing, then co-trained with limited real robot data to adapt pretrained dexterous VLA models. The central empirical claim is that this yields a zero-shot success rate increase on 18 unseen real-world dexterous manipulation tasks from 8.3% (robot data only) to 38.9%, with ablations attributing gains to generation scale and scene/embodiment alignment. Open-source code and videos are provided.

Significance. If the alignment between converted WM-H data and real robot embodiment holds, the work offers a scalable alternative to teleoperation for dexterous manipulation data, addressing embodiment and sim-to-real gaps in VLA training. The open-source release of code and videos is a clear strength that supports reproducibility and further validation. The approach directly targets the scale-alignment trade-off highlighted in the abstract.

major comments (3)
  1. [Abstract] Abstract: The headline result (8.3% o 38.9% zero-shot success across 18 tasks) rests on the claim that hand motion reconstruction plus visual editing produces supervision whose distribution is sufficiently close to real robot trajectories; however, no reconstruction accuracy numbers, pose error statistics, or distribution-shift metrics (e.g., trajectory statistics or divergence measures) between WM-H and real robot data are supplied to support this.
  2. [Methods] Methods / conversion pipeline description: No controls or quantitative checks are reported for generative artifacts such as inconsistent physics, hand-pose hallucinations, or embodiment mismatches introduced during video generation and editing; these omissions are load-bearing because any systematic bias could inflate the reported delta without reflecting genuine generalization.
  3. [Ablation studies] Ablation studies: While the abstract states that ablations demonstrate the importance of scalable generation and alignment, the manuscript provides no details on how alignment was quantified or how the ablation controls isolate the contribution of the conversion step versus other factors.
minor comments (3)
  1. [Results] Results tables or figures reporting the 8.3% and 38.9% figures do not include error bars or confidence intervals, making it difficult to assess the reliability of the performance delta.
  2. The 18 tasks are referred to only as 'unseen' without a table or appendix listing task descriptions, object sets, or how they differ from any training distribution.
  3. Dataset validation details for WM-H (e.g., diversity metrics or human preference studies on generated video quality) are not mentioned.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that additional quantitative support for the data conversion pipeline would strengthen the manuscript and will revise accordingly to address each point.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline result (8.3% o 38.9% zero-shot success across 18 tasks) rests on the claim that hand motion reconstruction plus visual editing produces supervision whose distribution is sufficiently close to real robot trajectories; however, no reconstruction accuracy numbers, pose error statistics, or distribution-shift metrics (e.g., trajectory statistics or divergence measures) between WM-H and real robot data are supplied to support this.

    Authors: We agree that direct quantitative metrics on reconstruction accuracy and distribution shift would provide stronger support for the central claim. In the revised version we will add a dedicated subsection reporting hand-pose reconstruction error statistics (using standard metrics from the hand-pose literature) together with trajectory-level comparisons (e.g., velocity histograms and KL divergence) between the converted WM-H data and the real-robot trajectories used in training. revision: yes

  2. Referee: [Methods] Methods / conversion pipeline description: No controls or quantitative checks are reported for generative artifacts such as inconsistent physics, hand-pose hallucinations, or embodiment mismatches introduced during video generation and editing; these omissions are load-bearing because any systematic bias could inflate the reported delta without reflecting genuine generalization.

    Authors: We acknowledge that explicit controls for generative artifacts are important. While the current manuscript uses downstream real-robot task success as the primary validation, we will add a new paragraph in the Methods section describing the artifact-filtering heuristics applied during dataset curation and will report simple quantitative checks (e.g., fraction of generations discarded for pose inconsistency and visual examples of retained vs. filtered frames). revision: yes

  3. Referee: [Ablation studies] Ablation studies: While the abstract states that ablations demonstrate the importance of scalable generation and alignment, the manuscript provides no details on how alignment was quantified or how the ablation controls isolate the contribution of the conversion step versus other factors.

    Authors: We will expand the Ablation studies section to clarify the alignment quantification (visual feature similarity and per-task success breakdowns) and to detail the experimental controls that isolate the conversion pipeline from generation scale. Additional ablation tables will be included to make these distinctions explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central result is independent empirical measurement

full rationale

The paper's headline claim is an empirical zero-shot success rate improvement (8.3% → 38.9%) measured on 18 separate real-world dexterous manipulation tasks. This evaluation uses physical robot deployment and is not obtained by fitting parameters to the generated WM-H data or by any self-referential equation. The conversion pipeline (hand motion reconstruction + visual editing) is presented as a preprocessing step whose output is then tested externally; no derivation reduces the reported metric to the input data by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Central claim depends on unverified assumptions about generative model fidelity and the effectiveness of the hand reconstruction plus visual editing steps in closing the human-to-robot gap; these are not supported by external benchmarks in the abstract.

free parameters (1)
  • Episode count (50k)
    Specific scale chosen for the generated WM-H dataset; may be tuned to balance compute and performance gains.
axioms (1)
  • domain assumption Generative world models can produce controllable, realistic egocentric human-object interaction videos when conditioned on language, objects, and scenes.
    Directly invoked to justify creation of the WM-H dataset.
invented entities (1)
  • WM-H dataset no independent evidence
    purpose: Scalable, embodiment-aligned source of human manipulation episodes for robot training
    Newly generated collection introduced by the framework; no independent external validation provided.

pith-pipeline@v0.9.1-grok · 5811 in / 1447 out tokens · 25311 ms · 2026-06-26T11:46:18.640458+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 1 canonical work pages

  1. [1]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K. Lee, S. Levine, Y . Lu, U. Malla, D. Manju- nath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, ...

  2. [2]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. T. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

  3. [3]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Conference on Robot Learning, 6-9 November 202...

  4. [4]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control.CoRR, abs/2410.24164, 2024. 9

  5. [5]

    R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A. Cheng, X. Zou, Y . Fang, X. Cheng, R. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. Egovla: Learning vision-language-action models from egocentric human videos.CoRR, abs/2507.12440, 2025

  6. [6]

    Hoque, P

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.CoRR, abs/2505.11709, 2025

  7. [7]

    Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, Y . Zhang, X. Chen, H. Chen, L. Sun, D. Chen, J. Yang, and B. Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

  8. [8]

    H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

  9. [9]

    H. Luo, Y . Wang, W. Zhang, S. Zheng, Z. Xi, C. Xu, H. Xu, H. Yuan, C. Zhang, Y . Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization. arXiv preprint arXiv:2601.12993, 2026

  10. [10]

    H. Luo, W. Zhang, Y . Feng, S. Zheng, H. Xu, C. Xu, Z. Xi, Y . Fu, and Z. Lu. Being-h0. 7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

  11. [11]

    Gavryushin, X

    A. Gavryushin, X. Wang, R. J. Malate, C. Yang, D. Liconti, R. Zurbrügg, R. K. Katzschmann, and M. Pollefeys. Maple: Encoding dexterous robotic manipulation priors learned from ego- centric videos.arXiv preprint arXiv:2504.06084, 2025

  12. [12]

    Lepert, J

    M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

  13. [13]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, V . Cartillier, Z. Chavis, A. Furnari, R. Girdhar, J. Ham- burger, H. Jiang, D. Kukreja, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu,...

  14. [14]

    Y . Li, X. Wei, J. Luo, Y . Xiao, Y . Bai, G. Zhou, T. Zou, C. Gui, J. Wen, H. Zhang, et al. Egolive: A large-scale egocentric dataset from real-world human tasks.arXiv preprint arXiv:2604.23570, 2026

  15. [15]

    Kareer, D

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. InIEEE International Conference on Robotics and Automation, ICRA 2025, Atlanta, GA, USA, May 19-23, 2025, pages 13226– 13233. IEEE, 2025

  16. [16]

    J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16939–16947. IEEE, 2025. 10

  17. [17]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. E. Wang, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf...

  18. [18]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  19. [19]

    Mittal, P

    M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zur- brügg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

  20. [20]

    R. Zhao, S. Xu, R. Jin, Y . Deng, Y . Tai, K. Jia, and G. Liu. Sim2real vla: Zero-shot generaliza- tion of synthesized skills to realistic manipulation. InThe Fourteenth International Conference on Learning Representations, 2026

  21. [21]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

  22. [22]

    Wan Team, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 11

  23. [23]

    Agarwal, A

    N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, D. Dworakowski, J. Fan, M. Fenzi, F. Ferroni, S. Fidler, D. Fox, S. Ge, Y . Ge, J. Gu, S. Gururani, E. He, J. Huang, J. S. Huffman, P. Jannaty, J. Jin, S. W. Kim, G. Klár, G. Lam, S. Lan, L. Leal-Taixé, A. Li, Z. Li, C. Lin, T. Lin, H. Ling, M. Liu, ...

  24. [24]

    Matsuo, Y

    Y . Matsuo, Y . LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, and J. Mo- rimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 152: 267–275, 2022

  25. [25]

    H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision-language-action generative world model. In R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Forty-first International Confer- ence on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceeding...

  26. [26]

    R. G. Goswami, A. Bar, D. Fan, T. Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khorrami, and Y . LeCun. World models can leverage human videos for dexterous manipulation.CoRR, abs/2512.13644, 2025

  27. [27]

    G. Lu, B. Jia, P. Li, Y . Chen, Z. Wang, Y . Tang, and S. Huang. Gwm: Towards scalable gaus- sian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9263–9274, 2025

  28. [28]

    J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y . Lin, L. Magne, A. Mandlekar, A. Narayan, Y . L. Tan, G. Wang, J. Wang, Q. Wang, Y . Xu, X. Zeng, K. Zheng, R. Zheng, M. Liu, L. Zettlemoyer, D. Fox, J. Kautz, S. Reed, Y . Zhu, and L. Fan. Dreamgen: Unlocking generalization in robot learning through neural trajectorie...

  29. [29]

    Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in Neural Information Processing Systems, 36:9156–9172, 2023

  30. [30]

    Bharadhwaj, D

    H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

  31. [31]

    Liang, R

    J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P. Tokmakov, S. Song, and C. V on- drick. Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024

  32. [32]

    B. Chen, T. Zhang, H. Geng, K. Song, C. Zhang, P. Li, W. T. Freeman, J. Malik, P. Abbeel, R. Tedrake, et al. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025

  33. [33]

    H. Li, L. Sun, Y .-H. Hu, D. Ta, J. Barry, G. Konidaris, and J. Fu. Novaflow: Zero-shot manip- ulation via actionable flow from generated videos.arXiv preprint arXiv:2510.08568, 2025

  34. [34]

    B. Kim, T. Kim, J. Lee, and H. Joo. Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

  35. [35]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026. 12

  36. [36]

    Xiang, F

    MotuBrain Team, C. Xiang, F. Bao, H. Liu, H. Tan, H. Bi, J. Li, J. Liu, J. Pang, K. Jing, et al. Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

  37. [37]

    L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  38. [38]

    Bjorck, F

    J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. LLontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhang, ...

  39. [39]

    Black, N

    Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: A vision-language-action model with open-world generalization. InProceedings of The 9th Conference on Robot Learning, 2025

  40. [40]

    Physical Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

  41. [41]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

  42. [42]

    H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . My- ers, M. J. Kim, M. Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

  43. [43]

    Zhong, X

    Y . Zhong, X. Huang, R. Li, C. Zhang, Z. Chen, T. Guan, F. Zeng, K. N. Lui, Y . Ye, Y . Liang, Y . Yang, and Y . Chen. Dexgraspvla: A vision-language-action framework towards general dex- terous grasping. In S. Koenig, C. Jenkins, and M. E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applicat...

  44. [44]

    H. Liu, S. Guo, P. Mai, J. Cao, H. Li, and J. Ma. Robodexvlm: Visual language model-enabled task planning and motion control for dexterous robot manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2025, Hangzhou, China, October 19-25, 2025, pages 1381–1388. IEEE, 2025

  45. [45]

    Pavlakos, D

    G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 9826–9836. IEEE, 2024

  46. [46]

    Zhang, J

    J. Zhang, J. Deng, C. Ma, and R. A. Potamias. Hawor: World-space hand motion reconstruc- tion from egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1805–1815, 2025

  47. [47]

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

  48. [48]

    K. Shaw, S. Bahl, and D. Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, pages 654–665. PMLR, 2023. 13

  49. [49]

    Qin, Y .-H

    Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision, pages 570–587. Springer, 2022

  50. [50]

    C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. In D. Kulic, G. Venture, K. E. Bekris, and E. Coronado, editors,Robotics: Science and Systems XX, Delft, The Netherlands, July 15-19, 2024, 2024

  51. [51]

    Zheng, D

    R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Castañeda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

  52. [52]

    Kareer, K

    S. Kareer, K. Pertsch, J. Darpinian, J. Hoffman, D. Xu, S. Levine, C. Finn, and S. Nair. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025

  53. [53]

    Patel, S

    S. Patel, S. Mohan, H. Mai, U. Jain, S. Lazebnik, and Y . Li. Robotic manipulation by imitating generated videos without physical demonstrations.arXiv preprint arXiv:2507.00990, 2025

  54. [54]

    Dharmarajan, W

    K. Dharmarajan, W. Huang, J. Wu, L. Fei-Fei, and R. Zhang. Dream2flow: Bridging video gen- eration and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025

  55. [55]

    G. Team, A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Zhu, K. Li, M. Xu, Q. Deng, S. Wang, W. Qin, X. Chen, X. Wang, Y . Wang, Y . Cao, Y . Chang, Y . Xu, Y . Ye, Y . Wang, Y . Zhou, Z. Zhang, Z. Dong, and Z. Zhu. Gigaworld-0: World models as data engine to empower embodied AI.CoRR, abs/2511.19861, 2025

  56. [56]

    C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, et al. Qwen- image technical report.arXiv preprint arXiv:2508.02324, 2025

  57. [57]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  58. [58]

    Romero, D

    J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Trans. Graph., 36(6):245:1–245:17, 2017

  59. [59]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  60. [60]

    Grauman, A

    K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, Z. Chavis, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024

  61. [61]

    something something

    R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzy´nska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pag...

  62. [62]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

  63. [63]

    Masked-attention mask transformer for universal image seg- mentation,

    Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. HOI4D: A 4d egocentric dataset for category-level human-object interaction. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 20981–20990. IEEE, 2022. doi:10.1109/CVPR52688.2022.02034. 15 Append...