Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data

Jieqi Shi; Jing Huo; Peiyang Wang; Yang Gao; Yangtao Chen; Yong-Lu Li; Zixuan Chen

arxiv: 2606.22136 · v2 · pith:YDYZP2NUnew · submitted 2026-06-20 · 💻 cs.RO

Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data

Yangtao Chen , Zixuan Chen , Peiyang Wang , Yong-Lu Li , Jing Huo , Jieqi Shi , Yang Gao This is my paper

Pith reviewed 2026-06-26 11:46 UTC · model grok-4.3

classification 💻 cs.RO

keywords generative world modelsdexterous manipulationegocentric human videosrobot learningvisual language action modelsdata generationzero-shot generalization

0 comments

The pith

A generative world model produces 50,000 egocentric human-hand videos that convert into robot supervision and raise zero-shot dexterous task success from 8.3 percent to 38.9 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing data sources for dexterous robot learning force a choice between expensive but aligned teleoperation data, scalable but misaligned simulation, and abundant but embodiment-mismatched real videos. Wh0 instead conditions a generative video world model on language, objects, and scenes to create a large dataset of human manipulation episodes. These videos undergo hand-motion reconstruction and visual editing to produce training signals that align with robot embodiments. When this data is combined with limited real robot demonstrations, pretrained visual-language-action models achieve substantially higher success on unseen real-world tasks.

Core claim

Conditioned on language, objects, and scenes, a generative world model yields the WM-H dataset of 50k egocentric human-object interaction videos; after hand motion reconstruction and visual editing, co-training with a small amount of real robot data adapts pretrained VLA models to dexterous manipulation, raising zero-shot success across 18 real-world tasks from 8.3 percent to 38.9 percent.

What carries the argument

Generative world model that outputs controllable WM-H videos, followed by hand motion reconstruction and visual editing to create robot-trainable supervision.

If this is right

Pretrained VLA models gain dexterous capabilities from far less real-robot data once supplemented by the generated episodes.
Performance gains are driven by the combination of large-scale video generation and explicit scene/embodiment alignment steps.
The method supports generalization across objects, scenes, and tasks that were previously limited by data scale or alignment.
Ablations confirm that removing either scalable generation or alignment reduces the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generative pipeline could be applied to other robot skills that currently lack large aligned datasets.
If the world model can be further conditioned on robot-specific kinematics, the domain gap after editing might shrink even more.
Open-sourcing the 50k-episode dataset and code allows direct measurement of how much additional real data is still required for target performance levels.

Load-bearing premise

Videos from the generative model, once reconstructed and edited, supply supervision aligned enough with real robot bodies that the resulting training signal transfers without large unmanageable domain gaps.

What would settle it

Retraining the same VLA models on the 18 tasks using only the robot data versus the combined robot-plus-WM-H data and finding no measurable difference in zero-shot success rates.

Figures

Figures reproduced from arXiv: 2606.22136 by Jieqi Shi, Jing Huo, Peiyang Wang, Yang Gao, Yangtao Chen, Yong-Lu Li, Zixuan Chen.

**Figure 1.** Figure 1: Overview of Wh0. Top: WM-H provides world-model-generated egocentric manipulation videos with diverse objects, layouts, and hand-object interactions. Middle: WM-H uniquely combines scale with low scene & embodiment gap to deployment; Wh0 converts them to robot-trainable supervision and co-trains with limited robot data atop a human-video-pretrained VLA. Bottom: The resulting policy zero-shot generalizes to… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Policy architecture and data composition. Top: A VITRA-style policy denoises actions in the unified MANO space, conditioned on PaliGemma cognition features, FoV, and current hand state. Bottom: Pretraining mixture (VITRA-1M, Ego4D-dominant) and post-training mixtures: Wh0 uses 28% teleop and 68% WM-H, heavily oversampling robot data per-sample given 400 teleop vs. 50k WM-H samples. Policy Architecture We a… view at source ↗

**Figure 4.** Figure 4: Real-world evaluation setup. Unitree G1 with Inspire hands and a head-mounted egocentric camera (teleop via Vision Pro); evaluation spans seen/unseen objects and one seen plus three unseen backgrounds. Wipe Stains Bottle in Plate Grasp Tripod Cola in Drawer Glove in Container [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Zero-shot rollouts on real-world dexterous tasks, including container-aware placement, smallobject grasping, and tool use. None of these object, container, or task combinations appear in the training set. mixture of 50k WM-H samples and 400 real teleoperated robot demonstrations. Each batch draws 28% from teleop, 68% from WM-H, and 4% from WM-H EA (WM-H frames after robot-hand editing for embodiment align… view at source ↗

**Figure 6.** Figure 6: Effect of scene and embodiment alignment. Top: Without scene alignment, generated videos drift from the target workspace (left); with it (ours), they stay anchored. Middle: Embodiment alignment edits selected frames to a robot hand while preserving pose and motion. Right: Action-feature cosine similarity under original vs. edited appearance. well under the human-hand appearance, but degrades under the robo… view at source ↗

**Figure 7.** Figure 7: Qualitative visualization of representative WM-H failure cases. Each panel highlights typical issues such as image editing errors, physically implausible hand-object interactions, temporal inconsistencies, instruction misalignment, and imperfect robot-hand embodiment alignment. B Policy Training Details B.1 Architecture and Conditioning. The policy uses a PaliGemma2-3B vision-language backbone to encode t… view at source ↗

**Figure 8.** Figure 8: Qualitative Visualization of WM-H 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Robot execution rollouts of Wh0 across various dexterous manipulation tasks. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Additional Wh0 rollouts and comparison with baseline VITRA. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

read the original abstract

Scaling dexterous manipulation requires generalization across objects, scenes, and tasks, yet existing data sources face a trade-off between scale and scene/embodiment alignment: teleoperation data is well aligned with robot deployment but expensive to collect; simulation is scalable but limited by the sim-to-real gap; and real egocentric videos scale effectively but remain misaligned with robot deployment. We propose Wh0, a framework that uses generative video world models as scalable and controllable sources of egocentric human-hand manipulation data to unlock the manipulation capabilities of pretrained dexterous VLA models. Conditioned on language, objects, and scenes, Wh0 uses a generative world model to produce WM-H, a 50k-episode dataset of egocentric human-object interaction videos. Wh0 then converts the generated videos into robot-trainable supervision through hand motion reconstruction and visual editing. Co-trained with a limited amount of real robot data, WM-H adapts pretrained VLA models to dexterous manipulation deployment. Across 18 real-world dexterous manipulation tasks, compared with a model post-trained only on robot data, Wh0 improves zero-shot success on unseen tasks from 8.3% to 38.9%. Ablation studies further show that scalable generation and scene/embodiment alignment are key drivers of performance gains. Videos and open-source code can be found on our project website: https://chenyt31.github.io/wh0.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a concrete lift on real dexterous tasks from generated human-hand data, but the reconstruction and editing step lacks the checks needed to trust the alignment claim.

read the letter

The main thing to know is that conditioning a generative world model on language, objects, and scenes to produce 50k egocentric human manipulation videos, then reconstructing motions and editing visuals, raises zero-shot success on 18 real robot tasks from 8.3% to 38.9% when added to limited robot data.

The pipeline is new in its end-to-end focus on using world models for scalable, controllable human-hand data that gets converted for VLA co-training. The work does well by targeting the scale-versus-alignment tradeoff directly, running ablations that separate the effects of data volume and alignment, and evaluating on actual hardware rather than simulation. The mention of open-source code is also useful for later checks.

The soft spot is the conversion process itself. The abstract describes hand motion reconstruction and visual editing but gives no numbers on reconstruction accuracy, no distribution-shift metrics between the generated trajectories and real robot ones, and no controls for artifacts like inconsistent physics or pose errors. The stress-test note is right that this is the least-secured link in the 8-to-39% result. Without those details it is hard to know whether the gains reflect genuine embodiment alignment or something narrower.

This paper is for people working on data-efficient dexterous manipulation and VLA adaptation. A reader who needs practical ways to augment small robot datasets will find the reported outcome and the ablation structure worth looking at. It deserves a serious referee because the empirical claim is concrete, the problem is central, and the real-world tasks give it weight, even though the methods section will require close examination to confirm the alignment holds.

Referee Report

3 major / 3 minor

Summary. The paper proposes Wh0, a framework that uses generative video world models conditioned on language, objects, and scenes to synthesize WM-H, a 50k-episode dataset of egocentric human hand manipulation videos. These videos are converted into robot supervision via hand motion reconstruction and visual editing, then co-trained with limited real robot data to adapt pretrained dexterous VLA models. The central empirical claim is that this yields a zero-shot success rate increase on 18 unseen real-world dexterous manipulation tasks from 8.3% (robot data only) to 38.9%, with ablations attributing gains to generation scale and scene/embodiment alignment. Open-source code and videos are provided.

Significance. If the alignment between converted WM-H data and real robot embodiment holds, the work offers a scalable alternative to teleoperation for dexterous manipulation data, addressing embodiment and sim-to-real gaps in VLA training. The open-source release of code and videos is a clear strength that supports reproducibility and further validation. The approach directly targets the scale-alignment trade-off highlighted in the abstract.

major comments (3)

[Abstract] Abstract: The headline result (8.3% o 38.9% zero-shot success across 18 tasks) rests on the claim that hand motion reconstruction plus visual editing produces supervision whose distribution is sufficiently close to real robot trajectories; however, no reconstruction accuracy numbers, pose error statistics, or distribution-shift metrics (e.g., trajectory statistics or divergence measures) between WM-H and real robot data are supplied to support this.
[Methods] Methods / conversion pipeline description: No controls or quantitative checks are reported for generative artifacts such as inconsistent physics, hand-pose hallucinations, or embodiment mismatches introduced during video generation and editing; these omissions are load-bearing because any systematic bias could inflate the reported delta without reflecting genuine generalization.
[Ablation studies] Ablation studies: While the abstract states that ablations demonstrate the importance of scalable generation and alignment, the manuscript provides no details on how alignment was quantified or how the ablation controls isolate the contribution of the conversion step versus other factors.

minor comments (3)

[Results] Results tables or figures reporting the 8.3% and 38.9% figures do not include error bars or confidence intervals, making it difficult to assess the reliability of the performance delta.
The 18 tasks are referred to only as 'unseen' without a table or appendix listing task descriptions, object sets, or how they differ from any training distribution.
Dataset validation details for WM-H (e.g., diversity metrics or human preference studies on generated video quality) are not mentioned.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that additional quantitative support for the data conversion pipeline would strengthen the manuscript and will revise accordingly to address each point.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result (8.3% o 38.9% zero-shot success across 18 tasks) rests on the claim that hand motion reconstruction plus visual editing produces supervision whose distribution is sufficiently close to real robot trajectories; however, no reconstruction accuracy numbers, pose error statistics, or distribution-shift metrics (e.g., trajectory statistics or divergence measures) between WM-H and real robot data are supplied to support this.

Authors: We agree that direct quantitative metrics on reconstruction accuracy and distribution shift would provide stronger support for the central claim. In the revised version we will add a dedicated subsection reporting hand-pose reconstruction error statistics (using standard metrics from the hand-pose literature) together with trajectory-level comparisons (e.g., velocity histograms and KL divergence) between the converted WM-H data and the real-robot trajectories used in training. revision: yes
Referee: [Methods] Methods / conversion pipeline description: No controls or quantitative checks are reported for generative artifacts such as inconsistent physics, hand-pose hallucinations, or embodiment mismatches introduced during video generation and editing; these omissions are load-bearing because any systematic bias could inflate the reported delta without reflecting genuine generalization.

Authors: We acknowledge that explicit controls for generative artifacts are important. While the current manuscript uses downstream real-robot task success as the primary validation, we will add a new paragraph in the Methods section describing the artifact-filtering heuristics applied during dataset curation and will report simple quantitative checks (e.g., fraction of generations discarded for pose inconsistency and visual examples of retained vs. filtered frames). revision: yes
Referee: [Ablation studies] Ablation studies: While the abstract states that ablations demonstrate the importance of scalable generation and alignment, the manuscript provides no details on how alignment was quantified or how the ablation controls isolate the contribution of the conversion step versus other factors.

Authors: We will expand the Ablation studies section to clarify the alignment quantification (visual feature similarity and per-task success breakdowns) and to detail the experimental controls that isolate the conversion pipeline from generation scale. Additional ablation tables will be included to make these distinctions explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central result is independent empirical measurement

full rationale

The paper's headline claim is an empirical zero-shot success rate improvement (8.3% → 38.9%) measured on 18 separate real-world dexterous manipulation tasks. This evaluation uses physical robot deployment and is not obtained by fitting parameters to the generated WM-H data or by any self-referential equation. The conversion pipeline (hand motion reconstruction + visual editing) is presented as a preprocessing step whose output is then tested externally; no derivation reduces the reported metric to the input data by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Central claim depends on unverified assumptions about generative model fidelity and the effectiveness of the hand reconstruction plus visual editing steps in closing the human-to-robot gap; these are not supported by external benchmarks in the abstract.

free parameters (1)

Episode count (50k)
Specific scale chosen for the generated WM-H dataset; may be tuned to balance compute and performance gains.

axioms (1)

domain assumption Generative world models can produce controllable, realistic egocentric human-object interaction videos when conditioned on language, objects, and scenes.
Directly invoked to justify creation of the WM-H dataset.

invented entities (1)

WM-H dataset no independent evidence
purpose: Scalable, embodiment-aligned source of human manipulation episodes for robot training
Newly generated collection introduced by the framework; no independent external validation provided.

pith-pipeline@v0.9.1-grok · 5811 in / 1447 out tokens · 25311 ms · 2026-06-26T11:46:18.640458+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 1 canonical work pages

[1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K. Lee, S. Levine, Y . Lu, U. Malla, D. Manju- nath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, ...

2023
[2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. T. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023
[3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Conference on Robot Learning, 6-9 November 202...

2024
[4]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control.CoRR, abs/2410.24164, 2024. 9

Pith/arXiv arXiv 2024
[5]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A. Cheng, X. Zou, Y . Fang, X. Cheng, R. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. Egovla: Learning vision-language-action models from egocentric human videos.CoRR, abs/2507.12440, 2025

Pith/arXiv arXiv 2025
[6]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.CoRR, abs/2505.11709, 2025

Pith/arXiv arXiv 2025
[7]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, Y . Zhang, X. Chen, H. Chen, L. Sun, D. Chen, J. Yang, and B. Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025
[8]

H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

arXiv 2025
[9]

H. Luo, Y . Wang, W. Zhang, S. Zheng, Z. Xi, C. Xu, H. Xu, H. Yuan, C. Zhang, Y . Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization. arXiv preprint arXiv:2601.12993, 2026

arXiv 2026
[10]

H. Luo, W. Zhang, Y . Feng, S. Zheng, H. Xu, C. Xu, Z. Xi, Y . Fu, and Z. Lu. Being-h0. 7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

Pith/arXiv arXiv 2026
[11]

Gavryushin, X

A. Gavryushin, X. Wang, R. J. Malate, C. Yang, D. Liconti, R. Zurbrügg, R. K. Katzschmann, and M. Pollefeys. Maple: Encoding dexterous robotic manipulation priors learned from ego- centric videos.arXiv preprint arXiv:2504.06084, 2025

arXiv 2025
[12]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

Pith/arXiv arXiv 2025
[13]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, V . Cartillier, Z. Chavis, A. Furnari, R. Girdhar, J. Ham- burger, H. Jiang, D. Kukreja, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu,...

2025
[14]

Y . Li, X. Wei, J. Luo, Y . Xiao, Y . Bai, G. Zhou, T. Zou, C. Gui, J. Wen, H. Zhang, et al. Egolive: A large-scale egocentric dataset from real-world human tasks.arXiv preprint arXiv:2604.23570, 2026

Pith/arXiv arXiv 2026
[15]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. InIEEE International Conference on Robotics and Automation, ICRA 2025, Atlanta, GA, USA, May 19-23, 2025, pages 13226– 13233. IEEE, 2025

2025
[16]

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16939–16947. IEEE, 2025. 10

2025
[17]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. E. Wang, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf...

2024
[18]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[19]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zur- brügg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

Pith/arXiv arXiv 2025
[20]

R. Zhao, S. Xu, R. Jin, Y . Deng, Y . Tai, K. Jia, and G. Liu. Sim2real vla: Zero-shot generaliza- tion of synthesized skills to realistic manipulation. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[21]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

2018
[22]

Wan Team, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 11

Pith/arXiv arXiv 2025
[23]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, D. Dworakowski, J. Fan, M. Fenzi, F. Ferroni, S. Fidler, D. Fox, S. Ge, Y . Ge, J. Gu, S. Gururani, E. He, J. Huang, J. S. Huffman, P. Jannaty, J. Jin, S. W. Kim, G. Klár, G. Lam, S. Lan, L. Leal-Taixé, A. Li, Z. Li, C. Lin, T. Lin, H. Ling, M. Liu, ...

Pith/arXiv arXiv 2025
[24]

Matsuo, Y

Y . Matsuo, Y . LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, and J. Mo- rimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 152: 267–275, 2022

2022
[25]

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision-language-action generative world model. In R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Forty-first International Confer- ence on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceeding...

2024
[26]

R. G. Goswami, A. Bar, D. Fan, T. Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khorrami, and Y . LeCun. World models can leverage human videos for dexterous manipulation.CoRR, abs/2512.13644, 2025

arXiv 2025
[27]

G. Lu, B. Jia, P. Li, Y . Chen, Z. Wang, Y . Tang, and S. Huang. Gwm: Towards scalable gaus- sian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9263–9274, 2025

2025
[28]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y . Lin, L. Magne, A. Mandlekar, A. Narayan, Y . L. Tan, G. Wang, J. Wang, Q. Wang, Y . Xu, X. Zeng, K. Zheng, R. Zheng, M. Liu, L. Zettlemoyer, D. Fox, J. Kautz, S. Reed, Y . Zhu, and L. Fan. Dreamgen: Unlocking generalization in robot learning through neural trajectorie...

Pith/arXiv arXiv 2025
[29]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in Neural Information Processing Systems, 36:9156–9172, 2023

2023
[30]

Bharadhwaj, D

H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Pith/arXiv arXiv 2024
[31]

Liang, R

J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P. Tokmakov, S. Song, and C. V on- drick. Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024

arXiv 2024
[32]

B. Chen, T. Zhang, H. Geng, K. Song, C. Zhang, P. Li, W. T. Freeman, J. Malik, P. Abbeel, R. Tedrake, et al. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025

Pith/arXiv arXiv 2025
[33]

H. Li, L. Sun, Y .-H. Hu, D. Ta, J. Barry, G. Konidaris, and J. Fu. Novaflow: Zero-shot manip- ulation via actionable flow from generated videos.arXiv preprint arXiv:2510.08568, 2025

arXiv 2025
[34]

B. Kim, T. Kim, J. Lee, and H. Joo. Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

arXiv 2025
[35]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026. 12

Pith/arXiv arXiv 2026
[36]

Xiang, F

MotuBrain Team, C. Xiang, F. Bao, H. Liu, H. Tan, H. Bi, J. Li, J. Liu, J. Pang, K. Jing, et al. Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

Pith/arXiv arXiv 2026
[37]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[38]

Bjorck, F

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. LLontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhang, ...

Pith/arXiv arXiv 2025
[39]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: A vision-language-action model with open-world generalization. InProceedings of The 9th Conference on Robot Learning, 2025

2025
[40]

Physical Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026
[41]

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

2022
[42]

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . My- ers, M. J. Kim, M. Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023
[43]

Zhong, X

Y . Zhong, X. Huang, R. Li, C. Zhang, Z. Chen, T. Guan, F. Zeng, K. N. Lui, Y . Ye, Y . Liang, Y . Yang, and Y . Chen. Dexgraspvla: A vision-language-action framework towards general dex- terous grasping. In S. Koenig, C. Jenkins, and M. E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applicat...

2026
[44]

H. Liu, S. Guo, P. Mai, J. Cao, H. Li, and J. Ma. Robodexvlm: Visual language model-enabled task planning and motion control for dexterous robot manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2025, Hangzhou, China, October 19-25, 2025, pages 1381–1388. IEEE, 2025

2025
[45]

Pavlakos, D

G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 9826–9836. IEEE, 2024

2024
[46]

Zhang, J

J. Zhang, J. Deng, C. Ma, and R. A. Potamias. Hawor: World-space hand motion reconstruc- tion from egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1805–1815, 2025

2025
[47]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

2023
[48]

K. Shaw, S. Bahl, and D. Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, pages 654–665. PMLR, 2023. 13

2023
[49]

Qin, Y .-H

Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision, pages 570–587. Springer, 2022

2022
[50]

C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. In D. Kulic, G. Venture, K. E. Bekris, and E. Coronado, editors,Robotics: Science and Systems XX, Delft, The Netherlands, July 15-19, 2024, 2024

2024
[51]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Castañeda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

arXiv 2026
[52]

Kareer, K

S. Kareer, K. Pertsch, J. Darpinian, J. Hoffman, D. Xu, S. Levine, C. Finn, and S. Nair. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025

arXiv 2025
[53]

Patel, S

S. Patel, S. Mohan, H. Mai, U. Jain, S. Lazebnik, and Y . Li. Robotic manipulation by imitating generated videos without physical demonstrations.arXiv preprint arXiv:2507.00990, 2025

Pith/arXiv arXiv 2025
[54]

Dharmarajan, W

K. Dharmarajan, W. Huang, J. Wu, L. Fei-Fei, and R. Zhang. Dream2flow: Bridging video gen- eration and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025

arXiv 2025
[55]

G. Team, A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Zhu, K. Li, M. Xu, Q. Deng, S. Wang, W. Qin, X. Chen, X. Wang, Y . Wang, Y . Cao, Y . Chang, Y . Xu, Y . Ye, Y . Wang, Y . Zhou, Z. Zhang, Z. Dong, and Z. Zhu. Gigaworld-0: World models as data engine to empower embodied AI.CoRR, abs/2511.19861, 2025

arXiv 2025
[56]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, et al. Qwen- image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025
[57]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[58]

Romero, D

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Trans. Graph., 36(6):245:1–245:17, 2017

2017
[59]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[60]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, Z. Chavis, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024

2024
[61]

something something

R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzy´nska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pag...

2017
[62]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

2025
[63]

Masked-attention mask transformer for universal image seg- mentation,

Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. HOI4D: A 4d egocentric dataset for category-level human-object interaction. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 20981–20990. IEEE, 2022. doi:10.1109/CVPR52688.2022.02034. 15 Append...

work page doi:10.1109/cvpr52688.2022.02034 2022

[1] [1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K. Lee, S. Levine, Y . Lu, U. Malla, D. Manju- nath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, ...

2023

[2] [2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. T. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023

[3] [3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Conference on Robot Learning, 6-9 November 202...

2024

[4] [4]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control.CoRR, abs/2410.24164, 2024. 9

Pith/arXiv arXiv 2024

[5] [5]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A. Cheng, X. Zou, Y . Fang, X. Cheng, R. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. Egovla: Learning vision-language-action models from egocentric human videos.CoRR, abs/2507.12440, 2025

Pith/arXiv arXiv 2025

[6] [6]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.CoRR, abs/2505.11709, 2025

Pith/arXiv arXiv 2025

[7] [7]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, Y . Zhang, X. Chen, H. Chen, L. Sun, D. Chen, J. Yang, and B. Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025

[8] [8]

H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

arXiv 2025

[9] [9]

H. Luo, Y . Wang, W. Zhang, S. Zheng, Z. Xi, C. Xu, H. Xu, H. Yuan, C. Zhang, Y . Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization. arXiv preprint arXiv:2601.12993, 2026

arXiv 2026

[10] [10]

H. Luo, W. Zhang, Y . Feng, S. Zheng, H. Xu, C. Xu, Z. Xi, Y . Fu, and Z. Lu. Being-h0. 7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

Pith/arXiv arXiv 2026

[11] [11]

Gavryushin, X

A. Gavryushin, X. Wang, R. J. Malate, C. Yang, D. Liconti, R. Zurbrügg, R. K. Katzschmann, and M. Pollefeys. Maple: Encoding dexterous robotic manipulation priors learned from ego- centric videos.arXiv preprint arXiv:2504.06084, 2025

arXiv 2025

[12] [12]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

Pith/arXiv arXiv 2025

[13] [13]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, V . Cartillier, Z. Chavis, A. Furnari, R. Girdhar, J. Ham- burger, H. Jiang, D. Kukreja, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu,...

2025

[14] [14]

Y . Li, X. Wei, J. Luo, Y . Xiao, Y . Bai, G. Zhou, T. Zou, C. Gui, J. Wen, H. Zhang, et al. Egolive: A large-scale egocentric dataset from real-world human tasks.arXiv preprint arXiv:2604.23570, 2026

Pith/arXiv arXiv 2026

[15] [15]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. InIEEE International Conference on Robotics and Automation, ICRA 2025, Atlanta, GA, USA, May 19-23, 2025, pages 13226– 13233. IEEE, 2025

2025

[16] [16]

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16939–16947. IEEE, 2025. 10

2025

[17] [17]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. E. Wang, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf...

2024

[18] [18]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[19] [19]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zur- brügg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

Pith/arXiv arXiv 2025

[20] [20]

R. Zhao, S. Xu, R. Jin, Y . Deng, Y . Tai, K. Jia, and G. Liu. Sim2real vla: Zero-shot generaliza- tion of synthesized skills to realistic manipulation. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[21] [21]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

2018

[22] [22]

Wan Team, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 11

Pith/arXiv arXiv 2025

[23] [23]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, D. Dworakowski, J. Fan, M. Fenzi, F. Ferroni, S. Fidler, D. Fox, S. Ge, Y . Ge, J. Gu, S. Gururani, E. He, J. Huang, J. S. Huffman, P. Jannaty, J. Jin, S. W. Kim, G. Klár, G. Lam, S. Lan, L. Leal-Taixé, A. Li, Z. Li, C. Lin, T. Lin, H. Ling, M. Liu, ...

Pith/arXiv arXiv 2025

[24] [24]

Matsuo, Y

Y . Matsuo, Y . LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, and J. Mo- rimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 152: 267–275, 2022

2022

[25] [25]

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision-language-action generative world model. In R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Forty-first International Confer- ence on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceeding...

2024

[26] [26]

R. G. Goswami, A. Bar, D. Fan, T. Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khorrami, and Y . LeCun. World models can leverage human videos for dexterous manipulation.CoRR, abs/2512.13644, 2025

arXiv 2025

[27] [27]

G. Lu, B. Jia, P. Li, Y . Chen, Z. Wang, Y . Tang, and S. Huang. Gwm: Towards scalable gaus- sian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9263–9274, 2025

2025

[28] [28]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y . Lin, L. Magne, A. Mandlekar, A. Narayan, Y . L. Tan, G. Wang, J. Wang, Q. Wang, Y . Xu, X. Zeng, K. Zheng, R. Zheng, M. Liu, L. Zettlemoyer, D. Fox, J. Kautz, S. Reed, Y . Zhu, and L. Fan. Dreamgen: Unlocking generalization in robot learning through neural trajectorie...

Pith/arXiv arXiv 2025

[29] [29]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in Neural Information Processing Systems, 36:9156–9172, 2023

2023

[30] [30]

Bharadhwaj, D

H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Pith/arXiv arXiv 2024

[31] [31]

Liang, R

J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P. Tokmakov, S. Song, and C. V on- drick. Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024

arXiv 2024

[32] [32]

B. Chen, T. Zhang, H. Geng, K. Song, C. Zhang, P. Li, W. T. Freeman, J. Malik, P. Abbeel, R. Tedrake, et al. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025

Pith/arXiv arXiv 2025

[33] [33]

H. Li, L. Sun, Y .-H. Hu, D. Ta, J. Barry, G. Konidaris, and J. Fu. Novaflow: Zero-shot manip- ulation via actionable flow from generated videos.arXiv preprint arXiv:2510.08568, 2025

arXiv 2025

[34] [34]

B. Kim, T. Kim, J. Lee, and H. Joo. Dexterous world models.arXiv preprint arXiv:2512.17907, 2025

arXiv 2025

[35] [35]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026. 12

Pith/arXiv arXiv 2026

[36] [36]

Xiang, F

MotuBrain Team, C. Xiang, F. Bao, H. Liu, H. Tan, H. Bi, J. Li, J. Liu, J. Pang, K. Jing, et al. Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

Pith/arXiv arXiv 2026

[37] [37]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[38] [38]

Bjorck, F

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. LLontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhang, ...

Pith/arXiv arXiv 2025

[39] [39]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: A vision-language-action model with open-world generalization. InProceedings of The 9th Conference on Robot Learning, 2025

2025

[40] [40]

Physical Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026

[41] [41]

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

2022

[42] [42]

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . My- ers, M. J. Kim, M. Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023

[43] [43]

Zhong, X

Y . Zhong, X. Huang, R. Li, C. Zhang, Z. Chen, T. Guan, F. Zeng, K. N. Lui, Y . Ye, Y . Liang, Y . Yang, and Y . Chen. Dexgraspvla: A vision-language-action framework towards general dex- terous grasping. In S. Koenig, C. Jenkins, and M. E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applicat...

2026

[44] [44]

H. Liu, S. Guo, P. Mai, J. Cao, H. Li, and J. Ma. Robodexvlm: Visual language model-enabled task planning and motion control for dexterous robot manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2025, Hangzhou, China, October 19-25, 2025, pages 1381–1388. IEEE, 2025

2025

[45] [45]

Pavlakos, D

G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 9826–9836. IEEE, 2024

2024

[46] [46]

Zhang, J

J. Zhang, J. Deng, C. Ma, and R. A. Potamias. Hawor: World-space hand motion reconstruc- tion from egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1805–1815, 2025

2025

[47] [47]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

2023

[48] [48]

K. Shaw, S. Bahl, and D. Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, pages 654–665. PMLR, 2023. 13

2023

[49] [49]

Qin, Y .-H

Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision, pages 570–587. Springer, 2022

2022

[50] [50]

C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. In D. Kulic, G. Venture, K. E. Bekris, and E. Coronado, editors,Robotics: Science and Systems XX, Delft, The Netherlands, July 15-19, 2024, 2024

2024

[51] [51]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Castañeda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

arXiv 2026

[52] [52]

Kareer, K

S. Kareer, K. Pertsch, J. Darpinian, J. Hoffman, D. Xu, S. Levine, C. Finn, and S. Nair. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025

arXiv 2025

[53] [53]

Patel, S

S. Patel, S. Mohan, H. Mai, U. Jain, S. Lazebnik, and Y . Li. Robotic manipulation by imitating generated videos without physical demonstrations.arXiv preprint arXiv:2507.00990, 2025

Pith/arXiv arXiv 2025

[54] [54]

Dharmarajan, W

K. Dharmarajan, W. Huang, J. Wu, L. Fei-Fei, and R. Zhang. Dream2flow: Bridging video gen- eration and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025

arXiv 2025

[55] [55]

G. Team, A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Zhu, K. Li, M. Xu, Q. Deng, S. Wang, W. Qin, X. Chen, X. Wang, Y . Wang, Y . Cao, Y . Chang, Y . Xu, Y . Ye, Y . Wang, Y . Zhou, Z. Zhang, Z. Dong, and Z. Zhu. Gigaworld-0: World models as data engine to empower embodied AI.CoRR, abs/2511.19861, 2025

arXiv 2025

[56] [56]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, et al. Qwen- image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025

[57] [57]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[58] [58]

Romero, D

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Trans. Graph., 36(6):245:1–245:17, 2017

2017

[59] [59]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[60] [60]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, Z. Chavis, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024

2024

[61] [61]

something something

R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzy´nska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pag...

2017

[62] [62]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

2025

[63] [63]

Masked-attention mask transformer for universal image seg- mentation,

Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. HOI4D: A 4d egocentric dataset for category-level human-object interaction. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 20981–20990. IEEE, 2022. doi:10.1109/CVPR52688.2022.02034. 15 Append...

work page doi:10.1109/cvpr52688.2022.02034 2022