pith. sign in

arxiv: 2605.15477 · v1 · pith:WV4JHJDGnew · submitted 2026-05-14 · 💻 cs.CV

EgoExo-WM: Unlocking Exo Video for Ego World Models

Pith reviewed 2026-05-19 14:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric world modelsexocentric videovideo transformationbody pose extractionaction-conditioned predictionrobot planningvisual goal reaching
0
0 comments X

The pith

Converting exocentric videos into egocentric views via body pose extraction allows training of more capable action-conditioned world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address the shortage of egocentric video data for world models by developing a conversion process that turns plentiful exocentric footage into aligned egocentric training examples. It extracts body poses from the exo videos to represent actions and applies a transformation guided by human kinematics to generate ego perspectives. A reader would care because this conversion could expand usable training data dramatically, leading to world models that better predict future states and plan sequences of body poses to reach visual goals. If the approach holds, agents could draw on in-the-wild videos to improve performance in prediction and planning tasks without needing massive new egocentric recordings.

Core claim

Extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior, unlocks the integration of in-the-wild exocentric data for egocentric world model training, with the result that training whole-body action-conditioned egocentric world models on the converted data significantly improves both prediction quality and downstream planning performance where the sequence of body poses needed to achieve a visual goal state is inferred.

What carries the argument

The exocentric-to-egocentric video transformation that extracts body poses and applies a human kinematics prior to produce action-aligned egocentric training data.

If this is right

  • Whole-body action-conditioned egocentric world models achieve higher prediction quality when trained on the converted data.
  • Downstream planning improves by more accurately inferring sequences of body poses that reach a desired visual goal state.
  • Arbitrary in-the-wild videos become usable as sources for building egocentric world models.
  • Applications in robot planning and augmented-reality guidance gain from the expanded training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conversion pipeline could be tested on other partial-observability settings where exo-style footage is easier to obtain than ego-style footage.
  • If the kinematics prior generalizes across body types and camera angles, the method might scale to crowd-sourced video collections without per-video manual alignment.
  • Planning performance gains might translate to real-robot control loops if the inferred pose sequences are executed on hardware with similar kinematics.

Load-bearing premise

The exocentric-to-egocentric video transformation informed by a human kinematics prior produces training data whose action representation and visual statistics remain sufficiently faithful to real egocentric observations that downstream gains are not artifacts of the conversion.

What would settle it

Retraining the world models on the converted data and measuring no gain in future-frame prediction accuracy or in success rate at inferring body-pose sequences for visual goals, relative to models trained only on native egocentric data, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.15477 by Danny Tran, Kristen Grauman, Roberto Mart\'in-Mart\'in.

Figure 1
Figure 1. Figure 1: Overview. Egocentric video provides an embodied view but often hides the body and occludes the hands, while exocentric video often reveals full-body motion (a). EgoExo-WM uses recovered 3D human motion as a bridge for learning an egocentric world-model with exocentric video: it defines the action sequence and guides exo-to-ego synthesis into action￾aligned egocentric observations (b). The learned world mod… view at source ↗
Figure 2
Figure 2. Figure 2: World Model Training. EgoExo-WM unlocks exocentric video for egocentric world model training. Given an exocentric video, we recover 3D human motion which we use alongside the original video to ground our exo-to-ego conversion. The 3D human motion becomes our actions and the converted exocentric video becomes the egocentric observation. We then train EgoExo-WM autoregressively with teacher forcing. We apply… view at source ↗
Figure 3
Figure 3. Figure 3: EgoX-Body Qualitative Compar￾ison. EgoX-Body better grounds generated egocentric video in human motion and inter￾action structure. On the egocentric side, we introduce an egocentric hand kinematics conditioning to directly reflect the in￾teractions that define egocentric video. We condition the model with a drawn hand-skeleton overlay that exposes hand kinematics, helping generate consistent hand motion an… view at source ↗
Figure 4
Figure 4. Figure 4: EgoX-Body Inference Overview. From exocentric videos, we extract body pose and lift the scene into a 3D point cloud. The body skeleton is overlaid onto the exocentric video, while the same pose and geometry are used to render an egocentric prior with predicted hand locations. We form two latent inputs: (1) the clean exocentric latent concatenated with noise, and (2) the body-overlaid exocentric latent conc… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative planning results. From an observation and a visual goal, a trajectory sampler proposes candidate motion sequences, and the world model ranks them to select the one whose predicted outcome best matches the goal. In the first example, the goal is to move left toward the sink, whereas in the second, the goal is to pour cereal. EgoExo-WM chooses trajectories that better match the ground-truth behav… view at source ↗
Figure 6
Figure 6. Figure 6: Examples of failure cases in Internet ego-view videos. We show representative clips with large white or black regions, which commonly arise from videos where the person is directly facing the camera where the egocentric prior is essentially black. These failure cases provide little useful training signal, motivating the automatic filtering criteria described in Section A.3.3. We retain clips satisfying bla… view at source ↗
read the original abstract

Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents EgoExo-WM, a method that extracts structured body poses from abundant exocentric videos to represent actions and transforms the videos into egocentric views using a human kinematics prior. This converted data is then used to train whole-body action-conditioned egocentric world models, with the authors claiming significant gains in prediction quality and in downstream planning where the model infers sequences of body poses to reach a specified visual goal state.

Significance. If the conversion process faithfully preserves visual statistics and action alignment with real egocentric observations, the work could meaningfully expand training resources for egocentric world models beyond current data-scarce regimes. This would support stronger embodied prediction and planning systems with applications in robotics and augmented-reality guidance.

major comments (2)
  1. [Video transformation section] Video transformation section (description of exocentric-to-egocentric synthesis via human kinematics prior): The central claim that converted data improves genuine egocentric world models rests on the assumption that the synthesized views match real ego statistics (hand occlusion, head-mounted motion, lighting, depth). No quantitative fidelity metrics, distribution-matching results, or ablation against real ego corpora are reported to rule out systematic artifacts from the prior.
  2. [Experiments and results] Experiments and results (prediction quality and planning sections): The abstract asserts 'significant improvements' in prediction and planning performance, yet the manuscript must supply concrete quantitative metrics, baselines (real-ego-only models), ablations isolating the conversion step, and error analysis. Without these, the load-bearing claim that gains derive from unlocked exo data rather than proxy-task artifacts cannot be evaluated.
minor comments (2)
  1. [Introduction] Define 'whole-body action-conditioned' and the precise action representation (pose sequences) more explicitly in the introduction to prevent reader ambiguity.
  2. [Figures] Figure captions and pipeline diagrams should include explicit labels for the kinematics prior application and the action extraction step for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the validation of the video transformation and the experimental analysis.

read point-by-point responses
  1. Referee: [Video transformation section] Video transformation section (description of exocentric-to-egocentric synthesis via human kinematics prior): The central claim that converted data improves genuine egocentric world models rests on the assumption that the synthesized views match real ego statistics (hand occlusion, head-mounted motion, lighting, depth). No quantitative fidelity metrics, distribution-matching results, or ablation against real ego corpora are reported to rule out systematic artifacts from the prior.

    Authors: We agree that quantitative fidelity metrics would provide stronger direct evidence that the synthesized views align with real egocentric statistics. The current manuscript relies on downstream task improvements as indirect validation of the kinematics prior. In the revision we will add distribution-matching results (e.g., FID or perceptual metrics) between synthesized and real egocentric videos as well as an ablation comparing models trained on converted data versus real ego corpora. revision: yes

  2. Referee: [Experiments and results] Experiments and results (prediction quality and planning sections): The abstract asserts 'significant improvements' in prediction and planning performance, yet the manuscript must supply concrete quantitative metrics, baselines (real-ego-only models), ablations isolating the conversion step, and error analysis. Without these, the load-bearing claim that gains derive from unlocked exo data rather than proxy-task artifacts cannot be evaluated.

    Authors: The experiments section already reports quantitative prediction and planning metrics with several baselines. To make the contribution of the exo-to-ego conversion explicit, we will add (i) a real-ego-only baseline, (ii) an ablation that trains the world model with and without the converted data, and (iii) error analysis broken down by action type and prediction horizon in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on external evaluation of converted data.

full rationale

The paper describes an exocentric-to-egocentric video conversion step that uses a human kinematics prior to extract body poses and synthesize ego views, then trains action-conditioned world models on the resulting data and reports measured improvements in prediction quality and planning performance. No equations, fitted parameters, or self-citations are shown that reduce the claimed gains to quantities defined by construction within the same paper; the results are presented as outcomes of training and downstream evaluation on the transformed corpus. The derivation chain therefore remains self-contained against the external benchmarks and real-world planning tasks referenced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the method implicitly depends on the existence and applicability of a human kinematics prior and on body pose being a sufficient action representation.

axioms (1)
  • domain assumption A human kinematics prior exists that can accurately map exocentric body poses and visuals into corresponding egocentric observations without introducing systematic bias for world-model training.
    The abstract states that the transformation is 'informed by a human kinematics prior' and treats this step as the bridge that unlocks exo data.

pith-pipeline@v0.9.0 · 5716 in / 1398 out tokens · 56409 ms · 2026-05-19T14:28:18.801641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Cosmos-Transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492,

    Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

  3. [3]

    Fiction: 4d future interaction prediction from video

    Kumar Ashutosh, Georgios Pavlakos, and Kristen Grauman. Fiction: 4d future interaction prediction from video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17613–17625, 2025

  4. [4]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  5. [5]

    Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manipulation.arXiv preprint arXiv:2506.15847, 2025

    Arpit Bahety, Arnav Balaji, Ben Abbatematteo, and Roberto Martín-Martín. Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manipulation.arXiv preprint arXiv:2506.15847, 2025

  6. [6]

    Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

    Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

  7. [7]

    Prentice-hall Englewood Cliffs, NJ, 1977

    Albert Bandura and Richard H Walters.Social learning theory, volume 1. Prentice-hall Englewood Cliffs, NJ, 1977

  8. [8]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025

  9. [9]

    Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, 2021

  10. [10]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

  11. [11]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015

  12. [12]

    Smpler-x: Scaling up expressive human pose and shape estimation

    Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Wang Yanjun, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al. Smpler-x: Scaling up expressive human pose and shape estimation. Advances in Neural Information Processing Systems, 36:11454–11468, 2023

  13. [13]

    Augmented reality technologies, systems and applications.Multimedia tools and applications, 51(1): 341–377, 2011

    Julie Carmigniani, Borko Furht, Marco Anisetti, Paolo Ceravolo, Ernesto Damiani, and Misa Ivkovic. Augmented reality technologies, systems and applications.Multimedia tools and applications, 51(1): 341–377, 2011

  14. [14]

    Generative novel view synthesis with 3d-aware diffusion models

    Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Generative novel view synthesis with 3d-aware diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4217–4229, 2023

  15. [15]

    Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

    Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 2416–2425, 2023

  16. [16]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023

  17. [17]

    4diff: 3d-aware diffusion model for third-to-first viewpoint translation

    Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, and Kristen Grauman. 4diff: 3d-aware diffusion model for third-to-first viewpoint translation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 10

  18. [18]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

  19. [19]

    The human activity assistive technology model

    Jessica Dashner, Kerri Morgan, Sue Tucker, Carla Walker, and Sandra Martina Espín Tello. The human activity assistive technology model. InRoutledge Companion to Occupational Therapy, pages 508–517. Routledge, 2025

  20. [20]

    Tokenhmr: Advancing human mesh recovery with a tokenized pose representation

    Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, and Michael J Black. Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1323–1333, 2024

  21. [21]

    Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

    Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual fore- sight: Model-based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568, 2018

  22. [22]

    arXiv preprint arXiv:2511.15586 , year=

    Aaron Ferguson, Ahmed AA Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vignola, Fabian Prada, Federica Bogo, et al. Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025

  23. [23]

    arXiv preprint arXiv:2512.08406 , year=

    Mingqi Gao, Yunqi Miao, and Jungong Han. Sam-body4d: Training-free 4d human body mesh recovery from videos.arXiv preprint arXiv:2512.08406, 2025

  24. [24]

    Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

    Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, and Yue Wang. Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

  25. [25]

    DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  26. [26]

    Hu- mans in 4d: Reconstructing and tracking humans with transformers

    Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Hu- mans in 4d: Reconstructing and tracking humans with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023

  27. [27]

    World models can leverage human videos for dexter- ous manipulation.ArXiv, abs/2512.13644, 2025

    Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025

  28. [28]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  29. [29]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

  30. [30]

    Maskvit: Masked visual pre-training for video prediction

    Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Li Fei-Fei. Maskvit: Masked visual pre-training for video prediction. InThe Eleventh International Conference on Learning Representations, 2022

  31. [31]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  32. [32]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

  33. [33]

    Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera.IEEE Robotics and Automation Letters, 4(2):2062–2069, 2019

    Noriaki Hirose, Amir Sadeghian, Fei Xia, Roberto Martín-Martín, and Silvio Savarese. Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera.IEEE Robotics and Automation Letters, 4(2):2062–2069, 2019

  34. [34]

    Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Letters, 4(4):3184–3191, 2019

    Noriaki Hirose, Fei Xia, Roberto Martín-Martín, Amir Sadeghian, and Silvio Savarese. Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Letters, 4(4):3184–3191, 2019

  35. [35]

    Egolm: Multi-modal language model of egocentric motions

    Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard Newcombe, Ziwei Liu, and Lingni Ma. Egolm: Multi-modal language model of egocentric motions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5344–5354, 2025. 11

  36. [36]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025

  37. [37]

    Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013

  38. [38]

    DreamGen: Unlocking Generalization in Robot Learning through Video World Models

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

  39. [39]

    Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities

    Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities. InEuropean Conference on Computer Vision, pages 767–786. Springer, 2020

  40. [40]

    Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

  41. [41]

    Egox: Egocentric video generation from a single exocentric video.arXiv preprint arXiv:2512.08269, 2025

    Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, and Jaegul Choo. Egox: Egocentric video generation from a single exocentric video.arXiv preprint arXiv:2512.08269, 2025

  42. [42]

    Egomimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

  43. [43]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  44. [44]

    H2o: Two hands manipulating objects for first person interaction recognition

    Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 10138–10148, 2021

  45. [45]

    Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

    Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

  46. [46]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  47. [47]

    Exocentric-to-egocentric video generation.Advances in Neural Information Processing Systems, 37:136149–136172, 2024

    Jia-Wei Liu, Weijia Mao, Zhongcong Xu, Jussi Keppo, and Mike Z Shou. Exocentric-to-egocentric video generation.Advances in Neural Information Processing Systems, 37:136149–136172, 2024

  48. [48]

    Smpl: A skinned multi-person linear model

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, V olume 2, pages 851–866. 2023

  49. [49]

    Put myself in your shoes: Lifting the egocentric perspective from exocentric videos

    Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. InEuropean Conference on Computer Vision, pages 407–425. Springer, 2024

  50. [50]

    Nymeria: A massive collection of multimodal egocentric daily motion in the wild

    Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024

  51. [51]

    On human motion prediction using recurrent neural networks

    Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2891–2900, 2017

  52. [52]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019

  53. [53]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021. 12

  54. [54]

    Movella, 2021

    Movella.MVN User Manual. Movella, 2021. https://www.movella.com/hubfs/MVN_User_Manual. pdf

  55. [55]

    V-jepa 2.1: Unlocking dense fea- tures in video self-supervised learning,

    Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

  56. [56]

    Egocontrol: Controllable egocentric video generation via 3d full-body poses

    Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, and Juergen Gall. Egocontrol: Controllable egocentric video generation via 3d full-body poses.arXiv preprint arXiv:2511.18173, 2025

  57. [57]

    Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation

    Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, and Ehsan Adeli. Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10318–10329, 2025

  58. [58]

    Expressive body capture: 3d hands, face, and body from a single image

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019

  59. [59]

    Reconstructing hands in 3d with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024

  60. [60]

    D-nerf: Neural radiance fields for dynamic scenes

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10318–10327, 2021

  61. [61]

    Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

    Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

  62. [62]

    Home action genome: Cooperative compositional action understanding

    Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka, Ehsan Adeli, and Juan Carlos Niebles. Home action genome: Cooperative compositional action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11184– 11193, 2021

  63. [63]

    Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

  64. [64]

    Understanding human hands in contact at internet scale

    Dandan Shan, Jiaqi Geng, Michelle Shu, and David F Fouhey. Understanding human hands in contact at internet scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9869–9878, 2020

  65. [65]

    P-stmo: Pre- trained spatial temporal many-to-one model for 3d human pose estimation

    Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-stmo: Pre- trained spatial temporal many-to-one model for 3d human pose estimation. InEuropean Conference on Computer Vision, pages 461–478. Springer, 2022

  66. [66]

    Wham: Reconstructing world-grounded humans with accurate 3d motion

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070–2080, 2024

  67. [67]

    DINOv3

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  68. [68]

    Deep sensorimotor control by imitating predictive models of human motion.arXiv preprint arXiv:2508.18691, 2025

    Himanshu Gaurav Singh, Pieter Abbeel, Jitendra Malik, and Antonio Loquercio. Deep sensorimotor control by imitating predictive models of human motion.arXiv preprint arXiv:2508.18691, 2025

  69. [69]

    Mental imagery of self-location during spontaneous and active self–other interactions: An electrical neuroimaging study

    Bérangere Thirioux, Manuel R Mercier, Gérard Jorland, Alain Berthoz, and Olaf Blanke. Mental imagery of self-location during spontaneous and active self–other interactions: An electrical neuroimaging study. Journal of Neuroscience, 30(21):7202–7214, 2010

  70. [70]

    Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees.Child development, 64(6):1688–1705, 1993

    Michael Tomasello, Sue Savage-Rumbaugh, and Ann Cale Kruger. Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees.Child development, 64(6):1688–1705, 1993

  71. [71]

    Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025

    Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, and Hengshuang Zhao. Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025. 13

  72. [72]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  73. [73]

    Egoexo-gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025

    Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Egoexo-gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025

  74. [74]

    Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022

  75. [75]

    Sam 3d body: Robust full-body human mesh recovery

    Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, et al. Sam 3d body: Robust full-body human mesh recovery. arXiv preprint arXiv:2602.15989, 2026

  76. [76]

    Unisim: A neural closed-loop sensor simulator

    Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1389–1399, 2023

  77. [77]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  78. [78]

    Decoupling human and camera motion from videos in the wild

    Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21222–21232, 2023

  79. [79]

    Smplest-x: Ultimate scaling for expressive human pose and shape estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, et al. Smplest-x: Ultimate scaling for expressive human pose and shape estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  80. [80]

    Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

Showing first 80 references.