EgoExo-WM: Unlocking Exo Video for Ego World Models
Pith reviewed 2026-05-19 14:28 UTC · model grok-4.3
The pith
Converting exocentric videos into egocentric views via body pose extraction allows training of more capable action-conditioned world models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior, unlocks the integration of in-the-wild exocentric data for egocentric world model training, with the result that training whole-body action-conditioned egocentric world models on the converted data significantly improves both prediction quality and downstream planning performance where the sequence of body poses needed to achieve a visual goal state is inferred.
What carries the argument
The exocentric-to-egocentric video transformation that extracts body poses and applies a human kinematics prior to produce action-aligned egocentric training data.
If this is right
- Whole-body action-conditioned egocentric world models achieve higher prediction quality when trained on the converted data.
- Downstream planning improves by more accurately inferring sequences of body poses that reach a desired visual goal state.
- Arbitrary in-the-wild videos become usable as sources for building egocentric world models.
- Applications in robot planning and augmented-reality guidance gain from the expanded training data.
Where Pith is reading between the lines
- The same conversion pipeline could be tested on other partial-observability settings where exo-style footage is easier to obtain than ego-style footage.
- If the kinematics prior generalizes across body types and camera angles, the method might scale to crowd-sourced video collections without per-video manual alignment.
- Planning performance gains might translate to real-robot control loops if the inferred pose sequences are executed on hardware with similar kinematics.
Load-bearing premise
The exocentric-to-egocentric video transformation informed by a human kinematics prior produces training data whose action representation and visual statistics remain sufficiently faithful to real egocentric observations that downstream gains are not artifacts of the conversion.
What would settle it
Retraining the world models on the converted data and measuring no gain in future-frame prediction accuracy or in success rate at inferring body-pose sequences for visual goals, relative to models trained only on native egocentric data, would falsify the claim.
Figures
read the original abstract
Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents EgoExo-WM, a method that extracts structured body poses from abundant exocentric videos to represent actions and transforms the videos into egocentric views using a human kinematics prior. This converted data is then used to train whole-body action-conditioned egocentric world models, with the authors claiming significant gains in prediction quality and in downstream planning where the model infers sequences of body poses to reach a specified visual goal state.
Significance. If the conversion process faithfully preserves visual statistics and action alignment with real egocentric observations, the work could meaningfully expand training resources for egocentric world models beyond current data-scarce regimes. This would support stronger embodied prediction and planning systems with applications in robotics and augmented-reality guidance.
major comments (2)
- [Video transformation section] Video transformation section (description of exocentric-to-egocentric synthesis via human kinematics prior): The central claim that converted data improves genuine egocentric world models rests on the assumption that the synthesized views match real ego statistics (hand occlusion, head-mounted motion, lighting, depth). No quantitative fidelity metrics, distribution-matching results, or ablation against real ego corpora are reported to rule out systematic artifacts from the prior.
- [Experiments and results] Experiments and results (prediction quality and planning sections): The abstract asserts 'significant improvements' in prediction and planning performance, yet the manuscript must supply concrete quantitative metrics, baselines (real-ego-only models), ablations isolating the conversion step, and error analysis. Without these, the load-bearing claim that gains derive from unlocked exo data rather than proxy-task artifacts cannot be evaluated.
minor comments (2)
- [Introduction] Define 'whole-body action-conditioned' and the precise action representation (pose sequences) more explicitly in the introduction to prevent reader ambiguity.
- [Figures] Figure captions and pipeline diagrams should include explicit labels for the kinematics prior application and the action extraction step for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the validation of the video transformation and the experimental analysis.
read point-by-point responses
-
Referee: [Video transformation section] Video transformation section (description of exocentric-to-egocentric synthesis via human kinematics prior): The central claim that converted data improves genuine egocentric world models rests on the assumption that the synthesized views match real ego statistics (hand occlusion, head-mounted motion, lighting, depth). No quantitative fidelity metrics, distribution-matching results, or ablation against real ego corpora are reported to rule out systematic artifacts from the prior.
Authors: We agree that quantitative fidelity metrics would provide stronger direct evidence that the synthesized views align with real egocentric statistics. The current manuscript relies on downstream task improvements as indirect validation of the kinematics prior. In the revision we will add distribution-matching results (e.g., FID or perceptual metrics) between synthesized and real egocentric videos as well as an ablation comparing models trained on converted data versus real ego corpora. revision: yes
-
Referee: [Experiments and results] Experiments and results (prediction quality and planning sections): The abstract asserts 'significant improvements' in prediction and planning performance, yet the manuscript must supply concrete quantitative metrics, baselines (real-ego-only models), ablations isolating the conversion step, and error analysis. Without these, the load-bearing claim that gains derive from unlocked exo data rather than proxy-task artifacts cannot be evaluated.
Authors: The experiments section already reports quantitative prediction and planning metrics with several baselines. To make the contribution of the exo-to-ego conversion explicit, we will add (i) a real-ego-only baseline, (ii) an ablation that trains the world model with and without the converted data, and (iii) error analysis broken down by action type and prediction horizon in the revised manuscript. revision: yes
Circularity Check
No significant circularity; empirical gains rest on external evaluation of converted data.
full rationale
The paper describes an exocentric-to-egocentric video conversion step that uses a human kinematics prior to extract body poses and synthesize ego views, then trains action-conditioned world models on the resulting data and reports measured improvements in prediction quality and planning performance. No equations, fitted parameters, or self-citations are shown that reduce the claimed gains to quantities defined by construction within the same paper; the results are presented as outcomes of training and downstream evaluation on the transformed corpus. The derivation chain therefore remains self-contained against the external benchmarks and real-world planning tasks referenced.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A human kinematics prior exists that can accurately map exocentric body poses and visuals into corresponding egocentric observations without introducing systematic bias for world-model training.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
wrist-position consistency objective... Lwrist
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025
-
[3]
Fiction: 4d future interaction prediction from video
Kumar Ashutosh, Georgios Pavlakos, and Kristen Grauman. Fiction: 4d future interaction prediction from video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17613–17625, 2025
work page 2025
-
[4]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Arpit Bahety, Arnav Balaji, Ben Abbatematteo, and Roberto Martín-Martín. Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manipulation.arXiv preprint arXiv:2506.15847, 2025
-
[6]
Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025
Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025
-
[7]
Prentice-hall Englewood Cliffs, NJ, 1977
Albert Bandura and Richard H Walters.Social learning theory, volume 1. Prentice-hall Englewood Cliffs, NJ, 1977
work page 1977
-
[8]
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025
work page 2025
-
[9]
Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields
Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, 2021
work page 2021
-
[10]
Mip-nerf 360: Unbounded anti-aliased neural radiance fields
Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022
work page 2022
-
[11]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015
work page 2015
-
[12]
Smpler-x: Scaling up expressive human pose and shape estimation
Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Wang Yanjun, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al. Smpler-x: Scaling up expressive human pose and shape estimation. Advances in Neural Information Processing Systems, 36:11454–11468, 2023
work page 2023
-
[13]
Julie Carmigniani, Borko Furht, Marco Anisetti, Paolo Ceravolo, Ernesto Damiani, and Misa Ivkovic. Augmented reality technologies, systems and applications.Multimedia tools and applications, 51(1): 341–377, 2011
work page 2011
-
[14]
Generative novel view synthesis with 3d-aware diffusion models
Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Generative novel view synthesis with 3d-aware diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4217–4229, 2023
work page 2023
-
[15]
Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction
Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 2416–2425, 2023
work page 2023
-
[16]
Executing your commands via motion diffusion in latent space
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023
work page 2023
-
[17]
4diff: 3d-aware diffusion model for third-to-first viewpoint translation
Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, and Kristen Grauman. 4diff: 3d-aware diffusion model for third-to-first viewpoint translation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 10
work page 2024
-
[18]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018
work page 2018
-
[19]
The human activity assistive technology model
Jessica Dashner, Kerri Morgan, Sue Tucker, Carla Walker, and Sandra Martina Espín Tello. The human activity assistive technology model. InRoutledge Companion to Occupational Therapy, pages 508–517. Routledge, 2025
work page 2025
-
[20]
Tokenhmr: Advancing human mesh recovery with a tokenized pose representation
Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, and Michael J Black. Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1323–1333, 2024
work page 2024
-
[21]
Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control
Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual fore- sight: Model-based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
arXiv preprint arXiv:2511.15586 , year=
Aaron Ferguson, Ahmed AA Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vignola, Fabian Prada, Federica Bogo, et al. Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025
-
[23]
arXiv preprint arXiv:2512.08406 , year=
Mingqi Gao, Yunqi Miao, and Jungong Han. Sam-body4d: Training-free 4d human body mesh recovery from videos.arXiv preprint arXiv:2512.08406, 2025
-
[24]
Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, and Yue Wang. Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026
-
[25]
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Hu- mans in 4d: Reconstructing and tracking humans with transformers
Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Hu- mans in 4d: Reconstructing and tracking humans with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023
work page 2023
-
[27]
World models can leverage human videos for dexter- ous manipulation.ArXiv, abs/2512.13644, 2025
Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025
-
[28]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022
work page 2022
-
[29]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...
work page 2024
-
[30]
Maskvit: Masked visual pre-training for video prediction
Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Li Fei-Fei. Maskvit: Masked visual pre-training for video prediction. InThe Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[31]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
Mastering Atari with Discrete World Models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[33]
Noriaki Hirose, Amir Sadeghian, Fei Xia, Roberto Martín-Martín, and Silvio Savarese. Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera.IEEE Robotics and Automation Letters, 4(2):2062–2069, 2019
work page 2062
-
[34]
Noriaki Hirose, Fei Xia, Roberto Martín-Martín, Amir Sadeghian, and Silvio Savarese. Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Letters, 4(4):3184–3191, 2019
work page 2019
-
[35]
Egolm: Multi-modal language model of egocentric motions
Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard Newcombe, Ziwei Liu, and Lingni Ma. Egolm: Multi-modal language model of egocentric motions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5344–5354, 2025. 11
work page 2025
-
[36]
ViPE: Video Pose Engine for 3D Geometric Perception
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013
work page 2013
-
[38]
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities
Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities. InEuropean Conference on Computer Vision, pages 767–786. Springer, 2020
work page 2020
-
[40]
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023
work page 2023
-
[41]
Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, and Jaegul Choo. Egox: Egocentric video generation from a single exocentric video.arXiv preprint arXiv:2512.08269, 2025
-
[42]
Egomimic: Scaling imitation learning via egocentric video
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025
work page 2025
-
[43]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
H2o: Two hands manipulating objects for first person interaction recognition
Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 10138–10148, 2021
work page 2021
-
[45]
Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024
-
[46]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014
work page 2014
-
[47]
Jia-Wei Liu, Weijia Mao, Zhongcong Xu, Jussi Keppo, and Mike Z Shou. Exocentric-to-egocentric video generation.Advances in Neural Information Processing Systems, 37:136149–136172, 2024
work page 2024
-
[48]
Smpl: A skinned multi-person linear model
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, V olume 2, pages 851–866. 2023
work page 2023
-
[49]
Put myself in your shoes: Lifting the egocentric perspective from exocentric videos
Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. InEuropean Conference on Computer Vision, pages 407–425. Springer, 2024
work page 2024
-
[50]
Nymeria: A massive collection of multimodal egocentric daily motion in the wild
Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024
work page 2024
-
[51]
On human motion prediction using recurrent neural networks
Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2891–2900, 2017
work page 2017
-
[52]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019
work page 2019
-
[53]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021. 12
work page 2021
-
[54]
Movella.MVN User Manual. Movella, 2021. https://www.movella.com/hubfs/MVN_User_Manual. pdf
work page 2021
-
[55]
Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026
-
[56]
Egocontrol: Controllable egocentric video generation via 3d full-body poses
Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, and Juergen Gall. Egocontrol: Controllable egocentric video generation via 3d full-body poses.arXiv preprint arXiv:2511.18173, 2025
-
[57]
Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation
Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, and Ehsan Adeli. Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10318–10329, 2025
work page 2025
-
[58]
Expressive body capture: 3d hands, face, and body from a single image
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019
work page 2019
-
[59]
Reconstructing hands in 3d with transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024
work page 2024
-
[60]
D-nerf: Neural radiance fields for dynamic scenes
Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10318–10327, 2021
work page 2021
-
[61]
Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025
Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025
-
[62]
Home action genome: Cooperative compositional action understanding
Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka, Ehsan Adeli, and Juan Carlos Niebles. Home action genome: Cooperative compositional action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11184– 11193, 2021
work page 2021
-
[63]
Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022
-
[64]
Understanding human hands in contact at internet scale
Dandan Shan, Jiaqi Geng, Michelle Shu, and David F Fouhey. Understanding human hands in contact at internet scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9869–9878, 2020
work page 2020
-
[65]
P-stmo: Pre- trained spatial temporal many-to-one model for 3d human pose estimation
Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-stmo: Pre- trained spatial temporal many-to-one model for 3d human pose estimation. InEuropean Conference on Computer Vision, pages 461–478. Springer, 2022
work page 2022
-
[66]
Wham: Reconstructing world-grounded humans with accurate 3d motion
Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070–2080, 2024
work page 2070
-
[67]
Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Himanshu Gaurav Singh, Pieter Abbeel, Jitendra Malik, and Antonio Loquercio. Deep sensorimotor control by imitating predictive models of human motion.arXiv preprint arXiv:2508.18691, 2025
-
[69]
Bérangere Thirioux, Manuel R Mercier, Gérard Jorland, Alain Berthoz, and Olaf Blanke. Mental imagery of self-location during spontaneous and active self–other interactions: An electrical neuroimaging study. Journal of Neuroscience, 30(21):7202–7214, 2010
work page 2010
-
[70]
Michael Tomasello, Sue Savage-Rumbaugh, and Ann Cale Kruger. Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees.Child development, 64(6):1688–1705, 1993
work page 1993
-
[71]
Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025
Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, and Hengshuang Zhao. Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025. 13
-
[72]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Egoexo-gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025
-
[74]
Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022
work page 2022
-
[75]
Sam 3d body: Robust full-body human mesh recovery
Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, et al. Sam 3d body: Robust full-body human mesh recovery. arXiv preprint arXiv:2602.15989, 2026
-
[76]
Unisim: A neural closed-loop sensor simulator
Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1389–1399, 2023
work page 2023
-
[77]
World Action Models are Zero-shot Policies
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[78]
Decoupling human and camera motion from videos in the wild
Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21222–21232, 2023
work page 2023
-
[79]
Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, et al. Smplest-x: Ultimate scaling for expressive human pose and shape estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[80]
Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.