EgoExo-WM: Unlocking Exo Video for Ego World Models

Danny Tran; Kristen Grauman; Roberto Mart\'in-Mart\'in

arxiv: 2605.15477 · v1 · pith:WV4JHJDGnew · submitted 2026-05-14 · 💻 cs.CV

EgoExo-WM: Unlocking Exo Video for Ego World Models

Danny Tran , Roberto Mart\'in-Mart\'in , Kristen Grauman This is my paper

Pith reviewed 2026-05-19 14:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric world modelsexocentric videovideo transformationbody pose extractionaction-conditioned predictionrobot planningvisual goal reaching

0 comments

The pith

Converting exocentric videos into egocentric views via body pose extraction allows training of more capable action-conditioned world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address the shortage of egocentric video data for world models by developing a conversion process that turns plentiful exocentric footage into aligned egocentric training examples. It extracts body poses from the exo videos to represent actions and applies a transformation guided by human kinematics to generate ego perspectives. A reader would care because this conversion could expand usable training data dramatically, leading to world models that better predict future states and plan sequences of body poses to reach visual goals. If the approach holds, agents could draw on in-the-wild videos to improve performance in prediction and planning tasks without needing massive new egocentric recordings.

Core claim

Extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior, unlocks the integration of in-the-wild exocentric data for egocentric world model training, with the result that training whole-body action-conditioned egocentric world models on the converted data significantly improves both prediction quality and downstream planning performance where the sequence of body poses needed to achieve a visual goal state is inferred.

What carries the argument

The exocentric-to-egocentric video transformation that extracts body poses and applies a human kinematics prior to produce action-aligned egocentric training data.

If this is right

Whole-body action-conditioned egocentric world models achieve higher prediction quality when trained on the converted data.
Downstream planning improves by more accurately inferring sequences of body poses that reach a desired visual goal state.
Arbitrary in-the-wild videos become usable as sources for building egocentric world models.
Applications in robot planning and augmented-reality guidance gain from the expanded training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conversion pipeline could be tested on other partial-observability settings where exo-style footage is easier to obtain than ego-style footage.
If the kinematics prior generalizes across body types and camera angles, the method might scale to crowd-sourced video collections without per-video manual alignment.
Planning performance gains might translate to real-robot control loops if the inferred pose sequences are executed on hardware with similar kinematics.

Load-bearing premise

The exocentric-to-egocentric video transformation informed by a human kinematics prior produces training data whose action representation and visual statistics remain sufficiently faithful to real egocentric observations that downstream gains are not artifacts of the conversion.

What would settle it

Retraining the world models on the converted data and measuring no gain in future-frame prediction accuracy or in success rate at inferring body-pose sequences for visual goals, relative to models trained only on native egocentric data, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.15477 by Danny Tran, Kristen Grauman, Roberto Mart\'in-Mart\'in.

**Figure 1.** Figure 1: Overview. Egocentric video provides an embodied view but often hides the body and occludes the hands, while exocentric video often reveals full-body motion (a). EgoExo-WM uses recovered 3D human motion as a bridge for learning an egocentric world-model with exocentric video: it defines the action sequence and guides exo-to-ego synthesis into actionaligned egocentric observations (b). The learned world mod… view at source ↗

**Figure 2.** Figure 2: World Model Training. EgoExo-WM unlocks exocentric video for egocentric world model training. Given an exocentric video, we recover 3D human motion which we use alongside the original video to ground our exo-to-ego conversion. The 3D human motion becomes our actions and the converted exocentric video becomes the egocentric observation. We then train EgoExo-WM autoregressively with teacher forcing. We apply… view at source ↗

**Figure 3.** Figure 3: EgoX-Body Qualitative Comparison. EgoX-Body better grounds generated egocentric video in human motion and interaction structure. On the egocentric side, we introduce an egocentric hand kinematics conditioning to directly reflect the interactions that define egocentric video. We condition the model with a drawn hand-skeleton overlay that exposes hand kinematics, helping generate consistent hand motion an… view at source ↗

**Figure 4.** Figure 4: EgoX-Body Inference Overview. From exocentric videos, we extract body pose and lift the scene into a 3D point cloud. The body skeleton is overlaid onto the exocentric video, while the same pose and geometry are used to render an egocentric prior with predicted hand locations. We form two latent inputs: (1) the clean exocentric latent concatenated with noise, and (2) the body-overlaid exocentric latent conc… view at source ↗

**Figure 5.** Figure 5: Qualitative planning results. From an observation and a visual goal, a trajectory sampler proposes candidate motion sequences, and the world model ranks them to select the one whose predicted outcome best matches the goal. In the first example, the goal is to move left toward the sink, whereas in the second, the goal is to pour cereal. EgoExo-WM chooses trajectories that better match the ground-truth behav… view at source ↗

**Figure 6.** Figure 6: Examples of failure cases in Internet ego-view videos. We show representative clips with large white or black regions, which commonly arise from videos where the person is directly facing the camera where the egocentric prior is essentially black. These failure cases provide little useful training signal, motivating the automatic filtering criteria described in Section A.3.3. We retain clips satisfying bla… view at source ↗

read the original abstract

Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes converting exocentric videos to egocentric ones via body pose extraction and a kinematics prior to train better action-conditioned world models, but the abstract asserts gains without any metrics or validation.

read the letter

The key point is that this work gives a concrete pipeline to pull body poses from abundant exocentric video and warp it into egocentric views using a human kinematics prior, then uses that data to train whole-body world models that improve prediction and goal-conditioned planning. The new element is this specific exo-to-ego bridging step that aligns action representations and unlocks in-the-wild footage for ego training. It frames the data scarcity problem cleanly and points to practical uses in robot planning and AR guidance. The approach builds on existing pose tools and priors, which keeps the method grounded and potentially reproducible if the implementation details hold up. The soft spots sit in the evaluation. The abstract states significant improvements in prediction quality and planning performance but supplies no numbers, baselines, ablations, or error breakdowns, so it is impossible to tell whether the gains are real or tied to artifacts from the conversion. The central assumption—that the transformed videos preserve visual statistics, action alignment, and ego-specific effects like camera motion and occlusions—needs direct checks such as distribution matching against real ego data. Without those, the planning results on inferred pose sequences could reflect the synthetic proxy rather than genuine unlocking of exo video. This paper is aimed at researchers building embodied world models who need more training data. Readers working on cross-view adaptation or data augmentation would get the most from the pipeline description. It shows clear thinking on the bottleneck even if the evidence is still thin. I would send it to peer review so the authors can add the missing quantitative validation and address the fidelity of the conversion.

Referee Report

2 major / 2 minor

Summary. The paper presents EgoExo-WM, a method that extracts structured body poses from abundant exocentric videos to represent actions and transforms the videos into egocentric views using a human kinematics prior. This converted data is then used to train whole-body action-conditioned egocentric world models, with the authors claiming significant gains in prediction quality and in downstream planning where the model infers sequences of body poses to reach a specified visual goal state.

Significance. If the conversion process faithfully preserves visual statistics and action alignment with real egocentric observations, the work could meaningfully expand training resources for egocentric world models beyond current data-scarce regimes. This would support stronger embodied prediction and planning systems with applications in robotics and augmented-reality guidance.

major comments (2)

[Video transformation section] Video transformation section (description of exocentric-to-egocentric synthesis via human kinematics prior): The central claim that converted data improves genuine egocentric world models rests on the assumption that the synthesized views match real ego statistics (hand occlusion, head-mounted motion, lighting, depth). No quantitative fidelity metrics, distribution-matching results, or ablation against real ego corpora are reported to rule out systematic artifacts from the prior.
[Experiments and results] Experiments and results (prediction quality and planning sections): The abstract asserts 'significant improvements' in prediction and planning performance, yet the manuscript must supply concrete quantitative metrics, baselines (real-ego-only models), ablations isolating the conversion step, and error analysis. Without these, the load-bearing claim that gains derive from unlocked exo data rather than proxy-task artifacts cannot be evaluated.

minor comments (2)

[Introduction] Define 'whole-body action-conditioned' and the precise action representation (pose sequences) more explicitly in the introduction to prevent reader ambiguity.
[Figures] Figure captions and pipeline diagrams should include explicit labels for the kinematics prior application and the action extraction step for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the validation of the video transformation and the experimental analysis.

read point-by-point responses

Referee: [Video transformation section] Video transformation section (description of exocentric-to-egocentric synthesis via human kinematics prior): The central claim that converted data improves genuine egocentric world models rests on the assumption that the synthesized views match real ego statistics (hand occlusion, head-mounted motion, lighting, depth). No quantitative fidelity metrics, distribution-matching results, or ablation against real ego corpora are reported to rule out systematic artifacts from the prior.

Authors: We agree that quantitative fidelity metrics would provide stronger direct evidence that the synthesized views align with real egocentric statistics. The current manuscript relies on downstream task improvements as indirect validation of the kinematics prior. In the revision we will add distribution-matching results (e.g., FID or perceptual metrics) between synthesized and real egocentric videos as well as an ablation comparing models trained on converted data versus real ego corpora. revision: yes
Referee: [Experiments and results] Experiments and results (prediction quality and planning sections): The abstract asserts 'significant improvements' in prediction and planning performance, yet the manuscript must supply concrete quantitative metrics, baselines (real-ego-only models), ablations isolating the conversion step, and error analysis. Without these, the load-bearing claim that gains derive from unlocked exo data rather than proxy-task artifacts cannot be evaluated.

Authors: The experiments section already reports quantitative prediction and planning metrics with several baselines. To make the contribution of the exo-to-ego conversion explicit, we will add (i) a real-ego-only baseline, (ii) an ablation that trains the world model with and without the converted data, and (iii) error analysis broken down by action type and prediction horizon in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on external evaluation of converted data.

full rationale

The paper describes an exocentric-to-egocentric video conversion step that uses a human kinematics prior to extract body poses and synthesize ego views, then trains action-conditioned world models on the resulting data and reports measured improvements in prediction quality and planning performance. No equations, fitted parameters, or self-citations are shown that reduce the claimed gains to quantities defined by construction within the same paper; the results are presented as outcomes of training and downstream evaluation on the transformed corpus. The derivation chain therefore remains self-contained against the external benchmarks and real-world planning tasks referenced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the method implicitly depends on the existence and applicability of a human kinematics prior and on body pose being a sufficient action representation.

axioms (1)

domain assumption A human kinematics prior exists that can accurately map exocentric body poses and visuals into corresponding egocentric observations without introducing systematic bias for world-model training.
The abstract states that the transformation is 'informed by a human kinematics prior' and treats this step as the bridge that unlocks exo data.

pith-pipeline@v0.9.0 · 5716 in / 1398 out tokens · 56409 ms · 2026-05-19T14:28:18.801641+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

wrist-position consistency objective... Lwrist

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Cosmos-Transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492,

Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

work page arXiv 2025
[3]

Fiction: 4d future interaction prediction from video

Kumar Ashutosh, Georgios Pavlakos, and Kristen Grauman. Fiction: 4d future interaction prediction from video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17613–17625, 2025

work page 2025
[4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manipulation.arXiv preprint arXiv:2506.15847, 2025

Arpit Bahety, Arnav Balaji, Ben Abbatematteo, and Roberto Martín-Martín. Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manipulation.arXiv preprint arXiv:2506.15847, 2025

work page arXiv 2025
[6]

Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

work page arXiv 2025
[7]

Prentice-hall Englewood Cliffs, NJ, 1977

Albert Bandura and Richard H Walters.Social learning theory, volume 1. Prentice-hall Englewood Cliffs, NJ, 1977

work page 1977
[8]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025

work page 2025
[9]

Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, 2021

work page 2021
[10]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

work page 2022
[11]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015

work page 2015
[12]

Smpler-x: Scaling up expressive human pose and shape estimation

Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Wang Yanjun, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al. Smpler-x: Scaling up expressive human pose and shape estimation. Advances in Neural Information Processing Systems, 36:11454–11468, 2023

work page 2023
[13]

Augmented reality technologies, systems and applications.Multimedia tools and applications, 51(1): 341–377, 2011

Julie Carmigniani, Borko Furht, Marco Anisetti, Paolo Ceravolo, Ernesto Damiani, and Misa Ivkovic. Augmented reality technologies, systems and applications.Multimedia tools and applications, 51(1): 341–377, 2011

work page 2011
[14]

Generative novel view synthesis with 3d-aware diffusion models

Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Generative novel view synthesis with 3d-aware diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4217–4229, 2023

work page 2023
[15]

Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 2416–2425, 2023

work page 2023
[16]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023

work page 2023
[17]

4diff: 3d-aware diffusion model for third-to-first viewpoint translation

Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, and Kristen Grauman. 4diff: 3d-aware diffusion model for third-to-first viewpoint translation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 10

work page 2024
[18]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

work page 2018
[19]

The human activity assistive technology model

Jessica Dashner, Kerri Morgan, Sue Tucker, Carla Walker, and Sandra Martina Espín Tello. The human activity assistive technology model. InRoutledge Companion to Occupational Therapy, pages 508–517. Routledge, 2025

work page 2025
[20]

Tokenhmr: Advancing human mesh recovery with a tokenized pose representation

Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, and Michael J Black. Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1323–1333, 2024

work page 2024
[21]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual fore- sight: Model-based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

arXiv preprint arXiv:2511.15586 , year=

Aaron Ferguson, Ahmed AA Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vignola, Fabian Prada, Federica Bogo, et al. Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025

work page arXiv 2025
[23]

arXiv preprint arXiv:2512.08406 , year=

Mingqi Gao, Yunqi Miao, and Jungong Han. Sam-body4d: Training-free 4d human body mesh recovery from videos.arXiv preprint arXiv:2512.08406, 2025

work page arXiv 2025
[24]

Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, and Yue Wang. Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

work page arXiv 2026
[25]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Hu- mans in 4d: Reconstructing and tracking humans with transformers

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Hu- mans in 4d: Reconstructing and tracking humans with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023

work page 2023
[27]

World models can leverage human videos for dexter- ous manipulation.ArXiv, abs/2512.13644, 2025

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025

work page arXiv 2025
[28]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

work page 2022
[29]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

work page 2024
[30]

Maskvit: Masked visual pre-training for video prediction

Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Li Fei-Fei. Maskvit: Masked visual pre-training for video prediction. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022
[31]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[33]

Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera.IEEE Robotics and Automation Letters, 4(2):2062–2069, 2019

Noriaki Hirose, Amir Sadeghian, Fei Xia, Roberto Martín-Martín, and Silvio Savarese. Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera.IEEE Robotics and Automation Letters, 4(2):2062–2069, 2019

work page 2062
[34]

Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Letters, 4(4):3184–3191, 2019

Noriaki Hirose, Fei Xia, Roberto Martín-Martín, Amir Sadeghian, and Silvio Savarese. Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Letters, 4(4):3184–3191, 2019

work page 2019
[35]

Egolm: Multi-modal language model of egocentric motions

Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard Newcombe, Ziwei Liu, and Lingni Ma. Egolm: Multi-modal language model of egocentric motions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5344–5354, 2025. 11

work page 2025
[36]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013

work page 2013
[38]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities

Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities. InEuropean Conference on Computer Vision, pages 767–786. Springer, 2020

work page 2020
[40]

Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

work page 2023
[41]

Egox: Egocentric video generation from a single exocentric video.arXiv preprint arXiv:2512.08269, 2025

Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, and Jaegul Choo. Egox: Egocentric video generation from a single exocentric video.arXiv preprint arXiv:2512.08269, 2025

work page arXiv 2025
[42]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

work page 2025
[43]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

H2o: Two hands manipulating objects for first person interaction recognition

Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 10138–10148, 2021

work page 2021
[45]

Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

work page arXiv 2024
[46]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[47]

Exocentric-to-egocentric video generation.Advances in Neural Information Processing Systems, 37:136149–136172, 2024

Jia-Wei Liu, Weijia Mao, Zhongcong Xu, Jussi Keppo, and Mike Z Shou. Exocentric-to-egocentric video generation.Advances in Neural Information Processing Systems, 37:136149–136172, 2024

work page 2024
[48]

Smpl: A skinned multi-person linear model

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, V olume 2, pages 851–866. 2023

work page 2023
[49]

Put myself in your shoes: Lifting the egocentric perspective from exocentric videos

Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. InEuropean Conference on Computer Vision, pages 407–425. Springer, 2024

work page 2024
[50]

Nymeria: A massive collection of multimodal egocentric daily motion in the wild

Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024

work page 2024
[51]

On human motion prediction using recurrent neural networks

Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2891–2900, 2017

work page 2017
[52]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019

work page 2019
[53]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021. 12

work page 2021
[54]

Movella, 2021

Movella.MVN User Manual. Movella, 2021. https://www.movella.com/hubfs/MVN_User_Manual. pdf

work page 2021
[55]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

work page arXiv 2026
[56]

Egocontrol: Controllable egocentric video generation via 3d full-body poses

Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, and Juergen Gall. Egocontrol: Controllable egocentric video generation via 3d full-body poses.arXiv preprint arXiv:2511.18173, 2025

work page arXiv 2025
[57]

Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation

Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, and Ehsan Adeli. Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10318–10329, 2025

work page 2025
[58]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019

work page 2019
[59]

Reconstructing hands in 3d with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024

work page 2024
[60]

D-nerf: Neural radiance fields for dynamic scenes

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10318–10327, 2021

work page 2021
[61]

Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

work page arXiv 2025
[62]

Home action genome: Cooperative compositional action understanding

Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka, Ehsan Adeli, and Juan Carlos Niebles. Home action genome: Cooperative compositional action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11184– 11193, 2021

work page 2021
[63]

Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

work page arXiv 2022
[64]

Understanding human hands in contact at internet scale

Dandan Shan, Jiaqi Geng, Michelle Shu, and David F Fouhey. Understanding human hands in contact at internet scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9869–9878, 2020

work page 2020
[65]

P-stmo: Pre- trained spatial temporal many-to-one model for 3d human pose estimation

Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-stmo: Pre- trained spatial temporal many-to-one model for 3d human pose estimation. InEuropean Conference on Computer Vision, pages 461–478. Springer, 2022

work page 2022
[66]

Wham: Reconstructing world-grounded humans with accurate 3d motion

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070–2080, 2024

work page 2070
[67]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Deep sensorimotor control by imitating predictive models of human motion.arXiv preprint arXiv:2508.18691, 2025

Himanshu Gaurav Singh, Pieter Abbeel, Jitendra Malik, and Antonio Loquercio. Deep sensorimotor control by imitating predictive models of human motion.arXiv preprint arXiv:2508.18691, 2025

work page arXiv 2025
[69]

Mental imagery of self-location during spontaneous and active self–other interactions: An electrical neuroimaging study

Bérangere Thirioux, Manuel R Mercier, Gérard Jorland, Alain Berthoz, and Olaf Blanke. Mental imagery of self-location during spontaneous and active self–other interactions: An electrical neuroimaging study. Journal of Neuroscience, 30(21):7202–7214, 2010

work page 2010
[70]

Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees.Child development, 64(6):1688–1705, 1993

Michael Tomasello, Sue Savage-Rumbaugh, and Ann Cale Kruger. Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees.Child development, 64(6):1688–1705, 1993

work page 1993
[71]

Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025

Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, and Hengshuang Zhao. Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025. 13

work page arXiv 2025
[72]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Egoexo-gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025

Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Egoexo-gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025

work page arXiv 2025
[74]

Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022

work page 2022
[75]

Sam 3d body: Robust full-body human mesh recovery

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, et al. Sam 3d body: Robust full-body human mesh recovery. arXiv preprint arXiv:2602.15989, 2026

work page arXiv 2026
[76]

Unisim: A neural closed-loop sensor simulator

Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1389–1399, 2023

work page 2023
[77]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[78]

Decoupling human and camera motion from videos in the wild

Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21222–21232, 2023

work page 2023
[79]

Smplest-x: Ultimate scaling for expressive human pose and shape estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, et al. Smplest-x: Ultimate scaling for expressive human pose and shape estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[80]

Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

work page arXiv 2026

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Cosmos-Transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492,

Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

work page arXiv 2025

[3] [3]

Fiction: 4d future interaction prediction from video

Kumar Ashutosh, Georgios Pavlakos, and Kristen Grauman. Fiction: 4d future interaction prediction from video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17613–17625, 2025

work page 2025

[4] [4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manipulation.arXiv preprint arXiv:2506.15847, 2025

Arpit Bahety, Arnav Balaji, Ben Abbatematteo, and Roberto Martín-Martín. Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manipulation.arXiv preprint arXiv:2506.15847, 2025

work page arXiv 2025

[6] [6]

Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

work page arXiv 2025

[7] [7]

Prentice-hall Englewood Cliffs, NJ, 1977

Albert Bandura and Richard H Walters.Social learning theory, volume 1. Prentice-hall Englewood Cliffs, NJ, 1977

work page 1977

[8] [8]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025

work page 2025

[9] [9]

Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, 2021

work page 2021

[10] [10]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

work page 2022

[11] [11]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015

work page 2015

[12] [12]

Smpler-x: Scaling up expressive human pose and shape estimation

Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Wang Yanjun, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al. Smpler-x: Scaling up expressive human pose and shape estimation. Advances in Neural Information Processing Systems, 36:11454–11468, 2023

work page 2023

[13] [13]

Augmented reality technologies, systems and applications.Multimedia tools and applications, 51(1): 341–377, 2011

Julie Carmigniani, Borko Furht, Marco Anisetti, Paolo Ceravolo, Ernesto Damiani, and Misa Ivkovic. Augmented reality technologies, systems and applications.Multimedia tools and applications, 51(1): 341–377, 2011

work page 2011

[14] [14]

Generative novel view synthesis with 3d-aware diffusion models

Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Generative novel view synthesis with 3d-aware diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4217–4229, 2023

work page 2023

[15] [15]

Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 2416–2425, 2023

work page 2023

[16] [16]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023

work page 2023

[17] [17]

4diff: 3d-aware diffusion model for third-to-first viewpoint translation

Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, and Kristen Grauman. 4diff: 3d-aware diffusion model for third-to-first viewpoint translation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 10

work page 2024

[18] [18]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

work page 2018

[19] [19]

The human activity assistive technology model

Jessica Dashner, Kerri Morgan, Sue Tucker, Carla Walker, and Sandra Martina Espín Tello. The human activity assistive technology model. InRoutledge Companion to Occupational Therapy, pages 508–517. Routledge, 2025

work page 2025

[20] [20]

Tokenhmr: Advancing human mesh recovery with a tokenized pose representation

Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, and Michael J Black. Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1323–1333, 2024

work page 2024

[21] [21]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual fore- sight: Model-based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

arXiv preprint arXiv:2511.15586 , year=

Aaron Ferguson, Ahmed AA Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vignola, Fabian Prada, Federica Bogo, et al. Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025

work page arXiv 2025

[23] [23]

arXiv preprint arXiv:2512.08406 , year=

Mingqi Gao, Yunqi Miao, and Jungong Han. Sam-body4d: Training-free 4d human body mesh recovery from videos.arXiv preprint arXiv:2512.08406, 2025

work page arXiv 2025

[24] [24]

Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, and Yue Wang. Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

work page arXiv 2026

[25] [25]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Hu- mans in 4d: Reconstructing and tracking humans with transformers

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Hu- mans in 4d: Reconstructing and tracking humans with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023

work page 2023

[27] [27]

World models can leverage human videos for dexter- ous manipulation.ArXiv, abs/2512.13644, 2025

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025

work page arXiv 2025

[28] [28]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

work page 2022

[29] [29]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

work page 2024

[30] [30]

Maskvit: Masked visual pre-training for video prediction

Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Li Fei-Fei. Maskvit: Masked visual pre-training for video prediction. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022

[31] [31]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[33] [33]

Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera.IEEE Robotics and Automation Letters, 4(2):2062–2069, 2019

Noriaki Hirose, Amir Sadeghian, Fei Xia, Roberto Martín-Martín, and Silvio Savarese. Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera.IEEE Robotics and Automation Letters, 4(2):2062–2069, 2019

work page 2062

[34] [34]

Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Letters, 4(4):3184–3191, 2019

Noriaki Hirose, Fei Xia, Roberto Martín-Martín, Amir Sadeghian, and Silvio Savarese. Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Letters, 4(4):3184–3191, 2019

work page 2019

[35] [35]

Egolm: Multi-modal language model of egocentric motions

Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard Newcombe, Ziwei Liu, and Lingni Ma. Egolm: Multi-modal language model of egocentric motions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5344–5354, 2025. 11

work page 2025

[36] [36]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013

work page 2013

[38] [38]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities

Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities. InEuropean Conference on Computer Vision, pages 767–786. Springer, 2020

work page 2020

[40] [40]

Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

work page 2023

[41] [41]

Egox: Egocentric video generation from a single exocentric video.arXiv preprint arXiv:2512.08269, 2025

Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, and Jaegul Choo. Egox: Egocentric video generation from a single exocentric video.arXiv preprint arXiv:2512.08269, 2025

work page arXiv 2025

[42] [42]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

work page 2025

[43] [43]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

H2o: Two hands manipulating objects for first person interaction recognition

Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 10138–10148, 2021

work page 2021

[45] [45]

Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

work page arXiv 2024

[46] [46]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014

[47] [47]

Exocentric-to-egocentric video generation.Advances in Neural Information Processing Systems, 37:136149–136172, 2024

Jia-Wei Liu, Weijia Mao, Zhongcong Xu, Jussi Keppo, and Mike Z Shou. Exocentric-to-egocentric video generation.Advances in Neural Information Processing Systems, 37:136149–136172, 2024

work page 2024

[48] [48]

Smpl: A skinned multi-person linear model

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, V olume 2, pages 851–866. 2023

work page 2023

[49] [49]

Put myself in your shoes: Lifting the egocentric perspective from exocentric videos

Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. InEuropean Conference on Computer Vision, pages 407–425. Springer, 2024

work page 2024

[50] [50]

Nymeria: A massive collection of multimodal egocentric daily motion in the wild

Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024

work page 2024

[51] [51]

On human motion prediction using recurrent neural networks

Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2891–2900, 2017

work page 2017

[52] [52]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019

work page 2019

[53] [53]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021. 12

work page 2021

[54] [54]

Movella, 2021

Movella.MVN User Manual. Movella, 2021. https://www.movella.com/hubfs/MVN_User_Manual. pdf

work page 2021

[55] [55]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

work page arXiv 2026

[56] [56]

Egocontrol: Controllable egocentric video generation via 3d full-body poses

Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, and Juergen Gall. Egocontrol: Controllable egocentric video generation via 3d full-body poses.arXiv preprint arXiv:2511.18173, 2025

work page arXiv 2025

[57] [57]

Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation

Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, and Ehsan Adeli. Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10318–10329, 2025

work page 2025

[58] [58]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019

work page 2019

[59] [59]

Reconstructing hands in 3d with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024

work page 2024

[60] [60]

D-nerf: Neural radiance fields for dynamic scenes

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10318–10327, 2021

work page 2021

[61] [61]

Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

work page arXiv 2025

[62] [62]

Home action genome: Cooperative compositional action understanding

Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka, Ehsan Adeli, and Juan Carlos Niebles. Home action genome: Cooperative compositional action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11184– 11193, 2021

work page 2021

[63] [63]

Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

work page arXiv 2022

[64] [64]

Understanding human hands in contact at internet scale

Dandan Shan, Jiaqi Geng, Michelle Shu, and David F Fouhey. Understanding human hands in contact at internet scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9869–9878, 2020

work page 2020

[65] [65]

P-stmo: Pre- trained spatial temporal many-to-one model for 3d human pose estimation

Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-stmo: Pre- trained spatial temporal many-to-one model for 3d human pose estimation. InEuropean Conference on Computer Vision, pages 461–478. Springer, 2022

work page 2022

[66] [66]

Wham: Reconstructing world-grounded humans with accurate 3d motion

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070–2080, 2024

work page 2070

[67] [67]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

Deep sensorimotor control by imitating predictive models of human motion.arXiv preprint arXiv:2508.18691, 2025

Himanshu Gaurav Singh, Pieter Abbeel, Jitendra Malik, and Antonio Loquercio. Deep sensorimotor control by imitating predictive models of human motion.arXiv preprint arXiv:2508.18691, 2025

work page arXiv 2025

[69] [69]

Mental imagery of self-location during spontaneous and active self–other interactions: An electrical neuroimaging study

Bérangere Thirioux, Manuel R Mercier, Gérard Jorland, Alain Berthoz, and Olaf Blanke. Mental imagery of self-location during spontaneous and active self–other interactions: An electrical neuroimaging study. Journal of Neuroscience, 30(21):7202–7214, 2010

work page 2010

[70] [70]

Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees.Child development, 64(6):1688–1705, 1993

Michael Tomasello, Sue Savage-Rumbaugh, and Ann Cale Kruger. Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees.Child development, 64(6):1688–1705, 1993

work page 1993

[71] [71]

Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025

Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, and Hengshuang Zhao. Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025. 13

work page arXiv 2025

[72] [72]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Egoexo-gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025

Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Egoexo-gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025

work page arXiv 2025

[74] [74]

Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022

work page 2022

[75] [75]

Sam 3d body: Robust full-body human mesh recovery

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, et al. Sam 3d body: Robust full-body human mesh recovery. arXiv preprint arXiv:2602.15989, 2026

work page arXiv 2026

[76] [76]

Unisim: A neural closed-loop sensor simulator

Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1389–1399, 2023

work page 2023

[77] [77]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[78] [78]

Decoupling human and camera motion from videos in the wild

Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21222–21232, 2023

work page 2023

[79] [79]

Smplest-x: Ultimate scaling for expressive human pose and shape estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, et al. Smplest-x: Ultimate scaling for expressive human pose and shape estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[80] [80]

Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

work page arXiv 2026