arxiv: 2604.19105 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

Ruibing Hou , Mingyue Zhou , Yuwei Gui , Mingshuang Luo , Bingpeng Ma , Hong Chang , Shiguang Shan , Xilin Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric motion generationvision-language modelsdiffusion modelshierarchical reasoning3D human motionfirst-person perceptionmotion synthesismotion primitives

0 comments

The pith

Decoupling semantic reasoning into discrete motion primitives before diffusion synthesis resolves gradient conflicts in egocentric motion generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the problem of creating 3D human motion sequences from first-person video observations paired with natural language instructions. It identifies that jointly optimizing for semantic understanding and physical kinematics leads to conflicting training signals that reduce output quality. EgoMotion addresses this through a two-stage process: a vision-language model first converts the inputs into structured discrete motion primitives, then a diffusion model uses those primitives as conditioning to generate smooth, coherent trajectories. This separation produces motions that align better with the given instructions and exhibit improved physical plausibility compared to single-stage approaches. Readers interested in embodied AI would care because accurate first-person motion synthesis supports more reliable simulation of human behavior in dynamic settings.

Core claim

EgoMotion is a hierarchical generative framework that decouples cognitive reasoning from kinematic modeling. In the first stage, a vision-language model projects egocentric visual observations and language instructions into a space of discrete motion primitives, forcing acquisition of goal-consistent representations. In the second stage, these primitives condition a diffusion-based generator that performs iterative denoising in continuous latent space to synthesize physically plausible and temporally coherent 3D motion trajectories.

What carries the argument

The two-stage hierarchy that maps multimodal inputs to discrete motion primitives via a vision-language model, then feeds those as conditioning to a diffusion motion generator.

If this is right

State-of-the-art performance on egocentric vision-language motion generation tasks.
Motion sequences that are semantically grounded in both language instructions and first-person visual observations.
Kinematically superior trajectories compared to existing joint-optimization approaches.
Avoidance of gradient conflicts that systematically degrade multimodal grounding and motion fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The discrete primitive representation could support easier adaptation to new environments or motion styles without full retraining.
Similar staged decoupling might address optimization conflicts in other multimodal generation domains such as audio-conditioned video or text-to-3D synthesis.
Deployment in real robotic systems could test whether the learned primitives transfer effectively from simulation to physical hardware.
Varying the granularity of the discrete motion primitives offers a testable way to measure trade-offs between semantic fidelity and motion detail.

Load-bearing premise

Mapping multimodal inputs to discrete motion primitives reliably bridges the semantic gap and avoids information loss or new conflicts when conditioning the diffusion generator.

What would settle it

An end-to-end single-stage model that jointly optimizes the vision-language and diffusion components achieving equal or higher scores on semantic grounding and kinematic quality metrics than the two-stage EgoMotion on the same evaluation benchmarks.

Figures

Figures reproduced from arXiv: 2604.19105 by Bingpeng Ma, Hong Chang, Mingshuang Luo, Mingyue Zhou, Ruibing Hou, Shiguang Shan, Xilin Chen, Yuwei Gui.

**Figure 1.** Figure 1: We propose a hierarchical framework EgoMotion for Ego-VL Motion Generation, aiming to synthesize motion sequences conditioned jointly on egocentric visual observations and natural language instructions. tations and undermining multimodal grounding fidelity. Conversely, the semantic reasoning objective interferes with the kinematic precision required for high-quality synthesis. This gradient interference ma… view at source ↗

**Figure 2.** Figure 2: Overview of EgoMotion. The framework employs a hierarchical two-stage paradigm. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of motion generation across different paradigms. Generated motion sequences are visualized from [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoMotion's two-stage split for egocentric vision-language motion looks like a reasonable way to handle reasoning-kinematics conflicts, but the abstract gives no numbers to judge if it actually works.

read the letter

The main takeaway is a hierarchical framework that first runs a VLM over first-person video and language to output discrete motion primitives, then conditions a diffusion model on those primitives to produce the final 3D trajectories. This directly targets the gradient conflict they describe between semantic understanding and physical motion quality. The concrete architecture for the Ego-VL task appears new relative to prior conditional motion work, and the biological analogy is a clean way to motivate the decoupling without overcomplicating the pipeline. Reusing off-the-shelf VLM and diffusion pieces keeps the focus on the split itself rather than new components. That part is executed clearly in the abstract. The soft spot is the complete absence of any metrics, ablations, or baseline comparisons in the provided text. Claims of state-of-the-art semantic grounding and kinematic superiority therefore sit on unshown evaluations, which makes it hard to tell whether the discrete stage preserves the fine velocity and viewpoint details needed for egocentric scenes or simply adds an extra quantization step. The stress-test worry about information loss in the primitive codebook is worth checking directly in the full paper; if they have controls showing the diffusion stage recovers what the VLM drops, that would address the main risk. This paper is for people working on embodied motion synthesis and multimodal generation in robotics or VR. A reader already familiar with diffusion models and VLMs would get the most out of seeing how the stages are wired together. It deserves a serious referee because the problem framing is honest and the proposed split is testable, even if the current evidence level is thin. I would send it to review and ask for the quantitative results and ablations in the first round.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces EgoMotion, a two-stage hierarchical framework for egocentric vision-language motion generation. The Cognitive Reasoning stage employs a vision-language model to project first-person visual observations and natural language instructions onto a space of discrete motion primitives, aiming to resolve semantic-kinematic grounding issues. The Motion Generation stage then uses these primitives as conditioning signals for a diffusion model to synthesize 3D human motion trajectories. The central claim is that this biologically-inspired decoupling avoids gradient conflicts inherent in joint optimization, yielding state-of-the-art performance with semantically grounded and kinematically superior outputs compared to prior approaches.

Significance. If the empirical claims hold, the work could meaningfully advance embodied AI by demonstrating a practical hierarchical solution to multimodal entanglement problems in motion synthesis. The explicit separation of high-level reasoning from low-level generation offers a template that might generalize to other vision-language-action tasks, particularly in dynamic first-person settings where continuous end-to-end models often struggle with fidelity.

major comments (2)

[Abstract] Abstract: The assertion that EgoMotion 'achieves state-of-the-art performance' and resolves gradient conflicts is presented without any quantitative metrics, ablation studies, error analysis, or comparison tables. This absence is load-bearing for the central claim, as the superiority over existing approaches cannot be evaluated from the provided description alone.
[Abstract] Cognitive Reasoning stage description: The claim that mapping to discrete motion primitives 'forces the VLM to acquire goal-consistent representations' and bridges the semantic gap without new conflicts or information loss is not supported by evidence that the discretization preserves fine-grained egocentric kinematic details (e.g., subtle velocity or viewpoint-specific pose variations). This step is load-bearing for the subsequent diffusion stage's claimed kinematic superiority.

minor comments (1)

[Abstract] The phrase 'reasoning-generation entanglement' is introduced as a novel challenge without a formal definition, mathematical formulation, or citation to related work on gradient conflicts in multimodal generative models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications drawn from the full manuscript and indicate the revisions made to strengthen the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that EgoMotion 'achieves state-of-the-art performance' and resolves gradient conflicts is presented without any quantitative metrics, ablation studies, error analysis, or comparison tables. This absence is load-bearing for the central claim, as the superiority over existing approaches cannot be evaluated from the provided description alone.

Authors: The abstract is a concise summary of the work. The full manuscript contains the requested quantitative support, including comparison tables (Section 4.1), ablation studies on the hierarchical separation and gradient conflict mitigation (Section 4.3), and error analyses (Section 5). To address the referee's concern and render the abstract more self-contained, we have revised it to include key performance metrics and a brief reference to these supporting experiments. revision: yes
Referee: [Abstract] Cognitive Reasoning stage description: The claim that mapping to discrete motion primitives 'forces the VLM to acquire goal-consistent representations' and bridges the semantic gap without new conflicts or information loss is not supported by evidence that the discretization preserves fine-grained egocentric kinematic details (e.g., subtle velocity or viewpoint-specific pose variations). This step is load-bearing for the subsequent diffusion stage's claimed kinematic superiority.

Authors: We agree that explicit evidence for preservation of fine-grained kinematic details under discretization is important. The manuscript provides this through ablations in Section 4.3 that compare discrete primitives against continuous alternatives, reporting low reconstruction errors on velocity and egocentric pose variations along with improved downstream motion fidelity. We have revised the abstract's description of the Cognitive Reasoning stage to note this supporting evidence and the resulting kinematic advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: standard hierarchical proposal with external evaluation

full rationale

The paper describes a two-stage architecture (Cognitive Reasoning via VLM to discrete primitives, followed by diffusion-based Motion Generation) to mitigate claimed gradient conflicts in egocentric VL motion synthesis. No equations, derivations, fitted parameters, or self-citations are invoked as load-bearing steps in the provided text. The SOTA performance claim rests on external evaluations rather than any reduction of outputs to inputs by construction, self-definition, or imported uniqueness theorems. This matches the default non-circular case for descriptive ML frameworks without mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract relies on standard assumptions from vision-language modeling and diffusion processes without introducing new free parameters or entities; the key premise is that discrete motion primitives can serve as an effective interface between semantic understanding and kinematic generation.

axioms (2)

domain assumption A vision-language model can be trained to project multimodal egocentric inputs into a structured space of discrete motion primitives that preserve goal-consistent semantics.
Invoked in the Cognitive Reasoning stage description as the mechanism that bridges the semantic gap.
domain assumption Iterative denoising in a continuous latent space conditioned on those primitives will produce physically plausible and temporally coherent 3D trajectories.
Stated as the operation of the Motion Generation stage.

pith-pipeline@v0.9.0 · 5583 in / 1413 out tokens · 38766 ms · 2026-05-10T03:35:17.539741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Unpaired motion style transfer from video to animation.ACM Transactions on Graphics, 39(4):64–1, 2020

Kfir Aberman, Yijia Weng, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. Unpaired motion style transfer from video to animation.ACM Transactions on Graphics, 39(4):64–1, 2020

2020
[2]

Circle: Capture in rich contextual environments

Joao Pedro Araujo, Jiaman Li, Karthik Vetrivel, Rishi Agarwal, Jiajun Wu, Deepak Gopinath, Alexander William Clegg, and Karen Liu. Circle: Capture in rich contextual environments. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21211–21221, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

2023
[3]

Generating human motion in 3d scenes from text descriptions

Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, and Xiaowei Zhou. Generating human motion in 3d scenes from text descriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1855– 1866, 2024

2024
[4]

Context-aware human motion prediction

Enric Corona, Albert Pumarola, Guillem Alenya, and Francesc Moreno-Noguer. Context-aware human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6992–7001, 2020

2020
[5]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie, pages 4171–4186, 2019

2019
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

Duetgen: Music driven two-person dance generation via hierarchi- cal masked modeling

Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. Duetgen: Music driven two-person dance generation via hierarchi- cal masked modeling. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

2025
[8]

Egolifter: Open-world 3d segmentation for egocentric perception

Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney. Egolifter: Open-world 3d segmentation for egocentric perception. InProceedings of the European Conference on Computer Vision, pages 382–400. Springer, 2024

2024
[9]

Momask: Generative masked modeling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024

1900
[10]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022

2022
[11]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the ACM International Conference on Multimedia, pages 2021–2029, 2020

2021
[12]

Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3d human pose ambiguities with 3d scene constraints. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2282–2292, 2019

2019
[13]

Nemf: Neural motion fields for kinematic animation

Chengan He, Jun Saito, James Zachary, Holly Rushmeier, and Yi Zhou. Nemf: Neural motion fields for kinematic animation. Advances in Neural Information Processing Systems, 35:4244–4256, 2022

2022
[14]

Motionverse: A unified multimodal framework for motion comprehension, generation and editing.arXiv preprint arXiv:2509.23635, 2025

Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, and Shiguang Shan. Motionverse: A unified multimodal framework for motion comprehension, generation and editing.arXiv preprint arXiv:2509.23635, 2025

work page arXiv 2025
[15]

Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

2023
[16]

Autonomous character-scene interaction synthesis from text instruction

Nan Jiang, Zimo He, Zi Wang, Hongjie Li, Yixin Chen, Siyuan Huang, and Yixin Zhu. Autonomous character-scene interaction synthesis from text instruction. InProceedings of the ACM SIG- GRAPH Conference on Computer Graphics and Interactive Techniques Asia, pages 1–11, 2024

2024
[17]

Scaling up dynamic human-scene interaction modeling

Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1737–1747, 2024

2024
[18]

Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion

Roy Kapon, Guy Tevet, Daniel Cohen-Or, and Amit H Bermano. Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1965–1974, 2024

1965
[19]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

Nifty: Neural object interaction fields for guided human motion synthesis

Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, and Leonidas Guibas. Nifty: Neural object interaction fields for guided human motion synthesis. 2023. arXiv preprint

2023
[21]

Autoregressive image generation using residual quanti- zation

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook- Shin Han. Autoregressive image generation using residual quanti- zation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022

2022
[22]

Locomotion-action-manipulation: Syn- thesizing human-scene interactions in complex 3d environments

Jiye Lee and Hanbyul Joo. Locomotion-action-manipulation: Syn- thesizing human-scene interactions in complex 3d environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9663–9674, 2023

2023
[23]

Karen Liu

Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C. Karen Liu. Controllable human-object interaction synthesis. InProceedings of the European Conference on Computer Vision, pages 54–72. Springer, 2024

2024
[24]

Ego-body pose estimation via ego-head pose estimation

Jiaman Li, Karen Liu, and Jiajun Wu. Ego-body pose estimation via ego-head pose estimation. InProceedings of the International Conference on Machine Learning, pages 17142–17151, 2023

2023
[25]

Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024

2024
[26]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Investigating pose representations and motion contexts modeling for 3d motion prediction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):681– 697, 2022

Zhenguang Liu, Shuang Wu, Shuyuan Jin, Shouling Ji, Qi Liu, Shijian Lu, and Li Cheng. Investigating pose representations and motion contexts modeling for 3d motion prediction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):681– 697, 2022

2022
[28]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Humantomato: Text-aligned whole-body motion generation,

Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. Humantomato: Text-aligned whole-body motion generation.arXiv preprint arXiv:2310.12978, 2023

work page arXiv 2023
[30]

Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard New- combe

Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard New- combe. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InProceedings of th...

2024
[31]

Ian Mason, Sebastian Starke, and Taku Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022

2022
[32]

Lookout: Real-world humanoid egocentric navigation

Boxiao Pan, Adam W Harley, Francis Engelmann, C Karen Liu, and Leonidas J Guibas. Lookout: Real-world humanoid egocentric navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24977–24988, 2025

2025
[33]

Aria digital twin: A new benchmark dataset for egocentric 3d machine perception

Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133– 20143, 2023

2023
[34]

Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and gen- eration

Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, and Ehsan Adeli. Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and gen- eration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10318–10329, 2025

2025
[35]

Action- conditioned 3d human motion synthesis with transformer vae

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Action- conditioned 3d human motion synthesis with transformer vae. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10985–10995, 2021

2021
[36]

Mmm: Generative masked motion model

Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024

2024
[37]

Monkey see, monkey do: Harnessing self-attention in motion diffusion for zero-shot motion transfer

Sigal Raab, Inbar Gat, Nathan Sala, Guy Tevet, Rotem Shalev- Arkushin, Ohad Fried, Amit Haim Bermano, and Daniel Cohen- Or. Monkey see, monkey do: Harnessing self-attention in motion diffusion for zero-shot motion transfer. InProceedings of the ACM SIGGRAPH Conference on Computer Graphics and Interactive Techniques, pages 1–13, 2024

2024
[38]

Single motion diffusion.arXiv preprint arXiv:2302.05905, 2023

Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Amit H Bermano, and Daniel Cohen-Or. Single motion diffusion.arXiv preprint arXiv:2302.05905, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

work page arXiv 2023
[39]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, pages 8748–8763. PmLR, 2021

2021
[40]

arXiv preprint arXiv:2303.01418 (2023) 3

Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano. Human motion diffusion as a generative prior.arXiv preprint arXiv:2303.01418, 2023

work page arXiv 2023
[41]

Bailando: 3d dance generation by actor-critic gpt with choreographic memory

Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022

2022
[42]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, Andre Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

work page internal anchor Pith review arXiv 2024
[43]

Role-aware interaction gen- eration from textual description

Mikihiro Tanaka and Kent Fujiwara. Role-aware interaction gen- eration from textual description. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15999– 16009, 2023

2023
[44]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Leonard Hussenot, Thomas Mesnard, Bobak Shahriari, et al. Gemma 2: Improv- ing open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review arXiv 2024
[45]

Human Motion Diffusion Model

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022

work page internal anchor Pith review arXiv 2022
[46]

Re- covering 3d human mesh from monocular images: A survey

Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. Re- covering 3d human mesh from monocular images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15406–15425, 2023

2023
[47]

Epic fields: Marrying 3d geometry and video understanding.Advances in Neural Information Processing Systems, 36:26485–26500, 2023

Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Laina, Diane Larlus, Dima Damen, and Andrea Vedaldi. Epic fields: Marrying 3d geometry and video understanding.Advances in Neural Information Processing Systems, 36:26485–26500, 2023

2023
[48]

Edge: Editable dance generation from music

Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448– 458, 2023

2023
[49]

Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017

2017
[50]

Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

2017
[51]

Ego- centric whole-body motion capture with fisheyevit and diffusion- based motion refinement

Jian Wang, Zhe Cao, Diogo Luvizon, Lingjie Liu, Kripasindhu Sarkar, Danhang Tang, Thabo Beeler, and Christian Theobalt. Ego- centric whole-body motion capture with fisheyevit and diffusion- based motion refinement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 777–787, 2024

2024
[52]

Ego4o: Egocentric human motion capture and understanding from multi-modal input

Jian Wang, Rishabh Dabral, Diogo Luvizon, Zhe Cao, Lingjie Liu, Thabo Beeler, and Christian Theobalt. Ego4o: Egocentric human motion capture and understanding from multi-modal input. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22668–22679, 2025

2025
[53]

Synthesizing long-term 3d human motion and interaction in 3d scenes

Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiaolong Wang. Synthesizing long-term 3d human motion and interaction in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9401–9411, 2021

2021
[54]

Move as you say, interact as you can: Language-guided human motion generation with scene affordance

Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Move as you say, interact as you can: Language-guided human motion generation with scene affordance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 433–444, 2024

2024
[55]

Humanise: Language-conditioned human motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022

Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned human motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022

2022
[56]

Egotwin: Dreaming body and view in first person.arXiv preprint arXiv:2508.13013, 2025

Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, and Ziwei Liu. Egotwin: Dreaming body and view in first person.arXiv preprint arXiv:2508.13013, 2025

work page arXiv 2025
[57]

Regennet: Towards human action-reaction synthesis

Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1759–1769, 2024

2024
[58]

3d human pose, shape and texture from low-resolution images and videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4490–4504, 2021

Xiangyu Xu, Hao Chen, Francesc Moreno-Noguer, Laszlo A Jeni, and Fernando De la Torre. 3d human pose, shape and texture from low-resolution images and videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4490–4504, 2021

2021
[59]

Human motion video generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Haiwei Xue, Xiangyang Luo, Zhanghao Hu, Xin Zhang, Xunzhi Xiang, Yuqin Dai, Jianzhuang Liu, Zhensong Zhang, Minglei Li, Jian Yang, et al. Human motion video generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[60]

Estimating body and hand motion in an ego-sensed world

Brent Yi, Vickie Ye, Maya Zheng, Yunqi Li, Lea M ¨uller, Geor- gios Pavlakos, Yi Ma, Jitendra Malik, and Angjoo Kanazawa. Estimating body and hand motion in an ego-sensed world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7072–7084, 2025

2025
[61]

Black, Xue Bin Peng, and Davis Rempe

Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. InProceedings of the European Conference on Computer Vision, pages 246–263. Springer, 2024

2024
[62]

Hero: Human reaction generation from videos

Chengjun Yu, Wei Zhai, Yuhang Yang, Yang Cao, and Zheng-Jun Zha. Hero: Human reaction generation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10262–10274, 2025

2025
[63]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023

2023
[64]

Generating human motion from textual descriptions with discrete representa- tions

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representa- tions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14730–14740, 2023

2023
[65]

Egoreact: Egocentric video-driven 3d human reaction generation

Libo Zhang, Zekun Li, Tianyu Li, Zeyu Cao, Rui Xu, Xiaoxiao Long, Wenjia Wang, Jingbo Wang, Yuan Liu, Wenping Wang, et al. Egoreact: Egocentric video-driven 3d human reaction generation. arXiv preprint arXiv:2512.22808, 2025

work page arXiv 2025
[66]

Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4115–4128, 2024

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4115–4128, 2024

2024
[67]

Black, and Siyu Tang

Siwei Zhang, Yan Zhang, Qianli Ma, Michael J. Black, and Siyu Tang. Place: Proximity learning of articulation and contact in 3d environments. InInternational Conference on 3D Vision, 2020

2020
[68]

Couch: Towards controllable human-chair interactions

Xiaohan Zhang, Bharat Lal Bhatnagar, Sebastian Starke, Vladimir Guzov, and Gerard Pons-Moll. Couch: Towards controllable human-chair interactions. InProceedings of the European Conference on Computer Vision, pages 518–535. Springer, 2022

2022
[69]

Motiongpt: Fine- tuned llms are general-purpose motion generators

Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. Motiongpt: Fine- tuned llms are general-purpose motion generators. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7368–7376, 2024

2024
[70]

Karen Liu, and Leonidas J

Yang Zheng, Yanchao Yang, Kaichun Mo, Jiaman Li, Tao Yu, Yebin Liu, C. Karen Liu, and Leonidas J. Guibas. Gimo: Gaze-informed human motion prediction in context. InProceedings of the European Conference on Computer Vision, pages 676–694. Springer, 2022

2022
[71]

Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

2023