pith. machine review for the scientific record. sign in

arxiv: 2604.19105 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric motion generationvision-language modelsdiffusion modelshierarchical reasoning3D human motionfirst-person perceptionmotion synthesismotion primitives
0
0 comments X

The pith

Decoupling semantic reasoning into discrete motion primitives before diffusion synthesis resolves gradient conflicts in egocentric motion generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the problem of creating 3D human motion sequences from first-person video observations paired with natural language instructions. It identifies that jointly optimizing for semantic understanding and physical kinematics leads to conflicting training signals that reduce output quality. EgoMotion addresses this through a two-stage process: a vision-language model first converts the inputs into structured discrete motion primitives, then a diffusion model uses those primitives as conditioning to generate smooth, coherent trajectories. This separation produces motions that align better with the given instructions and exhibit improved physical plausibility compared to single-stage approaches. Readers interested in embodied AI would care because accurate first-person motion synthesis supports more reliable simulation of human behavior in dynamic settings.

Core claim

EgoMotion is a hierarchical generative framework that decouples cognitive reasoning from kinematic modeling. In the first stage, a vision-language model projects egocentric visual observations and language instructions into a space of discrete motion primitives, forcing acquisition of goal-consistent representations. In the second stage, these primitives condition a diffusion-based generator that performs iterative denoising in continuous latent space to synthesize physically plausible and temporally coherent 3D motion trajectories.

What carries the argument

The two-stage hierarchy that maps multimodal inputs to discrete motion primitives via a vision-language model, then feeds those as conditioning to a diffusion motion generator.

If this is right

  • State-of-the-art performance on egocentric vision-language motion generation tasks.
  • Motion sequences that are semantically grounded in both language instructions and first-person visual observations.
  • Kinematically superior trajectories compared to existing joint-optimization approaches.
  • Avoidance of gradient conflicts that systematically degrade multimodal grounding and motion fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The discrete primitive representation could support easier adaptation to new environments or motion styles without full retraining.
  • Similar staged decoupling might address optimization conflicts in other multimodal generation domains such as audio-conditioned video or text-to-3D synthesis.
  • Deployment in real robotic systems could test whether the learned primitives transfer effectively from simulation to physical hardware.
  • Varying the granularity of the discrete motion primitives offers a testable way to measure trade-offs between semantic fidelity and motion detail.

Load-bearing premise

Mapping multimodal inputs to discrete motion primitives reliably bridges the semantic gap and avoids information loss or new conflicts when conditioning the diffusion generator.

What would settle it

An end-to-end single-stage model that jointly optimizes the vision-language and diffusion components achieving equal or higher scores on semantic grounding and kinematic quality metrics than the two-stage EgoMotion on the same evaluation benchmarks.

Figures

Figures reproduced from arXiv: 2604.19105 by Bingpeng Ma, Hong Chang, Mingshuang Luo, Mingyue Zhou, Ruibing Hou, Shiguang Shan, Xilin Chen, Yuwei Gui.

Figure 1
Figure 1. Figure 1: We propose a hierarchical framework EgoMotion for Ego-VL Motion Generation, aiming to synthesize motion sequences conditioned jointly on egocentric visual observations and natural language instructions. tations and undermining multimodal grounding fidelity. Conversely, the semantic reasoning objective interferes with the kinematic precision required for high-quality synthesis. This gradient interference ma… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EgoMotion. The framework employs a hierarchical two-stage paradigm. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of motion generation across different paradigms. Generated motion sequences are visualized from [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces EgoMotion, a two-stage hierarchical framework for egocentric vision-language motion generation. The Cognitive Reasoning stage employs a vision-language model to project first-person visual observations and natural language instructions onto a space of discrete motion primitives, aiming to resolve semantic-kinematic grounding issues. The Motion Generation stage then uses these primitives as conditioning signals for a diffusion model to synthesize 3D human motion trajectories. The central claim is that this biologically-inspired decoupling avoids gradient conflicts inherent in joint optimization, yielding state-of-the-art performance with semantically grounded and kinematically superior outputs compared to prior approaches.

Significance. If the empirical claims hold, the work could meaningfully advance embodied AI by demonstrating a practical hierarchical solution to multimodal entanglement problems in motion synthesis. The explicit separation of high-level reasoning from low-level generation offers a template that might generalize to other vision-language-action tasks, particularly in dynamic first-person settings where continuous end-to-end models often struggle with fidelity.

major comments (2)
  1. [Abstract] Abstract: The assertion that EgoMotion 'achieves state-of-the-art performance' and resolves gradient conflicts is presented without any quantitative metrics, ablation studies, error analysis, or comparison tables. This absence is load-bearing for the central claim, as the superiority over existing approaches cannot be evaluated from the provided description alone.
  2. [Abstract] Cognitive Reasoning stage description: The claim that mapping to discrete motion primitives 'forces the VLM to acquire goal-consistent representations' and bridges the semantic gap without new conflicts or information loss is not supported by evidence that the discretization preserves fine-grained egocentric kinematic details (e.g., subtle velocity or viewpoint-specific pose variations). This step is load-bearing for the subsequent diffusion stage's claimed kinematic superiority.
minor comments (1)
  1. [Abstract] The phrase 'reasoning-generation entanglement' is introduced as a novel challenge without a formal definition, mathematical formulation, or citation to related work on gradient conflicts in multimodal generative models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications drawn from the full manuscript and indicate the revisions made to strengthen the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that EgoMotion 'achieves state-of-the-art performance' and resolves gradient conflicts is presented without any quantitative metrics, ablation studies, error analysis, or comparison tables. This absence is load-bearing for the central claim, as the superiority over existing approaches cannot be evaluated from the provided description alone.

    Authors: The abstract is a concise summary of the work. The full manuscript contains the requested quantitative support, including comparison tables (Section 4.1), ablation studies on the hierarchical separation and gradient conflict mitigation (Section 4.3), and error analyses (Section 5). To address the referee's concern and render the abstract more self-contained, we have revised it to include key performance metrics and a brief reference to these supporting experiments. revision: yes

  2. Referee: [Abstract] Cognitive Reasoning stage description: The claim that mapping to discrete motion primitives 'forces the VLM to acquire goal-consistent representations' and bridges the semantic gap without new conflicts or information loss is not supported by evidence that the discretization preserves fine-grained egocentric kinematic details (e.g., subtle velocity or viewpoint-specific pose variations). This step is load-bearing for the subsequent diffusion stage's claimed kinematic superiority.

    Authors: We agree that explicit evidence for preservation of fine-grained kinematic details under discretization is important. The manuscript provides this through ablations in Section 4.3 that compare discrete primitives against continuous alternatives, reporting low reconstruction errors on velocity and egocentric pose variations along with improved downstream motion fidelity. We have revised the abstract's description of the Cognitive Reasoning stage to note this supporting evidence and the resulting kinematic advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: standard hierarchical proposal with external evaluation

full rationale

The paper describes a two-stage architecture (Cognitive Reasoning via VLM to discrete primitives, followed by diffusion-based Motion Generation) to mitigate claimed gradient conflicts in egocentric VL motion synthesis. No equations, derivations, fitted parameters, or self-citations are invoked as load-bearing steps in the provided text. The SOTA performance claim rests on external evaluations rather than any reduction of outputs to inputs by construction, self-definition, or imported uniqueness theorems. This matches the default non-circular case for descriptive ML frameworks without mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract relies on standard assumptions from vision-language modeling and diffusion processes without introducing new free parameters or entities; the key premise is that discrete motion primitives can serve as an effective interface between semantic understanding and kinematic generation.

axioms (2)
  • domain assumption A vision-language model can be trained to project multimodal egocentric inputs into a structured space of discrete motion primitives that preserve goal-consistent semantics.
    Invoked in the Cognitive Reasoning stage description as the mechanism that bridges the semantic gap.
  • domain assumption Iterative denoising in a continuous latent space conditioned on those primitives will produce physically plausible and temporally coherent 3D trajectories.
    Stated as the operation of the Motion Generation stage.

pith-pipeline@v0.9.0 · 5583 in / 1413 out tokens · 38766 ms · 2026-05-10T03:35:17.539741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    Unpaired motion style transfer from video to animation.ACM Transactions on Graphics, 39(4):64–1, 2020

    Kfir Aberman, Yijia Weng, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. Unpaired motion style transfer from video to animation.ACM Transactions on Graphics, 39(4):64–1, 2020

  2. [2]

    Circle: Capture in rich contextual environments

    Joao Pedro Araujo, Jiaman Li, Karthik Vetrivel, Rishi Agarwal, Jiajun Wu, Deepak Gopinath, Alexander William Clegg, and Karen Liu. Circle: Capture in rich contextual environments. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21211–21221, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

  3. [3]

    Generating human motion in 3d scenes from text descriptions

    Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, and Xiaowei Zhou. Generating human motion in 3d scenes from text descriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1855– 1866, 2024

  4. [4]

    Context-aware human motion prediction

    Enric Corona, Albert Pumarola, Guillem Alenya, and Francesc Moreno-Noguer. Context-aware human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6992–7001, 2020

  5. [5]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie, pages 4171–4186, 2019

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  7. [7]

    Duetgen: Music driven two-person dance generation via hierarchi- cal masked modeling

    Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. Duetgen: Music driven two-person dance generation via hierarchi- cal masked modeling. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  8. [8]

    Egolifter: Open-world 3d segmentation for egocentric perception

    Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney. Egolifter: Open-world 3d segmentation for egocentric perception. InProceedings of the European Conference on Computer Vision, pages 382–400. Springer, 2024

  9. [9]

    Momask: Generative masked modeling of 3d human motions

    Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024

  10. [10]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022

  11. [11]

    Action2motion: Conditioned generation of 3d human motions

    Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the ACM International Conference on Multimedia, pages 2021–2029, 2020

  12. [12]

    Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3d human pose ambiguities with 3d scene constraints. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2282–2292, 2019

  13. [13]

    Nemf: Neural motion fields for kinematic animation

    Chengan He, Jun Saito, James Zachary, Holly Rushmeier, and Yi Zhou. Nemf: Neural motion fields for kinematic animation. Advances in Neural Information Processing Systems, 35:4244–4256, 2022

  14. [14]

    Motionverse: A unified multimodal framework for motion comprehension, generation and editing.arXiv preprint arXiv:2509.23635, 2025

    Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, and Shiguang Shan. Motionverse: A unified multimodal framework for motion comprehension, generation and editing.arXiv preprint arXiv:2509.23635, 2025

  15. [15]

    Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

  16. [16]

    Autonomous character-scene interaction synthesis from text instruction

    Nan Jiang, Zimo He, Zi Wang, Hongjie Li, Yixin Chen, Siyuan Huang, and Yixin Zhu. Autonomous character-scene interaction synthesis from text instruction. InProceedings of the ACM SIG- GRAPH Conference on Computer Graphics and Interactive Techniques Asia, pages 1–11, 2024

  17. [17]

    Scaling up dynamic human-scene interaction modeling

    Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1737–1747, 2024

  18. [18]

    Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion

    Roy Kapon, Guy Tevet, Daniel Cohen-Or, and Amit H Bermano. Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1965–1974, 2024

  19. [19]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  20. [20]

    Nifty: Neural object interaction fields for guided human motion synthesis

    Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, and Leonidas Guibas. Nifty: Neural object interaction fields for guided human motion synthesis. 2023. arXiv preprint

  21. [21]

    Autoregressive image generation using residual quanti- zation

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook- Shin Han. Autoregressive image generation using residual quanti- zation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022

  22. [22]

    Locomotion-action-manipulation: Syn- thesizing human-scene interactions in complex 3d environments

    Jiye Lee and Hanbyul Joo. Locomotion-action-manipulation: Syn- thesizing human-scene interactions in complex 3d environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9663–9674, 2023

  23. [23]

    Karen Liu

    Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C. Karen Liu. Controllable human-object interaction synthesis. InProceedings of the European Conference on Computer Vision, pages 54–72. Springer, 2024

  24. [24]

    Ego-body pose estimation via ego-head pose estimation

    Jiaman Li, Karen Liu, and Jiajun Wu. Ego-body pose estimation via ego-head pose estimation. InProceedings of the International Conference on Machine Learning, pages 17142–17151, 2023

  25. [25]

    Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024

    Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024

  26. [26]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  27. [27]

    Investigating pose representations and motion contexts modeling for 3d motion prediction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):681– 697, 2022

    Zhenguang Liu, Shuang Wu, Shuyuan Jin, Shouling Ji, Qi Liu, Shijian Lu, and Li Cheng. Investigating pose representations and motion contexts modeling for 3d motion prediction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):681– 697, 2022

  28. [28]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  29. [29]

    Humantomato: Text-aligned whole-body motion generation,

    Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. Humantomato: Text-aligned whole-body motion generation.arXiv preprint arXiv:2310.12978, 2023

  30. [30]

    Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard New- combe

    Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard New- combe. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InProceedings of th...

  31. [31]

    Ian Mason, Sebastian Starke, and Taku Komura. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–18, 2022

  32. [32]

    Lookout: Real-world humanoid egocentric navigation

    Boxiao Pan, Adam W Harley, Francis Engelmann, C Karen Liu, and Leonidas J Guibas. Lookout: Real-world humanoid egocentric navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24977–24988, 2025

  33. [33]

    Aria digital twin: A new benchmark dataset for egocentric 3d machine perception

    Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133– 20143, 2023

  34. [34]

    Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and gen- eration

    Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, and Ehsan Adeli. Uniegomotion: A unified model for egocentric motion reconstruction, forecasting, and gen- eration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10318–10329, 2025

  35. [35]

    Action- conditioned 3d human motion synthesis with transformer vae

    Mathis Petrovich, Michael J Black, and G ¨ul Varol. Action- conditioned 3d human motion synthesis with transformer vae. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10985–10995, 2021

  36. [36]

    Mmm: Generative masked motion model

    Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024

  37. [37]

    Monkey see, monkey do: Harnessing self-attention in motion diffusion for zero-shot motion transfer

    Sigal Raab, Inbar Gat, Nathan Sala, Guy Tevet, Rotem Shalev- Arkushin, Ohad Fried, Amit Haim Bermano, and Daniel Cohen- Or. Monkey see, monkey do: Harnessing self-attention in motion diffusion for zero-shot motion transfer. InProceedings of the ACM SIGGRAPH Conference on Computer Graphics and Interactive Techniques, pages 1–13, 2024

  38. [38]

    Single motion diffusion.arXiv preprint arXiv:2302.05905, 2023

    Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Amit H Bermano, and Daniel Cohen-Or. Single motion diffusion.arXiv preprint arXiv:2302.05905, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

  39. [39]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, pages 8748–8763. PmLR, 2021

  40. [40]

    arXiv preprint arXiv:2303.01418 (2023) 3

    Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano. Human motion diffusion as a generative prior.arXiv preprint arXiv:2303.01418, 2023

  41. [41]

    Bailando: 3d dance generation by actor-critic gpt with choreographic memory

    Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022

  42. [42]

    PaliGemma 2: A Family of Versatile VLMs for Transfer

    Andreas Steiner, Andre Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

  43. [43]

    Role-aware interaction gen- eration from textual description

    Mikihiro Tanaka and Kent Fujiwara. Role-aware interaction gen- eration from textual description. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15999– 16009, 2023

  44. [44]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Leonard Hussenot, Thomas Mesnard, Bobak Shahriari, et al. Gemma 2: Improv- ing open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  45. [45]

    Human Motion Diffusion Model

    Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022

  46. [46]

    Re- covering 3d human mesh from monocular images: A survey

    Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. Re- covering 3d human mesh from monocular images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15406–15425, 2023

  47. [47]

    Epic fields: Marrying 3d geometry and video understanding.Advances in Neural Information Processing Systems, 36:26485–26500, 2023

    Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Laina, Diane Larlus, Dima Damen, and Andrea Vedaldi. Epic fields: Marrying 3d geometry and video understanding.Advances in Neural Information Processing Systems, 36:26485–26500, 2023

  48. [48]

    Edge: Editable dance generation from music

    Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448– 458, 2023

  49. [49]

    Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017

  50. [50]

    Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

  51. [51]

    Ego- centric whole-body motion capture with fisheyevit and diffusion- based motion refinement

    Jian Wang, Zhe Cao, Diogo Luvizon, Lingjie Liu, Kripasindhu Sarkar, Danhang Tang, Thabo Beeler, and Christian Theobalt. Ego- centric whole-body motion capture with fisheyevit and diffusion- based motion refinement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 777–787, 2024

  52. [52]

    Ego4o: Egocentric human motion capture and understanding from multi-modal input

    Jian Wang, Rishabh Dabral, Diogo Luvizon, Zhe Cao, Lingjie Liu, Thabo Beeler, and Christian Theobalt. Ego4o: Egocentric human motion capture and understanding from multi-modal input. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22668–22679, 2025

  53. [53]

    Synthesizing long-term 3d human motion and interaction in 3d scenes

    Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiaolong Wang. Synthesizing long-term 3d human motion and interaction in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9401–9411, 2021

  54. [54]

    Move as you say, interact as you can: Language-guided human motion generation with scene affordance

    Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Move as you say, interact as you can: Language-guided human motion generation with scene affordance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 433–444, 2024

  55. [55]

    Humanise: Language-conditioned human motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022

    Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned human motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022

  56. [56]

    Egotwin: Dreaming body and view in first person.arXiv preprint arXiv:2508.13013, 2025

    Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, and Ziwei Liu. Egotwin: Dreaming body and view in first person.arXiv preprint arXiv:2508.13013, 2025

  57. [57]

    Regennet: Towards human action-reaction synthesis

    Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1759–1769, 2024

  58. [58]

    3d human pose, shape and texture from low-resolution images and videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4490–4504, 2021

    Xiangyu Xu, Hao Chen, Francesc Moreno-Noguer, Laszlo A Jeni, and Fernando De la Torre. 3d human pose, shape and texture from low-resolution images and videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4490–4504, 2021

  59. [59]

    Human motion video generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Haiwei Xue, Xiangyang Luo, Zhanghao Hu, Xin Zhang, Xunzhi Xiang, Yuqin Dai, Jianzhuang Liu, Zhensong Zhang, Minglei Li, Jian Yang, et al. Human motion video generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  60. [60]

    Estimating body and hand motion in an ego-sensed world

    Brent Yi, Vickie Ye, Maya Zheng, Yunqi Li, Lea M ¨uller, Geor- gios Pavlakos, Yi Ma, Jitendra Malik, and Angjoo Kanazawa. Estimating body and hand motion in an ego-sensed world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7072–7084, 2025

  61. [61]

    Black, Xue Bin Peng, and Davis Rempe

    Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. InProceedings of the European Conference on Computer Vision, pages 246–263. Springer, 2024

  62. [62]

    Hero: Human reaction generation from videos

    Chengjun Yu, Wei Zhai, Yuhang Yang, Yang Cao, and Zheng-Jun Zha. Hero: Human reaction generation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10262–10274, 2025

  63. [63]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023

  64. [64]

    Generating human motion from textual descriptions with discrete representa- tions

    Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representa- tions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14730–14740, 2023

  65. [65]

    Egoreact: Egocentric video-driven 3d human reaction generation

    Libo Zhang, Zekun Li, Tianyu Li, Zeyu Cao, Rui Xu, Xiaoxiao Long, Wenjia Wang, Jingbo Wang, Yuan Liu, Wenping Wang, et al. Egoreact: Egocentric video-driven 3d human reaction generation. arXiv preprint arXiv:2512.22808, 2025

  66. [66]

    Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4115–4128, 2024

    Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4115–4128, 2024

  67. [67]

    Black, and Siyu Tang

    Siwei Zhang, Yan Zhang, Qianli Ma, Michael J. Black, and Siyu Tang. Place: Proximity learning of articulation and contact in 3d environments. InInternational Conference on 3D Vision, 2020

  68. [68]

    Couch: Towards controllable human-chair interactions

    Xiaohan Zhang, Bharat Lal Bhatnagar, Sebastian Starke, Vladimir Guzov, and Gerard Pons-Moll. Couch: Towards controllable human-chair interactions. InProceedings of the European Conference on Computer Vision, pages 518–535. Springer, 2022

  69. [69]

    Motiongpt: Fine- tuned llms are general-purpose motion generators

    Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. Motiongpt: Fine- tuned llms are general-purpose motion generators. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7368–7376, 2024

  70. [70]

    Karen Liu, and Leonidas J

    Yang Zheng, Yanchao Yang, Kaichun Mo, Jiaman Li, Tao Yu, Yebin Liu, C. Karen Liu, and Leonidas J. Guibas. Gimo: Gaze-informed human motion prediction in context. InProceedings of the European Conference on Computer Vision, pages 676–694. Springer, 2022

  71. [71]

    Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

    Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023