AtomicMotion: Learning Human Motion From Different Human Parts

Chuhua Xian; Fa-Ting Hong; Runzhen Liu

arxiv: 2605.22631 · v1 · pith:K2OHEBOQnew · submitted 2026-05-21 · 💻 cs.CV

AtomicMotion: Learning Human Motion From Different Human Parts

Runzhen Liu , Chuhua Xian , Fa-Ting Hong This is my paper

Pith reviewed 2026-05-22 06:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords human motion reconstructionsparse pose estimationbody partitioningkinematic attentionfull-body synthesisAR/VR telepresencebiomechanical constraintsAMASS dataset

0 comments

The pith

AtomicMotion reconstructs full-body poses from sparse head and hand data by splitting the skeleton into five functional clusters and embedding kinematic structure into attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem of turning limited head and hand movement data into accurate full-body poses for AR/VR systems. It claims that treating the whole body as one unit causes accumulated errors and unnatural coordination because it misses local movement goals and body structure. To fix this, AtomicMotion splits the skeleton into five clusters chosen by how body parts are meant to move together, trains the model with masked full-body examples so it learns overall constraints, and replaces standard attention with a version that respects the fixed connections between joints. A reader would care if this makes virtual telepresence feel more natural without needing sensors on every joint.

Core claim

AtomicMotion decouples body dynamics by first decomposing the skeleton into five distinct clusters based on functional intent so each keeps its internal joint synergies while isolating local motion primitives, then applies masked full-body pre-conditioning during training to force internalization of global skeletal topology and latent kinematic constraints, and finally replaces vanilla spatial attention with Kinematic Attention that embeds the classical kinematic tree structure to guarantee biological plausibility, resulting in higher reconstruction fidelity and superior biomechanical realism on the AMASS dataset.

What carries the argument

AtomicMotion framework built around logical body partitioning into five functional clusters, masked full-body pre-conditioning, and Kinematic Attention that injects the fixed kinematic tree into the attention computation.

If this is right

Local motion primitives can be learned separately in each cluster without breaking overall coordination when the partitions respect functional intent.
Masked pre-conditioning teaches the network to fill in missing body parts from sparse signals by internalizing global constraints.
Kinematic Attention prevents generation of physiologically impossible joint configurations by enforcing the fixed tree structure.
The combination yields measurably higher fidelity and biomechanical realism than monolithic baselines on standard motion datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cluster-based split could be tested on other sparse-input problems such as full-body prediction from only foot or torso data.
If the method scales, it could lower the sensor count needed for convincing VR telepresence sessions.
Dynamic re-clustering during inference might further improve results for activities that change which body parts move together.
Integration with physics simulators could add collision and balance checks that the current kinematic attention does not explicitly enforce.

Load-bearing premise

That decomposing the skeleton into five clusters based on functional intent preserves internal joint synergies while isolating local motion primitives.

What would settle it

If a controlled test on AMASS motions shows that AtomicMotion produces joint angles or coordination patterns that violate known physiological limits more often than a single-model baseline, or if removing the five-cluster split causes no drop in fidelity, the central claim would fail.

Figures

Figures reproduced from arXiv: 2605.22631 by Chuhua Xian, Fa-Ting Hong, Runzhen Liu.

**Figure 2.** Figure 2: Illustration of our method framework. Masked Pose Modeling Guidance helps the model [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of Global Synchronized Modulation and Temporal-Kinematic Block. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of dual-branch Kinematic Attention. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visual comparisons on the AMASS-P1 in the standard mode. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visual comparisons on the AMASS-P1 in the hand tracking mode. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparison of ablation studies on the AMASS-P1. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Accurately reconstructing full-body poses from sparse head and hand trajectories is a foundational challenge for immersive AR/VR telepresence. Current methods often struggle with error accumulation and unnatural joint coordination, primarily because they treat the human body as a monolithic entity, thereby failing to capture the fine-grained ``atomic intents'' embedded in subtle signal variations and overlooking the inherent structural topology. To bridge this gap, we present AtomicMotion, a framework designed to decouple and re-integrate body dynamics through three core innovations. First, we introduce a logical body partitioning scheme that decomposes the skeleton into five distinct clusters based on functional intent; this ensures that each partition preserves internal joint synergies while isolating local motion primitives. Second, to robustly map sparse inputs to high-dimensional poses, we employ a masked full-body pre-conditioning strategy during training, forcing the model to internalize global skeletal topology and latent kinematic constraints. Finally, addressing the limitations of vanilla spatial attention, which often ignores fixed physiological connectivity, we propose Kinematic Attention. By embedding the classical kinematic tree structure into the attention mechanism, we ensure biological plausibility in the synthesized motions. Extensive evaluations on the AMASS dataset demonstrate that AtomicMotion significantly outperforms existing baselines, yielding higher reconstruction fidelity and superior biomechanical realism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AtomicMotion splits the body into five functional clusters plus kinematic attention to fix sparse pose reconstruction, but the abstract gives no numbers so the gains are impossible to check.

read the letter

AtomicMotion tries to reconstruct full-body poses from just head and hand trajectories by splitting the skeleton into five clusters based on functional intent, then using masked full-body pre-conditioning and kinematic attention to glue things back together. The pitch is that treating the body as one piece causes error buildup and bad coordination, so breaking it down should capture local motion primitives better while the other pieces keep global consistency and biological plausibility.

Referee Report

2 major / 2 minor

Summary. The manuscript presents AtomicMotion, a framework for reconstructing full-body human poses from sparse head and hand trajectories. It introduces three innovations: a logical partitioning of the skeleton into five clusters based on functional intent to isolate local motion primitives while preserving internal synergies; a masked full-body pre-conditioning strategy during training to enforce global skeletal topology; and Kinematic Attention that embeds the classical kinematic tree into the attention mechanism for biological plausibility. The authors claim that these yield significantly higher reconstruction fidelity and biomechanical realism than existing baselines on the AMASS dataset.

Significance. If the central claims are substantiated with quantitative evidence, the work could advance AR/VR telepresence by addressing error accumulation and unnatural coordination through explicit incorporation of functional partitioning and kinematic constraints. The Kinematic Attention mechanism represents a concrete attempt to inject domain structure into attention, which is a potential strength. The manuscript does not report machine-checked proofs, reproducible code, or parameter-free derivations.

major comments (2)

[Abstract] Abstract: The claim that AtomicMotion 'significantly outperforms existing baselines, yielding higher reconstruction fidelity and superior biomechanical realism' is stated without any quantitative metrics, ablation results, error bars, comparison tables, or figures. This absence is load-bearing because it prevents any evaluation of the central empirical claim.
[Method overview / logical body partitioning] Logical body partitioning (described in the abstract and method overview): The five-cluster decomposition based on functional intent is introduced without derivation, optimality argument, or empirical test demonstrating that cross-cluster synergies (e.g., coordinated arm-torso rotation or hand-to-head reaching) are preserved. Because masked pre-conditioning and Kinematic Attention are applied only after this fixed partition, any loss of dependencies at this step directly undermines the biomechanical-realism claim.

minor comments (2)

[Abstract] The term 'atomic intents' is used without a precise definition or link to the subsequent clustering.
[Experiments] Ensure that all baseline methods referenced in the (unseen) experimental section are accompanied by explicit citations and implementation details for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. Where the comments identify areas needing greater substantiation or clarity, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that AtomicMotion 'significantly outperforms existing baselines, yielding higher reconstruction fidelity and superior biomechanical realism' is stated without any quantitative metrics, ablation results, error bars, comparison tables, or figures. This absence is load-bearing because it prevents any evaluation of the central empirical claim.

Authors: We agree that the abstract would benefit from including concrete quantitative support for the performance claims. The main body of the manuscript already contains detailed comparisons, ablation studies, error bars, and tables on the AMASS dataset. In the revised version we have updated the abstract to report key metrics (e.g., MPJPE reductions relative to baselines) and to explicitly reference the supporting tables and figures. This change makes the central empirical claims directly evaluable from the abstract. revision: yes
Referee: [Method overview / logical body partitioning] Logical body partitioning (described in the abstract and method overview): The five-cluster decomposition based on functional intent is introduced without derivation, optimality argument, or empirical test demonstrating that cross-cluster synergies (e.g., coordinated arm-torso rotation or hand-to-head reaching) are preserved. Because masked pre-conditioning and Kinematic Attention are applied only after this fixed partition, any loss of dependencies at this step directly undermines the biomechanical-realism claim.

Authors: The partitioning into five clusters is motivated by standard functional and anatomical groupings used in biomechanics and animation literature, chosen to isolate local motion primitives while the subsequent global mechanisms (masked full-body pre-conditioning and Kinematic Attention operating over the full kinematic tree) are intended to restore cross-cluster coordination. We acknowledge that the original text provided limited explicit justification and validation for this choice. In the revision we have added a dedicated paragraph deriving the clusters from functional intent, citing relevant biomechanical references, and presenting an ablation study that compares our partitioning against random and alternative clusterings on metrics of coordinated motion (e.g., arm-torso and hand-head correlations). These additions demonstrate that the chosen partition, combined with the global components, preserves the cited synergies. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the claimed derivation chain.

full rationale

The paper introduces AtomicMotion via three explicit design choices—a logical partitioning of the skeleton into five functional clusters, masked full-body pre-conditioning, and Kinematic Attention—each presented as an innovation to address stated limitations of monolithic modeling. Performance claims rest on empirical results on the AMASS dataset rather than any derivation that reduces the reported gains in fidelity or biomechanical realism to a fitted parameter or self-referential definition. No equations appear that equate a prediction to its own input by construction, and no load-bearing premise is justified solely by a self-citation whose content is itself unverified. The central assertions therefore remain independent of the target outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that functional clustering of joints is meaningful and on the introduction of a new attention variant whose benefits are asserted rather than derived from first principles.

axioms (1)

domain assumption The human skeleton can be decomposed into five distinct clusters based on functional intent that preserve internal joint synergies while isolating local motion primitives.
Presented as the first core innovation in the abstract.

invented entities (1)

Kinematic Attention no independent evidence
purpose: Embedding the classical kinematic tree structure into the attention mechanism to enforce biological plausibility.
Introduced to address limitations of vanilla spatial attention.

pith-pipeline@v0.9.0 · 5749 in / 1173 out tokens · 50996 ms · 2026-05-22T06:26:49.518481+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a principled body partitioning scheme that decomposes the skeleton into five functional clusters... Atomic Intent Partition (AIP) scheme, which segments the skeleton into five functional clusters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

Avatarposer: Articulated full-body pose tracking from sparse motion sensing

Jiaxi Jiang, Paul Streli, Huajian Qiu, Andreas Fender, Larissa Laich, Patrick Snape, and Christian Holz. Avatarposer: Articulated full-body pose tracking from sparse motion sensing. InECCV, 2022

work page 2022
[2]

Egoposer: Robust real-time egocentric pose estimation from sparse and intermittent observations everywhere

Jiaxi Jiang, Paul Streli, Manuel Meier, and Christian Holz. Egoposer: Robust real-time egocentric pose estimation from sparse and intermittent observations everywhere. InECCV, 2024

work page 2024
[3]

Realistic full-body tracking from sparse observations via joint-level modeling

Xiaozheng Zheng, Zhuo Su, Chao Wen, Zhou Xue, and Xiaojie Jin. Realistic full-body tracking from sparse observations via joint-level modeling. InICCV, 2023. 9 GT Ours w/o MPMG w/o Intent Branch w/o Structual Branch Seq1Seq2Seq3Seq4 Figure 7: Visual comparison of ablation studies on the AMASS-P1

work page 2023
[4]

Stratified avatar generation from sparse observations

Han Feng, Wenchao Ma, Quankai Gao, Xianwei Zheng, Nan Xue, and Huijuan Xu. Stratified avatar generation from sparse observations. InCVPR, 2024

work page 2024
[5]

John wiley & sons, 2009

David A Winter.Biomechanics and motor control of human movement. John wiley & sons, 2009

work page 2009
[6]

Human kinetics, 2008

Roger M Enoka.Neuromechanics of human movement. Human kinetics, 2008

work page 2008
[7]

Human kinetics, 2002

Vladimir M Zatsiorsky.Kinetics of human motion. Human kinetics, 2002

work page 2002
[8]

Hierarchical recurrent neural network for skeleton based action recognition

Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for skeleton based action recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1110–1118, 2015

work page 2015
[9]

Part-level graph convolutional network for skeleton-based action recognition

Linjiang Huang, Yan Huang, Wanli Ouyang, and Liang Wang. Part-level graph convolutional network for skeleton-based action recognition. InAAAI, 2020

work page 2020
[10]

Skeleton-parted graph scattering networks for 3d human motion prediction

Maosen Li, Siheng Chen, Zijing Zhang, Lingxi Xie, Qi Tian, and Ya Zhang. Skeleton-parted graph scattering networks for 3d human motion prediction. InECCV, 2022

work page 2022
[11]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InICCV, 2019

work page 2019
[12]

A complete generalized solution to the inverse kinematics of robots.IEEE Journal on Robotics and Automation, 2003

Andrew Goldenberg, Beno Benhabib, and Robert Fenton. A complete generalized solution to the inverse kinematics of robots.IEEE Journal on Robotics and Automation, 2003

work page 2003
[13]

URL https://assetstore.unity.com/packages/tools/animation/ final-ik-14290

RootMotion Final IK., 2018. URL https://assetstore.unity.com/packages/tools/animation/ final-ik-14290

work page 2018
[14]

Human upper-body inverse kinematics for increased embodiment in consumer-grade virtual reality

Mathias Parger, Joerg H Mueller, Dieter Schmalstieg, and Markus Steinberger. Human upper-body inverse kinematics for increased embodiment in consumer-grade virtual reality. InProceedings of the 24th ACM symposium on virtual reality software and technology, 2018

work page 2018
[15]

Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time

Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J Black, Otmar Hilliges, and Gerard Pons-Moll. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. TOG, 2018

work page 2018
[16]

Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation

Yifeng Jiang, Yuting Ye, Deepak Gopinath, Jungdam Won, Alexander W Winkler, and C Karen Liu. Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. InSIGGRAPH Asia, 2022

work page 2022
[17]

Sparse inertial poser: Automatic 3d human pose estimation from sparse imus

Timo V on Marcard, Bodo Rosenhahn, Michael J Black, and Gerard Pons-Moll. Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. InComputer graphics forum. Wiley Online Library, 2017

work page 2017
[18]

Transpose: Real-time 3d human translation and pose estimation with six inertial sensors.TOG, 2021

Xinyu Yi, Yuxiao Zhou, and Feng Xu. Transpose: Real-time 3d human translation and pose estimation with six inertial sensors.TOG, 2021. 10

work page 2021
[19]

Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors

Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. InCVPR, 2022

work page 2022
[20]

Coolmoves: User motion accentuation in virtual reality.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2021

Karan Ahuja, Eyal Ofek, Mar Gonzalez-Franco, Christian Holz, and Andrew D Wilson. Coolmoves: User motion accentuation in virtual reality.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2021

work page 2021
[21]

Lobstr: Real-time lower-body pose prediction from sparse upper-body tracking signals

Dongseok Yang, Doyeon Kim, and Sung-Hee Lee. Lobstr: Real-time lower-body pose prediction from sparse upper-body tracking signals. InComputer Graphics F orum, 2021

work page 2021
[22]

Full-body motion from a single head-mounted device: Generating smpl poses from partial observations

Andrea Dittadi, Sebastian Dziadzio, Darren Cosker, Ben Lundell, Thomas J Cashman, and Jamie Shotton. Full-body motion from a single head-mounted device: Generating smpl poses from partial observations. In ICCV, 2021

work page 2021
[23]

Flag: Flow-based 3d avatar generation from sparse observations

Sadegh Aliakbarian, Pashmina Cameron, Federica Bogo, Andrew Fitzgibbon, and Thomas J Cashman. Flag: Flow-based 3d avatar generation from sparse observations. InCVPR, 2022

work page 2022
[24]

From sparse signal to smooth motion: Real-time motion generation with rolling prediction models

German Barquero, Nadine Bertsch, Manojkumar Marramreddy, Carlos Chacón, Filippo Arcadu, Ferran Rigual, Nicky Sijia He, Cristina Palmero, Sergio Escalera, Yuting Ye, et al. From sparse signal to smooth motion: Real-time motion generation with rolling prediction models. InCVPR, 2025

work page 2025
[25]

Hmd-poser: On-device real-time human motion tracking from scalable sparse observations

Peng Dai, Yang Zhang, Tao Liu, Zhen Fan, Tianyuan Du, Zhuo Su, Xiaozheng Zheng, and Zeming Li. Hmd-poser: On-device real-time human motion tracking from scalable sparse observations. InCVPR, 2024

work page 2024
[26]

Questsim: Human motion tracking from sparse sensors with simulated avatars

Alexander Winkler, Jungdam Won, and Yuting Ye. Questsim: Human motion tracking from sparse sensors with simulated avatars. InSIGGRAPH Asia, 2022

work page 2022
[27]

Attention is all you need.NeurIPS, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017

work page 2017
[28]

Denoising diffusion probabilistic models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 2020

work page 2020
[29]

Smpl: A skinned multi-person linear model

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, V olume 2, pages 851–866. 2023

work page 2023
[30]

On the continuity of rotation representa- tions in neural networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representa- tions in neural networks. InCVPR, 2019

work page 2019
[31]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.NeurIPS, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.NeurIPS, 2022

work page 2022
[32]

Motionbert: A unified perspective on learning human motion representations

Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. Motionbert: A unified perspective on learning human motion representations. InICCV, 2023

work page 2023
[33]

Extracting and composing robust features with denoising autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InICML, 2008

work page 2008
[34]

Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model

Yuming Du, Robin Kips, Albert Pumarola, Sebastian Starke, Ali Thabet, and Artsiom Sanakoyeu. Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. InCVPR, 2023

work page 2023
[35]

CMU MoCap Dataset

Carnegie Mellon University. CMU MoCap Dataset. URLhttp://mocap.cs.cmu.edu

work page
[36]

Decomposing biological motion: A framework for analysis and synthesis of human gait patterns.Journal of vision, 2(5):2–2, 2002

Nikolaus F Troje. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns.Journal of vision, 2(5):2–2, 2002

work page 2002
[37]

Documentation mocap database hdm05

M Müller, T Röder, M Clausen, B Eberhardt, B Krüger, and A Weber. Documentation mocap database hdm05. universität bonn; bonn. Technical report, Germany: 2007. Technical Report CG-2007-2.[Google Scholar], 2007

work page 2007
[38]

ACCAD MoCap Dataset

Advanced Computing Center for the Arts and Design. ACCAD MoCap Dataset. URL https://accad. osu.edu/research/motion-lab/mocap-system-and-data

work page
[39]

Saeed Ghorbani, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, and Nikolaus F. Troje. MoVi: A large multipurpose motion and video dataset.arXiv preprint arXiv: 2003.01888, 2020. 11

work page arXiv 2003
[40]

The KIT whole-body human motion database

Christian Mandery, Ömer Terlemez, Martin Do, Nikolaus Vahrenkamp, and Tamim Asfour. The KIT whole-body human motion database. In(ICAR, 2015

work page 2015
[41]

Unifying rep- resentations and large-scale whole-body motion databases for studying human motion.IEEE Transactions on Robotics, 32(4):796–809, 2016

Christian Mandery, Ömer Terlemez, Martin Do, Nikolaus Vahrenkamp, and Tamim Asfour. Unifying rep- resentations and large-scale whole-body motion databases for studying human motion.IEEE Transactions on Robotics, 32(4):796–809, 2016

work page 2016
[42]

The KIT bimanual manipulation dataset

Franziska Krebs, Andre Meixner, Isabel Patzer, and Tamim Asfour. The KIT bimanual manipulation dataset. InIEEE/RAS International Conference on Humanoid Robots (Humanoids), 2021

work page 2021
[43]

Eyes JAPAN Co. Ltd. Eyes Japan MoCap Dataset. URLhttp://mocapdata.com

work page
[44]

Loper, Naureen Mahmood, and Michael J

Matthew M. Loper, Naureen Mahmood, and Michael J. Black. MoSh: Motion and shape capture from sparse markers.TOG, 2014

work page 2014
[45]

Ijaz Akhter and Michael J. Black. Pose-conditioned joint angle limits for 3D human pose reconstruction. InCVPR, 2015

work page 2015
[46]

SFU Motion Capture Database

Simon Fraser University and National University of Singapore. SFU Motion Capture Database. URL http://mocap.cs.sfu.ca/

work page
[47]

Total Capture: 3d human pose estimation fusing video and inertial sensors

Matt Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. Total Capture: 3d human pose estimation fusing video and inertial sensors. InBMVC, 2017

work page 2017
[48]

Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion.International journal of computer vision, 87(1):4–27, 2010

Leonid Sigal, Alexandru O Balan, and Michael J Black. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion.International journal of computer vision, 87(1):4–27, 2010. 12 Appendices A Implementation Details The model stacks 6 TK-Blocks with an embedding dimension of 256 and 8 attention he...

work page 2010

[1] [1]

Avatarposer: Articulated full-body pose tracking from sparse motion sensing

Jiaxi Jiang, Paul Streli, Huajian Qiu, Andreas Fender, Larissa Laich, Patrick Snape, and Christian Holz. Avatarposer: Articulated full-body pose tracking from sparse motion sensing. InECCV, 2022

work page 2022

[2] [2]

Egoposer: Robust real-time egocentric pose estimation from sparse and intermittent observations everywhere

Jiaxi Jiang, Paul Streli, Manuel Meier, and Christian Holz. Egoposer: Robust real-time egocentric pose estimation from sparse and intermittent observations everywhere. InECCV, 2024

work page 2024

[3] [3]

Realistic full-body tracking from sparse observations via joint-level modeling

Xiaozheng Zheng, Zhuo Su, Chao Wen, Zhou Xue, and Xiaojie Jin. Realistic full-body tracking from sparse observations via joint-level modeling. InICCV, 2023. 9 GT Ours w/o MPMG w/o Intent Branch w/o Structual Branch Seq1Seq2Seq3Seq4 Figure 7: Visual comparison of ablation studies on the AMASS-P1

work page 2023

[4] [4]

Stratified avatar generation from sparse observations

Han Feng, Wenchao Ma, Quankai Gao, Xianwei Zheng, Nan Xue, and Huijuan Xu. Stratified avatar generation from sparse observations. InCVPR, 2024

work page 2024

[5] [5]

John wiley & sons, 2009

David A Winter.Biomechanics and motor control of human movement. John wiley & sons, 2009

work page 2009

[6] [6]

Human kinetics, 2008

Roger M Enoka.Neuromechanics of human movement. Human kinetics, 2008

work page 2008

[7] [7]

Human kinetics, 2002

Vladimir M Zatsiorsky.Kinetics of human motion. Human kinetics, 2002

work page 2002

[8] [8]

Hierarchical recurrent neural network for skeleton based action recognition

Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for skeleton based action recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1110–1118, 2015

work page 2015

[9] [9]

Part-level graph convolutional network for skeleton-based action recognition

Linjiang Huang, Yan Huang, Wanli Ouyang, and Liang Wang. Part-level graph convolutional network for skeleton-based action recognition. InAAAI, 2020

work page 2020

[10] [10]

Skeleton-parted graph scattering networks for 3d human motion prediction

Maosen Li, Siheng Chen, Zijing Zhang, Lingxi Xie, Qi Tian, and Ya Zhang. Skeleton-parted graph scattering networks for 3d human motion prediction. InECCV, 2022

work page 2022

[11] [11]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InICCV, 2019

work page 2019

[12] [12]

A complete generalized solution to the inverse kinematics of robots.IEEE Journal on Robotics and Automation, 2003

Andrew Goldenberg, Beno Benhabib, and Robert Fenton. A complete generalized solution to the inverse kinematics of robots.IEEE Journal on Robotics and Automation, 2003

work page 2003

[13] [13]

URL https://assetstore.unity.com/packages/tools/animation/ final-ik-14290

RootMotion Final IK., 2018. URL https://assetstore.unity.com/packages/tools/animation/ final-ik-14290

work page 2018

[14] [14]

Human upper-body inverse kinematics for increased embodiment in consumer-grade virtual reality

Mathias Parger, Joerg H Mueller, Dieter Schmalstieg, and Markus Steinberger. Human upper-body inverse kinematics for increased embodiment in consumer-grade virtual reality. InProceedings of the 24th ACM symposium on virtual reality software and technology, 2018

work page 2018

[15] [15]

Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time

Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J Black, Otmar Hilliges, and Gerard Pons-Moll. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. TOG, 2018

work page 2018

[16] [16]

Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation

Yifeng Jiang, Yuting Ye, Deepak Gopinath, Jungdam Won, Alexander W Winkler, and C Karen Liu. Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. InSIGGRAPH Asia, 2022

work page 2022

[17] [17]

Sparse inertial poser: Automatic 3d human pose estimation from sparse imus

Timo V on Marcard, Bodo Rosenhahn, Michael J Black, and Gerard Pons-Moll. Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. InComputer graphics forum. Wiley Online Library, 2017

work page 2017

[18] [18]

Transpose: Real-time 3d human translation and pose estimation with six inertial sensors.TOG, 2021

Xinyu Yi, Yuxiao Zhou, and Feng Xu. Transpose: Real-time 3d human translation and pose estimation with six inertial sensors.TOG, 2021. 10

work page 2021

[19] [19]

Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors

Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. InCVPR, 2022

work page 2022

[20] [20]

Coolmoves: User motion accentuation in virtual reality.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2021

Karan Ahuja, Eyal Ofek, Mar Gonzalez-Franco, Christian Holz, and Andrew D Wilson. Coolmoves: User motion accentuation in virtual reality.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2021

work page 2021

[21] [21]

Lobstr: Real-time lower-body pose prediction from sparse upper-body tracking signals

Dongseok Yang, Doyeon Kim, and Sung-Hee Lee. Lobstr: Real-time lower-body pose prediction from sparse upper-body tracking signals. InComputer Graphics F orum, 2021

work page 2021

[22] [22]

Full-body motion from a single head-mounted device: Generating smpl poses from partial observations

Andrea Dittadi, Sebastian Dziadzio, Darren Cosker, Ben Lundell, Thomas J Cashman, and Jamie Shotton. Full-body motion from a single head-mounted device: Generating smpl poses from partial observations. In ICCV, 2021

work page 2021

[23] [23]

Flag: Flow-based 3d avatar generation from sparse observations

Sadegh Aliakbarian, Pashmina Cameron, Federica Bogo, Andrew Fitzgibbon, and Thomas J Cashman. Flag: Flow-based 3d avatar generation from sparse observations. InCVPR, 2022

work page 2022

[24] [24]

From sparse signal to smooth motion: Real-time motion generation with rolling prediction models

German Barquero, Nadine Bertsch, Manojkumar Marramreddy, Carlos Chacón, Filippo Arcadu, Ferran Rigual, Nicky Sijia He, Cristina Palmero, Sergio Escalera, Yuting Ye, et al. From sparse signal to smooth motion: Real-time motion generation with rolling prediction models. InCVPR, 2025

work page 2025

[25] [25]

Hmd-poser: On-device real-time human motion tracking from scalable sparse observations

Peng Dai, Yang Zhang, Tao Liu, Zhen Fan, Tianyuan Du, Zhuo Su, Xiaozheng Zheng, and Zeming Li. Hmd-poser: On-device real-time human motion tracking from scalable sparse observations. InCVPR, 2024

work page 2024

[26] [26]

Questsim: Human motion tracking from sparse sensors with simulated avatars

Alexander Winkler, Jungdam Won, and Yuting Ye. Questsim: Human motion tracking from sparse sensors with simulated avatars. InSIGGRAPH Asia, 2022

work page 2022

[27] [27]

Attention is all you need.NeurIPS, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017

work page 2017

[28] [28]

Denoising diffusion probabilistic models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 2020

work page 2020

[29] [29]

Smpl: A skinned multi-person linear model

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, V olume 2, pages 851–866. 2023

work page 2023

[30] [30]

On the continuity of rotation representa- tions in neural networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representa- tions in neural networks. InCVPR, 2019

work page 2019

[31] [31]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.NeurIPS, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.NeurIPS, 2022

work page 2022

[32] [32]

Motionbert: A unified perspective on learning human motion representations

Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. Motionbert: A unified perspective on learning human motion representations. InICCV, 2023

work page 2023

[33] [33]

Extracting and composing robust features with denoising autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InICML, 2008

work page 2008

[34] [34]

Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model

Yuming Du, Robin Kips, Albert Pumarola, Sebastian Starke, Ali Thabet, and Artsiom Sanakoyeu. Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. InCVPR, 2023

work page 2023

[35] [35]

CMU MoCap Dataset

Carnegie Mellon University. CMU MoCap Dataset. URLhttp://mocap.cs.cmu.edu

work page

[36] [36]

Decomposing biological motion: A framework for analysis and synthesis of human gait patterns.Journal of vision, 2(5):2–2, 2002

Nikolaus F Troje. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns.Journal of vision, 2(5):2–2, 2002

work page 2002

[37] [37]

Documentation mocap database hdm05

M Müller, T Röder, M Clausen, B Eberhardt, B Krüger, and A Weber. Documentation mocap database hdm05. universität bonn; bonn. Technical report, Germany: 2007. Technical Report CG-2007-2.[Google Scholar], 2007

work page 2007

[38] [38]

ACCAD MoCap Dataset

Advanced Computing Center for the Arts and Design. ACCAD MoCap Dataset. URL https://accad. osu.edu/research/motion-lab/mocap-system-and-data

work page

[39] [39]

Saeed Ghorbani, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, and Nikolaus F. Troje. MoVi: A large multipurpose motion and video dataset.arXiv preprint arXiv: 2003.01888, 2020. 11

work page arXiv 2003

[40] [40]

The KIT whole-body human motion database

Christian Mandery, Ömer Terlemez, Martin Do, Nikolaus Vahrenkamp, and Tamim Asfour. The KIT whole-body human motion database. In(ICAR, 2015

work page 2015

[41] [41]

Unifying rep- resentations and large-scale whole-body motion databases for studying human motion.IEEE Transactions on Robotics, 32(4):796–809, 2016

Christian Mandery, Ömer Terlemez, Martin Do, Nikolaus Vahrenkamp, and Tamim Asfour. Unifying rep- resentations and large-scale whole-body motion databases for studying human motion.IEEE Transactions on Robotics, 32(4):796–809, 2016

work page 2016

[42] [42]

The KIT bimanual manipulation dataset

Franziska Krebs, Andre Meixner, Isabel Patzer, and Tamim Asfour. The KIT bimanual manipulation dataset. InIEEE/RAS International Conference on Humanoid Robots (Humanoids), 2021

work page 2021

[43] [43]

Eyes JAPAN Co. Ltd. Eyes Japan MoCap Dataset. URLhttp://mocapdata.com

work page

[44] [44]

Loper, Naureen Mahmood, and Michael J

Matthew M. Loper, Naureen Mahmood, and Michael J. Black. MoSh: Motion and shape capture from sparse markers.TOG, 2014

work page 2014

[45] [45]

Ijaz Akhter and Michael J. Black. Pose-conditioned joint angle limits for 3D human pose reconstruction. InCVPR, 2015

work page 2015

[46] [46]

SFU Motion Capture Database

Simon Fraser University and National University of Singapore. SFU Motion Capture Database. URL http://mocap.cs.sfu.ca/

work page

[47] [47]

Total Capture: 3d human pose estimation fusing video and inertial sensors

Matt Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. Total Capture: 3d human pose estimation fusing video and inertial sensors. InBMVC, 2017

work page 2017

[48] [48]

Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion.International journal of computer vision, 87(1):4–27, 2010

Leonid Sigal, Alexandru O Balan, and Michael J Black. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion.International journal of computer vision, 87(1):4–27, 2010. 12 Appendices A Implementation Details The model stacks 6 TK-Blocks with an embedding dimension of 256 and 8 attention he...

work page 2010