MotionMAR: Multi-scale Auto-Regressive Human Motion Reconstruction from Sparse Observations

Chenglu Wen; Cheng Wang; Junsheng Zhang; Lan Xu; Mengyin Liu; Ming Yan; Siqi Shen; Xincheng Lin; Yuhua Luo; Zhudi Chen

arxiv: 2606.23000 · v1 · pith:YN7PWKSRnew · submitted 2026-06-22 · 💻 cs.CV

MotionMAR: Multi-scale Auto-Regressive Human Motion Reconstruction from Sparse Observations

Yuhua Luo , Junsheng Zhang , Mengyin Liu , Xincheng Lin , Ming Yan , Zhudi Chen , Chenglu Wen , Lan Xu

show 2 more authors

Siqi Shen Cheng Wang

This is my paper

Pith reviewed 2026-06-26 09:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords human motion reconstructionsparse observationsautoregressive predictionVQ-VAEmulti-scale tokenizationmotion capturetemporal hierarchy

0 comments

The pith

A multi-scale autoregressive model in tokenized latent space reconstructs full human motion from sparse observations by building global structure first then adding details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that human motion reconstruction from limited sensor inputs becomes reliable when the process follows the motion's built-in progression from broad low-frequency trajectories to high-frequency details. It implements this through a VQ-VAE that tokenizes motion at several temporal resolutions, an autoregressive network that predicts indices scale by scale, a control module that conditions on the sparse data, and a final refinement step. A sympathetic reader would care because many practical settings supply only a handful of tracking points yet still require complete, usable body poses for animation or interaction. If the hierarchy holds and the staged prediction works, reconstruction quality improves without needing denser or more expensive observations.

Core claim

MotionMAR is a coarse-to-fine framework whose four components—Temporal Multi-scale Tokenization VQ-VAE, Motion Autoregressive Network, Scale-Aware Control module, and Motion Refinement Network—jointly encode motion at multiple temporal resolutions, predict latent indices from coarse global structure to fine details, condition the output on sparse observations, and remove quantization artifacts to reach state-of-the-art accuracy on the AMASS dataset.

What carries the argument

The Motion Autoregressive Network, which predicts motion indices level by level in the multi-scale VQ-VAE latent space, first fixing coarse global indices then generating finer indices for temporal details.

If this is right

Sparse tracking data is integrated at each scale through the Scale-Aware Control module so that generated motion stays consistent with the actual observations.
Semantic content is isolated from minor jitters because the VQ-VAE operates at multiple temporal resolutions.
Consecutive poses become smooth and quantization errors are removed by the final Motion Refinement Network.
Overall accuracy reaches state-of-the-art levels on the AMASS benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged prediction could be continued autoregressively to forecast future frames beyond the observed window.
Sensor layouts might be optimized by placing devices to capture the dominant scales identified by the tokenization process.
If the low-to-high frequency separation proves stable, the architecture could transfer to reconstructing trajectories of non-human articulated systems such as robots or animals.

Load-bearing premise

Human motion has a clean temporal hierarchical structure that can be separated into low-frequency global trajectories and high-frequency details and modeled by multi-level autoregressive prediction inside a VQ-VAE latent space.

What would settle it

Ablating the multi-scale tokenization and level-by-level autoregressive prediction on the AMASS dataset and measuring whether reconstruction error rises above that of a single-scale autoregressive baseline.

Figures

Figures reproduced from arXiv: 2606.23000 by Chenglu Wen, Cheng Wang, Junsheng Zhang, Lan Xu, Mengyin Liu, Ming Yan, Siqi Shen, Xincheng Lin, Yuhua Luo, Zhudi Chen.

**Figure 1.** Figure 1: Visualization of MotionMAR’s coarse-to-fine generation strategy. Rather than creating the motion in a single pass, the framework builds it sequentially across different time resolutions. The process starts by predicting a broad trajectory (Scale 1) to map out the general movement envelope. Following this initial outline, the model progressively adds mid-level dynamics (Scale 2) and high-frequency details (… view at source ↗

**Figure 2.** Figure 2: Overview of the Multi-scale Motion Autoregressive Network (MotionMAR). It consists of four core components: a Temporal Multi-scale VQ-VAE, a Scale-aware Control Module, a Motion Autoregressive Network, and a Refinement Network in the final stage. within each resolution level, VAR achieves significant computational efficiency improvements, reducing complexity from O(n 6 ) to O(n 4 ). 2.2. Human Motion Reco… view at source ↗

**Figure 3.** Figure 3: Visual comparison of MotionMAR against baseline methods for Human Motion Reconstruction under setting S1. ally—leveraging both past and future temporal information—as noted in (Feng et al., 2024). However, its performance degrades significantly in the online setting, likely due to the constraints of real-time application scenarios. SAGE decouples human motion into upper and lower body components, progres… view at source ↗

**Figure 4.** Figure 4: Visualization results on real data. Blue indicates the predicted results, while white represents the ground truth [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization comparison for the Temporal Multi-scale Tokenization. The darker the red color, the greater the deviation between the predicted result and the ground truth. also confirms this finding. In contrast, our method, MotionMAR, achieves highly competitive results in capturing both structural human pose and coherent full-body motion. Notably, we observe satisfactory improvements in Hand PE and Lowe… view at source ↗

**Figure 6.** Figure 6: Visualization of representative failure cases under setting S1 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 3.** Figure 3: We developed a new MotionMAR variant, MotionMAR(Spatial). It replaces the temporal strategith a spatial hierarchthat generates the skeleton from core to We developed a new MotionMAR variant, MotionMAR(Spatial). It replaces the temporal strategy with a spatial hierarchy that [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Human motion follows a temporal hierarchical structure, transitioning from low-frequency global trajectories to high-frequency details. Inspired by the success of multi-level autoregressive models in computer vision, we propose MotionMAR, a coarse-to-fine framework for motion reconstruction from sparse observations. It first estimates the global trajectory of human motion and then gradually refines the temporal details. This architecture consists of four integrated components. The Temporal Multi-scale Tokenization (TMT) VQ-VAE encodes the data at multiple temporal resolutions, separating semantic motion from minor jitters. The Motion Autoregressive Network (MAN) operates in this latent space, predicting motion across scales. It first establishes the global structure through coarse indices and then generates finer indices to recover specific details. Meanwhile, the Scale-Aware Control (SAC) module integrates sparse tracking data to ensure the generated output aligns with actual observations. The Motion Refinement Network (MRN) subsequently smooths consecutive poses and eliminates quantization artifacts. Experiments show that MotionMAR achieves state-of-the-art accuracy on the AMASS dataset, providing a reliable and structure-aware approach for motion reconstruction. The source code is publicly available at http://www.lidarhumanmotion.net/motionmar/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MotionMAR puts together a multi-scale VQ-VAE plus autoregressive prediction for sparse-to-dense motion reconstruction, but the SOTA claim on AMASS cannot be checked from the abstract alone.

read the letter

The paper's main contribution is a coarse-to-fine pipeline that first recovers global motion trajectory then adds finer details. It does this with four pieces: a Temporal Multi-scale Tokenization VQ-VAE that encodes at different resolutions, a Motion Autoregressive Network that predicts indices from coarse to fine in latent space, a Scale-Aware Control module that folds in the sparse observations, and a Motion Refinement Network that cleans up the output.

This integration is new for the sparse-observation setting even if the multi-level autoregressive idea itself comes from earlier vision work. The structure makes sense for motion data, where low-frequency paths and high-frequency details are often separable, and the conditioning step on actual sensor inputs is a practical addition.

The code release helps. Anyone can run the model and see how the pieces fit together.

The soft spot is the missing evidence. The abstract states state-of-the-art accuracy on AMASS but shows no error values, no baseline comparisons, and no ablation results. Without those tables it is impossible to judge whether the gains are real or modest. The hierarchical-frequency assumption is standard and not obviously wrong, but it would be stronger if the paper tested it against motions that break the pattern.

This work is aimed at people building animation tools or sensor-light tracking systems. A reader already working on autoregressive sequence models or motion capture would get the most out of the architecture details.

It should go to peer review. The method is described clearly enough that referees can evaluate the experiments and ask for the missing numbers.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes MotionMAR, a coarse-to-fine framework for human motion reconstruction from sparse observations. It encodes motion via a Temporal Multi-scale Tokenization (TMT) VQ-VAE that separates semantic content from jitter at multiple temporal resolutions, uses a Motion Autoregressive Network (MAN) to predict coarse-to-fine indices in the latent space, incorporates sparse observations through a Scale-Aware Control (SAC) module, and applies a Motion Refinement Network (MRN) to remove quantization artifacts and ensure smoothness. The central claim is that this architecture achieves state-of-the-art accuracy on the AMASS dataset by exploiting the temporal hierarchical structure of human motion.

Significance. If the performance claims are substantiated, the work contributes a structure-aware autoregressive pipeline that explicitly models global trajectories before local details in a VQ-VAE latent space. Public release of the source code is a positive factor for reproducibility and follow-up work in sparse motion capture.

major comments (1)

[Abstract] Abstract: the assertion that MotionMAR 'achieves state-of-the-art accuracy on the AMASS dataset' is presented without any quantitative metrics (e.g., MPJPE, acceleration error), baseline comparisons, error bars, or ablation results. This absence prevents verification of the central empirical claim.

minor comments (1)

[Abstract] The high-level description of the four components (TMT VQ-VAE, MAN, SAC, MRN) and their integration would benefit from an explicit diagram or pseudocode showing data flow between modules.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their comment. We address the concern regarding the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that MotionMAR 'achieves state-of-the-art accuracy on the AMASS dataset' is presented without any quantitative metrics (e.g., MPJPE, acceleration error), baseline comparisons, error bars, or ablation results. This absence prevents verification of the central empirical claim.

Authors: We agree that the abstract would benefit from including key quantitative results to allow immediate verification of the SOTA claim. The manuscript body contains the requested evaluations (MPJPE, acceleration error, baseline comparisons, error bars, and ablations on AMASS). We will revise the abstract to incorporate the primary metrics and comparisons from the experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical pipeline (TMT VQ-VAE + MAN + SAC + MRN) whose SOTA claim on AMASS rests on measured reconstruction accuracy rather than any equation that reduces a prediction to a fitted input or self-citation by construction. No load-bearing derivation step is shown to be equivalent to its own inputs; the architecture is presented as a standard coarse-to-fine autoregressive model whose validity is tested externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 4 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented physical entities; the four named modules are architectural inventions whose independent evidence is limited to the reported AMASS performance.

invented entities (4)

Temporal Multi-scale Tokenization (TMT) VQ-VAE no independent evidence
purpose: Encodes motion at multiple temporal resolutions to separate semantic motion from minor jitters
Introduced as the first component of the framework; no external validation cited in abstract.
Motion Autoregressive Network (MAN) no independent evidence
purpose: Predicts motion indices across scales in latent space
Core generative component; evidence limited to abstract claim of SOTA.
Scale-Aware Control (SAC) module no independent evidence
purpose: Integrates sparse tracking data to align output with observations
Control mechanism; no separate validation mentioned.
Motion Refinement Network (MRN) no independent evidence
purpose: Smooths consecutive poses and removes quantization artifacts
Post-processing step; evidence tied to overall framework performance.

pith-pipeline@v0.9.1-grok · 5769 in / 1309 out tokens · 15247 ms · 2026-06-26T09:25:55.318065+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 69 canonical work pages

[1]

FirstName LastName , title =
[2]

FirstName Alpher , title =
[3]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
[4]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
[5]

FirstName Alpher and FirstName Gamow , title =
[6]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

HiSC4D: Human-Centered Interaction and 4D Scene Capture in Large-Scale Space Using Wearable IMUs and LiDAR , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[7]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=
[8]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=
[9]

European Conference on Computer Vision , pages=

Gimo: Gaze-informed human motion prediction in context , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Hmd-poser: On-device real-time human motion tracking from scalable sparse observations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[11]

European Conference on Computer Vision , pages=

Egoposer: Robust real-time egocentric pose estimation from sparse and intermittent observations everywhere , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[12]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

End-to-end recovery of human shape and pose , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[13]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Neural descent for visual 3d human pose and shape , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[14]

Seminal Graphics Papers: Pushing the Boundaries, Volume 2 , pages=

SMPL: A skinned multi-person linear model , author=. Seminal Graphics Papers: Pushing the Boundaries, Volume 2 , pages=
[15]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

HuMoR: 3D Human Motion Model for Robust Pose Estimation , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

2021
[16]

2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , author=. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

2019
[17]

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

VIBE: Video Inference for Human Body Pose and Shape Estimation , author=. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2020
[18]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

LiDARCap: Long-range Markerless 3D Human Motion Capture with LiDAR Point Clouds , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2022
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Reli11d: A comprehensive multimodal human motion dataset and method , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[20]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

CIMI4D: A Large Multimodal Climbing Motion Dataset under Human-scene Interactions , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2023
[21]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2025
[22]

European Conference on Computer Vision , year=

AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing , author=. European Conference on Computer Vision , year=
[23]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Stratified Avatar Generation from Sparse Observations , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2024
[24]

ArXiv , year=

Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose , author=. ArXiv , year=
[25]

British Machine Vision Conference , year=

Hierarchical Graph Networks for 3D Human Pose Estimation , author=. British Machine Vision Conference , year=
[26]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2025
[27]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Diffusiondet: Diffusion model for object detection , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[28]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
[29]

Journal of Machine Learning Research , volume=

Cascaded diffusion models for high fidelity image generation , author=. Journal of Machine Learning Research , volume=
[30]

Proceedings of the 30th ACM International Conference on Multimedia , pages=

Prodiff: Progressive fast diffusion model for high-quality text-to-speech , author=. Proceedings of the 30th ACM International Conference on Multimedia , pages=
[31]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Diffsound: Discrete diffusion model for text-to-sound generation , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2023 , publisher=

2023
[32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Bodiffusion: Diffusing sparse observations for full-body human motion synthesis , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[33]

European Conference on Computer Vision , pages=

Posegpt: Quantization-based 3d human motion generation and forecasting , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[34]

Advances in Neural Information Processing Systems , volume=

Motiongpt: Human motion as a foreign language , author=. Advances in Neural Information Processing Systems , volume=
[35]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Generating human motion from textual descriptions with discrete representations , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[36]

Advances in neural information processing systems , volume=

Neural discrete representation learning , author=. Advances in neural information processing systems , volume=
[37]

Advances in neural information processing systems , volume=

Visual autoregressive modeling: Scalable image generation via next-scale prediction , author=. Advances in neural information processing systems , volume=
[38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[39]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Realistic full-body tracking from sparse observations via joint-level modeling , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[40]

Eyes Japan MoCap Dataset , author=
[41]

CMU MoCap Dataset , author=
[42]

Journal of Vision , year=

Decomposing biological motion: A linear model for analysis and synthesis of human gait patterns , author=. Journal of Vision , year=
[43]

Institut f

Mocap database hdm05 , author=. Institut f
[44]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[45]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=
[46]

arXiv preprint arXiv:2305.10403 , year=

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

Pith/arXiv arXiv
[47]

arXiv preprint arXiv:2203.15556 , year=

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2302.13971 , year=

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

Pith/arXiv arXiv
[49]

arXiv preprint arXiv:2307.09288 , year=

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv
[50]

arXiv preprint arXiv:2211.05100 , year=

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2107.02137 , year=

Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation , author=. arXiv preprint arXiv:2107.02137 , year=

arXiv
[52]

arXiv preprint arXiv:2309.16609 , year=

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

Pith/arXiv arXiv
[53]

arXiv preprint arXiv:2312.11805 , year=

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

Pith/arXiv arXiv
[54]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[55]

arXiv preprint arXiv:2110.04627 , year=

Vector-quantized image modeling with improved vqgan , author=. arXiv preprint arXiv:2110.04627 , year=

Pith/arXiv arXiv
[56]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Autoregressive image generation using residual quantization , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[57]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

On the continuity of rotation representations in neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[58]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Full-body motion from a single head-mounted device: Generating smpl poses from partial observations , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[59]

Computer Graphics Forum , volume=

Lobstr: Real-time lower-body pose prediction from sparse upper-body tracking signals , author=. Computer Graphics Forum , volume=. 2021 , organization=

2021
[60]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Flag: Flow-based 3d avatar generation from sparse observations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[61]

Computer Graphics Forum , volume=

MOVIN: Real-time Motion Capture using a Single LiDAR , author=. Computer Graphics Forum , volume=. 2023 , organization=

2023
[62]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

AMASS: Archive of motion capture as surface shapes , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[63]

SIGGRAPH Asia 2022 Conference Papers , pages=

Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation , author=. SIGGRAPH Asia 2022 Conference Papers , pages=

2022
[64]

Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao , booktitle=. Vi
[65]

2019 , howpublished =

Vicon Motion Capture , author=. 2019 , howpublished =

2019
[66]

2019 , howpublished =

Xsens Motion Capture , author=. 2019 , howpublished =

2019
[67]

2021 , howpublished =

Noitom Motion Capture , author=. 2021 , howpublished =

2021
[68]

2024 , howpublished =

Latitude Climbing , title =. 2024 , howpublished =

2024
[69]

2021 , howpublished =

Abhinav Tyagi , title =. 2021 , howpublished =

2021
[70]

Continuous-Time Human Motion Field from Events , journal =

Ziyun Wang and Ruijun Zhang and Zi. Continuous-Time Human Motion Field from Events , journal =. 2024 , url =. doi:10.48550/ARXIV.2412.01747 , eprinttype =. 2412.01747 , timestamp =

work page doi:10.48550/arxiv.2412.01747 2024
[71]

Current Issues in Sport Science (CISS) , volume=

Comparison of joint kinematics from optical marker-based and inertial sensor-based motion capture during change-of-direction movements , author=. Current Issues in Sport Science (CISS) , volume=
[72]

Proceedings of the 32nd ACM International Conference on Multimedia , pages=

Spikegs: 3d gaussian splatting from spike streams with high-speed camera motion , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=
[73]

ACM SIGGRAPH 2024 Conference Papers , pages=

Ultra inertial poser: Scalable motion capture and tracking from sparse inertial sensors and ultra-wideband ranging , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

2024
[74]

Nature , year=

Low-latency automotive vision with event cameras , author=. Nature , year=
[75]

A Benchmark Dataset for Event-Guided Human Pose Estimation and Tracking in Extreme Conditions , author=
[76]

Weakly Supervised 3D Multi-Person Pose Estimation for Large-Scale Scenes Based on Monocular Camera and Single LiDAR , booktitle =

Peishan Cong and Yiteng Xu and Yiming Ren and Juze Zhang and Lan Xu and Jingya Wang and Jingyi Yu and Yuexin Ma , noeditor =. Weakly Supervised 3D Multi-Person Pose Estimation for Large-Scale Scenes Based on Monocular Camera and Single LiDAR , booktitle =
[77]

2019 , url =

Shoushun Chen and Menghan Guo , title =. 2019 , url =

2019
[78]

Patrick Lichtsteiner and Christoph Posch and Tobi Delbr. A 128. 2008 , url =. doi:10.1109/JSSC.2007.914337 , timestamp =

work page doi:10.1109/jssc.2007.914337 2008
[79]

In: 2020 25th International Conference on Pattern Recognition (ICPR)

Michael F. 25th International Conference on Pattern Recognition,. 2020 , url =. doi:10.1109/ICPR48806.2021.9412785 , timestamp =

work page doi:10.1109/icpr48806.2021.9412785 2020
[80]

MobiCom '20: The 26th Annual International Conference on Mobile Computing and Networking, London, United Kingdom, September 21-25, 2020 , pages =

Wenjun Jiang and Hongfei Xue and Chenglin Miao and Shiyang Wang and Sen Lin and Chong Tian and Srinivasan Murali and Haochen Hu and Zhi Sun and Lu Su , title =. MobiCom '20: The 26th Annual International Conference on Mobile Computing and Networking, London, United Kingdom, September 21-25, 2020 , pages =. 2020 , url =. doi:10.1145/3372224.3380900 , timestamp =

work page doi:10.1145/3372224.3380900 2020

Showing first 80 references.

[1] [1]

FirstName LastName , title =

[2] [2]

FirstName Alpher , title =

[3] [3]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

[4] [4]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

[5] [5]

FirstName Alpher and FirstName Gamow , title =

[6] [6]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

HiSC4D: Human-Centered Interaction and 4D Scene Capture in Large-Scale Space Using Wearable IMUs and LiDAR , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

[7] [7]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

[8] [8]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

[9] [9]

European Conference on Computer Vision , pages=

Gimo: Gaze-informed human motion prediction in context , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Hmd-poser: On-device real-time human motion tracking from scalable sparse observations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[11] [11]

European Conference on Computer Vision , pages=

Egoposer: Robust real-time egocentric pose estimation from sparse and intermittent observations everywhere , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[12] [12]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

End-to-end recovery of human shape and pose , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[13] [13]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Neural descent for visual 3d human pose and shape , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[14] [14]

Seminal Graphics Papers: Pushing the Boundaries, Volume 2 , pages=

SMPL: A skinned multi-person linear model , author=. Seminal Graphics Papers: Pushing the Boundaries, Volume 2 , pages=

[15] [15]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

HuMoR: 3D Human Motion Model for Robust Pose Estimation , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

2021

[16] [16]

2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , author=. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

2019

[17] [17]

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

VIBE: Video Inference for Human Body Pose and Shape Estimation , author=. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2020

[18] [18]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

LiDARCap: Long-range Markerless 3D Human Motion Capture with LiDAR Point Clouds , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2022

[19] [19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Reli11d: A comprehensive multimodal human motion dataset and method , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[20] [20]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

CIMI4D: A Large Multimodal Climbing Motion Dataset under Human-scene Interactions , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2023

[21] [21]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2025

[22] [22]

European Conference on Computer Vision , year=

AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing , author=. European Conference on Computer Vision , year=

[23] [23]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Stratified Avatar Generation from Sparse Observations , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2024

[24] [24]

ArXiv , year=

Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose , author=. ArXiv , year=

[25] [25]

British Machine Vision Conference , year=

Hierarchical Graph Networks for 3D Human Pose Estimation , author=. British Machine Vision Conference , year=

[26] [26]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2025

[27] [27]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Diffusiondet: Diffusion model for object detection , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[28] [28]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

[29] [29]

Journal of Machine Learning Research , volume=

Cascaded diffusion models for high fidelity image generation , author=. Journal of Machine Learning Research , volume=

[30] [30]

Proceedings of the 30th ACM International Conference on Multimedia , pages=

Prodiff: Progressive fast diffusion model for high-quality text-to-speech , author=. Proceedings of the 30th ACM International Conference on Multimedia , pages=

[31] [31]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Diffsound: Discrete diffusion model for text-to-sound generation , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2023 , publisher=

2023

[32] [32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Bodiffusion: Diffusing sparse observations for full-body human motion synthesis , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[33] [33]

European Conference on Computer Vision , pages=

Posegpt: Quantization-based 3d human motion generation and forecasting , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[34] [34]

Advances in Neural Information Processing Systems , volume=

Motiongpt: Human motion as a foreign language , author=. Advances in Neural Information Processing Systems , volume=

[35] [35]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Generating human motion from textual descriptions with discrete representations , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[36] [36]

Advances in neural information processing systems , volume=

Neural discrete representation learning , author=. Advances in neural information processing systems , volume=

[37] [37]

Advances in neural information processing systems , volume=

Visual autoregressive modeling: Scalable image generation via next-scale prediction , author=. Advances in neural information processing systems , volume=

[38] [38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[39] [39]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Realistic full-body tracking from sparse observations via joint-level modeling , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[40] [40]

Eyes Japan MoCap Dataset , author=

[41] [41]

CMU MoCap Dataset , author=

[42] [42]

Journal of Vision , year=

Decomposing biological motion: A linear model for analysis and synthesis of human gait patterns , author=. Journal of Vision , year=

[43] [43]

Institut f

Mocap database hdm05 , author=. Institut f

[44] [44]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

[45] [45]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

[46] [46]

arXiv preprint arXiv:2305.10403 , year=

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

Pith/arXiv arXiv

[47] [47]

arXiv preprint arXiv:2203.15556 , year=

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2302.13971 , year=

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

Pith/arXiv arXiv

[49] [49]

arXiv preprint arXiv:2307.09288 , year=

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv

[50] [50]

arXiv preprint arXiv:2211.05100 , year=

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2107.02137 , year=

Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation , author=. arXiv preprint arXiv:2107.02137 , year=

arXiv

[52] [52]

arXiv preprint arXiv:2309.16609 , year=

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

Pith/arXiv arXiv

[53] [53]

arXiv preprint arXiv:2312.11805 , year=

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

Pith/arXiv arXiv

[54] [54]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[55] [55]

arXiv preprint arXiv:2110.04627 , year=

Vector-quantized image modeling with improved vqgan , author=. arXiv preprint arXiv:2110.04627 , year=

Pith/arXiv arXiv

[56] [56]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Autoregressive image generation using residual quantization , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[57] [57]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

On the continuity of rotation representations in neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[58] [58]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Full-body motion from a single head-mounted device: Generating smpl poses from partial observations , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[59] [59]

Computer Graphics Forum , volume=

Lobstr: Real-time lower-body pose prediction from sparse upper-body tracking signals , author=. Computer Graphics Forum , volume=. 2021 , organization=

2021

[60] [60]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Flag: Flow-based 3d avatar generation from sparse observations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[61] [61]

Computer Graphics Forum , volume=

MOVIN: Real-time Motion Capture using a Single LiDAR , author=. Computer Graphics Forum , volume=. 2023 , organization=

2023

[62] [62]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

AMASS: Archive of motion capture as surface shapes , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[63] [63]

SIGGRAPH Asia 2022 Conference Papers , pages=

Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation , author=. SIGGRAPH Asia 2022 Conference Papers , pages=

2022

[64] [64]

Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao , booktitle=. Vi

[65] [65]

2019 , howpublished =

Vicon Motion Capture , author=. 2019 , howpublished =

2019

[66] [66]

2019 , howpublished =

Xsens Motion Capture , author=. 2019 , howpublished =

2019

[67] [67]

2021 , howpublished =

Noitom Motion Capture , author=. 2021 , howpublished =

2021

[68] [68]

2024 , howpublished =

Latitude Climbing , title =. 2024 , howpublished =

2024

[69] [69]

2021 , howpublished =

Abhinav Tyagi , title =. 2021 , howpublished =

2021

[70] [70]

Continuous-Time Human Motion Field from Events , journal =

Ziyun Wang and Ruijun Zhang and Zi. Continuous-Time Human Motion Field from Events , journal =. 2024 , url =. doi:10.48550/ARXIV.2412.01747 , eprinttype =. 2412.01747 , timestamp =

work page doi:10.48550/arxiv.2412.01747 2024

[71] [71]

Current Issues in Sport Science (CISS) , volume=

Comparison of joint kinematics from optical marker-based and inertial sensor-based motion capture during change-of-direction movements , author=. Current Issues in Sport Science (CISS) , volume=

[72] [72]

Proceedings of the 32nd ACM International Conference on Multimedia , pages=

Spikegs: 3d gaussian splatting from spike streams with high-speed camera motion , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=

[73] [73]

ACM SIGGRAPH 2024 Conference Papers , pages=

Ultra inertial poser: Scalable motion capture and tracking from sparse inertial sensors and ultra-wideband ranging , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

2024

[74] [74]

Nature , year=

Low-latency automotive vision with event cameras , author=. Nature , year=

[75] [75]

A Benchmark Dataset for Event-Guided Human Pose Estimation and Tracking in Extreme Conditions , author=

[76] [76]

Weakly Supervised 3D Multi-Person Pose Estimation for Large-Scale Scenes Based on Monocular Camera and Single LiDAR , booktitle =

Peishan Cong and Yiteng Xu and Yiming Ren and Juze Zhang and Lan Xu and Jingya Wang and Jingyi Yu and Yuexin Ma , noeditor =. Weakly Supervised 3D Multi-Person Pose Estimation for Large-Scale Scenes Based on Monocular Camera and Single LiDAR , booktitle =

[77] [77]

2019 , url =

Shoushun Chen and Menghan Guo , title =. 2019 , url =

2019

[78] [78]

Patrick Lichtsteiner and Christoph Posch and Tobi Delbr. A 128. 2008 , url =. doi:10.1109/JSSC.2007.914337 , timestamp =

work page doi:10.1109/jssc.2007.914337 2008

[79] [79]

In: 2020 25th International Conference on Pattern Recognition (ICPR)

Michael F. 25th International Conference on Pattern Recognition,. 2020 , url =. doi:10.1109/ICPR48806.2021.9412785 , timestamp =

work page doi:10.1109/icpr48806.2021.9412785 2020

[80] [80]

MobiCom '20: The 26th Annual International Conference on Mobile Computing and Networking, London, United Kingdom, September 21-25, 2020 , pages =

Wenjun Jiang and Hongfei Xue and Chenglin Miao and Shiyang Wang and Sen Lin and Chong Tian and Srinivasan Murali and Haochen Hu and Zhi Sun and Lu Su , title =. MobiCom '20: The 26th Annual International Conference on Mobile Computing and Networking, London, United Kingdom, September 21-25, 2020 , pages =. 2020 , url =. doi:10.1145/3372224.3380900 , timestamp =

work page doi:10.1145/3372224.3380900 2020