pith. sign in

arxiv: 2606.13364 · v1 · pith:CWIANV4Xnew · submitted 2026-06-11 · 💻 cs.LG · cs.CV

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

Pith reviewed 2026-06-27 07:08 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords 3D human motion generationdiffusion models2D supervisionmonocular videoreprojection lossmotion priorsnoisy teacher
0
0 comments X

The pith

A depth-weighted 2D reprojection loss trains 3D motion diffusion models from video alone and nearly matches full 3D supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VideoMDM, which trains a 3D human motion diffusion model using only accurate 2D keypoints from monocular videos. A pretrained 2D-to-3D lifter supplies noisy 3D sequences that are diffused and denoised in 3D space, then supervised via reprojected 2D loss against the original keypoints. Under mild assumptions this depth-weighted reprojection loss equals direct 3D supervision in expectation, so the model learns a coherent 3D motion manifold during training rather than performing lifting only at test time. The approach adapts velocity and representation regularizers to the 2D setting and reports FID of 0.88 versus 0.54 for a fully 3D-supervised baseline on HumanML3D, plus human preference on real video data.

Core claim

VideoMDM is a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. Under mild assumptions a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and standard 3D motion regularizers are adapted to this 2D setting.

What carries the argument

depth-weighted 2D reprojection loss that equates in expectation to direct 3D supervision

If this is right

  • The model learns a coherent 3D motion manifold throughout training instead of lifting only at inference time.
  • On HumanML3D the FID reaches 0.88, close to the 0.54 of a fully 3D-supervised MDM.
  • Velocity consistency and over-parameterized representation alignment regularizers can be adapted to the 2D supervision setting.
  • On real video datasets the generated motions are consistently preferred by human evaluators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same equivalence could allow scaling motion training to far larger collections of unlabeled video.
  • Other 3D generative tasks that currently require expensive capture data might adopt similar 2D reprojection supervision.
  • If the lifter signal degrades during training, performance would be expected to plateau below the 3D-supervised ceiling.

Load-bearing premise

The mild assumptions hold under which the depth-weighted 2D reprojection loss equals direct 3D supervision in expectation, and the pretrained lifter continues to supply an informative noisy teacher signal.

What would settle it

A mathematical derivation or empirical measurement showing that the expected value of the depth-weighted 2D reprojection loss differs from the expected 3D loss by a non-zero amount when the stated mild assumptions are satisfied.

Figures

Figures reproduced from arXiv: 2606.13364 by Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany.

Figure 1
Figure 1. Figure 1: We demonstrate VideoMDM on monocular videos of human activities. Our framework trains 3D text-to-motion diffusion models using 2D pose sequences extracted from videos. Left: representative training videos. Right: generated motions using the trained model from text prompts. Despite relying solely on 2D supervision, VideoMDM attains motion fidelity approaching that of fully 3D-supervised training. See projec… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on HumanML3D for the prompt “the person walks backwards in a [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Histogram of resulting clip lengths after dynamic-programming segmentation. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A representative screenshot from the NBA Human Preference Survey interface. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A representative screenshot from the Fit3D Human Preference Survey interface. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VideoMDM, a diffusion-based framework for generating 3D human motions trained directly from accurate 2D keypoints extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter supplies noisy 3D pose sequences that are diffused and denoised in 3D space, with supervision provided via a depth-weighted 2D reprojection loss against the accurate 2D keypoints. The central claim is that this loss is equivalent in expectation to direct 3D supervision under mild assumptions; the method also adapts velocity consistency and representation alignment regularizers to the 2D setting. Experiments report FID 0.88 on HumanML3D (vs. 0.54 for fully 3D-supervised MDM) and human preference on Fit3D and NBA video datasets.

Significance. If the equivalence result and empirical claims hold, the work would be significant for enabling scalable training of 3D motion diffusion models from abundant unlabeled video data rather than scarce 3D motion capture, while learning a coherent 3D manifold during training rather than lifting only at inference.

major comments (2)
  1. [Abstract] Abstract: The claim that 'under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision' is load-bearing for the entire training justification, yet the abstract (and by extension the manuscript) provides neither an explicit list of the assumptions nor any derivation showing how the expectation identity is obtained. Without these, the observed FID gap cannot be attributed to the claimed equivalence, and the training objective's validity as a 3D proxy remains unverified.
  2. [§4] §4 (Experiments): No error bars, standard deviations, or details of the experimental protocol (number of runs, random seeds, exact FID computation) are reported for the HumanML3D FID numbers (0.88 vs 0.54) or the human preference studies on Fit3D/NBA. This makes it impossible to assess whether the gap to MDM is statistically meaningful or reproducible.
minor comments (2)
  1. The manuscript should include a dedicated subsection or appendix deriving the equivalence (with all assumptions stated) so that readers can verify the conditions under which the depth-weighted reprojection loss matches 3D supervision in expectation.
  2. Notation for the depth-weighting term and the lifter noise model should be introduced formally with equations rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight important areas for clarification and rigor. We address each below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision' is load-bearing for the entire training justification, yet the abstract (and by extension the manuscript) provides neither an explicit list of the assumptions nor any derivation showing how the expectation identity is obtained. Without these, the observed FID gap cannot be attributed to the claimed equivalence, and the training objective's validity as a 3D proxy remains unverified.

    Authors: We agree that the abstract and main text should make the assumptions and derivation more explicit to support the central claim. The current manuscript states the result under 'mild assumptions' but does not enumerate them or derive the expectation identity in the main body. In the revision we will (1) list the assumptions explicitly in a new paragraph of Section 3, (2) provide a concise derivation (or clear reference to the supplementary derivation) showing why the depth-weighted 2D reprojection loss equals direct 3D supervision in expectation, and (3) update the abstract to reference these assumptions. This change directly addresses the concern that the training objective's validity remains unverified. revision: yes

  2. Referee: [§4] §4 (Experiments): No error bars, standard deviations, or details of the experimental protocol (number of runs, random seeds, exact FID computation) are reported for the HumanML3D FID numbers (0.88 vs 0.54) or the human preference studies on Fit3D/NBA. This makes it impossible to assess whether the gap to MDM is statistically meaningful or reproducible.

    Authors: We acknowledge that the reported results lack error bars, standard deviations, and full experimental-protocol details, which limits assessment of statistical significance and reproducibility. In the revised manuscript we will add: (i) standard deviations computed over at least three independent runs with distinct random seeds for all HumanML3D FID scores, (ii) explicit description of the FID computation protocol (including feature extractor, number of samples, and distance metric), (iii) the number of runs and seeds used, and (iv) for the human preference studies, the number of raters, inter-rater agreement, and any statistical significance tests. These additions will allow readers to evaluate whether the observed gap is meaningful. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper claims to show an equivalence (under mild assumptions) between a depth-weighted 2D reprojection loss and direct 3D supervision, but this is presented as a derived mathematical result rather than a self-definitional identity or a fitted quantity renamed as a prediction. The loss itself is defined directly from accurate 2D keypoints extracted from video, with no equations or self-citation chains that reduce the reported FID or human-preference results to the inputs by construction. No load-bearing self-citations, uniqueness theorems from the same authors, or smuggled ansatzes are evident. The derivation chain is self-contained against external benchmarks such as comparison to fully 3D-supervised MDM and evaluation on real video datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, no listed axioms beyond the unnamed mild assumptions, and no new invented entities; the equivalence claim rests on an unelaborated domain assumption.

pith-pipeline@v0.9.1-grok · 5747 in / 1369 out tokens · 28650 ms · 2026-06-27T07:08:52.660767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 7 canonical work pages

  1. [1]

    Alliegro, Y

    A. Alliegro, Y . Siddiqui, T. Tommasi, and M. Nießner. Polydiff: Generating 3d polygonal meshes with diffusion models, 2023. URLhttps://arxiv.org/abs/2312.11417

  2. [2]

    X. Bie, W. Guo, S. Leglaive, L. Girin, F. Moreno-Noguer, and X. Alameda-Pineda. Hit- dvae: Human motion generation via hierarchical transformer dynamical vae, 2022. URL https://arxiv.org/abs/2204.01565

  3. [3]

    Bi´nkowski, D

    M. Bi´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=r1lUOzWCW

  4. [4]

    G. Bradski. The OpenCV Library.Dr . Dobb’s Journal of Software Tools, 2000

  5. [5]

    Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y . Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields, 2019. URLhttps://arxiv.org/abs/1812.08008

  6. [6]

    X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023

  7. [7]

    Deitke, R

    M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre, E. VanderBilt, A. Kembhavi, C. V ondrick, G. Gkioxari, K. Ehsani, L. Schmidt, and A. Farhadi. Objaverse-xl: a universe of 10m+ 3d objects. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, R...

  8. [8]

    H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y . Xiu, Y .-L. Li, and C. Lu. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time.IEEE Trans. Pattern Anal. Mach. Intell., 45(6):7157–7173, June 2023. ISSN 0162-8828. doi: 10.1109/TPAMI.2022.3222784. URLhttps://doi.org/10.1109/TPAMI.2022.3222784

  9. [9]

    Fieraru, M

    M. Fieraru, M. Zanfir, S.-C. Pirlea, V . Olaru, and C. Sminchisescu. Aifit: Automatic 3d human- interpretable feedback models for fitness training. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021

  10. [10]

    R. Gao, A. Holynski, P. Henzler, A. Brussee, R. Martin-Brualla, P. Srinivasan, J. T. Barron, and B. Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

  11. [11]

    Grauman, A

    K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, Z. Chavis, J. Chen, F. Cheng, F.-J. Chu, S. Crane, A. Dasgupta, J. Dong, M. Escobar, C. Forigua, A. Gebreselasie, S. Haresh, J. Huang, M. M. Islam, S. Jain, R. Khirodkar, D. Kukreja, K. J. Liang, J.-W. Liu, S. Majumder, Y . Mao, ...

  12. [12]

    C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020

  13. [13]

    C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022

  14. [14]

    C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng. Momask: Generative masked modeling of 3d human motions, 2023. URLhttps://arxiv.org/abs/2312.00063

  15. [15]

    R. Guo, H. Pi, Z. Shen, Q. Shuai, Z. Hu, Z. Wang, Y . Dong, R. Hu, T. Komura, S. Peng, and X. Zhou. Motion-2-to-3: Leveraging 2d motion data for 3d motion generations. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14305–14316, October 2025

  16. [16]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models, 2020. URL https: //arxiv.org/abs/2006.11239

  17. [17]

    S.-E. Hong, S. Lim, J. Hwang, M. Chang, and H. Kang. Bipo: Bidirectional partial occlusion network for text-to-motion synthesis, 2025. URLhttps://arxiv.org/abs/2412.00112

  18. [18]

    Jiang, P

    T. Jiang, P. Lu, L. Zhang, N. Ma, R. Han, C. Lyu, Y . Li, and K. Chen. Rtmpose: Real-time multi-person pose estimation based on mmpose, 2023. URL https://arxiv.org/abs/2303. 07399

  19. [19]

    Kapon, G

    R. Kapon, G. Tevet, D. Cohen-Or, and A. H. Bermano. Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1965–1974, 2024

  20. [20]

    Karunratanakul, K

    K. Karunratanakul, K. Preechakul, S. Suwajanakorn, and S. Tang. Guided motion diffusion for controllable human motion synthesis, 2023. URLhttps://arxiv.org/abs/2305.12577

  21. [21]

    3D Gaussian Splatting for Real-Time Radiance Field Rendering,

    B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4), July 2023. ISSN 0730-0301. doi: 10.1145/3592433. URLhttps://doi.org/10.1145/3592433

  22. [22]

    Kynkäänniemi, T

    T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved precision and recall metric for assessing generative models.CoRR, abs/1904.06991, 2019

  23. [23]

    J. Li, Y . Yuan, D. Rempe, H. Zhang, C. Lu, J. Kautz, and U. Iqbal. Coin: Control-inpainting diffusion prior for human and camera motion estimation. InEuropean Conference on Computer Vision (ECCV), 2024

  24. [24]

    J. Li, J. Cao, H. Zhang, D. Rempe, J. Kautz, U. Iqbal, and Y . Yuan. Genmo: Generative models for human motion synthesis.arXiv preprint arXiv:2505.01425, 2025

  25. [25]

    J. Li, C. K. Liu, and J. Wu. Lifting motion to the 3d world via 2d diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17518–17528, 2025

  26. [26]

    M. Liu, C. Xu, H. Jin, L. Chen, M. Varma T, Z. Xu, and H. Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization.Advances in Neural Information Processing Systems, 36:22226–22246, 2023. 11

  27. [27]

    M. Liu, R. Shi, L. Chen, Z. Zhang, C. Xu, X. Wei, H. Chen, C. Zeng, J. Gu, and H. Su. One-2- 3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10072–10083, 2024

  28. [28]

    R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

  29. [29]

    Z. Liu, Y . Feng, M. J. Black, D. Nowrouzezahrai, L. Paull, and W. Liu. Meshdiffusion: Score- based generative 3d mesh modeling, 2023. URLhttps://arxiv.org/abs/2303.08133

  30. [30]

    Macario Barros, M

    A. Macario Barros, M. Michel, Y . Moline, G. Corre, and F. Carrel. A comprehensive survey of visual slam algorithms.Robotics, 11(1), 2022. ISSN 2218-6581. doi: 10.3390/robotics11010024. URLhttps://www.mdpi.com/2218-6581/11/1/24

  31. [31]

    Mahmood, N

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes, 2019. URLhttps://arxiv.org/abs/1904.03278

  32. [32]

    Y . Mu, X. Zuo, C. Guo, Y . Wang, J. Lu, X. Wu, S. Xu, P. Dai, Y . Yan, and L. Cheng. Gsd: View-guided gaussian splatting diffusion for 3d reconstruction, 2024. URL https://arxiv. org/abs/2407.04237

  33. [33]

    H. Nam, G. Kwon, G. Y . Park, and J. C. Ye. Contrastive denoising score for text-guided latent diffusion image editing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9192–9201, 2024

  34. [34]

    C. Peng, I. Sobol, M. Tomizuka, K. Keutzer, C. Xu, and O. Litany. A lesson in splats: Teacher- guided diffusion for 3d gaussian splats generation with 2d supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  35. [35]

    Pinyoanuntapong, M

    E. Pinyoanuntapong, M. U. Saleem, P. Wang, M. Lee, S. Das, and C. Chen. Bamm: Bidirectional autoregressive motion model, 2024. URLhttps://arxiv.org/abs/2403.19435

  36. [36]

    Poole, A

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022

  37. [37]

    Ramesh, P

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents, 2022. URLhttps://arxiv.org/abs/2204.06125

  38. [38]

    2024 , isbn =

    B. Roessle, N. Müller, L. Porzi, S. Rota Bulò, P. Kontschieder, A. Dai, and M. Nießner. L3dg: Latent 3d gaussian diffusion. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400711312. doi: 10.1145/3680528.3687699. URLhttps://doi.org/10.1145/3680528.3687699

  39. [39]

    M. Shi, K. Aberman, A. Aristidou, T. Komura, D. Lischinski, D. Cohen-Or, and B. Chen. Motionet: 3d human motion reconstruction from monocular video with skeleton consistency. ACM Transactions on Graphics, 40(1):1–15, Sept. 2020. ISSN 1557-7368. doi: 10.1145/ 3407659. URLhttp://dx.doi.org/10.1145/3407659

  40. [40]

    R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su. Zero123++: a single image to consistent multi-view diffusion base model.arXiv preprint arXiv:2310.15110, 2023

  41. [41]

    S. Shin, J. Kim, E. Halilaj, and M. J. Black. WHAM: Reconstructing world-grounded humans with accurate 3D motion. InIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2024

  42. [42]

    R. C. Smith and P. Cheeseman. On the representation and estimation of spatial uncer- tainty.The International Journal of Robotics Research, 5(4):56–68, 1986. doi: 10.1177/ 027836498600500404. URLhttps://doi.org/10.1177/027836498600500404

  43. [43]

    Sobol, C

    I. Sobol, C. Xu, and O. Litany. Zero-to-hero: Enhancing zero-shot novel view synthesis via attention map filtering, 2024. URLhttps://arxiv.org/abs/2405.18677. 12

  44. [44]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  45. [45]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations, 2021. URL https://arxiv. org/abs/2011.13456

  46. [46]

    Tevet, S

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano. Human motion diffusion model. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=SJ1kSyO2jwu

  47. [47]

    van den Oord, O

    A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. InProceedings of the 31st International Conference on Neural Information Processing Sys- tems, NIPS’17, page 6309–6318, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

  48. [48]

    Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task,

    B. Wandt, J. J. Little, and H. Rhodin. ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D Poses . In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6625–6635, Los Alami- tos, CA, USA, June 2022. IEEE Computer Society. doi: 10.1109/CVPR52688.2022.00652. URLhtt...

  49. [49]

    J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y . Zhao, D. Liu, Y . Mu, M. Tan, X. Wang, W. Liu, and B. Xiao. Deep high-resolution representation learning for visual recognition, 2020. URL https://arxiv.org/abs/1908.07919

  50. [50]

    Y . Wang, Z. Wang, L. Liu, and K. Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos. InEuropean Conference on Computer Vision, pages 467–487. Springer, 2024

  51. [51]

    Z. Wang, C. Lu, Y . Wang, F. Bao, C. LI, H. Su, and J. Zhu. Prolificdreamer: High- fidelity and diverse text-to-3d generation with variational score distillation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 8406–8441. Curran Associates, Inc., 2023. URL ...

  52. [52]

    K. Xie, J. Lorraine, T. Cao, J. Gao, J. Lucas, A. Torralba, S. Fidler, and X. Zeng. Latte3d: Large-scale amortized text-to-enhanced3d synthesis. InEuropean Conference on Computer Vision, pages 305–322. Springer, 2024

  53. [53]

    X. Zeng, A. Vahdat, F. Williams, Z. Gojcic, O. Litany, S. Fidler, and K. Kreis. Lion: Latent point diffusion models for 3d shape generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  54. [54]

    Zhang, Z

    M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu. Motiondiffuse: Text-driven human motion generation with diffusion model, 2022. URL https://arxiv.org/abs/2208. 15001

  55. [55]

    Zhang, D

    M. Zhang, D. Jin, C. Gu, F. Hong, Z. Cai, J. Huang, C. Zhang, X. Guo, L. Yang, Y . He, and Z. Liu. Large motion model for unified multi-modal motion generation.arXiv preprint arXiv:2404.01284, 2024

  56. [56]

    Zhang, B

    S. Zhang, B. L. Bhatnagar, Y . Xu, A. Winkler, P. Kadlecek, S. Tang, and F. Bogo. Rohm: Robust human motion reconstruction via diffusion, 2024. URL https://arxiv.org/abs/ 2401.08570

  57. [57]

    W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y . Wang. Motionbert: A unified perspective on learning human motion representations. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  58. [58]

    x y z # , ˆx=

    Q. Zou, S. Yuan, S. Du, Y . Wang, C. Liu, Y . Xu, J. Chen, and X. Ji. Parco: Part-coordinating text-to-motion synthesis, 2024. URLhttps://arxiv.org/abs/2403.18512. 13 A Weights for 3D to 2D Loss Equivalence In standard DDPM and DDIM training, given a sample x∼p and denoiser output ˆx, the reconstruc- tion loss is the mean squared error: L3d =E x∼p ∥ˆx−x∥ ...

  59. [59]

    J Explicit HumanML Channel Partitioning HumanML3D’s representation [13] is composed of:

    EPnP solver and Levenberg-Marquardt pose refinement, both with default parameters. J Explicit HumanML Channel Partitioning HumanML3D’s representation [13] is composed of:

  60. [60]

    1 channel for angular velocity around the y-axis, 2 channels for root velocity in the XZ plane, 1 channel for root height

  61. [61]

    3 channels per non-root joint, representing X (root coordinate frame) Y (global) and Z (root coordinate frame)

  62. [62]

    6 channels per non-root joint, representing the 6D continuous rotations of the joints in relation to the rest pose angle (T-shape human), each joint rotation is calculated as the normalized displacement from its ancestor

  63. [63]

    3 channels per joint (including root) representing the per-joint velocity

  64. [64]

    For the NBA dataset with only 2 foot joints we replicate these flags per foot

    4 channels representing the 4 foot contact flags. For the NBA dataset with only 2 foot joints we replicate these flags per foot. So in total our x is composed of AJ = 4+(J−1)×3 channels and r of BJ = (J−1)×6+J×3+4 channels. 9