pith. sign in

arxiv: 2605.15583 · v1 · pith:VY2EQUP5new · submitted 2026-05-15 · 💻 cs.CV

Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling

Pith reviewed 2026-05-20 19:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D human pose estimationunsupervised learningdiffusion modelsancestral samplingmulti-view consistency2D-3D liftingcross-domain generalizationmotion priors
0
0 comments X

The pith

Conditional multi-view ancestral sampling recovers 3D human poses from single 2D views by aligning projections with pre-trained 2D motion diffusion manifolds without any 3D supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an unsupervised technique for turning a single 2D image of a person into a 3D pose estimate. It does so by adapting ancestral sampling from diffusion models to multiple virtual camera views and conditioning on the input 2D pose plus basic body constraints. The key is that the 2D motion diffusion model, trained on lots of 2D data, supplies a prior that guides the search for a plausible 3D configuration. Experiments show this yields better results than previous methods on the Yoga dataset, especially for unusual body positions where 3D training examples are missing. This matters for applications where gathering 3D motion capture data is impractical.

Core claim

The paper claims that 3D human poses can be estimated from single-view 2D inputs without 3D supervision by using conditional multi-view ancestral sampling to optimize the pose such that its multi-view 2D projections follow the manifold of a pre-trained 2D motion diffusion model in noise space, while also matching the given 2D pose and anatomical constraints.

What carries the argument

conditional multi-view ancestral sampling (cMAS) which extends multi-view ancestral sampling of diffusion models to optimize 3D poses for 2D-3D lifting by enforcing consistency in the 2D MDM noise space.

Load-bearing premise

The 2D diffusion model's learned manifold acts as a good prior for 3D poses when its noise-space projections are optimized jointly over multiple virtual views.

What would settle it

A direct comparison on the Yoga dataset where the proposed method shows no improvement over state-of-the-art unsupervised 3D pose estimation baselines for extreme poses.

Figures

Figures reproduced from arXiv: 2605.15583 by Fumio Okura, Ryohei Goto, Shunsuke Saruwatari, Takuya Fujihashi.

Figure 1
Figure 1. Figure 1: Compared to the baselines including state-of-the-art supervised (Video-to-Pose3D [31] and MotionBERT [48]) and unsupervised (ElePose [43]) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (Left) An overview of our proposed framework. We generate 2D poses for V − 1 virtual views from noise using a diffusion model. These generated poses, along with the input 2D pose from a reference view v0, are used to triangulate an initial 3D pose. The 3D pose is then reprojected, and the noise is updated to refine the estimate. This iterative process is repeated T timesteps to yield the final 3D pose. (Ri… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the multi-view 2D poses generated by the proposed [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of the proposed method and baseline methods on challenging sequences from the Yoga90 dataset [19]. From left to right: input, Video-to-Pose3D [31], ElePose [43], MotionBERT [48], Ours, and the ground truth. Compared to MotionBERT and ElePose, our method consistently generates accurate and natural 3D poses, especially for poses with significant self-occlusion or unusual joint articula… view at source ↗
Figure 5
Figure 5. Figure 5: Visual examples of challenging poses beyond yoga, without 3D ground truth. Compared to the baseline methods, the proposed method consistently generates accurate and natural 3D poses. V. CONCLUSIONS We introduced a novel method for estimating 3D human pose from a single-view video without 3D supervision, leveraging the rich priors from 2D motion diffusion models (MDMs) for robust 3D pose estimation. We prop… view at source ↗
Figure 6
Figure 6. Figure 6: Failure cases due to inherent depth ambiguity from a single view [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable. Code is available at: https://github.com/asaa0001/c-MAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an unsupervised approach to 3D human pose estimation from a single 2D image by proposing conditional multi-view ancestral sampling (cMAS). This method extends ancestral sampling from 2D motion diffusion models (MDMs) to optimize a 3D pose such that its projections onto multiple virtual views lie within the learned 2D manifold, conditioned on the observed 2D pose and anatomical constraints. The authors report improved cross-domain performance on the Yoga dataset compared to existing supervised and unsupervised methods, particularly for extreme poses lacking 3D supervision.

Significance. Should the empirical results hold and the method reliably recover accurate 3D poses, this work offers a promising direction for leveraging large-scale 2D diffusion priors in 3D tasks without requiring 3D training data. The innovation in using multi-view projections to constrain the optimization is noteworthy, and the availability of code enhances reproducibility. However, the significance is tempered by the need to confirm that the approach does not suffer from under-constrained 3D solutions.

major comments (2)
  1. The cMAS procedure optimizes the 3D pose to match the 2D MDM manifold in noise space across virtual views. However, it is not clear from the description whether this optimization guarantees a unique or correct 3D pose, as multiple 3D configurations could project to points on the 2D manifold. This assumption is central to the claim of accurate 3D estimation for out-of-distribution Yoga poses.
  2. The abstract claims better performance on the Yoga dataset, but without detailed quantitative tables, ablation studies on the number of virtual views, or error analysis, it is difficult to assess if the improvements are due to the proposed method or other factors. Specific metrics and comparisons are needed to support the cross-domain superiority.
minor comments (2)
  1. Including key quantitative results, such as MPJPE or PCK values on the Yoga dataset, would strengthen the abstract and provide immediate evidence for the performance claims.
  2. Ensure that the notation for the diffusion model and the optimization objective is clearly defined to avoid ambiguity in the conditional sampling process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We have addressed each major comment in detail below and revised the paper to strengthen the presentation of our method and results.

read point-by-point responses
  1. Referee: The cMAS procedure optimizes the 3D pose to match the 2D MDM manifold in noise space across virtual views. However, it is not clear from the description whether this optimization guarantees a unique or correct 3D pose, as multiple 3D configurations could project to points on the 2D manifold. This assumption is central to the claim of accurate 3D estimation for out-of-distribution Yoga poses.

    Authors: We appreciate the referee's point regarding potential ambiguities in the 3D solution. The cMAS optimization does not provide a strict mathematical guarantee of uniqueness, as the underlying 2D-to-3D lifting problem remains under-constrained in principle. However, the combination of conditioning on the input 2D pose, enforcing anatomical constraints, and requiring consistency across multiple virtual views on the learned 2D MDM manifold substantially reduces the space of plausible solutions. Our empirical evaluation on the Yoga dataset shows that the recovered poses are both plausible and more accurate than competing methods, especially for extreme poses. In the revised manuscript we have added a dedicated paragraph in Section 3.3 discussing residual ambiguities and how the multi-view and constraint terms mitigate them, along with additional qualitative visualizations of recovered poses and failure cases. revision: partial

  2. Referee: The abstract claims better performance on the Yoga dataset, but without detailed quantitative tables, ablation studies on the number of virtual views, or error analysis, it is difficult to assess if the improvements are due to the proposed method or other factors. Specific metrics and comparisons are needed to support the cross-domain superiority.

    Authors: We agree that expanded quantitative support strengthens the claims. The original manuscript contained comparative results, but we have now included a new table (Table 2) reporting MPJPE and PCK metrics on the Yoga dataset against both supervised and unsupervised baselines. We have also added an ablation study (Table 3) that varies the number of virtual views from 2 to 8, demonstrating consistent gains that plateau after approximately five views. Finally, we provide a per-pose error breakdown separating standard and extreme poses to illustrate where the cross-domain advantage is most pronounced. These additions appear in the revised experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation or claims.

full rationale

The paper introduces an explicit optimization procedure (cMAS) that extends ancestral sampling from pre-trained 2D MDMs to enforce consistency of projected 2D poses on the learned manifold while conditioning on input observations and anatomical constraints. This is a constructive algorithmic step rather than a parameter fit or definitional loop that encodes the target 3D accuracy by construction. No equations reduce the cross-domain performance claim on Yoga to a tautology or self-citation chain; the evaluation remains an independent empirical test of the joint-optimization hypothesis. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly relies on the existence of a useful 2D pose manifold in the pre-trained MDM and on the validity of multi-view projection consistency as a 3D constraint.

pith-pipeline@v0.9.0 · 5719 in / 1096 out tokens · 34534 ms · 2026-05-20T19:37:55.330056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    Ahuja and L.-P

    C. Ahuja and L.-P. Morency. Language2Pose: Natural language grounded pose forecasting. InProceedings of International Conference on 3D Vision (3DV), 2019

  2. [2]

    Y . Cai, L. Ge, J. Liu, J. Cai, T.-J. Cham, J. Yuan, and N. M. Thalmann. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. InProceedings of IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2019

  3. [3]

    Z. Cao, T. Simon, S.-E. Wei, and Y . Sheikh. Realtime multi-person 2D pose estimation using part affinity fields. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  4. [4]

    Carbonera Luvizon, H

    D. Carbonera Luvizon, H. Tabia, and D. Picard. SSP-Net: Scalable sequential pyramid networks for real-time 3D human pose regression. Pattern Recognition (PR), 142:109714, 2023

  5. [5]

    C.-H. Chen, A. Tyagi, A. Agrawal, D. Drover, R. MV , S. Stojanov, and J. M. Rehg. Unsupervised 3D pose estimation with geometric self- supervision. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  6. [6]

    Cheng, B

    Y . Cheng, B. Yang, B. Wang, and R. T. Tan. 3D human pose estimation using spatio-temporal networks with explicit occlusion training. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2020

  7. [7]

    H. Ci, C. Wang, X. Ma, and Y . Wang. Optimizing network structure for 3D human pose estimation. InProceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2019

  8. [8]

    H. Ci, M. Wu, W. Zhu, X. Ma, H. Dong, F. Zhong, and Y . Wang. GFPose: Learning 3D human pose prior with gradient fields. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  9. [9]

    Drover, R

    D. Drover, R. MV , C.-H. Chen, A. Agrawal, A. Tyagi, and C. Phuoc Huynh. Can 3D pose be learned from 2D projections alone? InProceedings of European Conference on Computer Vision Workshops (ECCVW), 2018

  10. [10]

    H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y . Xiu, Y .-L. Li, and C. Lu. AlphaPose: Whole-body regional multi-person pose estimation and tracking in real-time.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(6):7157–7173, 2022

  11. [11]

    Ghosh, R

    A. Ghosh, R. Dabral, V . Golyanik, C. Theobalt, and P. Slusallek. ReMoS: 3D motion-conditioned reaction synthesis for two-person interactions. InProceedings of European Conference on Computer Vision (ECCV), 2024

  12. [13]

    C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng. Generating diverse and natural 3D human motions from text. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  13. [14]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  14. [15]

    Holden, J

    D. Holden, J. Saito, and T. Komura. A deep learning framework for character motion synthesis and editing.ACM Transactions on Graphics (TOG), 35(4):138:1–138:11, 2016

  15. [16]

    Ionescu, D

    C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7):1325–1339, 2014

  16. [17]

    Kapon, G

    R. Kapon, G. Tevet, D. Cohen-Or, and A. H. Bermano. MAS: Multi- view ancestral sampling for 3D motion generation using 2D diffusion. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  17. [18]

    Karunratanakul, K

    K. Karunratanakul, K. Preechakul, S. Suwajanakorn, and S. Tang. Guided motion diffusion for controllable human motion synthesis. InProceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  18. [19]

    S. Kim. 3DYoga90: A hierarchical video dataset for yoga pose understanding.arXiv preprint arXiv:2310.10131, 2023

  19. [20]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InProceedings of International Conference on Learning Representa- tions (ICLR), 2015

  20. [21]

    D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of International Conference on Learning Representations (ICLR), 2014

  21. [22]

    Kocabas, S

    M. Kocabas, S. Karagoz, and E. Akbas. Self-supervised learning of 3D human pose using multi-view geometry. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  22. [23]

    J. Li, H. Hu, J. Li, and X. Zhao. 3D-Yoga: A 3D yoga dataset for visual-based hierarchical sports action analysis. InProceedings of Asian Conference on Computer Vision (ACCV), 2022

  23. [24]

    J. Li, J. Wu, and C. K. Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):197:1– 197:11, 2023

  24. [25]

    W. Li, H. Liu, H. Tang, P. Wang, and L. Van Gool. MHFormer: Multi- hypothesis transformer for 3D human pose estimation. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  25. [26]

    Liang, W

    H. Liang, W. Zhang, W. Li, J. Yu, and L. Xu. InterGen: Diffusion- based multi-human motion generation under complex interactions. International Journal of Computer Vision (IJCV), 132(9):3463–3483, 2024

  26. [27]

    Martinez, R

    J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3D human pose estimation. InProceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2017

  27. [28]

    Mehta, H

    D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3D human pose estimation in the wild using improved CNN supervision. InProceedings of International Conference on 3D Vision (3DV), 2017

  28. [29]

    Pavlakos, X

    G. Pavlakos, X. Zhou, and K. Daniilidis. Ordinal depth supervision for 3D human pose estimation. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  29. [30]

    Pavlakos, X

    G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse- to-fine volumetric prediction for single-image 3D human pose. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  30. [31]

    Pavllo, C

    D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3D human pose estimation in video with temporal convolutions and semi-supervised training. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  31. [32]

    Petrovich, M

    M. Petrovich, M. J. Black, and G. Varol. TEMOS: Generating diverse human motions from textual descriptions. InProceedings of European Conference on Computer Vision (ECCV), 2022

  32. [33]

    S. Raab, I. Leibovitch, G. Tevet, M. Arar, A. H. Bermano, and D. Cohen-Or. Single motion diffusion. InProceedings of International Conference on Learning Representations (ICLR), 2024

  33. [34]

    N. D. Reddy, L. Guigues, L. Pishchulin, J. Eledath, and S. G. Narasimhan. TesseTrack: End-to-end learnable multi-person articu- lated 3D pose tracking. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  34. [35]

    Shafir, G

    Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano. Human motion diffusion as a generative prior. InProceedings of International Conference on Learning Representations (ICLR), 2024

  35. [36]

    W. Shan, Z. Liu, X. Zhang, S. Wang, S. Ma, and W. Gao. P-STMO: Pre-trained spatial temporal many-to-one model for 3D human pose estimation. InProceedings of European Conference on Computer Vision (ECCV), 2022

  36. [37]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of International Conference on Machine Learning (ICML), 2015

  37. [38]

    X. Sun, J. Shang, S. Liang, and Y . Wei. Compositional human pose regression. InProceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2017

  38. [39]

    Tanaka and K

    M. Tanaka and K. Fujiwara. Role-aware interaction generation from textual description. InProceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  39. [40]

    Tekin, A

    B. Tekin, A. Rozantsev, V . Lepetit, and P. Fua. Direct prediction of 3D body poses from motion compensated sequences. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  40. [41]

    Tevet, B

    G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or. MotionCLIP: Exposing human motion generation to CLIP space. In Proceedings of European Conference on Computer Vision (ECCV), 2022

  41. [42]

    Tevet, S

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano. Human motion diffusion model. InProceedings of International Conference on Learning Representations (ICLR), 2023

  42. [43]

    Wandt, J

    B. Wandt, J. J. Little, and H. Rhodin. ElePose: Unsupervised 3D human pose estimation by predicting camera elevation and learning normalizing flows on 2D poses. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  43. [44]

    J. Wang, S. Yan, Y . Xiong, and D. Lin. Motion guided 3D pose estimation from videos. InProceedings of European Conference on Computer Vision (ECCV), 2020

  44. [45]

    Z. Yu, B. Ni, J. Xu, J. Wang, C. Zhao, and W. Zhang. Towards alleviating the modeling ambiguity of unsupervised monocular 3D human pose estimation. InProceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  45. [46]

    Zhang, Z

    J. Zhang, Z. Tu, J. Yang, Y . Chen, and J. Yuan. MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  46. [47]

    Zheng, S

    C. Zheng, S. Zhu, M. Mendieta, T. Yang, C. Chen, and Z. Ding. 3D human pose estimation with spatial and temporal transformers. InProceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  47. [48]

    W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y . Wang. MotionBERT: A unified perspective on learning human motion representations. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 2023