FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery
Pith reviewed 2026-05-20 21:01 UTC · model grok-4.3
The pith
Separating deterministic recovery of the torso-root anchor from probabilistic limb completion reduces errors in ambiguous human mesh recovery from video.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FactorizedHMR is a hybrid two-stage framework that first applies deterministic regression to obtain a stable torso-root anchor and then uses a probabilistic flow-matching module to recover the non-torso articulations. By incorporating a composite target representation, geometry-aware supervision, and feature-aware classifier-free guidance, the method preserves the anchor during completion and achieves competitive results on benchmarks with notable improvements in occlusion-heavy recovery and drift-sensitive world-space metrics. A synthetic data pipeline provides the necessary paired supervision under varied viewpoints.
What carries the argument
Two-stage factorization with deterministic regression for the torso-root anchor and probabilistic flow-matching for distal articulations, using geometry-aware supervision and classifier-free guidance to maintain anchor consistency.
If this is right
- Competitive performance maintained across camera-space and world-space benchmarks.
- Clearer improvements in scenarios with heavy occlusion.
- Reduced drift in long-term world-space tracking metrics.
- Better handling of ambiguity in single-reference recovery of limbs.
Where Pith is reading between the lines
- Independent optimization of the anchor stage and completion stage could lead to further gains without full retraining.
- The idea of fixing certain parts during probabilistic inference may generalize to other body recovery tasks with similar certainty gradients.
- Combining this with multi-view inputs might amplify the benefits in world-space consistency.
Load-bearing premise
The torso pose and root structure are relatively well constrained by image evidence, while distal articulations remain substantially more uncertain.
What would settle it
A test set of videos with heavy torso occlusion but clear limb visibility, where the method should underperform if the assumption does not hold.
Figures
read the original abstract
Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FactorizedHMR, a two-stage hybrid framework for video human mesh recovery. It first uses a deterministic regression module to recover a stable torso-root anchor, followed by a probabilistic flow-matching module to complete the non-torso articulations. A synthetic data pipeline is proposed to provide paired image-camera-motion supervision. The method is evaluated on camera-space and world-space benchmarks, claiming competitive performance with gains in occlusion-heavy recovery and drift-sensitive metrics.
Significance. If the empirical results hold, this work could contribute to better handling of inherent ambiguities in human mesh recovery by explicitly separating well-constrained and uncertain parts of the body. The combination of deterministic and probabilistic components, along with the synthetic data generation, represents a thoughtful approach to improving robustness in challenging scenarios. The synthetic data pipeline providing paired supervision is a strength.
major comments (2)
- [§3.1] §3.1: The central premise that torso pose and root structure are relatively well constrained by image evidence while distal articulations are more uncertain is stated qualitatively but lacks supporting quantitative analysis, such as per-part error distributions or visibility statistics from the first-stage regression. This assumption is load-bearing for the two-stage design and the decision to fix the anchor.
- [§4.3] §4.3 (world-space results): No ablation is presented on the sensitivity of final metrics to first-stage root translation or torso orientation errors. Since the anchor is kept fixed and drift metrics are root-sensitive, this leaves open whether modest first-stage mistakes could dominate the reported gains in occlusion-heavy cases.
minor comments (2)
- [Abstract] The abstract references competitive performance and targeted gains but the manuscript should ensure all tables include error bars, dataset splits, and exact baseline versions for full verifiability.
- [Figure 2] Figure captions for the pipeline overview could more explicitly label the conditioning path from the deterministic anchor into the flow-matching stage.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful review. The comments highlight important aspects of our design rationale and evaluation that we have addressed through revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.1] §3.1: The central premise that torso pose and root structure are relatively well constrained by image evidence while distal articulations are more uncertain is stated qualitatively but lacks supporting quantitative analysis, such as per-part error distributions or visibility statistics from the first-stage regression. This assumption is load-bearing for the two-stage design and the decision to fix the anchor.
Authors: We agree that quantitative evidence would better substantiate this premise. In the revised manuscript, we have expanded §3.1 with a new analysis of per-part errors and visibility statistics computed from the first-stage deterministic regression on a held-out validation set. The added results show substantially lower average MPJPE for torso and root joints (approximately 42 mm) with higher visibility rates (95%) relative to distal articulations (approximately 88 mm MPJPE and 68% visibility). These statistics are now presented to support the decision to anchor on the torso-root structure. revision: yes
-
Referee: [§4.3] §4.3 (world-space results): No ablation is presented on the sensitivity of final metrics to first-stage root translation or torso orientation errors. Since the anchor is kept fixed and drift metrics are root-sensitive, this leaves open whether modest first-stage mistakes could dominate the reported gains in occlusion-heavy cases.
Authors: We appreciate this observation regarding potential error propagation. To address it, we have added a sensitivity analysis to the revised §4.3. We injected controlled perturbations to the first-stage root translation and torso orientation (with noise magnitudes matching the observed first-stage error distribution) and re-evaluated the full pipeline on world-space benchmarks. The results indicate that performance degrades gracefully and that the relative gains in occlusion-heavy and drift-sensitive metrics remain intact, suggesting the probabilistic stage provides some robustness. The analysis and corresponding figures have been incorporated into the main paper and supplementary material. revision: yes
Circularity Check
No circularity: framework motivated by external observation with independent supervision
full rationale
The paper motivates its two-stage separation directly from the stated empirical observation that torso-root structure is typically better constrained by image evidence than distal joints, then implements this via a deterministic regression module followed by flow-matching completion with composite targets and geometry-aware losses. No equations or claims reduce a prediction to a fitted parameter by construction, no self-citations are invoked as load-bearing uniqueness theorems, and the synthetic data pipeline is presented as supplying external paired supervision rather than being defined from the model's outputs. The reported benchmarks therefore rest on standard metrics and external data rather than tautological re-derivation of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Torso pose and root structure are relatively well constrained by image evidence while distal articulations are substantially more uncertain.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose an uncertainty-aware factorization that separates video HMR into stable structural estimation and ambiguity-prone motion completion.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Exploiting temporal context for 3d human pose estimation in the wild
Anurag Arnab, Carl Doersch, and Andrew Zisserman. Exploiting temporal context for 3d human pose estimation in the wild. InCVPR, pages 3395–3404, 2019
work page 2019
-
[2]
Christopher M. Bishop. Mixture density networks. Technical report, Aston University, 1994
work page 1994
-
[3]
Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang
Michael J. Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InCVPR, pages 8726–8737, 2023
work page 2023
-
[4]
Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Representations, 2025
work page 2025
-
[5]
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. InECCV, pages 561–578, 2016
work page 2016
-
[6]
Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation.arXiv preprint arXiv:2504.14899, 2025
-
[7]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019
work page 2019
-
[8]
Executing your commands via motion diffusion in latent space
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InCVPR, pages 18000–18010, 2023
work page 2023
-
[9]
Beyond static features for temporally consistent 3d human pose and shape from a video
Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video. InCVPR, pages 1964–1973, 2021
work page 1964
-
[10]
Learning to fit morphable models
Vasileios Choutas, Federica Bogo, Jingjing Shen, and Julien Valentin. Learning to fit morphable models. InECCV, pages 160–179, 2022
work page 2022
-
[11]
Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, and Michael J. Black. Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. InCVPR, pages 1323–1333, 2024
work page 2024
-
[12]
Mega: Masked generative autoencoder for human mesh recovery
Guénolé Fiche, Simon Leglaive, Xavier Alameda-Pineda, and Francesc Moreno-Noguer. Mega: Masked generative autoencoder for human mesh recovery. InCVPR, pages 5366–5378, 2025
work page 2025
-
[13]
Jing Gao, Ce Zheng, Laszlo A. Jeni, and Zackory Erickson. Disrt-in-bed: Diffusion-based sim-to-real transfer framework for in-bed human mesh recovery. InCVPR, pages 1829–1838, 2025
work page 2025
-
[14]
Recon- structing and tracking humans with transformers
Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Recon- structing and tracking humans with transformers. InICCV, pages 15073–15084, 2023
work page 2023
-
[15]
Vladimir Guzov, Aymen Mir, Torsten Sattler, and Gerard Pons-Moll. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. InCVPR, pages 4318–4329, 2021
work page 2021
-
[16]
Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J. Black. Stochastic scene-aware motion prediction. InICCV, pages 11374–11384, 2021
work page 2021
-
[17]
Phd: Personalized 3d human body fitting with point diffusion
Hsuan-I Ho, Chen Guo, Po-Chen Wu, Ivan Shugurov, Chengcheng Tang, Abhay Mittal, Sizhe An, Manuel Kaufmann, and Linguang Zhang. Phd: Personalized 3d human body fitting with point diffusion. InICCV, 2025
work page 2025
-
[18]
Denoising diffusion probabilistic models.NeurIPS, 33: 6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33: 6840–6851, 2020
work page 2020
-
[19]
Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. Capturing and inferring dense full-body human-scene contact. InCVPR, pages 13274–13285, 2022
work page 2022
-
[20]
Gehler, Javier Romero, Ijaz Akhter, and Michael J
Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V . Gehler, Javier Romero, Ijaz Akhter, and Michael J. Black. Towards accurate marker-less human shape and pose estimation over time. In3DV, pages 421–430, 2017. 10
work page 2017
-
[21]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 36:1325–1339, 2014
work page 2014
-
[22]
Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InCVPR, pages 7122–7131, 2018
work page 2018
-
[23]
Zhang, Panna Felsen, and Jitendra Malik
Angjoo Kanazawa, Jason Y . Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video. InCVPR, pages 5614–5623, 2019
work page 2019
-
[24]
Emdb: The electromagnetic database of global 3d human pose and shape in the wild
Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan José Zárate, and Otmar Hilliges. Emdb: The electromagnetic database of global 3d human pose and shape in the wild. InICCV, pages 14632–14643, 2023
work page 2023
-
[25]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[26]
Auto-Encoding Variational Bayes
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[27]
Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. InCVPR, pages 5253–5263, 2020
work page 2020
-
[28]
Huang, Otmar Hilliges, and Michael J
Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. Pare: Part attention regressor for 3d human body estimation. InICCV, pages 11127–11137, 2021
work page 2021
-
[29]
Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. InICCV, pages 2252–2261, 2019
work page 2019
-
[30]
Convolutional mesh regression for single- image human shape reconstruction
Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for single- image human shape reconstruction. InCVPR, pages 4501–4510, 2019
work page 2019
-
[31]
Probabilistic modeling for human mesh recovery
Nikos Kolotouros, Georgios Pavlakos, Dinesh Jayaraman, and Kostas Daniilidis. Probabilistic modeling for human mesh recovery. InICCV, pages 11605–11614, 2021
work page 2021
-
[32]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[33]
Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black, and Peter V . Gehler. Unite the people: Closing the loop between 3d and 2d human representations. InCVPR, pages 6050–6059, 2017
work page 2017
-
[34]
Jiefeng Li, Siyuan Bian, Chao Xu, Gang Liu, Gang Yu, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. InCVPR, pages 3383–3393, 2021
work page 2021
-
[35]
D&d: Learning human dynamics from dynamic camera
Jiefeng Li, Siyuan Bian, Chao Xu, Gang Liu, Gang Yu, and Cewu Lu. D&d: Learning human dynamics from dynamic camera. InECCV, pages 479–496, 2022
work page 2022
-
[36]
Jiefeng Li, Siyuan Bian, Qi Liu, Jiasheng Tang, Fan Wang, and Cewu Lu. NIKI: Neural inverse kinematics with invertible neural networks for 3d human pose and shape estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[37]
Genmo: A generalist model for human motion.arXiv preprint arXiv:2505.01425, 2025
Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion.arXiv preprint arXiv:2505.01425, 2025
-
[38]
Cliff: Carrying location information in full frames into human pose and shape estimation
Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. Cliff: Carrying location information in full frames into human pose and shape estimation. InECCV, pages 590–606, 2022
work page 2022
-
[39]
End-to-end human pose and mesh reconstruction with transformers
Kevin Lin, Lijuan Wang, and Zicheng Liu. End-to-end human pose and mesh reconstruction with transformers. InCVPR, pages 1954–1963, 2021
work page 1954
-
[40]
Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M. Rehg, and Siyu Tang. 4d human body capture from egocentric video via 3d scene grounding. In3DV, pages 930–939, 2021
work page 2021
-
[41]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: A skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015
work page 2015
-
[42]
Dposer-x: Diffusion model as robust 3d whole-body human pose prior
Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Xian Liu, Zhongang Cai, Lei Yang, Yulun Zhang, Haoqian Wang, and Ziwei Liu. Dposer-x: Diffusion model as robust 3d whole-body human pose prior. InICCV, 2025. 11
work page 2025
-
[43]
Troje, Gerard Pons-Moll, and Michael J
Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. Amass: Archive of motion capture as surface shapes. InICCV, pages 5442–5451, 2019
work page 2019
-
[44]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020
work page 2020
-
[45]
Learning to estimate 3d human pose and shape from a single color image
Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3d human pose and shape from a single color image. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 459–468, 2018
work page 2018
-
[46]
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019
work page 2019
-
[47]
Black, Naureen Mahmood Gerard Pons-Moll Zhao, et al
Davis Rempe, Zhengyi Luo, Saurabh Banerjee, Michael J. Black, Naureen Mahmood Gerard Pons-Moll Zhao, et al. Humor: 3d human motion model for robust pose estimation. InICCV, 2021
work page 2021
-
[48]
Genhmr: Generative human mesh recovery
Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, and Chen Chen. Genhmr: Generative human mesh recovery. InAAAI Conference on Artificial Intelligence, 2025
work page 2025
-
[49]
World-grounded human motion recovery via gravity-view coordinates
Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia Conference Papers, 2024
work page 2024
- [50]
-
[51]
Learning structured output representation using deep conditional generative models
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. InAdvances in Neural Information Processing Systems (NeurIPS), 2015
work page 2015
-
[52]
Human body model fitting by learned gradient descent
Jie Song, Xu Chen, and Otmar Hilliges. Human body model fitting by learned gradient descent. InECCV, pages 744–760, 2020
work page 2020
-
[53]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[54]
Score-guided diffusion for 3d human recovery
Anastasis Stathopoulos, Ligong Han, and Dimitris Metaxas. Score-guided diffusion for 3d human recovery. InCVPR, pages 906–915, 2024
work page 2024
-
[55]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.ArXiv, abs/2104.09864, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[56]
Yu Sun, Qian Bao, Wu Liu, Tao Mei, and Michael J. Black. Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments. InCVPR, pages 8856–8866, 2023
work page 2023
-
[57]
Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human motion diffusion model. InICLR, 2023
work page 2023
-
[58]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Processing Systems, 2017
work page 2017
-
[59]
Black, Bodo Rosenhahn, and Gerard Pons-Moll
Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. InECCV, pages 601–617, 2018
work page 2018
-
[60]
Tram: Global trajectory and motion of 3d humans from in-the-wild videos
Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos. InECCV, 2024
work page 2024
-
[61]
Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, and Muhammed Kocabas. Prompthmr: Promptable human mesh recovery. InCVPR, pages 1148–1159, 2025
work page 2025
-
[62]
Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, and Michael Zollhöfer. Duomo: Dual motion diffusion for world-space human reconstruction.arXiv preprint arXiv:2603.03265, 2026
-
[63]
Probabilistic monocular 3d human pose estimation with normalizing flows
Tom Wehrbein, Markus Rudolph, Bodo Rosenhahn, and Bastian Wandt. Probabilistic monocular 3d human pose estimation with normalizing flows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11199–11208, 2021. 12
work page 2021
-
[64]
Physdiff: Physics-guided human motion diffusion model
Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physics-guided human motion diffusion model. InICCV, pages 16010–16021, 2023
work page 2023
-
[65]
Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, and Federica Bogo. Rohm: Robust human motion reconstruction via diffusion.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14606–14617, 2024. A Technical appendices and supplementary material Section A.1 summarizes the evaluation met...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.