pith. machine review for the scientific record. sign in

arxiv: 2604.03299 · v1 · submitted 2026-03-29 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MoViD: View-Invariant 3D Human Pose Estimation via Motion-View Disentanglement

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D human pose estimationview-invariant estimationmotion view disentanglementorthogonal projectioncontrastive alignmentreal-time edge inferenceocclusion robustness
0
0 comments X

The pith

MoViD disentangles motion features from viewpoint information to enable accurate 3D human pose estimation from any camera angle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MoViD to solve the problem that 3D pose estimation often fails when the camera viewpoint changes from what was seen in training. It extracts viewpoint predictions using a view estimator focused on joint relationships, then uses an orthogonal projection module to separate motion and view features, reinforced by contrastive alignment based on physics principles. This separation leads to better generalization, lower data requirements, and faster inference suitable for edge devices. If successful, it would allow pose estimation systems to work reliably in varied real-world camera setups like drones or security cameras without extensive retraining.

Core claim

MoViD establishes that viewpoint information can be predicted from intermediate pose features using joint relationship modeling and then disentangled from motion features via orthogonal projection and physics-grounded contrastive alignment across views, resulting in a framework that achieves viewpoint invariance in 3D human pose estimation.

What carries the argument

The view estimator combined with the orthogonal projection module and physics-grounded contrastive alignment, which isolates motion features while using viewpoint predictions to guide refinement.

If this is right

  • Pose estimation error drops by more than 24.2 percent relative to existing approaches across tested datasets.
  • Performance holds up under heavy occlusions even when trained on 60 percent less data.
  • Real-time operation at 15 frames per second becomes possible on standard NVIDIA edge hardware through adaptive view-aware processing.
  • The method generalizes across nine public datasets and additional custom multiview collections from UAVs and gait analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar disentanglement techniques could apply to other 3D reconstruction tasks where camera angle affects feature quality.
  • By reducing data needs, the approach might support training on smaller or synthetic datasets more effectively.
  • Adaptive activation of refinements based on viewpoint could inspire efficiency gains in related video analysis pipelines.

Load-bearing premise

That extracting and projecting out viewpoint features leaves behind all the motion information needed for accurate pose estimation regardless of how the viewpoint changes.

What would settle it

Observing that accuracy does not improve or even degrades on a held-out set of camera angles after applying the disentanglement steps compared to a baseline without them.

Figures

Figures reproduced from arXiv: 2604.03299 by Haoxian Liu, Hengle Jiang, Runxi Huang, Xiaomin Ouyang, Yejia Liu.

Figure 1
Figure 1. Figure 1: Comparison of the traditional framework for 3D [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pose estimation errors across different views. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Incorporating view information (“View-aware”) re￾duces pose estimation errors across views. mm to 81.7 mm, which shows the impact of camera perspective on pose estimation accuracy. Moreover, several viewpoints, such as View 2 and View 7 (the elevated side viewpoints), consistently yield higher errors than other views, indicating a strong correlation be￾tween viewpoint and performance degradation. These res… view at source ↗
Figure 5
Figure 5. Figure 5: Typical application scenarios of MoViD: Activity [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MoViD features a view estimator to predict viewpoint information from intermediate pose estimation features. This [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the view estimator. It computes an [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: View-orthogonal contrastive learning. We first disentangles motion and view features through the orthogonal [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: MoViD features a new frame-by-frame inference [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The end-to-end system for real-time pose estima [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Viewpoint estimation performance and its rela [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Effect of flip augmentation on different views. [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: Performance on the self-collected UAV dataset [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
Figure 15
Figure 15. Figure 15: UAV trajectories including side-follow, horizontal [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: The self-collected multimodal multi-view dataset [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗
read the original abstract

3D human pose estimation is a key enabling technology for applications such as healthcare monitoring, human-robot collaboration, and immersive gaming, but real-world deployment remains challenged by viewpoint variations. Existing methods struggle to generalize to unseen camera viewpoints, require large amounts of training data, and suffer from high inference latency. We propose MoViD, a viewpoint-invariant 3D human pose estimation framework that disentangles viewpoint information from motion features. The key idea is to extract viewpoint information from intermediate pose features and leverage it to enhance both the robustness and efficiency of pose estimation. MoViD introduces a view estimator that models key joint relationships to predict viewpoint information, and an orthogonal projection module to disentangle motion and view features, further enhanced through physics-grounded contrastive alignment across views. For real-time edge deployment, MoViD employs a frame-by-frame inference pipeline with a view-aware strategy that adaptively activates flip refinement based on the estimated viewpoint. Evaluations on nine public datasets and newly collected multiview UAV and gait analysis datasets show that MoViD reduces pose estimation error by over 24.2\% compared to state-of-the-art methods, maintains robust performance under severe occlusions with 60\% less training data, and achieves real-time inference at 15 FPS on NVIDIA edge devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MoViD, a viewpoint-invariant 3D human pose estimation framework that disentangles viewpoint information from motion features via a view estimator module modeling joint relationships, an orthogonal projection module, and physics-grounded contrastive alignment across views. It further employs a view-aware frame-by-frame inference pipeline with adaptive flip refinement for edge deployment. Evaluations across nine public datasets plus new multiview UAV and gait collections report over 24.2% error reduction versus state-of-the-art, robust performance under severe occlusions using 60% less training data, and 15 FPS real-time inference on NVIDIA edge devices.

Significance. If the disentanglement preserves all pose-critical information, the work could meaningfully advance generalization to unseen viewpoints and practical edge deployment in applications such as healthcare monitoring and human-robot interaction. Notable strengths include the breadth of evaluation (nine public datasets plus new collections) and the explicit focus on real-time efficiency with adaptive refinement.

major comments (2)
  1. [Abstract] Abstract: The 24.2% error reduction and occlusion-robustness claims rest on the orthogonal projection module plus physics-grounded contrastive alignment successfully isolating motion features while preserving view-dependent pose cues required to resolve 3D ambiguities; no verification (e.g., mutual information between projected features and ground-truth camera parameters or reconstruction of view-specific 2D keypoints from the motion branch) is described.
  2. [Method] Method (orthogonal projection and contrastive alignment): Enforcing orthogonality in intermediate feature space risks nulling components linearly correlated with camera angle that remain necessary for accurate lifting, given that viewpoint modulates 2D projections nonlinearly via foreshortening, depth scaling, and view-specific occlusions; the contrastive term further assumes motion features are identical across views, which may discard action-specific view-dependent dynamics present in the training distribution.
minor comments (2)
  1. The abstract would benefit from a brief statement of the loss formulation or key hyperparameters used in the physics-grounded contrastive alignment.
  2. Consider adding an ablation table isolating the contribution of the orthogonal projection module versus the contrastive term alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive remarks on the evaluation breadth and real-time focus of MoViD. We address each major comment below with clarifications and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 24.2% error reduction and occlusion-robustness claims rest on the orthogonal projection module plus physics-grounded contrastive alignment successfully isolating motion features while preserving view-dependent pose cues required to resolve 3D ambiguities; no verification (e.g., mutual information between projected features and ground-truth camera parameters or reconstruction of view-specific 2D keypoints from the motion branch) is described.

    Authors: We agree that direct verification of the disentanglement would strengthen the abstract claims. In the revised manuscript we add two analyses: (1) mutual-information measurements between the motion-branch features and ground-truth camera parameters, confirming that view information is largely removed, and (2) reconstruction of view-specific 2D keypoints from the motion features alone, showing that pose-critical cues are retained. These additions support the reported error reductions and occlusion robustness. revision: yes

  2. Referee: [Method] Method (orthogonal projection and contrastive alignment): Enforcing orthogonality in intermediate feature space risks nulling components linearly correlated with camera angle that remain necessary for accurate lifting, given that viewpoint modulates 2D projections nonlinearly via foreshortening, depth scaling, and view-specific occlusions; the contrastive term further assumes motion features are identical across views, which may discard action-specific view-dependent dynamics present in the training distribution.

    Authors: We appreciate the concern about possible loss of necessary components. The orthogonal projection is applied after the view estimator has isolated viewpoint information, and our multi-view ablations demonstrate that the resulting motion features still permit accurate lifting. The contrastive loss aligns features of the same 3D pose observed from different viewpoints; view-dependent dynamics (foreshortening, occlusions) are explicitly routed to the view branch rather than discarded. We have expanded the method section with a discussion of these design choices and additional ablation results quantifying the effect of the orthogonality constraint. revision: yes

Circularity Check

0 steps flagged

No significant circularity: new modules and empirical claims are self-contained

full rationale

The paper introduces architectural components (view estimator, orthogonal projection module, physics-grounded contrastive alignment) and reports empirical gains on external datasets. No equations, fitted parameters, or self-citations are shown reducing any claimed prediction or result to its own inputs by construction. The derivation chain relies on novel design choices evaluated against public benchmarks rather than tautological re-use of fitted quantities or prior self-results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Limited information available from abstract only; the method relies on standard computer vision assumptions about camera models and joint kinematics plus newly introduced modules whose internal parameters are not detailed.

axioms (1)
  • domain assumption Standard camera projection and joint kinematic models hold for the input videos
    Implicit in the orthogonal projection module and view estimation step
invented entities (2)
  • View estimator module no independent evidence
    purpose: Predict viewpoint information from intermediate pose features by modeling key joint relationships
    New component introduced to extract viewpoint information
  • Orthogonal projection module no independent evidence
    purpose: Disentangle motion features from view features
    Core new mechanism for separation of motion and viewpoint

pith-pipeline@v0.9.0 · 5544 in / 1374 out tokens · 39651 ms · 2026-05-14T21:20:00.459826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. 2019. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. G...

  2. [2]

    Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. 2022. HuMMan: Multi-modal 4d human dataset for versatile sensing and modeling. In17th European Conference on Computer Vision, Tel A viv, Israel, October 23–27, 2022, P...

  3. [3]

    Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. 2021. Beyond Static Features for Temporally Consistent 3D Human Pose and Shape From a Video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1964–1973

  4. [4]

    Jeonghyeok Do and Munchurl Kim. 2025. Skateformer: skeletal-temporal trans- former for human action recognition. InEuropean Conference on Computer Vision. Springer, 401–420

  5. [5]

    Yinpeng Dong, Shouwei Ruan, Hang Su, Caixin Kang, Xingxing Wei, and Jun Zhu

  6. [6]

    Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints.Advances in Neural Information Processing Systems35 (2022), 36789– 36803

  7. [7]

    Nicola Garau, Giulia Martinelli, Piotr Bródka, Niccolò Bisagno, and Nicola Conci

  8. [8]

    In2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

    PanopTOP: a framework for generating viewpoint-invariant human pose estimation datasets. In2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 234–242. https://doi.org/10.1109/ICCVW54120.2021.00031

  9. [9]

    Maryam Fathollahi Ghezelghieh, Rangachar Kasturi, and Sudeep Sarkar. 2016. Learning Camera Viewpoint Using CNN to Improve 3D Body Pose Estimation. InProc. IEEE International Conference on 3D Vision (3DV)

  10. [10]

    Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. 2023. Humans in 4D: Reconstructing and Tracking Humans with Transformers. InICCV

  11. [11]

    Riza Alp Guler, Natalia Neverova, and Iasonas Kokkinos. 2018. DensePose: Dense Human Pose Estimation In-the-Wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7297–7306

  12. [12]

    Abdul Hannan et al. 2021. A Portable Smart Fitness Suite for Real-Time Exercise Monitoring and Posture Correction.Sensors21, 19 (8 10 2021), 6692. https: //doi.org/10.3390/s21196692

  13. [13]

    Michiel Hazewinkel. 2001. Orthogonalization. InEncyclopaedia of Mathematics, Michiel Hazewinkel (Ed.). Springer

  14. [14]

    Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J

    Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. 2022. Cap- turing and Inferring Dense Full-Body Human-Scene Contact. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13274–13285

  15. [15]

    Linzhi Huang, Yulong Li, Hongbo Tian, Yue Yang, Xiangang Li, Weihong Deng, and Jieping Ye. 2023. Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module. In2023 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). 693–703. https: //doi.org/10.1109/CVPR52729.2023.00074

  16. [16]

    Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments.IEEE Transactions on Pattern Analysis and Machine Intel- ligence36 (2014), 1325–1339. https://api.semanticscholar.org/CorpusID:4244548

  17. [17]

    Zhang, Panna Felsen, and Jitendra Malik

    Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. 2019. Learn- ing 3D Human Dynamics from Video. InComputer Vision and Pattern Recognition (CVPR)

  18. [18]

    Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh Vo, and Kris Kitani. 2023. EgoHumans: An Egocentric 3D Multi-Human Benchmark. arXiv preprint arXiv:2305.16487(2023)

  19. [19]

    Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis

  20. [20]

    InProceedings of the IEEE International Conference on Computer Vision (ICCV)

    Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop. InProceedings of the IEEE International Conference on Computer Vision (ICCV). 2252–2261

  21. [21]

    Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. 2019. Convolutional Mesh Regression for Single-Image Human Shape Reconstruction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4501– 4510

  22. [22]

    Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. 2022. MH- Former: Multi-Hypothesis Transformer for 3D Human Pose Estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13147–13156

  23. [23]

    Yanjie Li, Sen Yang, Peidong Liu, Shoukui Zhang, Yunxiao Wang, Zhicheng Wang, Wankou Yang, and Shu-Tao Xia. 2022. SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation. InProceedings of the European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Vol. 13666. Springer, 89–106. https://doi.org/10.1007/978-...

  24. [24]

    Daxin Liu, Yu Huang, Zhenyu Liu, Haoyang Mao, Pengcheng Kan, and Jianrong Tan. 2024. A skeleton-based assembly action recognition method with feature fusion for human-robot collaborative assembly.Journal of Manufacturing Systems 76 (2024), 553–566. https://doi.org/10.1016/j.jmsy.2024.08.019

  25. [25]

    Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. 2020. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2020), 2684–2701

  26. [26]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2023.SMPL: A Skinned Multi-Person Linear Model(1 ed.). Asso- ciation for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/ 3596711.3596800

  27. [27]

    Ghattamaneni Mahesh and M. Kalidas. 2023. A Real-Time IoT Based Fall Detection and Alert System for Elderly. In2023 International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT). 327–331. https://doi.org/10.1109/ICAICCIT60255.2023.10465914

  28. [28]

    Troje, Gerard Pons-Moll, and Michael J

    Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In The IEEE International Conference on Computer Vision (ICCV). https://amass.is. tue.mpg.de

  29. [29]

    Lourdes Martínez-Villaseñor, Hiram Ponce, Jorge Brieva, Ernesto Moya-Albor, José Núñez-Martínez, and Carlos Peñafort-Asturiano. 2019. UP-Fall Detection Dataset: A Multimodal Approach.Sensors19, 9 (2019). https://doi.org/10.3390/ s19091988

  30. [30]

    Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3D Human Pose Estima- tion In The Wild Using Improved CNN Supervision. In3D Vision (3DV), 2017 Fifth International Conference on. IEEE. https://doi.org/10.1109/3dv.2017.00064

  31. [31]

    Qianhui Men, Edmond S. L. Ho, Hubert P. H. Shum, and Howard Leung. 2023. Focalized Contrastive View-invariant Learning for Self-supervised Skeleton- based Action Recognition.Neurocomputing537 (2023), 198–209. https://api. semanticscholar.org/CorpusID:257899093

  32. [32]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. InECCV

  33. [33]

    Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng Yan. 2019. Single- Stage Multi-Person Pose Machines. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6950–6959. https://doi.org/10.1109/ICCV. 2019.00705

  34. [34]

    Xiaomin Ouyang, Xian Shuai, Yang Li, Li Pan, Xifan Zhang, Heming Fu, Sitong Cheng, Xinyan Wang, Shihua Cao, Jiang Xin, et al . 2024. ADMarker: A Multi-Modal Federated Learning System for Monitoring Digital Biomarkers of Alzheimer’s Disease. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 404–419

  35. [35]

    Sida Peng, Xiaowei Zhou, Yuan Liu, Haotong Lin, Qixing Huang, and Hujun Bao

  36. [36]

    https://doi.org/10.1109/TPAMI.2020.3047388

    PVNet: Pixel-Wise Voting Network for 6DoF Object Pose Estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence44, 6 (2022), 3212–3223. https://doi.org/10.1109/TPAMI.2020.3047388

  37. [37]

    Adri Priadana, Duy-Linh Nguyen, Xuan-Thuy Vo, Jehwan Choi, Russo Ashraf, and Kanghyun Jo. 2025. HFD-YOLO: Improved YOLO Network Using Efficient Attention Modules for Real-Time One-Stage Human Fall Detection.IEEE Access 13 (2025), 41248–41258. https://doi.org/10.1109/ACCESS.2025.3547360

  38. [38]

    Shouwei Ruan, Yinpeng Dong, Hanqing Liu, Yao Huang, Hang Su, and Xingxing Wei. 2024. Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models. InEuropean Conference on Computer Vision. https://api. semanticscholar.org/CorpusID:269214060

  39. [39]

    Shouwei Ruan, Yinpeng Dong, Han Su, Jianteng Peng, Ning Chen, and Xingxing Wei. 2023. Towards Viewpoint-Invariant Visual Recognition via Adversarial Training.2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), 4686–4696. https://api.semanticscholar.org/CorpusID:259991124

  40. [40]

    Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. InProceedings of the IEEE conference on computer vision and pattern recognition. 1010–1019

  41. [41]

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J. Black. 2024. WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion. InComputer Vision and Pattern Recognition (CVPR)

  42. [42]

    Yuming Song, Wei Zhang, Xiaojun Chang, Yi Zhang, Wei Liu, Yao Wu, and Lei Wang. 2020. Pose2Pose: Human Pose Transfer by Attention-Based Keypoint Graph. InEuropean Conference on Computer Vision (ECCV). 236–252

  43. [43]

    Stenum, K

    J. Stenum, K. M. Cherry-Allen, C. O. Pyles, R. D. Reetzke, M. F. Vignos, and R. T. Roemmich. 2021. Applications of Pose Estimation in Human Health and Performance across the Lifespan.Sensors21, 21 (2021), 7315. https://doi.org/10. 3390/s21217315 SenSys ’26, May 11–14, 2026, Saint Malo, France Yejia Liu *, Hengle Jiang*, Haoxian Liu, Runxi Huang, Xiaomin Ouyang†

  44. [44]

    Haixun Sun, Yanyan Zhang, Yijie Zheng, Jianxin Luo, and Zhisong Pan. 2022. G2O-Pose: Real-Time Monocular 3D Human Pose Estimation Based on General Graph Optimization.Sensors22, 21 (2022). https://doi.org/10.3390/s22218335

  45. [45]

    Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. InCVPR

  46. [46]

    Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. 2018. Integral human pose regression. InProceedings of the European Conference on Computer Vision (ECCV). 529–545

  47. [47]

    Black, Ivan Laptev, and Cordelia Schmid

    Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. 2017. Learning from Synthetic Humans. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4627–4635. https://doi.org/10.1109/CVPR.2017.492

  48. [48]

    Black, Bodo Rosenhahn, and Gerard Pons-Moll

    Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering Accurate 3D Human Pose in the Wild Using IMUs and a Moving Camera. InComputer Vision – ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 614–631

  49. [49]

    Bastian Wandt and Bodo Rosenhahn. 2019. RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation. In Computer Vision and Pattern Recognition (CVPR)

  50. [50]

    Yufu Wang and Kostas Daniilidis. 2023. ReFit: Recurrent Fitting Network for 3D Human Recovery. InInternational Conference on Computer Vision

  51. [51]

    Zhe Wang, Daeyun Shin, and Charless C. Fowlkes. 2020. Predicting Camera Viewpoint Improves Cross-Dataset Generalization for 3D Human Pose Estimation. InProc. European Conference on Computer Vision (ECCV) Workshops

  52. [52]

    Yan Xia, Xiaowei Zhou, Etienne Vouga, Qixing Huang, and Georgios Pavlakos

  53. [53]

    Reconstructing Humans with a Biomechanically Accurate Skeleton. (2025). arXiv:cs.CV/2503.21751 https://arxiv.org/abs/2503.21751

  54. [54]

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.ArXivabs/2204.12484 (2022). https://api.semanticscholar.org/CorpusID:248392410

  55. [55]

    Jingrui Yu, Dipankar Nandi, Roman Seidel, and Gangolf Hirtz. 2024. NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images. (2024). arXiv:cs.CV/2402.18196 https: //arxiv.org/abs/2402.18196