arxiv: 2604.03299 · v1 · submitted 2026-03-29 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MoViD: View-Invariant 3D Human Pose Estimation via Motion-View Disentanglement

Yejia Liu , Hengle Jiang , Haoxian Liu , Runxi Huang , Xiaomin Ouyang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D human pose estimationview-invariant estimationmotion view disentanglementorthogonal projectioncontrastive alignmentreal-time edge inferenceocclusion robustness

0 comments

The pith

MoViD disentangles motion features from viewpoint information to enable accurate 3D human pose estimation from any camera angle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MoViD to solve the problem that 3D pose estimation often fails when the camera viewpoint changes from what was seen in training. It extracts viewpoint predictions using a view estimator focused on joint relationships, then uses an orthogonal projection module to separate motion and view features, reinforced by contrastive alignment based on physics principles. This separation leads to better generalization, lower data requirements, and faster inference suitable for edge devices. If successful, it would allow pose estimation systems to work reliably in varied real-world camera setups like drones or security cameras without extensive retraining.

Core claim

MoViD establishes that viewpoint information can be predicted from intermediate pose features using joint relationship modeling and then disentangled from motion features via orthogonal projection and physics-grounded contrastive alignment across views, resulting in a framework that achieves viewpoint invariance in 3D human pose estimation.

What carries the argument

The view estimator combined with the orthogonal projection module and physics-grounded contrastive alignment, which isolates motion features while using viewpoint predictions to guide refinement.

If this is right

Pose estimation error drops by more than 24.2 percent relative to existing approaches across tested datasets.
Performance holds up under heavy occlusions even when trained on 60 percent less data.
Real-time operation at 15 frames per second becomes possible on standard NVIDIA edge hardware through adaptive view-aware processing.
The method generalizes across nine public datasets and additional custom multiview collections from UAVs and gait analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar disentanglement techniques could apply to other 3D reconstruction tasks where camera angle affects feature quality.
By reducing data needs, the approach might support training on smaller or synthetic datasets more effectively.
Adaptive activation of refinements based on viewpoint could inspire efficiency gains in related video analysis pipelines.

Load-bearing premise

That extracting and projecting out viewpoint features leaves behind all the motion information needed for accurate pose estimation regardless of how the viewpoint changes.

What would settle it

Observing that accuracy does not improve or even degrades on a held-out set of camera angles after applying the disentanglement steps compared to a baseline without them.

Figures

Figures reproduced from arXiv: 2604.03299 by Haoxian Liu, Hengle Jiang, Runxi Huang, Xiaomin Ouyang, Yejia Liu.

**Figure 2.** Figure 2: Pose estimation errors across different views. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Incorporating view information (“View-aware”) reduces pose estimation errors across views. mm to 81.7 mm, which shows the impact of camera perspective on pose estimation accuracy. Moreover, several viewpoints, such as View 2 and View 7 (the elevated side viewpoints), consistently yield higher errors than other views, indicating a strong correlation between viewpoint and performance degradation. These res… view at source ↗

**Figure 5.** Figure 5: Typical application scenarios of MoViD: Activity [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: MoViD features a view estimator to predict viewpoint information from intermediate pose estimation features. This [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of the view estimator. It computes an [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: View-orthogonal contrastive learning. We first disentangles motion and view features through the orthogonal [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: MoViD features a new frame-by-frame inference [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: The end-to-end system for real-time pose estima [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 12.** Figure 12: Viewpoint estimation performance and its rela [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 13.** Figure 13: Effect of flip augmentation on different views. [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 16.** Figure 16: Performance on the self-collected UAV dataset [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗

**Figure 15.** Figure 15: UAV trajectories including side-follow, horizontal [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 17.** Figure 17: The self-collected multimodal multi-view dataset [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

read the original abstract

3D human pose estimation is a key enabling technology for applications such as healthcare monitoring, human-robot collaboration, and immersive gaming, but real-world deployment remains challenged by viewpoint variations. Existing methods struggle to generalize to unseen camera viewpoints, require large amounts of training data, and suffer from high inference latency. We propose MoViD, a viewpoint-invariant 3D human pose estimation framework that disentangles viewpoint information from motion features. The key idea is to extract viewpoint information from intermediate pose features and leverage it to enhance both the robustness and efficiency of pose estimation. MoViD introduces a view estimator that models key joint relationships to predict viewpoint information, and an orthogonal projection module to disentangle motion and view features, further enhanced through physics-grounded contrastive alignment across views. For real-time edge deployment, MoViD employs a frame-by-frame inference pipeline with a view-aware strategy that adaptively activates flip refinement based on the estimated viewpoint. Evaluations on nine public datasets and newly collected multiview UAV and gait analysis datasets show that MoViD reduces pose estimation error by over 24.2\% compared to state-of-the-art methods, maintains robust performance under severe occlusions with 60\% less training data, and achieves real-time inference at 15 FPS on NVIDIA edge devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoViD adds a joint-relationship view estimator and orthogonal projection for disentanglement, with reported gains, but the preservation of pose-critical details under that projection remains unverified.

read the letter

The main takeaway is that MoViD uses a view estimator built around key joint relationships plus an orthogonal projection module to separate motion features from viewpoint information, reinforced by physics-grounded contrastive alignment. The abstract claims this yields over 24% lower pose error than prior methods, plus occlusion robustness with 60% less data and 15 FPS on edge hardware across nine public datasets and some new UAV and gait collections. The practical inference pipeline with view-aware flip refinement is a clear deployment-oriented addition. What the paper does well is lay out a specific architectural choice for the view estimator and tie the contrastive term to physical priors, which gives a concrete handle on the disentanglement idea beyond generic contrastive baselines. The broad dataset coverage and the edge-device focus also make the evaluation scope practical rather than purely academic. The soft spot sits in the orthogonal projection step itself. Viewpoint effects on 2D projections are nonlinear, so enforcing orthogonality in feature space can remove components that correlate with camera angle yet still matter for lifting to accurate 3D, such as foreshortening or view-specific depth cues. The contrastive alignment assumes motion features should match across views, but that may not hold for actions with inherent view-dependent dynamics. No checks like mutual information between the projected features and ground-truth camera parameters, or tests reconstructing view-specific 2D keypoints from the motion branch, are described, so the performance numbers rest on an assumption that has not been directly tested. This paper is aimed at computer vision groups working on robust 3D pose for monitoring, robotics, or immersive uses. A reader already experimenting with disentanglement or contrastive losses would find the module descriptions and the adaptive inference strategy worth trying, even if they plan to add their own verification. I would send it for peer review because the core mechanism is clearly motivated, the empirical claims are ambitious, and referees can examine the ablations and implementation to see whether the disentanglement holds up.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MoViD, a viewpoint-invariant 3D human pose estimation framework that disentangles viewpoint information from motion features via a view estimator module modeling joint relationships, an orthogonal projection module, and physics-grounded contrastive alignment across views. It further employs a view-aware frame-by-frame inference pipeline with adaptive flip refinement for edge deployment. Evaluations across nine public datasets plus new multiview UAV and gait collections report over 24.2% error reduction versus state-of-the-art, robust performance under severe occlusions using 60% less training data, and 15 FPS real-time inference on NVIDIA edge devices.

Significance. If the disentanglement preserves all pose-critical information, the work could meaningfully advance generalization to unseen viewpoints and practical edge deployment in applications such as healthcare monitoring and human-robot interaction. Notable strengths include the breadth of evaluation (nine public datasets plus new collections) and the explicit focus on real-time efficiency with adaptive refinement.

major comments (2)

[Abstract] Abstract: The 24.2% error reduction and occlusion-robustness claims rest on the orthogonal projection module plus physics-grounded contrastive alignment successfully isolating motion features while preserving view-dependent pose cues required to resolve 3D ambiguities; no verification (e.g., mutual information between projected features and ground-truth camera parameters or reconstruction of view-specific 2D keypoints from the motion branch) is described.
[Method] Method (orthogonal projection and contrastive alignment): Enforcing orthogonality in intermediate feature space risks nulling components linearly correlated with camera angle that remain necessary for accurate lifting, given that viewpoint modulates 2D projections nonlinearly via foreshortening, depth scaling, and view-specific occlusions; the contrastive term further assumes motion features are identical across views, which may discard action-specific view-dependent dynamics present in the training distribution.

minor comments (2)

The abstract would benefit from a brief statement of the loss formulation or key hyperparameters used in the physics-grounded contrastive alignment.
Consider adding an ablation table isolating the contribution of the orthogonal projection module versus the contrastive term alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive remarks on the evaluation breadth and real-time focus of MoViD. We address each major comment below with clarifications and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The 24.2% error reduction and occlusion-robustness claims rest on the orthogonal projection module plus physics-grounded contrastive alignment successfully isolating motion features while preserving view-dependent pose cues required to resolve 3D ambiguities; no verification (e.g., mutual information between projected features and ground-truth camera parameters or reconstruction of view-specific 2D keypoints from the motion branch) is described.

Authors: We agree that direct verification of the disentanglement would strengthen the abstract claims. In the revised manuscript we add two analyses: (1) mutual-information measurements between the motion-branch features and ground-truth camera parameters, confirming that view information is largely removed, and (2) reconstruction of view-specific 2D keypoints from the motion features alone, showing that pose-critical cues are retained. These additions support the reported error reductions and occlusion robustness. revision: yes
Referee: [Method] Method (orthogonal projection and contrastive alignment): Enforcing orthogonality in intermediate feature space risks nulling components linearly correlated with camera angle that remain necessary for accurate lifting, given that viewpoint modulates 2D projections nonlinearly via foreshortening, depth scaling, and view-specific occlusions; the contrastive term further assumes motion features are identical across views, which may discard action-specific view-dependent dynamics present in the training distribution.

Authors: We appreciate the concern about possible loss of necessary components. The orthogonal projection is applied after the view estimator has isolated viewpoint information, and our multi-view ablations demonstrate that the resulting motion features still permit accurate lifting. The contrastive loss aligns features of the same 3D pose observed from different viewpoints; view-dependent dynamics (foreshortening, occlusions) are explicitly routed to the view branch rather than discarded. We have expanded the method section with a discussion of these design choices and additional ablation results quantifying the effect of the orthogonality constraint. revision: yes

Circularity Check

0 steps flagged

No significant circularity: new modules and empirical claims are self-contained

full rationale

The paper introduces architectural components (view estimator, orthogonal projection module, physics-grounded contrastive alignment) and reports empirical gains on external datasets. No equations, fitted parameters, or self-citations are shown reducing any claimed prediction or result to its own inputs by construction. The derivation chain relies on novel design choices evaluated against public benchmarks rather than tautological re-use of fitted quantities or prior self-results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Limited information available from abstract only; the method relies on standard computer vision assumptions about camera models and joint kinematics plus newly introduced modules whose internal parameters are not detailed.

axioms (1)

domain assumption Standard camera projection and joint kinematic models hold for the input videos
Implicit in the orthogonal projection module and view estimation step

invented entities (2)

View estimator module no independent evidence
purpose: Predict viewpoint information from intermediate pose features by modeling key joint relationships
New component introduced to extract viewpoint information
Orthogonal projection module no independent evidence
purpose: Disentangle motion features from view features
Core new mechanism for separation of motion and viewpoint

pith-pipeline@v0.9.0 · 5544 in / 1374 out tokens · 39651 ms · 2026-05-14T21:20:00.459826+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

physics-enhanced contrastive alignment... SMPL-derived motion dynamics... view-invariant motion representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

[1]

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. 2019. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. G...

work page 2019
[2]

Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. 2022. HuMMan: Multi-modal 4d human dataset for versatile sensing and modeling. In17th European Conference on Computer Vision, Tel A viv, Israel, October 23–27, 2022, P...

work page 2022
[3]

Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. 2021. Beyond Static Features for Temporally Consistent 3D Human Pose and Shape From a Video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1964–1973

work page 2021
[4]

Jeonghyeok Do and Munchurl Kim. 2025. Skateformer: skeletal-temporal trans- former for human action recognition. InEuropean Conference on Computer Vision. Springer, 401–420

work page 2025
[5]

Yinpeng Dong, Shouwei Ruan, Hang Su, Caixin Kang, Xingxing Wei, and Jun Zhu

work page
[6]

Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints.Advances in Neural Information Processing Systems35 (2022), 36789– 36803

work page 2022
[7]

Nicola Garau, Giulia Martinelli, Piotr Bródka, Niccolò Bisagno, and Nicola Conci

work page
[8]

In2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

PanopTOP: a framework for generating viewpoint-invariant human pose estimation datasets. In2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 234–242. https://doi.org/10.1109/ICCVW54120.2021.00031

work page doi:10.1109/iccvw54120.2021.00031 2021
[9]

Maryam Fathollahi Ghezelghieh, Rangachar Kasturi, and Sudeep Sarkar. 2016. Learning Camera Viewpoint Using CNN to Improve 3D Body Pose Estimation. InProc. IEEE International Conference on 3D Vision (3DV)

work page 2016
[10]

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. 2023. Humans in 4D: Reconstructing and Tracking Humans with Transformers. InICCV

work page 2023
[11]

Riza Alp Guler, Natalia Neverova, and Iasonas Kokkinos. 2018. DensePose: Dense Human Pose Estimation In-the-Wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7297–7306

work page 2018
[12]

Abdul Hannan et al. 2021. A Portable Smart Fitness Suite for Real-Time Exercise Monitoring and Posture Correction.Sensors21, 19 (8 10 2021), 6692. https: //doi.org/10.3390/s21196692

work page doi:10.3390/s21196692 2021
[13]

Michiel Hazewinkel. 2001. Orthogonalization. InEncyclopaedia of Mathematics, Michiel Hazewinkel (Ed.). Springer

work page 2001
[14]

Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J

Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. 2022. Cap- turing and Inferring Dense Full-Body Human-Scene Contact. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13274–13285

work page 2022
[15]

Linzhi Huang, Yulong Li, Hongbo Tian, Yue Yang, Xiangang Li, Weihong Deng, and Jieping Ye. 2023. Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module. In2023 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). 693–703. https: //doi.org/10.1109/CVPR52729.2023.00074

work page doi:10.1109/cvpr52729.2023.00074 2023
[16]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments.IEEE Transactions on Pattern Analysis and Machine Intel- ligence36 (2014), 1325–1339. https://api.semanticscholar.org/CorpusID:4244548

work page 2014
[17]

Zhang, Panna Felsen, and Jitendra Malik

Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. 2019. Learn- ing 3D Human Dynamics from Video. InComputer Vision and Pattern Recognition (CVPR)

work page 2019
[18]

Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh Vo, and Kris Kitani. 2023. EgoHumans: An Egocentric 3D Multi-Human Benchmark. arXiv preprint arXiv:2305.16487(2023)

work page arXiv 2023
[19]

Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis

work page
[20]

InProceedings of the IEEE International Conference on Computer Vision (ICCV)

Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop. InProceedings of the IEEE International Conference on Computer Vision (ICCV). 2252–2261

work page
[21]

Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. 2019. Convolutional Mesh Regression for Single-Image Human Shape Reconstruction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4501– 4510

work page 2019
[22]

Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. 2022. MH- Former: Multi-Hypothesis Transformer for 3D Human Pose Estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13147–13156

work page 2022
[23]

Yanjie Li, Sen Yang, Peidong Liu, Shoukui Zhang, Yunxiao Wang, Zhicheng Wang, Wankou Yang, and Shu-Tao Xia. 2022. SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation. InProceedings of the European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Vol. 13666. Springer, 89–106. https://doi.org/10.1007/978-...

work page doi:10.1007/978-3-031-20068- 2022
[24]

Daxin Liu, Yu Huang, Zhenyu Liu, Haoyang Mao, Pengcheng Kan, and Jianrong Tan. 2024. A skeleton-based assembly action recognition method with feature fusion for human-robot collaborative assembly.Journal of Manufacturing Systems 76 (2024), 553–566. https://doi.org/10.1016/j.jmsy.2024.08.019

work page doi:10.1016/j.jmsy.2024.08.019 2024
[25]

Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. 2020. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2020), 2684–2701

work page 2020
[26]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2023.SMPL: A Skinned Multi-Person Linear Model(1 ed.). Asso- ciation for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/ 3596711.3596800

work page arXiv 2023
[27]

Ghattamaneni Mahesh and M. Kalidas. 2023. A Real-Time IoT Based Fall Detection and Alert System for Elderly. In2023 International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT). 327–331. https://doi.org/10.1109/ICAICCIT60255.2023.10465914

work page doi:10.1109/icaiccit60255.2023.10465914 2023
[28]

Troje, Gerard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In The IEEE International Conference on Computer Vision (ICCV). https://amass.is. tue.mpg.de

work page 2019
[29]

Lourdes Martínez-Villaseñor, Hiram Ponce, Jorge Brieva, Ernesto Moya-Albor, José Núñez-Martínez, and Carlos Peñafort-Asturiano. 2019. UP-Fall Detection Dataset: A Multimodal Approach.Sensors19, 9 (2019). https://doi.org/10.3390/ s19091988

work page 2019
[30]

Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3D Human Pose Estima- tion In The Wild Using Improved CNN Supervision. In3D Vision (3DV), 2017 Fifth International Conference on. IEEE. https://doi.org/10.1109/3dv.2017.00064

work page doi:10.1109/3dv.2017.00064 2017
[31]

Qianhui Men, Edmond S. L. Ho, Hubert P. H. Shum, and Howard Leung. 2023. Focalized Contrastive View-invariant Learning for Self-supervised Skeleton- based Action Recognition.Neurocomputing537 (2023), 198–209. https://api. semanticscholar.org/CorpusID:257899093

work page 2023
[32]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. InECCV

work page 2020
[33]

Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng Yan. 2019. Single- Stage Multi-Person Pose Machines. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6950–6959. https://doi.org/10.1109/ICCV. 2019.00705

work page doi:10.1109/iccv 2019
[34]

Xiaomin Ouyang, Xian Shuai, Yang Li, Li Pan, Xifan Zhang, Heming Fu, Sitong Cheng, Xinyan Wang, Shihua Cao, Jiang Xin, et al . 2024. ADMarker: A Multi-Modal Federated Learning System for Monitoring Digital Biomarkers of Alzheimer’s Disease. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 404–419

work page 2024
[35]

Sida Peng, Xiaowei Zhou, Yuan Liu, Haotong Lin, Qixing Huang, and Hujun Bao

work page
[36]

https://doi.org/10.1109/TPAMI.2020.3047388

PVNet: Pixel-Wise Voting Network for 6DoF Object Pose Estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence44, 6 (2022), 3212–3223. https://doi.org/10.1109/TPAMI.2020.3047388

work page doi:10.1109/tpami.2020.3047388 2022
[37]

Adri Priadana, Duy-Linh Nguyen, Xuan-Thuy Vo, Jehwan Choi, Russo Ashraf, and Kanghyun Jo. 2025. HFD-YOLO: Improved YOLO Network Using Efficient Attention Modules for Real-Time One-Stage Human Fall Detection.IEEE Access 13 (2025), 41248–41258. https://doi.org/10.1109/ACCESS.2025.3547360

work page doi:10.1109/access.2025.3547360 2025
[38]

Shouwei Ruan, Yinpeng Dong, Hanqing Liu, Yao Huang, Hang Su, and Xingxing Wei. 2024. Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models. InEuropean Conference on Computer Vision. https://api. semanticscholar.org/CorpusID:269214060

work page 2024
[39]

Shouwei Ruan, Yinpeng Dong, Han Su, Jianteng Peng, Ning Chen, and Xingxing Wei. 2023. Towards Viewpoint-Invariant Visual Recognition via Adversarial Training.2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), 4686–4696. https://api.semanticscholar.org/CorpusID:259991124

work page 2023
[40]

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. InProceedings of the IEEE conference on computer vision and pattern recognition. 1010–1019

work page 2016
[41]

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J. Black. 2024. WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion. InComputer Vision and Pattern Recognition (CVPR)

work page 2024
[42]

Yuming Song, Wei Zhang, Xiaojun Chang, Yi Zhang, Wei Liu, Yao Wu, and Lei Wang. 2020. Pose2Pose: Human Pose Transfer by Attention-Based Keypoint Graph. InEuropean Conference on Computer Vision (ECCV). 236–252

work page 2020
[43]

Stenum, K

J. Stenum, K. M. Cherry-Allen, C. O. Pyles, R. D. Reetzke, M. F. Vignos, and R. T. Roemmich. 2021. Applications of Pose Estimation in Human Health and Performance across the Lifespan.Sensors21, 21 (2021), 7315. https://doi.org/10. 3390/s21217315 SenSys ’26, May 11–14, 2026, Saint Malo, France Yejia Liu *, Hengle Jiang*, Haoxian Liu, Runxi Huang, Xiaomin Ouyang†

work page 2021
[44]

Haixun Sun, Yanyan Zhang, Yijie Zheng, Jianxin Luo, and Zhisong Pan. 2022. G2O-Pose: Real-Time Monocular 3D Human Pose Estimation Based on General Graph Optimization.Sensors22, 21 (2022). https://doi.org/10.3390/s22218335

work page doi:10.3390/s22218335 2022
[45]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. InCVPR

work page 2019
[46]

Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. 2018. Integral human pose regression. InProceedings of the European Conference on Computer Vision (ECCV). 529–545

work page 2018
[47]

Black, Ivan Laptev, and Cordelia Schmid

Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. 2017. Learning from Synthetic Humans. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4627–4635. https://doi.org/10.1109/CVPR.2017.492

work page doi:10.1109/cvpr.2017.492 2017
[48]

Black, Bodo Rosenhahn, and Gerard Pons-Moll

Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering Accurate 3D Human Pose in the Wild Using IMUs and a Moving Camera. InComputer Vision – ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 614–631

work page 2018
[49]

Bastian Wandt and Bodo Rosenhahn. 2019. RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation. In Computer Vision and Pattern Recognition (CVPR)

work page 2019
[50]

Yufu Wang and Kostas Daniilidis. 2023. ReFit: Recurrent Fitting Network for 3D Human Recovery. InInternational Conference on Computer Vision

work page 2023
[51]

Zhe Wang, Daeyun Shin, and Charless C. Fowlkes. 2020. Predicting Camera Viewpoint Improves Cross-Dataset Generalization for 3D Human Pose Estimation. InProc. European Conference on Computer Vision (ECCV) Workshops

work page 2020
[52]

Yan Xia, Xiaowei Zhou, Etienne Vouga, Qixing Huang, and Georgios Pavlakos

work page
[53]

Reconstructing Humans with a Biomechanically Accurate Skeleton. (2025). arXiv:cs.CV/2503.21751 https://arxiv.org/abs/2503.21751

work page arXiv 2025
[54]

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.ArXivabs/2204.12484 (2022). https://api.semanticscholar.org/CorpusID:248392410

work page arXiv 2022
[55]

Jingrui Yu, Dipankar Nandi, Roman Seidel, and Gangolf Hirtz. 2024. NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images. (2024). arXiv:cs.CV/2402.18196 https: //arxiv.org/abs/2402.18196

work page arXiv 2024