pith. sign in

arxiv: 2606.20542 · v1 · pith:MWWIFQCTnew · submitted 2026-06-18 · 💻 cs.CV

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Pith reviewed 2026-06-26 17:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords CalTennis datasetmonocular 3D pose estimationmulti-view video benchmarktennis motion capturefoot contact estimationdepth estimation failureathletic pose evaluation
0
0 comments X

The pith

A large multi-view tennis video dataset enables label-free benchmarking of monocular 3D pose estimation on athletic motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CalTennis, a dataset of over 11 million frames from 40 players captured with 2-6 synchronized cameras, to evaluate monocular-to-3D human pose methods on real tennis play. The multi-view recordings support automated calibration that yields 3D ground truth without manual labels or mocap equipment. Benchmarking existing methods on this data shows accurate recovery of 3D joint angles but consistent failures in depth estimation and foot contact. Two new metrics, footwork and stability, plus qualitative checks on body shape, expose these specific shortcomings and suggest directions for better action analysis.

Core claim

CalTennis supplies synchronized multi-view video of expert tennis motion at scale, with fully automated calibration and synchronization that produces 3D ground truth for label-free evaluation of monocular pose estimators; on this benchmark, current methods recover joint angles accurately yet struggle to estimate depth and foot contact consistently, as revealed by the proposed footwork and stability metrics.

What carries the argument

The multi-view synchronized camera protocol with automated video calibration and synchronization that generates reliable 3D ground truth from ordinary recordings.

If this is right

  • Monocular 3D pose algorithms can now be tested at scale on in-the-wild athletic sequences without specialized capture hardware.
  • Joint-angle recovery has reached usable accuracy on dynamic sports motion.
  • Depth and foot-contact estimation remain open failure modes that limit applications in sports analysis.
  • Footwork and stability metrics provide concrete, quantitative ways to measure and improve those failure modes.
  • Body-shape inconsistency across frames can be detected and studied directly from the multi-view data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same capture protocol could be replicated for other sports to create comparable benchmarks without new equipment.
  • Training monocular models with explicit depth or contact losses derived from the new metrics might close the observed gaps.
  • Reliable foot-contact detection would directly improve downstream tasks such as injury-risk assessment or performance coaching from video.

Load-bearing premise

Automated calibration and synchronization of the multi-view videos produce 3D ground truth accurate enough to serve as a reliable benchmark without further error checks.

What would settle it

A side-by-side comparison of the automated 3D joint positions against a small set of manually verified or mocap-recorded frames that shows large systematic discrepancies in depth or foot locations.

Figures

Figures reproduced from arXiv: 2606.20542 by Anna Szczuka, Blake Werner, Ilona Demler, Pietro Perona, Xinran Xie.

Figure 1
Figure 1. Figure 1: Overview of CalTennis Setup. (Left): 4-tripod setup, with two overlapping camera views (blue & orange) on each half-court (sections A.2 and 3.1) (Center): Overlapping views enable multi-view consistency evaluation. We measure the difference in 3D position, the difference in pose once the difference in 3D position is removed, as well as body shape and foot contact (section 5). (Right): We collect up to 6 co… view at source ↗
Figure 2
Figure 2. Figure 2: CalTennis complexity compared to other real-world benchmarks. (Left): CalTennis contains 10× frames than currently used benchmarks. (Middle): CalTennis contains more variation in the distance of people from the camera (top), as well as many more people per video (bottom). (Right): CalTennis has the highest pose space coverage (defined in §3). not provide 3D pose annotations. A smaller set of recent dataset… view at source ↗
Figure 3
Figure 3. Figure 3: Spatiotemporal calibration and synchronization. (Left): Calibrating cameras (intrinsic and extrinsic calibration, outlined in section A.2) allows us to lift model estimates into a shared court coordinate system (§A.2). Discrepancy in depth estimates results in differing 3D translation estimates. (Right): Videos lack identical timestamps, so we align sequences by optimizing a continuous global offset variab… view at source ↗
Figure 4
Figure 4. Figure 4: Multi-View Consistency Analysis. (Left): Median pose error versus translation error (Etrans) (Epose) across SOTA models on CalTennis. Metrics represent cross-view disagreement (§5); error bars denote 25th/75th percentiles. (Right): Translation error versus model size (num. parameters). PromptHMR has the lowest translation inconsistency but also the heaviest, while TRAM, which is second lowest, is also seco… view at source ↗
Figure 5
Figure 5. Figure 5: Consistency Across Motion Metrics. Multi-view agreement for foot height (Eheight), stability (Estab), and shape estimates (§5). No single model dominates across all dimensions; e.g., while WHAM excels in foot height consistency, it shows high inconsistency in stability and shape metrics. Results highlight the trade-offs between static pose accuracy and temporal/physical consis￾tency (§6.1). PA-MPJPE ↓ (m) … view at source ↗
Figure 6
Figure 6. Figure 6: Cross-view pose projections. Histogram of multi-view PA-MPJPE inconsistency for PromptHMR (the best-performing model) on a single video, colored by 10% intervals. We project pose estimates from one camera onto the other view to highlight inconsistencies. Low-disagreement poses are typically stationary and equally visible by both cameras, while high disagreement occurs on distant or dynamic poses with some … view at source ↗
Figure 7
Figure 7. Figure 7: Shape consistency. Models exhibit significant inconsistency in SMPL-X shape parameters (β) across different views. Qualitatively, PromptHMR (§6.1) achieves the highest multi-view consis￾tency, likely due to its conditioning on 2D bounding boxes and keypoints. PromptHMR produces the most consistent multi-view shape reconstruction, likely because it takes in additional bounding box and joint information. The… view at source ↗
Figure 8
Figure 8. Figure 8: Spatiotemporal calibration. Left: We lift model estimates into a shared court coordinate system (§A.2). Discrepancy in depth estimates results in differing 3D translation estimates τ˜ i t . Right: Videos lack identical timestamps, so we align sequences using a global offset ∆t and linearly interpolate poses for missing timestamps (§A.2) to ensure a precise millisecond-level comparison. express it as: TWmod… view at source ↗
Figure 9
Figure 9. Figure 9: Model Error Correlations. We calculate frame-level Pearson correlation between error measurements of different models for pose error, translation error, and stability. Each square shows the correlation between the error (in time) signal of two models. We find little correlation in model error across different models. Upper Torso Lower Upper Lower PromptHMR GVHMR Upper Torso Lower TRAM GENMO WHAM Upper Tors… view at source ↗
Figure 10
Figure 10. Figure 10: Joint Error Correlations. We notice that upper body (blue box) and lower body (green box) errors do not correlate with each other; erroneous upper-body estimates can correspond to consistent lower-body estimates and vice-versa. WHAM produces SMPL coordinates, which contain fewer upper and lower body joints. Interestingly, the torso joints (left/right hip and pelvis) correlate more with lower body errors f… view at source ↗
Figure 11
Figure 11. Figure 11: Pose Space Uniformity and Coverage of Real-World Datasets. coverage of pose space by CalTennis compared to others. On the right we provide a comparison of pose space coverage and uniformity metrics. Coverage is defined as the number of clusters that points in a dataset visit, divided by the total number of clusters (in this case 500). Joint poses in CalTennis visit 10% more clusters than other benchmarks.… view at source ↗
Figure 12
Figure 12. Figure 12: Joint Angle Histograms. We report the per-joint angular ranges in each dataset, ranging from the 10th to 90th percentiles, and normalize this with the documented angular range from medical literature. A flatter distribution indicates a more even spread over angular mobility [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Model Performance Trade-Offs. Left: we plot model runtime versus input frame counts of the same video. PromptHMR seems to scale quadratically, due to having a temporal transformer module, and GENMO scales linearly, thanks to its diffusion-based architecture. Right: we plot mean translation error versus inference runtime. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Model Runtime Analysis. 11” 68” [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Tripod setup. A.8 Camera setup In [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Inter-model failure analysis (failure = worst 30% per model, MPJPE). [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Consensus failure distribution (worst 30% per model, averaged over four metrics). [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
read the original abstract

The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces CalTennis, a large-scale multi-view tennis video dataset with over 11 million frames (51 hours) from 40 players captured at 60 Hz using 2-6 synchronized cameras. It describes a simple data collection protocol and a fully automated pipeline for video calibration and synchronization to produce 3D ground truth without manual labels. Benchmarking of state-of-the-art monocular-to-3D pose estimation methods on this dataset shows accurate recovery of 3D joint angles but consistent struggles with depth and foot contact; the authors also propose new metrics (footwork and stability) and qualitatively examine body shape inconsistency.

Significance. If the automated calibration produces sufficiently accurate 3D ground truth, the dataset would represent a substantial advance as the largest in-the-wild multi-view benchmark focused on expert athletic motion, enabling scalable, label-free evaluation of monocular methods and exposing underexplored failure modes in depth and contact estimation.

major comments (1)
  1. [Abstract and multi-view setup / benchmarking protocol] Abstract and multi-view setup / benchmarking protocol: The central claim that the multi-view recordings yield reliable 3D ground truth for benchmarking (and for attributing specific failures to depth and foot contact) depends on the accuracy of the automated calibration and synchronization, yet no quantitative validation is provided such as mean reprojection error, synchronization residual statistics, or cross-validation against manual landmarks or known scene geometry. Without these, it is impossible to separate monocular model errors from potential ground-truth noise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the importance of validating the automated calibration and synchronization pipeline. We address this point directly below.

read point-by-point responses
  1. Referee: [Abstract and multi-view setup / benchmarking protocol] Abstract and multi-view setup / benchmarking protocol: The central claim that the multi-view recordings yield reliable 3D ground truth for benchmarking (and for attributing specific failures to depth and foot contact) depends on the accuracy of the automated calibration and synchronization, yet no quantitative validation is provided such as mean reprojection error, synchronization residual statistics, or cross-validation against manual landmarks or known scene geometry. Without these, it is impossible to separate monocular model errors from potential ground-truth noise.

    Authors: We agree that quantitative validation of the calibration and synchronization is necessary to substantiate the reliability of the 3D ground truth. The current manuscript describes the automated pipeline but does not report explicit accuracy metrics such as mean reprojection error, synchronization residuals, or cross-validation results. In the revised version we will add a new subsection (likely in Section 3 or 4) that reports these quantities computed on the collected sequences, including average reprojection errors across cameras, temporal synchronization residuals, and any available checks against known scene geometry or a small set of manually annotated landmarks. This addition will allow readers to evaluate ground-truth quality independently of the monocular method errors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset collection and benchmarking with no derivations or self-referential predictions

full rationale

The paper describes collection of a multi-view tennis video dataset using automated calibration and synchronization, followed by direct benchmarking of existing monocular-to-3D pose methods. No mathematical derivations, fitted parameters presented as predictions, or first-principles results are claimed. The central contribution is the dataset itself and empirical observations on model performance (e.g., accurate joint angles but struggles with depth and foot contact); these do not reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The work is self-contained as an empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution rests on empirical data collection rather than new theoretical machinery; the primary unstated premise is that multi-view recordings yield usable ground truth.

axioms (1)
  • domain assumption Multi-view synchronized recordings provide accurate label-free 3D ground truth for human pose
    Invoked to justify the benchmark protocol and label-free evaluation claim.

pith-pipeline@v0.9.1-grok · 5777 in / 1247 out tokens · 26556 ms · 2026-06-26T17:37:13.327249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TaskNPoint: How to Teach Your Humanoid to Hit a Backhand in Minutes

    cs.RO 2026-06 unverdicted novelty 6.0

    TaskNPoint lets humanoid robots learn dynamic skills such as tennis backhands from single short human video demonstrations plus under one hour of single-GPU simulation training, achieving zero-shot generalization to n...

Reference graph

Works this paper leans on

54 extracted references · cited by 1 Pith paper

  1. [1]

    Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

    Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, and Thomas Lucas. Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

  2. [2]

    Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. InEuropean Conference on Computer Vision (ECCV), pages 561–578, 2016

  3. [3]

    Methodological factors affecting joint moments estimation in clinical gait analysis: a systematic review.BioMedical Engineering OnLine, 16(1):106, aug 2017

    Valentina Camomilla, Andrea Cereatti, Andrea Giovanni Cutti, Silvia Fantozzi, Rita Stagni, and Giuseppe Vannozzi. Methodological factors affecting joint moments estimation in clinical gait analysis: a systematic review.BioMedical Engineering OnLine, 16(1):106, aug 2017

  4. [4]

    Beyond static features for temporally consistent 3d human pose and shape from a video, 2021

    Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video, 2021

  5. [5]

    Steffi L Colyer, Murray Evans, Darren P Cosker, and Aki I T Salo. A review of the evolution of vision-based motion analysis and the integration of advanced computer vision methods towards developing a markerless system.Sports Medicine - Open, 2018

  6. [6]

    Meva: A large-scale multiview, multimodal video dataset for activity detection, 2020

    Kellie Corona, Katie Osterdahl, Roderic Collins, and Anthony Hoogs. Meva: A large-scale multiview, multimodal video dataset for activity detection, 2020

  7. [7]

    SportsMOT: A large multi-object tracking dataset in multiple sports scenes

    Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. SportsMOT: A large multi-object tracking dataset in multiple sports scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9921–9931, 2023

  8. [8]

    SoccerNet: A scalable dataset for action spotting in soccer videos

    Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. SoccerNet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018

  9. [9]

    Humans in 4d: Reconstructing and tracking humans with transformers, 2023

    Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4d: Reconstructing and tracking humans with transformers, 2023

  10. [10]

    OmniH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning

    Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. OmniH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InConference on Robot Learning (CoRL), 2024

  11. [11]

    Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J

    Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. Capturing and inferring dense full-body human-scene contact. InProceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 13274–13285, June 2022

  12. [12]

    SportsPose — a dynamic 3D sports pose dataset

    Christian Keilstrup Ingwersen, Christian Møller Mikkelstrup, Janus Nørtoft Jensen, Morten Rieger Han- nemose, and Anders Bjorholm Dahl. SportsPose — a dynamic 3D sports pose dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 5219–5228, 2023

  13. [13]

    Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014

  14. [14]

    Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, and Jie Song

    Tianjian Jiang, Johsan Billingham, Sebastian Müksch, Juan Zarate, Nicolas Evans, Martin R. Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, and Jie Song. WorldPose: A world cup dataset for global 3D human pose estimation. InProceedings of the European Conference on Computer Vision (ECCV), 2024

  15. [15]

    Coherent reconstruction of multiple humans from a single image, 2020

    Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Coherent reconstruction of multiple humans from a single image, 2020

  16. [16]

    Panoptic studio: A massively multiview system for social motion capture

    Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3334–3342, 2015

  17. [17]

    Black, David W

    Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose, 2018

  18. [18]

    Zhang, Panna Felsen, and Jitendra Malik

    Angjoo Kanazawa, Jason Y . Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video, 2019. 11

  19. [19]

    EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild

    Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan José Zárate, and Otmar Hilliges. EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. InInternational Conference on Computer Vision (ICCV), 2023

  20. [20]

    Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. VIBE: Video inference for human body pose and shape estimation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5253–5263, 2020

  21. [21]

    Huang, Otmar Hilliges, and Michael J

    Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. Pare: Part attention regressor for 3d human body estimation, 2021

  22. [22]

    Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal

    Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J. Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and camera motion estimation from in-the-wild videos, 2023

  23. [23]

    Black, and Kostas Daniilidis

    Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop, 2019

  24. [24]

    Genmo: A generalist model for human motion, 2025

    Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion, 2025

  25. [25]

    Coin: Control-inpainting diffusion prior for human and camera motion estimation, 2024

    Jiefeng Li, Ye Yuan, Davis Rempe, Haotian Zhang, Pavlo Molchanov, Cewu Lu, Jan Kautz, and Umar Iqbal. Coin: Control-inpainting diffusion prior for human and camera motion estimation, 2024

  26. [26]

    Ross, and Angjoo Kanazawa

    Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3D dance generation with AIST++. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13401–13412, 2021

  27. [27]

    Deep appearance models for face rendering.ACM Transactions on Graphics, 37(4):68:1–68:13, 2018

    Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. Deep appearance models for face rendering.ACM Transactions on Graphics, 37(4):68:1–68:13, 2018

  28. [28]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: a skinned multi-person linear model.ACM Trans. Graph., 34(6), November 2015

  29. [29]

    McGhee and A.A

    R.B. McGhee and A.A. Frank. On the stability properties of quadruped creeping gaits.Mathematical Biosciences, 3:331–351, 1968

  30. [30]

    Morgan Kaufmann, 2 edition, 2011

    Alberto Menache.Understanding Motion Capture for Computer Animation. Morgan Kaufmann, 2 edition, 2011

  31. [31]

    deface: Video anonymization by face detection, 2026

    ORB-HD. deface: Video anonymization by face detection, 2026

  32. [32]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image, 2019

  33. [33]

    Amir Rasouli and John K. Tsotsos. Autonomous vehicles that interact with pedestrians: A survey of theory and practice.IEEE Transactions on Intelligent Transportation Systems, 21(3):900–918, 2020

  34. [34]

    You only look once: Unified, real-time object detection, 2016

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection, 2016

  35. [35]

    Deep gait recognition: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):264–284, 2023

    Alireza Sepas-Moghaddam and Ali Etemad. Deep gait recognition: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):264–284, 2023

  36. [36]

    FineGym: A hierarchical video dataset for fine-grained action understanding

    Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. FineGym: A hierarchical video dataset for fine-grained action understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2616–2625, 2020

  37. [37]

    World-grounded human motion recovery via gravity-view coordinates

    Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, pages 1–11. ACM, December 2024

  38. [38]

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J. Black. Wham: Reconstructing world-grounded humans with accurate 3d motion, 2024

  39. [39]

    Aios: All-in-one-stage expressive human pose and shape estimation, 2024

    Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, and Zhongang Cai. Aios: All-in-one-stage expressive human pose and shape estimation, 2024. 12

  40. [40]

    Moeslund, Peter Carr, and Adrian Hilton

    Graham Thomas, Rikke Gade, Thomas B. Moeslund, Peter Carr, and Adrian Hilton. Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 159:3–18, 2017

  41. [41]

    Uhlrich, Antoine Falisse, Łukasz Kidzi ´nski, Julie Muccini, Michael Ko, Akshay S

    Scott D. Uhlrich, Antoine Falisse, Łukasz Kidzi ´nski, Julie Muccini, Michael Ko, Akshay S. Chaudhari, Jennifer L. Hicks, and Scott L. Delp. OpenCap: Human movement dynamics from smartphone videos. PLOS Computational Biology, 19(10):e1011462, 2023

  42. [42]

    Recover- ing accurate 3d human pose in the wild using imus and a moving camera

    Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recover- ing accurate 3d human pose in the wild using imus and a moving camera. InEuropean Conference on Computer Vision (ECCV), sep 2018

  43. [43]

    Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.PeerJ, 10:e12995, 2022

    Logan Wade, Laurie Needham, Polly McGuigan, and James Bilzon. Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.PeerJ, 10:e12995, 2022

  44. [44]

    Black, and Muhammed Kocabas

    Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, and Muhammed Kocabas. Prompthmr: Promptable human mesh recovery, 2025

  45. [45]

    Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

    Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

  46. [46]

    Detectron2.https: //github.com/facebookresearch/detectron2, 2019

    Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.https: //github.com/facebookresearch/detectron2, 2019

  47. [47]

    Vitpose: Simple vision transformer baselines for human pose estimation, 2022

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation, 2022

  48. [48]

    Sam 3d body: Robust full-body human mesh recovery, 2026

    Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. Sam 3d body: Robust full-body human mesh recovery, 2026

  49. [49]

    Decoupling human and camera motion from videos in the wild, 2023

    Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild, 2023

  50. [50]

    Athletepose3d: A benchmark dataset for 3d human pose estimation and kinematic validation in athletic movements, 2025

    Calvin Yeung, Tomohiro Suzuki, Ryota Tanaka, Zhuoer Yin, and Keisuke Fujii. Athletepose3d: A benchmark dataset for 3d human pose estimation and kinematic validation in athletic movements, 2025

  51. [51]

    Hi4d: 4d instance segmentation of close human interaction

    Yifei Yin, Chen Guo, Manuel Kaufmann, Juan Zarate, Jie Song, and Otmar Hilliges. Hi4d: 4d instance segmentation of close human interaction. InComputer Vision and Pattern Recognition (CVPR), 2023

  52. [52]

    Glamr: Global occlusion-aware human mesh recovery with dynamic cameras, 2022

    Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras, 2022

  53. [53]

    Derpanis

    Weiyu Zhang, Menglong Zhu, and Konstantinos G. Derpanis. From actemes to action: A strongly- supervised representation for detailed action understanding. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2013

  54. [54]

    Kucner, Luigi Palmieri, Kai O

    Yufei Zhu, Andrey Rudenko, Tomasz P. Kucner, Luigi Palmieri, Kai O. Arras, Achim J. Lilienthal, and Martin Magnusson. Cliff-lhmp: Using spatial dynamics patterns for long-term human motion prediction, 2023. 13 A Technical appendices and supplementary material A.1 Maximum-likelihood consensus pose To establish a single robust 3D joint estimate per timestep...