CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Anna Szczuka; Blake Werner; Ilona Demler; Pietro Perona; Xinran Xie

arxiv: 2606.20542 · v1 · pith:MWWIFQCTnew · submitted 2026-06-18 · 💻 cs.CV

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Ilona Demler , Xinran Xie , Blake Werner , Anna Szczuka , Pietro Perona This is my paper

Pith reviewed 2026-06-26 17:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords CalTennis datasetmonocular 3D pose estimationmulti-view video benchmarktennis motion capturefoot contact estimationdepth estimation failureathletic pose evaluation

0 comments

The pith

A large multi-view tennis video dataset enables label-free benchmarking of monocular 3D pose estimation on athletic motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CalTennis, a dataset of over 11 million frames from 40 players captured with 2-6 synchronized cameras, to evaluate monocular-to-3D human pose methods on real tennis play. The multi-view recordings support automated calibration that yields 3D ground truth without manual labels or mocap equipment. Benchmarking existing methods on this data shows accurate recovery of 3D joint angles but consistent failures in depth estimation and foot contact. Two new metrics, footwork and stability, plus qualitative checks on body shape, expose these specific shortcomings and suggest directions for better action analysis.

Core claim

CalTennis supplies synchronized multi-view video of expert tennis motion at scale, with fully automated calibration and synchronization that produces 3D ground truth for label-free evaluation of monocular pose estimators; on this benchmark, current methods recover joint angles accurately yet struggle to estimate depth and foot contact consistently, as revealed by the proposed footwork and stability metrics.

What carries the argument

The multi-view synchronized camera protocol with automated video calibration and synchronization that generates reliable 3D ground truth from ordinary recordings.

If this is right

Monocular 3D pose algorithms can now be tested at scale on in-the-wild athletic sequences without specialized capture hardware.
Joint-angle recovery has reached usable accuracy on dynamic sports motion.
Depth and foot-contact estimation remain open failure modes that limit applications in sports analysis.
Footwork and stability metrics provide concrete, quantitative ways to measure and improve those failure modes.
Body-shape inconsistency across frames can be detected and studied directly from the multi-view data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same capture protocol could be replicated for other sports to create comparable benchmarks without new equipment.
Training monocular models with explicit depth or contact losses derived from the new metrics might close the observed gaps.
Reliable foot-contact detection would directly improve downstream tasks such as injury-risk assessment or performance coaching from video.

Load-bearing premise

Automated calibration and synchronization of the multi-view videos produce 3D ground truth accurate enough to serve as a reliable benchmark without further error checks.

What would settle it

A side-by-side comparison of the automated 3D joint positions against a small set of manually verified or mocap-recorded frames that shows large systematic discrepancies in depth or foot locations.

Figures

Figures reproduced from arXiv: 2606.20542 by Anna Szczuka, Blake Werner, Ilona Demler, Pietro Perona, Xinran Xie.

**Figure 1.** Figure 1: Overview of CalTennis Setup. (Left): 4-tripod setup, with two overlapping camera views (blue & orange) on each half-court (sections A.2 and 3.1) (Center): Overlapping views enable multi-view consistency evaluation. We measure the difference in 3D position, the difference in pose once the difference in 3D position is removed, as well as body shape and foot contact (section 5). (Right): We collect up to 6 co… view at source ↗

**Figure 2.** Figure 2: CalTennis complexity compared to other real-world benchmarks. (Left): CalTennis contains 10× frames than currently used benchmarks. (Middle): CalTennis contains more variation in the distance of people from the camera (top), as well as many more people per video (bottom). (Right): CalTennis has the highest pose space coverage (defined in §3). not provide 3D pose annotations. A smaller set of recent dataset… view at source ↗

**Figure 3.** Figure 3: Spatiotemporal calibration and synchronization. (Left): Calibrating cameras (intrinsic and extrinsic calibration, outlined in section A.2) allows us to lift model estimates into a shared court coordinate system (§A.2). Discrepancy in depth estimates results in differing 3D translation estimates. (Right): Videos lack identical timestamps, so we align sequences by optimizing a continuous global offset variab… view at source ↗

**Figure 4.** Figure 4: Multi-View Consistency Analysis. (Left): Median pose error versus translation error (Etrans) (Epose) across SOTA models on CalTennis. Metrics represent cross-view disagreement (§5); error bars denote 25th/75th percentiles. (Right): Translation error versus model size (num. parameters). PromptHMR has the lowest translation inconsistency but also the heaviest, while TRAM, which is second lowest, is also seco… view at source ↗

**Figure 5.** Figure 5: Consistency Across Motion Metrics. Multi-view agreement for foot height (Eheight), stability (Estab), and shape estimates (§5). No single model dominates across all dimensions; e.g., while WHAM excels in foot height consistency, it shows high inconsistency in stability and shape metrics. Results highlight the trade-offs between static pose accuracy and temporal/physical consistency (§6.1). PA-MPJPE ↓ (m) … view at source ↗

**Figure 6.** Figure 6: Cross-view pose projections. Histogram of multi-view PA-MPJPE inconsistency for PromptHMR (the best-performing model) on a single video, colored by 10% intervals. We project pose estimates from one camera onto the other view to highlight inconsistencies. Low-disagreement poses are typically stationary and equally visible by both cameras, while high disagreement occurs on distant or dynamic poses with some … view at source ↗

**Figure 7.** Figure 7: Shape consistency. Models exhibit significant inconsistency in SMPL-X shape parameters (β) across different views. Qualitatively, PromptHMR (§6.1) achieves the highest multi-view consistency, likely due to its conditioning on 2D bounding boxes and keypoints. PromptHMR produces the most consistent multi-view shape reconstruction, likely because it takes in additional bounding box and joint information. The… view at source ↗

**Figure 8.** Figure 8: Spatiotemporal calibration. Left: We lift model estimates into a shared court coordinate system (§A.2). Discrepancy in depth estimates results in differing 3D translation estimates τ˜ i t . Right: Videos lack identical timestamps, so we align sequences using a global offset ∆t and linearly interpolate poses for missing timestamps (§A.2) to ensure a precise millisecond-level comparison. express it as: TWmod… view at source ↗

**Figure 9.** Figure 9: Model Error Correlations. We calculate frame-level Pearson correlation between error measurements of different models for pose error, translation error, and stability. Each square shows the correlation between the error (in time) signal of two models. We find little correlation in model error across different models. Upper Torso Lower Upper Lower PromptHMR GVHMR Upper Torso Lower TRAM GENMO WHAM Upper Tors… view at source ↗

**Figure 10.** Figure 10: Joint Error Correlations. We notice that upper body (blue box) and lower body (green box) errors do not correlate with each other; erroneous upper-body estimates can correspond to consistent lower-body estimates and vice-versa. WHAM produces SMPL coordinates, which contain fewer upper and lower body joints. Interestingly, the torso joints (left/right hip and pelvis) correlate more with lower body errors f… view at source ↗

**Figure 11.** Figure 11: Pose Space Uniformity and Coverage of Real-World Datasets. coverage of pose space by CalTennis compared to others. On the right we provide a comparison of pose space coverage and uniformity metrics. Coverage is defined as the number of clusters that points in a dataset visit, divided by the total number of clusters (in this case 500). Joint poses in CalTennis visit 10% more clusters than other benchmarks.… view at source ↗

**Figure 12.** Figure 12: Joint Angle Histograms. We report the per-joint angular ranges in each dataset, ranging from the 10th to 90th percentiles, and normalize this with the documented angular range from medical literature. A flatter distribution indicates a more even spread over angular mobility [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Model Performance Trade-Offs. Left: we plot model runtime versus input frame counts of the same video. PromptHMR seems to scale quadratically, due to having a temporal transformer module, and GENMO scales linearly, thanks to its diffusion-based architecture. Right: we plot mean translation error versus inference runtime. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Model Runtime Analysis. 11” 68” [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Tripod setup. A.8 Camera setup In [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Inter-model failure analysis (failure = worst 30% per model, MPJPE). [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Consensus failure distribution (worst 30% per model, averaged over four metrics). [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

read the original abstract

The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CalTennis supplies a genuinely large new multi-view tennis dataset with useful failure observations, but the automated calibration has no reported accuracy numbers to back the depth and contact claims.

read the letter

The main point is that this paper releases CalTennis, a 51-hour multi-view tennis dataset from 40 players that is ten times larger than prior in-the-wild motion collections and the first at this scale for expert athletic motion. It also benchmarks several monocular 3D pose methods and reports that joint angles come out reasonably well while depth and foot contact remain weak, plus it adds footwork and stability metrics to highlight those gaps.

The scale and domain choice are the real additions. Existing datasets either lack synchronized views or focus on everyday actions rather than high-speed tennis. The standardized collection protocol without specialized gear is practical, and the label-free evaluation setup from multi-view is a straightforward idea that fits the goal. The concrete observations on where current models fail give a clear direction for follow-up work.

The soft spot is the ground truth itself. The paper uses fully automated calibration and synchronization but gives no reprojection error, synchronization residual, or cross-check numbers. Without those, it is difficult to separate model errors from possible noise in the 3D reference, especially on the exact dimensions the benchmark flags as problematic. That gap is real and directly affects how much weight the failure-mode claims can carry.

This paper is aimed at groups working on monocular 3D pose or action analysis in sports. The dataset could be worth using if the calibration holds up under scrutiny. It shows straightforward thinking about what is missing in current benchmarks and does not overclaim beyond the data presented.

I would send it for peer review. Reviewers can ask for the missing calibration metrics and check the data release details, which is the right level of engagement for a dataset contribution of this size.

Referee Report

1 major / 0 minor

Summary. The paper introduces CalTennis, a large-scale multi-view tennis video dataset with over 11 million frames (51 hours) from 40 players captured at 60 Hz using 2-6 synchronized cameras. It describes a simple data collection protocol and a fully automated pipeline for video calibration and synchronization to produce 3D ground truth without manual labels. Benchmarking of state-of-the-art monocular-to-3D pose estimation methods on this dataset shows accurate recovery of 3D joint angles but consistent struggles with depth and foot contact; the authors also propose new metrics (footwork and stability) and qualitatively examine body shape inconsistency.

Significance. If the automated calibration produces sufficiently accurate 3D ground truth, the dataset would represent a substantial advance as the largest in-the-wild multi-view benchmark focused on expert athletic motion, enabling scalable, label-free evaluation of monocular methods and exposing underexplored failure modes in depth and contact estimation.

major comments (1)

[Abstract and multi-view setup / benchmarking protocol] Abstract and multi-view setup / benchmarking protocol: The central claim that the multi-view recordings yield reliable 3D ground truth for benchmarking (and for attributing specific failures to depth and foot contact) depends on the accuracy of the automated calibration and synchronization, yet no quantitative validation is provided such as mean reprojection error, synchronization residual statistics, or cross-validation against manual landmarks or known scene geometry. Without these, it is impossible to separate monocular model errors from potential ground-truth noise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the importance of validating the automated calibration and synchronization pipeline. We address this point directly below.

read point-by-point responses

Referee: [Abstract and multi-view setup / benchmarking protocol] Abstract and multi-view setup / benchmarking protocol: The central claim that the multi-view recordings yield reliable 3D ground truth for benchmarking (and for attributing specific failures to depth and foot contact) depends on the accuracy of the automated calibration and synchronization, yet no quantitative validation is provided such as mean reprojection error, synchronization residual statistics, or cross-validation against manual landmarks or known scene geometry. Without these, it is impossible to separate monocular model errors from potential ground-truth noise.

Authors: We agree that quantitative validation of the calibration and synchronization is necessary to substantiate the reliability of the 3D ground truth. The current manuscript describes the automated pipeline but does not report explicit accuracy metrics such as mean reprojection error, synchronization residuals, or cross-validation results. In the revised version we will add a new subsection (likely in Section 3 or 4) that reports these quantities computed on the collected sequences, including average reprojection errors across cameras, temporal synchronization residuals, and any available checks against known scene geometry or a small set of manually annotated landmarks. This addition will allow readers to evaluate ground-truth quality independently of the monocular method errors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset collection and benchmarking with no derivations or self-referential predictions

full rationale

The paper describes collection of a multi-view tennis video dataset using automated calibration and synchronization, followed by direct benchmarking of existing monocular-to-3D pose methods. No mathematical derivations, fitted parameters presented as predictions, or first-principles results are claimed. The central contribution is the dataset itself and empirical observations on model performance (e.g., accurate joint angles but struggles with depth and foot contact); these do not reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The work is self-contained as an empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution rests on empirical data collection rather than new theoretical machinery; the primary unstated premise is that multi-view recordings yield usable ground truth.

axioms (1)

domain assumption Multi-view synchronized recordings provide accurate label-free 3D ground truth for human pose
Invoked to justify the benchmark protocol and label-free evaluation claim.

pith-pipeline@v0.9.1-grok · 5777 in / 1247 out tokens · 26556 ms · 2026-06-26T17:37:13.327249+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TaskNPoint: How to Teach Your Humanoid to Hit a Backhand in Minutes
cs.RO 2026-06 unverdicted novelty 6.0

TaskNPoint lets humanoid robots learn dynamic skills such as tennis backhands from single short human video demonstrations plus under one hour of single-GPU simulation training, achieving zero-shot generalization to n...

Reference graph

Works this paper leans on

54 extracted references · cited by 1 Pith paper

[1]

Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, and Thomas Lucas. Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

2024
[2]

Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. InEuropean Conference on Computer Vision (ECCV), pages 561–578, 2016

2016
[3]

Methodological factors affecting joint moments estimation in clinical gait analysis: a systematic review.BioMedical Engineering OnLine, 16(1):106, aug 2017

Valentina Camomilla, Andrea Cereatti, Andrea Giovanni Cutti, Silvia Fantozzi, Rita Stagni, and Giuseppe Vannozzi. Methodological factors affecting joint moments estimation in clinical gait analysis: a systematic review.BioMedical Engineering OnLine, 16(1):106, aug 2017

2017
[4]

Beyond static features for temporally consistent 3d human pose and shape from a video, 2021

Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video, 2021

2021
[5]

Steffi L Colyer, Murray Evans, Darren P Cosker, and Aki I T Salo. A review of the evolution of vision-based motion analysis and the integration of advanced computer vision methods towards developing a markerless system.Sports Medicine - Open, 2018

2018
[6]

Meva: A large-scale multiview, multimodal video dataset for activity detection, 2020

Kellie Corona, Katie Osterdahl, Roderic Collins, and Anthony Hoogs. Meva: A large-scale multiview, multimodal video dataset for activity detection, 2020

2020
[7]

SportsMOT: A large multi-object tracking dataset in multiple sports scenes

Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. SportsMOT: A large multi-object tracking dataset in multiple sports scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9921–9931, 2023

2023
[8]

SoccerNet: A scalable dataset for action spotting in soccer videos

Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. SoccerNet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018

2018
[9]

Humans in 4d: Reconstructing and tracking humans with transformers, 2023

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4d: Reconstructing and tracking humans with transformers, 2023

2023
[10]

OmniH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. OmniH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InConference on Robot Learning (CoRL), 2024

2024
[11]

Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J

Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. Capturing and inferring dense full-body human-scene contact. InProceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 13274–13285, June 2022

2022
[12]

SportsPose — a dynamic 3D sports pose dataset

Christian Keilstrup Ingwersen, Christian Møller Mikkelstrup, Janus Nørtoft Jensen, Morten Rieger Han- nemose, and Anders Bjorholm Dahl. SportsPose — a dynamic 3D sports pose dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 5219–5228, 2023

2023
[13]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014

2014
[14]

Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, and Jie Song

Tianjian Jiang, Johsan Billingham, Sebastian Müksch, Juan Zarate, Nicolas Evans, Martin R. Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, and Jie Song. WorldPose: A world cup dataset for global 3D human pose estimation. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024
[15]

Coherent reconstruction of multiple humans from a single image, 2020

Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Coherent reconstruction of multiple humans from a single image, 2020

2020
[16]

Panoptic studio: A massively multiview system for social motion capture

Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3334–3342, 2015

2015
[17]

Black, David W

Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose, 2018

2018
[18]

Zhang, Panna Felsen, and Jitendra Malik

Angjoo Kanazawa, Jason Y . Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video, 2019. 11

2019
[19]

EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild

Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan José Zárate, and Otmar Hilliges. EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. InInternational Conference on Computer Vision (ICCV), 2023

2023
[20]

Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. VIBE: Video inference for human body pose and shape estimation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5253–5263, 2020

2020
[21]

Huang, Otmar Hilliges, and Michael J

Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. Pare: Part attention regressor for 3d human body estimation, 2021

2021
[22]

Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal

Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J. Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and camera motion estimation from in-the-wild videos, 2023

2023
[23]

Black, and Kostas Daniilidis

Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop, 2019

2019
[24]

Genmo: A generalist model for human motion, 2025

Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion, 2025

2025
[25]

Coin: Control-inpainting diffusion prior for human and camera motion estimation, 2024

Jiefeng Li, Ye Yuan, Davis Rempe, Haotian Zhang, Pavlo Molchanov, Cewu Lu, Jan Kautz, and Umar Iqbal. Coin: Control-inpainting diffusion prior for human and camera motion estimation, 2024

2024
[26]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3D dance generation with AIST++. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13401–13412, 2021

2021
[27]

Deep appearance models for face rendering.ACM Transactions on Graphics, 37(4):68:1–68:13, 2018

Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. Deep appearance models for face rendering.ACM Transactions on Graphics, 37(4):68:1–68:13, 2018

2018
[28]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: a skinned multi-person linear model.ACM Trans. Graph., 34(6), November 2015

2015
[29]

McGhee and A.A

R.B. McGhee and A.A. Frank. On the stability properties of quadruped creeping gaits.Mathematical Biosciences, 3:331–351, 1968

1968
[30]

Morgan Kaufmann, 2 edition, 2011

Alberto Menache.Understanding Motion Capture for Computer Animation. Morgan Kaufmann, 2 edition, 2011

2011
[31]

deface: Video anonymization by face detection, 2026

ORB-HD. deface: Video anonymization by face detection, 2026

2026
[32]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image, 2019

2019
[33]

Amir Rasouli and John K. Tsotsos. Autonomous vehicles that interact with pedestrians: A survey of theory and practice.IEEE Transactions on Intelligent Transportation Systems, 21(3):900–918, 2020

2020
[34]

You only look once: Unified, real-time object detection, 2016

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection, 2016

2016
[35]

Deep gait recognition: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):264–284, 2023

Alireza Sepas-Moghaddam and Ali Etemad. Deep gait recognition: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):264–284, 2023

2023
[36]

FineGym: A hierarchical video dataset for fine-grained action understanding

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. FineGym: A hierarchical video dataset for fine-grained action understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2616–2625, 2020

2020
[37]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, pages 1–11. ACM, December 2024

2024
[38]

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J. Black. Wham: Reconstructing world-grounded humans with accurate 3d motion, 2024

2024
[39]

Aios: All-in-one-stage expressive human pose and shape estimation, 2024

Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, and Zhongang Cai. Aios: All-in-one-stage expressive human pose and shape estimation, 2024. 12

2024
[40]

Moeslund, Peter Carr, and Adrian Hilton

Graham Thomas, Rikke Gade, Thomas B. Moeslund, Peter Carr, and Adrian Hilton. Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 159:3–18, 2017

2017
[41]

Uhlrich, Antoine Falisse, Łukasz Kidzi ´nski, Julie Muccini, Michael Ko, Akshay S

Scott D. Uhlrich, Antoine Falisse, Łukasz Kidzi ´nski, Julie Muccini, Michael Ko, Akshay S. Chaudhari, Jennifer L. Hicks, and Scott L. Delp. OpenCap: Human movement dynamics from smartphone videos. PLOS Computational Biology, 19(10):e1011462, 2023

2023
[42]

Recover- ing accurate 3d human pose in the wild using imus and a moving camera

Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recover- ing accurate 3d human pose in the wild using imus and a moving camera. InEuropean Conference on Computer Vision (ECCV), sep 2018

2018
[43]

Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.PeerJ, 10:e12995, 2022

Logan Wade, Laurie Needham, Polly McGuigan, and James Bilzon. Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.PeerJ, 10:e12995, 2022

2022
[44]

Black, and Muhammed Kocabas

Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, and Muhammed Kocabas. Prompthmr: Promptable human mesh recovery, 2025

2025
[45]

Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

2024
[46]

Detectron2.https: //github.com/facebookresearch/detectron2, 2019

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.https: //github.com/facebookresearch/detectron2, 2019

2019
[47]

Vitpose: Simple vision transformer baselines for human pose estimation, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation, 2022

2022
[48]

Sam 3d body: Robust full-body human mesh recovery, 2026

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. Sam 3d body: Robust full-body human mesh recovery, 2026

2026
[49]

Decoupling human and camera motion from videos in the wild, 2023

Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild, 2023

2023
[50]

Athletepose3d: A benchmark dataset for 3d human pose estimation and kinematic validation in athletic movements, 2025

Calvin Yeung, Tomohiro Suzuki, Ryota Tanaka, Zhuoer Yin, and Keisuke Fujii. Athletepose3d: A benchmark dataset for 3d human pose estimation and kinematic validation in athletic movements, 2025

2025
[51]

Hi4d: 4d instance segmentation of close human interaction

Yifei Yin, Chen Guo, Manuel Kaufmann, Juan Zarate, Jie Song, and Otmar Hilliges. Hi4d: 4d instance segmentation of close human interaction. InComputer Vision and Pattern Recognition (CVPR), 2023

2023
[52]

Glamr: Global occlusion-aware human mesh recovery with dynamic cameras, 2022

Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras, 2022

2022
[53]

Derpanis

Weiyu Zhang, Menglong Zhu, and Konstantinos G. Derpanis. From actemes to action: A strongly- supervised representation for detailed action understanding. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2013

2013
[54]

Kucner, Luigi Palmieri, Kai O

Yufei Zhu, Andrey Rudenko, Tomasz P. Kucner, Luigi Palmieri, Kai O. Arras, Achim J. Lilienthal, and Martin Magnusson. Cliff-lhmp: Using spatial dynamics patterns for long-term human motion prediction, 2023. 13 A Technical appendices and supplementary material A.1 Maximum-likelihood consensus pose To establish a single robust 3D joint estimate per timestep...

arXiv 2023

[1] [1]

Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, and Thomas Lucas. Multi-hmr: Multi-person whole-body human mesh recovery in a single shot, 2024

2024

[2] [2]

Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. InEuropean Conference on Computer Vision (ECCV), pages 561–578, 2016

2016

[3] [3]

Methodological factors affecting joint moments estimation in clinical gait analysis: a systematic review.BioMedical Engineering OnLine, 16(1):106, aug 2017

Valentina Camomilla, Andrea Cereatti, Andrea Giovanni Cutti, Silvia Fantozzi, Rita Stagni, and Giuseppe Vannozzi. Methodological factors affecting joint moments estimation in clinical gait analysis: a systematic review.BioMedical Engineering OnLine, 16(1):106, aug 2017

2017

[4] [4]

Beyond static features for temporally consistent 3d human pose and shape from a video, 2021

Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video, 2021

2021

[5] [5]

Steffi L Colyer, Murray Evans, Darren P Cosker, and Aki I T Salo. A review of the evolution of vision-based motion analysis and the integration of advanced computer vision methods towards developing a markerless system.Sports Medicine - Open, 2018

2018

[6] [6]

Meva: A large-scale multiview, multimodal video dataset for activity detection, 2020

Kellie Corona, Katie Osterdahl, Roderic Collins, and Anthony Hoogs. Meva: A large-scale multiview, multimodal video dataset for activity detection, 2020

2020

[7] [7]

SportsMOT: A large multi-object tracking dataset in multiple sports scenes

Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. SportsMOT: A large multi-object tracking dataset in multiple sports scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9921–9931, 2023

2023

[8] [8]

SoccerNet: A scalable dataset for action spotting in soccer videos

Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. SoccerNet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018

2018

[9] [9]

Humans in 4d: Reconstructing and tracking humans with transformers, 2023

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4d: Reconstructing and tracking humans with transformers, 2023

2023

[10] [10]

OmniH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. OmniH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InConference on Robot Learning (CoRL), 2024

2024

[11] [11]

Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J

Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. Capturing and inferring dense full-body human-scene contact. InProceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 13274–13285, June 2022

2022

[12] [12]

SportsPose — a dynamic 3D sports pose dataset

Christian Keilstrup Ingwersen, Christian Møller Mikkelstrup, Janus Nørtoft Jensen, Morten Rieger Han- nemose, and Anders Bjorholm Dahl. SportsPose — a dynamic 3D sports pose dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 5219–5228, 2023

2023

[13] [13]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014

2014

[14] [14]

Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, and Jie Song

Tianjian Jiang, Johsan Billingham, Sebastian Müksch, Juan Zarate, Nicolas Evans, Martin R. Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, and Jie Song. WorldPose: A world cup dataset for global 3D human pose estimation. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024

[15] [15]

Coherent reconstruction of multiple humans from a single image, 2020

Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Coherent reconstruction of multiple humans from a single image, 2020

2020

[16] [16]

Panoptic studio: A massively multiview system for social motion capture

Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3334–3342, 2015

2015

[17] [17]

Black, David W

Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose, 2018

2018

[18] [18]

Zhang, Panna Felsen, and Jitendra Malik

Angjoo Kanazawa, Jason Y . Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video, 2019. 11

2019

[19] [19]

EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild

Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan José Zárate, and Otmar Hilliges. EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. InInternational Conference on Computer Vision (ICCV), 2023

2023

[20] [20]

Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. VIBE: Video inference for human body pose and shape estimation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5253–5263, 2020

2020

[21] [21]

Huang, Otmar Hilliges, and Michael J

Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. Pare: Part attention regressor for 3d human body estimation, 2021

2021

[22] [22]

Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal

Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J. Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and camera motion estimation from in-the-wild videos, 2023

2023

[23] [23]

Black, and Kostas Daniilidis

Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop, 2019

2019

[24] [24]

Genmo: A generalist model for human motion, 2025

Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion, 2025

2025

[25] [25]

Coin: Control-inpainting diffusion prior for human and camera motion estimation, 2024

Jiefeng Li, Ye Yuan, Davis Rempe, Haotian Zhang, Pavlo Molchanov, Cewu Lu, Jan Kautz, and Umar Iqbal. Coin: Control-inpainting diffusion prior for human and camera motion estimation, 2024

2024

[26] [26]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3D dance generation with AIST++. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13401–13412, 2021

2021

[27] [27]

Deep appearance models for face rendering.ACM Transactions on Graphics, 37(4):68:1–68:13, 2018

Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. Deep appearance models for face rendering.ACM Transactions on Graphics, 37(4):68:1–68:13, 2018

2018

[28] [28]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: a skinned multi-person linear model.ACM Trans. Graph., 34(6), November 2015

2015

[29] [29]

McGhee and A.A

R.B. McGhee and A.A. Frank. On the stability properties of quadruped creeping gaits.Mathematical Biosciences, 3:331–351, 1968

1968

[30] [30]

Morgan Kaufmann, 2 edition, 2011

Alberto Menache.Understanding Motion Capture for Computer Animation. Morgan Kaufmann, 2 edition, 2011

2011

[31] [31]

deface: Video anonymization by face detection, 2026

ORB-HD. deface: Video anonymization by face detection, 2026

2026

[32] [32]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image, 2019

2019

[33] [33]

Amir Rasouli and John K. Tsotsos. Autonomous vehicles that interact with pedestrians: A survey of theory and practice.IEEE Transactions on Intelligent Transportation Systems, 21(3):900–918, 2020

2020

[34] [34]

You only look once: Unified, real-time object detection, 2016

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection, 2016

2016

[35] [35]

Deep gait recognition: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):264–284, 2023

Alireza Sepas-Moghaddam and Ali Etemad. Deep gait recognition: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):264–284, 2023

2023

[36] [36]

FineGym: A hierarchical video dataset for fine-grained action understanding

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. FineGym: A hierarchical video dataset for fine-grained action understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2616–2625, 2020

2020

[37] [37]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, pages 1–11. ACM, December 2024

2024

[38] [38]

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J. Black. Wham: Reconstructing world-grounded humans with accurate 3d motion, 2024

2024

[39] [39]

Aios: All-in-one-stage expressive human pose and shape estimation, 2024

Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, and Zhongang Cai. Aios: All-in-one-stage expressive human pose and shape estimation, 2024. 12

2024

[40] [40]

Moeslund, Peter Carr, and Adrian Hilton

Graham Thomas, Rikke Gade, Thomas B. Moeslund, Peter Carr, and Adrian Hilton. Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 159:3–18, 2017

2017

[41] [41]

Uhlrich, Antoine Falisse, Łukasz Kidzi ´nski, Julie Muccini, Michael Ko, Akshay S

Scott D. Uhlrich, Antoine Falisse, Łukasz Kidzi ´nski, Julie Muccini, Michael Ko, Akshay S. Chaudhari, Jennifer L. Hicks, and Scott L. Delp. OpenCap: Human movement dynamics from smartphone videos. PLOS Computational Biology, 19(10):e1011462, 2023

2023

[42] [42]

Recover- ing accurate 3d human pose in the wild using imus and a moving camera

Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recover- ing accurate 3d human pose in the wild using imus and a moving camera. InEuropean Conference on Computer Vision (ECCV), sep 2018

2018

[43] [43]

Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.PeerJ, 10:e12995, 2022

Logan Wade, Laurie Needham, Polly McGuigan, and James Bilzon. Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.PeerJ, 10:e12995, 2022

2022

[44] [44]

Black, and Muhammed Kocabas

Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, and Muhammed Kocabas. Prompthmr: Promptable human mesh recovery, 2025

2025

[45] [45]

Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024

2024

[46] [46]

Detectron2.https: //github.com/facebookresearch/detectron2, 2019

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.https: //github.com/facebookresearch/detectron2, 2019

2019

[47] [47]

Vitpose: Simple vision transformer baselines for human pose estimation, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation, 2022

2022

[48] [48]

Sam 3d body: Robust full-body human mesh recovery, 2026

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. Sam 3d body: Robust full-body human mesh recovery, 2026

2026

[49] [49]

Decoupling human and camera motion from videos in the wild, 2023

Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild, 2023

2023

[50] [50]

Athletepose3d: A benchmark dataset for 3d human pose estimation and kinematic validation in athletic movements, 2025

Calvin Yeung, Tomohiro Suzuki, Ryota Tanaka, Zhuoer Yin, and Keisuke Fujii. Athletepose3d: A benchmark dataset for 3d human pose estimation and kinematic validation in athletic movements, 2025

2025

[51] [51]

Hi4d: 4d instance segmentation of close human interaction

Yifei Yin, Chen Guo, Manuel Kaufmann, Juan Zarate, Jie Song, and Otmar Hilliges. Hi4d: 4d instance segmentation of close human interaction. InComputer Vision and Pattern Recognition (CVPR), 2023

2023

[52] [52]

Glamr: Global occlusion-aware human mesh recovery with dynamic cameras, 2022

Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras, 2022

2022

[53] [53]

Derpanis

Weiyu Zhang, Menglong Zhu, and Konstantinos G. Derpanis. From actemes to action: A strongly- supervised representation for detailed action understanding. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2013

2013

[54] [54]

Kucner, Luigi Palmieri, Kai O

Yufei Zhu, Andrey Rudenko, Tomasz P. Kucner, Luigi Palmieri, Kai O. Arras, Achim J. Lilienthal, and Martin Magnusson. Cliff-lhmp: Using spatial dynamics patterns for long-term human motion prediction, 2023. 13 A Technical appendices and supplementary material A.1 Maximum-likelihood consensus pose To establish a single robust 3D joint estimate per timestep...

arXiv 2023