SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation
Pith reviewed 2026-06-28 22:43 UTC · model grok-4.3
The pith
Finetuned SMPLest-X with RAFT tracking and foot anchoring achieves 38.6 percent lower error than FIFA baseline for soccer pose estimation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SMART finetunes SMPLest-X via stratified clip split, multi-task depth supervision, and broadcast augmentation, paired with a RAFT dense optical flow camera tracker, foot-plane anchoring, and two-pass temporal smoothing to achieve a validation score of 0.647, a 38.6 percent improvement over the FIFA baseline of 1.053, and a test score of 0.593.
What carries the argument
SMPLest-X finetuning combined with RAFT optical flow tracking and foot-plane anchoring to enforce temporal consistency and ground contact in 3D pose estimates from video.
If this is right
- Global MPJPE of 0.324 m and local MPJPE of 0.054 m on the test set indicate accurate world-space and relative joint positions.
- The method works on standard broadcast video without specialized camera setups.
- Temporal smoothing produces smoother pose sequences over time.
- Foot-plane anchoring maintains realistic player-ground interactions.
Where Pith is reading between the lines
- Similar adaptation techniques could improve pose estimation in other dynamic sports with broadcast footage.
- Large mesh models may generalize better to new domains when combined with dense flow tracking for camera motion.
Load-bearing premise
The improvements are attributable to the described finetuning choices, RAFT tracker, and foot-plane anchoring rather than to undisclosed data leakage, metric-specific tuning, or post-hoc selection of the validation split.
What would settle it
Evaluating the SMART pipeline on an independent soccer video dataset collected separately from the challenge data and measuring if the MPJPE remains below 0.4 m.
read the original abstract
We present our approach to the FIFA Skeletal Tracking Challenge 2026, which requires estimating 3D world-space poses of soccer players from broadcast video. Our method finetunes SMPLest-X (ViT-H, 687 M parameters) via a stratified clip split, multi-task depth supervision, and broadcast augmentation, paired with a RAFT dense optical flow camera tracker, foot-plane anchoring, and two-pass temporal smoothing. Against the FIFA baseline score of 1.053 on the validation set, SMART achieves 0.647, a 38.6% improvement; on the held-out test set, SMART scores 0.593 (Global MPJPE: 0.324 m, Local MPJPE: 0.054 m).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SMART for the FIFA Skeletal Tracking Challenge 2026, which estimates 3D world-space soccer player poses from broadcast video. It finetunes the SMPLest-X ViT-H model (687M parameters) using a stratified clip split, multi-task depth supervision, broadcast augmentation, a RAFT dense optical flow tracker, foot-plane anchoring, and two-pass temporal smoothing. It reports a validation score of 0.647 (38.6% better than the FIFA baseline of 1.053) and a held-out test score of 0.593 (Global MPJPE 0.324 m, Local MPJPE 0.054 m).
Significance. If the reported gains can be independently verified as arising from the listed components rather than data partitioning, the work would demonstrate a practical recipe for adapting large body models to sports broadcast footage with camera motion and ground-plane constraints. The test-set numbers provide a concrete, falsifiable benchmark for future soccer pose methods.
major comments (3)
- [Abstract] Abstract: the central claim that the 38.6% validation improvement is produced by the finetuning choices, RAFT tracker, and foot-plane anchoring cannot be evaluated because the manuscript supplies no methods section, no implementation details of the multi-task depth loss or broadcast augmentation, and no ablation tables isolating each component's contribution to the MPJPE reductions.
- [Abstract] Abstract: the stratified clip split is described only at the level of a name; no clip-length statistics, stratification criteria, or quantitative overlap measures between training and validation clips are given, leaving open the possibility that the measured gain partly reflects a more favorable partition rather than the algorithmic contributions.
- [Abstract] Abstract: no error analysis, per-scene breakdown, or failure-case discussion is provided to support the attribution of the test-set Global MPJPE (0.324 m) and Local MPJPE (0.054 m) specifically to the RAFT tracker and foot-plane anchoring.
minor comments (1)
- [Abstract] The abstract reports both validation and test scores but does not state whether the validation split was fixed before any hyper-parameter search or whether the reported numbers reflect a single run or the best of several runs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for expanded methodological transparency and analysis in our challenge submission. We address each major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the 38.6% validation improvement is produced by the finetuning choices, RAFT tracker, and foot-plane anchoring cannot be evaluated because the manuscript supplies no methods section, no implementation details of the multi-task depth loss or broadcast augmentation, and no ablation tables isolating each component's contribution to the MPJPE reductions.
Authors: We agree that the current concise format lacks a dedicated methods section and ablations. In the revised manuscript we will add a full methods section detailing the multi-task depth loss formulation, broadcast augmentation pipeline, and ablation tables that isolate the contribution of each component (stratified finetuning, RAFT tracking, foot-plane anchoring) to the reported MPJPE reductions. revision: yes
-
Referee: [Abstract] Abstract: the stratified clip split is described only at the level of a name; no clip-length statistics, stratification criteria, or quantitative overlap measures between training and validation clips are given, leaving open the possibility that the measured gain partly reflects a more favorable partition rather than the algorithmic contributions.
Authors: We will expand the data section to report clip-length statistics, explicit stratification criteria (e.g., by camera motion, player density, and scene type), and quantitative overlap measures such as average temporal overlap and feature similarity between train and validation clips to demonstrate that the partition does not artificially inflate performance. revision: yes
-
Referee: [Abstract] Abstract: no error analysis, per-scene breakdown, or failure-case discussion is provided to support the attribution of the test-set Global MPJPE (0.324 m) and Local MPJPE (0.054 m) specifically to the RAFT tracker and foot-plane anchoring.
Authors: We acknowledge the absence of error analysis. The revision will include a dedicated analysis section with per-scene MPJPE breakdowns, qualitative failure cases, and quantitative attribution experiments showing the incremental effect of the RAFT tracker and foot-plane anchoring on the final Global and Local MPJPE values. revision: yes
Circularity Check
No significant circularity in derivation chain.
full rationale
The paper describes an empirical pipeline (finetuning SMPLest-X with multi-task supervision, augmentation, RAFT tracking, foot-plane anchoring, and smoothing) and reports measured performance on a provided validation set and held-out test set. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The improvement claim is an empirical outcome on challenge data rather than a self-referential logical step; the stratified clip split is stated as a methodological choice without evidence that its definition incorporates the target metric. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
WorldPose: A world cup dataset for global 3D human pose estimation
Tianjian Jiang, Johsan Billingham, Sebastian M ¨uksch, Juan Zarate, Nicolas Evans, Martin Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, and Jie Song. WorldPose: A world cup dataset for global 3D human pose estimation. In ECCV, 2024. 1
2024
-
[2]
SMPLest-X: Ultimate scaling for expressive human pose and shape estimation.IEEE TPAMI, 48(2):1778–1794, 2026
Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Atsushi Ya- mashita, Lei Yang, and Ziwei Liu. SMPLest-X: Ultimate scaling for expressive human pose and shape estimation.IEEE TPAMI, 48(2):1778–1794, 2026. 1
2026
-
[3]
RAFT: Recurrent all-pairs field transforms for optical flow, 2020
Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow, 2020. 2
2020
-
[4]
Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013
Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013. 2
2013
-
[5]
EPnP: An accurate O(n) solution to the PnP problem.IJCV, 81 (2):155–166, 2009
Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An accurate O(n) solution to the PnP problem.IJCV, 81 (2):155–166, 2009. 2
2009
-
[6]
Sam 3d body: Robust full-body human mesh recovery
Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. SAM 3D body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.