SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation

Parthsarthi Rawat

arxiv: 2605.31551 · v1 · pith:XBFJVS25new · submitted 2026-05-29 · 💻 cs.CV

SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation

Parthsarthi Rawat This is my paper

Pith reviewed 2026-06-28 22:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords soccerpose estimation3D reconstructionoptical flowhuman mesh recoverybroadcast videoSMPL

0 comments

The pith

Finetuned SMPLest-X with RAFT tracking and foot anchoring achieves 38.6 percent lower error than FIFA baseline for soccer pose estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SMART as an approach to the FIFA Skeletal Tracking Challenge for estimating 3D poses of soccer players from broadcast video. It finetunes the large SMPLest-X model using a stratified clip split, multi-task depth supervision, and broadcast augmentation. These are paired with a RAFT dense optical flow camera tracker, foot-plane anchoring, and two-pass temporal smoothing. On the validation set this yields a score of 0.647 against the baseline of 1.053, and 0.593 on the held-out test set with specific MPJPE values.

Core claim

SMART finetunes SMPLest-X via stratified clip split, multi-task depth supervision, and broadcast augmentation, paired with a RAFT dense optical flow camera tracker, foot-plane anchoring, and two-pass temporal smoothing to achieve a validation score of 0.647, a 38.6 percent improvement over the FIFA baseline of 1.053, and a test score of 0.593.

What carries the argument

SMPLest-X finetuning combined with RAFT optical flow tracking and foot-plane anchoring to enforce temporal consistency and ground contact in 3D pose estimates from video.

If this is right

Global MPJPE of 0.324 m and local MPJPE of 0.054 m on the test set indicate accurate world-space and relative joint positions.
The method works on standard broadcast video without specialized camera setups.
Temporal smoothing produces smoother pose sequences over time.
Foot-plane anchoring maintains realistic player-ground interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar adaptation techniques could improve pose estimation in other dynamic sports with broadcast footage.
Large mesh models may generalize better to new domains when combined with dense flow tracking for camera motion.

Load-bearing premise

The improvements are attributable to the described finetuning choices, RAFT tracker, and foot-plane anchoring rather than to undisclosed data leakage, metric-specific tuning, or post-hoc selection of the validation split.

What would settle it

Evaluating the SMART pipeline on an independent soccer video dataset collected separately from the challenge data and measuring if the MPJPE remains below 0.4 m.

read the original abstract

We present our approach to the FIFA Skeletal Tracking Challenge 2026, which requires estimating 3D world-space poses of soccer players from broadcast video. Our method finetunes SMPLest-X (ViT-H, 687 M parameters) via a stratified clip split, multi-task depth supervision, and broadcast augmentation, paired with a RAFT dense optical flow camera tracker, foot-plane anchoring, and two-pass temporal smoothing. Against the FIFA baseline score of 1.053 on the validation set, SMART achieves 0.647, a 38.6% improvement; on the held-out test set, SMART scores 0.593 (Global MPJPE: 0.324 m, Local MPJPE: 0.054 m).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competition entry that finetunes SMPLest-X with RAFT and standard tricks for soccer pose, reports clear test-set numbers, but adds no new methods and leaves the source of the 38% gain unverified.

read the letter

The paper is a FIFA Skeletal Tracking Challenge submission. It starts from SMPLest-X, applies a stratified clip split plus multi-task depth supervision and broadcast augmentation during finetuning, then layers on RAFT optical flow for camera tracking, foot-plane anchoring, and temporal smoothing. On the validation set it moves the score from the baseline 1.053 to 0.647; on the held-out test set it reaches 0.593 with the reported global and local MPJPE values.

Nothing in the method is new. SMPLest-X and RAFT are existing components, and the finetuning steps are standard practice. The work is an empirical application to one narrow domain rather than a new algorithm or derivation.

What it does well is ship concrete numbers on a true held-out test set and break the metric into global and local components. That gives other challenge participants a usable reference point.

The main weakness is the missing link between the listed techniques and the reported gains. The abstract gives no ablation tables, no quantitative description of the clip split (lengths, stratification rules, overlap checks), and no error analysis. Without those, it is impossible to tell how much of the 38% validation improvement comes from the RAFT tracker or foot anchoring versus from how the validation clips were chosen. The circularity burden noted in the stress test is real on the evidence supplied.

This paper is for teams already working on the FIFA challenge or for applied sports-vision practitioners who need a practical baseline. A reader looking for new ideas in 3D human modeling or tracking will not find them. It does not contain the formal grounding or evidential sharpness that would justify sending it to serious referees.

Referee Report

3 major / 1 minor

Summary. The paper presents SMART for the FIFA Skeletal Tracking Challenge 2026, which estimates 3D world-space soccer player poses from broadcast video. It finetunes the SMPLest-X ViT-H model (687M parameters) using a stratified clip split, multi-task depth supervision, broadcast augmentation, a RAFT dense optical flow tracker, foot-plane anchoring, and two-pass temporal smoothing. It reports a validation score of 0.647 (38.6% better than the FIFA baseline of 1.053) and a held-out test score of 0.593 (Global MPJPE 0.324 m, Local MPJPE 0.054 m).

Significance. If the reported gains can be independently verified as arising from the listed components rather than data partitioning, the work would demonstrate a practical recipe for adapting large body models to sports broadcast footage with camera motion and ground-plane constraints. The test-set numbers provide a concrete, falsifiable benchmark for future soccer pose methods.

major comments (3)

[Abstract] Abstract: the central claim that the 38.6% validation improvement is produced by the finetuning choices, RAFT tracker, and foot-plane anchoring cannot be evaluated because the manuscript supplies no methods section, no implementation details of the multi-task depth loss or broadcast augmentation, and no ablation tables isolating each component's contribution to the MPJPE reductions.
[Abstract] Abstract: the stratified clip split is described only at the level of a name; no clip-length statistics, stratification criteria, or quantitative overlap measures between training and validation clips are given, leaving open the possibility that the measured gain partly reflects a more favorable partition rather than the algorithmic contributions.
[Abstract] Abstract: no error analysis, per-scene breakdown, or failure-case discussion is provided to support the attribution of the test-set Global MPJPE (0.324 m) and Local MPJPE (0.054 m) specifically to the RAFT tracker and foot-plane anchoring.

minor comments (1)

[Abstract] The abstract reports both validation and test scores but does not state whether the validation split was fixed before any hyper-parameter search or whether the reported numbers reflect a single run or the best of several runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for expanded methodological transparency and analysis in our challenge submission. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the 38.6% validation improvement is produced by the finetuning choices, RAFT tracker, and foot-plane anchoring cannot be evaluated because the manuscript supplies no methods section, no implementation details of the multi-task depth loss or broadcast augmentation, and no ablation tables isolating each component's contribution to the MPJPE reductions.

Authors: We agree that the current concise format lacks a dedicated methods section and ablations. In the revised manuscript we will add a full methods section detailing the multi-task depth loss formulation, broadcast augmentation pipeline, and ablation tables that isolate the contribution of each component (stratified finetuning, RAFT tracking, foot-plane anchoring) to the reported MPJPE reductions. revision: yes
Referee: [Abstract] Abstract: the stratified clip split is described only at the level of a name; no clip-length statistics, stratification criteria, or quantitative overlap measures between training and validation clips are given, leaving open the possibility that the measured gain partly reflects a more favorable partition rather than the algorithmic contributions.

Authors: We will expand the data section to report clip-length statistics, explicit stratification criteria (e.g., by camera motion, player density, and scene type), and quantitative overlap measures such as average temporal overlap and feature similarity between train and validation clips to demonstrate that the partition does not artificially inflate performance. revision: yes
Referee: [Abstract] Abstract: no error analysis, per-scene breakdown, or failure-case discussion is provided to support the attribution of the test-set Global MPJPE (0.324 m) and Local MPJPE (0.054 m) specifically to the RAFT tracker and foot-plane anchoring.

Authors: We acknowledge the absence of error analysis. The revision will include a dedicated analysis section with per-scene MPJPE breakdowns, qualitative failure cases, and quantitative attribution experiments showing the incremental effect of the RAFT tracker and foot-plane anchoring on the final Global and Local MPJPE values. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain.

full rationale

The paper describes an empirical pipeline (finetuning SMPLest-X with multi-task supervision, augmentation, RAFT tracking, foot-plane anchoring, and smoothing) and reports measured performance on a provided validation set and held-out test set. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The improvement claim is an empirical outcome on challenge data rather than a self-referential logical step; the stratified clip split is stated as a methodological choice without evidence that its definition incorporates the target metric. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5653 in / 1167 out tokens · 22925 ms · 2026-06-28T22:43:43.254693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages

[1]

WorldPose: A world cup dataset for global 3D human pose estimation

Tianjian Jiang, Johsan Billingham, Sebastian M ¨uksch, Juan Zarate, Nicolas Evans, Martin Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, and Jie Song. WorldPose: A world cup dataset for global 3D human pose estimation. In ECCV, 2024. 1

2024
[2]

SMPLest-X: Ultimate scaling for expressive human pose and shape estimation.IEEE TPAMI, 48(2):1778–1794, 2026

Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Atsushi Ya- mashita, Lei Yang, and Ziwei Liu. SMPLest-X: Ultimate scaling for expressive human pose and shape estimation.IEEE TPAMI, 48(2):1778–1794, 2026. 1

2026
[3]

RAFT: Recurrent all-pairs field transforms for optical flow, 2020

Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow, 2020. 2

2020
[4]

Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013

Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013. 2

2013
[5]

EPnP: An accurate O(n) solution to the PnP problem.IJCV, 81 (2):155–166, 2009

Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An accurate O(n) solution to the PnP problem.IJCV, 81 (2):155–166, 2009. 2

2009
[6]

Sam 3d body: Robust full-body human mesh recovery

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. SAM 3D body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026. 3

work page arXiv 2026

[1] [1]

WorldPose: A world cup dataset for global 3D human pose estimation

Tianjian Jiang, Johsan Billingham, Sebastian M ¨uksch, Juan Zarate, Nicolas Evans, Martin Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, and Jie Song. WorldPose: A world cup dataset for global 3D human pose estimation. In ECCV, 2024. 1

2024

[2] [2]

SMPLest-X: Ultimate scaling for expressive human pose and shape estimation.IEEE TPAMI, 48(2):1778–1794, 2026

Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Atsushi Ya- mashita, Lei Yang, and Ziwei Liu. SMPLest-X: Ultimate scaling for expressive human pose and shape estimation.IEEE TPAMI, 48(2):1778–1794, 2026. 1

2026

[3] [3]

RAFT: Recurrent all-pairs field transforms for optical flow, 2020

Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow, 2020. 2

2020

[4] [4]

Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013

Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013. 2

2013

[5] [5]

EPnP: An accurate O(n) solution to the PnP problem.IJCV, 81 (2):155–166, 2009

Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An accurate O(n) solution to the PnP problem.IJCV, 81 (2):155–166, 2009. 2

2009

[6] [6]

Sam 3d body: Robust full-body human mesh recovery

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. SAM 3D body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026. 3

work page arXiv 2026