SoccerNet 2026 Player-Centric Ball-Action Spotting:Retraining and Post-Processing Extensions to the FOOTPASS Baselines

Parthsarthi Rawat

arxiv: 2606.09679 · v1 · pith:EOT4HEZBnew · submitted 2026-06-08 · 💻 cs.CV

SoccerNet 2026 Player-Centric Ball-Action Spotting:Retraining and Post-Processing Extensions to the FOOTPASS Baselines

Parthsarthi Rawat This is my paper

Pith reviewed 2026-06-27 17:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords SoccerNetball-action spottingplayer-centricFOOTPASSpost-processingclass weightingaction recognitionvideo analysis

0 comments

The pith

Extensions to FOOTPASS baselines raise Macro F1 to 0.548 on the SoccerNet test set for player-centric ball-action spotting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that four targeted changes to three existing baseline models can improve results on a broadcast soccer video task that requires identifying which player performs which of eight actions and at what exact moment. The changes include enabling full model retraining on limited hardware, combining graph-based and visual features, reweighting rare actions to counter extreme imbalance, and applying a sequence of prediction cleanup steps plus an ensemble. A reader would care because reliable automatic detection of player actions supports downstream uses such as game statistics, scouting, and broadcast enhancement. The work treats the baselines as a workable starting point and shows measurable gains from the listed additions rather than a complete redesign.

Core claim

By applying gradient checkpointing to permit full-backbone fine-tuning, fusing GNN logits into the DST encoder, adopting square-root frequency class weighting, and running a post-processing pipeline of per-class logit gating, temporal frame refinement, jersey re-assignment, and a two-model ensemble on the TAAD, TAAD+GNN, and TAAD+DST baselines, the system reaches 0.548 Macro F1 on the test set and 0.446 on the challenge set.

What carries the argument

The four-part extension pipeline (gradient checkpointing, GNN-to-DST logit fusion, square-root class weighting, and multi-step post-processing with ensemble) applied to the three FOOTPASS baselines.

If this is right

Gradient checkpointing makes full fine-tuning of large visual backbones feasible on a single GPU.
Fusing GNN logits into the DST encoder adds tactical graph context to per-player visual features.
Square-root frequency weighting reduces the dominance of frequent classes such as passes over rare ones such as tackles.
The post-processing steps correct timing errors, re-assign players via jersey numbers, and combine two models to raise final accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same four extensions could be tested on other video action datasets that exhibit similar class imbalance.
The post-processing pipeline might be applied independently to outputs from entirely different spotting models to measure its isolated contribution.
An expanded ensemble that includes additional variants of the baselines could be evaluated to check whether further gains remain available.
The reported scores provide a new reference point for future submissions that wish to compare against these particular extensions rather than the raw baselines.

Load-bearing premise

The three FOOTPASS baselines already supply a workable foundation that the four listed extensions can improve without new core model architectures.

What would settle it

A side-by-side evaluation on the same test set in which the unmodified TAAD+DST baseline alone matches or exceeds 0.548 Macro F1 would show that the extensions add no value.

read the original abstract

We describe our system for the SoccerNet 2026 Player-Centric Ball-Action Spotting Challenge, which requires predicting who performs which action and when, across eight classes in broadcast soccer. Building on the three FOOTPASS baselines [1] (TAAD, TAAD+GNN, and TAAD+DST), we contribute four extensions: (1) gradient check pointing to enable full-backbone fine-tuning on a single GPU; (2) fusion of GNN logits into the DST encoder, combining graph-based tactical context with per-player visual features; (3) square-root frequency class weighting to address the 213:1 pass-to-tackle imbalance in the training data; and (4) a post processing pipeline comprising per-class logit gating, temporal frame refinement, jersey re-assignment, and a two-model ensemble. Our system achieves 0.548 Macro F1 on the test set and 0.446 on the challenge set (server evaluation).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward engineering report extending FOOTPASS baselines for the SoccerNet 2026 challenge with four standard tweaks, reporting server-evaluated F1 gains but without ablations or variance measures.

read the letter

This paper extends the three FOOTPASS baselines for player-centric ball-action spotting by adding gradient checkpointing for single-GPU fine-tuning, GNN logit fusion into DST, square-root frequency weighting for the pass-tackle imbalance, and a post-processing chain of gating, temporal refinement, jersey re-assignment, and ensembling. The reported results are 0.548 Macro F1 on the test set and 0.446 on the challenge server.

The extensions are sensible applications of known techniques. Checkpointing addresses a real hardware constraint, the weighting targets the documented 213:1 skew, and the post-processing steps are the kind of refinements that commonly lift spotting performance. The use of an external server evaluation adds some credibility to the numbers.

The soft spots are the absence of ablations showing the contribution of each change, no error bars or repeated runs, and high-level descriptions that leave implementation details unclear. The work stays inside the existing baselines rather than adding new core components, so the gains rest on empirical tuning.

This is mainly for teams already in the SoccerNet challenge or working on broadcast sports video. A reader focused on general action spotting might pick up practical tips on imbalance handling and post-processing, but the scope is narrow.

I would send it to peer review for a challenge or workshop track because the held-out server results are concrete and the methods are described enough to be reproducible by others, even if more analysis would help.

Referee Report

2 major / 2 minor

Summary. The manuscript describes a system for the SoccerNet 2026 Player-Centric Ball-Action Spotting Challenge by extending the three FOOTPASS baselines (TAAD, TAAD+GNN, TAAD+DST) with four modifications: gradient checkpointing for full-backbone fine-tuning, fusion of GNN logits into the DST encoder, square-root frequency class weighting to handle 213:1 class imbalance, and a four-stage post-processing pipeline (per-class logit gating, temporal frame refinement, jersey re-assignment, two-model ensemble). It reports achieving 0.548 Macro F1 on the test set and 0.446 on the challenge set via server evaluation.

Significance. If reproducible, the work offers incremental engineering improvements on a challenging player-centric temporal action spotting task with severe class imbalance. The techniques are standard and directly address the stated problem constraints, but the lack of ablations or validation details limits assessment of which extensions drive the reported gains over the cited baselines.

major comments (2)

[Abstract] Abstract: the central claims rest on the reported Macro F1 scores (0.548 test, 0.446 challenge) with no accompanying error bars, ablation studies, implementation details, or validation procedure, rendering it impossible to verify that the four listed extensions produce the stated improvements over the FOOTPASS baselines.
[Methods] No section provides the precise formulation or loss-function integration of the square-root frequency class weighting, despite its identification as a load-bearing extension for the 213:1 imbalance; without this, the contribution cannot be assessed or reproduced.

minor comments (2)

A results table comparing each baseline to the extended system (with and without individual extensions) would clarify the incremental gains.
The citation to the FOOTPASS baselines [1] should include the exact reference details for the three variants (TAAD, TAAD+GNN, TAAD+DST).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review. We address the two major comments point-by-point below, indicating where revisions will be made to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims rest on the reported Macro F1 scores (0.548 test, 0.446 challenge) with no accompanying error bars, ablation studies, implementation details, or validation procedure, rendering it impossible to verify that the four listed extensions produce the stated improvements over the FOOTPASS baselines.

Authors: The manuscript is a concise system-description paper for a fixed challenge deadline rather than a full research article. Implementation details for all four extensions appear in the Methods section. The reported scores are single-run server evaluations on the organizers' fixed test and challenge sets; no error bars are possible without multiple independent runs, which were not performed. Ablation studies were omitted due to the challenge timeline and GPU-hour limits. We will revise the abstract to explicitly state that scores come from single server submissions and to reference the Methods section for extension details. revision: partial
Referee: [Methods] No section provides the precise formulation or loss-function integration of the square-root frequency class weighting, despite its identification as a load-bearing extension for the 213:1 imbalance; without this, the contribution cannot be assessed or reproduced.

Authors: We agree that the current manuscript lacks the explicit formula. In the revised version we will insert the precise definition (square-root inverse-frequency weights applied to the cross-entropy loss) together with the integration equation and the resulting per-class weight values computed from the training-set statistics. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports purely empirical results obtained by applying four standard engineering extensions (gradient checkpointing, GNN-logit fusion, sqrt-frequency weighting, and a four-stage post-processor) to three externally cited FOOTPASS baselines, then measuring Macro F1 on held-out test and challenge-server sets. No equations, derivations, or fitted parameters are present that could reduce to the reported scores by construction. The single citation to the baselines is not load-bearing for any internal claim; the central result is an observed performance number on external data, not a self-referential prediction or renamed input.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence and quality of the three FOOTPASS baselines plus the stated class imbalance ratio; no new entities or free parameters beyond the choice of sqrt weighting.

free parameters (1)

square-root frequency class weighting
Applied to counter the 213:1 pass-to-tackle imbalance stated in the abstract.

axioms (1)

domain assumption The training data exhibits a 213:1 pass-to-tackle imbalance
Invoked to justify the weighting choice.

pith-pipeline@v0.9.1-grok · 5706 in / 1146 out tokens · 26377 ms · 2026-06-27T17:19:19.092530+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 3 canonical work pages

[1]

FOOTPASS: A multi-modal multi-agent tactical context dataset for play-by-play action spotting in soccer broadcast videos.Computer Vision and Image Un- derstanding, 269:104790, 2026

J ´er´emie Ochin, Rapha¨el Chekroun, Bogdan Stanciulescu, and Sotiris Manitsaris. FOOTPASS: A multi-modal multi-agent tactical context dataset for play-by-play action spotting in soccer broadcast videos.Computer Vision and Image Un- derstanding, 269:104790, 2026. ISSN 1077-3142. doi: 10.1016/j.cviu.2026.104790

work page doi:10.1016/j.cviu.2026.104790 2026
[2]

Game state and spatio-temporal action detection in soccer using graph neural networks and 3d con- volutional networks

Jeremie Ochin, Guillaume Devineau, Bogdan Stanciulescu, and Sotiris Manitsaris. Game state and spatio-temporal action detection in soccer using graph neural networks and 3d con- volutional networks. InProceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM), pages 636–646. INSTICC, SciTePress, 2025. ISBN 97...

work page doi:10.5220/0013161100003905 2025
[3]

Spatio-temporal action detection under large motion

Gurkirt Singh, Vasileios Choutas, Suman Saha, Fisher Yu, and Luc Van Gool. Spatio-temporal action detection under large motion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6009–6018, January 2023

2023
[4]

In: 2020 IEEE/CVF Conf

Christoph Feichtenhofer. X3D: Expanding architectures for efficient video recognition. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 200–210, 2020. doi: 10.1109/CVPR42600.2020.00028

work page doi:10.1109/cvpr42600.2020.00028 2020
[5]

Mask R-CNN, 2018

Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask R-CNN, 2018. URL https://arxiv.org/abs/ 1703.06870

Pith/arXiv arXiv 2018
[6]

Sarma, Michael M

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph CNN for learning on point clouds, 2019. URL https: //arxiv.org/abs/1801.07829

Pith/arXiv arXiv 2019
[7]

Beyond pixels: Leveraging the language of soccer to improve spatio-temporal action detection in broadcast videos

Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, and Sotiris Manitsaris. Beyond pixels: Leveraging the language of soccer to improve spatio-temporal action detection in broadcast videos. InProceedings of the 22nd International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS),
[8]

Scheduled for publication by Springer on 24th November 2025

2025
[9]

Gomez, Lukasz Kaiser, and Il- lia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Il- lia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/1706.03762

Pith/arXiv arXiv 2023
[10]

Focal loss for dense object detection, 2018

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection, 2018. URL https://arxiv.org/abs/1708.02002

Pith/arXiv arXiv 2018

[1] [1]

FOOTPASS: A multi-modal multi-agent tactical context dataset for play-by-play action spotting in soccer broadcast videos.Computer Vision and Image Un- derstanding, 269:104790, 2026

J ´er´emie Ochin, Rapha¨el Chekroun, Bogdan Stanciulescu, and Sotiris Manitsaris. FOOTPASS: A multi-modal multi-agent tactical context dataset for play-by-play action spotting in soccer broadcast videos.Computer Vision and Image Un- derstanding, 269:104790, 2026. ISSN 1077-3142. doi: 10.1016/j.cviu.2026.104790

work page doi:10.1016/j.cviu.2026.104790 2026

[2] [2]

Game state and spatio-temporal action detection in soccer using graph neural networks and 3d con- volutional networks

Jeremie Ochin, Guillaume Devineau, Bogdan Stanciulescu, and Sotiris Manitsaris. Game state and spatio-temporal action detection in soccer using graph neural networks and 3d con- volutional networks. InProceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM), pages 636–646. INSTICC, SciTePress, 2025. ISBN 97...

work page doi:10.5220/0013161100003905 2025

[3] [3]

Spatio-temporal action detection under large motion

Gurkirt Singh, Vasileios Choutas, Suman Saha, Fisher Yu, and Luc Van Gool. Spatio-temporal action detection under large motion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6009–6018, January 2023

2023

[4] [4]

In: 2020 IEEE/CVF Conf

Christoph Feichtenhofer. X3D: Expanding architectures for efficient video recognition. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 200–210, 2020. doi: 10.1109/CVPR42600.2020.00028

work page doi:10.1109/cvpr42600.2020.00028 2020

[5] [5]

Mask R-CNN, 2018

Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask R-CNN, 2018. URL https://arxiv.org/abs/ 1703.06870

Pith/arXiv arXiv 2018

[6] [6]

Sarma, Michael M

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph CNN for learning on point clouds, 2019. URL https: //arxiv.org/abs/1801.07829

Pith/arXiv arXiv 2019

[7] [7]

Beyond pixels: Leveraging the language of soccer to improve spatio-temporal action detection in broadcast videos

Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, and Sotiris Manitsaris. Beyond pixels: Leveraging the language of soccer to improve spatio-temporal action detection in broadcast videos. InProceedings of the 22nd International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS),

[8] [8]

Scheduled for publication by Springer on 24th November 2025

2025

[9] [9]

Gomez, Lukasz Kaiser, and Il- lia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Il- lia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/1706.03762

Pith/arXiv arXiv 2023

[10] [10]

Focal loss for dense object detection, 2018

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection, 2018. URL https://arxiv.org/abs/1708.02002

Pith/arXiv arXiv 2018