pith. sign in

arxiv: 2511.12606 · v3 · submitted 2025-11-16 · 💻 cs.CV

Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

Pith reviewed 2026-05-17 21:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords group activity recognitionplayer trackinggraph neural networkssoccermultimodal datasetsports video analysisrole-aware modeling
0
0 comments X

The pith

Player position tracking outperforms video pixels for recognizing group activities in soccer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates SoccerNet-GAR, a dataset that pairs broadcast video with player tracking data for over 87,000 group activities from the 2022 World Cup football matches. It compares standard video classifiers to a new graph-based model that uses player positions and roles to recognize team activities. The position-based model reaches 77.8 percent balanced accuracy while the best video model gets only 60.9 percent, and it trains much faster with far fewer parameters. A sympathetic reader would see this as evidence that explicit spatial tracking can be more effective than raw pixel data for understanding collective behaviors in sports. The work establishes a benchmark to compare these modalities directly.

Core claim

The authors introduce SoccerNet-GAR, a multimodal dataset built from 64 matches of the 2022 World Cup with synchronized broadcast videos and player tracking for 87,939 group activities annotated with 10 categories. They define a unified evaluation protocol and propose a novel role-aware graph architecture for tracking-based GAR that encodes tactical structure through positional edges connecting players by their on-pitch roles. Their tracking model achieves 77.8% balanced accuracy compared to 60.9% for the best video baseline, while training with 7 times less GPU hours and 479 times fewer parameters.

What carries the argument

The role-aware graph neural network that directly encodes tactical structure by connecting players with edges based on their assigned on-pitch roles.

If this is right

  • Tracking data can serve as a more efficient alternative to video for group activity recognition in team sports.
  • The inclusion of role information in graph models improves the capture of spatial interactions among agents.
  • Unified multimodal benchmarks like SoccerNet-GAR enable fair comparisons that reveal the strengths of each modality.
  • Compact position signals may reduce the need for resource-intensive video processing in sports analytics applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar position-based approaches could be tested in other domains like pedestrian crowd analysis or robotics swarm behaviors where spatial data is available.
  • Hybrid models that fuse video and tracking might yield further gains by combining visual cues with positional precision.
  • Practitioners in sports coaching could adopt tracking models for real-time activity monitoring due to their lower computational demands.

Load-bearing premise

The player tracking data is assumed to be accurate, complete, and perfectly synchronized with the broadcast video annotations without errors in role assignment or activity labeling.

What would settle it

Demonstrating that a video-based model trained on the aligned SoccerNet-GAR data can exceed 77.8% balanced accuracy, or revealing significant synchronization errors in the tracking data that alter activity labels, would undermine the superiority claim.

Figures

Figures reproduced from arXiv: 2511.12606 by Anthony Cioppa, Bernard Ghanem, Drishya Karki, Merey Ramazanova, Silvio Giancola.

Figure 1
Figure 1. Figure 1: Overview of Our Group Activity Recognition Bench￾mark. Broadcast video and agent tracking modalities are pro￾cessed through modality-specific backbones. The resulting rep￾resentations are evaluated for group activity recognition. rather than isolated individual actions. This task is central to analytics, automation, and decision support, where un￾derstanding both who acts as well as how agents coordinate o… view at source ↗
Figure 2
Figure 2. Figure 2: Class Distribution Across Each Split. The dataset is heavily skewed toward PASS (63.3%) while GOAL is rare (0.2%). The y-axis shows event counts; percentages denote proportions. 3.3. Statistics The dataset contains 94,285 annotated events across 10 action classes, averaging 1,473 events per match. To avoid leakage, data were split by match: 45 training matches (66,901 events, 71%), 9 validation matches (12… view at source ↗
Figure 3
Figure 3. Figure 3: Positional Graph Representation at a Single Frame. Nodes represent players at their pitch coordinates, colored by tac￾tical role (red: goalkeeper, blue: defenders, purple: midfielders, pink: forwards). Edges follow formation structure, connecting po￾sitionally adjacent roles within each team. poral aggregation strategies spanning from parameter-free pooling operations to learnable sequential models. A Mult… view at source ↗
Figure 5
Figure 5. Figure 5: Performance vs. Number of Training Matches for Video and Tracking Modalities. Tracking achieves higher accu￾racies across all data regimes and plateaus earlier (≈ 35 matches), while video does not. The gap narrows from 15.8% (5 matches) to 8.7% (45 matches). 86.3M parameters) by 9.1% while using 438× fewer pa￾rameters and training 4.25× faster. Our ablation stud￾ies demonstrate that domain-informed graph c… view at source ↗
read the original abstract

Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the $64$ matches of the football World Cup 2022. Specifically, the broadcast videos and player tracking modalities for $87{,}939$ group activities are synchronized and annotated with $10$ categories. Furthermore, we define a unified evaluation protocol to benchmark two strong unimodal approaches: (i) competitive video-based classifiers and (ii) tracking-based classifiers leveraging graph neural networks. In particular, our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges connecting players by their on-pitch roles. Our tracking model achieves $77.8\%$ balanced accuracy compared to $60.9\%$ for the best video baseline, while training with $7 \times$ less GPU hours and $479 \times$ fewer parameters ($180K$ vs. $86.3M$). This study provides new insights into the relative strengths of pixels and positions for group activity recognition in sports.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SoccerNet-GAR, a multimodal dataset derived from 64 matches of the 2022 FIFA World Cup, providing synchronized broadcast video and player tracking data for 87,939 group activities annotated with 10 categories. It defines a unified evaluation protocol and benchmarks competitive video-based classifiers against tracking-based classifiers that use a novel role-aware graph neural network to encode tactical structure via positional edges connecting players by on-pitch roles. The central empirical result is that the tracking model reaches 77.8% balanced accuracy compared to 60.9% for the best video baseline, while using 7× less GPU hours and 479× fewer parameters (180K vs. 86.3M).

Significance. If the performance gap and efficiency claims prove robust after verification of data quality and implementation details, the work would offer a valuable new benchmark for group activity recognition in sports, demonstrating that compact positional tracking signals can substantially outperform pixel-based video approaches in this domain. The creation of a large-scale synchronized multimodal dataset and the explicit resource comparisons are concrete strengths that could guide future modality-specific or hybrid research.

major comments (3)
  1. [§3] §3 (Dataset construction and synchronization): The manuscript must provide explicit details on the synchronization process between broadcast video and tracking data across all 87,939 instances, including quantitative checks for temporal alignment, missing players, or drift. Without this, systematic errors could invalidate the apples-to-apples modality comparison and contribute to the reported 17-point accuracy gap.
  2. [§4.1] §4.1 (Role-aware GNN architecture): The method for assigning on-pitch roles to players (e.g., fixed formation lookup, heuristic from positions, or separate annotation) is not sufficiently specified. If role labels correlate with the 10 activity categories or were derived using label information, the graph edges inject semantic structure unavailable to the video baselines, making the 77.8% vs. 60.9% result non-comparable and undermining the central modality claim.
  3. [§5] §5 (Experiments and baselines): Full specification of data splits, baseline re-implementations, hyperparameter search, and any post-hoc decisions is required to confirm that the balanced accuracy numbers and resource metrics (GPU hours, parameter counts) were obtained without leakage or unequal tuning. The current description leaves open the possibility that implementation differences, rather than modality, drive the gap.
minor comments (2)
  1. [Table 1] Table 1 or equivalent resource table: clarify whether the 7× GPU hours and 479× parameter reductions include only training or also inference and data preprocessing.
  2. [Figure 3] Figure 3 (graph visualization): add explicit legend explaining edge types and role labels to improve readability of the role-aware architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to provide the requested clarifications and specifications.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset construction and synchronization): The manuscript must provide explicit details on the synchronization process between broadcast video and tracking data across all 87,939 instances, including quantitative checks for temporal alignment, missing players, or drift. Without this, systematic errors could invalidate the apples-to-apples modality comparison and contribute to the reported 17-point accuracy gap.

    Authors: We agree that more explicit details on synchronization are needed to fully substantiate the modality comparison. In the revised manuscript we will expand §3 with a dedicated description of the synchronization pipeline, including how broadcast video timestamps are aligned to the official FIFA tracking data, quantitative alignment statistics (e.g., mean and max temporal offset across instances), and the handling of any drift or missing-player cases. These additions will confirm that systematic misalignment does not explain the observed performance difference. revision: yes

  2. Referee: [§4.1] §4.1 (Role-aware GNN architecture): The method for assigning on-pitch roles to players (e.g., fixed formation lookup, heuristic from positions, or separate annotation) is not sufficiently specified. If role labels correlate with the 10 activity categories or were derived using label information, the graph edges inject semantic structure unavailable to the video baselines, making the 77.8% vs. 60.9% result non-comparable and undermining the central modality claim.

    Authors: Roles are assigned via a deterministic heuristic that uses only player coordinates and standard soccer formation templates (goalkeeper, defenders, midfielders, forwards) and does not incorporate activity labels or any other semantic information from the ground-truth annotations. This choice is intended to reflect tactical structure that is naturally present in the tracking modality. We will revise §4.1 to include the precise assignment algorithm and an explicit statement that label information is never used, thereby preserving the fairness of the modality comparison. revision: yes

  3. Referee: [§5] §5 (Experiments and baselines): Full specification of data splits, baseline re-implementations, hyperparameter search, and any post-hoc decisions is required to confirm that the balanced accuracy numbers and resource metrics (GPU hours, parameter counts) were obtained without leakage or unequal tuning. The current description leaves open the possibility that implementation differences, rather than modality, drive the gap.

    Authors: We will augment §5 with complete experimental details: match-level train/validation/test splits that prevent temporal or team leakage, exact re-implementation settings and hyperparameter ranges for every baseline (including the search procedure), and the precise protocols used to measure GPU hours and parameter counts under identical hardware conditions. These additions will allow independent verification that the reported gap arises from the modalities themselves rather than from unequal tuning or implementation artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation circularity

full rationale

This is a dataset introduction and empirical benchmarking paper that measures performance of trained models on held-out test data for video vs. tracking modalities in group activity recognition. The central claims (77.8% vs 60.9% balanced accuracy, parameter/GPU efficiency) are direct experimental outcomes on the new SoccerNet-GAR dataset and do not reduce via any equations, fitted parameters renamed as predictions, or self-citation chains to the paper's own inputs. No self-definitional steps, uniqueness theorems, or ansatzes are present in the provided text; the role-aware GNN is described as a novel architecture but its performance is evaluated externally rather than derived tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the new dataset being representative and the graph model correctly capturing interactions via role-based edges; no free parameters are fitted specifically to produce the headline accuracy numbers beyond standard model training.

axioms (2)
  • domain assumption Player positions and trajectories provide sufficient signal to recognize coordinated group activities in soccer without pixel-level visual cues.
    Invoked when claiming the tracking modality is competitive or superior.
  • domain assumption Graph neural networks can effectively model spatial interactions when edges are defined by on-pitch roles.
    Core assumption behind the novel architecture.

pith-pipeline@v0.9.0 · 5614 in / 1481 out tokens · 41825 ms · 2026-05-17T21:41:51.536911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy

    cs.CV 2026-05 unverdicted novelty 7.0

    SoccerLens benchmark shows state-of-the-art soccer VLMs achieve strong classification accuracy yet fail to exceed 50% grounding performance on annotated visual cues and underutilize temporal information.

  2. SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy

    cs.CV 2026-05 unverdicted novelty 7.0

    SoccerLens benchmark shows state-of-the-art soccer VLMs achieve high classification accuracy yet fail to exceed 50% visual grounding performance and underutilize temporal information.

  3. Towards Athlete Fatigue Assessment from Association Football Videos

    cs.CV 2026-04 unverdicted novelty 4.0

    Monocular broadcast videos can produce acceleration-speed profiles compatible with fatigue analysis in football, though sensitive to trajectory noise and calibration errors.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Social scene understanding: End-to-end multi-person action localization and collective activity recognition

    Timur Bagautdinov, Alexandre Alahi, Franc ¸ois Fleuret, Pas- cal Fua, and Silvio Savarese. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. InCVPR, pages 4315–4324, 2017. 2

  2. [2]

    Osl-actionspotting: A unified library for action spot- ting in sports videos, 2024

    Yassine Benzakour, Bruno Cabado, Silvio Giancola, An- thony Cioppa, Bernard Ghanem, and Marc Van Droogen- broeck. Osl-actionspotting: A unified library for action spot- ting in sports videos, 2024. 4

  3. [3]

    How attentive are graph attention networks?, 2022

    Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks?, 2022. 3

  4. [4]

    A graph-based method for soccer action spotting using unsu- pervised player classification

    Alejandro Cartas, Coloma Ballester, and Gloria Haro. A graph-based method for soccer action spotting using unsu- pervised player classification. InACM International Work- shop on Multimedia Content Analysis in Sports (MMSports), pages 93–102, 2022. 2

  5. [5]

    What are they doing?: Collective activity classification using spatio-temporal relationship among people

    Wongun Choi, Khuram Shahid, and Silvio Savarese. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In2009 IEEE 12th international conference on computer vision work- shops, ICCV Workshops, pages 1282–1289. IEEE, 2009. 2, 3

  6. [6]

    Scaling up soccer- net with multi-view spatial localization and re-identification

    Anthony Cioppa, Adrien Deliege, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. Scaling up soccer- net with multi-view spatial localization and re-identification. Scientific data, 9(1):355, 2022. 2

  7. [7]

    Camera calibration and player local- ization in SoccerNet-v2 and investigation of their represen- tations for action spotting

    Anthony Cioppa, Adrien Deli `ege, Silvio Giancola, Flori- ane Magera, Olivier Barnich, Bernard Ghanem, and Marc Van Droogenbroeck. Camera calibration and player local- ization in SoccerNet-v2 and investigation of their represen- tations for action spotting. pages 4532–4541, June 2021. 2

  8. [8]

    SoccerNet 2023 challenges results.Sports Engineering, 27(2):24, 2024

    Anthony Cioppa, Silvio Giancola, Vladimir Somers, Flori- ane Magera, Xin Zhou, Hassan Mkhallati, Adrien Deli `ege, Jan Held, Carlos Hinojosa, Amir M Mansourian, et al. SoccerNet 2023 challenges results.Sports Engineering, 27(2):24, 2024. 2, 3

  9. [9]

    SportsMOT: A large multi- object tracking dataset in multiple sports scenes

    Yifu Cui, Chenkai Zeng, Xiaoyu Zhao, Yiyao Yang, Gang- shan Wu, and Limin Wang. SportsMOT: A large multi- object tracking dataset in multiple sports scenes. InAdvances in Neural Information Processing Systems (NeurIPS), vol- ume 36, 2023. 2

  10. [10]

    SoccerNet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos

    Adrien Deli `ege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogen- broeck. SoccerNet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos. InIEEE Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages...

  11. [11]

    A graph attention based approach for trajectory prediction in multi-agent sports games.arXiv preprint arXiv:2012.10531, 2020

    Dawei Ding and Hsiangsheng Huang. A graph attention based approach for trajectory prediction in multi-agent sports games.arXiv preprint arXiv:2012.10531, 2020. 3

  12. [12]

    Learning pose grammar to encode hu- man body configuration for 3d pose estimation

    Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. Learning pose grammar to encode hu- man body configuration for 3d pose estimation. InProceed- ings of the AAAI conference on artificial intelligence, vol- ume 32, 2018. 2, 3

  13. [13]

    Wide open spaces: A statistical technique for measuring space creation in professional soccer

    Javier Fern ´andez, Luke Bornn, and Dan Cervone. Wide open spaces: A statistical technique for measuring space creation in professional soccer. InMIT Sloan Sports Analytics Con- ference, 2019. 2

  14. [14]

    Com- positional action recognition with dependent compositional attention

    Nuno C Garcia, Pietro Morerio, and Vittorio Murino. Com- positional action recognition with dependent compositional attention. InCVPRW, pages 668–669, 2020. 2

  15. [15]

    Actor-transformers for group activity recognition

    Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, and Cees GM Snoek. Actor-transformers for group activity recognition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 839–848, 2020. 2, 3

  16. [16]

    SoccerNet: A scalable dataset for action spotting in soccer videos

    Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. SoccerNet: A scalable dataset for action spotting in soccer videos. InIEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1711–1721, 2018. 2, 3

  17. [17]

    Framewise phoneme classification with bidirectional lstm and other neural net- work architectures.Neural networks, 18(5-6):602–610,

    Alex Graves and J ¨urgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural net- work architectures.Neural networks, 18(5-6):602–610,

  18. [18]

    Inductive representation learning on large graphs

    Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neu- ral Information Processing Systems (NeurIPS), pages 1024– 1034, 2017. 3

  19. [19]

    Spotting temporally precise, fine-grained events in video

    James Hong, Haotian Zhang, Micha ¨el Gharbi, Matthew Fisher, and Kayvon Fatahalian. Spotting temporally precise, fine-grained events in video. InEuropean Conference on Computer Vision, pages 33–51. Springer, 2022. 2

  20. [20]

    A hierarchical deep tempo- ral model for group activity recognition

    Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep tempo- ral model for group activity recognition. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1971–1980, 2016. 2, 3

  21. [21]

    Sport- sPose: A dynamic 3d sports pose dataset

    Christian Ingwersen and Joni-Kristian K ¨am¨ar¨ainen. Sport- sPose: A dynamic 3d sports pose dataset. InIEEE Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), 2023. 2

  22. [22]

    Towards more practical group activity detec- tion: A new benchmark and model.arXiv preprint arXiv:2312.02878, 2023

    Dongkeun Kim, Youngkil Song, Minsu Cho, and Suha Kwak. Towards more practical group activity detec- tion: A new benchmark and model.arXiv preprint arXiv:2312.02878, 2023. 3

  23. [23]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 5

  24. [24]

    Semi-supervised classi- fication with graph convolutional networks

    Thomas N Kipf and Max Welling. Semi-supervised classi- fication with graph convolutional networks. InInternational Conference on Learning Representations (ICLR), 2017. 3

  25. [25]

    Temporal convolutional networks for ac- tion segmentation and detection

    Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networks for ac- tion segmentation and detection. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 156–165, 2017. 3

  26. [26]

    DeepGCNs: Can GCNs go as deep as CNNs? In 9 IEEE International Conference on Computer Vision (ICCV), pages 9267–9276, 2019

    Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. DeepGCNs: Can GCNs go as deep as CNNs? In 9 IEEE International Conference on Computer Vision (ICCV), pages 9267–9276, 2019. 3

  27. [27]

    Groupformer: Group activity recognition with clustered spatial-temporal trans- former

    Shuaicheng Li, Qianggang Cao, Lingbo Liu, Kunlin Yang, Shinan Liu, Jun Hou, and Shuai Yi. Groupformer: Group activity recognition with clustered spatial-temporal trans- former. InICCV, pages 13668–13677, 2021. 2

  28. [28]

    Skeleton-based group activity recognition via spatial- temporal panoramic graph

    Zhengcen Li, Xianxiang Chang, Yueran Li, and Jing Su. Skeleton-based group activity recognition via spatial- temporal panoramic graph. InECCV, pages 254–270. Springer, 2024. 2

  29. [29]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

  30. [30]

    Real-time analysis of soccer ball-player interactions using graph convolutional networks for enhanced game insights.Scientific Reports, 15(1):1–19,

    Abdul Majeed, Mohammad Farukh Hashmi, Muham- mad Umar Ashraf, Gitanjali Srivastava, Zong Woo Geem, and Neeraj Dhanraj Bokde. Real-time analysis of soccer ball-player interactions using graph convolutional networks for enhanced game insights.Scientific Reports, 15(1):1–19,

  31. [31]

    Fifawc: a dataset with detailed annotation and rich semantics for group activity recognition.Frontiers of Computer Science, 18(6):186351, 2024

    Duoxuan Pei, Di Huang, and Yunhong Wang. Fifawc: a dataset with detailed annotation and rich semantics for group activity recognition.Frontiers of Computer Science, 18(6):186351, 2024. 3

  32. [32]

    Data-driven exploration of the 2022 fifa world cup.https : / / www

    PFF FC. Data-driven exploration of the 2022 fifa world cup.https : / / www . blog . fc . pff . com / blog / enhanced-2022-world-cup-dataset, 2023. Ac- cessed: November 12, 2025. 3

  33. [33]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 4

  34. [34]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 4

  35. [35]

    Temporally precise action spotting in soccer videos using dense detection anchors

    Jo ˜ao V Carvalho Soares, Mubarak Shah, and Ralph Ewerth. Temporally precise action spotting in soccer videos using dense detection anchors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 5074–5085, 2023. 2

  36. [36]

    SoccerNet game state reconstruction: End-to-end athlete tracking and identification on a minimap

    Vladimir Somers, Victor Joos, Anthony Cioppa, Silvio Gian- cola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Bap- tiste Standaert, Amir Mohammad Mansourian, Xin Zhou, Shohreh Kasaei, et al. SoccerNet game state reconstruction: End-to-end athlete tracking and identification on a minimap. InIEEE Conference on Computer Vision and Pattern Recog- nition Works...

  37. [37]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022. 2, 4

  38. [38]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14549–14560, 2023. 2, 4, 5

  39. [39]

    Dynamic graph CNN for learning on point clouds

    Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph CNN for learning on point clouds. InACM Trans- actions on Graphics (TOG), volume 38, pages 1–12, 2019. 3

  40. [40]

    ASTRA: An action spotting transformer for soc- cer videos.arXiv preprint arXiv:2404.01891, 2024

    Artur Xarles, Sergio Escalera, Thomas B Moeslund, and Al- bert Clap´es. ASTRA: An action spotting transformer for soc- cer videos.arXiv preprint arXiv:2404.01891, 2024. 2

  41. [41]

    T-deed: Temporal-discriminability enhancer encoder-decoder for precise event spotting in sports videos

    Artur Xarles, Sergio Escalera, Thomas B Moeslund, and Al- bert Clap ´es. T-deed: Temporal-discriminability enhancer encoder-decoder for precise event spotting in sports videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3410–3419, 2024. 2

  42. [42]

    How powerful are graph neural networks?, 2019

    Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?, 2019. 3, 5

  43. [43]

    Social adaptive module for weakly-supervised group activity recognition, 2020

    Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. Social adaptive module for weakly-supervised group activity recognition, 2020. 3

  44. [44]

    Spatial tempo- ral graph convolutional networks for skeleton-based action recognition

    Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo- ral graph convolutional networks for skeleton-based action recognition. InAAAI, volume 32, 2018. 2, 3

  45. [45]

    Forecasting basketball trajectories and player intentions using graph neural networks

    Raymond A Yeh, Alexander G Schwing, Jonathan Huang, and Kevin Murphy. Forecasting basketball trajectories and player intentions using graph neural networks. InICML Workshop on Computer Vision for Autonomous Vehicles,

  46. [46]

    Learning visual context for group activity recognition

    Hangjie Yuan and Dong Ni. Learning visual context for group activity recognition. InAAAI, volume 34, pages 3261– 3269, 2021. 2

  47. [47]

    Composer: Compositional reasoning of group activity in videos with keypoint-only modality

    Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, and Hans Peter Graf. Composer: Compositional reasoning of group activity in videos with keypoint-only modality. In ECCV, pages 249–266. Springer, 2022. 2 10