Pixels or Positions? Benchmarking Modalities in Group Activity Recognition
Pith reviewed 2026-05-17 21:41 UTC · model grok-4.3
The pith
Player position tracking outperforms video pixels for recognizing group activities in soccer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce SoccerNet-GAR, a multimodal dataset built from 64 matches of the 2022 World Cup with synchronized broadcast videos and player tracking for 87,939 group activities annotated with 10 categories. They define a unified evaluation protocol and propose a novel role-aware graph architecture for tracking-based GAR that encodes tactical structure through positional edges connecting players by their on-pitch roles. Their tracking model achieves 77.8% balanced accuracy compared to 60.9% for the best video baseline, while training with 7 times less GPU hours and 479 times fewer parameters.
What carries the argument
The role-aware graph neural network that directly encodes tactical structure by connecting players with edges based on their assigned on-pitch roles.
If this is right
- Tracking data can serve as a more efficient alternative to video for group activity recognition in team sports.
- The inclusion of role information in graph models improves the capture of spatial interactions among agents.
- Unified multimodal benchmarks like SoccerNet-GAR enable fair comparisons that reveal the strengths of each modality.
- Compact position signals may reduce the need for resource-intensive video processing in sports analytics applications.
Where Pith is reading between the lines
- Similar position-based approaches could be tested in other domains like pedestrian crowd analysis or robotics swarm behaviors where spatial data is available.
- Hybrid models that fuse video and tracking might yield further gains by combining visual cues with positional precision.
- Practitioners in sports coaching could adopt tracking models for real-time activity monitoring due to their lower computational demands.
Load-bearing premise
The player tracking data is assumed to be accurate, complete, and perfectly synchronized with the broadcast video annotations without errors in role assignment or activity labeling.
What would settle it
Demonstrating that a video-based model trained on the aligned SoccerNet-GAR data can exceed 77.8% balanced accuracy, or revealing significant synchronization errors in the tracking data that alter activity labels, would undermine the superiority claim.
Figures
read the original abstract
Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the $64$ matches of the football World Cup 2022. Specifically, the broadcast videos and player tracking modalities for $87{,}939$ group activities are synchronized and annotated with $10$ categories. Furthermore, we define a unified evaluation protocol to benchmark two strong unimodal approaches: (i) competitive video-based classifiers and (ii) tracking-based classifiers leveraging graph neural networks. In particular, our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges connecting players by their on-pitch roles. Our tracking model achieves $77.8\%$ balanced accuracy compared to $60.9\%$ for the best video baseline, while training with $7 \times$ less GPU hours and $479 \times$ fewer parameters ($180K$ vs. $86.3M$). This study provides new insights into the relative strengths of pixels and positions for group activity recognition in sports.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SoccerNet-GAR, a multimodal dataset derived from 64 matches of the 2022 FIFA World Cup, providing synchronized broadcast video and player tracking data for 87,939 group activities annotated with 10 categories. It defines a unified evaluation protocol and benchmarks competitive video-based classifiers against tracking-based classifiers that use a novel role-aware graph neural network to encode tactical structure via positional edges connecting players by on-pitch roles. The central empirical result is that the tracking model reaches 77.8% balanced accuracy compared to 60.9% for the best video baseline, while using 7× less GPU hours and 479× fewer parameters (180K vs. 86.3M).
Significance. If the performance gap and efficiency claims prove robust after verification of data quality and implementation details, the work would offer a valuable new benchmark for group activity recognition in sports, demonstrating that compact positional tracking signals can substantially outperform pixel-based video approaches in this domain. The creation of a large-scale synchronized multimodal dataset and the explicit resource comparisons are concrete strengths that could guide future modality-specific or hybrid research.
major comments (3)
- [§3] §3 (Dataset construction and synchronization): The manuscript must provide explicit details on the synchronization process between broadcast video and tracking data across all 87,939 instances, including quantitative checks for temporal alignment, missing players, or drift. Without this, systematic errors could invalidate the apples-to-apples modality comparison and contribute to the reported 17-point accuracy gap.
- [§4.1] §4.1 (Role-aware GNN architecture): The method for assigning on-pitch roles to players (e.g., fixed formation lookup, heuristic from positions, or separate annotation) is not sufficiently specified. If role labels correlate with the 10 activity categories or were derived using label information, the graph edges inject semantic structure unavailable to the video baselines, making the 77.8% vs. 60.9% result non-comparable and undermining the central modality claim.
- [§5] §5 (Experiments and baselines): Full specification of data splits, baseline re-implementations, hyperparameter search, and any post-hoc decisions is required to confirm that the balanced accuracy numbers and resource metrics (GPU hours, parameter counts) were obtained without leakage or unequal tuning. The current description leaves open the possibility that implementation differences, rather than modality, drive the gap.
minor comments (2)
- [Table 1] Table 1 or equivalent resource table: clarify whether the 7× GPU hours and 479× parameter reductions include only training or also inference and data preprocessing.
- [Figure 3] Figure 3 (graph visualization): add explicit legend explaining edge types and role labels to improve readability of the role-aware architecture.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to provide the requested clarifications and specifications.
read point-by-point responses
-
Referee: [§3] §3 (Dataset construction and synchronization): The manuscript must provide explicit details on the synchronization process between broadcast video and tracking data across all 87,939 instances, including quantitative checks for temporal alignment, missing players, or drift. Without this, systematic errors could invalidate the apples-to-apples modality comparison and contribute to the reported 17-point accuracy gap.
Authors: We agree that more explicit details on synchronization are needed to fully substantiate the modality comparison. In the revised manuscript we will expand §3 with a dedicated description of the synchronization pipeline, including how broadcast video timestamps are aligned to the official FIFA tracking data, quantitative alignment statistics (e.g., mean and max temporal offset across instances), and the handling of any drift or missing-player cases. These additions will confirm that systematic misalignment does not explain the observed performance difference. revision: yes
-
Referee: [§4.1] §4.1 (Role-aware GNN architecture): The method for assigning on-pitch roles to players (e.g., fixed formation lookup, heuristic from positions, or separate annotation) is not sufficiently specified. If role labels correlate with the 10 activity categories or were derived using label information, the graph edges inject semantic structure unavailable to the video baselines, making the 77.8% vs. 60.9% result non-comparable and undermining the central modality claim.
Authors: Roles are assigned via a deterministic heuristic that uses only player coordinates and standard soccer formation templates (goalkeeper, defenders, midfielders, forwards) and does not incorporate activity labels or any other semantic information from the ground-truth annotations. This choice is intended to reflect tactical structure that is naturally present in the tracking modality. We will revise §4.1 to include the precise assignment algorithm and an explicit statement that label information is never used, thereby preserving the fairness of the modality comparison. revision: yes
-
Referee: [§5] §5 (Experiments and baselines): Full specification of data splits, baseline re-implementations, hyperparameter search, and any post-hoc decisions is required to confirm that the balanced accuracy numbers and resource metrics (GPU hours, parameter counts) were obtained without leakage or unequal tuning. The current description leaves open the possibility that implementation differences, rather than modality, drive the gap.
Authors: We will augment §5 with complete experimental details: match-level train/validation/test splits that prevent temporal or team leakage, exact re-implementation settings and hyperparameter ranges for every baseline (including the search procedure), and the precise protocols used to measure GPU hours and parameter counts under identical hardware conditions. These additions will allow independent verification that the reported gap arises from the modalities themselves rather than from unequal tuning or implementation artifacts. revision: yes
Circularity Check
Empirical benchmark study with no derivation circularity
full rationale
This is a dataset introduction and empirical benchmarking paper that measures performance of trained models on held-out test data for video vs. tracking modalities in group activity recognition. The central claims (77.8% vs 60.9% balanced accuracy, parameter/GPU efficiency) are direct experimental outcomes on the new SoccerNet-GAR dataset and do not reduce via any equations, fitted parameters renamed as predictions, or self-citation chains to the paper's own inputs. No self-definitional steps, uniqueness theorems, or ansatzes are present in the provided text; the role-aware GNN is described as a novel architecture but its performance is evaluated externally rather than derived tautologically.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Player positions and trajectories provide sufficient signal to recognize coordinated group activities in soccer without pixel-level visual cues.
- domain assumption Graph neural networks can effectively model spatial interactions when edges are defined by on-pitch roles.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges connecting players by their on-pitch roles
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Tracking (GIN + Attention + Pos) 197K 67.2% 4 GPU hours
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
SoccerLens benchmark shows state-of-the-art soccer VLMs achieve strong classification accuracy yet fail to exceed 50% grounding performance on annotated visual cues and underutilize temporal information.
-
SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
SoccerLens benchmark shows state-of-the-art soccer VLMs achieve high classification accuracy yet fail to exceed 50% visual grounding performance and underutilize temporal information.
-
Towards Athlete Fatigue Assessment from Association Football Videos
Monocular broadcast videos can produce acceleration-speed profiles compatible with fatigue analysis in football, though sensitive to trajectory noise and calibration errors.
Reference graph
Works this paper leans on
-
[1]
Timur Bagautdinov, Alexandre Alahi, Franc ¸ois Fleuret, Pas- cal Fua, and Silvio Savarese. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. InCVPR, pages 4315–4324, 2017. 2
work page 2017
-
[2]
Osl-actionspotting: A unified library for action spot- ting in sports videos, 2024
Yassine Benzakour, Bruno Cabado, Silvio Giancola, An- thony Cioppa, Bernard Ghanem, and Marc Van Droogen- broeck. Osl-actionspotting: A unified library for action spot- ting in sports videos, 2024. 4
work page 2024
-
[3]
How attentive are graph attention networks?, 2022
Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks?, 2022. 3
work page 2022
-
[4]
A graph-based method for soccer action spotting using unsu- pervised player classification
Alejandro Cartas, Coloma Ballester, and Gloria Haro. A graph-based method for soccer action spotting using unsu- pervised player classification. InACM International Work- shop on Multimedia Content Analysis in Sports (MMSports), pages 93–102, 2022. 2
work page 2022
-
[5]
Wongun Choi, Khuram Shahid, and Silvio Savarese. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In2009 IEEE 12th international conference on computer vision work- shops, ICCV Workshops, pages 1282–1289. IEEE, 2009. 2, 3
work page 2009
-
[6]
Scaling up soccer- net with multi-view spatial localization and re-identification
Anthony Cioppa, Adrien Deliege, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. Scaling up soccer- net with multi-view spatial localization and re-identification. Scientific data, 9(1):355, 2022. 2
work page 2022
-
[7]
Anthony Cioppa, Adrien Deli `ege, Silvio Giancola, Flori- ane Magera, Olivier Barnich, Bernard Ghanem, and Marc Van Droogenbroeck. Camera calibration and player local- ization in SoccerNet-v2 and investigation of their represen- tations for action spotting. pages 4532–4541, June 2021. 2
work page 2021
-
[8]
SoccerNet 2023 challenges results.Sports Engineering, 27(2):24, 2024
Anthony Cioppa, Silvio Giancola, Vladimir Somers, Flori- ane Magera, Xin Zhou, Hassan Mkhallati, Adrien Deli `ege, Jan Held, Carlos Hinojosa, Amir M Mansourian, et al. SoccerNet 2023 challenges results.Sports Engineering, 27(2):24, 2024. 2, 3
work page 2023
-
[9]
SportsMOT: A large multi- object tracking dataset in multiple sports scenes
Yifu Cui, Chenkai Zeng, Xiaoyu Zhao, Yiyao Yang, Gang- shan Wu, and Limin Wang. SportsMOT: A large multi- object tracking dataset in multiple sports scenes. InAdvances in Neural Information Processing Systems (NeurIPS), vol- ume 36, 2023. 2
work page 2023
-
[10]
SoccerNet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos
Adrien Deli `ege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogen- broeck. SoccerNet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos. InIEEE Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages...
work page 2021
-
[11]
Dawei Ding and Hsiangsheng Huang. A graph attention based approach for trajectory prediction in multi-agent sports games.arXiv preprint arXiv:2012.10531, 2020. 3
-
[12]
Learning pose grammar to encode hu- man body configuration for 3d pose estimation
Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. Learning pose grammar to encode hu- man body configuration for 3d pose estimation. InProceed- ings of the AAAI conference on artificial intelligence, vol- ume 32, 2018. 2, 3
work page 2018
-
[13]
Wide open spaces: A statistical technique for measuring space creation in professional soccer
Javier Fern ´andez, Luke Bornn, and Dan Cervone. Wide open spaces: A statistical technique for measuring space creation in professional soccer. InMIT Sloan Sports Analytics Con- ference, 2019. 2
work page 2019
-
[14]
Com- positional action recognition with dependent compositional attention
Nuno C Garcia, Pietro Morerio, and Vittorio Murino. Com- positional action recognition with dependent compositional attention. InCVPRW, pages 668–669, 2020. 2
work page 2020
-
[15]
Actor-transformers for group activity recognition
Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, and Cees GM Snoek. Actor-transformers for group activity recognition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 839–848, 2020. 2, 3
work page 2020
-
[16]
SoccerNet: A scalable dataset for action spotting in soccer videos
Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. SoccerNet: A scalable dataset for action spotting in soccer videos. InIEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1711–1721, 2018. 2, 3
work page 2018
-
[17]
Alex Graves and J ¨urgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural net- work architectures.Neural networks, 18(5-6):602–610,
-
[18]
Inductive representation learning on large graphs
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neu- ral Information Processing Systems (NeurIPS), pages 1024– 1034, 2017. 3
work page 2017
-
[19]
Spotting temporally precise, fine-grained events in video
James Hong, Haotian Zhang, Micha ¨el Gharbi, Matthew Fisher, and Kayvon Fatahalian. Spotting temporally precise, fine-grained events in video. InEuropean Conference on Computer Vision, pages 33–51. Springer, 2022. 2
work page 2022
-
[20]
A hierarchical deep tempo- ral model for group activity recognition
Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep tempo- ral model for group activity recognition. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1971–1980, 2016. 2, 3
work page 1971
-
[21]
Sport- sPose: A dynamic 3d sports pose dataset
Christian Ingwersen and Joni-Kristian K ¨am¨ar¨ainen. Sport- sPose: A dynamic 3d sports pose dataset. InIEEE Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), 2023. 2
work page 2023
-
[22]
Dongkeun Kim, Youngkil Song, Minsu Cho, and Suha Kwak. Towards more practical group activity detec- tion: A new benchmark and model.arXiv preprint arXiv:2312.02878, 2023. 3
-
[23]
Adam: A Method for Stochastic Optimization
Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 5
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[24]
Semi-supervised classi- fication with graph convolutional networks
Thomas N Kipf and Max Welling. Semi-supervised classi- fication with graph convolutional networks. InInternational Conference on Learning Representations (ICLR), 2017. 3
work page 2017
-
[25]
Temporal convolutional networks for ac- tion segmentation and detection
Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networks for ac- tion segmentation and detection. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 156–165, 2017. 3
work page 2017
-
[26]
Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. DeepGCNs: Can GCNs go as deep as CNNs? In 9 IEEE International Conference on Computer Vision (ICCV), pages 9267–9276, 2019. 3
work page 2019
-
[27]
Groupformer: Group activity recognition with clustered spatial-temporal trans- former
Shuaicheng Li, Qianggang Cao, Lingbo Liu, Kunlin Yang, Shinan Liu, Jun Hou, and Shuai Yi. Groupformer: Group activity recognition with clustered spatial-temporal trans- former. InICCV, pages 13668–13677, 2021. 2
work page 2021
-
[28]
Skeleton-based group activity recognition via spatial- temporal panoramic graph
Zhengcen Li, Xianxiang Chang, Yueran Li, and Jing Su. Skeleton-based group activity recognition via spatial- temporal panoramic graph. InECCV, pages 254–270. Springer, 2024. 2
work page 2024
-
[29]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Abdul Majeed, Mohammad Farukh Hashmi, Muham- mad Umar Ashraf, Gitanjali Srivastava, Zong Woo Geem, and Neeraj Dhanraj Bokde. Real-time analysis of soccer ball-player interactions using graph convolutional networks for enhanced game insights.Scientific Reports, 15(1):1–19,
-
[31]
Duoxuan Pei, Di Huang, and Yunhong Wang. Fifawc: a dataset with detailed annotation and rich semantics for group activity recognition.Frontiers of Computer Science, 18(6):186351, 2024. 3
work page 2024
-
[32]
Data-driven exploration of the 2022 fifa world cup.https : / / www
PFF FC. Data-driven exploration of the 2022 fifa world cup.https : / / www . blog . fc . pff . com / blog / enhanced-2022-world-cup-dataset, 2023. Ac- cessed: November 12, 2025. 3
work page 2022
-
[33]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 4
work page 2021
-
[34]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Temporally precise action spotting in soccer videos using dense detection anchors
Jo ˜ao V Carvalho Soares, Mubarak Shah, and Ralph Ewerth. Temporally precise action spotting in soccer videos using dense detection anchors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 5074–5085, 2023. 2
work page 2023
-
[36]
SoccerNet game state reconstruction: End-to-end athlete tracking and identification on a minimap
Vladimir Somers, Victor Joos, Anthony Cioppa, Silvio Gian- cola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Bap- tiste Standaert, Amir Mohammad Mansourian, Xin Zhou, Shohreh Kasaei, et al. SoccerNet game state reconstruction: End-to-end athlete tracking and identification on a minimap. InIEEE Conference on Computer Vision and Pattern Recog- nition Works...
work page 2024
-
[37]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022. 2, 4
work page 2022
-
[38]
Videomae v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14549–14560, 2023. 2, 4, 5
work page 2023
-
[39]
Dynamic graph CNN for learning on point clouds
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph CNN for learning on point clouds. InACM Trans- actions on Graphics (TOG), volume 38, pages 1–12, 2019. 3
work page 2019
-
[40]
ASTRA: An action spotting transformer for soc- cer videos.arXiv preprint arXiv:2404.01891, 2024
Artur Xarles, Sergio Escalera, Thomas B Moeslund, and Al- bert Clap´es. ASTRA: An action spotting transformer for soc- cer videos.arXiv preprint arXiv:2404.01891, 2024. 2
-
[41]
Artur Xarles, Sergio Escalera, Thomas B Moeslund, and Al- bert Clap ´es. T-deed: Temporal-discriminability enhancer encoder-decoder for precise event spotting in sports videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3410–3419, 2024. 2
work page 2024
-
[42]
How powerful are graph neural networks?, 2019
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?, 2019. 3, 5
work page 2019
-
[43]
Social adaptive module for weakly-supervised group activity recognition, 2020
Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. Social adaptive module for weakly-supervised group activity recognition, 2020. 3
work page 2020
-
[44]
Spatial tempo- ral graph convolutional networks for skeleton-based action recognition
Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo- ral graph convolutional networks for skeleton-based action recognition. InAAAI, volume 32, 2018. 2, 3
work page 2018
-
[45]
Forecasting basketball trajectories and player intentions using graph neural networks
Raymond A Yeh, Alexander G Schwing, Jonathan Huang, and Kevin Murphy. Forecasting basketball trajectories and player intentions using graph neural networks. InICML Workshop on Computer Vision for Autonomous Vehicles,
-
[46]
Learning visual context for group activity recognition
Hangjie Yuan and Dong Ni. Learning visual context for group activity recognition. InAAAI, volume 34, pages 3261– 3269, 2021. 2
work page 2021
-
[47]
Composer: Compositional reasoning of group activity in videos with keypoint-only modality
Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, and Hans Peter Graf. Composer: Compositional reasoning of group activity in videos with keypoint-only modality. In ECCV, pages 249–266. Springer, 2022. 2 10
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.