SoccerLens benchmark shows state-of-the-art soccer VLMs achieve high classification accuracy yet fail to exceed 50% visual grounding performance and underutilize temporal information.
SoccerMaster: A Vision Foundation Model for Soccer Understanding
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection and identification) to high-level semantic reasoning (e.g., event classification). Concretely, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline, SoccerFactory, to generate scalable spatial annotations, and integrate multiple existing soccer video datasets as a comprehensive pretraining data resource for multi-task pretraining; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.
citation-role summary
citation-polarity summary
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2roles
background 2polarities
background 2representative citing papers
SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.
citing papers explorer
-
SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
SoccerLens benchmark shows state-of-the-art soccer VLMs achieve high classification accuracy yet fail to exceed 50% visual grounding performance and underutilize temporal information.
-
Towards Temporal Compositional Reasoning in Long-Form Sports Videos
SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.