pith. machine review for the scientific record. sign in

arxiv: 2512.11016 · v2 · submitted 2025-12-11 · 💻 cs.CV · cs.AI

Recognition: unknown

SoccerMaster: A Vision Foundation Model for Soccer Understanding

Authors on Pith no claims yet
classification 💻 cs.CV cs.AI
keywords modelsoccerdatadiversepretrainingsoccermastertasksunderstanding
0
0 comments X
read the original abstract

Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection and identification) to high-level semantic reasoning (e.g., event classification). Concretely, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline, SoccerFactory, to generate scalable spatial annotations, and integrate multiple existing soccer video datasets as a comprehensive pretraining data resource for multi-task pretraining; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy

    cs.CV 2026-05 unverdicted novelty 7.0

    SoccerLens benchmark shows state-of-the-art soccer VLMs achieve strong classification accuracy yet fail to exceed 50% grounding performance on annotated visual cues and underutilize temporal information.

  2. SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy

    cs.CV 2026-05 unverdicted novelty 7.0

    SoccerLens benchmark shows state-of-the-art soccer VLMs achieve high classification accuracy yet fail to exceed 50% visual grounding performance and underutilize temporal information.

  3. Towards Temporal Compositional Reasoning in Long-Form Sports Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3

  2. [2]

    Jersey number recognition using keyframe identification from low-resolution broadcast videos

    Bavesh Balaji, Jerrin Bright, Harish Prakash, Yuhao Chen, David A Clausi, and John Zelek. Jersey number recognition using keyframe identification from low-resolution broadcast videos. InProceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023. 1, 2

  3. [3]

    Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005. 9

  4. [4]

    Evaluating mul- tiple object tracking performance: the clear mot metrics

    Keni Bernardin and Rainer Stiefelhagen. Evaluating mul- tiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008. 8

  5. [5]

    Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021. 5

  6. [6]

    Observation-centric sort: Rethinking sort for robust multi-object tracking

    Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirod- kar, and Kris Kitani. Observation-centric sort: Rethinking sort for robust multi-object tracking. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2023. 8

  7. [7]

    Camera calibration and player local- ization in soccernet-v2 and investigation of their representa- tions for action spotting

    Anthony Cioppa, Adrien Deliege, Floriane Magera, Sil- vio Giancola, Olivier Barnich, Bernard Ghanem, and Marc Van Droogenbroeck. Camera calibration and player local- ization in soccernet-v2 and investigation of their representa- tions for action spotting. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, 2021. 1

  8. [8]

    Scaling up soccer- net with multi-view spatial localization and re-identification

    Anthony Cioppa, Adrien Deliege, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. Scaling up soccer- net with multi-view spatial localization and re-identification. Scientific Data, 2022. 1, 15

  9. [9]

    Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos

    Anthony Cioppa, Silvio Giancola, Adrien Deliege, Le Kang, Xin Zhou, Zhiyu Cheng, Bernard Ghanem, and Marc Van Droogenbroeck. Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2022. 1, 2, 8

  10. [10]

    ArXiv abs/2409.10587(2024), https://api.semanticscholar.org/CorpusID:272693834

    Anthony Cioppa, Silvio Giancola, Vladimir Somers, Vic- tor Joos, Floriane Magera, Jan Held, Seyed Abol- fazl Ghasemzadeh, Xin Zhou, Karolina Seweryn, Mateusz Kowalczyk, et al. Soccernet 2024 challenges results.arXiv preprint arXiv:2409.10587, 2024. 2

  11. [11]

    Sportsmot: A large multi- object tracking dataset in multiple sports scenes

    Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. Sportsmot: A large multi- object tracking dataset in multiple sports scenes. InProceed- ings of the International Conference on Computer Vision,

  12. [12]

    Soccernet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos

    Adrien Deliege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogen- broeck. Soccernet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 1,...

  13. [13]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representati...

  14. [14]

    Strongsort: Make deep- sort great again.IEEE Transactions on Multimedia, 2023

    Yunhao Du, Zhicheng Zhao, Yang Song, Yanyun Zhao, Fei Su, Tao Gong, and Hongying Meng. Strongsort: Make deep- sort great again.IEEE Transactions on Multimedia, 2023. 3, 8

  15. [15]

    Enhancing soccer camera calibration through keypoint exploitation

    Nikolay S Falaleev and Ruilong Chen. Enhancing soccer camera calibration through keypoint exploitation. InACM Multimedia Workshops, 2024. 2, 3

  16. [16]

    Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 2025

    Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 2025. 2

  17. [17]

    A survey for foundation mod- els in autonomous driving.arXiv preprint arXiv:2402.01105,

    Haoxiang Gao, Zhongruo Wang, Yaqian Li, Kaiwen Long, Ming Yang, and Yiqing Shen. A survey for foundation mod- els in autonomous driving.arXiv preprint arXiv:2402.01105,

  18. [18]

    Multiple object track- ing as id prediction

    Ruopeng Gao, Ji Qi, and Limin Wang. Multiple object track- ing as id prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 7, 8

  19. [19]

    Soccernet: A scalable dataset for action spotting in soccer videos

    Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. Soccernet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, 2018. 1, 2

  20. [20]

    Soccernet 2022 challenges results

    Silvio Giancola, Anthony Cioppa, Adrien Deli `ege, Floriane Magera, Vladimir Somers, Le Kang, Xin Zhou, Olivier Bar- nich, Christophe De Vleeschouwer, Alexandre Alahi, et al. Soccernet 2022 challenges results. InProceedings of the 5th International ACM Workshop on Multimedia Content Anal- ysis in Sports, 2022. 2, 8

  21. [21]

    ArXiv abs/2508.19182(2025), https://api.semanticscholar.org/CorpusID:280870241

    Silvio Giancola, Anthony Cioppa, et al. Soccernet 2025 chal- lenges results.arXiv preprint arXiv:2508.19182, 2025. 2, 4

  22. [22]

    From broadcast to min- imap: Achieving state-of-the-art soccernet game state recon- struction

    Vladimir Golovkin, Nikolay Nemtsev, Vasyl Shandyba, Oleg Udin, Nikita Kasatkin, Pavel Kononov, Anton Afanasiev, Sergey Ulasen, and Andrei Boiarov. From broadcast to min- imap: Achieving state-of-the-art soccernet game state recon- struction. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition Workshops, 2025. 4

  23. [23]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 7

  24. [24]

    Pnlcalib: Sports field registration via points and lines optimization.arXiv preprint arXiv:2404.08401, 2024

    Marc Guti ´errez-P´erez and Antonio Agudo. Pnlcalib: Sports field registration via points and lines optimization.arXiv preprint arXiv:2404.08401, 2024. 2, 3, 6, 7, 8, 15 10

  25. [25]

    Vars: Video assistant referee system for automated soccer decision making from multiple views

    Jan Held, Anthony Cioppa, Silvio Giancola, Abdullah Hamdi, Bernard Ghanem, and Marc Van Droogenbroeck. Vars: Video assistant referee system for automated soccer decision making from multiple views. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion Workshops, 2023. 1, 2

  26. [26]

    X-vars: Introducing explainability in football refereeing with multi- modal large language models

    Jan Held, Hani Itani, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. X-vars: Introducing explainability in football refereeing with multi- modal large language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2024. 1, 2

  27. [27]

    Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori

    Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2016. 2

  28. [28]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 17

  29. [29]

    Maria Koshkina and James H. Elder. A general framework for jersey number recognition in sports video. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2024. 1, 4, 16

  30. [30]

    arXiv preprint arXiv:2401.01505 (2024)

    Haopeng Li, Andong Deng, Qiuhong Ke, Jun Liu, Hos- sein Rahmani, Yulan Guo, Bernt Schiele, and Chen Chen. Sports-qa: A large-scale video question answering bench- mark for complex and professional sports.arXiv preprint arXiv:2401.01505, 2024. 2

  31. [31]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the International Conference on Machine Learning,

  32. [32]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the International Conference on Machine Learn- ing, 2023. 2, 7

  33. [33]

    Multisports: A multi-person video dataset of spatio-temporally localized sports actions

    Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. InPro- ceedings of the International Conference on Computer Vi- sion, 2021. 2

  34. [34]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText Summarization Branches Out, 2004. 9

  35. [35]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 6

  36. [36]

    F3set: Towards analyzing fast, frequent, and fine-grained events from videos

    Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, and Jin Song Dong. F3set: Towards analyzing fast, frequent, and fine-grained events from videos. InProceedings of the In- ternational Conference on Learning Representations, 2025. 2

  37. [37]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Confer- ence on Learning Representations, 2019. 17

  38. [38]

    Hota: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, 2021

    Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, 2021. 8

  39. [39]

    Npu rgbd dataset and a feature-enhanced lstm-dgcn method for action recognition of basketball players+.Applied Sci- ences, 2021

    Chunyan Ma, Jizhuang Fan, Jing-Yue Yao, and Tao Zhang. Npu rgbd dataset and a feature-enhanced lstm-dgcn method for action recognition of basketball players+.Applied Sci- ences, 2021. 2

  40. [40]

    A universal protocol to bench- mark camera calibration for sports

    Floriane Magera, Thomas Hoyoux, Olivier Barnich, and Marc Van Droogenbroeck. A universal protocol to bench- mark camera calibration for sports. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion Workshops, 2024. 1, 2, 8

  41. [41]

    Broadtrack: Broadcast camera tracking for soccer

    Floriane Magera, Thomas Hoyoux, Olivier Barnich, and Marc Van Droogenbroeck. Broadtrack: Broadcast camera tracking for soccer. InWinter Conference on Applications of Computer Vision, 2025. 1

  42. [42]

    Multi-task learning for joint re-identification, team affiliation, and role classifi- cation for sports visual tracking

    Amir M Mansourian, Vladimir Somers, Christophe De Vleeschouwer, and Shohreh Kasaei. Multi-task learning for joint re-identification, team affiliation, and role classifi- cation for sports visual tracking. InProceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023. 1, 2, 3, 8

  43. [43]

    Leapfrog diffusion model for stochastic trajec- tory prediction

    Weibo Mao, Chenxin Xu, Qi Zhu, Siheng Chen, and Yan- feng Wang. Leapfrog diffusion model for stochastic trajec- tory prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. 2

  44. [44]

    Soccernet- caption: Dense video captioning for soccer broadcasts com- mentaries

    Hassan Mkhallati, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. Soccernet- caption: Dense video captioning for soccer broadcasts com- mentaries. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition Workshops, 2023. 1, 2, 16

  45. [45]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. 2

  46. [46]

    Basket: A large- scale video dataset for fine-grained skill estimation

    Yulu Pan, Ce Zhang, and Gedas Bertasius. Basket: A large- scale video dataset for fine-grained skill estimation. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 2

  47. [47]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAssociation for Computational Linguistics,

  48. [48]

    AJ Piergiovanni and Michael S. Ryoo. Fine-grained activ- ity recognition in baseball videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition Workshops, 2018. 2

  49. [49]

    Goal: A challenging knowledge-grounded video captioning bench- mark for real-time soccer commentary generation

    Ji Qi, Jifan Yu, Teng Tu, Kunyu Gao, Yifan Xu, Xinyu Guan, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li, et al. Goal: A challenging knowledge-grounded video captioning bench- mark for real-time soccer commentary generation. InPro- ceedings of the ACM International Conference on Informa- tion and Knowledge Management, 2023. 1, 2 11

  50. [50]

    Sports video captioning via attentive motion representation and group relationship modeling.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2020

    Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. Sports video captioning via attentive motion representation and group relationship modeling.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2020. 2

  51. [51]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, 2021. 2

  52. [52]

    Matchtime: Towards automatic soccer game commentary generation

    Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, and Weidi Xie. Matchtime: Towards automatic soccer game commentary generation. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing,

  53. [53]

    1, 2, 3, 4, 5, 7, 8, 9, 15, 16, 17

  54. [54]

    Multi-agent system for comprehensive soccer understanding

    Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Multi-agent system for comprehensive soccer understanding. InACM Multimedia, 2025. 1, 2

  55. [55]

    Towards universal soccer video under- standing

    Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards universal soccer video under- standing. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2025. 1, 2, 3, 4, 5, 6, 7, 15, 16, 17

  56. [56]

    Sam 2: Seg- ment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InProceedings of the International Conference on Learning Representations,

  57. [57]

    Performance measures and a data set for multi-target, multi-camera tracking

    Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. InProceedings of the European Conference on Computer Vision, 2016. 8

  58. [58]

    Finegym: A hierarchical video dataset for fine-grained action understand- ing

    Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understand- ing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 2

  59. [59]

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 6

  60. [60]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 7

  61. [61]

    Soccernet game state reconstruction: End-to- end athlete tracking and identification on a minimap

    Vladimir Somers, Victor Joos, Anthony Cioppa, Silvio Gian- cola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Bap- tiste Standaert, Amir M Mansourian, Xin Zhou, Shohreh Kasaei, et al. Soccernet game state reconstruction: End-to- end athlete tracking and identification on a minimap. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Reco...

  62. [62]

    Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 2017

    Graham Thomas, Rikke Gade, Thomas B Moeslund, Peter Carr, and Adrian Hilton. Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 2017. 2

  63. [63]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 3, 6, 7, 9, 17

  64. [64]

    Semi-supervised training to im- prove player and ball detection in soccer

    Renaud Vandeghen, Anthony Cioppa, and Marc Van Droogenbroeck. Semi-supervised training to im- prove player and ball detection in soccer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2022. 1, 2

  65. [65]

    Yolov8: A novel object detection algorithm with enhanced performance and robust- ness

    Rejin Varghese and M Sambath. Yolov8: A novel object detection algorithm with enhanced performance and robust- ness. InInternational Conference on Advances in Data En- gineering and Intelligent Computing Systems, 2024. 3

  66. [66]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 9

  67. [67]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 2

  68. [68]

    Tacticai: an ai assistant for football tactics.Nature Commu- nications, 2024

    Zhe Wang, Petar Veli ˇckovi´c, Daniel Hennes, Nenad Tomaˇsev, Laurel Prince, Michael Kaisers, Yoram Bachrach, Romuald Elie, Li Kevin Wenliang, Federico Piccinini, et al. Tacticai: an ai assistant for football tactics.Nature Commu- nications, 2024. 19

  69. [69]

    Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Hui Hui, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications, 2025. 2

  70. [70]

    Sports video analysis on large-scale data

    Dekun Wu, He Zhao, Xingce Bao, and Richard P Wildes. Sports video analysis on large-scale data. InProceedings of the European Conference on Computer Vision, 2022. 2

  71. [71]

    A simple yet effective knowl- edge guided method for entity-aware video captioning on a basketball benchmark.Neurocomputing, 2025

    Zeyu Xi, Ge Shi, Xuefen Li, Junchi Yan, Zun Li, Lifang Wu, Zilin Liu, and Liang Wang. A simple yet effective knowl- edge guided method for entity-aware video captioning on a basketball benchmark.Neurocomputing, 2025

  72. [72]

    Eika: Explicit & implicit knowledge-augmented network for entity-aware sports video captioning.Expert Systems with Applications, 2025

    Zeyu Xi, Ge Shi, Haoying Sun, Bowen Zhang, Shuyi Li, and Lifang Wu. Eika: Explicit & implicit knowledge-augmented network for entity-aware sports video captioning.Expert Systems with Applications, 2025

  73. [73]

    Player-centric multimodal prompt generation for large lan- guage model based identity-aware basketball video caption- ing

    Zeyu Xi, Haoying Sun, Yaofei Wu, Junchi Yan, Haoran Zhang, Lifang Wu, Liang Wang, and Changwen Chen. Player-centric multimodal prompt generation for large lan- guage model based identity-aware basketball video caption- ing. InProceedings of the International Conference on Com- puter Vision, 2025. 2

  74. [74]

    Sportqa: A benchmark for sports understanding in large language models

    Haotian Xia, Zhengbang Yang, Yuqing Wang, Rhys Tracy, Yun Zhao, Dongdong Huang, Zezhi Chen, Yan Zhu, Yuan- fang Wang, and Weining Shen. Sportqa: A benchmark for sports understanding in large language models. InProceed- 12 ings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2024. 2

  75. [75]

    Sportu: A comprehensive sports understanding benchmark for multimodal large language models

    Haotian Xia, Zhengbang Yang, Junbo Zou, Rhys Tracy, Yuqing Wang, Chi Lu, Christopher Lai, Yanjun He, Xun Shao, Zhuoqing Xie, et al. Sportu: A comprehensive sports understanding benchmark for multimodal large language models. InProceedings of the International Conference on Learning Representations, 2025. 2

  76. [76]

    Language-guided audio-visual learn- ing for long-term sports assessment

    Huangbiao Xu, Xiao Ke, Huanqi Wu, Rui Xu, Yuezhou Li, and Wenzhong Guo. Language-guided audio-visual learn- ing for long-term sports assessment. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 2

  77. [77]

    Finediving: A fine-grained dataset for procedure-aware action quality assessment

    Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu. Finediving: A fine-grained dataset for procedure-aware action quality assessment. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 2

  78. [78]

    Depth any- thing v2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2. InConference on Neural Information Processing Systems, 2024. 2

  79. [79]

    Timesoccer: An end-to-end multimodal large language model for soccer commentary generation

    Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, and Changbo Wang. Timesoccer: An end-to-end multimodal large language model for soccer commentary generation. InACM Multi- media, 2025. 2

  80. [80]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2022. 2, 5

Showing first 80 references.