pith. machine review for the scientific record. sign in

arxiv: 2604.23392 · v1 · submitted 2026-04-25 · 💻 cs.AI

Recognition: unknown

SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:00 UTC · model grok-4.3

classification 💻 cs.AI
keywords soccer refereeingmulti-agent systemsmultimodal large language modelscross-modal retrievalAI in sportsexplainable AIvideo understandingknowledge base
0
0 comments X

The pith

Multi-agent AI framework outperforms general models at soccer refereeing decisions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SoccerRef-Agents, a multi-agent decision-making framework for soccer refereeing that integrates analysis of foul videos with a knowledge base drawn from the laws of the game. It targets the shortcoming of prior AI tools that perform only isolated video perception without rule-based reasoning about fouls. The authors build a benchmark containing over 1,200 referee theory questions and 600 foul clips, a vector knowledge base of regulations and cases, and an architecture in which agents collaborate through cross-modal retrieval-augmented generation. If the approach works, it produces decisions and explanations that are more accurate and grounded than those from general-purpose multimodal large language models. A reader would care because consistent, transparent AI assistance could support fairer officiating in a sport where small judgment errors affect outcomes.

Core claim

The authors claim that a multi-agent architecture collaborating via cross-modal RAG to link visual foul content with the Laws of the Game and a case database bridges the semantic gap between video and regulatory texts, and that evaluations on SoccerRefBench show this yields significantly higher decision accuracy and explanation quality than general-purpose MLLMs.

What carries the argument

The multi-agent architecture that collaborates via cross-modal RAG to bridge the semantic gap between visual content and regulatory texts.

If this is right

  • The system handles complex foul scenarios by combining video perception with explicit rule knowledge rather than relying on model intuition alone.
  • Decisions come with explanations directly tied to the official laws and precedent cases.
  • The released benchmark and knowledge base provide a standard testbed for comparing future AI refereeing methods.
  • The design suggests that domain-specific agent specialization plus retrieval can lift performance on tasks requiring both seeing and rule application.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern of agents plus cross-modal retrieval from a rules database could be tested on refereeing tasks in other sports such as basketball or rugby.
  • Live-match trials would reveal whether the accuracy gains survive real-time constraints and variable camera conditions not present in the benchmark clips.
  • The benchmark enables head-to-head testing of new models against this multi-agent baseline on the same foul scenarios.

Load-bearing premise

The multi-agent architecture with cross-modal RAG can reliably bridge the semantic gap between visual foul content and regulatory texts without introducing new errors or hallucinations.

What would settle it

A new collection of foul video clips on which the system produces decisions or explanations that conflict with expert human referee judgments on a substantial fraction of cases.

Figures

Figures reproduced from arXiv: 2604.23392 by Gang Chen, Jiayuan Rao, Wanli Song, Yi Hu, Zi Meng.

Figure 1
Figure 1. Figure 1: Overview of SoccerRef-Agents. The system mimics a professional officiat￾ing team by decomposing the task into perception (Video Agent), background analy￾sis (Context Agent), legal interpretation (Rule Agent), and precedent retrieval (Case Agent), culminating in a final decision by the Chief Referee Agent. to many aspects of soccer understanding, including video comprehension [38, 37, 20, 39, 43, 28] Howeve… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the data collection pipeline for SoccerRefBench. The dataset integrates 1,218 textual theory questions from diverse international sources and 600 video-based judgment scenarios derived from the SoccerNet-MVFoul dataset. 3.3 Data Curation To facilitate standardized and objective evaluation, we curate the collected raw data into a unified multiple-choice question format. Each entry in SoccerRef￾B… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the system’s dual-pathway reasoning workflow. Depending on the input modality identified by the Router, the system executes either the Text-Mode or Video-Mode pipeline. The Text-Mode Pipeline directly utilizes the input query for retrieval. The Video-Mode Pipeline features a specialized Cross-Modal RAG mecha￾nism, where the Video Agent converts visual information into a textual Video Analysis (… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples of SoccerRef-Agents on SoccerRefBench. The fig￾ure illustrates the step-by-step reasoning chains for both textual (left two columns) and video-based (right two columns) queries. Key intermediate outputs from the Rule Agent, Case Agent, and Video Agent are explicitly logged in agent_traces, providing legally grounded justifications for the final decision. that mimics the collaborative w… view at source ↗
Figure 5
Figure 5. Figure 5: Human Evaluation Interface view at source ↗
Figure 6
Figure 6. Figure 6: Human Evaluation Interface E Additional Qualitative Results view at source ↗
Figure 7
Figure 7. Figure 7: Extended qualitative results “A defender begins pulling an attacker outside the penalty area but continues the pulling until they are inside the penalty area. What is the decision?” A) Award a penalty kick B) Award a direct free kick at the spot where the pulling started C) Award an indirect free kick inside the penalty area D) Allow play to continue Prediction: Award a penalty kick Explanation: According … view at source ↗
Figure 8
Figure 8. Figure 8: Extended qualitative results view at source ↗
read the original abstract

Refereeing is vital in sports, where fair, accurate, and explainable decisions are fundamental. While intelligent assistant technologies are being widely adopted in soccer refereeing, current AI-assisted approaches remain preliminary. Existing research mostly focuses on isolated video perception tasks and lacks the ability to understand and reason about foul scenarios. To fill this gap, we propose SoccerRef-Agents, a holistic and explainable multi-agent decision-making framework for soccer refereeing. The main contributions are: (i) constructing the multimodal benchmark SoccerRefBench with over 1,200 referee theory questions and 600 foul video clips; (ii) building a vector-based knowledge base RefKnowledgeDB using the latest "Laws of the Game" and a classic case database for precise, knowledge-driven reasoning; (iii) designing a novel multi-agent architecture that collaborates via cross-modal RAG to bridge the semantic gap between visual content and regulatory texts. This work explores the technical capability of integrating MLLMs with refereeing expertise, and evaluations show our system significantly outperforms general-purpose MLLMs in decision accuracy and explanation quality. All databases, benchmarks, and code will be made available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SoccerRef-Agents, a multi-agent decision-making framework for soccer refereeing that integrates MLLMs with a vector-based knowledge base (RefKnowledgeDB) of the Laws of the Game and case examples via cross-modal RAG. It introduces the SoccerRefBench benchmark containing over 1,200 referee theory questions and 600 foul video clips. The central claim is that this architecture enables more accurate and explainable foul decisions than general-purpose MLLMs, with all resources to be released publicly.

Significance. If the empirical claims hold under rigorous scrutiny, the work would advance AI applications in sports officiating by moving beyond isolated perception tasks to holistic, knowledge-grounded reasoning. The construction and planned release of a dedicated multimodal benchmark and regulatory knowledge base represent concrete contributions that could support reproducible follow-on research in multimodal agents and domain-specific RAG.

major comments (2)
  1. [Abstract and Experiments section] Abstract and Experiments section: The assertion of significant outperformance in decision accuracy and explanation quality over general-purpose MLLMs is presented without any description of the evaluation protocol, specific baseline models, metrics for explanation quality, statistical tests, or error analysis across the 600 foul clips. This information is load-bearing for the central claim that the multi-agent cross-modal RAG architecture delivers net gains.
  2. [Section 4 (Multi-agent architecture)] Section 4 (Multi-agent architecture): The cross-modal RAG mechanism is described at a high level but provides no analysis or ablation of retrieval performance on ambiguous foul semantics (e.g., intent or subtle contact cases common in the benchmark). Without evidence that retrieval errors are not amplified by agent collaboration, the claim that the system reliably bridges visual content to regulatory text remains unsubstantiated.
minor comments (1)
  1. [Abstract] The abstract uses approximate figures ('over 1,200' questions, '600' clips); exact counts and any filtering criteria should be stated in the benchmark description section for precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where additional clarity and analysis will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] Abstract and Experiments section: The assertion of significant outperformance in decision accuracy and explanation quality over general-purpose MLLMs is presented without any description of the evaluation protocol, specific baseline models, metrics for explanation quality, statistical tests, or error analysis across the 600 foul clips. This information is load-bearing for the central claim that the multi-agent cross-modal RAG architecture delivers net gains.

    Authors: We agree that the abstract is high-level and that the Experiments section would benefit from more explicit description of the full evaluation protocol, including the precise list of baseline MLLMs, the human-evaluation rubric for explanation quality, and the statistical significance testing. We will revise the abstract to summarize the protocol and key metrics in one sentence. In the Experiments section we will add a dedicated error-analysis subsection with per-category breakdowns across the 600 clips and report the results of paired statistical tests. These changes will make the supporting evidence for the central claim fully transparent. revision: yes

  2. Referee: [Section 4 (Multi-agent architecture)] Section 4 (Multi-agent architecture): The cross-modal RAG mechanism is described at a high level but provides no analysis or ablation of retrieval performance on ambiguous foul semantics (e.g., intent or subtle contact cases common in the benchmark). Without evidence that retrieval errors are not amplified by agent collaboration, the claim that the system reliably bridges visual content to regulatory text remains unsubstantiated.

    Authors: We acknowledge that the current description of the cross-modal RAG component lacks quantitative ablation on retrieval quality for ambiguous cases. In the revised manuscript we will insert a new subsection in Section 4 that reports retrieval precision/recall on a manually annotated subset of ambiguous foul clips (intent, subtle contact, etc.). We will also present an ablation comparing end-to-end decision accuracy with and without the RAG module, together with a qualitative analysis of how the referee-agent verification step mitigates retrieval errors. This will directly substantiate the reliability claim. revision: yes

Circularity Check

0 steps flagged

Empirical system proposal with no derivation chain

full rationale

The paper describes an applied multi-agent architecture for soccer refereeing, including construction of a benchmark (SoccerRefBench), a knowledge base (RefKnowledgeDB), and a cross-modal RAG collaboration mechanism, followed by empirical evaluations against general MLLMs. No equations, parameter fitting, predictions, or formal derivations are present that could reduce to inputs by construction. All load-bearing claims rest on experimental accuracy and explanation quality metrics rather than self-referential logic or self-citation chains, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that current MLLMs plus RAG can perform reliable rule-based reasoning when supplied with curated soccer knowledge; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption Multimodal large language models can meaningfully interpret soccer video clips when augmented with retrieved rule text
    Invoked by the use of MLLMs and cross-modal RAG for foul analysis
  • domain assumption The compiled Laws of the Game and case database constitute sufficient and accurate knowledge for referee decisions
    Basis for constructing RefKnowledgeDB
invented entities (1)
  • SoccerRef-Agents multi-agent architecture no independent evidence
    purpose: Collaborative decision-making via cross-modal RAG for foul scenarios
    New system proposed by the authors

pith-pipeline@v0.9.0 · 5508 in / 1342 out tokens · 22998 ms · 2026-05-08T08:00:19.230794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    https://www.anthropic.com (2025)

    Anthropic: Claude 4.5 sonnet. https://www.anthropic.com (2025)

  2. [2]

    Journal of the Operational Research Society74(7), 1690–1711 (2023)

    Bai, L., Gedik, R., Egilmez, G.: What does it take to win or lose a soccer game? a machine learning approach to understand the impact of game and team statistics. Journal of the Operational Research Society74(7), 1690–1711 (2023)

  3. [3]

    Hallucination of Multimodal Large Language Models: A Survey

    Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)

  4. [4]

    ArXiv abs/2409.10587(2024), https://api.semanticscholar.org/CorpusID:272693834

    Cioppa, A., Giancola, S., Somers, V., Joos, V., Magera, F., Held, J., Ghasemzadeh, S.A., Zhou, X., Seweryn, K., Kowalczyk, M., Mr’oz, Z., Lukasik, S., Halo’n, M., Mkhallati, H., Deliège, A., Hinojosa, C., Sanchez, K., Mansourian, A.M., Miralles, P., Barnich, O., Vleeschouwer, C.D., Alahi, A., Ghanem, B., Droogenbroeck, M.V., Gorski, A., Clapés, A., Boiaro...

  5. [5]

    Mirage: Assessing hal- lucination in multimodal reasoning chains of mllm.arXiv preprint arXiv:2505.24238, 2025

    Dong,B.,Ni,M.,Huang,Z.,Yang,G.,Zuo,W.,Zhang,L.:Mirage:Assessinghallu- cination in multimodal reasoning chains of mllm. arXiv preprint arXiv:2505.24238 (2025)

  6. [6]

    In: 2022 IEEE 12th Annual Computing and Com- munication Workshop and Conference (CCWC)

    Elmiligi, H., Saad, S.: Predicting the outcome of soccer matches using machine learning and statistical analysis. In: 2022 IEEE 12th Annual Computing and Com- munication Workshop and Conference (CCWC). pp. 1–8. IEEE (2022)

  7. [7]

    Inside FIFA (July 17 2023), https://inside.fifa.com/innovation/world-cup-2022/semi-automated-offside- technology, accessed: 2026-01-13

    FIFA: Semi-automated offside technology. Inside FIFA (July 17 2023), https://inside.fifa.com/innovation/world-cup-2022/semi-automated-offside- technology, accessed: 2026-01-13

  8. [8]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24108–24118 (2025)

  9. [9]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    Giancola, S., Amine, M., Dghaily, T., Ghanem, B.: Soccernet: A scalable dataset for action spotting in soccer videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 1711–1721 (2018)

  10. [10]

    ArXiv abs/2508.19182(2025), https://api.semanticscholar.org/CorpusID:280870241

    Giancola, S., Cioppa, A., Guti’errez-P’erez, M., Held, J., Hinojosa, C., Joos, V., Leduc, A., Magera, F., Sanchez, K., Somers, V., Xarles, A., Agudo, A., Alahi, A., Barnich, O., Clap’es, A., Vleeschouwer, C.D., Escalera, S., Ghanem, B., Moeslund, T.B.,Droogenbroeck,M.V.,Abe,T.,Alotaibi,S.G.,Altawijri,F.S.,Araujo,S.,Bai, X., Bi, X., Cao, J., Chao, V., Czar...

  11. [11]

    https://deepmind.google (2025)

    Google: Gemini 2.5 flash. https://deepmind.google (2025)

  12. [12]

    International Journal of Operations Management2(3), 7–15 (2022)

    Gottschalk, C., Tewes, S., Niestroj, B., Jäger, C., Drees, J., Ernst, A.: Innovation in elite refereeing through ai technological support for dogso decisions. International Journal of Operations Management2(3), 7–15 (2022)

  13. [13]

    arXiv preprint arXiv:2406.08407 (2024)

    He, X., Feng, W., Zheng, K., Lu, Y., Zhu, W., Li, J., Fan, Y., Wang, J., Li, L., Yang, Z., et al.: Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos. arXiv preprint arXiv:2406.08407 (2024)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Held, J., Cioppa, A., Giancola, S., Hamdi, A., Ghanem, B., Van Droogenbroeck, M.: Vars: Video assistant referee system for automated soccer decision making from multiple views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5086–5097 (2023)

  15. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Held, J., Itani, H., Cioppa, A., Giancola, S., Ghanem, B., Van Droogenbroeck, M.: X-vars: Introducing explainability in football refereeing with multi-modal large lan- guage models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3267–3279 (2024)

  16. [16]

    ACM Computing Surveys57(8), 1–36 (2025)

    Kuang, J., Shen, Y., Xie, J., Luo, H., Xu, Z., Li, R., Li, Y., Cheng, X., Lin, X., Han, Y.: Natural language understanding and inference with mllm in visual question answering: A survey. ACM Computing Surveys57(8), 1–36 (2025)

  17. [17]

    Frontiers in Sports and Active Living4, 807198 (2022)

    Kubayi, A., Larkin, P.: Match-related statistics differentiating winning and losing teams at the 2019 africa cup of nations soccer championship. Frontiers in Sports and Active Living4, 807198 (2022)

  18. [18]

    PeerJ Computer Science8, e853 (2022)

    Lee, G.J., Jung, J.J.: Dnn-based multi-output model for predicting soccer team tactics. PeerJ Computer Science8, e853 (2022)

  19. [19]

    arXiv preprint arXiv:2401.01505 (2024)

    Li, H., Deng, A., Ke, Q., Liu, J., Rahmani, H., Guo, Y., Schiele, B., Chen, C.: Sports-qa: A large-scale video question answering benchmark for complex and pro- fessional sports. arXiv preprint arXiv:2401.01505 (2024)

  20. [20]

    In: 2024 IEEE International Conference on Big Data (BigData)

    Li, Q., Chiu, T.C., Huang, H.W., Sun, M.T., Ku, W.S.: Videobadminton: a video dataset for badminton action recognition. In: 2024 IEEE International Conference on Big Data (BigData). pp. 1387–1392. IEEE (2024)

  21. [21]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Mkhallati, H., Cioppa, A., Giancola, S., Ghanem, B., Van Droogenbroeck, M.: Soccernet-caption: Dense video captioning for soccer broadcasts commentaries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5074–5085 (2023)

  23. [23]

    https://openai.com (2024)

    OpenAI: Gpt-4o. https://openai.com (2024)

  24. [24]

    PLAKIAS, S., KOKKOTIS, C., GIAKAS, G., TSAOPOULOS, D., MOUS- TAKIDIS, S.: Can artificial intelligence revolutionize soccer tactical analysis? Trends in Sport Sciences31(3) (2024)

  25. [25]

    IEEE Transactions on Circuits and Systems for Video Technology30(8), 2617–2633 (2019) 18 Z

    Qi, M., Wang, Y., Li, A., Luo, J.: Sports video captioning via attentive motion representation and group relationship modeling. IEEE Transactions on Circuits and Systems for Video Technology30(8), 2617–2633 (2019) 18 Z. Meng et al

  26. [26]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Rao, J., Li, Z., Wu, H., Zhang, Y., Wang, Y., Xie, W.: Multi-agent system for com- prehensive soccer understanding. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3654–3663 (2025)

  27. [27]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Rao, J., Wu, H., Jiang, H., Zhang, Y., Wang, Y., Xie, W.: Towards universal soccer video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8384–8394 (2025)

  28. [28]

    arXiv preprint arXiv:2406.18530 (2024)

    Rao, J., Wu, H., Liu, C., Wang, Y., Xie, W.: Matchtime: Towards automatic soccer game commentary generation. arXiv preprint arXiv:2406.18530 (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Shao, D., Zhao, Y., Dai, B., Lin, D.: Finegym: A hierarchical video dataset for fine-grained action understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2616–2625 (2020)

  30. [30]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Somers, V., Joos, V., Cioppa, A., Giancola, S., Ghasemzadeh, S.A., Magera, F., Standaert, B., Mansourian, A.M., Zhou, X., Kasaei, S., et al.: Soccernet game state reconstruction: End-to-end athlete tracking and identification on a minimap. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3293–3305 (2024)

  31. [31]

    Journal of Sports Sciences39(2), 147–153 (2021)

    Spitz, J., Wagemans, J., Memmert, D., Williams, A.M., Helsen, W.F.: Video as- sistant referees (var): The impact of technology on decision making in association football referees. Journal of Sports Sciences39(2), 147–153 (2021)

  32. [32]

    https://github.com/QwenLM/Qwen3-VL (2025)

    Team, Q.: Qwen3-vl. https://github.com/QwenLM/Qwen3-VL (2025)

  33. [33]

    The International Football Association Board, Zurich, Switzerland (2025), https://www.theifab.com, effective from 1st July 2025

    The International Football Association Board: Laws of the Game 2025/26. The International Football Association Board, Zurich, Switzerland (2025), https://www.theifab.com, effective from 1st July 2025

  34. [34]

    PloS one20(6), e0322889 (2025)

    Wang, J., Li, L.: A method for feature division of soccer foul actions based on salience image semantics. PloS one20(6), e0322889 (2025)

  35. [35]

    arXiv preprint arXiv:2410.08474 (2024)

    Xia, H., Yang, Z., Zou, J., Tracy, R., Wang, Y., Lu, C., Lai, C., He, Y., Shao, X., Xie, Z., et al.: Sportu: A comprehensive sports understanding benchmark for multimodal large language models. arXiv preprint arXiv:2410.08474 (2024)

  36. [36]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xu, J., Rao, Y., Yu, X., Chen, G., Zhou, J., Lu, J.: Finediving: A fine-grained dataset for procedure-aware action quality assessment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2949–2958 (2022)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xu, J., Zhao, G., Yin, S., Zhou, W., Peng, Y.: Finesports: A multi-person hierar- chical sports video dataset for fine-grained action understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21773–21782 (2024)

  38. [38]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    You, L., Huang, W., Xie, X., Wei, X., Li, B., Lin, S., Li, Y., Wang, C.: Timesoccer: Anend-to-endmultimodallargelanguagemodelforsoccercommentarygeneration. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3418–3427 (2025)

  39. [39]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yuan, H., Ni, D., Wang, M.: Spatio-temporal dynamic inference network for group activity recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7476–7485 (2021)

  40. [40]

    IEEE Transactions on Instrumenta- tion and Measurement73, 1–11 (2024)

    Zahan, S., Hassan, G.M., Mian, A.: Learning sparse temporal video mapping for action quality assessment in floor gymnastics. IEEE Transactions on Instrumenta- tion and Measurement73, 1–11 (2024)

  41. [41]

    Retos: nuevas tendencias en educación física, deporte y recreación (61), 1162–1170 (2024) SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing 19

    Zhekambayeva, M., Yerekesheva, M., Ramashov, N., Seidakhmetov, Y., Kulam- bayev, B.: Designing an artificial intelligence-powered video assistant referee sys- tem for team sports using computer vision. Retos: nuevas tendencias en educación física, deporte y recreación (61), 1162–1170 (2024) SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing 19

  42. [42]

    arXiv e-prints pp

    Zhou, J., Shu, Y., Zhao, B., Wu, B., Xiao, S., Yang, X., Xiong, Y., Zhang, B., Huang, T., Liu, Z.: Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv e-prints pp. arXiv–2406 (2024)

  43. [43]

    Q " : " Based on the f ol lo wi ng foul video , what decision do you think the head referee should make ?

    Zhu, K., Wong, A., McPhee, J.: Fencenet: Fine-grained footwork recognition in fencing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3589–3598 (2022) 20 Z. Meng et al. A SoccerRefBench Dataset Details To ensure seamless multimodal evaluation, both textual and video-based queries inSoccerRefBenchfollow a standar...

  44. [44]

    Reference Law: {rule_str_placeholder}%(Includes Text of Law, Match Logic, and Confi- dence)%

  45. [45]

    Reference Precedents: {case_str_placeholder}%(Includes Valid Precedent or No Precedent sta- tus)%

  46. [46]

    Reference Context (Video Mode Only): {context_analysis}

  47. [47]

    Based on the following foul video, what decision do you think the head referee should make?

    Visual Evidence (Video Agent): – Video Agent’s Choice Explanation:{desc} – Video Agent’s Initial Intuition:{pred} === INSTRUCTIONS === –Analyze the provided input text and subordinate reports carefully. –Select the most correct ONE option ID. –Provide a brief explanation in English. OUTPUT FORMAT: Prediction: [Option ID] Explanation: [Reasoning] D Details...