RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
arXiv preprint arXiv:2410.08474 (2024)
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
SoccerRef-Agents is a multi-agent framework using MLLMs, cross-modal RAG, and a custom knowledge base that outperforms general MLLMs on soccer foul decisions and explanations.
citing papers explorer
-
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
-
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
-
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
-
SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing
SoccerRef-Agents is a multi-agent framework using MLLMs, cross-modal RAG, and a custom knowledge base that outperforms general MLLMs on soccer foul decisions and explanations.