pith. sign in

arxiv: 2509.15602 · v5 · submitted 2025-09-19 · 💻 cs.CV

TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

Pith reviewed 2026-05-18 16:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords Tennis video understandingMultimodal large language modelsVideo benchmarksTemporal groundingFrame samplingSports video analysisRally understandingStroke event sequences
0
0 comments X

The pith

Multimodal large language models struggle with tennis rallies because they lack sufficient temporal grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TennisTV as a benchmark that treats each tennis rally as a sequence of ordered stroke events and generates questions across eight tasks from individual strokes to full rallies. Evaluation of seventeen models reveals consistent underperformance on these short but dense clips. The results indicate that frame sampling density must be chosen differently for different tasks and that stronger temporal grounding would improve reasoning about the sequence of events. A sympathetic reader would see this as evidence that general video capabilities do not automatically extend to fast, information-dense sports footage. The benchmark supplies the first systematic testbed for measuring exactly where the gaps appear.

Core claim

TennisTV models rallies as temporal-ordered sequences of consecutive stroke events through automated video filtering and question generation, then shows that current multimodal models fall short on stroke-to-rally reasoning tasks primarily because they fail to maintain accurate temporal grounding across sampled frames.

What carries the argument

TennisTV benchmark that represents each rally as a temporal-ordered sequence of consecutive stroke events and supplies 2527 human-verified questions across eight tasks.

If this is right

  • Task-specific frame sampling rates become necessary for reliable video reasoning in high-frequency domains.
  • Improvements in temporal grounding would directly raise accuracy on stroke identification and rally outcome prediction.
  • The same modeling approach can be reused to create comparable benchmarks for other fast sports.
  • General video understanding benchmarks may systematically underestimate difficulties that appear only in temporally dense clips.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The findings point to a broader need for explicit event-tracking modules inside multimodal models rather than relying on uniform frame sampling.
  • Similar temporal-density problems likely exist in other domains such as surgical videos or traffic monitoring.
  • Future work could test whether adding optical-flow or pose features as auxiliary inputs closes the observed gap without increasing frame count.

Load-bearing premise

The automated pipelines that filter videos and generate questions correctly capture the true sequence of stroke events without introducing major selection bias or factual errors.

What would settle it

A side-by-side check in which human annotators re-label a random sample of the benchmark questions against the original video frames and find a high rate of mismatches between the stated stroke order and what actually occurs.

read the original abstract

Multimodal large language models (MLLMs) excel at general video understanding but struggle with fast, high-frequency sports like tennis, where rally clips are short yet information-dense. To systematically evaluate MLLMs in this challenging domain, we present TennisTV, the first and most comprehensive benchmark for tennis video understanding. TennisTV models each rally as a temporal-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation. It covers 8 tasks from the stroke level to the rally level and includes 2527 human-verified questions. Evaluating 17 representative MLLMs, we provide the first systematic assessment of tennis video understanding. Results yield two key insights: (i) frame-sampling density should be tailored and balanced across tasks, and (ii) improving temporal grounding is essential for stronger reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TennisTV, the first comprehensive benchmark for evaluating multimodal large language models (MLLMs) on tennis video understanding. It models each rally as a temporal-ordered sequence of consecutive stroke events via automated pipelines for video filtering and question generation, covers 8 tasks from stroke-level to rally-level, and includes 2527 human-verified questions. Evaluation of 17 representative MLLMs yields two main insights: frame-sampling density should be tailored and balanced across tasks, and improving temporal grounding is essential for stronger reasoning in this domain.

Significance. If the automated pipelines produce accurate temporal labels without substantial undetected biases, the work offers a valuable new resource for probing MLLM limitations in fast-paced, information-dense video domains like sports. The systematic coverage of 8 tasks, scale of human-verified questions, and evaluation across 17 models are clear strengths that could guide future improvements in temporal modeling for video understanding.

major comments (2)
  1. [§4] §4 (Benchmark Construction): The central claims about insufficient temporal grounding and the need for task-tailored frame sampling rest on the assumption that the automated pipelines correctly identify and order stroke events without systematic errors. The manuscript describes human verification only for the final 2527 questions; it does not report quantitative accuracy metrics, error rates, or validation for the upstream stroke detection, boundary identification, or sequencing steps. This is load-bearing because differential performance across the 8 tasks could reflect pipeline artifacts rather than genuine model deficits.
  2. [§5] §5 (Experiments and Results): No statistical significance tests, confidence intervals, or error analysis (e.g., per-task variance or inter-model comparisons) are reported despite the 2527-question scale. This weakens support for the two key insights, as observed performance gaps could be within noise.
minor comments (2)
  1. Add a table explicitly defining the 8 tasks, their input formats, and expected outputs for clarity.
  2. [Abstract] The abstract's phrasing of 'first and most comprehensive' should be softened or supported by explicit comparison to prior sports-video benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing the TennisTV benchmark. We address each major comment below and outline the revisions we will make to strengthen the work.

read point-by-point responses
  1. Referee: [§4] §4 (Benchmark Construction): The central claims about insufficient temporal grounding and the need for task-tailored frame sampling rest on the assumption that the automated pipelines correctly identify and order stroke events without systematic errors. The manuscript describes human verification only for the final 2527 questions; it does not report quantitative accuracy metrics, error rates, or validation for the upstream stroke detection, boundary identification, or sequencing steps. This is load-bearing because differential performance across the 8 tasks could reflect pipeline artifacts rather than genuine model deficits.

    Authors: We appreciate the referee's emphasis on validating the automated pipelines. The human verification step was applied to the final questions to ensure quality, but we acknowledge that separate quantitative metrics for stroke detection, boundary identification, and sequencing would provide stronger assurance against potential artifacts. In the revised manuscript, we will add a dedicated validation subsection in §4 reporting accuracy, precision, and error rates on a manually annotated subset of rallies for these upstream components. revision: yes

  2. Referee: [§5] §5 (Experiments and Results): No statistical significance tests, confidence intervals, or error analysis (e.g., per-task variance or inter-model comparisons) are reported despite the 2527-question scale. This weakens support for the two key insights, as observed performance gaps could be within noise.

    Authors: We agree that statistical analysis would better support the reported insights. In the revised version of §5, we will incorporate statistical significance tests (e.g., paired t-tests or McNemar's test for model comparisons), bootstrap confidence intervals, and per-task variance/error analysis to quantify the reliability of performance differences and strengthen the evidence for task-tailored frame sampling and the importance of temporal grounding. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain or self-referential reductions

full rationale

The paper creates the TennisTV benchmark using automated filtering and question-generation pipelines followed by human verification of 2527 questions, then evaluates 17 MLLMs across 8 tasks to derive two empirical insights on frame sampling and temporal grounding. No equations, fitted parameters, predictions, or mathematical derivations are present that could reduce to inputs by construction. The central claims rest on direct experimental outcomes against an externally tested benchmark rather than any self-definitional, self-citation load-bearing, or ansatz-smuggling steps, making the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on standard assumptions from multimodal video evaluation and sports analysis domains rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Tennis rallies can be reliably represented as temporal-ordered sequences of consecutive stroke events using automated filtering and question generation.
    This modeling choice underpins the entire benchmark construction and task design as described in the abstract.

pith-pipeline@v0.9.0 · 5664 in / 1285 out tokens · 54561 ms · 2026-05-18T16:36:25.496841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 9 internal anchors

  1. [1]

    INTRODUCTION Tennis, renowned worldwide for its commercial impact and elite tournaments, has recently attracted significant research interest in applying artificial intelligence to tennis video understanding. However, despite the significant progress made in general video understanding[1, 2] and reasoning[3, 4], identifying fast-paced, high-frequency spor...

  2. [2]

    RELA TED WORK 2.1. Sports Understanding Sports understanding [6] is a rapidly evolving field that encompasses multiple research topics and integrates diverse modalities, covering tasks such as action recognition [7], athlete analysis [8], tactical planning [9], foul recognition [10]. Recent work on MLLMs [11, 12, arXiv:2509.15602v2 [cs.CV] 22 Sep 2025 Ral...

  3. [3]

    The benchmark comprises 9 subtasks with 2943 questions, supporting multiple levels of video understanding

    TENNISTV BENCHMARK To rigorously evaluate the performance of MLLMs in tennis video understanding, we introduceTennis TourVideo (TennisTV). The benchmark comprises 9 subtasks with 2943 questions, supporting multiple levels of video understanding. As shown in Fig 2, this sec- tion outlines benchmark overview introduction and the automatic an- notation pipel...

  4. [4]

    EXPERIMENT 4.1. Implementation Details We evaluate 14 open-source MLLMs that can process multi-frame video inputs, including 9 non-thinking models: Video-LLaV A [17] , LLaV A-OneVision-7B, Qwen2.5VL-3B&7B [20] , Qwen2VL- 7B, mPLUG-Owl3 [21] , InternVideo2.5 [2] and MiMoVL-SFT- 7B-2508 [22] , as well as 5 thinking models: Video-R1 [23] , VideoChat-R1 [3] ,...

  5. [5]

    TAKEA W A YS Based on our findings, we highlight two takeaways for future im- provements in tennis video understanding: •Balancing frame sampling density across tasks.For ten- nis video understanding, the sampling frequency that yields the best performance differs between coarse global tasks and fine-grained tasks. Future work should pursue a principled b...

  6. [6]

    TennisTV models each rally as a time-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation

    CONCLUSION In this paper, we presentTennisTV, the first benchmark for tennis video understanding tailored to evaluate MLLMs. TennisTV models each rally as a time-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation. Uni- fied evaluations over diverse MLLMs show that reasoning and care- ful frame sa...

  7. [7]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,”arXiv preprint arXiv:2306.05424, 2023

  8. [8]

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xi- angyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al., “Internvideo2. 5: Empowering video mllms with long and rich context modeling,”arXiv preprint arXiv:2501.12386, 2025

  9. [9]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang, “Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025

  10. [10]

    Vide- oRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning,

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou, “Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning,”arXiv preprint arXiv:2505.12434, 2025

  11. [11]

    F 3set: Towards analyzing fast, fre- quent, and fine-grained events from videos,

    Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, and Jin Song Dong, “F 3set: Towards analyzing fast, fre- quent, and fine-grained events from videos,”arXiv preprint arXiv:2504.08222, 2025

  12. [12]

    Computer vision for sports: Cur- rent applications and research topics,

    Graham Thomas, Rikke Gade, Thomas B Moeslund, Peter Carr, and Adrian Hilton, “Computer vision for sports: Cur- rent applications and research topics,”Computer Vision and Image Understanding, vol. 159, pp. 3–18, 2017

  13. [13]

    Soccernet-v2: A dataset and benchmarks for holistic under- standing of broadcast soccer videos,

    Adrien Deliege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogenbroeck, “Soccernet-v2: A dataset and benchmarks for holistic under- standing of broadcast soccer videos,” inProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2021...

  14. [14]

    Finegym: A hierarchical video dataset for fine-grained action understand- ing,

    Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin, “Finegym: A hierarchical video dataset for fine-grained action understand- ing,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2616–2625

  15. [15]

    Insight analysis for tennis strategy and tactics,

    Zhaoyu Liu, Kan Jiang, Zhe Hou, Yun Lin, and Jin Song Dong, “Insight analysis for tennis strategy and tactics,” in 2023 IEEE International Conference on Data Mining (ICDM). IEEE, 2023, pp. 1169–1174

  16. [16]

    What will happen next? forecasting player moves in sports videos,

    Panna Felsen, Pulkit Agrawal, and Jitendra Malik, “What will happen next? forecasting player moves in sports videos,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3342–3351

  17. [17]

    Sports-qa: A large-scale video question answering bench- mark for complex and professional sports,

    Haopeng Li, Andong Deng, Qiuhong Ke, Jun Liu, Hos- sein Rahmani, Yulan Guo, Bernt Schiele, and Chen Chen, “Sports-qa: A large-scale video question answering bench- mark for complex and professional sports,”arXiv preprint arXiv:2401.01505, 2024

  18. [18]

    Sportu: A comprehensive sports understand- ing benchmark for multimodal large language models,

    Haotian Xia, Zhengbang Yang, Junbo Zou, Rhys Tracy, Yuqing Wang, Chi Lu, Christopher Lai, Yanjun He, Xun Shao, Zhuo- qing Xie, et al., “Sportu: A comprehensive sports understand- ing benchmark for multimodal large language models,”arXiv preprint arXiv:2410.08474, 2024

  19. [19]

    Sportqa: A benchmark for sports understanding in large language models,

    Haotian Xia, Zhengbang Yang, Yuqing Wang, Rhys Tracy, Yun Zhao, Dongdong Huang, Zezhi Chen, Yan Zhu, Yuan- fang Wang, and Weining Shen, “Sportqa: A benchmark for sports understanding in large language models,”arXiv preprint arXiv:2402.15862, 2024

  20. [20]

    Blip- 2: Bootstrapping language-image pre-training with frozen im- age encoders and large language models,

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, “Blip- 2: Bootstrapping language-image pre-training with frozen im- age encoders and large language models,” inInternational con- ference on machine learning. PMLR, 2023, pp. 19730–19742

  21. [21]

    Visual instruction tuning,

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34892–34916, 2023

  22. [22]

    From seconds to hours: Review- ing multimodal large language models on comprehensive long video understanding,

    Heqing Zou, Tianze Luo, Guiyang Xie, Fengmao Lv, Guang- cong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang, et al., “From seconds to hours: Review- ing multimodal large language models on comprehensive long video understanding,”arXiv preprint arXiv:2409.18938, 2024

  23. [23]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan, “Video-llava: Learning united visual rep- resentation by alignment before projection,”arXiv preprint arXiv:2311.10122, 2023

  24. [24]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al., “Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  25. [25]

    VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks,

    Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan, “Versavid-r1: A versatile video understanding and reasoning model from question answering to captioning tasks,”arXiv preprint arXiv:2506.09079, 2025

  26. [26]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  27. [27]

    mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou, “mplug-owl3: Towards long image-sequence understanding in multi-modal large language models,”arXiv preprint arXiv:2408.04840, 2024

  28. [28]

    Mimo: Unlocking the reasoning poten- tial of language model–from pretraining to posttraining,

    LLM Xiaomi, Bingquan Xia, Bowen Shen, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, et al., “Mimo: Unlocking the reasoning poten- tial of language model–from pretraining to posttraining,”arXiv preprint arXiv:2505.07608, 2025

  29. [29]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue, “Video-r1: Reinforcing video rea- soning in mllms,”arXiv preprint arXiv:2503.21776, 2025

  30. [30]

    Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis,

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al., “Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24108–24118

  31. [31]

    Large language models are zero-shot reasoners,

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa, “Large language models are zero-shot reasoners,”Advances in neural information process- ing systems, vol. 35, pp. 22199–22213, 2022