pith. machine review for the scientific record. sign in

arxiv: 2604.12320 · v2 · submitted 2026-04-14 · 💻 cs.CV · cs.AI· cs.MM

Recognition: unknown

EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords egocentric videoVideo-LLMesportsbenchmarkperceptionreasoningquestion answeringfirst-person shooter
0
0 comments X

The pith

A new benchmark shows Video-LLMs reach only 71.58 percent accuracy on fast esports egocentric videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoEsportsQA as a collection of 1,745 question-answer pairs drawn from professional first-person shooter matches. Questions are organized along two axes, one for cognitive levels from basic visual perception to tactical reasoning and one for domain-specific esports knowledge. Evaluations of leading Video-LLMs find that performance tops out at 71.58 percent, with clearer shortfalls in deep reasoning than in surface perception and in precise actions than in overall game flow. This matters because it supplies a concrete test for how these models handle high-velocity virtual scenes that differ sharply from the slower real-world videos they already process well. The results point to specific architectural limits that future designs must address to work reliably in information-dense environments.

Core claim

By running a six-stage curation process on matches from three esports titles and structuring the resulting questions into a two-dimensional taxonomy of eleven cognitive sub-tasks and six knowledge sub-tasks, the paper shows that current Video-LLMs fall well short of satisfactory performance. The strongest model reaches only 71.58 percent accuracy. Models prove stronger at basic visual perception than at deep tactical reasoning and better at macro game progression than at fine-grained micro operations, exposing intrinsic weaknesses in how existing architectures process rapid virtual egocentric input.

What carries the argument

The two-dimensional decoupled taxonomy that separates cognitive capability levels from esports knowledge domains, used to structure the QA pairs and isolate specific performance gaps.

If this is right

  • Current Video-LLM architectures contain intrinsic weaknesses that ablation experiments trace to handling of rapid temporal sequences.
  • The benchmark data reveals measurable connections between real-world and virtual egocentric understanding.
  • Optimizing models on this dataset supplies guidance for downstream esports applications such as real-time analysis tools.
  • Models consistently handle macro-level game progression more reliably than micro-level operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the taxonomy holds across domains, similar decoupled tests could diagnose perception-versus-reasoning gaps in other high-speed video settings.
  • Training pipelines that add high-velocity virtual clips may close the observed performance difference between real and esports scenes.
  • The benchmark could serve as a diagnostic for whether future models have overcome limits in fine-grained action recognition.

Load-bearing premise

The six-stage curation pipeline produces high-quality, unbiased QA pairs that validly measure perception and reasoning without introducing artifacts from the selection or annotation process.

What would settle it

A Video-LLM reaching above 90 percent accuracy on the benchmark while showing no similar improvement on real-world egocentric videos would indicate that the reported gaps are not fundamental to current model designs.

Figures

Figures reproduced from arXiv: 2604.12320 by Jianzhe Ma, Qin Jin, Shangkui Chen, Wenxuan Wang, Yichen Xu, Zhonghao Cao.

Figure 1
Figure 1. Figure 1: Examples from EgoEsportsQA. The benchmark requires high-frequency visual perception (left) and expert [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The six-stage data construction pipeline of EgoEsportsQA. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistical overview of the EgoEsportsQA benchmark. The dataset is systematically categorized along [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance breakdown of 8 Video-LLMs across the cognitive capability and esports knowledge dimensions. In contrast, performance decreases considerably on micro-level categories, including planned tac￾tics, adaptive coordination, and player mechanics. These sub-domains require the model to capture split-second, fine-grained mechanical executions. The difficulty of extracting such transient, pixel￾level int… view at source ↗
read the original abstract

While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EgoEsportsQA, a benchmark of 1,745 QA pairs curated from professional egocentric esports videos in three FPS games via a six-stage pipeline. Questions follow a two-dimensional taxonomy with 11 cognitive sub-tasks (perception and reasoning) and 6 esports-knowledge sub-tasks. Evaluations of state-of-the-art Video-LLMs report a best-model accuracy of 71.58%, with gaps showing stronger perception than reasoning and macro-progression over micro-operations; ablations and analysis link the dataset to real-world egocentric domains and downstream applications.

Significance. If the benchmark quality holds, the work is significant for filling a gap in Video-LLM evaluation for high-velocity virtual environments, where existing daily-activity benchmarks fall short. The decoupled taxonomy, new dataset, and systematic model comparisons provide concrete evidence of architectural limitations and guidance for optimization. The scalable curation pipeline and explicit connection between virtual and real egocentric domains are notable strengths.

major comments (2)
  1. [§3] §3 (Dataset Curation): The six-stage pipeline is asserted to yield 'high-quality' and unbiased QA pairs that validly separate perception from reasoning, yet no inter-annotator agreement, expert difficulty ratings, or cross-axis calibration statistics are reported. This is load-bearing for the headline claim that models exhibit 'stronger capabilities in basic visual perception than in deep tactical reasoning' and the 71.58% ceiling, because systematic bias in question selection or annotation could artifactually produce the observed perception > reasoning and macro > micro gaps.
  2. [§4] §4 (Experiments and Results): The performance differentials and ablation conclusions rest on the taxonomy being equitably difficult across sub-tasks, but the manuscript supplies no error analysis, per-sub-task difficulty breakdowns, or validation that reasoning questions are not inadvertently easier/harder than perception ones. Without these, the central finding that 'current models still fail to achieve satisfactory performance' cannot be confidently attributed to model architecture rather than benchmark construction.
minor comments (2)
  1. [Abstract] Abstract and §1: The term 'pioneering' is subjective and should be replaced with a factual statement about the benchmark's novelty relative to prior esports or egocentric video QA work.
  2. [§3] Taxonomy description: The mapping from the 11 cognitive sub-tasks to the perception/reasoning axis is not accompanied by explicit decision criteria or examples, which would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and for highlighting areas where additional validation would strengthen our claims. We address each major comment below and will revise the manuscript accordingly to improve transparency on dataset quality and experimental analysis.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Curation): The six-stage pipeline is asserted to yield 'high-quality' and unbiased QA pairs that validly separate perception from reasoning, yet no inter-annotator agreement, expert difficulty ratings, or cross-axis calibration statistics are reported. This is load-bearing for the headline claim that models exhibit 'stronger capabilities in basic visual perception than in deep tactical reasoning' and the 71.58% ceiling, because systematic bias in question selection or annotation could artifactually produce the observed perception > reasoning and macro > micro gaps.

    Authors: We agree that quantitative metrics such as inter-annotator agreement would provide stronger evidence for annotation quality. The six-stage pipeline incorporates domain-expert review by professional esports players and coaches at multiple stages, with explicit guidelines to enforce the decoupled taxonomy for separating perception from reasoning. However, these agreement and calibration statistics were not computed or reported in the original manuscript. In revision, we will expand §3 to detail the annotator pool, consensus process, and expert-assessed difficulty ratings for a sampled subset of questions, along with any available cross-axis calibration notes. This will directly support the validity of the reported gaps. revision: partial

  2. Referee: [§4] §4 (Experiments and Results): The performance differentials and ablation conclusions rest on the taxonomy being equitably difficult across sub-tasks, but the manuscript supplies no error analysis, per-sub-task difficulty breakdowns, or validation that reasoning questions are not inadvertently easier/harder than perception ones. Without these, the central finding that 'current models still fail to achieve satisfactory performance' cannot be confidently attributed to model architecture rather than benchmark construction.

    Authors: The taxonomy was constructed to balance cognitive and knowledge dimensions, and the consistent perception > reasoning pattern across multiple models supports attribution to architectural limitations. That said, the manuscript lacks the requested per-sub-task difficulty metrics and systematic error analysis. We will revise §4 to include: per-sub-task accuracy tables, proxy difficulty measures (e.g., question complexity indicators), and a qualitative error categorization of model failures. These additions will allow readers to assess whether the observed gaps arise from benchmark construction or from Video-LLM shortcomings in tactical reasoning. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark dataset and direct model evaluations

full rationale

The paper introduces EgoEsportsQA as a fresh dataset of 1,745 QA pairs curated via a six-stage pipeline from esports videos, then reports direct accuracy numbers (e.g., best Video-LLM at 71.58%) from running existing models on that data. No equations, fitted parameters, or self-citations are used to derive the headline results; the performance gaps are empirical outputs rather than reductions to prior author work or internal definitions. The curation pipeline is presented as an input process whose quality is asserted but not mathematically self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms beyond standard video QA practices, or invented entities are introduced; the contribution rests on curation of existing match footage and standard model evaluation.

pith-pipeline@v0.9.0 · 5584 in / 1101 out tokens · 35774 ms · 2026-05-10T15:38:58.692947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  2. [2]

    Anthropic . 2025. https://www.anthropic.com/news/claude-sonnet-4-5 Introducing Claude sonnet 4.5 . Anthropic Blog. Accessed: 2026-03-18

  3. [3]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

  4. [4]

    Fanni B \'a nyai, Mark D Griffiths, Orsolya Kir \'a ly, and Zsolt Demetrovics. 2019. The psychology of esports: A systematic literature review. Journal of gambling studies, 35(2):351--365

  5. [5]

    Andrzej Bia ecki, Natalia Jakubowska, Pawe Dobrowolski, Piotr Bia ecki, Leszek Krupi \'n ski, Andrzej Szczap, Robert Bia ecki, and Jan Gajewski. 2023. Sc2egset: Starcraft ii esport replay and game-state dataset. Scientific Data, 10(1):600

  6. [6]

    Blizzard Entertainment . 2022. https://overwatch.blizzard.com Overwatch 2 . Video game. Accessed: 2026-03-18

  7. [7]

    ByteDance Seed . 2025. https://seed.bytedance.com/en/seed1_8 Seed1.8: A generalized agentic model that can efficiently and accurately accomplish complex tasks in real-world scenarios . ByteDance Official Website. Accessed: 2026-03-18

  8. [8]

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. 2025. Livecc: Learning video llm with streaming speech transcription at scale. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29083--29095

  9. [9]

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and 1 others. 2024. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056--27087

  10. [10]

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and 1 others. 2018. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720--736

  11. [11]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088--10115

  12. [12]

    Brianna Duffy, Jonathan Gallagher, Jocelyn Rego, Wael Fatnassi, and Michael Warren. 2025. Cdops: Complex dynamics of online professional squads. In 2025 IEEE Conference on Games (CoG), pages 1--8. IEEE

  13. [13]

    David Durst, Feng Xie, Vishnu Sarukkai, Brennan Shacklett, Iuri Frosio, Chen Tessler, Joohwan Kim, Carly Taylor, Gilbert Bernstein, Sanjiban Choudhury, and 1 others. 2024. Learning to move like professional counter-strike players. In Computer Graphics Forum, volume 43, page e15173. Wiley Online Library

  14. [14]

    Chenyou Fan. 2019. Egovqa-an egocentric video question answering benchmark dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0--0

  15. [15]

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, and 1 others. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108--24118

  16. [16]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, and 1 others. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995--19012

  17. [17]

    Juho Hamari and Max Sj \"o blom. 2017. What is esports and why do people watch it? Internet research, 27(2):211--232

  18. [18]

    Nirai Hayakawa, Kazumasa Shimari, Kazuma Yamasaki, Hirotatsu Hoshikawa, Rikuto Tsuchida, and Kenichi Matsumoto. 2025. Round outcome prediction in valorant using tactical features from video analysis. In 2025 IEEE Conference on Games (CoG), pages 1--4. IEEE

  19. [19]

    Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. 2025. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  20. [20]

    Masaharu Hirota. 2024. Predicting win conditions of counter-strike: Global offensive for analyzing round progression. In 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), pages 1287--1288. IEEE

  21. [21]

    Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei Zhang, and Xiaodan Liang. 2025. Hires-llava: Restoring fragmentation input in high-resolution large vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29814--29824

  22. [22]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  23. [23]

    JaidedAI. 2020. https://github.com/JaidedAI/EasyOCR Easyocr: Ready-to-use ocr with 80+ supported languages and all popular writing scripts . GitHub repository. Accessed: 2026-03-18

  24. [24]

    Wooyoung William Jang and Kevin K Byon. 2020. Antecedents of esports gameplay intention: Genre as a moderator. Computers in Human Behavior, 109:106336

  25. [25]

    Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. 2022. Egotaskqa: Understanding human tasks in egocentric videos. Advances in Neural Information Processing Systems, 35:3343--3360

  26. [26]

    Md Tanbeer Jubaer, Mayeesha Farjana, Barisha Chowdhury, Md Shahid Uz Zaman, Azmain Yakin Srizon, and Md Minhazul Islam. 2024. Analyzing audience engagement in esports: Sentiment and llm-based topic insights from live chats in south asia. In 2024 27th International Conference on Computer and Information Technology (ICCIT), pages 1351--1356. IEEE

  27. [27]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and 1 others. 2025 a . Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research

  28. [28]

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024 a . Llava-interleave: Tackling multi-image, video, and 3d in large multimodal models. In The Thirteenth International Conference on Learning Representations

  29. [29]

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, and 1 others. 2024 b . Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195--22206

  30. [30]

    Yanjun Li, Yuqian Fu, Tianwen Qian, Qi'ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, and Xiaoling Wang. 2025 b . Egocross: Benchmarking multimodal large language models for cross-domain egocentric video question answering. arXiv preprint arXiv:2508.10729

  31. [31]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

  32. [32]

    Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. 2025. Bolt: Boost large vision-language model without training for long-form video understanding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3318--3327

  33. [33]

    Jiaying Lu, Yongchen Qian, Shifan Zhao, Yuanzhe Xi, and Carl Yang. 2023. Mug: A multimodal classification benchmark on game data with tabular, textual, and visual fields. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5332--5346

  34. [34]

    Weiyu Ma, Yuqian Fu, Zecheng Zhang, Bernard Ghanem, and Guohao Li. 2025. Ava: Attentive vlm agent for mastering starcraft ii. arXiv preprint arXiv:2503.05383

  35. [35]

    Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, and Haifeng Zhang. 2024 a . Large language models play starcraft ii: Benchmarks and a chain of summarization approach. Advances in Neural Information Processing Systems, 37:133386--133442

  36. [36]

    Weiyu Ma, Dongyu Xu, Shu Lin, Haifeng Zhang, and Jun Wang. 2024 b . Adaptive command: Real-time policy adjustment via language models in starcraft ii. In Proceedings of the 2024 6th International Conference on Distributed Artificial Intelligences, pages 22--30

  37. [37]

    DLS Mamoru, AD Panditha, WASSJ Perera, and GU Ganegoda. 2022. Conceptual representation and evaluation of an fps game commentary generator. In 2022 2nd International Conference on Image Processing and Robotics (ICIPRob), pages 1--6. IEEE

  38. [38]

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. 2023. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36:46212--46244

  39. [39]

    Thye Shan Ng, Feiqi Cao, and Soyeon Caren Han. 2025. 3m-game: Multi-modal multi-task multi-teacher learning for game event detection (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29448--29450

  40. [40]

    OpenAI . 2025. https://openai.com/index/introducing-gpt-5 Introducing GPT-5 . OpenAI Blog. Accessed: 2026-03-18

  41. [41]

    Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. 2025. https://blog.google/products/gemini/gemini-3 A new era of intelligence with gemini 3 . Google Blog. Accessed: 2026-03-18

  42. [42]

    Jason G Reitman, Maria J Anderson-Coto, Minerva Wu, Je Seok Lee, and Constance Steinkuehler. 2020. Esports research: A literature review. Games and Culture, 15(1):32--50

  43. [43]

    Charles Ringer, James Alfred Walker, and Mihalis A Nicolaou. 2019. Multimodal joint emotion and game context recognition in league of legends livestreams. In 2019 IEEE Conference on Games (CoG), pages 1--8. IEEE

  44. [44]

    Riot Games . 2020. https://playvalorant.com Valorant . Video game. Accessed: 2026-03-18

  45. [45]

    Tsunehiko Tanaka and Edgar Simo-Serra. 2021. Lol-v2t: Large-scale esports video description dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4557--4566

  46. [46]

    Valve Corporation . 2023. https://www.counter-strike.net Counter-strike 2 . Video game. Accessed: 2026-03-18

  47. [47]

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 1 others. 2025 a . Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265

  48. [48]

    Yunzhe Wang, Soham Hans, and Volkan Ustun. 2025 b . X-ego: Acquiring team-level tactical situational awareness via cross-egocentric contrastive video representation learning. arXiv preprint arXiv:2510.19150

  49. [49]

    Yunzhe Wang, Volkan Ustun, and Chris McGroarty. 2025 c . A data-driven discretized cs: Go simulation environment to facilitate strategic multi-agent planning research. In 2025 Winter Simulation Conference (WSC), pages 2419--2430. IEEE

  50. [50]

    Zihan Wang and Naoki Yoshinaga. 2024. Commentary generation from data records of multiplayer strategy esports game. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 263--271

  51. [51]

    Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, and Mike Zheng Shou. 2022. Assistq: Affordance-centric question-driven task completion for egocentric assistant. In European Conference on Computer Vision, pages 485--501. Springer

  52. [52]

    Peter Xenopoulos, William Robert Freeman, and Claudio Silva. 2022. Analyzing the differences between professional and amateur esports through win probability. In Proceedings of the ACM Web Conference 2022, pages 3418--3427

  53. [53]

    Peter Xenopoulos and Claudio Silva. 2022. Esta: An esports trajectory and action dataset. arXiv preprint arXiv:2209.09861

  54. [54]

    Junjie H Xu, Hong Huang, Xiaoling Ling, and Pujana Paliyawan. 2022. Toward collaborative game commentating utilizing pre-trained generative language models. In 2022 IEEE International Conference on Consumer Electronics (ICCE), pages 1--4. IEEE

  55. [55]

    Junjie H Xu, Yu Nakano, Lingrong Kong, and Kojiro Iizuka. 2023. Cs-lol: A dataset of viewer comment with scene in e-sports live-streaming. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval, pages 422--426

  56. [56]

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. 2026. https://openreview.net/forum?id=gVbPWbA97s Streaming VLM : Real-time understanding for infinite video streams . In The Fourteenth International Conference on Learning Representations

  57. [57]

    Yichen Xu, Jianzhe Ma, Chuhan Wang, Zhonghao Cao, Liangyu Chen, Wenxuan Wang, and Qin Jin. 2025. A survey of large models in sports

  58. [58]

    Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, and 1 others. 2025. Egolife: Towards egocentric life assistant. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28885--28900

  59. [59]

    Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, and 1 others. 2025. Mmego: Towards building egocentric multimodal llms for video qa. In The Thirteenth International Conference on Learning Representations

  60. [60]

    Ari Yu, Jinwoo Hyun, Hyeong-Gyu Jang, Sung-Yun Park, and Sang-Kwang Lee. 2025 a . Single-anchored multi-modal dense video captioning for esports broadcasts commentaries. In Proceedings of the 8th International ACM Workshop on Multimedia Content Analysis in Sports, pages 31--38

  61. [61]

    Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, and Qianru Sun. 2025 b . https://openreview.net/forum?id=LNL7zKvm7e Frame-voyager: Learning to query frames for video large language models . In The Thirteenth International Conference on Learning Representations

  62. [62]

    Dawei Zhang, Sixing Wu, Yao Guo, and Xiangqun Chen. 2022. Moba-e2c: Generating moba game commentaries via capturing highlight events from the meta-data. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4545--4556

  63. [63]

    Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. In Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pages 543--553

  64. [64]

    Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. 2025 a . Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22056--22065

  65. [65]

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. 2025 b . Llava-video: Video instruction tuning with synthetic data. Transactions on Machine Learning Research

  66. [66]

    Zhonghan Zhao, Wenhao Chai, Shengyu Hao, Wenhao Hu, Guanhong Wang, Shidong Cao, Mingli Song, Jenq-Neng Hwang, and Gaoang Wang. 2025. A survey of deep learning in sports applications: Perception, comprehension, and decision. IEEE Transactions on Visualization and Computer Graphics

  67. [67]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  68. [68]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...