arxiv: 2604.12320 · v2 · submitted 2026-04-14 · 💻 cs.CV · cs.AI· cs.MM

Recognition: unknown

EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

Jianzhe Ma , Zhonghao Cao , Shangkui Chen , Yichen Xu , Wenxuan Wang , Qin Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords egocentric videoVideo-LLMesportsbenchmarkperceptionreasoningquestion answeringfirst-person shooter

0 comments

The pith

A new benchmark shows Video-LLMs reach only 71.58 percent accuracy on fast esports egocentric videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoEsportsQA as a collection of 1,745 question-answer pairs drawn from professional first-person shooter matches. Questions are organized along two axes, one for cognitive levels from basic visual perception to tactical reasoning and one for domain-specific esports knowledge. Evaluations of leading Video-LLMs find that performance tops out at 71.58 percent, with clearer shortfalls in deep reasoning than in surface perception and in precise actions than in overall game flow. This matters because it supplies a concrete test for how these models handle high-velocity virtual scenes that differ sharply from the slower real-world videos they already process well. The results point to specific architectural limits that future designs must address to work reliably in information-dense environments.

Core claim

By running a six-stage curation process on matches from three esports titles and structuring the resulting questions into a two-dimensional taxonomy of eleven cognitive sub-tasks and six knowledge sub-tasks, the paper shows that current Video-LLMs fall well short of satisfactory performance. The strongest model reaches only 71.58 percent accuracy. Models prove stronger at basic visual perception than at deep tactical reasoning and better at macro game progression than at fine-grained micro operations, exposing intrinsic weaknesses in how existing architectures process rapid virtual egocentric input.

What carries the argument

The two-dimensional decoupled taxonomy that separates cognitive capability levels from esports knowledge domains, used to structure the QA pairs and isolate specific performance gaps.

If this is right

Current Video-LLM architectures contain intrinsic weaknesses that ablation experiments trace to handling of rapid temporal sequences.
The benchmark data reveals measurable connections between real-world and virtual egocentric understanding.
Optimizing models on this dataset supplies guidance for downstream esports applications such as real-time analysis tools.
Models consistently handle macro-level game progression more reliably than micro-level operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the taxonomy holds across domains, similar decoupled tests could diagnose perception-versus-reasoning gaps in other high-speed video settings.
Training pipelines that add high-velocity virtual clips may close the observed performance difference between real and esports scenes.
The benchmark could serve as a diagnostic for whether future models have overcome limits in fine-grained action recognition.

Load-bearing premise

The six-stage curation pipeline produces high-quality, unbiased QA pairs that validly measure perception and reasoning without introducing artifacts from the selection or annotation process.

What would settle it

A Video-LLM reaching above 90 percent accuracy on the benchmark while showing no similar improvement on real-world egocentric videos would indicate that the reported gaps are not fundamental to current model designs.

Figures

Figures reproduced from arXiv: 2604.12320 by Jianzhe Ma, Qin Jin, Shangkui Chen, Wenxuan Wang, Yichen Xu, Zhonghao Cao.

**Figure 2.** Figure 2: The six-stage data construction pipeline of EgoEsportsQA. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Statistical overview of the EgoEsportsQA benchmark. The dataset is systematically categorized along [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance breakdown of 8 Video-LLMs across the cognitive capability and esports knowledge dimensions. In contrast, performance decreases considerably on micro-level categories, including planned tactics, adaptive coordination, and player mechanics. These sub-domains require the model to capture split-second, fine-grained mechanical executions. The difficulty of extracting such transient, pixellevel int… view at source ↗

read the original abstract

While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoEsportsQA adds a focused benchmark for high-speed virtual egocentric video QA that shows clear model gaps, but the headline performance differences hinge on unshown checks for question balance.

read the letter

The main point is that this paper delivers a new dataset of 1745 QA pairs from professional FPS matches, organized along separate cognitive and esports-knowledge axes, and uses it to show that current Video-LLMs top out at 71.58% while doing better on basic perception than on tactical reasoning and better on macro game state than on micro actions. That pattern is useful to see even if the absolute numbers are not surprising. The work is new in its narrow focus on rule-bound, high-velocity virtual scenes rather than everyday real-world footage, and the two-dimensional taxonomy keeps the evaluation dimensions from collapsing into each other. The evaluations across several models and the ablation experiments give a concrete starting point for where architectures fall short in these environments. The connection drawn to downstream esports applications is also a reasonable extension. The soft spot is the six-stage curation pipeline. The abstract calls it scalable and high-quality, yet supplies no inter-annotator agreement figures, no expert difficulty ratings, and no explicit test that perception and reasoning items were calibrated to the same difficulty level. Without those numbers the reported gaps between perception and reasoning, or macro and micro, could partly reflect how the questions were chosen or worded rather than pure model limits. If the full paper includes those checks and shows they hold, the central claim strengthens; if not, the differential results stay harder to interpret. This paper is aimed at researchers who build or test Video-LLMs for fast, information-dense settings and at people working on gaming or real-time perception applications. A reader who needs a new testbed in that niche will find the data and the taxonomy worth looking at. It deserves a serious referee because the dataset itself is a concrete addition that others can use and extend, even if the validation details need tightening before publication.

Referee Report

2 major / 2 minor

Summary. The paper introduces EgoEsportsQA, a benchmark of 1,745 QA pairs curated from professional egocentric esports videos in three FPS games via a six-stage pipeline. Questions follow a two-dimensional taxonomy with 11 cognitive sub-tasks (perception and reasoning) and 6 esports-knowledge sub-tasks. Evaluations of state-of-the-art Video-LLMs report a best-model accuracy of 71.58%, with gaps showing stronger perception than reasoning and macro-progression over micro-operations; ablations and analysis link the dataset to real-world egocentric domains and downstream applications.

Significance. If the benchmark quality holds, the work is significant for filling a gap in Video-LLM evaluation for high-velocity virtual environments, where existing daily-activity benchmarks fall short. The decoupled taxonomy, new dataset, and systematic model comparisons provide concrete evidence of architectural limitations and guidance for optimization. The scalable curation pipeline and explicit connection between virtual and real egocentric domains are notable strengths.

major comments (2)

[§3] §3 (Dataset Curation): The six-stage pipeline is asserted to yield 'high-quality' and unbiased QA pairs that validly separate perception from reasoning, yet no inter-annotator agreement, expert difficulty ratings, or cross-axis calibration statistics are reported. This is load-bearing for the headline claim that models exhibit 'stronger capabilities in basic visual perception than in deep tactical reasoning' and the 71.58% ceiling, because systematic bias in question selection or annotation could artifactually produce the observed perception > reasoning and macro > micro gaps.
[§4] §4 (Experiments and Results): The performance differentials and ablation conclusions rest on the taxonomy being equitably difficult across sub-tasks, but the manuscript supplies no error analysis, per-sub-task difficulty breakdowns, or validation that reasoning questions are not inadvertently easier/harder than perception ones. Without these, the central finding that 'current models still fail to achieve satisfactory performance' cannot be confidently attributed to model architecture rather than benchmark construction.

minor comments (2)

[Abstract] Abstract and §1: The term 'pioneering' is subjective and should be replaced with a factual statement about the benchmark's novelty relative to prior esports or egocentric video QA work.
[§3] Taxonomy description: The mapping from the 11 cognitive sub-tasks to the perception/reasoning axis is not accompanied by explicit decision criteria or examples, which would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and for highlighting areas where additional validation would strengthen our claims. We address each major comment below and will revise the manuscript accordingly to improve transparency on dataset quality and experimental analysis.

read point-by-point responses

Referee: [§3] §3 (Dataset Curation): The six-stage pipeline is asserted to yield 'high-quality' and unbiased QA pairs that validly separate perception from reasoning, yet no inter-annotator agreement, expert difficulty ratings, or cross-axis calibration statistics are reported. This is load-bearing for the headline claim that models exhibit 'stronger capabilities in basic visual perception than in deep tactical reasoning' and the 71.58% ceiling, because systematic bias in question selection or annotation could artifactually produce the observed perception > reasoning and macro > micro gaps.

Authors: We agree that quantitative metrics such as inter-annotator agreement would provide stronger evidence for annotation quality. The six-stage pipeline incorporates domain-expert review by professional esports players and coaches at multiple stages, with explicit guidelines to enforce the decoupled taxonomy for separating perception from reasoning. However, these agreement and calibration statistics were not computed or reported in the original manuscript. In revision, we will expand §3 to detail the annotator pool, consensus process, and expert-assessed difficulty ratings for a sampled subset of questions, along with any available cross-axis calibration notes. This will directly support the validity of the reported gaps. revision: partial
Referee: [§4] §4 (Experiments and Results): The performance differentials and ablation conclusions rest on the taxonomy being equitably difficult across sub-tasks, but the manuscript supplies no error analysis, per-sub-task difficulty breakdowns, or validation that reasoning questions are not inadvertently easier/harder than perception ones. Without these, the central finding that 'current models still fail to achieve satisfactory performance' cannot be confidently attributed to model architecture rather than benchmark construction.

Authors: The taxonomy was constructed to balance cognitive and knowledge dimensions, and the consistent perception > reasoning pattern across multiple models supports attribution to architectural limitations. That said, the manuscript lacks the requested per-sub-task difficulty metrics and systematic error analysis. We will revise §4 to include: per-sub-task accuracy tables, proxy difficulty measures (e.g., question complexity indicators), and a qualitative error categorization of model failures. These additions will allow readers to assess whether the observed gaps arise from benchmark construction or from Video-LLM shortcomings in tactical reasoning. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark dataset and direct model evaluations

full rationale

The paper introduces EgoEsportsQA as a fresh dataset of 1,745 QA pairs curated via a six-stage pipeline from esports videos, then reports direct accuracy numbers (e.g., best Video-LLM at 71.58%) from running existing models on that data. No equations, fitted parameters, or self-citations are used to derive the headline results; the performance gaps are empirical outputs rather than reductions to prior author work or internal definitions. The curation pipeline is presented as an input process whose quality is asserted but not mathematically self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms beyond standard video QA practices, or invented entities are introduced; the contribution rests on curation of existing match footage and standard model evaluation.

pith-pipeline@v0.9.0 · 5584 in / 1101 out tokens · 35774 ms · 2026-05-10T15:38:58.692947+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Anthropic . 2025. https://www.anthropic.com/news/claude-sonnet-4-5 Introducing Claude sonnet 4.5 . Anthropic Blog. Accessed: 2026-03-18

2025
[3]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Fanni B \'a nyai, Mark D Griffiths, Orsolya Kir \'a ly, and Zsolt Demetrovics. 2019. The psychology of esports: A systematic literature review. Journal of gambling studies, 35(2):351--365

2019
[5]

Andrzej Bia ecki, Natalia Jakubowska, Pawe Dobrowolski, Piotr Bia ecki, Leszek Krupi \'n ski, Andrzej Szczap, Robert Bia ecki, and Jan Gajewski. 2023. Sc2egset: Starcraft ii esport replay and game-state dataset. Scientific Data, 10(1):600

2023
[6]

Blizzard Entertainment . 2022. https://overwatch.blizzard.com Overwatch 2 . Video game. Accessed: 2026-03-18

2022
[7]

ByteDance Seed . 2025. https://seed.bytedance.com/en/seed1_8 Seed1.8: A generalized agentic model that can efficiently and accurately accomplish complex tasks in real-world scenarios . ByteDance Official Website. Accessed: 2026-03-18

2025
[8]

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. 2025. Livecc: Learning video llm with streaming speech transcription at scale. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29083--29095

2025
[9]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and 1 others. 2024. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056--27087

2024
[10]

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and 1 others. 2018. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720--736

2018
[11]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088--10115

2023
[12]

Brianna Duffy, Jonathan Gallagher, Jocelyn Rego, Wael Fatnassi, and Michael Warren. 2025. Cdops: Complex dynamics of online professional squads. In 2025 IEEE Conference on Games (CoG), pages 1--8. IEEE

2025
[13]

David Durst, Feng Xie, Vishnu Sarukkai, Brennan Shacklett, Iuri Frosio, Chen Tessler, Joohwan Kim, Carly Taylor, Gilbert Bernstein, Sanjiban Choudhury, and 1 others. 2024. Learning to move like professional counter-strike players. In Computer Graphics Forum, volume 43, page e15173. Wiley Online Library

2024
[14]

Chenyou Fan. 2019. Egovqa-an egocentric video question answering benchmark dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0--0

2019
[15]

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, and 1 others. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108--24118

2025
[16]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, and 1 others. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995--19012

2022
[17]

Juho Hamari and Max Sj \"o blom. 2017. What is esports and why do people watch it? Internet research, 27(2):211--232

2017
[18]

Nirai Hayakawa, Kazumasa Shimari, Kazuma Yamasaki, Hirotatsu Hoshikawa, Rikuto Tsuchida, and Kenichi Matsumoto. 2025. Round outcome prediction in valorant using tactical features from video analysis. In 2025 IEEE Conference on Games (CoG), pages 1--4. IEEE

2025
[19]

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. 2025. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

2025
[20]

Masaharu Hirota. 2024. Predicting win conditions of counter-strike: Global offensive for analyzing round progression. In 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), pages 1287--1288. IEEE

2024
[21]

Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei Zhang, and Xiaodan Liang. 2025. Hires-llava: Restoring fragmentation input in high-resolution large vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29814--29824

2025
[22]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

JaidedAI. 2020. https://github.com/JaidedAI/EasyOCR Easyocr: Ready-to-use ocr with 80+ supported languages and all popular writing scripts . GitHub repository. Accessed: 2026-03-18

2020
[24]

Wooyoung William Jang and Kevin K Byon. 2020. Antecedents of esports gameplay intention: Genre as a moderator. Computers in Human Behavior, 109:106336

2020
[25]

Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. 2022. Egotaskqa: Understanding human tasks in egocentric videos. Advances in Neural Information Processing Systems, 35:3343--3360

2022
[26]

Md Tanbeer Jubaer, Mayeesha Farjana, Barisha Chowdhury, Md Shahid Uz Zaman, Azmain Yakin Srizon, and Md Minhazul Islam. 2024. Analyzing audience engagement in esports: Sentiment and llm-based topic insights from live chats in south asia. In 2024 27th International Conference on Computer and Information Technology (ICCIT), pages 1351--1356. IEEE

2024
[27]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and 1 others. 2025 a . Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research

2025
[28]

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024 a . Llava-interleave: Tackling multi-image, video, and 3d in large multimodal models. In The Thirteenth International Conference on Learning Representations

2024
[29]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, and 1 others. 2024 b . Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195--22206

2024
[30]

Yanjun Li, Yuqian Fu, Tianwen Qian, Qi'ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, and Xiaoling Wang. 2025 b . Egocross: Benchmarking multimodal large language models for cross-domain egocentric video question answering. arXiv preprint arXiv:2508.10729

work page arXiv 2025
[31]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

2023
[32]

Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. 2025. Bolt: Boost large vision-language model without training for long-form video understanding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3318--3327

2025
[33]

Jiaying Lu, Yongchen Qian, Shifan Zhao, Yuanzhe Xi, and Carl Yang. 2023. Mug: A multimodal classification benchmark on game data with tabular, textual, and visual fields. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5332--5346

2023
[34]

Weiyu Ma, Yuqian Fu, Zecheng Zhang, Bernard Ghanem, and Guohao Li. 2025. Ava: Attentive vlm agent for mastering starcraft ii. arXiv preprint arXiv:2503.05383

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, and Haifeng Zhang. 2024 a . Large language models play starcraft ii: Benchmarks and a chain of summarization approach. Advances in Neural Information Processing Systems, 37:133386--133442

2024
[36]

Weiyu Ma, Dongyu Xu, Shu Lin, Haifeng Zhang, and Jun Wang. 2024 b . Adaptive command: Real-time policy adjustment via language models in starcraft ii. In Proceedings of the 2024 6th International Conference on Distributed Artificial Intelligences, pages 22--30

2024
[37]

DLS Mamoru, AD Panditha, WASSJ Perera, and GU Ganegoda. 2022. Conceptual representation and evaluation of an fps game commentary generator. In 2022 2nd International Conference on Image Processing and Robotics (ICIPRob), pages 1--6. IEEE

2022
[38]

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. 2023. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36:46212--46244

2023
[39]

Thye Shan Ng, Feiqi Cao, and Soyeon Caren Han. 2025. 3m-game: Multi-modal multi-task multi-teacher learning for game event detection (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29448--29450

2025
[40]

OpenAI . 2025. https://openai.com/index/introducing-gpt-5 Introducing GPT-5 . OpenAI Blog. Accessed: 2026-03-18

2025
[41]

Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. 2025. https://blog.google/products/gemini/gemini-3 A new era of intelligence with gemini 3 . Google Blog. Accessed: 2026-03-18

2025
[42]

Jason G Reitman, Maria J Anderson-Coto, Minerva Wu, Je Seok Lee, and Constance Steinkuehler. 2020. Esports research: A literature review. Games and Culture, 15(1):32--50

2020
[43]

Charles Ringer, James Alfred Walker, and Mihalis A Nicolaou. 2019. Multimodal joint emotion and game context recognition in league of legends livestreams. In 2019 IEEE Conference on Games (CoG), pages 1--8. IEEE

2019
[44]

Riot Games . 2020. https://playvalorant.com Valorant . Video game. Accessed: 2026-03-18

2020
[45]

Tsunehiko Tanaka and Edgar Simo-Serra. 2021. Lol-v2t: Large-scale esports video description dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4557--4566

2021
[46]

Valve Corporation . 2023. https://www.counter-strike.net Counter-strike 2 . Video game. Accessed: 2026-03-18

2023
[47]

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 1 others. 2025 a . Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Yunzhe Wang, Soham Hans, and Volkan Ustun. 2025 b . X-ego: Acquiring team-level tactical situational awareness via cross-egocentric contrastive video representation learning. arXiv preprint arXiv:2510.19150

work page arXiv 2025
[49]

Yunzhe Wang, Volkan Ustun, and Chris McGroarty. 2025 c . A data-driven discretized cs: Go simulation environment to facilitate strategic multi-agent planning research. In 2025 Winter Simulation Conference (WSC), pages 2419--2430. IEEE

2025
[50]

Zihan Wang and Naoki Yoshinaga. 2024. Commentary generation from data records of multiplayer strategy esports game. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 263--271

2024
[51]

Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, and Mike Zheng Shou. 2022. Assistq: Affordance-centric question-driven task completion for egocentric assistant. In European Conference on Computer Vision, pages 485--501. Springer

2022
[52]

Peter Xenopoulos, William Robert Freeman, and Claudio Silva. 2022. Analyzing the differences between professional and amateur esports through win probability. In Proceedings of the ACM Web Conference 2022, pages 3418--3427

2022
[53]

Peter Xenopoulos and Claudio Silva. 2022. Esta: An esports trajectory and action dataset. arXiv preprint arXiv:2209.09861

work page arXiv 2022
[54]

Junjie H Xu, Hong Huang, Xiaoling Ling, and Pujana Paliyawan. 2022. Toward collaborative game commentating utilizing pre-trained generative language models. In 2022 IEEE International Conference on Consumer Electronics (ICCE), pages 1--4. IEEE

2022
[55]

Junjie H Xu, Yu Nakano, Lingrong Kong, and Kojiro Iizuka. 2023. Cs-lol: A dataset of viewer comment with scene in e-sports live-streaming. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval, pages 422--426

2023
[56]

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. 2026. https://openreview.net/forum?id=gVbPWbA97s Streaming VLM : Real-time understanding for infinite video streams . In The Fourteenth International Conference on Learning Representations

2026
[57]

Yichen Xu, Jianzhe Ma, Chuhan Wang, Zhonghao Cao, Liangyu Chen, Wenxuan Wang, and Qin Jin. 2025. A survey of large models in sports

2025
[58]

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, and 1 others. 2025. Egolife: Towards egocentric life assistant. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28885--28900

2025
[59]

Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, and 1 others. 2025. Mmego: Towards building egocentric multimodal llms for video qa. In The Thirteenth International Conference on Learning Representations

2025
[60]

Ari Yu, Jinwoo Hyun, Hyeong-Gyu Jang, Sung-Yun Park, and Sang-Kwang Lee. 2025 a . Single-anchored multi-modal dense video captioning for esports broadcasts commentaries. In Proceedings of the 8th International ACM Workshop on Multimedia Content Analysis in Sports, pages 31--38

2025
[61]

Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, and Qianru Sun. 2025 b . https://openreview.net/forum?id=LNL7zKvm7e Frame-voyager: Learning to query frames for video large language models . In The Thirteenth International Conference on Learning Representations

2025
[62]

Dawei Zhang, Sixing Wu, Yao Guo, and Xiangqun Chen. 2022. Moba-e2c: Generating moba game commentaries via capturing highlight events from the meta-data. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4545--4556

2022
[63]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. In Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pages 543--553

2023
[64]

Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. 2025 a . Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22056--22065

2025
[65]

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. 2025 b . Llava-video: Video instruction tuning with synthetic data. Transactions on Machine Learning Research

2025
[66]

Zhonghan Zhao, Wenhao Chai, Shengyu Hao, Wenhao Hu, Guanhong Wang, Shidong Cao, Mingli Song, Jenq-Neng Hwang, and Gaoang Wang. 2025. A survey of deep learning in sports applications: Perception, comprehension, and decision. IEEE Transactions on Visualization and Computer Graphics

2025
[67]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[68]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...