Recognition: unknown
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3
The pith
A new benchmark shows Video-LLMs reach only 71.58 percent accuracy on fast esports egocentric videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By running a six-stage curation process on matches from three esports titles and structuring the resulting questions into a two-dimensional taxonomy of eleven cognitive sub-tasks and six knowledge sub-tasks, the paper shows that current Video-LLMs fall well short of satisfactory performance. The strongest model reaches only 71.58 percent accuracy. Models prove stronger at basic visual perception than at deep tactical reasoning and better at macro game progression than at fine-grained micro operations, exposing intrinsic weaknesses in how existing architectures process rapid virtual egocentric input.
What carries the argument
The two-dimensional decoupled taxonomy that separates cognitive capability levels from esports knowledge domains, used to structure the QA pairs and isolate specific performance gaps.
If this is right
- Current Video-LLM architectures contain intrinsic weaknesses that ablation experiments trace to handling of rapid temporal sequences.
- The benchmark data reveals measurable connections between real-world and virtual egocentric understanding.
- Optimizing models on this dataset supplies guidance for downstream esports applications such as real-time analysis tools.
- Models consistently handle macro-level game progression more reliably than micro-level operations.
Where Pith is reading between the lines
- If the taxonomy holds across domains, similar decoupled tests could diagnose perception-versus-reasoning gaps in other high-speed video settings.
- Training pipelines that add high-velocity virtual clips may close the observed performance difference between real and esports scenes.
- The benchmark could serve as a diagnostic for whether future models have overcome limits in fine-grained action recognition.
Load-bearing premise
The six-stage curation pipeline produces high-quality, unbiased QA pairs that validly measure perception and reasoning without introducing artifacts from the selection or annotation process.
What would settle it
A Video-LLM reaching above 90 percent accuracy on the benchmark while showing no similar improvement on real-world egocentric videos would indicate that the reported gaps are not fundamental to current model designs.
Figures
read the original abstract
While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EgoEsportsQA, a benchmark of 1,745 QA pairs curated from professional egocentric esports videos in three FPS games via a six-stage pipeline. Questions follow a two-dimensional taxonomy with 11 cognitive sub-tasks (perception and reasoning) and 6 esports-knowledge sub-tasks. Evaluations of state-of-the-art Video-LLMs report a best-model accuracy of 71.58%, with gaps showing stronger perception than reasoning and macro-progression over micro-operations; ablations and analysis link the dataset to real-world egocentric domains and downstream applications.
Significance. If the benchmark quality holds, the work is significant for filling a gap in Video-LLM evaluation for high-velocity virtual environments, where existing daily-activity benchmarks fall short. The decoupled taxonomy, new dataset, and systematic model comparisons provide concrete evidence of architectural limitations and guidance for optimization. The scalable curation pipeline and explicit connection between virtual and real egocentric domains are notable strengths.
major comments (2)
- [§3] §3 (Dataset Curation): The six-stage pipeline is asserted to yield 'high-quality' and unbiased QA pairs that validly separate perception from reasoning, yet no inter-annotator agreement, expert difficulty ratings, or cross-axis calibration statistics are reported. This is load-bearing for the headline claim that models exhibit 'stronger capabilities in basic visual perception than in deep tactical reasoning' and the 71.58% ceiling, because systematic bias in question selection or annotation could artifactually produce the observed perception > reasoning and macro > micro gaps.
- [§4] §4 (Experiments and Results): The performance differentials and ablation conclusions rest on the taxonomy being equitably difficult across sub-tasks, but the manuscript supplies no error analysis, per-sub-task difficulty breakdowns, or validation that reasoning questions are not inadvertently easier/harder than perception ones. Without these, the central finding that 'current models still fail to achieve satisfactory performance' cannot be confidently attributed to model architecture rather than benchmark construction.
minor comments (2)
- [Abstract] Abstract and §1: The term 'pioneering' is subjective and should be replaced with a factual statement about the benchmark's novelty relative to prior esports or egocentric video QA work.
- [§3] Taxonomy description: The mapping from the 11 cognitive sub-tasks to the perception/reasoning axis is not accompanied by explicit decision criteria or examples, which would improve reproducibility.
Simulated Author's Rebuttal
Thank you for the constructive review and for highlighting areas where additional validation would strengthen our claims. We address each major comment below and will revise the manuscript accordingly to improve transparency on dataset quality and experimental analysis.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Curation): The six-stage pipeline is asserted to yield 'high-quality' and unbiased QA pairs that validly separate perception from reasoning, yet no inter-annotator agreement, expert difficulty ratings, or cross-axis calibration statistics are reported. This is load-bearing for the headline claim that models exhibit 'stronger capabilities in basic visual perception than in deep tactical reasoning' and the 71.58% ceiling, because systematic bias in question selection or annotation could artifactually produce the observed perception > reasoning and macro > micro gaps.
Authors: We agree that quantitative metrics such as inter-annotator agreement would provide stronger evidence for annotation quality. The six-stage pipeline incorporates domain-expert review by professional esports players and coaches at multiple stages, with explicit guidelines to enforce the decoupled taxonomy for separating perception from reasoning. However, these agreement and calibration statistics were not computed or reported in the original manuscript. In revision, we will expand §3 to detail the annotator pool, consensus process, and expert-assessed difficulty ratings for a sampled subset of questions, along with any available cross-axis calibration notes. This will directly support the validity of the reported gaps. revision: partial
-
Referee: [§4] §4 (Experiments and Results): The performance differentials and ablation conclusions rest on the taxonomy being equitably difficult across sub-tasks, but the manuscript supplies no error analysis, per-sub-task difficulty breakdowns, or validation that reasoning questions are not inadvertently easier/harder than perception ones. Without these, the central finding that 'current models still fail to achieve satisfactory performance' cannot be confidently attributed to model architecture rather than benchmark construction.
Authors: The taxonomy was constructed to balance cognitive and knowledge dimensions, and the consistent perception > reasoning pattern across multiple models supports attribution to architectural limitations. That said, the manuscript lacks the requested per-sub-task difficulty metrics and systematic error analysis. We will revise §4 to include: per-sub-task accuracy tables, proxy difficulty measures (e.g., question complexity indicators), and a qualitative error categorization of model failures. These additions will allow readers to assess whether the observed gaps arise from benchmark construction or from Video-LLM shortcomings in tactical reasoning. revision: yes
Circularity Check
No circularity: new benchmark dataset and direct model evaluations
full rationale
The paper introduces EgoEsportsQA as a fresh dataset of 1,745 QA pairs curated via a six-stage pipeline from esports videos, then reports direct accuracy numbers (e.g., best Video-LLM at 71.58%) from running existing models on that data. No equations, fitted parameters, or self-citations are used to derive the headline results; the performance gaps are empirical outputs rather than reductions to prior author work or internal definitions. The curation pipeline is presented as an input process whose quality is asserted but not mathematically self-referential.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Anthropic . 2025. https://www.anthropic.com/news/claude-sonnet-4-5 Introducing Claude sonnet 4.5 . Anthropic Blog. Accessed: 2026-03-18
2025
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Fanni B \'a nyai, Mark D Griffiths, Orsolya Kir \'a ly, and Zsolt Demetrovics. 2019. The psychology of esports: A systematic literature review. Journal of gambling studies, 35(2):351--365
2019
-
[5]
Andrzej Bia ecki, Natalia Jakubowska, Pawe Dobrowolski, Piotr Bia ecki, Leszek Krupi \'n ski, Andrzej Szczap, Robert Bia ecki, and Jan Gajewski. 2023. Sc2egset: Starcraft ii esport replay and game-state dataset. Scientific Data, 10(1):600
2023
-
[6]
Blizzard Entertainment . 2022. https://overwatch.blizzard.com Overwatch 2 . Video game. Accessed: 2026-03-18
2022
-
[7]
ByteDance Seed . 2025. https://seed.bytedance.com/en/seed1_8 Seed1.8: A generalized agentic model that can efficiently and accurately accomplish complex tasks in real-world scenarios . ByteDance Official Website. Accessed: 2026-03-18
2025
-
[8]
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. 2025. Livecc: Learning video llm with streaming speech transcription at scale. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29083--29095
2025
-
[9]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and 1 others. 2024. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37:27056--27087
2024
-
[10]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and 1 others. 2018. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720--736
2018
-
[11]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088--10115
2023
-
[12]
Brianna Duffy, Jonathan Gallagher, Jocelyn Rego, Wael Fatnassi, and Michael Warren. 2025. Cdops: Complex dynamics of online professional squads. In 2025 IEEE Conference on Games (CoG), pages 1--8. IEEE
2025
-
[13]
David Durst, Feng Xie, Vishnu Sarukkai, Brennan Shacklett, Iuri Frosio, Chen Tessler, Joohwan Kim, Carly Taylor, Gilbert Bernstein, Sanjiban Choudhury, and 1 others. 2024. Learning to move like professional counter-strike players. In Computer Graphics Forum, volume 43, page e15173. Wiley Online Library
2024
-
[14]
Chenyou Fan. 2019. Egovqa-an egocentric video question answering benchmark dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0--0
2019
-
[15]
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, and 1 others. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108--24118
2025
-
[16]
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, and 1 others. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995--19012
2022
-
[17]
Juho Hamari and Max Sj \"o blom. 2017. What is esports and why do people watch it? Internet research, 27(2):211--232
2017
-
[18]
Nirai Hayakawa, Kazumasa Shimari, Kazuma Yamasaki, Hirotatsu Hoshikawa, Rikuto Tsuchida, and Kenichi Matsumoto. 2025. Round outcome prediction in valorant using tactical features from video analysis. In 2025 IEEE Conference on Games (CoG), pages 1--4. IEEE
2025
-
[19]
Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. 2025. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track
2025
-
[20]
Masaharu Hirota. 2024. Predicting win conditions of counter-strike: Global offensive for analyzing round progression. In 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), pages 1287--1288. IEEE
2024
-
[21]
Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei Zhang, and Xiaodan Liang. 2025. Hires-llava: Restoring fragmentation input in high-resolution large vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29814--29824
2025
-
[22]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
JaidedAI. 2020. https://github.com/JaidedAI/EasyOCR Easyocr: Ready-to-use ocr with 80+ supported languages and all popular writing scripts . GitHub repository. Accessed: 2026-03-18
2020
-
[24]
Wooyoung William Jang and Kevin K Byon. 2020. Antecedents of esports gameplay intention: Genre as a moderator. Computers in Human Behavior, 109:106336
2020
-
[25]
Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. 2022. Egotaskqa: Understanding human tasks in egocentric videos. Advances in Neural Information Processing Systems, 35:3343--3360
2022
-
[26]
Md Tanbeer Jubaer, Mayeesha Farjana, Barisha Chowdhury, Md Shahid Uz Zaman, Azmain Yakin Srizon, and Md Minhazul Islam. 2024. Analyzing audience engagement in esports: Sentiment and llm-based topic insights from live chats in south asia. In 2024 27th International Conference on Computer and Information Technology (ICCIT), pages 1351--1356. IEEE
2024
-
[27]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and 1 others. 2025 a . Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research
2025
-
[28]
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024 a . Llava-interleave: Tackling multi-image, video, and 3d in large multimodal models. In The Thirteenth International Conference on Learning Representations
2024
-
[29]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, and 1 others. 2024 b . Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195--22206
2024
- [30]
-
[31]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916
2023
-
[32]
Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. 2025. Bolt: Boost large vision-language model without training for long-form video understanding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3318--3327
2025
-
[33]
Jiaying Lu, Yongchen Qian, Shifan Zhao, Yuanzhe Xi, and Carl Yang. 2023. Mug: A multimodal classification benchmark on game data with tabular, textual, and visual fields. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5332--5346
2023
-
[34]
Weiyu Ma, Yuqian Fu, Zecheng Zhang, Bernard Ghanem, and Guohao Li. 2025. Ava: Attentive vlm agent for mastering starcraft ii. arXiv preprint arXiv:2503.05383
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, and Haifeng Zhang. 2024 a . Large language models play starcraft ii: Benchmarks and a chain of summarization approach. Advances in Neural Information Processing Systems, 37:133386--133442
2024
-
[36]
Weiyu Ma, Dongyu Xu, Shu Lin, Haifeng Zhang, and Jun Wang. 2024 b . Adaptive command: Real-time policy adjustment via language models in starcraft ii. In Proceedings of the 2024 6th International Conference on Distributed Artificial Intelligences, pages 22--30
2024
-
[37]
DLS Mamoru, AD Panditha, WASSJ Perera, and GU Ganegoda. 2022. Conceptual representation and evaluation of an fps game commentary generator. In 2022 2nd International Conference on Image Processing and Robotics (ICIPRob), pages 1--6. IEEE
2022
-
[38]
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. 2023. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36:46212--46244
2023
-
[39]
Thye Shan Ng, Feiqi Cao, and Soyeon Caren Han. 2025. 3m-game: Multi-modal multi-task multi-teacher learning for game event detection (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29448--29450
2025
-
[40]
OpenAI . 2025. https://openai.com/index/introducing-gpt-5 Introducing GPT-5 . OpenAI Blog. Accessed: 2026-03-18
2025
-
[41]
Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. 2025. https://blog.google/products/gemini/gemini-3 A new era of intelligence with gemini 3 . Google Blog. Accessed: 2026-03-18
2025
-
[42]
Jason G Reitman, Maria J Anderson-Coto, Minerva Wu, Je Seok Lee, and Constance Steinkuehler. 2020. Esports research: A literature review. Games and Culture, 15(1):32--50
2020
-
[43]
Charles Ringer, James Alfred Walker, and Mihalis A Nicolaou. 2019. Multimodal joint emotion and game context recognition in league of legends livestreams. In 2019 IEEE Conference on Games (CoG), pages 1--8. IEEE
2019
-
[44]
Riot Games . 2020. https://playvalorant.com Valorant . Video game. Accessed: 2026-03-18
2020
-
[45]
Tsunehiko Tanaka and Edgar Simo-Serra. 2021. Lol-v2t: Large-scale esports video description dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4557--4566
2021
-
[46]
Valve Corporation . 2023. https://www.counter-strike.net Counter-strike 2 . Video game. Accessed: 2026-03-18
2023
-
[47]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 1 others. 2025 a . Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [48]
-
[49]
Yunzhe Wang, Volkan Ustun, and Chris McGroarty. 2025 c . A data-driven discretized cs: Go simulation environment to facilitate strategic multi-agent planning research. In 2025 Winter Simulation Conference (WSC), pages 2419--2430. IEEE
2025
-
[50]
Zihan Wang and Naoki Yoshinaga. 2024. Commentary generation from data records of multiplayer strategy esports game. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 263--271
2024
-
[51]
Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, and Mike Zheng Shou. 2022. Assistq: Affordance-centric question-driven task completion for egocentric assistant. In European Conference on Computer Vision, pages 485--501. Springer
2022
-
[52]
Peter Xenopoulos, William Robert Freeman, and Claudio Silva. 2022. Analyzing the differences between professional and amateur esports through win probability. In Proceedings of the ACM Web Conference 2022, pages 3418--3427
2022
- [53]
-
[54]
Junjie H Xu, Hong Huang, Xiaoling Ling, and Pujana Paliyawan. 2022. Toward collaborative game commentating utilizing pre-trained generative language models. In 2022 IEEE International Conference on Consumer Electronics (ICCE), pages 1--4. IEEE
2022
-
[55]
Junjie H Xu, Yu Nakano, Lingrong Kong, and Kojiro Iizuka. 2023. Cs-lol: A dataset of viewer comment with scene in e-sports live-streaming. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval, pages 422--426
2023
-
[56]
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. 2026. https://openreview.net/forum?id=gVbPWbA97s Streaming VLM : Real-time understanding for infinite video streams . In The Fourteenth International Conference on Learning Representations
2026
-
[57]
Yichen Xu, Jianzhe Ma, Chuhan Wang, Zhonghao Cao, Liangyu Chen, Wenxuan Wang, and Qin Jin. 2025. A survey of large models in sports
2025
-
[58]
Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, and 1 others. 2025. Egolife: Towards egocentric life assistant. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28885--28900
2025
-
[59]
Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, and 1 others. 2025. Mmego: Towards building egocentric multimodal llms for video qa. In The Thirteenth International Conference on Learning Representations
2025
-
[60]
Ari Yu, Jinwoo Hyun, Hyeong-Gyu Jang, Sung-Yun Park, and Sang-Kwang Lee. 2025 a . Single-anchored multi-modal dense video captioning for esports broadcasts commentaries. In Proceedings of the 8th International ACM Workshop on Multimedia Content Analysis in Sports, pages 31--38
2025
-
[61]
Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, and Qianru Sun. 2025 b . https://openreview.net/forum?id=LNL7zKvm7e Frame-voyager: Learning to query frames for video large language models . In The Thirteenth International Conference on Learning Representations
2025
-
[62]
Dawei Zhang, Sixing Wu, Yao Guo, and Xiangqun Chen. 2022. Moba-e2c: Generating moba game commentaries via capturing highlight events from the meta-data. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4545--4556
2022
-
[63]
Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. In Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pages 543--553
2023
-
[64]
Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. 2025 a . Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22056--22065
2025
-
[65]
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. 2025 b . Llava-video: Video instruction tuning with synthetic data. Transactions on Machine Learning Research
2025
-
[66]
Zhonghan Zhao, Wenhao Chai, Shengyu Hao, Wenhao Hu, Guanhong Wang, Shidong Cao, Mingli Song, Jenq-Neng Hwang, and Gaoang Wang. 2025. A survey of deep learning in sports applications: Perception, comprehension, and decision. IEEE Transactions on Visualization and Computer Graphics
2025
-
[67]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[68]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.