Recognition: no theorem link
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Pith reviewed 2026-05-10 18:44 UTC · model grok-4.3
The pith
Video-MME-v2 shows current models lag human experts because errors in visual aggregation and temporal modeling block higher-level reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Video-MME-v2 establishes that the leading model falls substantially short of human experts on comprehensive video understanding. Mistakes in visual information aggregation and temporal dynamics modeling propagate to limit performance at the level of complex multimodal reasoning. Thinking-based reasoning improves when subtitles are available but sometimes degrades in purely visual settings.
What carries the argument
The progressive tri-level hierarchy, which incrementally raises complexity from multi-point visual information aggregation through temporal dynamics modeling to complex multimodal reasoning, together with the group-based non-linear evaluation strategy that enforces consistency across related queries and withholds credit for fragmented or guess-based answers.
If this is right
- Advancing video understanding requires targeted gains in visual information aggregation and temporal modeling before complex reasoning can improve.
- Current models rely on textual cues such as subtitles to support thinking-based reasoning, with performance sometimes dropping when those cues are absent.
- Standard per-question accuracy overestimates capabilities by crediting answers that lack coherence across related questions.
- Future model development should prioritize architectures that maintain fidelity across visual details and time rather than compensating at the reasoning stage alone.
Where Pith is reading between the lines
- Training objectives that explicitly supervise lower-level visual and temporal tasks may produce faster gains on high-level reasoning benchmarks than end-to-end reasoning training alone.
- The same hierarchical structure could be applied to audio-video or long-horizon video tasks to diagnose whether similar error propagation occurs.
- Architectures that preserve fine-grained visual information over extended sequences would be a direct test of whether the observed bottlenecks can be narrowed.
Load-bearing premise
The group-based non-linear evaluation and tri-level hierarchy accurately measure genuine video understanding without introducing their own biases or inconsistencies.
What would settle it
A result in which models reach human-level scores on the highest reasoning level while still failing the visual aggregation and temporal modeling levels under the same evaluation protocol would undermine the claimed hierarchical bottleneck.
Figures
read the original abstract
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Video-MME-v2, a new video understanding benchmark featuring a progressive tri-level hierarchy (multi-point visual aggregation, temporal dynamics modeling, and complex multimodal reasoning) and a group-based non-linear evaluation strategy that enforces consistency across related queries and coherence in multi-step reasoning. It is built via a controlled human annotation pipeline (12 annotators, 50 reviewers, 3300 human-hours, up to 5 QA rounds) and reports experiments showing a substantial performance gap between Gemini-3-Pro and human experts, with lower-level visual/temporal errors propagating to limit high-level reasoning, plus dependence on textual cues.
Significance. If the tri-level hierarchy and non-linear scoring are shown to be reliable, the benchmark could meaningfully advance evaluation of video MLLMs by exposing real limitations in visual-temporal integration and reasoning chains that saturated per-question accuracy metrics obscure. The scale and rigor of the human annotation pipeline is a clear strength that supports data quality claims.
major comments (2)
- [evaluation strategy (abstract and methods)] The abstract and evaluation strategy description claim that the group-based non-linear evaluator 'penalizes fragmented or guess-based correctness' and 'assigns credit only to answers supported by valid reasoning,' yet no quantitative validation is reported (e.g., inter-annotator agreement, correlation with standard per-question accuracy, or stability of model rankings when recomputed with conventional metrics). This is load-bearing for the central claim of hierarchical error propagation, as the observed bottlenecks could be produced by the scoring rules themselves.
- [experiments and results] The headline experimental result (Gemini-3-Pro vs. humans + propagation from visual/temporal errors to reasoning failures) is measured exclusively with the new tri-level question sets and non-linear evaluator. Without an ablation comparing these scores to ordinary accuracy on the identical data and questions, it is unclear whether the propagation pattern reflects model capabilities or an artifact of the metric design.
minor comments (2)
- [abstract] The abstract refers to 'Gemini-3-Pro' without specifying the exact model version or release date; this should be clarified for reproducibility.
- [introduction] The paper states the benchmark 'aims to serve as one of the most authoritative video benchmarks' but provides no direct comparison table against prior video benchmarks (e.g., Video-MME-v1, ActivityNet, or others) on question count, duration coverage, or annotation effort.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects of validating our proposed evaluation strategy and ensuring the robustness of our experimental claims. We address each major comment below and will incorporate revisions to strengthen the paper.
read point-by-point responses
-
Referee: [evaluation strategy (abstract and methods)] The abstract and evaluation strategy description claim that the group-based non-linear evaluator 'penalizes fragmented or guess-based correctness' and 'assigns credit only to answers supported by valid reasoning,' yet no quantitative validation is reported (e.g., inter-annotator agreement, correlation with standard per-question accuracy, or stability of model rankings when recomputed with conventional metrics). This is load-bearing for the central claim of hierarchical error propagation, as the observed bottlenecks could be produced by the scoring rules themselves.
Authors: We agree that quantitative validation of the group-based non-linear evaluator is essential to substantiate our claims and rule out metric artifacts. The initial submission focused on describing the design and its intended properties but did not include explicit supporting analyses. In the revised manuscript, we will add: (1) inter-annotator agreement statistics for the group scoring decisions, drawing on the multi-reviewer quality assurance process (50 reviewers); (2) Pearson/Spearman correlations between the non-linear group scores and conventional per-question accuracy across models; and (3) a comparison of model rankings under both scoring schemes to assess stability. These additions will directly address whether the observed hierarchical propagation is robust or scoring-dependent. revision: yes
-
Referee: [experiments and results] The headline experimental result (Gemini-3-Pro vs. humans + propagation from visual/temporal errors to reasoning failures) is measured exclusively with the new tri-level question sets and non-linear evaluator. Without an ablation comparing these scores to ordinary accuracy on the identical data and questions, it is unclear whether the propagation pattern reflects model capabilities or an artifact of the metric design.
Authors: We acknowledge that presenting results solely under the new metric leaves open the possibility of metric-specific effects. To resolve this, the revised version will include a dedicated ablation section that recomputes all primary results—including the Gemini-3-Pro vs. human gap and the visual/temporal-to-reasoning error propagation—using both the group-based non-linear evaluator and standard per-question accuracy on the exact same question sets and videos. This will allow direct comparison of patterns and demonstrate that the bottlenecks are not an artifact of the evaluation design. revision: yes
Circularity Check
No circularity: benchmark construction and empirical results are independent of self-referential derivations.
full rationale
The paper introduces Video-MME-v2 via explicit design choices (tri-level hierarchy from visual aggregation to multimodal reasoning, plus group-based non-linear scoring that penalizes inconsistency). These are presented as definitional construction steps, not derived from equations or prior fitted values. The headline claims (Gemini-3-Pro gap, error propagation) are direct empirical outputs from running the benchmark on models and humans; they do not reduce to the metric definition by construction, nor rely on self-citation chains for their validity. No fitted parameters are renamed as predictions, no uniqueness theorems are imported, and no ansatz is smuggled. The evaluation rules are stated upfront and applied externally, satisfying the self-contained benchmark criterion for score 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotation with multiple reviewers produces reliable ground-truth labels for complex video reasoning tasks
Forward citations
Cited by 8 Pith papers
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs
GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025
2025
-
[3]
Seed2.0 model card: Towards intelligence frontier for real-world complexity, February 2026
ByteDance Seed Team. Seed2.0 model card: Towards intelligence frontier for real-world complexity, February 2026. Model Card
2026
-
[4]
Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, and Ziwei Liu. Insight-v++: Towards advanced long-chain visual reasoning with multimodal large language models.arXiv preprint arXiv:2603.18118, 2026
-
[5]
Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, and Ziwei Liu. Demo-icl: In-context learning for procedural video knowledge acquisition. arXiv preprint arXiv:2602.08439, 2026
-
[6]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review arXiv 2025
-
[7]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review arXiv 2023
-
[8]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025
2025
-
[9]
Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction. arXiv preprint arXiv:2501.01957, 2025
-
[10]
Introducing gemini 3: our most intelligent model that helps you bring any idea to life
Google DeepMind. Introducing gemini 3: our most intelligent model that helps you bring any idea to life. Google Blog, 2025
2025
-
[11]
Motionbench: Benchmarking and improving fine-grained video mo- tion understanding for vision language models
Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video mo- tion understanding for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8450–8460, 2025
2025
-
[12]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025
work page internal anchor Pith review arXiv 2025
-
[14]
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 13
-
[15]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024
2024
-
[16]
arXiv preprint arXiv:2504.06958 (2025)
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958, 2025
-
[17]
Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024
2024
-
[18]
Charles, Xinyu Zhou, and Xu Sun
Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning? arXiv preprint arXiv:2505.23359, 2025
-
[19]
Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025
-
[20]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. Alibaba Cloud, Technical Report
2026
-
[21]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review arXiv 2025
-
[24]
Qwen3.5-omni: Scaling up, toward native omni-modal agi, March 2026
Qwen Team. Qwen3.5-omni: Scaling up, toward native omni-modal agi, March 2026
2026
-
[25]
Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025
-
[26]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025
2025
-
[27]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024
2024
-
[29]
Mimo-vl technical report, 2025
LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025
2025
-
[30]
Xiaomi mimo-v2-omni: See, hear, act in the agentic era
Xiaomi Corporation. Xiaomi mimo-v2-omni: See, hear, act in the agentic era. https://mimo.xiaomi. com/mimo-v2-omni, 2026
2026
-
[31]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review arXiv 2025
-
[32]
Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,
Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025
-
[33]
Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025
Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025. 14
-
[34]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025
work page internal anchor Pith review arXiv 2025
-
[35]
Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding
Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, and Ziwei Liu. Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20626–20636, 2025
2025
-
[36]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review arXiv 2024
-
[37]
right af- ter X
Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video under- standing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 15 Table 4: Video-MME v2 leaf-level task definitions (full versio...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.