Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Anyi Rao; Hohin Kwan; Hongyu Li; Jiahao Xie; Manyuan Zhang; Ray Zhang; Si Liu; Xianghao Kong

arxiv: 2606.27828 · v1 · pith:MP3LYDLSnew · submitted 2026-06-26 · 💻 cs.CV

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Hohin Kwan , Hongyu Li , Ray Zhang , Manyuan Zhang , Xianghao Kong , Anyi Rao , Jiahao Xie , Si Liu This is my paper

Pith reviewed 2026-06-29 04:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords video temporal reasoningmultimodal large language modelsbenchmarklogical reasoningstate trackingtemporal orderingMLLM evaluationcontrolled generation

0 comments

The pith

Video-MME-Logical creates a controlled benchmark around five temporal-logical operations to measure how MLLMs maintain and compose evidence across video frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video temporal-logical reasoning can be isolated from static recognition and scene complexity by organizing tasks around state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. It generates 25 fine-grained categories with explicit control over object states, transitions, temporal dependencies, and logical compositions, allowing evaluation at varying horizons and complexities. Experiments on state-of-the-art MLLMs show a human-model performance gap that widens with increased reasoning demands. Even supervised fine-tuning on up to 500K generated samples raises scores but leaves the gap open, and the benchmark includes checks for whether models recover the required intermediate logical trace before answering.

Core claim

The paper claims that a benchmark built on five temporal-logical operations with controlled generation accurately diagnoses video temporal-logical reasoning in MLLMs, that current models exhibit a substantial and complexity-dependent gap relative to humans, and that supervised fine-tuning on hundreds of thousands of samples narrows but does not eliminate this gap.

What carries the argument

The five temporal-logical operations (state tracking, sequential counting, temporal ordering, dynamic spatiality, structural composition) together with controlled generation of tasks that vary temporal horizon and reasoning complexity.

If this is right

Gaps between models and humans increase as temporal horizon and logical complexity rise.
Supervised fine-tuning on 500K samples improves scores but leaves a remaining gap.
The benchmark supports diagnostic checks on whether models recover the required reasoning trace before the final answer.
The construction positions the benchmark as a scalable testbed for further analysis of temporal-logical capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures may need explicit mechanisms for updating and composing states across frames rather than relying on next-token prediction alone.
The same controlled-generation approach could be applied to isolate other forms of dynamic reasoning such as causal or counterfactual inference in video.
Persistent gaps after large-scale fine-tuning suggest that data scale alone may not suffice and that new objectives or memory structures warrant testing.

Load-bearing premise

The five operations and controlled generation process isolate temporal-logical reasoning without being affected by static object recognition or uncontrolled scene factors.

What would settle it

A result in which models reach human-level accuracy on the benchmark yet still fail to maintain or compose evidence correctly on matched real-world videos that require the same operations.

Figures

Figures reproduced from arXiv: 2606.27828 by Anyi Rao, Hohin Kwan, Hongyu Li, Jiahao Xie, Manyuan Zhang, Ray Zhang, Si Liu, Xianghao Kong.

**Figure 1.** Figure 1: VIDEO-MME-LOGICAL is a controllable benchmark for video temporal-logical reasoning with 25 tasks. It evaluates whether models can reason over dynamic visual worlds through State Tracking, Structural Composition, Dynamic Spatiality, Temporal Ordering and Sequential Counting, spanning final-answer tasks, intermediate-state diagnostics, and difficulty-controlled settings. Abstract Recent interest in multimoda… view at source ↗

**Figure 2.** Figure 2: Taxonomy of VIDEO-MME-LOGICAL. The inner ring separates direct-answer tasks from the intermediate-state diagnostic subset, while the outer rings group fine-grained task categories under five temporal-logical operation groups. of four core components: (1) temporal transition, which defines how the scene evolves over time, including the order in which objects move, swap, appear, or disappear; (2) scene con… view at source ↗

**Figure 3.** Figure 3: VIDEO-MME-LOGICAL combines controllable video generation, structured metadata, and diversified reasoning templates to build a 25-task temporal logical reasoning benchmark. reasoning, rather than merely producing the correct final answer. QA and Training Annotation. We generate QA pairs directly from the program metadata using task-specific templates, ensuring that each question is paired with an exact, aut… view at source ↗

**Figure 4.** Figure 4: A qualitative example of intermediate-state evaluation on a state-tracking task. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Word cloud of VIDEO-MME-LOGICAL. B Training Configuration for the Qwen3-VL-8B SFT Experiments [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Additional visual examples from VIDEO-MME-LOGICAL [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Additional visual examples from VIDEO-MME-LOGICAL [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Additional visual examples from VIDEO-MME-LOGICAL [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Additional visual examples from VIDEO-MME-LOGICAL [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Additional visual examples from VIDEO-MME-LOGICAL [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Additional visual examples from VIDEO-MME-LOGICAL [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Additional visual examples from VIDEO-MME-LOGICAL [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

read the original abstract

Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames? This ability, which we refer to as video temporal-logical reasoning, requires models to maintain, update, and compose evidence as visual states evolve across frames. Existing video benchmarks often conflate this capability with scene complexity, static recognition, or uncontrolled temporal variation. To isolate this capability, we introduce Video-MME-Logical, a controlled benchmark organized around five temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. The benchmark contains 25 fine-grained task categories generated with controlled object states, transitions, temporal dependencies, and logical compositions. It enables difficulty-controlled final-answer evaluation by varying temporal horizon and reasoning complexity, and supports intermediate-state diagnostics by verifying whether models recover the required logical reasoning trace before producing the final answer. Experiments with state-of-the-art MLLMs reveal a substantial human-model gap, especially as temporal-logical complexity increases. Supervised fine-tuning on up to 500K generated samples improves performance but remains insufficient to close the reasoning gap, positioning Video-MME-Logical as a scalable testbed for analyzing and improving temporal-logical reasoning in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Video-MME-Logical sets up a controlled benchmark around five operations and trace diagnostics, but the abstract supplies almost no validation details so the isolation claim stays untested.

read the letter

The paper introduces Video-MME-Logical, a benchmark organized around state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition, with 25 task categories generated under controlled states and transitions. It also adds intermediate trace checks alongside final-answer scoring and varies horizon and complexity for difficulty control.

What stands out is the explicit attempt to separate temporal-logical reasoning from static recognition and scene clutter, plus the dual diagnostic format. That structure is cleaner than many existing video benchmarks that mix everything together.

The experiments section claims a human-model gap that grows with complexity and that supervised fine-tuning on up to 500K samples narrows but does not close it. Without numbers, tables, or generation statistics in the abstract, those claims cannot be checked yet.

The main soft spot is the lack of any reported validation that the controlled generation actually removes confounds. The abstract states the intent but gives no data on object diversity, transition coverage, or inter-annotator agreement on the logical traces. If those steps are missing or weak in the full paper, the isolation claim does not hold. The human baseline also needs clear description of how participants were tested and scored.

This work is aimed at groups building or evaluating video MLLMs who want diagnostic tools rather than end-to-end accuracy numbers. A reader already working on temporal reasoning benchmarks would find the task taxonomy and trace evaluation worth looking at.

I would send it to peer review. The core idea is coherent and the motivation is sound; the methods and results sections need external scrutiny to confirm the controls work as described.

Referee Report

0 major / 4 minor

Summary. The paper introduces Video-MME-Logical, a controlled diagnostic benchmark for video temporal-logical reasoning in MLLMs. It organizes evaluation around five operations (state tracking, sequential counting, temporal ordering, dynamic spatiality, structural composition) realized in 25 fine-grained task categories via controlled generation of object states, transitions, and logical compositions. The benchmark supports difficulty scaling by temporal horizon and complexity, plus intermediate-state diagnostics. Experiments on SOTA MLLMs show a substantial human-model gap that widens with complexity; supervised fine-tuning on up to 500K generated samples yields gains but does not close the gap.

Significance. If the controlled generation and diagnostic protocol successfully isolate temporal-logical reasoning without confounding by static recognition or scene complexity, the benchmark would supply a scalable, falsifiable testbed for a capability that current video benchmarks largely conflate with other skills. The reported persistence of the gap after large-scale SFT would indicate a genuine limitation worth targeted architectural or training research.

minor comments (4)

The abstract and introduction refer to 'controlled object states, transitions, temporal dependencies, and logical compositions' and 'intermediate-state diagnostics' but the manuscript should supply explicit pseudocode or a figure detailing the generation pipeline and the exact verification procedure for the reasoning trace (e.g., how intermediate states are extracted and scored).
Table or figure reporting per-operation and per-complexity human vs. model accuracies (with standard errors) is needed to substantiate the claim that the gap 'especially' increases with temporal-logical complexity; aggregate numbers alone are insufficient.
The SFT protocol (data volume, sampling strategy, base model, training hyperparameters, and whether the 500K samples overlap with the benchmark) should be described in a dedicated subsection so that the 'remains insufficient' conclusion can be reproduced or extended.
Clarify the exact definition of 'temporal horizon' and 'reasoning complexity' used for difficulty control, and report how many videos fall into each bin.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of Video-MME-Logical and the recommendation for minor revision. The summary accurately reflects the benchmark's design around the five operations, controlled generation, and experimental findings on the human-model gap.

Circularity Check

0 steps flagged

Benchmark construction with no derivations or fitted predictions

full rationale

The paper introduces Video-MME-Logical as a controlled benchmark organized around five temporal-logical operations, generated with explicit states, transitions, and compositions. No equations, parameter fittings, predictions derived from prior fits, or self-citation chains appear in the abstract or described methods. The central claims concern benchmark isolation of capabilities and experimental reporting of model gaps after SFT; these do not reduce to self-referential inputs by construction. The work is self-contained as diagnostic data generation and evaluation, with no load-bearing steps matching the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark construction paper with no mathematical derivations; the central assumptions concern the validity of the chosen operations as isolators of temporal-logical reasoning.

axioms (1)

domain assumption The five operations (state tracking, sequential counting, temporal ordering, dynamic spatiality, structural composition) isolate the target capability.
Benchmark is organized around these operations as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5784 in / 1175 out tokens · 39068 ms · 2026-06-29T04:51:41.005277+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 16 canonical work pages · 7 internal anchors

[1]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Transactions on Machine Learning Research , year=

LLaVA-Video: Video Instruction Tuning With Synthetic Data , author=. Transactions on Machine Learning Research , year=
[6]

Kimi-VL Technical Report

Kimi-VL Technical Report , author=. arXiv preprint arXiv:2504.07491 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

2026 , howpublished=

Introducing GPT-5.4 , author=. 2026 , howpublished=

2026
[8]

2025 , howpublished=

Gemini 3 Pro Model Card , author=. 2025 , howpublished=

2025
[9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , month =

Fu, Chaoyou and Dai, Yuhan and Luo, Yongdong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and Chen, Peixian and Li, Yanwei and Lin, Shaohui and Zhao, Sirui and Li, Ke and Xu, Tong and Zheng, Xiawu and Chen, Enhong and Shan, Caifeng and He, Ran and Sun, Xing , title =. Proceedings of t...

2025
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Zhou, Junjie and Shu, Yan and Zhao, Bo and Wu, Boya and Liang, Zhengyang and Xiao, Shitao and Qin, Minghao and Yang, Xi and Xiong, Yongping and Zhang, Bo and Huang, Tiejun and Liu, Zheng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025
[11]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

ALLVB: All-in-One Long Video Understanding Benchmark , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[12]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding , url =

Wu, Haoning and Li, Dongxu and Chen, Bei and Li, Junnan , booktitle =. LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding , url =. doi:10.52202/079017-0907 , editor =

work page doi:10.52202/079017-0907
[13]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

LVBench: An Extreme Long Video Understanding Benchmark , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[14]

arXiv preprint arXiv:2405.08813 , year=

CinePile: A Long Video Question Answering Dataset and Benchmark , author=. arXiv preprint arXiv:2405.08813 , year=

work page arXiv
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[16]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and Wang, Limin and Qiao, Yu , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024
[17]

Advances in Neural Information Processing Systems , volume=

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023
[18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ego4D: Around the World in 3,000 Hours of Egocentric Video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[19]

Perception Test: A Diagnostic Benchmark for Multimodal Video Models , url =

Patraucean, Viorica and Smaira, Lucas and Gupta, Ankush and Recasens, Adria and Markeeva, Larisa and Banarse, Dylan and Koppula, Skanda and heyward, joseph and Malinowski, Mateusz and Yang, Yi and Doersch, Carl and Matejovicova, Tatiana and Sulsky, Yury and Miech, Antoine and Fr\'. Perception Test: A Diagnostic Benchmark for Multimodal Video Models , url ...
[20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=
[21]

Advances in Neural Information Processing Systems , volume=

STAR: A Benchmark for Situated Reasoning in Real-World Videos , author=. Advances in Neural Information Processing Systems , volume=. 2021 , url=

2021
[22]

Computer Vision -

Shicheng Li and Lei Li and Yi Liu and Shuhuai Ren and Yuanxin Liu and Rundong Gao and Xu Sun and Lu Hou , title =. Computer Vision -
[23]

arXiv preprint arXiv:2501.10674 , year=

Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! , author=. arXiv preprint arXiv:2501.10674 , year=

work page arXiv
[24]

The Thirteenth International Conference on Learning Representations , year=

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models , author=. The Thirteenth International Conference on Learning Representations , year=
[25]

2024 , address=

Liu, Yuanxin and Li, Shicheng and Liu, Yi and Wang, Yuxiang and Ren, Shuhuai and Li, Lei and Chen, Sishuo and Sun, Xu and Hou, Lu , booktitle=. 2024 , address=. doi:10.18653/v1/2024.findings-acl.517 , url=

work page doi:10.18653/v1/2024.findings-acl.517 2024
[26]

Advances in Neural Information Processing Systems , volume=

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos , author=. Advances in Neural Information Processing Systems , volume=. 2024 , doi=

2024
[27]

VideoReasonBench: Can

Yuanxin Liu and Kun Ouyang and Haoning Wu and Yi Liu and Lin Sui and Xinhao Li and Yan Zhong and Y.Charles and Xinyu Zhou and Xu Sun , booktitle=. VideoReasonBench: Can
[28]

arXiv preprint arXiv:2503.11495 , year=

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning , author=. arXiv preprint arXiv:2503.11495 , year=

work page arXiv
[29]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
[30]

arXiv preprint arXiv:2510.05091 , year=

Factuality Matters: When Image Generation and Editing Meet Structured Visuals , author=. arXiv preprint arXiv:2510.05091 , year=

work page arXiv
[31]

International Conference on Learning Representations , year=

CLEVRER: Collision Events for Video Representation and Reasoning , author=. International Conference on Learning Representations , year=
[32]

Cognitive psychology , volume=

The reviewing of object files: Object-specific integration of information , author=. Cognitive psychology , volume=. 1992 , publisher=

1992
[33]

Cognition , volume=

Visual indexes, preconceptual objects, and situated vision , author=. Cognition , volume=. 2001 , publisher=

2001
[34]

Vision research , volume=

Visual cognition , author=. Vision research , volume=. 2011 , publisher=

2011
[35]

2026 , eprint=

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding , author=. 2026 , eprint=

2026
[36]

Qwen3.5-Omni Technical Report

Qwen3. 5-omni technical report , author=. arXiv preprint arXiv:2604.15804 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

2026 , eprint=

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

2026
[38]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

arXiv preprint arXiv:2603.08436 , year=

Can Vision-Language Models Solve the Shell Game? , author=. arXiv preprint arXiv:2603.08436 , year=

work page arXiv
[40]

arXiv preprint arXiv:2602.20159 , year=

A very big video reasoning suite , author=. arXiv preprint arXiv:2602.20159 , year=

work page arXiv
[41]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[42]

arXiv preprint arXiv:2506.01908 , year=

Reinforcement learning tuning for videollms: Reward design and data efficiency , author=. arXiv preprint arXiv:2506.01908 , year=

work page arXiv
[43]

Advances in Neural Information Processing Systems , volume=

Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation , author=. Advances in Neural Information Processing Systems , volume=

[1] [1]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Transactions on Machine Learning Research , year=

LLaVA-Video: Video Instruction Tuning With Synthetic Data , author=. Transactions on Machine Learning Research , year=

[6] [6]

Kimi-VL Technical Report

Kimi-VL Technical Report , author=. arXiv preprint arXiv:2504.07491 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

2026 , howpublished=

Introducing GPT-5.4 , author=. 2026 , howpublished=

2026

[8] [8]

2025 , howpublished=

Gemini 3 Pro Model Card , author=. 2025 , howpublished=

2025

[9] [9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , month =

Fu, Chaoyou and Dai, Yuhan and Luo, Yongdong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and Chen, Peixian and Li, Yanwei and Lin, Shaohui and Zhao, Sirui and Li, Ke and Xu, Tong and Zheng, Xiawu and Chen, Enhong and Shan, Caifeng and He, Ran and Sun, Xing , title =. Proceedings of t...

2025

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Zhou, Junjie and Shu, Yan and Zhao, Bo and Wu, Boya and Liang, Zhengyang and Xiao, Shitao and Qin, Minghao and Yang, Xi and Xiong, Yongping and Zhang, Bo and Huang, Tiejun and Liu, Zheng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025

[11] [11]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

ALLVB: All-in-One Long Video Understanding Benchmark , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[12] [12]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding , url =

Wu, Haoning and Li, Dongxu and Chen, Bei and Li, Junnan , booktitle =. LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding , url =. doi:10.52202/079017-0907 , editor =

work page doi:10.52202/079017-0907

[13] [13]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

LVBench: An Extreme Long Video Understanding Benchmark , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[14] [14]

arXiv preprint arXiv:2405.08813 , year=

CinePile: A Long Video Question Answering Dataset and Benchmark , author=. arXiv preprint arXiv:2405.08813 , year=

work page arXiv

[15] [15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[16] [16]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and Wang, Limin and Qiao, Yu , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024

[17] [17]

Advances in Neural Information Processing Systems , volume=

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023

[18] [18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ego4D: Around the World in 3,000 Hours of Egocentric Video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[19] [19]

Perception Test: A Diagnostic Benchmark for Multimodal Video Models , url =

Patraucean, Viorica and Smaira, Lucas and Gupta, Ankush and Recasens, Adria and Markeeva, Larisa and Banarse, Dylan and Koppula, Skanda and heyward, joseph and Malinowski, Mateusz and Yang, Yi and Doersch, Carl and Matejovicova, Tatiana and Sulsky, Yury and Miech, Antoine and Fr\'. Perception Test: A Diagnostic Benchmark for Multimodal Video Models , url ...

[20] [20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

[21] [21]

Advances in Neural Information Processing Systems , volume=

STAR: A Benchmark for Situated Reasoning in Real-World Videos , author=. Advances in Neural Information Processing Systems , volume=. 2021 , url=

2021

[22] [22]

Computer Vision -

Shicheng Li and Lei Li and Yi Liu and Shuhuai Ren and Yuanxin Liu and Rundong Gao and Xu Sun and Lu Hou , title =. Computer Vision -

[23] [23]

arXiv preprint arXiv:2501.10674 , year=

Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! , author=. arXiv preprint arXiv:2501.10674 , year=

work page arXiv

[24] [24]

The Thirteenth International Conference on Learning Representations , year=

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models , author=. The Thirteenth International Conference on Learning Representations , year=

[25] [25]

2024 , address=

Liu, Yuanxin and Li, Shicheng and Liu, Yi and Wang, Yuxiang and Ren, Shuhuai and Li, Lei and Chen, Sishuo and Sun, Xu and Hou, Lu , booktitle=. 2024 , address=. doi:10.18653/v1/2024.findings-acl.517 , url=

work page doi:10.18653/v1/2024.findings-acl.517 2024

[26] [26]

Advances in Neural Information Processing Systems , volume=

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos , author=. Advances in Neural Information Processing Systems , volume=. 2024 , doi=

2024

[27] [27]

VideoReasonBench: Can

Yuanxin Liu and Kun Ouyang and Haoning Wu and Yi Liu and Lin Sui and Xinhao Li and Yan Zhong and Y.Charles and Xinyu Zhou and Xu Sun , booktitle=. VideoReasonBench: Can

[28] [28]

arXiv preprint arXiv:2503.11495 , year=

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning , author=. arXiv preprint arXiv:2503.11495 , year=

work page arXiv

[29] [29]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

[30] [30]

arXiv preprint arXiv:2510.05091 , year=

Factuality Matters: When Image Generation and Editing Meet Structured Visuals , author=. arXiv preprint arXiv:2510.05091 , year=

work page arXiv

[31] [31]

International Conference on Learning Representations , year=

CLEVRER: Collision Events for Video Representation and Reasoning , author=. International Conference on Learning Representations , year=

[32] [32]

Cognitive psychology , volume=

The reviewing of object files: Object-specific integration of information , author=. Cognitive psychology , volume=. 1992 , publisher=

1992

[33] [33]

Cognition , volume=

Visual indexes, preconceptual objects, and situated vision , author=. Cognition , volume=. 2001 , publisher=

2001

[34] [34]

Vision research , volume=

Visual cognition , author=. Vision research , volume=. 2011 , publisher=

2011

[35] [35]

2026 , eprint=

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding , author=. 2026 , eprint=

2026

[36] [36]

Qwen3.5-Omni Technical Report

Qwen3. 5-omni technical report , author=. arXiv preprint arXiv:2604.15804 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

2026 , eprint=

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

2026

[38] [38]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

arXiv preprint arXiv:2603.08436 , year=

Can Vision-Language Models Solve the Shell Game? , author=. arXiv preprint arXiv:2603.08436 , year=

work page arXiv

[40] [40]

arXiv preprint arXiv:2602.20159 , year=

A very big video reasoning suite , author=. arXiv preprint arXiv:2602.20159 , year=

work page arXiv

[41] [41]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[42] [42]

arXiv preprint arXiv:2506.01908 , year=

Reinforcement learning tuning for videollms: Reward design and data efficiency , author=. arXiv preprint arXiv:2506.01908 , year=

work page arXiv

[43] [43]

Advances in Neural Information Processing Systems , volume=

Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation , author=. Advances in Neural Information Processing Systems , volume=