{"total":13,"items":[{"citing_arxiv_id":"2606.03920","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Visual State Tracking in Multimodal Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-06-02T17:12:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03890","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2026-06-02T16:51:32+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02522","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events","primary_cat":"cs.CV","submitted_at":"2026-06-01T17:32:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27705","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?","primary_cat":"cs.CR","submitted_at":"2026-05-26T21:27:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgenticVBench evaluates frontier VLMs on 100 real-world video post-production tasks across four families, with the best agent stack scoring just over 30% versus human experts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25979","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence","primary_cat":"cs.CV","submitted_at":"2026-05-25T15:54:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22907","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-21T18:00:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17260","ref_index":2,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-17T05:02:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiteFrame is an efficient vision encoder backbone trained with Compressed Token Distillation and Language Model Adaptation to scale frame count in Video LLMs while cutting latency and raising accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15764","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions","primary_cat":"cs.CV","submitted_at":"2026-05-15T09:24:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10762","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:57:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09904","ref_index":14,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T02:47:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"object-level continuity needed for temporal object consistency. 2.2 Video QA Benchmarks and Temporal Evaluation Video QA benchmarks have evolved from short-video QA datasets such as MSVD-QA [46], MSRVTT- QA [47], ActivityNet-QA [52], TGIF-QA [16], NExT-QA [44], and STAR [41], to broader Video- LLM evaluation suites such as EgoSchema [28], MVBench [22], Video-MME [13], Video-MME- v2 [14], MMBench-Video [11], LongVideoBench [42], and LVBench [37]. These benchmarks cover activity understanding, long-form comprehension, multi-domain QA, and broad temporal reasoning, but most are built from video-level annotations, captions, or human-written questions rather than explicit object identity tracks. Recent benchmarks have also examined whether Video-LLMs rely on shortcuts when answering tem-"},{"citing_arxiv_id":"2605.07593","ref_index":72,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos","primary_cat":"cs.CV","submitted_at":"2026-05-08T11:06:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[70] Jiachun Li, Shaoping Huang, Zhuoran Jin, Chenlong Zhang, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Mmr-life: Piecing together real-life scenes for multimodal multi-image reasoning.arXiv preprint arXiv:2603.02024, 2026. [71] Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, and Zhe Gan. Prism- bench: A benchmark of puzzle-based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025. [72] Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026. [73] Ruchit Rawal, Khalid Saifullah, Miquel Farré, Ronen Basri, David Jacobs, Gowthami Somepalli,"},{"citing_arxiv_id":"2605.06094","ref_index":10,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:13:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"aspects of video reasoning, as summarized in Figure 1. As a result, VISD not only improves the 2 quality and faithfulness of reasoning, but also significantly enhances learning efficiency by enabling faster convergence with more informative and structured training signals. Training VideoLLMs further poses challenges due to long-horizon dependencies and heterogeneous reward signals [10, 30, 9]. VISD addresses these issues through a set of principled optimization strategies. We adopt a curriculum that gradually transitions from structured self-distillation to reinforcement learning and maintain an exponential moving average teacher to stabilize token-level supervision. Together, these designs enable robust and scalable training under complex video"},{"citing_arxiv_id":"2604.16893","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EasyVideoR1: Easier RL for Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-18T07:56:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"any single long sequence from monopolizing compute. Taking LVBench as an example, our pipeline achieves approximately6∼7×speedup over vanilla inference frameworks. 7 Table 2Video understanding benchmarks supported by the evaluation framework. Benchmark Task Type Number Metric General Video Understanding Video-MME [11] Multiple Choice 2,700 Accuracy Video-MME-v2 [12] Multiple Choice 3,200 Accuracy MVBench [19] Multiple Choice 3,586 Accuracy TempCompass [23] Multiple Choice 7,540 Accuracy MotionBench [14] Multiple Choice 3,715 Accuracy Long Video Understanding LVBench [37] Multiple Choice 1,492 Accuracy LongVideoBench [39] Multiple Choice 1,337 Accuracy MLVU [58] Multiple Choice 502 Accuracy Video Reasoning Video-Holmes [6] Multiple Choice 1,837 Accuracy"}],"limit":50,"offset":0}