{"total":15,"items":[{"citing_arxiv_id":"2606.30288","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context","primary_cat":"cs.CV","submitted_at":"2026-06-29T13:30:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22907","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-21T18:00:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15764","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions","primary_cat":"cs.CV","submitted_at":"2026-05-15T09:24:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13803","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EvoGround: Self-Evolving Video Agents for Video Temporal Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:25:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09904","ref_index":49,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T02:47:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"9 6.0 91.3 58.2 48.1 16.2 38.0 30.1 39.6 28.7 41.2 GPT-5.4-mini 32 33.1 46.1 21.3 69.6 47.8 49.2 11.4 34.1 21.8 36.3 28.7 28.4 Mimo-V2-Omni [45] 32 31.6 47.5 4.7 69.6 57.5 47.6 7.6 30.3 26.9 42.1 27.0 35.1 Open-source thinking / reasoning models Qwen3-VL-8B-Thinking [1] 32 32.1 35.6 14.0 100.0 35.1 50.8 10.5 34.9 24.7 40.6 30.3 43.8 VideoChat-R1.5-7B [49] 32 29.4 45.2 7.0 100.0 47.0 47.1 9.5 29.3 17.0 40.6 23.6 33.1 Video-R1-7B [12] 32 25.1 34.2 0.0 100.0 32.8 46.5 14.3 25.6 15.1 35.3 24.7 32.5 Open-source standard models Qwen2.5-VL-72B [2] 32 34.0 47.5 3.0 95.7 49.3 50.3 15.2 38.2 30.1 40.6 27.0 42.6 InternVL3-8B [59] 32 30.1 42.5 18.7 91.3 42.5 43.3 5.7 32.0 20.2 31.7 26.4 27.8 LLaV A-Video-72B [58] 32 28."},{"citing_arxiv_id":"2605.09422","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs","primary_cat":"cs.CL","submitted_at":"2026-05-10T08:48:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The data elements are defined as: •a. Real Events:Events that actually occurred in the video and are described in textual form. •b. Real Video:Video that authentically contain multiple causal events (i.e., video clip form). • c. Fake Events:The textual event set is constructed by inserting fabricated cause eventsEf ={e ca l } into the original event sequence. We use GPT-5 [27] to generate candidate fabricated causes and retain those with higher language-only log-probability than the ground-truth causes, making them statistically plausible distractors rather than model-specific adversarial examples. • d. Fake Video:Constructed by replacing cause event clips in real videos with fabricated ones generated by Google VEO-3 [28], conditioned on the corresponding cause event text descriptions."},{"citing_arxiv_id":"2605.06094","ref_index":45,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:13:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28114-28128, 2025. [44] Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, and Xudong Jiang. Video-ktr: Reinforcing video reasoning via key token attribution.arXiv preprint arXiv:2601.19686, 2026. [45] Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 12 [46] Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan."},{"citing_arxiv_id":"2604.21718","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Building a Precise Video Language with Human-AI Oversight","primary_cat":"cs.CV","submitted_at":"2026-04-22T09:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Method Caption Generation Reward Modeling Critique Generation SubjectSceneMotionSpatialCameraAvg Subject Scene MotionSpatialCamera Avg SubjectSceneMotionSpatialCameraAvg Open-source models PerceptionLM [15] 8.2 4.8 5.0 7.0 7.5 6.5 38.2 32.4 29.9 34.9 39.9 35.1 2.5 1.5 2.0 1.8 2.2 2.0 OmniVinci [79] 2.8 5.2 3.5 5.5 3.0 4.0 35.9 42.9 37.5 32.8 34.7 36.8 1.2 2.2 1.8 2.5 1.3 1.8 VideoChat-R1.5 [74]6.5 5.8 4.2 2.8 5.0 4.9 42.5 44.3 41.0 43.3 49.3 44.1 1.0 2.5 1.8 2.0 2.2 1.9 SkyReels-V2 [11] 1.8 4.0 2.5 4.5 3.2 3.2 52.7 58.0 55.2 51.0 59.9 55.4 2.2 1.0 1.5 2.8 1.5 1.8 OwlCap [90] 4.8 5.5 3.8 5.2 2.5 4.4 48.4 51.3 49.7 47.4 55.2 50.4 1.5 2.5 2.0 1.2 1.8 1.8 video-SALMONN-2 [58]2.5 1.5 2.0 3.5 3.8 2.7 53.1 61.9 57.8 56.1 61.2 58.0 1.8 1.0 2.5 1."},{"citing_arxiv_id":"2604.05015","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-06T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04379","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-06T03:01:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"mPLUG-Owl3-8B [48] - - 53.5 - 54.5 - - 59.8 InternVL2.5-8B [6] 41.6 - 63.7 68.7 70.5 - - - Qwen2.5-VL-7B [2] 37.4 47.4 65.1 69.2 67.5 51.3 42.0 56.0 RL-based LMMs Video-R1 [9] 35.8 52.3 59.3 73.2 63.9 - - - STAR-R1 [23] 34.1 49.2 56.6 72.4 67.8 - - - TinyLLaV A-Video-R1 [55] - - 46.6 49.5 - - - - VideoChat-R1 [22] - 50.0 58.8 73.9 67.9 - - - VideoChat-R1.5 [43] - 51.4 67.1 - 70.6 - 48.4 - VideoRFT [37] 36.8 51.1 59.8 73.7 62.1 - - - MOSS-ChatV [36] - 50.2 60.0 72.9 67.6 - - - RLER (Ours) 43.3 54.2 68.5 76.2 72.9 57.5 50.7 63.0 The frame-sensitive scores fs(o)follows Eq. (7) on parsed<keyframes>with deduplication and range checks. Think-transparency maps the reasoning length L(o), clipped to[64,1024]tokens, to eL(o)∈[0,1]and uses"},{"citing_arxiv_id":"2604.04372","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-06T02:43:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 13754-13765, 2025. 3 [50] Zhucun Xue, Jiangning Zhang, Xurong Xie, Yong Liu, Xiangtai Li, Dacheng Tao, et al. Adavideorag: Omni- contextual adaptive retrieval-augmented efficient long video understanding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 3 [51] Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce mul- timodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 2 [52] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen"},{"citing_arxiv_id":"2602.02994","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation","primary_cat":"cs.CV","submitted_at":"2026-02-03T02:05:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.15693","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning","primary_cat":"cs.CV","submitted_at":"2025-12-17T18:48:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"proven to be effective in various visual tasks [ 60, 63], du- plicating its success to our tasks is not naive. The resulting model achieves even worse performance than (1). Con- sidering the base model's inability in AI-generated video detection tasks, purely RL can harly equip the model with sufficient artifacts identifying capability without our cold- start initialization process [80]. (3) Without the RL stage: our reinforcement training stage further boosts the detection performance of the supervised finetuned model. Effect of Reward Design: Direct real-fake binary reward yields suboptimal performance.In our reward score, we introduce two special designs: asymmetric accuracy reward and inspection reward. When setting the accuracy reward to"},{"citing_arxiv_id":"2512.03963","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-12-03T16:57:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.13026","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-11-17T06:25:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}