{"total":24,"items":[{"citing_arxiv_id":"2605.22109","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?","primary_cat":"cs.AI","submitted_at":"2026-05-21T07:42:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19506","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:01:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19218","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference","primary_cat":"cs.CV","submitted_at":"2026-05-19T00:45:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18547","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation","primary_cat":"cs.AI","submitted_at":"2026-05-18T15:27:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISAFF is a tuning-free speaker-centered visual affective feature learning framework for emotion recognition in conversation that guides frozen VLMs to active speakers and uses reliability-guided complementation from textual and acoustic modalities to achieve competitive performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13403","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RotVLA: Rotational Latent Action for Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-05-13T11:58:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[2] Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision- language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769, 2025. [3] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892-34916, 2023. [4] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185-24198, 2024. [5] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang,"},{"citing_arxiv_id":"2605.12624","ref_index":32,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12481","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents","primary_cat":"cs.AI","submitted_at":"2026-05-12T17:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[4] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. [5] Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264, 2024. [6] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185-24198, 2024. [7] Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui"},{"citing_arxiv_id":"2605.12237","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:07:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Such evidence can include vehicles, vessels, small facilities, or narrow infrastructure [35]. This extreme scale disparity requires VLMs to preserve global scene context while performing precise micro-level perception [22]. Standard VLM pipelines are usually constrained by fixed visual token budgets. When UHR images are resized [15], patched [13], or compressed into a limited set of visual tokens [6], small visual cues can be weakened before reasoning begins [43, 27]. We refer to this empirical gap as aresolution illusion: higher nominal resolution suggests richer visual evidence, yet measured performance remains poor on tasks requiring spatially small evidence. In Earth observation, such failures can lead to missed targets, incorrect spatial grounding, or answers based on coarse scene context rather than localized"},{"citing_arxiv_id":"2605.12056","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T12:42:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19-35. Springer, 2024. [5] Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Streamingtom: Streaming token compres- sion for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025. [6] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185-24198, 2024. [7] Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu,"},{"citing_arxiv_id":"2605.11723","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:08:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sampling, or visual rechecking [54, 50] to improve evaluation depth. However, localized structural and physical anomalies can still be overwhelmed by predominantly normal context. In contrast, CaC introduces a coarse-to-fine spatiotemporal grounding mechanism, effectively concentrating on sparse anomalies amid normal context. Reinforcement Learning.The integration of RL [ 25] into LLMs and MLLMs [ 8, 49, 2] has significantly advanced their reasoning capabilities [58, 31, 39]. While early implementations relied on PPO [34], GRPO [35, 12] simplifies advantage estimation via group-relative baselines and leverages verifiable rewards to elicit long-chain reasoning without extensive preference data. Recently, GRPO has been further applied to visual understanding [53, 60, 50, 52, 37] and visual generation [61, 44, 26]."},{"citing_arxiv_id":"2605.11462","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images","primary_cat":"cs.CV","submitted_at":"2026-05-12T03:20:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Human Level - - - 67.3 91.3 94.4 92.6 - Random 25.0 25.0 25.0 32.7 25.0 33.6 24.9 32.4 Proprietary Models Gemini-2.0-Flash-Thinking [2] - - - - 34.3 47.4 44.0 47.1 GPT-4o [24] 69.4 81.3 75.4 36.4 58.3 51.7 47.8 - Claude-3.7-Sonnet [50] - - - 21.8 46.0 48.3 47.5 - Open-Source General Models LLaV A-OneVision-7B [23] 53.2 63.5 58.3 31.245.240.2 35.747.4 InternVL3-2B [51] - - - - 44.2 41.2 38.0 37.5 Qwen2.5-VL-3B [3] 69.1 72.2 70.6 24.6 31.7 41.2 40.3 33.2 Qwen2.5-VL-7B [3]75.083.179.039.2 37.4 45.0 39.2 38.8 Qwen3-VL-2B [3] 73.7 83.4 78.6 42.6 35.6 41.2 35.7 32.2 Open-sourced Specialized Models SpaceQwen2.5-VL-3B-Instruct[29] 54.9 60.7 57.8 36.9 32.047.440.3 33.3 Spatial-MLLM-4B [9] - - - 31.5 - - - 32.1 SpaceR-7B [30] 49."},{"citing_arxiv_id":"2605.10426","ref_index":3,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-11T12:01:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"[1] Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. [2] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286-26296, 2024. [3] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185-24198, 2024. [4] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo"},{"citing_arxiv_id":"2605.10106","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T07:20:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"If I am standing by the stove and facing the tv, is the sofa to my front-left, front- right, back-left, or back-right?\\nThe directions refer to the quadrants of a Cartesian plane (if I am standing at the origin and facing along the positive y-axis). Options: A. back-left B. front-right C. back-right D. front-left Cognitive Map Picture Format {\"cabinet\":[[8,2],[6,1],[5,2],[4,2],[9 ,3],[7,0],[6,0],[5,2],[6,1]], \"sink\":[[8,1]], \"oven\":[[6,1]], \"stove\":[[6,1]], \"sofa\":[[3,9]], \"table\":[[2,4],[0,5],[2,7],[5,8]], \"chair\":[[4,3],[2,5]], \"tv_monitor\":[[1,5]]} Cognitive Map Json Format Facing the TV from the stove, the sofa is located in front-right direction. The TV is on the wall to the right, and the sofa is positioned"},{"citing_arxiv_id":"2605.09982","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning","primary_cat":"cs.CV","submitted_at":"2026-05-11T04:50:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"compared to previous methods, preserving higher accuracy even under aggressive pruning ratios while significantly reducing end-to-end latency in high-resolution settings. 9 References [1] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. [2] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185-24198, 2024. [3] Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui"},{"citing_arxiv_id":"2605.09904","ref_index":6,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T02:47:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Temporalbench: Towards fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024. [5] Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. [6] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185-24198, 2024. [7] Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu,"},{"citing_arxiv_id":"2605.08985","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?","primary_cat":"cs.CV","submitted_at":"2026-05-09T15:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"in Neural Information Processing Systems, 37:27056-27087, 2024. [8] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024. [9] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185-24198, 2024. [10] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N"},{"citing_arxiv_id":"2605.08802","ref_index":4,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization","primary_cat":"cs.CV","submitted_at":"2026-05-09T08:47:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoLVR uses latent contrastive objectives with angle-based perturbation and RL trajectory rewards to increase exploratory visual reasoning in MLLMs, delivering 5-8% gains on VSP, Jigsaw, and MMStar benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. [3] Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, and Hangjie Yuan. Cogflow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving, 2026. [4] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185-24198, 2024. [5] Geonmo Gu, Byeongho Heo, Jaemyung Yu, Jaehui Hwang, Taekyung Kim, Sangmin Lee, HeeJae Jun,"},{"citing_arxiv_id":"2605.07568","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-08T10:40:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Modern Video-LLMs typically couple a vision encoder with an LLM through a projector. Their visual backbones range from frame-centric encoders, which process videos as individual frame sequences [42, 2], to video-centric encoders that explicitly model temporal dynamics across frames [44, 4, 1]. These models have achieved promising performance on broad video understanding benchmarks [11, 32, 47, 15]. However, a growing body of work shows that current Video-LLMs remain weak at temporal reasoning, a key prerequisite for understanding real-world dynamics [16, 13]. A particularly striking instance of this weakness arises in theArrow of Time(AoT), the implicit assumption that events unfold irreversibly from past to future, constrained by gravity, entropy, and"},{"citing_arxiv_id":"2605.07338","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs","primary_cat":"cs.CV","submitted_at":"2026-05-08T06:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02735","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs","primary_cat":"cs.LG","submitted_at":"2026-05-04T15:36:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13710","ref_index":7,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-15T10:39:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"their ability to model deep cross-modal interactions and struggling with queries requiring complex reasoning or world knowledge. 2.2 Multimodal Large Language Models MLLMs extend LLM reasoning capabilities to vision by mapping image features into the language model's embedding space [26, 49, 8]. Recent large-scale models like GPT-4V [ 1], Gemini [ 36], Qwen-VL [3, 37, 5, 4], and InternVL [7, 50, 39] demonstrate exceptional multimodal understanding through unified transformer architectures and massive pre-training. However, they are primarily optimized for autoregressive text generation rather than discriminative retrieval. Extracting high- quality, compact embeddings from these generative backbones without compromising their reasoning"},{"citing_arxiv_id":"2603.02210","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images","primary_cat":"cs.CV","submitted_at":"2026-03-02T18:59:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.21334","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Streaming Video Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2025-12-24T18:59:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.05271","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DeepEyesV2: Toward Agentic Multimodal Model","primary_cat":"cs.CV","submitted_at":"2025-11-07T14:31:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}