{"total":10,"items":[{"citing_arxiv_id":"2604.21444","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration","primary_cat":"cs.AI","submitted_at":"2026-04-23T09:04:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiCrew improves long-form video question answering on EgoSchema and NExT-QA via a hybrid tree for temporal topology, question-aware captioning, and adaptive multi-agent planning, with gains in temporal and causal reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20760","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Exploring High-Order Self-Similarity for Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-22T16:48:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"through explicit motion representations such as optical flows [70] or through 3D convolutional networks [10,18,19,79,81]. With the rise of vision transformers [15], subsequent methods extend pre-trained image encoders to the video domain via end-to-end finetuning with spatio-temporal attention mechanisms [1,5,17, 40,42,52]. With the emergence of large vision foundation models [62,73,74,94], 4 Manjin Kim 1∗, Heeseung Kwon2∗, Karteek Alahari3, and Minsu Cho1 several methods further explore efficient image-to-video transfer by adapting frozen image encoders either with lightweight temporal modules [49,58,59,84,92] or side networks [46,61,93], avoiding costly end-to-end finetuning. In this work, we follow this efficient adaptation paradigm and demonstrate that high-order"},{"citing_arxiv_id":"2604.12391","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-04-14T07:26:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12033","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmarking Deflection and Hallucination in Large Vision-Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-13T20:22:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLM-DeflectionBench is a new benchmark showing that current large vision-language models rarely deflect and instead hallucinate when given conflicting or insufficient multimodal evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04969","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-04-04T07:14:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MG²-RAG proposes a multi-granularity graph RAG framework that constructs hierarchical multimodal nodes via entity-driven visual grounding and performs structured retrieval, delivering SOTA results on four multimodal tasks with 43.3× faster graph construction.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"reflecting the reliability of the localized visual evidence. These three edge types integrate contextual hierarchy, semantic structure, and visual grounding into a unified heterogeneous topology, laying the foundation for structured multi-granularity retrieval in Section 3.2 In parallel, all textual and visual elements are embedded into a shared space via a unified encoderΨ (·)(i.e., EVA-CLIP-8B [64]),3 which maps any elemental input x to a d-dimensional vector zx = Ψ (x) ∈R d. We construct four embedding matrices corresponding to distinct levels of granularity: (1)Sentence matrix ZS ∈R |S|×d encoded from global sentencesS; (2)Chunk matrixZC ∈R |VC |×d encoded from document chunksVC; (3)Image matrix ZI ∈R |VI |×d encoded from source imagesVI; and (4)Object matrix ZO ∈R |VO|×d encoded from"},{"citing_arxiv_id":"2604.03231","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning","primary_cat":"cs.CV","submitted_at":"2026-04-03T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"understanding and grounding, with notable gains on counting and pointing tasks (@3px/5px). - denotes not supported. Model Chart Diagrams Tables Others Counting Pointing LLAVA1.5-7B [37] 20.31 28.32 20.80 29.68 78.27 - LLAVA1.5-13B [37] 23.33 33.59 23.24 34.86 70.95 - LLAVA-Mistral 7B [37] 22.16 33.98 26.95 48.14 77.82 - Intern-VL2 8B [60] 57.71 72.85 73.82 90.62 74.05 - QWEN2-VL 7B [58] 45.21 64.25 61.13 86.91 57.42 - Pixtral-12B [2] 38.28 54.00 63.96 64.94 71.66 - Paligemma-3B [6] 16.50 26.26 20.80 20.11 8.57 - Kosmos-2 8B [46] 7.81 8.88 12.50 8.00 26.19 - Instruct-BLIP 7B [14] 13.28 17.08 16.60 10.54 36.19 - Phi3 7B [1] 10.54 9.37 9.57 7.22 12.61 - GLM-4V 9B [22] 40.23 58.65 54.12 84.37 84.76 - Molmo2 7B [15] 52.39 62.41 66.25 76."},{"citing_arxiv_id":"2603.11689","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks","primary_cat":"cs.AI","submitted_at":"2026-03-12T08:56:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces Explicit Logic Channel (ELC) with LLM, VFM and probabilistic inference for validating, selecting and enhancing MLLMs on zero-shot tasks using Consistency Rate and cross-channel integration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.13856","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QKVQA: Question-Focused Filtering for Knowledge-based VQA","primary_cat":"cs.IR","submitted_at":"2026-01-20T11:08:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QKVQA proposes a question-focused filtering method with QFF and CDA modules that boosts accuracy by 3.2 points on Encyclopedic-VQA and 2.2 points on InfoSeek over prior state-of-the-art.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24943","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents","primary_cat":"cs.CV","submitted_at":"2025-09-29T15:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.13181","ref_index":130,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Perception Encoder: The best visual embeddings are not at the output of the network","primary_cat":"cs.CV","submitted_at":"2025-04-17T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"0 96.2 87.2 96.4 93.7 67.8 45.6 77.4 75.7 78.8 57.1 75.9 85.5 96.6 Unbounded Scale DFN-H+† [33] 0.6B 378 5B 81.6 84.3 78.3 79.6 79.6 93.6 73.3 80.5 96.2 91.6 96.8 96.0 72.5 37.9 77.4 75.9 75.8 55.6 71.8 82.1 93.6 InternVL-C [19] 5.5B 224 5B 82.5 83.2 77.3 80.6 83.8 95.7 74.3 76.4 95.3 85.8 96.3 94.4 53.3 35.1 76.3 74.4 78.6 58.6 74.9 85.0 95.7 EV A 18B[130] 17.5B 224 2B 83.6 83.8 77.9 82.2 87.3 95.7 74.7 78.8 95.8 86.0 96.1 94.9 59.7 43.1 77.7 76.9 77.5 56.2 73.6 83.3 96.7 EV A 18B+[130] 17.5B 336 2B 84.1 83.9 78.2 83.6 88.9 95.6 74.3 - - - - - - - - - - - - - - SigLIP2-g-opt† [138] 1.1B 384 10B 86.2 85.0 79.8 88.0 90.5 96.6 77.4 81.0 97.0 91.5 97.8 95.9 73.6 40.1 76.3 75.9 78.0 56.1 72.8 86.0 95.4 PEcoreG (image only) 1."}],"limit":50,"offset":0}