{"total":11,"items":[{"citing_arxiv_id":"2606.31599","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-06-30T12:47:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViToS uses dual-stream RL with cross-feedback optimization to prune medical image tokens to 77% length while reporting 108.27% and 104.16% relative performance on two 7B VLMs across seven benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31467","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AeroVerse-SatAgent: UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience","primary_cat":"cs.CV","submitted_at":"2026-06-30T10:46:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SatAgent is a UAV-satellite collaborative spatial reasoning model using geometric 3D encoding, multi-view alignment, and a new 130K dataset that reports 25.91% and 11.69% gains over general and specialized baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29808","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework","primary_cat":"cs.HC","submitted_at":"2026-06-29T05:40:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21734","ref_index":122,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-19T20:43:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17417","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models","primary_cat":"cs.SD","submitted_at":"2026-06-16T01:57:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a benchmark for mechanistic analysis of temporal failures in LALMs and shows attention scaling at bottleneck layers improves accuracy from 55.9% to 59.1%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01612","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Improving Small Object Grounding in LVLMs","primary_cat":"cs.CV","submitted_at":"2026-06-01T03:01:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention maps in LVLMs enable an IoU regressor (Pearson r > 0.67) and a training-free entropy-based selector that improves small-object localization by up to 19% on COCO and Objects365.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01558","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-01T02:02:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Attentive-CoT is an attention-guided fine-tuning objective that improves chain-of-thought performance in multimodal LLMs by delaying answer commitment and increasing sustained visual-token access during rationale generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12586","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"3D Primitives are a Spatial Language for VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:57:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07148","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-08T02:32:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Infinibench: Infinite benchmarking for visual spatial reasoning with customizable scene complexity.arXiv preprint arXiv:2511.18200, 2025. [12] Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 397-406, 2021. [13] Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas.arXiv preprint arXiv:2503.01773, 2025. [14] Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Beyond semantics: Rediscovering spatial"},{"citing_arxiv_id":"2602.17419","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-02-19T14:50:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.14837","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models","primary_cat":"cs.CL","submitted_at":"2025-09-18T10:58:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}