{"total":318,"items":[{"citing_arxiv_id":"2606.22339","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"T-IMPACT: A Severity-Aware Benchmark for Contextual Image-Text Manipulation","primary_cat":"cs.CV","submitted_at":"2026-06-21T05:07:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"T-IMPACT is a new benchmark dataset and pipeline that supplies nearly 99k manipulated image-text pairs together with a human-calibrated continuous severity signal for contextual interpretation change.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02569","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AdaCodec: A Predictive Visual Code for Video MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-01T17:56:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00959","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition","primary_cat":"cs.AI","submitted_at":"2026-05-31T02:29:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PID applied to MLLMs identifies task-specific modality interaction profiles that generalize across models, extend to tri-modal cases, and yield initial performance gains via reweighting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00640","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Attribute-Based Measure of Video Complexity","primary_cat":"cs.CV","submitted_at":"2026-05-30T09:30:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00562","ref_index":114,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeepLatent: Think with Images via Parallel Latent Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-30T06:33:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31457","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-29T15:51:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VisionPulse is a step-wise visual token pruning method for LMMs that retains 5% of tokens per step, shortens reasoning traces by 11.2%, and maintains accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30231","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30140","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection","primary_cat":"cs.CV","submitted_at":"2026-05-28T16:05:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AnomalyAgent is a training-free agentic framework that equips MLLMs with anomaly-centric tools and a memory module to outperform VLM-based methods on both simple and complex contextual anomalies in zero- and few-shot settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23655","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception","primary_cat":"cs.CV","submitted_at":"2026-05-22T14:07:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CVSearch proposes an Assess-then-Search workflow combining expert-assisted search with Semantic Guided Adaptive Patching and Dynamic Bottom-Up Search to improve efficiency and accuracy on high-resolution image tasks for MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23508","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DrawVideo: Generating Long Video from Storyboard Keyframe Sketches","primary_cat":"cs.GR","submitted_at":"2026-05-22T11:16:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DrawVideo is a sketch-guided framework that decomposes long videos into controllable shots using keyframe sketches, appearance prompts, and motion prompts, supported by a new SketchLongVideo dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22907","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-21T18:00:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22678","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Swift Sampling: Selecting Temporal Surprises via Taylor Series","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:20:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22269","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering","primary_cat":"cs.CV","submitted_at":"2026-05-21T10:13:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22158","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs","primary_cat":"cs.AI","submitted_at":"2026-05-21T08:27:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22078","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding","primary_cat":"cs.AI","submitted_at":"2026-05-21T07:16:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22036","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation","primary_cat":"cs.CV","submitted_at":"2026-05-21T06:20:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21988","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:38:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21919","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals","primary_cat":"cs.CV","submitted_at":"2026-05-21T02:44:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21652","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming","primary_cat":"cs.CV","submitted_at":"2026-05-20T19:06:34+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21625","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:36:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21487","ref_index":11,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21479","ref_index":125,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:58:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21059","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimodal LLMs under Pairwise Modalities","primary_cat":"cs.CV","submitted_at":"2026-05-20T11:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20950","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T09:37:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20914","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RISE: Reliable Improvement in Self-Evolving Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T08:57:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20682","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools","primary_cat":"cs.CV","submitted_at":"2026-05-20T03:52:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localization, and reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"under zero-shot object and defect categories. To address this, we propose tool-grounded multimodal CoT reasoning via RL, explicitly linking intermediate reasoning steps to external tool observations to reduce hallucination while preserving zero-shot generalization. Tool-Augmented Agentic Systems.Tool use has become an effective way to enhance multimodal reasoning. MV oT [41] incorporates visual evidence into reasoning chains as multimodal thoughts, while LLaV A-Plus [53] and VPD [37] enable tool learning via supervised training or program-derived data; more recent works like TACO [56] and PyVision [110] further extend this with RL. However, most rely on static tool-use pipelines or objectives rewarding tool invocation without considering"},{"citing_arxiv_id":"2605.20035","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stage-adaptive Token Selection for Efficient Omni-modal LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-19T15:55:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SEATS adaptively selects and removes non-text tokens before and inside the LLM layers of omni-modal models, yielding 9.3x FLOPs reduction and 4.8x prefill speedup at 10% token retention while keeping 96.3% performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19852","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T13:44:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19846","ref_index":10,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-19T13:40:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19660","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond","primary_cat":"cs.LG","submitted_at":"2026-05-19T10:53:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19506","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:01:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19322","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-19T04:02:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18734","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:54:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18714","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Semantic Generative Tuning for Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:46:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18678","ref_index":56,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"1 Multimodal Large Language Models Multimodal large language models (MLLMs) have become the dominant paradigm for image and video understanding by aligning pretrained visual encoders with powerful language backbones. Representative early systems include Flamingo [2], IDEFICS [56], and InstructBLIP [20], while later open-source families such as LLaVA [57, 71-73], Qwen-VL [3-5, 114], and InternVL [15, 16, 32, 115] further improve instruction following, high-resolution perception, and long-context multimodal reasoning. This line of work mainly follows the LLaVA paradigm [71], in which visual inputs are first encoded by a vision encoder [94, 108] and then concatenated with text tokens for joint modeling by a language model decoder."},{"citing_arxiv_id":"2605.18603","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:13:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18287","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StableVLA: Towards Robust Vision-Language-Action Models without Extra Data","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StableVLA adds an Information Bottleneck Adapter to VLA models that improves robustness to visual corruptions by 30% on average with under 10M extra parameters and no extra data, even when using a much smaller backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18214","ref_index":26,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation","primary_cat":"cs.CV","submitted_at":"2026-05-18T10:58:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EgoInteract is a new simulator for generating synthetic egocentric videos with precise control over camera, body, hand, and object motions, producing a dataset that improves model performance on real-world benchmarks for temporal action segmentation, next-active object detection, interaction Anticip","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18115","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens","primary_cat":"cs.CV","submitted_at":"2026-05-18T09:24:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18018","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:09:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17921","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Efficient Streaming Video Understanding Framework with Agentic Control","primary_cat":"cs.CV","submitted_at":"2026-05-18T06:29:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17260","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-17T05:02:52+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17093","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-16T17:33:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15951","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15735","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UAM: A Dual-Stream Perspective on Forgetting in VLA Training","primary_cat":"cs.CV","submitted_at":"2026-05-15T08:45:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15621","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs","primary_cat":"cs.CV","submitted_at":"2026-05-15T05:09:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LRCP prunes visual tokens in LVLMs by scoring projection residuals onto a PCA-estimated low-rank subspace, achieving 88.9% image token reduction with 94.7% performance retention and 87.5% video reduction with 97.8% accuracy retention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14475","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-14T07:15:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2312.11805, 2023. [9] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.ArXiv, abs/2304.08485, 2023. [10] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286-26296, 2023. [11] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.ArXiv, abs/2408.03326, 2024. [12] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding,"},{"citing_arxiv_id":"2605.14310","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-14T03:22:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14070","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language","primary_cat":"cs.NI","submitted_at":"2026-05-13T19:47:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13831","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:52:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"sequence by using the rendered page images as visual input and the parsed text elements as the target output. 4 Table 1We compare long-document VQA data with OCR transcription data under the same setting. The models are evaluated on the document category of MMLongBench [5] at 64K and 128K, which contains three datasets: MMLongBench-Doc [47], LongDocURL [48], and SlideVQA [55]. We abbreviate them as MMLB-D, LD-URL, and SLIDE, respectively.SFTmeans an extra 5B-token SFT stage. 64K MMLongBench 128K MMLongBench AVG. Training dataMMLB-D LD-URL SLIDE A VG. MMLB-D LD-URL SLIDE A VG. Qwen2.5-VL-7B 32.17 49.57 75.00 52.24 26.96 51.85 68.00 48.94 50.59 extract-single33.85 59.73 77.00 56.86 30.89 55.69 77.00 54.53 55.69+5.1 extract-multi32."}],"limit":50,"offset":0}