super hub Mixed citations

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

author=, Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution · 2024 · cs.CV · arXiv 2409.12191

Mixed citation behavior. Most common role is background (60%).

664 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 664 citing papers more from author= arXiv PDF

abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 103 baseline 28 method 26 dataset 6 other 2

citation-polarity summary

background 99 baseline 28 use method 26 use dataset 6 unclear 5 support 1

claims ledger

abstract We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion

authors

author= Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

cs.CV · 2026-05-19 · accept · novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

cs.CV · 2026-05-15 · conditional · novelty 8.0

MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

cs.CV · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 8.0

CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.

SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

cs.NE · 2026-04-13 · unverdicted · novelty 8.0

SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

cs.CV · 2026-03-16 · accept · novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

A document is worth a structured record: Principled inductive bias design for document recognition

cs.CV · 2025-07-11 · unverdicted · novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

Seek to Segment: Active Perception for Panoramic Referring Segmentation

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.

Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing

cs.CV · 2026-07-02 · conditional · novelty 7.0

Proposes WUICC task and WUICC-bench dataset, then evaluates 11 image difference captioning methods plus 2 LLMs on web UI changes.

Disentangling Pictorial Cue Understanding from Language Bias in VLMs via Depth Ordering Task

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

VLMs show chance-level depth ordering performance (47-56%) on controlled images, driven by language bias rather than pictorial cues, with no improvement from CoT or ICL.

MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.

Generative Lane Topology Reasoning via Autoregressive Model with Geometry Prior

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

TopoGPT pre-trains an autoregressive transformer on serialized lane graphs from 3.3M scenes to learn geometry priors and uses a perception adapter to apply it to BEV features for improved lane graph prediction on OpenLane-V2.

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

cs.RO · 2026-06-30 · accept · novelty 7.0

RCT dataset with sequence-preserving splits demonstrates that tactile-to-text models achieve only 25.1% Recall@1 on held-out materials, exposing generalization as the core challenge.

Personalizing MLLMs via Reinforced Multimodal Reference Game

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

RRG trains MLLMs via a reinforced multimodal reference game with contrastive rewards on hard positives and negatives to produce accurate, discriminative concept descriptions, achieving SOTA on personalization benchmarks.

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

cs.CL · 2026-06-26 · conditional · novelty 7.0

VLMs default to visual grounding but a sparse circuit of 2.5-4.8% attention heads in later layers mediates prior-knowledge overrides, identified causally via patching and ablation across three model families.

AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

cs.RO · 2026-06-26 · accept · novelty 7.0

VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.

Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds

cs.AI · 2026-06-25 · unverdicted · novelty 7.0

Look-Before-Move is a framework that converts narrative intent into Semantic Observation Contracts, uses Monte Carlo Viewpoint Search for feasible viewpoints, and applies Semantic Trajectory Grounding for coherent camera motion in dynamic 3D story worlds.

citing papers explorer

Showing 50 of 58 citing papers after filters.

Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds cs.AI · 2026-06-25 · unverdicted · none · ref 37 · internal anchor
Look-Before-Move is a framework that converts narrative intent into Semantic Observation Contracts, uses Monte Carlo Viewpoint Search for feasible viewpoints, and applies Semantic Trajectory Grounding for coherent camera motion in dynamic 3D story worlds.
OctoT2I: A Self-Evolving Agentic Text-to-Image Router cs.AI · 2026-06-01 · unverdicted · none · ref 45 · internal anchor
OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.
Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion cs.AI · 2026-05-28 · unverdicted · none · ref 85 · internal anchor
Mind-Omni unifies seven brain-vision-language tasks in one discrete-diffusion framework with a brain tokenizer and a new BQA dataset, claiming SOTA multi-task performance competitive with larger single-task models.
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality? cs.AI · 2026-05-21 · unverdicted · none · ref 68 · internal anchor
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization cs.AI · 2026-05-20 · unverdicted · none · ref 47 · internal anchor
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain cs.AI · 2026-05-18 · unverdicted · none · ref 28 · 2 links · internal anchor
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
Allegory of the Cave: Measurement-Grounded Vision-Language Learning cs.AI · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines cs.AI · 2026-04-26 · unverdicted · none · ref 59 · internal anchor
A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and linking success to specific architectural properties.
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding cs.AI · 2026-04-21 · unverdicted · none · ref 55 · internal anchor
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 43 · internal anchor
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing cs.AI · 2026-07-02 · unverdicted · none · ref 61 · internal anchor
ScopeEdit decomposes MLLM edits into modality-local and evidence-gated shared branches using orthogonal low-rank spaces and recursive updates to improve scoped cross-modal transfer while preserving locality and efficiency.
Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning cs.AI · 2026-06-28 · unverdicted · none · ref 50 · internal anchor
Mixture of Debaters uses MoE to enable dynamic self-debate inside one model, claiming better accuracy than multi-agent systems at 3.7x lower latency and 87% fewer tokens on multimodal benchmarks.
ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection cs.AI · 2026-06-17 · unverdicted · none · ref 35 · internal anchor
ThinkDeception introduces MLLMs, a multimodal CoT dataset, and VAC-GRPO progressive RL to convert deception detection into interpretable reasoning and claims new SOTA accuracy plus rationale quality.
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach cs.AI · 2026-06-11 · unverdicted · none · ref 35 · internal anchor
Introduces UXBench benchmark for MLLM UI UX reasoning and UI-UX model achieving 0.7963 accuracy via RL enhancements on Qwen3-VL base.
Rethinking RAG in Long Videos: What to Retrieve and How to Use It? cs.AI · 2026-06-11 · unverdicted · none · ref 61 · internal anchor
Introduces V-RAGBench benchmark and CARVE method that selects per-chunk retrieval configurations via parallel retrievers and adaptive reranking, outperforming eight VideoRAG baselines.
Benchmark Everything Everywhere All at Once cs.AI · 2026-06-04 · unverdicted · none · ref 45 · internal anchor
Benchmark Agent is an autonomous agentic system that constructs benchmarks for LLMs and MLLMs via query analysis, subtask design, annotation and quality control, yielding 15 benchmarks with minimal human input.
Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models cs.AI · 2026-06-01 · unverdicted · none · ref 32 · internal anchor
Stopping large reasoning models at the first correct reasoning prefix improves accuracy up to 21% by avoiding harmful overthinking that destabilizes correct trajectories.
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees cs.AI · 2026-05-19 · unverdicted · none · ref 28 · internal anchor
AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation cs.AI · 2026-05-18 · unverdicted · none · ref 35 · internal anchor
VISAFF is a tuning-free speaker-centered visual affective feature learning framework for emotion recognition in conversation that guides frozen VLMs to active speakers and uses reliability-guided complementation from textual and acoustic modalities to achieve competitive performance.
Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs cs.AI · 2026-05-18 · unverdicted · none · ref 49 · 2 links · internal anchor
Generative Visual Grounding creates instance-specific visual proxy images from EEG signals to enhance MLLM understanding of brain activity beyond text-only alignment.
How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study cs.AI · 2026-05-16 · unverdicted · none · ref 12 · 2 links · internal anchor
EEG study reveals distinct ERP patterns for AI hallucinations, with misjudged ones failing to trigger standard neurocognitive verification pathways.
Revealing Interpretable Failure Modes of VLMs cs.AI · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
REVELIO uncovers interpretable failure modes in VLMs by searching combinatorial concept spaces with diversity-aware beam search and Gaussian-process Thompson sampling, revealing vulnerabilities in autonomous driving and indoor robotics.
SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 42 · internal anchor
SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer? cs.AI · 2026-05-11 · unverdicted · none · ref 35 · internal anchor
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric cs.AI · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.
Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning cs.AI · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.
Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing cs.AI · 2026-05-06 · unverdicted · none · ref 6 · 2 links · internal anchor
EBM-RL applies a GRPO-based RL method with decomposed rewards for scene alignment, perceptual utility, faithfulness, and format to improve video-grounded role-playing dialogue over text-only baselines.
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits cs.AI · 2026-05-05 · unverdicted · none · ref 27 · internal anchor
Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models cs.AI · 2026-05-05 · unverdicted · none · ref 38 · internal anchor
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems cs.AI · 2026-05-03 · unverdicted · none · ref 40 · 3 links · internal anchor
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems cs.AI · 2026-04-09 · unverdicted · none · ref 40 · internal anchor
MONETA is the first multimodal benchmark for industry classification using text and geographic sources, with MLLM baselines at 62-74% accuracy and up to 22.8% gains from multi-turn context enrichment and explanations.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models cs.AI · 2026-04-07 · unverdicted · none · ref 34 · internal anchor
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy cs.AI · 2026-03-02 · unverdicted · none · ref 57 · internal anchor
Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration cs.AI · 2025-12-22 · unverdicted · none · ref 22 · internal anchor
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 52 · internal anchor
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners cs.AI · 2025-04-19 · unverdicted · none · ref 52 · internal anchor
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
SmolVLM: Redefining small and efficient multimodal models cs.AI · 2025-04-07 · unverdicted · none · ref 32 · internal anchor
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models cs.AI · 2026-06-28 · unverdicted · none · ref 50 · 2 links · internal anchor
FADE attenuates FFN outputs at critical layers in LVLMs to curb language-prior dominance and cut hallucinations, shown effective on POPE, CHAIR, and MME across three models.
MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning cs.AI · 2026-06-16 · unverdicted · none · ref 46 · internal anchor
MathVis-Fine proposes a dataset with fine-grained visual annotations and dependency ratings plus a progressive two-stage training paradigm to align visual supervision with sample-specific necessity in multimodal mathematical reasoning.
The Hidden Power of Scaling Factor in LoRA Optimization cs.AI · 2026-06-11 · unverdicted · none · ref 6 · internal anchor
Alpha in LoRA outperforms learning-rate scaling, follows a square-root law with rank, and enables a minimalist LoRA-alpha method that improves performance across tasks.
Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition cs.AI · 2026-05-31 · unverdicted · none · ref 66 · internal anchor
PID applied to MLLMs identifies task-specific modality interaction profiles that generalize across models, extend to tri-modal cases, and yield initial performance gains via reweighting.
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding cs.AI · 2026-05-15 · unverdicted · none · ref 36 · internal anchor
DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 49 · internal anchor
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making cs.AI · 2026-05-10 · unverdicted · none · ref 41 · internal anchor
SKG-VLA models each complaint as a structured scene via a Scene Knowledge Graph to improve policy-grounded multimodal reasoning and decision accuracy.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents cs.AI · 2026-04-19 · unverdicted · none · ref 21 · internal anchor
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model cs.AI · 2026-04-13 · unverdicted · none · ref 61 · internal anchor
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning cs.AI · 2025-08-27 · unverdicted · none · ref 19 · internal anchor
InquireMobile applies two-stage reinforcement fine-tuning and pre-action reasoning to VLM mobile agents, raising inquiry success rate by 46.8% on the introduced InquireBench benchmark.
Xiaomi-GUI-0 Technical Report cs.AI · 2026-06-30 · unverdicted · none · ref 40 · 2 links · internal anchor
Xiaomi-GUI-0 reports 72.0% success on RealMobile and 78.9% on AndroidWorld via real-device closed-loop training with multi-source data and three-stage RL pipeline.
Vision Language Model Helps Private Information De-Identification in Vision Data cs.AI · 2026-06-08 · unverdicted · none · ref 4 · internal anchor
VisShield with OPTIC dataset enables VLMs to localize and mask private text in vision data via instruction tuning for privacy preservation.

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer