super hub Mixed citations

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

author=, Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution · 2024 · cs.CV · arXiv 2409.12191

Mixed citation behavior. Most common role is background (60%).

620 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 620 citing papers more from author= arXiv PDF

abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 103 baseline 28 method 26 dataset 6 other 2

citation-polarity summary

background 99 baseline 28 use method 26 use dataset 6 unclear 5 support 1

claims ledger

abstract We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion

authors

author= Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

cs.CV · 2026-05-19 · accept · novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

cs.CV · 2026-05-15 · conditional · novelty 8.0

MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

cs.CV · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 8.0

CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.

SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

cs.NE · 2026-04-13 · unverdicted · novelty 8.0

SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

cs.CV · 2026-03-16 · accept · novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

A document is worth a structured record: Principled inductive bias design for document recognition

cs.CV · 2025-07-11 · unverdicted · novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.

Generative Lane Topology Reasoning via Autoregressive Model with Geometry Prior

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

TopoGPT pre-trains an autoregressive transformer on serialized lane graphs from 3.3M scenes to learn geometry priors and uses a perception adapter to apply it to BEV features for improved lane graph prediction on OpenLane-V2.

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

cs.RO · 2026-06-30 · accept · novelty 7.0

RCT dataset with sequence-preserving splits demonstrates that tactile-to-text models achieve only 25.1% Recall@1 on held-out materials, exposing generalization as the core challenge.

Personalizing MLLMs via Reinforced Multimodal Reference Game

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

RRG trains MLLMs via a reinforced multimodal reference game with contrastive rewards on hard positives and negatives to produce accurate, discriminative concept descriptions, achieving SOTA on personalization benchmarks.

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

cs.CL · 2026-06-26 · conditional · novelty 7.0

VLMs default to visual grounding but a sparse circuit of 2.5-4.8% attention heads in later layers mediates prior-knowledge overrides, identified causally via patching and ablation across three model families.

AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

cs.RO · 2026-06-26 · accept · novelty 7.0

VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.

Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds

cs.AI · 2026-06-25 · unverdicted · novelty 7.0

Look-Before-Move is a framework that converts narrative intent into Semantic Observation Contracts, uses Monte Carlo Viewpoint Search for feasible viewpoints, and applies Semantic Trajectory Grounding for coherent camera motion in dynamic 3D story worlds.

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

cs.AI · 2026-06-06 · accept · novelty 7.0

MLLMs fail to detect absent correct answers in video QA tasks across three evaluation settings, defaulting to distractors even with chain-of-thought prompting.

Closed-Form Spectral Regularization for Multi-Task Model Merging

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

Iterative solvers in layer-wise model merging act as spectral regularizers on an ill-posed interference operator; closed-form SWUDI and adaptive SWUDI-A match or exceed SOTA merging accuracy with 28-72x wall-clock speedup.

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

DisasterBench is a new multi-stage multimodal reasoning benchmark for UAV disaster response with 14 scenes and 9 tasks; the accompanying 2B DisasterVL model outperforms open-source MLLMs and approaches GPT-4o efficiency.

citing papers explorer

Showing 8 of 8 citing papers after filters.

ImageAuditor: Membership Inference Attack against Image-based Retrieval-Augmented Generation cs.CR · 2026-06-02 · unverdicted · none · ref 37 · internal anchor
ImageAuditor is the first MIA for IRAG that achieves over 80% AUROC with four queries by using reward-guided policy optimization for cross-modal retrieval and task-specific prompting for signal extraction.
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation cs.CR · 2026-05-15 · unverdicted · none · ref 59 · internal anchor
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents cs.CR · 2025-10-11 · unverdicted · none · ref 42 · internal anchor
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing cs.CR · 2026-06-04 · unverdicted · none · ref 25 · internal anchor
RedEdit finds that fewer than two photo edits on average let 76.2% of unsafe images evade detectors while retaining 93.0% of malicious semantics.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models cs.CR · 2026-05-02 · conditional · none · ref 30 · internal anchor
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking cs.CR · 2025-07-29 · unverdicted · none · ref 35 · internal anchor
PRISM decomposes harmful instructions into benign visual gadgets and directs LVLMs via prompts to compose them through reasoning into harmful outputs, achieving ASR over 0.90 on SafeBench.
New Wide-Net-Casting Jailbreak Attacks Risk Large Models cs.CR · 2026-05-16 · unverdicted · none · ref 24 · internal anchor
The paper demonstrates that a tailored jailbreak method for querying groups of large models can achieve up to 100% success rate in some experiments on unprotected models, revealing overlooked multi-model safety risks.
LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents cs.CR · 2025-07-13 · conditional · none · ref 16 · internal anchor
LaSM is a layer-wise scaling mechanism that amplifies attention and MLP modules in critical layers to defend GUI agents against pop-up attacks by correcting attention misalignment.

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer