NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
super hub Mixed citations
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Mixed citation behavior. Most common role is background (60%).
abstract
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion
authors
co-cited works
representative citing papers
MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.
SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.
Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.
EgoTraj is a new open multimodal dataset of 75 long-horizon egocentric human navigation sequences in urban environments with head pose, gaze, and scene data, plus benchmarks of trajectory prediction methods.
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.
TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% added parameters and up to 50% fewer reasoning tokens.
citing papers explorer
- EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning
- Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
- RAVE: Re-Allocating Visual Attention in Large Multimodal Models
- Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
- Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
- How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study
- Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
- FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization
- Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning
- Fast Image Super-Resolution via Consistency Rectified Flow
- Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
- CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
- ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
- Follow the Mean: Reference-Guided Flow Matching
- jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
- A Systematic Investigation of RL-Jailbreaking in LLMs
- D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
- Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging
- Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing
- GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models
- Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
- Let ViT Speak: Generative Language-Image Pre-training
- MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
- World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
- SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
- Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
- Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
- Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
- Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
- Identifying Topological Invariants of Non-Hermitian Systems via Domain-Adaptive Multimodal Model for Mathematics
- OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
- To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
- MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval
- MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
- FreeRet: MLLMs as Training-Free Retrievers
- HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
- Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement