pith. sign in

super hub Mixed citations

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Mixed citation behavior. Most common role is background (60%).

675 Pith papers citing it
Background 60% of classified citations
abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

hub tools

citation-role summary

background 103 baseline 28 method 26 dataset 6 other 2

citation-polarity summary

claims ledger

  • abstract We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion

authors

co-cited works

clear filters

representative citing papers

Personalizing MLLMs via Reinforced Multimodal Reference Game

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

RRG trains MLLMs via a reinforced multimodal reference game with contrastive rewards on hard positives and negatives to produce accurate, discriminative concept descriptions, achieving SOTA on personalization benchmarks.

citing papers explorer

Showing 3 of 3 citing papers after filters.

  • UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes cs.CV · 2025-11-28 · conditional · none · ref 64 · internal anchor

    UniGeoSeg releases the first million-scale dataset for instruction-driven remote sensing segmentation and a unified model that achieves state-of-the-art results with strong zero-shot generalization.

  • VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation cs.CV · 2026-04-02 · conditional · none · ref 49 · internal anchor

    VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.

  • Visual-RFT: Visual Reinforcement Fine-Tuning cs.CV · 2025-03-03 · conditional · none · ref 38 · internal anchor

    Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.