OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.
mega hub Mixed citations
Qwen3-VL Technical Report
Mixed citation behavior. Most common role is background (48%).
abstract
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.
Introduces the first large-scale multimodal benchmark MedLayXPlain-122K showing medical VLMs suffer significant lay-register degradation while general VLMs lack clinical precision.
LOCUS is a released corpus of nearly all US municipal and county ordinance codes, processed via OCR and paired with ModernBERT classifiers for dimensions such as opacity and paternalism.
A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.
ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
citing papers explorer
-
IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams
IPIBench evaluates MLLMs on interactive proactive intelligence in streaming videos, identifies unstable triggering and poor coordination, and proposes the training-free IPI-Agent framework to improve performance across settings.
-
Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning
VGID constructs an intervention-induced teacher distribution via visual perturbation plus textual in-context unlearning and distills it into the student MLLM to achieve parameter-level forgetting.
-
REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization
REVERSE uses tool-grounded trajectories and process rewards on visual grounding, query utility, and evidence discrimination to train a 4B model that outperforms retrieval-augmented baselines on Im2GPS3k and YFCC4k.
-
DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding
DynFrame introduces tokenized learnable span-density retrieval and Segment-Decoupled GRPO in video MLLMs, achieving competitive or SOTA results on six benchmarks with 4B and 8B models.
-
O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding
O-MARC is a compression distillation framework that lets compact omnimodal models maintain or exceed full-token performance on video QA while cutting latency and memory by about 35%.
-
InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward
InterSketch improves long-horizon visual-textual chain-of-thought in VLMs by dynamically generating and interleaving self-correcting visual sketches with text, using a synthesized dataset plus reflection in cold-start followed by stepwise-reward RL, and reports outperforming Gemini-3-Pro on benchmar
-
AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation
AnchorDiff performs training-free concept grounding in multi-modal diffusion transformers by anchor selection followed by graph propagation on attention-derived graphs, reducing concept leakage on a new multi-concept dataset.
-
InstructSAM: Segment Any Instance with Any Instructions
InstructSAM uses learnable queries in a VLM to condition SAM3 for single-pass multi-instance segmentation from arbitrary instructions, with a new Inst2Seg benchmark.
-
StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering
StreamOV proposes evidence-guided long-short term memory and a hidden-state-driven trigger for efficient online audio-visual reasoning in streaming videos, along with the SOVBench benchmark for multi-turn evaluation.
-
Rethinking Scribble-Guided Image Editing: Generalization, Instruction Adherence, and Multi-Tasking
Empirical studies reveal instruction-level generalization as the main bottleneck in scribble-guided editing; three strategies (curriculum, multi-task mosaicking, edit-focused loss) achieve SOTA on VIBE benchmark.
-
ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs
ProSR adds a Counterfactual Invariance Penalty and a Tail Drift Penalty to shape VLM reasoning trajectories for better visual dependence and stability on spatial tasks.
-
MetaphorVU: Towards Metaphorical Video Understanding
Introduces the first benchmark for metaphorical video understanding, identifies MLLM weaknesses in cross-domain mapping, and proposes an inference-time enhancement using a knowledge graph.
-
Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation
Introduces Layout-as-Policy (LaP) to turn 3D layout estimation into an iterative policy-learning refinement process for better physical coherence.
-
ClueAegis: Heuristic-to-Reasoning Cognitive-skill Learning for Unified Evidence-based Synthetic Image Detection
ClueAegis introduces a heuristic-to-reasoning cognitive skill framework and ClueAegis-Bench for evidence-based synthetic image detection that outperforms end-to-end classifiers in generalization and explainability.
-
Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing
RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.
-
FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis
FoodMonitor benchmark evaluates MLLMs on explainable kitchen compliance analysis using dual-channel annotations and a composite C_score metric, with best model at 0.36.
-
D2-V2X: Depth-Driven Cooperative V2X Reasoning for Autonomous Driving
D2-V2X benchmark and LiDAR-aligned VLM baseline raise occluded hazard recall to 24.4% and cut spatial estimation error by 77% versus zero-shot models in cooperative V2X settings.
-
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.
-
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.
-
Cambrian-P: Pose-Grounded Video Understanding
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
-
Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following
A method combining head-conditioned local LoRA adaptation and out-of-cone penalty improves gaze reasoning in vision foundation models, yielding state-of-the-art results on GazeFollow and VAT datasets especially for non-salient targets.
-
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.
-
From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding
ReceiptBench provides a 10k-receipt benchmark with four hierarchical VIE subtasks and a GRPO-based training framework that achieves SOTA results on receipt reasoning and parsing over proprietary models.
-
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
-
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
-
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.
-
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
-
SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary
SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.
-
VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment
VersusQ introduces a pairwise margin reasoning framework using large multimodal models to predict signed continuous quality margins between video pairs, claiming improved cross-domain generalization over pointwise scoring methods.
-
TextSculptor: Training and Benchmarking Scene Text Editing
TextSculptor supplies an automated data synthesis pipeline yielding 3.2M samples plus a four-task benchmark that raises open-source scene text editing performance.
-
Multimodal LLMs under Pairwise Modalities
A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
-
Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy
ILVAD is a plug-and-play method that builds a saliency map from inter-layer attention discrepancies on early tokens to enhance visual evidence focus and ground generated text, reducing hallucinations in LVLMs.
-
Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding
A graph-grounded Combined Road Substrate framework generates traceable QA pairs from road maps to improve small VLMs on compositional road reasoning tasks.
-
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six benchmarks.
-
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks
Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
-
P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation
P2DNav proposes a three-part hierarchical framework (panorama-to-downview reasoning, sliding-window dialogue memory, and reflective reorientation) that reports large success-rate gains on the R2R-CE zero-shot VLN benchmark.
-
Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos
Artifact-Bench supplies a three-level artifact taxonomy and three evaluation tasks that show 19 MLLMs perform near or below random on AI-video realism detection and reasoning.
-
Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory
IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.
-
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
-
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
-
SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning
SpatioRoute introduces dynamic prompt routing that improves zero-shot spatial VQA accuracy by up to 5% on the SQA3D benchmark across VLMs without 3D inputs or fine-tuning.
-
SurgLQA: Scalable Long-Horizon Surgical Video Question Answering
SurgLQA introduces FTC for compact long-range video representations and TMS for adaptive test-time scaling, reporting gains on restructured Colon-LQA and REAL-Colon-VQA benchmarks.
-
Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding
A foveated VLM trained for scene comprehension produces human-like fixations, outperforming models trained for search, classification, or with altered peripheral vision.
-
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
IC-Seg is a multi-turn clarification framework with hierarchical GRPO optimization that resolves ambiguous queries in referring video object segmentation and introduces the Ambi-RVOS benchmark.
-
Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
-
GeoWorld-VLM: Geometry from World Models for Vision-Language Models
GeoWorld-VLM aligns VLM image features with intermediate representations from camera-conditioned world models via fine-tuning only the encoder and projector, yielding ~4% gains on What'sUp and VSR spatial benchmarks across two VLM backbones.
-
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.
-
Unlocking Dense Metric Depth Estimation in VLMs
DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.
-
Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models
Proposes AGSR and the FAB-G supervised multi-agent framework that predicts attribute salience from human annotations to constrain MLLM emotion reasoning, yielding gains on EmoArt and cross-dataset tests.
-
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.