OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.
mega hub Mixed citations
Qwen3-VL Technical Report
Mixed citation behavior. Most common role is background (48%).
abstract
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.
Introduces the first large-scale multimodal benchmark MedLayXPlain-122K showing medical VLMs suffer significant lay-register degradation while general VLMs lack clinical precision.
LOCUS is a released corpus of nearly all US municipal and county ordinance codes, processed via OCR and paired with ModernBERT classifiers for dimensions such as opacity and paternalism.
A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.
ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
citing papers explorer
-
MotionAtlas: Detailed Region Captioning for Motion-Centric Videos
MotionAtlas supplies a 2,073-question benchmark, a self-bootstrap pipeline yielding 159k captions, and fine-tuned Video-MLLMs that deliver 5.2-point gains over Qwen3-VL-4B on motion tasks.
-
The Platonic Defense: Backdoor Defense for Self-Supervised Encoders in the Era of Large Scale Pre-training
Introduces an attack-agnostic black-box defense for SSL encoders that trains a conditional energy function via NCE and DSM to detect and purify representations, with an energy gap lower-bounded by mutual information.
-
HKVLM: Faithful Reasoning Grounding by Binding Language Queries to a Frozen Detector
HKVLM trains only an alignment hook to bind frozen LM query embeddings to frozen detector proposals via contrastive retrieval and bipartite assignment, yielding 50-90x grounding gains and reduced hallucinations on RefCOCO and POPE.
-
Detecting Clinical Hallucinations in LVLMs via Counterfactual Visual Grounding Uncertainty
A counterfactual visual grounding uncertainty method detects hallucinations in LVLMs on medical images, improving over baselines with interpretable evidence and cross-model transfer.
-
HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration
HAT-4D presents an agentic VLM-plus-human-in-the-loop pipeline for monocular 4D multi-object interaction reconstruction and releases the MVOIK-4D benchmark.
-
Toward Robust In-Context Segmentation via Concept Guidance
CG-ICS improves ICS robustness by using MLLM-proposed textual concepts scored via SAM3 and tree search plus visual exemplars to activate a frozen SAM3, claiming SOTA accuracy and lower variance across references.
-
ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering
ProMSA is a progressive multimodal search agent for KB-VQA that iteratively selects search tools under budgets, trained via rejection-sampling SFT then TN-GSPO RL, reporting gains on E-VQA and InfoSeek over RAG baselines.
-
Understanding How MLLMs Describe Artworks Using Token Activation Maps
Token Activation Maps applied to MLLM art descriptions reveal that visual grounding strength varies by token category, with better artist identification than title prediction.
-
Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models
VISE is an unsupervised self-evolving method for LMMs that uses invariance rewards to improve visual conditioning, reporting gains on captioning and reduced hallucination across multiple models.
-
RoPEMover: Depth-Aware Object Relocation via Positional Embeddings
RoPEMover extends 2D RoPE to a depth-aware version in diffusion transformers to enable consistent object relocation in single images, trained mostly on synthetic data with minimal real supervision and claiming SOTA results.
-
HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models
HarmVideoBench is a multi-layered benchmark for harmful video understanding in LVLMs with three hierarchical dimensions, and BCR is a method that raises average model performance from 61.7% to 84.4%.
-
Unison: Benchmarking Unified Multimodal Models via Synergistic Understanding and Generation
Unison is a new benchmark with unified and decoupled tracks plus Unison-Judge to measure synergy between understanding and generation in multimodal models.
-
PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation
PortraitGen integrates real-image exemplars into GRPO sampling and applies dual rewards (OmniReward and AI-Portrait) to improve photorealism, claiming better results than baselines on a new PortraitBench.
-
SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing
SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.
-
From Hallucination to Grounding: Diagnosing Visual Spatial Intelligence via CRISP
CRISP diagnoses a systematic perception-reasoning disconnect in VLMs, showing proprietary models have latent reasoning but poor metric estimation while open-source models lack compositional reasoning.
-
Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs
ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.
-
Text Over Image: Auditing Multimodal Robustness in Synthetic Medical Image Detection
VLMs for synthetic medical image detection overweight text metadata, flipping authenticity judgments on the same image and dropping accuracy on authentic images by 61.1% on average when an explicit AI-origin tag is present.
-
Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity
Presents Invoice Haystack benchmark for homogeneous document retrieval and VL-RAG hybrid framework achieving 60% Recall@1 and up to 13.5 point gains over prior methods.
-
SteerVTE: Seamless Video Text Editing with Style and Glyph Control
SteerVTE adds lightweight style and dual-granularity glyph adapters to a frozen video diffusion model, introduces a glyph-aware loss and progressive training, and releases a 1M synthetic dataset to enable accurate video text editing.
-
Compression and Retrieval: Implicit Memory Retrieval for Video World Models
CaR uses attention with viewpoint positional encoding and context compression for flexible memory retrieval in video world models, backed by a new SceneFly dataset, and reports SOTA results with open-domain generalization.
-
READ More than What You See: Reinforcement Learning for Accurate and Coherent Audio Description Generations
READ is the first reinforcement-learning framework for training audio-description generators, using sequence-level rewards for reference match, length, format, and context-aware coherence.
-
OmniSpace: Efficient Geometry Awareness for Autonomous Vehicles MLLMs
OmniSpace is a plug-and-play method that improves spatial reasoning in MLLMs for AV by injecting camera pose, using epipolar attention across views, and distilling 3D geometric knowledge to overcome weak cross-view correspondence and depth estimation.
-
Training-Free Semantic Correction for Autoregressive Visual Models
Gazer uses MLLM feedback in two stages to diagnose semantic errors in intermediate AVM states and rewind/rectify the generation trajectory, improving alignment on compositional benchmarks without training.
-
CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming
CVSBench benchmark shows VLMs struggle with cross-view spatial consistency but improve substantially when given 3D scene imagination inputs.
-
T-IMPACT: A Severity-Aware Benchmark for Contextual Image-Text Manipulation
T-IMPACT is a new benchmark dataset and pipeline that supplies nearly 99k manipulated image-text pairs together with a human-calibrated continuous severity signal for contextual interpretation change.
-
CapRiCorn-1K: A Comprehensive Benchmark for Video Captioning and Subject Referential Consistency Across Temporal Scales
CapRiCorn-1K benchmark shows current video captioning models produce inaccurate and inconsistent captions that worsen with longer videos, with proposed metrics correlating to downstream task performance.
-
HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
-
UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating
UnityShots uses fixed LTM and STM memory slots with boundary-conditioned gating and speaker tokens to achieve coherent multi-shot audio-video generation, leading open-source baselines on cross-shot coherence metrics.
-
A Neurosymbolic Framework for Interpretable Skeleton-Based Seizure Detection via Concept-Driven Logical Reasoning
Neurosymbolic framework detects seizures from video skeletons by activating clinical concepts and composing them with differentiable logic into interpretable rules, evaluated on two benchmarks with public code release.
-
EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies
EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.
-
PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models
PerceptionDLM enables parallel region captioning in multimodal diffusion language models via prompting and attention masking, introduces ParaDLC-Bench, and claims first parallel region perception with DLMs.
-
OneCanvas: 3D Scene Understanding via Panoramic Reprojection
OneCanvas aggregates multi-view 3D patches onto one panoramic canvas with continuous angular placement and 3D embeddings, enabling pretrained VLMs to achieve SOTA on SQA3D and VSI-Bench with an order of magnitude less compute via a new spatial pretraining curriculum.
-
AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model
Introduces AMALIA-VL, the first open-source instruction-tuned LVLM for European Portuguese, using a high-resolution vision encoder, pt-PT language model, learned connector, and three-stage training on a custom data mix.
-
Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning
Visual-OPSD distills reasoning from a privileged visual-thought teacher to a text-only student using on-policy JSD, delivering +3.40pp accuracy gain and 14.3x speedup over the generative teacher on nine benchmarks.
-
VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers
VTOS jointly searches solution and observer programs to adaptively orchestrate vision tools, outperforming static pipelines on dense object counting and zero-shot plant disease segmentation.
-
Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
UniAR uses a shared context-visual tokenizer with bitwise quantization and parallel prediction in an autoregressive framework to unify visual understanding and generation, claiming SOTA on generation and editing tasks.
-
EventDrive: Event Cameras for Vision-Language Driving Intelligence
EventDrive supplies a multi-task benchmark and EventDrive-VLM architecture that fuses event data, RGB, and language supervision, reporting gains in temporal precision and motion awareness for driving intelligence.
-
MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias
MLLMs show late-layer textual override of correct visual predictions, with a directional signature enabling a simple inference-time recovery method that improves conflict benchmarks by up to 9.4%.
-
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.
-
NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis
NeRD framework generates efficient ontology-grounded reasoning chains for medical image diagnosis via neuro-symbolic rule distillation, shown on skin datasets with expert validation.
-
Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation
Prompt2Effect is a weight-driven hypernetwork that synthesizes LoRA adapters for I2V models from prompts and base weights via SVD parameterization, matching fine-tuned quality at 3.3s inference instead of 56 GPU hours.
-
OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data
OmniDirector introduces a grid-based camera representation and hierarchical prompt agent for multi-shot camera cloning in video diffusion models trained on million-scale unpaired data.
-
Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback
IVT teaches VLMs iterative spatial self-correction via visual feedback from rendered bounding boxes, improving Acc@0.5 by 2.4pp on referring expression benchmarks using 2400 samples and GRPO.
-
VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving
VLGA introduces geometry as a fourth modality in VLA models via pointmap regression loss, reporting SOTA open-loop and closed-loop driving metrics on nuScenes and Bench2Drive.
-
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
ReRe boosts open-source MLLMs on spatial reasoning benchmarks VSI-Bench and STI-Bench to rival proprietary SOTA by using a two-phase Reason then Re-reason process with Geometry-to-Video novel view synthesis.
-
ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation
ARGUS converts MLLM-selected identity evidence into a synchronized 3x3 mosaic injected as negative-time memory in a diffusion model, plus supporting training techniques, to achieve SOTA subject preservation on human video benchmarks.
-
CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence
CoCoSI is a training-free multi-agent system for collaborative cognitive map construction that improves spatial understanding in arbitrary pretrained MLLMs.
-
Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?
Introduces Ego-MC-Bench benchmark and Ego-CoMist synthetic dataset showing that fine-tuning video LLMs on proactive mistake corrections improves performance especially for smaller models.
-
Temporal-Aware Reasoning Optimization for Video Temporal Grounding
TaRO improves video temporal grounding in MLLMs via constructive reasoning exploration from dense captions and a temporal-sensitivity reward that uses logit drops on disrupted event boundaries, followed by curriculum learning to SOTA results.
-
HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging
HDRAgent is the first agent-driven framework for multi-exposure HDR imaging that uses MLLM scene perception, contextual knowledge matching, and perception-distortion feedback to reduce ghosting artifacts.