ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
super hub Mixed citations
Qwen3-VL Technical Report
Mixed citation behavior. Most common role is background (47%).
abstract
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con
authors
co-cited works
representative citing papers
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.
ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
GaussDet enables open-vocabulary and referring segmentation in 3D Gaussians by learning instance features and aggregating votes from 2D detectors, improving referential grounding by 16.7% mIoU in zero-shot setting.
Goku supplies a 2M-scale dataset, synthesis pipeline, decoupled dual-branch model, and 1000-case benchmark for multi-task instruction-based video editing, reporting up to 8% gains in instruction following.
OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
citing papers explorer
-
Head-wise Modality Specialization within MLLMs for Robust Fake News Detection under Missing Modality
Head-wise modality specialization via attention constraints and unimodal knowledge retention in MLLMs improves robustness to missing modalities in fake news detection while preserving full multimodal performance.
-
Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents
EchoTrust is an evidence-driven actor-verifier framework that produces structured intermediate representations for more reliable and interpretable reasoning in echocardiography visual language models.
-
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
Patch-level analysis of token attention patterns and semantic alignment detects LVLM hallucinations at up to 90% accuracy by identifying diffuse, non-localized grounding that global methods miss.
-
LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild
LOGER ensembles heterogeneous global vision models with selective local patch aggregation via multiple instance learning to achieve robust deepfake detection across varied manipulations and degradations.
-
Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs
CogAlign uses hierarchical supervised fine-tuning on clinical cognition data plus counterfactual RL to align MLLMs with expert diagnostic pathways and enforce causal lesion grounding for GI endoscopy diagnosis.
-
Gym-V: A Unified Vision Environment System for Agentic Vision Research
Gym-V supplies 179 visual environments showing that observation scaffolding like captions and rules matters more for training success than the choice of RL algorithm.
-
Exploring a Multimodal Chatbot as a Facilitator in Therapeutic Art Activity
The authors built and expert-tested a multimodal chatbot that analyzes drawings in real time and holds reflective conversations to aid therapeutic art activities.
-
JARVIS: An Evidence-Grounded Retrieval System for Interpretable Deceptive Reviews Adjudication
JARVIS combines hybrid retrieval and evidence graphs with LLMs to raise deceptive-review detection precision from 0.953 to 0.988 and recall from 0.830 to 0.901 on a custom dataset while cutting manual inspection time by 75% in production.
-
UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
UI-Oceanus shows that continual pre-training on forward dynamics predictions from synthetic GUI exploration improves agent success rates by 7% offline and 16.8% online, with gains scaling by data volume.
-
Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning
Reason-IAD improves explainable industrial anomaly detection by combining retrieval-augmented category knowledge with entropy-guided latent reasoning and dynamic visual patch injection in MLLMs.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
-
SAM3-I: Segment Anything with Instructions
SAM3-I extends SAM3 with cascaded instruction adaptation and a new dataset to enable direct segmentation from rich natural-language instructions while retaining concept-level performance.
-
OneThinker: All-in-one Reasoning Model for Image and Video
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
-
Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation
ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.
-
Latent-CURE for Breast Cancer Diagnosis
Latent-CURE introduces latent-space chain-of-thought reasoning and dual-asymmetric optimization to produce transparent, robust breast cancer diagnoses in imbalanced cohorts.
-
Consistency as Inductive Bias: Learning Cross-View Invariance for Robust Multimodal Reasoning
ConsistRoll enforces cross-view consistency during RLVR training for MLLMs by joint rewards on grouped original and augmented views, yielding robustness gains on math, general, and hallucination benchmarks.
-
SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants
SkillChain automates skill lifecycle for e-commerce image AI assistants via creator, optimizer, and refiner stages, leading to improved response quality and user engagement in production A/B tests.
-
On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making
REIS reduces inference redundancy in embodied robotic planning via lightweight gating and routing while preserving task performance on ALFRED and real robots.
-
Multi-Stage VLM Pipeline for Zero-Shot Traffic Accident Understanding
A multi-stage VLM pipeline on Qwen3-VL models wins the ACCIDENT challenge for zero-shot traffic accident understanding from video, achieving public/private LB scores of 0.55469/0.57080.
-
ViASNet: A Video Ad Saliency Network for Predicting Dynamic Saliency and Viewer Engagement
ViASNet applies a 3D U-Net architecture augmented with audio and semantic inputs to predict dynamic saliency in video ads and uses frame-wise entropy to diagnose low-engagement scenes on eye-tracked data from 151 ads.
-
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration
OmniVerifier-M1 is a generalist visual verifier using symbolic outputs for meta-verification and decoupled RL to outperform joint optimization for robust verification and agentic self-correction.
-
Audio-Mind: An Auditable Agentic Framework for Audio Understanding
Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.
-
Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning
Mags-RL uses agentic RL and a super-resolution agent for two-round reasoning in MLLMs, claiming gains on VSR, TallyQA, and GQA with a curriculum needing only 40 samples.
-
Darwin Mobile Agent: A Roadmap for Self-Evolution
Introduces an open-source mobile GUI agent training framework and a roadmap for autonomous self-evolution via removal of human priors in three pillars.
-
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.
-
SIREN: Unified Multi-Granularity Semantic Interaction for Multi-Modal Lifelong User Interest Modeling
SIREN unifies multi-modal and collaborative features for lifelong user interest modeling via semantic ID retrieval and target-aware transformer interactions, reporting SOTA GAUC and positive GMV gains in production.
-
Tracing the ongoing emergence of human-like reasoning in Large Language Models
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
-
iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment
iDiff is a dual-branch framework with an Answer Model for robust pairwise preference prediction via view decomposition and ensembles, and a Thinking Model for structured rationale generation using templates and answer-aware supervision, winning first place in the NTIRE 2026 RAIM challenge.
-
Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents
Proposes a knowledge-adaptive edge expert agent architecture for sustainable biodiversity monitoring that separates visual perception from reasoning with an explicit knowledge base.
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.
-
Fine-tuning a vision-language model for fracture-surface morphology recognition
Fine-tuning Qwen3-VL-32B-Instruct on a curated set of 13k fracture images yields a specialist model achieving 0.92 precision on morphology recognition, outperforming the base model and several proprietary VLMs on a 100-image manual benchmark.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
LLM-Enhanced Topical Trend Detection at Snapchat
Snapchat's deployed system detects emerging topical trends in short videos via multimodal extraction, time-series burst detection, and LLM consolidation, achieving high precision per six months of human evaluation and improving content freshness in production.
-
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
-
A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows
A multistage extraction pipeline with page-level retrieval improves field-level accuracy by up to 31.9 percentage points over direct VLM application on 3000 pages of real multilingual KYC documents, reaching 87.27% with PaddleOCR and MiniCPM2.6.
-
Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference
VIBES uses Bayesian inference to trigger focused VLM reasoning on localized far-field regions in expressway videos, improving anomaly detection accuracy and efficiency.
-
EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks
EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstrained environments.
-
Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
Geometric Reward Credit Assignment disentangles rewards to geometric tokens and adds reprojection consistency to boost 3D keypoint accuracy from 0.64 to 0.93 and bounding box IoU to 0.686 on a ShapeNetCore benchmark while preserving 2D performance.
-
PLaMo 2.1-VL Technical Report
PLaMo 2.1-VL reports 61.5 ROUGE-L on JA-VG-VQA-500, 85.2% on Japanese Ref-L4, 53.9% zero-shot factory accuracy, and raises anomaly detection F1 from 39.7 to 64.9 after fine-tuning.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
Real-Time Visual Attribution Streaming in Thinking Model
An amortized estimator trained on attention features provides real-time faithful visual attributions for multimodal reasoning models, matching the faithfulness of exhaustive causal methods.
-
The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results
The NTIRE 2026 CD-FSOD Challenge report details innovative methods and performance results from 19 teams on cross-domain few-shot object detection in open- and closed-source tracks.
-
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.