super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

887 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 887 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 114 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

Attend, Transform, or Silence: Operator-Level Visual Skipping for Efficient Multimodal LLM Inference

cs.CV · 2026-06-30 · conditional · novelty 7.0

The paper proposes an operator-level visual-token skipping framework for MLLMs that reduces TFLOPs by 33.7% on Qwen3-VL while retaining 99.5% performance across VQA benchmarks.

SENSE-VAD: Sentient and Semantic Video Anomaly Detection for Autonomous Driving

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

SENSE-VAD introduces the first synthetic benchmark dataset with per-frame labels for socially complex anomalies in autonomous driving scenes and shows existing video anomaly detectors fail on them.

citing papers explorer

Showing 50 of 70 citing papers after filters.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings cs.CV · 2026-05-08 · conditional · none · ref 11 · internal anchor
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models cs.CV · 2026-04-19 · unverdicted · none · ref 3 · internal anchor
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters cs.CV · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs cs.CV · 2026-05-12 · unverdicted · none · ref 24 · internal anchor
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
Dynamic Execution Commitment of Vision-Language-Action Models cs.CV · 2026-05-12 · unverdicted · none · ref 19 · 3 links · internal anchor
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 1 · 2 links · internal anchor
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 2 · internal anchor
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 3 · internal anchor
STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV · 2026-05-05 · unverdicted · none · ref 18 · 3 links · internal anchor
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
Benchmarking and Improving GUI Agents in High-Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 1 · 2 links · internal anchor
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs cs.CV · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 5 · internal anchor
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 6 · internal anchor
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote sensing interpretation.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning cs.CV · 2026-04-08 · unverdicted · none · ref 2 · internal anchor
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
PortraitCraft: A Benchmark for Portrait Composition Understanding and Generation cs.CV · 2026-04-04 · unverdicted · none · ref 1 · internal anchor
PortraitCraft supplies a new 50k-image dataset and two tasks for evaluating AI on fine-grained portrait composition understanding and constrained generation.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators cs.CV · 2026-03-31 · unverdicted · none · ref 2 · internal anchor
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 7 · internal anchor
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 4 · 2 links · internal anchor
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context cs.CV · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs cs.CV · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
LDDR proposes a linear DPP-based dynamic-resolution frame sampler that achieves 3x speedup and up to 2.5-point gains on video MLLM benchmarks by selecting non-redundant frames and allocating tokens accordingly.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer cs.CV · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.
Reinforcing Multimodal Reasoning Against Visual Degradation cs.CV · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
HFRU is a two-stage reinforcement unlearning method operating on the vision encoder with GRPO optimization and an abstraction reward that achieves over 98% forgetting and retention on object and face tasks with negligible hallucination.
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing cs.CV · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring cs.CV · 2026-05-08 · unverdicted · none · ref 1 · 2 links · internal anchor
ChartREG++ benchmark for multi-target chart referring expression grounding with diverse clues plus a plotting-code synthesis pipeline for accurate masks that improves model performance.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models cs.CV · 2026-05-06 · unverdicted · none · ref 2 · 3 links · internal anchor
D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by having the model act as both teacher (with multimodal context) and student (with text-only context) on its own roll-outs.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 5 · 2 links · internal anchor
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
Let ViT Speak: Generative Language-Image Pre-training cs.CV · 2026-05-01 · unverdicted · none · ref 7 · 2 links · internal anchor
GenLIP pretrains ViTs to generate language tokens from images via LM objective without contrastive batches or extra decoders, matching baselines on less data and improving on OCR after multi-resolution continued pretraining.
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation cs.CV · 2026-04-30 · unverdicted · none · ref 7 · internal anchor
SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
ViPO: Visual Preference Optimization at Scale cs.CV · 2026-04-27 · unverdicted · none · ref 2 · internal anchor
Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch cs.CV · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image cs.CV · 2026-04-15 · unverdicted · none · ref 39 · internal anchor
Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running substantially faster.
Boosting Visual Instruction Tuning with Self-Supervised Guidance cs.CV · 2026-04-14 · unverdicted · none · ref 6 · internal anchor
Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning cs.CV · 2026-04-09 · unverdicted · none · ref 31 · internal anchor
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 28 · internal anchor
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks cs.CV · 2026-04-05 · conditional · none · ref 15 · internal anchor
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 3 · internal anchor
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer