super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

988 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 988 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 114 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents

cs.MM · 2026-06-26 · unverdicted · novelty 8.0

Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

cs.CL · 2026-06-01 · conditional · novelty 8.0

FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

citing papers explorer

Showing 31 of 81 citing papers after filters.

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA cs.RO · 2026-03-31 · unverdicted · none · ref 4 · internal anchor
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons cs.RO · 2026-03-02 · unverdicted · none · ref 124 · internal anchor
Robometer combines intra-trajectory progress supervision with inter-trajectory preference supervision on a 1M-trajectory dataset to learn more generalizable robotic reward functions than prior methods.
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning cs.RO · 2026-02-09 · unverdicted · none · ref 18 · internal anchor
R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding cs.RO · 2025-12-27 · conditional · none · ref 16 · internal anchor
OBEYED-VLA improves VLA robustness in cluttered real-world manipulation by disentangling perception into VLM-based object-centric grounding and geometry-aware stages, then fine-tuning the policy only on single-object demonstrations.
FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 79 · internal anchor
FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.
OpenSPM: An Environment-Transferable Robotic Key Spatial Pose Memory and Closed-Loop High-Frequency Flow-Matching Action Generation Model cs.RO · 2026-06-29 · unverdicted · none · ref 28 · internal anchor
OpenSPM extracts key spatial poses from demonstrations as transferable memory and generates high-frequency actions via flow-matching, achieving 85.6% success rate on ten LIBERO-GOAL tasks at 1033.3 Hz.
S$^2$-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation cs.RO · 2026-06-26 · unverdicted · none · ref 24 · internal anchor
S²-VLA uses a state-space model to maintain a belief state that produces dynamic gating weights for fusing visual, language, and action features, claiming better long-horizon manipulation than 7B models with only 2B parameters.
Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models cs.RO · 2026-06-09 · unverdicted · none · ref 3 · internal anchor
Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation cs.RO · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
DeMaVLA is a VLA foundation model using a pruned action expert and flow matching, pre-trained on 5000 hours of real demonstrations and post-trained on multi-task folding data with human-in-the-loop correction, reporting competitive benchmark and real-world folding performance.
TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues cs.RO · 2026-05-29 · unverdicted · none · ref 21 · internal anchor
TARIC maintains traversability-consistent guidance using 3D cue memory during semantic cue interruptions in outdoor VLN, improving success rates on long routes.
SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving cs.RO · 2026-05-19 · unverdicted · none · ref 56 · internal anchor
SafeAlign-VLA uses counterfactual safety pairing and anchor-based group relative policy optimization to incorporate negative data for safer VLA-based autonomous driving.
Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation cs.RO · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
A vision-language model outputs dual heatmaps for navigation affordance and facing to ground semantic instructions into executable free space, achieving higher affordance rates than waypoint regression across simulated robot embodiments.
AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment cs.RO · 2026-05-17 · unverdicted · none · ref 27 · internal anchor
AffordVLA improves VLA models for robotic manipulation by implicitly injecting affordance perception through feature alignment with a zero-shot teacher, claiming SOTA results in simulation and real-world tests.
PhysBrain 1.0 Technical Report cs.RO · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.
Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action cs.RO · 2026-05-14 · unverdicted · none · ref 3 · 2 links · internal anchor
A unified embodied foundation model uses one VLM for understanding and reasoning plus a joint video-action future generator, reporting competitive scores on VLM, world modeling, and robot benchmarks without apparent compromise.
Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control cs.RO · 2026-05-14 · unverdicted · none · ref 25 · 2 links · internal anchor
DAJI is a hierarchical framework using distillation and autoregressive generation to learn future-aware joint intents for language-conditioned humanoid robot control.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 72 · internal anchor
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation cs.RO · 2026-05-09 · unverdicted · none · ref 4 · internal anchor
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts cs.RO · 2026-05-07 · unverdicted · none · ref 1 · 3 links · internal anchor
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models cs.RO · 2026-04-21 · unverdicted · none · ref 4 · internal anchor
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing cs.RO · 2026-04-15 · unverdicted · none · ref 34 · internal anchor
SpaceMind is a self-evolving modular VLM agent framework that achieves 90-100% navigation success in nominal conditions and recovers from failures via experience distillation, with zero-code transfer to physical robots for on-orbit tasks.
Task-Aware Bimanual Affordance Prediction via VLM-Guided Semantic-Geometric Reasoning cs.RO · 2026-04-09 · unverdicted · none · ref 9 · internal anchor
A VLM-guided method for joint bimanual affordance localization and arm allocation achieves higher real-world task success rates than geometric or semantic baselines across nine manipulation tasks.
GN0: Toward a Unified Paradigm for Generation, Evaluation, and Policy Learning in Visual-Language Navigation cs.RO · 2026-06-02 · unverdicted · none · ref 10 · internal anchor
GN0 curates GN-Matrix dataset, builds 3DGS simulator and GN-Bench, and trains BAE model via supervised learning plus DAgger and RL to unify VLN tasks and outperform prior methods on GN-Bench and VLN-CE.
On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making cs.RO · 2026-05-29 · unverdicted · none · ref 60 · internal anchor
REIS reduces inference redundancy in embodied robotic planning via lightweight gating and routing while preserving task performance on ALFRED and real robots.
RLDX-1 Technical Report cs.RO · 2026-05-05 · unverdicted · none · ref 7 · 2 links · internal anchor
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks cs.RO · 2026-04-26 · unverdicted · none · ref 3 · internal anchor
EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstrained environments.
RhinoVLA Technical Report cs.RO · 2026-06-05 · unverdicted · none · ref 1 · internal anchor
RhinoVLA cuts VLM tokens with a Qwen3-VL backbone and continuous action expert, adds a unified cross-robot interface, and reaches real-time 11.69 Hz on Huixi R1 while matching π0.5 downstream performance.
3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance cs.RO · 2026-06-30 · unreviewed · ref 2 · internal anchor
Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision cs.RO · 2026-06-29 · unreviewed · ref 5 · internal anchor
AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly cs.RO · 2026-04-10 · unreviewed · ref 2 · internal anchor
SkillWrapper: Generative Predicate Invention for Task-level Robot Planning cs.RO · 2025-11-22 · unreviewed · ref 6 · internal anchor

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer