super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

787 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 787 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 114 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

SpatialUAV: Benchmarking Spatial Intelligence for Low-Altitude UAV Perception, Collaboration, and Motion

cs.CV · 2026-06-26 · accept · novelty 7.0

SpatialUAV is a new real-world benchmark dataset and evaluation suite exposing large gaps between vision-language models and human performance on spatial tasks for low-altitude UAVs.

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

citing papers explorer

Showing 50 of 787 citing papers.

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos cs.CV · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
Artifact-Bench supplies a three-level artifact taxonomy and three evaluation tasks that show 19 MLLMs perform near or below random on AI-video realism detection and reasoning.
A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability cs.DC · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.
Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory cs.CV · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.
Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 4 · 2 links · internal anchor
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents cs.CV · 2026-05-18 · conditional · none · ref 3 · internal anchor
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning cs.CV · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
SpatioRoute introduces dynamic prompt routing that improves zero-shot spatial VQA accuracy by up to 5% on the SQA3D benchmark across VLMs without 3D inputs or fine-tuning.
SurgLQA: Scalable Long-Horizon Surgical Video Question Answering cs.CV · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
SurgLQA introduces FTC for compact long-range video representations and TMS for adaptive test-time scaling, reporting gains on restructured Colon-LQA and REAL-Colon-VQA benchmarks.
Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 30 · internal anchor
A foveated VLM trained for scene comprehension produces human-like fixations, outperforming models trained for search, classification, or with altered peripheral vision.
Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation cs.IR · 2026-05-17 · unverdicted · none · ref 3 · internal anchor
TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.
Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models cs.CV · 2026-05-17 · unverdicted · none · ref 5 · internal anchor
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 78 · internal anchor
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning cs.RO · 2026-05-16 · unverdicted · none · ref 2 · internal anchor
DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation cs.CV · 2026-05-15 · unverdicted · none · ref 1 · internal anchor
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.
Unlocking Dense Metric Depth Estimation in VLMs cs.CV · 2026-05-15 · unverdicted · none · ref 3 · 2 links · internal anchor
DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.
Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models cs.CV · 2026-05-15 · unverdicted · none · ref 23 · internal anchor
Proposes AGSR and the FAB-G supervised multi-agent framework that predicts attribute salience from human annotations to constrain MLLM emotion reasoning, yielding gains on EmoArt and cross-dataset tests.
SEED: Targeted Data Selection by Weighted Independent Set cs.LG · 2026-05-15 · unverdicted · none · ref 4 · internal anchor
SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods on instruction tuning and segmentation tasks.
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 6 · internal anchor
CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
Towards Continuous Sign Language Conversation from Isolated Signs cs.CV · 2026-05-14 · unverdicted · none · ref 3 · internal anchor
Constructs continuous sign conversation data from isolated signs using retrieval and diffusion models to train a direct sign-to-sign conversational AI.
To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model cs.CR · 2026-05-14 · unverdicted · none · ref 3 · internal anchor
MMGuard generates unlearnable multimodal examples via perturbations that exploit LVLM optimization shortcuts and disrupt cross-modal bindings, providing robust protection against unauthorized fine-tuning across threat models.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context cs.CV · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning cs.CV · 2026-05-13 · unverdicted · none · ref 38 · internal anchor
CAVE is a GRPO-based process-reward method that improves VLMs on fragmented visual reasoning by crediting intermediate actions via belief update, evidence acquisition, and adaptive focus, shown on TRACER-Bench and public benchmarks.
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 1 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World cs.CV · 2026-05-13 · unverdicted · none · ref 1 · 2 links · internal anchor
PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
SECOND-Grasp: Semantic Contact-guided Dexterous Grasping cs.RO · 2026-05-13 · conditional · none · ref 1 · internal anchor
SECOND-Grasp integrates semantic contact proposals from vision-language reasoning with geometric refinement to achieve 98%+ lifting success and improved intent-aware grasping on seen and unseen objects.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 128 · internal anchor
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency cs.CV · 2026-05-13 · conditional · none · ref 3 · internal anchor
VLMs exhibit size, center, and saliency biases in scene understanding, relying less on people than humans do, with size bias as a key driver of divergence.
When Vision Speaks for Sound cs.CV · 2026-05-13 · unverdicted · none · ref 57 · internal anchor
Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention performance by 28 points.
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging cs.CL · 2026-05-13 · conditional · none · ref 60 · 2 links · internal anchor
DiM3 is a direction- and magnitude-aware merging method that composes heterogeneous multilingual and multimodal updates in LLM backbones, outperforming baselines on 57-language benchmarks while retaining multimodal performance.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Reinforcing VLAs in Task-Agnostic World Models cs.AI · 2026-05-12 · unverdicted · none · ref 2 · 2 links · internal anchor
RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification cs.AI · 2026-05-12 · unverdicted · none · ref 49 · internal anchor
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training cs.CV · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.
EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation cs.CV · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.
LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs cs.CV · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
LDDR proposes a linear DPP-based dynamic-resolution frame sampler that achieves 3x speedup and up to 2.5-point gains on video MLLM benchmarks by selecting non-redundant frames and allocating tokens accordingly.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
Personal Visual Context Learning in Large Multimodal Models cs.CV · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer cs.CV · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 3 · 2 links · internal anchor
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation cs.AI · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
Accelerating Compound LLM Training Workloads with Maestro cs.DC · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 1 · 2 links · internal anchor
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection cs.CV · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
The paper releases the Sens-VisualNews dataset of 9,576 annotated news images for sensational image detection and benchmarks open multimodal LLMs on zero-shot and fine-tuned performance.
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 6 · 3 links · internal anchor
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
ReCoVR: Closing the Loop in Interactive Composed Video Retrieval cs.IR · 2026-05-11 · unverdicted · none · ref 53 · internal anchor
ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Reinforcing Multimodal Reasoning Against Visual Degradation cs.CV · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer