super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

767 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 767 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 120 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 113 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

SpatialUAV: Benchmarking Spatial Intelligence for Low-Altitude UAV Perception, Collaboration, and Motion

cs.CV · 2026-06-26 · accept · novelty 7.0

SpatialUAV is a new real-world benchmark dataset and evaluation suite exposing large gaps between vision-language models and human performance on spatial tasks for low-altitude UAVs.

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

citing papers explorer

Showing 50 of 767 citing papers.

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling cs.CV · 2026-02-11 · unverdicted · none · ref 1 · internal anchor
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension cs.CV · 2026-02-10 · unverdicted · none · ref 1 · internal anchor
Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding cs.CL · 2026-02-02 · unverdicted · none · ref 13 · internal anchor
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning cs.CV · 2026-01-30 · unverdicted · none · ref 2 · internal anchor
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
VPTracker: Global Vision-Language Tracking via Visual Prompt cs.CV · 2025-12-28 · conditional · none · ref 5 · internal anchor
VPTracker enables global object tracking in videos by using multimodal large language models with location-aware visual prompts to search entire images while reducing distractions from similar objects.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence cs.CV · 2025-05-22 · conditional · none · ref 5 · internal anchor
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature cs.CV · 2026-05-04 · accept · none · ref 25
PubMed-Ophtha is a new hierarchical dataset of 102k ophthalmological image-caption pairs from 15k+ PubMed articles, with full-resolution PDF extraction, panel splitting, modality annotations, and released extraction models plus pipeline.
Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval cs.CV · 2026-05-04 · unverdicted · none · ref 5
Generalized Moment Retrieval (GMR) is introduced as a unified task with the Soccer-GMR benchmark and adapter models that retrieve multiple or zero matching moments from videos.
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval cs.CL · 2026-05-04 · unverdicted · none · ref 8
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations in novelty prediction.
Act2See: Emergent Active Visual Perception for Video Reasoning cs.CV · 2026-05-03 · unverdicted · none · ref 1
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL cs.CV · 2026-06-26 · unverdicted · none · ref 1 · internal anchor
TempAct applies hierarchical planner-executor RL with group exploration and multi-level rewards to improve temporal consistency in autoregressive video models.
ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering cs.CV · 2026-06-26 · unverdicted · none · ref 2 · internal anchor
ProMSA is a progressive multimodal search agent for KB-VQA that iteratively selects search tools under budgets, trained via rejection-sampling SFT then TN-GSPO RL, reporting gains on E-VQA and InfoSeek over RAG baselines.
Understanding How MLLMs Describe Artworks Using Token Activation Maps cs.CV · 2026-06-26 · unverdicted · none · ref 5 · internal anchor
Token Activation Maps applied to MLLM art descriptions reveal that visual grounding strength varies by token category, with better artist identification than title prediction.
LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation cs.RO · 2026-06-26 · unverdicted · none · ref 48 · internal anchor
Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.
Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization cs.RO · 2026-06-26 · unverdicted · none · ref 33 · internal anchor
Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.
LA4VLA: Learning to Act without Seeing via Language-Action Pretraining cs.RO · 2026-06-25 · unverdicted · none · ref 38 · internal anchor
LA4VLA creates a 33K language-action dataset from existing demos and shows that pretraining on language-action pairs before or alongside vision-language-action training boosts success rates in sim and real robot tasks.
HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models cs.CV · 2026-06-25 · unverdicted · none · ref 15 · internal anchor
HarmVideoBench is a multi-layered benchmark for harmful video understanding in LVLMs with three hierarchical dimensions, and BCR is a method that raises average model performance from 61.7% to 84.4%.
SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing cs.CV · 2026-06-25 · unverdicted · none · ref 1 · internal anchor
SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.
Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity cs.CV · 2026-06-24 · unverdicted · none · ref 9 · internal anchor
Presents Invoice Haystack benchmark for homogeneous document retrieval and VL-RAG hybrid framework achieving 60% Recall@1 and up to 13.5 point gains over prior methods.
VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers cs.CV · 2026-06-17 · unverdicted · none · ref 9 · internal anchor
VTOS jointly searches solution and observer programs to adaptively orchestrate vision tools, outperforming static pipelines on dense object counting and zero-shot plant disease segmentation.
SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning cs.RO · 2026-06-08 · unverdicted · none · ref 67 · internal anchor
SpaceVLN proposes a stagewise closed-loop framework using Spatial Cognitive Memory and Spatial-CoT for zero-shot vision-and-language navigation and object-goal navigation, reporting SOTA results on R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON plus real-robot tests.
AdaCodec: A Predictive Visual Code for Video MLLMs cs.CV · 2026-06-01 · unverdicted · none · ref 51 · internal anchor
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
Visual Graph Scaffolds for Structural Reasoning in Large Language Models cs.AI · 2026-06-01 · unverdicted · none · ref 5 · internal anchor
Visual graph mind maps outperform text-flattened versions as internal reasoning scaffolds for LLMs on multi-hop QA, with the advantage holding after fine-tuning and distillation.
Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue cs.CL · 2026-05-31 · unverdicted · none · ref 51 · internal anchor
RefMem-Bench benchmarks reflective memory in dialogue with 26K instances across eight dimensions, and REMIND improves model accuracy via hierarchical evidence retrieval, grounding, and abstraction.
RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes cs.CV · 2026-05-30 · unverdicted · none · ref 1 · internal anchor
RoboStressBench decomposes visual stress into four physically grounded dimensions to benchmark VLM robustness in embodied scenes and proposes a stress-aware solver.
FlowNar: Scalable Streaming Narration for Long-Form Videos cs.CV · 2026-05-30 · unverdicted · none · ref 2 · internal anchor
FlowNar achieves bounded memory and 3x higher throughput for streaming narration on Ego4D, EgoExo4D, and EpicKitchens100 by combining dynamic historical context removal with a Cross Linear Attentive Memory module.
Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion cs.CV · 2026-05-30 · unverdicted · none · ref 1 · internal anchor
Introduces pause-and-think-T dataset and pause-and-think-B benchmark for video-grounded assistive action suggestion, enabling a 4B VLM to match larger models on reasoning tasks and generalize to EgoThink and TempCompass.
Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning cs.CV · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.
Task-Focused Memorization for Multimodal Agents cs.CV · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.
Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring cs.RO · 2026-05-29 · unverdicted · none · ref 66 · internal anchor
Hide-and-Seek uses contrastive objectives on trajectories to localize failure signals in VLA models from trajectory-level supervision alone.
VLM3: Vision Language Models Are Native 3D Learners cs.CV · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
GPIC: A Giant Permissive Image Corpus for Visual Generation cs.CV · 2026-05-28 · unverdicted · none · ref 19 · internal anchor
GPIC is a new 28-trillion-pixel permissively licensed image corpus with 100M training examples for visual generative modeling.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion cs.CV · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
LoMo is a lightweight data curation technique that locally substitutes text with images in prompts to enforce cross-modal invariance, yielding 2.67-2.82 point gains over standard SFT on two VLMs across 13 benchmarks.
Unveiling the Visual Counting Bottleneck in Vision-Language Models cs.MM · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
VLMs fail at visual counting extrapolation because they cannot project visual magnitudes onto symbolic tokens, despite intact perceptual representations, supporting a fractured magnitude hypothesis.
MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding cs.LG · 2026-05-28 · unverdicted · none · ref 39 · internal anchor
MIRAGE uses adaptive multimodal gating on native multimodal backbones plus a transformer encoder to achieve state-of-the-art whole-brain fMRI prediction for naturalistic audiovisual stimuli, outperforming post-hoc unimodal aggregation.
AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning cs.CV · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
AgentCVR uses coordinated multi-agent active search and script-simulated RL to improve cross-video reasoning in MLLMs over single-pass baselines.
PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning cs.LG · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
PEARL is a pedagogically aligned RL framework using a controllable student simulator, generative reward model, and stable multi-objective scheme to train Socratic tutors that outperform other open-source models on benchmarks.
Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning cs.CV · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
Inverse dynamics prediction is added as an auxiliary task to reduce state aliasing in VLA models by directly supervising the vision encoder on action-relevant visual distinctions using only standard observation-action pairs.
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling cs.CV · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
AnyMo is a masked-modeling framework for any-modality human motion generation trained on the new OmniHuMo dataset of 5,000+ hours of multimodal motion sequences.
ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control cs.AI · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
ReasonLight uses multimodal foundation models to refine RL-proposed traffic signal phases based on camera images and sensor data, enabling zero-shot adaptation to unseen events such as emergency vehicle priority.
Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization cs.CV · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
GCPO performs per-token credit assignment in discrete policy optimization by setting token advantages proportional to the difference in model predictions under positive versus negative prompts, outperforming GRPO and DAPO on text-to-image and chain-of-thought tasks.
Agent Explorative Policy Optimization for Multimodal Agentic Reasoning cs.CL · 2026-05-27 · unverdicted · none · ref 21 · internal anchor
AXPO addresses the Thinking-Acting Gap in agentic RL training by targeted resampling of tool calls in all-wrong subgroups, delivering +1.8pp gains over GRPO on nine multimodal benchmarks with an 8B model beating a 32B baseline on Pass@4.
Self-Prophetic Decoding to Unlock Visual Search in LVLMs cs.CV · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
SeProD is a plug-and-play self-prophetic decoding framework that combines pre- and post-training LVLM capabilities via probability-based sampling to improve coherent visual search and multi-step reasoning.
PrimitiveVLA: Learning Reusable Motion Primitives for Efficient and Generalizable Robotic Manipulation cs.RO · 2026-05-27 · unverdicted · none · ref 37 · internal anchor
PrimitiveVLA introduces a primitive-centric framework that disassembles demonstrations into reusable motion primitives during fine-tuning and assembles them at inference via VLM planner and LLM switch for improved data efficiency and zero-shot generalization in robotic manipulation.
DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving cs.CV · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
DriveWAM converts video generative priors into a unified video-action policy for driving, reporting strong benchmark performance and positive scaling from 4k to 100k clips.
Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation cs.CV · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
Proprio uses flow residuals from latent perturbations in frozen video generators as a self-scoring signal for physical plausibility, yielding reported gains of 16.5% on Physics-IQ and 20.6% on VideoPhy2-hard.
MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing cs.AI · 2026-05-27 · unverdicted · none · ref 29 · internal anchor
MACReD is a multi-agent collaborative reasoning framework for reaction diagram parsing that reports state-of-the-art F1 scores of 75.2% and 84.6% on the RxnScribe benchmark.
Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization cs.AI · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
RC-DPO adds a CoT-conditioned preference term to DPO and pairs it with MCTS-based positive CoT generation plus attention-guided pruning for negatives, yielding lower hallucination rates on multimodal benchmarks.
SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control cs.CV · 2026-05-27 · unverdicted · none · ref 15 · internal anchor
SmartDirector generates cinematic videos via Director-Gen for low-res keyframe-conditioned output followed by Director-SR refinement using high-res keyframes, trained on curated movie sequences.
Reflective Dialogue between Teacher and Solver Agents for Video Question Answering cs.CV · 2026-05-27 · unverdicted · none · ref 15 · internal anchor
A multi-turn reflective dialogue between Teacher and Solver agents constructs richer context from support examples than standard in-context learning, improving video QA on the EgoCross benchmark.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer