super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

806 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 806 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 114 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

GaussDet enables open-vocabulary and referring segmentation in 3D Gaussians by learning instance features and aggregating votes from 2D detectors, improving referential grounding by 16.7% mIoU in zero-shot setting.

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Goku supplies a 2M-scale dataset, synthesis pipeline, decoupled dual-branch model, and 1000-case benchmark for multi-task instruction-based video editing, reporting up to 8% gains in instruction following.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

citing papers explorer

Showing 50 of 806 citing papers.

Accelerating Compound LLM Training Workloads with Maestro cs.DC · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 1 · 2 links · internal anchor
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection cs.CV · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
The paper releases the Sens-VisualNews dataset of 9,576 annotated news images for sensational image detection and benchmarks open multimodal LLMs on zero-shot and fine-tuned performance.
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 6 · 3 links · internal anchor
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
ReCoVR: Closing the Loop in Interactive Composed Video Retrieval cs.IR · 2026-05-11 · unverdicted · none · ref 53 · internal anchor
ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Reinforcing Multimodal Reasoning Against Visual Degradation cs.CV · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A cs.CV · 2026-05-09 · unverdicted · none · ref 1 · internal anchor
F^3A is a training-free visual token pruning router that treats pruning as task-conditioned evidence search and allocates a fixed vision token budget using question cues and frozen sparse heads without extra LLM passes.
CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference cs.CV · 2026-05-09 · unverdicted · none · ref 1 · internal anchor
CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria cs.AI · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
HFRU is a two-stage reinforcement unlearning method operating on the vision encoder with GRPO optimization and an abstraction reward that achieves over 98% forgetting and retention on object and face tasks with negligible hallucination.
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing cs.CV · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.
LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation cs.CV · 2026-05-08 · conditional · none · ref 4 · internal anchor
LithoBench is a new multi-level benchmark showing that existing large multimodal models have substantial limitations in geological semantic understanding for remote sensing lithology interpretation.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding cs.CV · 2026-05-08 · unverdicted · none · ref 44 · 2 links · internal anchor
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning cs.AI · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment cs.CV · 2026-05-08 · conditional · none · ref 34 · internal anchor
Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents cs.LG · 2026-05-08 · unverdicted · none · ref 3 · 2 links · internal anchor
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation cs.RO · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
ProCompNav: Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries cs.AI · 2026-05-07 · unverdicted · none · ref 46 · 3 links · internal anchor
ProCompNav builds a candidate pool from ambiguous queries then uses pool-splitting binary questions for disambiguation, improving success rate and shortening responses on CoIN-Bench and TextNav.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling cs.CV · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading cs.LG · 2026-05-07 · unverdicted · none · ref 4 · internal anchor
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs cs.AI · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
SPARK constructs unified knowledge graphs from multi-document scientific literature to ground self-play RL with asymmetric roles and verifiable rewards, outperforming flat-corpus baselines especially on longer-hop reasoning tasks.
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning cs.CV · 2026-05-06 · unverdicted · none · ref 1 · internal anchor
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
RemoteZero: Geospatial Reasoning with Zero Human Annotations cs.CV · 2026-05-06 · unverdicted · none · ref 1 · internal anchor
RemoteZero replaces coordinate supervision with intrinsic semantic verification to enable box-free GRPO training and self-evolution for geospatial reasoning.
Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion cs.CV · 2026-05-06 · unverdicted · none · ref 70 · internal anchor
DiLAST optimizes 3D latents via guidance from a 2D diffusion model to enable generalizable style transfer for OOD styles in 3D asset generation.
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 32 · internal anchor
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing cs.RO · 2026-05-05 · unverdicted · none · ref 42 · internal anchor
ScanHD achieves 92.7% exact accuracy and 98.1% Win@1 accuracy in recommending discrete scanning parameters from instructions and images on a new real-world dataset.
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing cs.CL · 2026-05-05 · unverdicted · none · ref 21 · internal anchor
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe cs.LG · 2026-05-05 · unverdicted · none · ref 4 · internal anchor
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
MolmoAct2: Action Reasoning Models for Real-world Deployment cs.RO · 2026-05-04 · unverdicted · none · ref 1 · 2 links · internal anchor
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking cs.CV · 2026-05-04 · unverdicted · none · ref 38 · internal anchor
ViewSAM achieves state-of-the-art weakly supervised performance on cross-view referring multi-object tracking by refining SAM tracklets via affinity-guided re-prompting and modeling view-induced variations as learnable conditions on SAM2.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 5 · 2 links · internal anchor
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
Affordance Agent Harness: Verification-Gated Skill Orchestration cs.RO · 2026-05-01 · unverdicted · none · ref 6 · 2 links · internal anchor
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors cs.CV · 2026-05-01 · unverdicted · none · ref 134 · internal anchor
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 105 · internal anchor
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness q-bio.NC · 2026-04-30 · unverdicted · none · ref 2 · internal anchor
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction cs.RO · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
FreeOcc enables training-free open-vocabulary 3D occupancy prediction from RGB-D sequences by combining SLAM, dense Gaussian maps, off-the-shelf vision-language models, and probabilistic projection, achieving over 2x gains on benchmarks and zero-shot transfer to novel scenes.
MotuBrain: An Advanced World Action Model for Robot Control cs.RO · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new robots with 50-100 trajectories.
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning cs.AI · 2026-04-30 · unverdicted · none · ref 3 · 3 links · internal anchor
A 4B-parameter vision-language model trained on rubric-guided synthetic wafer defect data reaches 6.493 LLM-Judge score, nearly matching Gemini-3-Flash at 7.149 for on-premise industrial use.
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation cs.CV · 2026-04-30 · unverdicted · none · ref 7 · internal anchor
SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization cs.CV · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.
Co-Evolving Policy Distillation cs.LG · 2026-04-29 · unverdicted · none · ref 41 · internal anchor
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading cs.CV · 2026-04-29 · unverdicted · none · ref 1 · internal anchor
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CV · 2026-04-29 · unverdicted · none · ref 7 · internal anchor
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding cs.CV · 2026-04-29 · unverdicted · none · ref 2 · internal anchor
MCM-VG achieves state-of-the-art zero-shot 3D visual grounding on ScanRefer and Nr3D by creating consistent 2D-3D mappings across semantic, geometric, and viewpoint dimensions using LLMs and VLMs.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer