super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

737 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 737 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 120 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 113 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

citing papers explorer

Showing 50 of 737 citing papers.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models cs.CV · 2026-04-27 · unverdicted · none · ref 2 · internal anchor
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents cs.CL · 2026-04-27 · unverdicted · none · ref 89 · internal anchor
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
Don't Pause! Every prediction matters in a streaming video cs.CV · 2026-04-27 · unverdicted · none · ref 7 · internal anchor
SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models cs.HC · 2026-04-27 · unverdicted · none · ref 47 · internal anchor
IntentVLM uses forward-inverse modeling in a two-stage video-language setup to reach up to 80% accuracy on open-vocabulary intention recognition benchmarks, beating baselines by 30% and matching human performance.
Hindsight Preference Optimization for Financial Time Series Advisory cs.LG · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
Hindsight Preference Optimization lets a 4B model outperform a 235B model on S&P 500 advisory accuracy and quality by generating DPO preference pairs from outcome-based LLM judgments on time series predictions.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation cs.CV · 2026-04-26 · unverdicted · none · ref 2 · 2 links · internal anchor
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs cs.CV · 2026-04-25 · unverdicted · none · ref 1 · internal anchor
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding eess.AS · 2026-04-24 · unverdicted · none · ref 35 · internal anchor
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
Grounding Video Reasoning in Physical Signals cs.CV · 2026-04-23 · unverdicted · none · ref 2 · internal anchor
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.
Near-Future Policy Optimization cs.LG · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback cs.CV · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.
Evaluating Remote Sensing Image Captions Beyond Metric Biases cs.CV · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis cs.CV · 2026-04-22 · unverdicted · none · ref 5 · internal anchor
X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark cs.CV · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Knowledge-Clue-Answer.
WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring cs.CV · 2026-04-22 · unverdicted · none · ref 5 · internal anchor
WildFireVQA is a new large-scale visual question answering benchmark that pairs RGB imagery with radiometric thermal measurements for aerial wildfire monitoring across six task categories.
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training cs.CV · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation cs.RO · 2026-04-21 · unverdicted · none · ref 4 · 2 links · internal anchor
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts cs.CL · 2026-04-20 · unverdicted · none · ref 66 · internal anchor
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes cs.CV · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
E3VS-Bench supplies 99 3D Gaussian Splatting scenes and 2,014 episodes to test whether embodied agents can use unrestricted 5-DoF viewpoint control to answer questions that depend on fine-grained visual details visible only from specific angles.
DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax cs.CV · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
DanceCrafter generates high-fidelity, text-controlled dance sequences using a new Choreographic Syntax framework and a large fine-grained motion dataset.
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification cs.AI · 2026-04-18 · unverdicted · none · ref 5 · internal anchor
Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planning to reduce constraint violations by 19.26% and improve task completion.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs cs.CV · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
S-GRPO: Unified Post-Training for Large Vision-Language Models cs.LG · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI cs.CV · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows cs.CL · 2026-04-17 · conditional · none · ref 50 · internal anchor
GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
AnimationBench: Are Video Models Good at Character-Centric Animation? cs.CV · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID cs.CV · 2026-04-16 · unverdicted · none · ref 56 · internal anchor
STFER uses LVLM-generated identity-consistent semantic text to drive visual token filtering and expert routing for improved any-time person re-identification under clothing changes and modality shifts.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation cs.CV · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management cs.AI · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 5 · internal anchor
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts cs.CL · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports cs.CV · 2026-04-14 · unverdicted · none · ref 3 · internal anchor
EgoEsportsQA is a new egocentric video QA benchmark from esports matches that shows state-of-the-art Video-LLMs reach only 71.58% accuracy and struggle more with tactical reasoning than basic perception.
Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
VLMs display semantic fixation, with higher accuracy on standard rule mappings than inverse ones across 14 models, narrowed by neutral prompts but widened by loaded ones and affected by post-training alignment.
Online Reasoning Video Object Segmentation cs.CV · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 6 · internal anchor
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote sensing interpretation.
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video cs.CV · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates cs.CV · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling cs.CV · 2026-04-13 · unverdicted · none · ref 4 · internal anchor
AmodalSVG produces semantically separate and geometrically complete SVG layers from natural images by using VLM-guided semantic layer peeling for amodal completion followed by adaptive vectorization.
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories cs.SD · 2026-04-12 · unverdicted · none · ref 2 · internal anchor
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly cs.CV · 2026-04-12 · accept · none · ref 3 · internal anchor
IMPACT is a synchronized five-view RGB-D dataset of 112 real industrial assembly trials with multi-granularity annotations, anomaly taxonomy, and compliance tracking.
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models cs.CV · 2026-04-12 · unverdicted · none · ref 6 · internal anchor
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping cs.CV · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
PlantXpert benchmark shows fine-tuned VLMs reach up to 78% accuracy on plant phenotyping but scaling gains plateau and quantitative biological reasoning remains weak.
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-horizon visual reasoning benchmarks.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation cs.CL · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation cs.CV · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos cs.CV · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer