super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

732 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 732 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 120 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 113 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

citing papers explorer

Showing 50 of 50 citing papers after filters.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 10 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
PhyGround: Benchmarking Physical Reasoning in Generative World Models cs.CV · 2026-05-11 · accept · none · ref 1 · internal anchor
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents cs.CL · 2026-05-11 · unverdicted · none · ref 41 · internal anchor
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 7 · internal anchor
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL · 2026-05-10 · unverdicted · none · ref 2 · internal anchor
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents cs.AI · 2026-05-09 · unverdicted · none · ref 38 · 3 links · internal anchor
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere cs.CV · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frechet mean centering, holistic scene attention, and spherical geodesic pulling.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning cs.CV · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Don't Pause! Every prediction matters in a streaming video cs.CV · 2026-04-27 · unverdicted · none · ref 7 · internal anchor
SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark cs.CV · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Knowledge-Clue-Answer.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation cs.CV · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management cs.AI · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts cs.CL · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions cs.CV · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space cs.LG · 2026-04-03 · unverdicted · none · ref 4 · internal anchor
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
Self-Distilled RLVR cs.LG · 2026-04-03 · unverdicted · none · ref 19 · internal anchor
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence cs.CV · 2025-05-22 · conditional · none · ref 5 · internal anchor
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification cs.AI · 2026-05-12 · unverdicted · none · ref 49 · internal anchor
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
Personal Visual Context Learning in Large Multimodal Models cs.CV · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 6 · 3 links · internal anchor
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria cs.AI · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation cs.CV · 2026-05-08 · conditional · none · ref 4 · internal anchor
LithoBench is a new multi-level benchmark showing that existing large multimodal models have substantial limitations in geological semantic understanding for remote sensing lithology interpretation.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents cs.LG · 2026-05-08 · unverdicted · none · ref 3 · 2 links · internal anchor
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness q-bio.NC · 2026-04-30 · unverdicted · none · ref 2 · internal anchor
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CV · 2026-04-29 · unverdicted · none · ref 7 · internal anchor
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation cs.CL · 2026-04-23 · conditional · none · ref 10 · internal anchor
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Latent Denoising Improves Visual Alignment in Large Multimodal Models cs.CV · 2026-04-23 · unverdicted · none · ref 6 · internal anchor
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model cs.CV · 2026-04-22 · unverdicted · none · ref 5 · internal anchor
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization cs.AI · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving cs.CV · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation cs.CV · 2026-04-19 · unverdicted · none · ref 2 · internal anchor
RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.
Narrative-Driven Paper-to-Slide Generation via ArcDeck cs.AI · 2026-04-13 · unverdicted · none · ref 41 · internal anchor
ArcDeck models paper-to-slide generation as narrative reconstruction using discourse parsing and multi-agent refinement, plus a new ArcBench benchmark, to improve flow and coherence over direct summarization.
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs cs.CV · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA cs.RO · 2026-03-31 · unverdicted · none · ref 4 · internal anchor
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
Context Unrolling in Omni Models cs.CV · 2026-04-23 · unverdicted · none · ref 1 · internal anchor
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model cs.AI · 2026-04-13 · unverdicted · none · ref 8 · internal anchor
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images cs.CV · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks cs.CV · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild cs.CV · 2026-04-04 · unverdicted · none · ref 1 · internal anchor
LOGER ensembles heterogeneous global vision models with selective local patch aggregation via multiple instance learning to achieve robust deepfake detection across varied manipulations and degradations.
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning cs.CV · 2026-05-04 · unverdicted · none · ref 4
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating cs.CV · 2026-05-12 · unreviewed · ref 2 · internal anchor
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution cs.CV · 2026-04-29 · unreviewed · ref 2 · 2 links · internal anchor
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning cs.CV · 2026-03-31 · unreviewed · ref 10 · internal anchor
HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models cs.CL · 2026-03-31 · unreviewed · ref 9 · internal anchor

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer