super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

734 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 734 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 120 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 113 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

citing papers explorer

Showing 50 of 734 citing papers.

Active Testing of Large Language Models via Approximate Neyman Allocation cs.AI · 2026-05-11 · unverdicted · none · ref 1 · 2 links · internal anchor
Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings versus uniform sampling.
BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training cs.DC · 2026-05-11 · unverdicted · none · ref 7 · 2 links · internal anchor
BatchWeave delivers an object-store-native data plane for distributed large foundation model training via transactional global batches and a decentralized adaptive commit algorithm.
StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video cs.CV · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
StreamPro introduces a benchmark and training method using CB-Stream Loss and GRPO to enable proactive decision-making in streaming videos, achieving 41.5 on StreamPro-Bench compared to 10.4 previously.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents cs.CL · 2026-05-11 · unverdicted · none · ref 41 · internal anchor
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 1 · 2 links · internal anchor
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 2 · internal anchor
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 7 · internal anchor
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL · 2026-05-10 · unverdicted · none · ref 2 · internal anchor
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs cs.CV · 2026-05-10 · conditional · none · ref 18 · internal anchor
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 2 · 2 links · internal anchor
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 3 · internal anchor
STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.
PPI2Text: Captioning Protein-Protein Interactions with Coordinate-Aligned Pair-Map Decoding cs.CE · 2026-05-09 · unverdicted · none · ref 42 · internal anchor
PPI2Text generates natural-language captions for protein-protein interactions from sequences by encoding each protein with ESM3, building a residue-pair map, and decoding with Qwen3 using coordinate-aligned positional encoding.
FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence cs.CV · 2026-05-09 · unverdicted · none · ref 30 · internal anchor
FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while producing false positives on real damage.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents cs.AI · 2026-05-09 · unverdicted · none · ref 38 · 3 links · internal anchor
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection cs.CL · 2026-05-09 · accept · none · ref 5 · internal anchor
CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere cs.CV · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frechet mean centering, holistic scene attention, and spherical geodesic pulling.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning cs.CV · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents cs.RO · 2026-05-08 · unverdicted · none · ref 51 · 2 links · internal anchor
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
StrLoRA is a regularized two-stage expert routing method for streaming CVIT that selects experts via textual instructions and applies token-wise cross-modal weighting with historical routing alignment.
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment cs.CV · 2026-05-07 · unverdicted · none · ref 1 · 2 links · internal anchor
FuScore uses MLLMs to output continuous quality scores for IVIF images, constructs per-image soft labels from four sub-dimensions, and applies a tripartite objective with Thurstone fidelity to achieve higher correlation with human preferences than prior metrics.
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance cs.CV · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs cs.AI · 2026-05-07 · unverdicted · none · ref 4 · 2 links · internal anchor
CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.
OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning q-bio.GN · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
OmicsLM integrates continuous omics embeddings into LLMs for multi-sample biological reasoning, matching specialized models on profile tasks while outperforming them and general LLMs on language-guided QA over real expression data.
TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity cs.CL · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
TableVista benchmark finds foundation models maintain performance across visual styles but degrade sharply on complex table structures and vision-only settings.
Evaluating Non-English Developer Support in Machine Learning for Software Engineering cs.SE · 2026-05-07 · unverdicted · none · ref 77 · internal anchor
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models cs.CV · 2026-05-07 · unverdicted · none · ref 4 · internal anchor
ContactPrompt uses part-wise vertex grids and multi-stage part-conditioned reasoning in MLLMs to achieve training-free dense hand contact estimation that outperforms prior supervised methods.
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA cs.CV · 2026-05-06 · unverdicted · none · ref 1 · internal anchor
VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks cs.AI · 2026-05-05 · unverdicted · none · ref 13 · internal anchor
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation cs.CV · 2026-05-05 · unverdicted · none · ref 1 · internal anchor
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV · 2026-05-05 · unverdicted · none · ref 18 · 3 links · internal anchor
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition cs.CV · 2026-05-04 · unverdicted · none · ref 40 · internal anchor
VideoNet is a new large-scale benchmark and training dataset for domain-specific action recognition that exposes limitations in VLMs and enables smaller fine-tuned models to surpass larger open-weight ones.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 236 · internal anchor
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion cs.RO · 2026-05-02 · unverdicted · none · ref 34 · internal anchor
Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
How Historians Use Visualization: A Corpus-Backed Taxonomy and Analysis for Cross-Disciplinary Practice cs.GR · 2026-05-02 · unverdicted · none · ref 100 · internal anchor
Historians use visualizations in five roles—primary-source, evidence-synthesis, communicative, confirmative, and exploratory—but epistemological and practical barriers limit broader adoption.
GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment cs.CV · 2026-05-02 · unverdicted · none · ref 23 · internal anchor
GameScope provides 4,048 multi-codec gaming videos with MOS ratings and attribute annotations, claimed as the first comprehensive dataset for gaming video quality assessment across codecs and content types.
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 61 · internal anchor
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
Faithful Mobile GUI Agents with Guided Advantage Estimator cs.AI · 2026-05-02 · unverdicted · none · ref 1 · internal anchor
Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding cs.AI · 2026-05-01 · accept · none · ref 1 · 2 links · internal anchor
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images cs.AI · 2026-04-30 · unverdicted · none · ref 5 · internal anchor
SpecVQA is a new benchmark dataset and evaluation suite for testing multimodal large language models on scientific spectral image understanding and visual question answering, supported by a curve-preserving sampling method that improves results.
TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions cs.CV · 2026-04-30 · unverdicted · none · ref 3 · internal anchor
TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.
Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG cs.IR · 2026-04-30 · unverdicted · none · ref 4 · internal anchor
FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts cs.CV · 2026-04-30 · unverdicted · none · ref 4 · 2 links · internal anchor
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
Benchmarking and Improving GUI Agents in High-Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 1 · 2 links · internal anchor
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
FCMBench-Video: Benchmarking Document Video Intelligence cs.CV · 2026-04-28 · unverdicted · none · ref 16 · internal anchor
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models cs.CV · 2026-04-27 · unverdicted · none · ref 2 · internal anchor
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer