super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

768 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 768 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 120 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 113 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

SpatialUAV: Benchmarking Spatial Intelligence for Low-Altitude UAV Perception, Collaboration, and Motion

cs.CV · 2026-06-26 · accept · novelty 7.0

SpatialUAV is a new real-world benchmark dataset and evaluation suite exposing large gaps between vision-language models and human performance on spatial tasks for low-altitude UAVs.

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

citing papers explorer

Showing 50 of 768 citing papers.

ViMU: Benchmarking Video Metaphorical Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence cs.CL · 2026-05-13 · accept · none · ref 2 · internal anchor
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding cs.CV · 2026-05-11 · unverdicted · none · ref 101 · internal anchor
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation cs.AI · 2026-05-08 · unverdicted · none · ref 29 · internal anchor
RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings cs.CV · 2026-05-08 · conditional · none · ref 11 · internal anchor
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild cs.CV · 2026-05-07 · unverdicted · none · ref 59 · internal anchor
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments cs.AI · 2026-04-30 · accept · none · ref 1 · internal anchor
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
Lost in Translation: Do LVLM Judges Generalize Across Languages? cs.CL · 2026-04-21 · unverdicted · none · ref 8 · internal anchor
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models cs.CV · 2026-04-19 · unverdicted · none · ref 3 · internal anchor
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees cs.CV · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning cs.CV · 2026-04-03 · conditional · none · ref 1 · internal anchor
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision cs.CV · 2026-02-15 · conditional · none · ref 1 · internal anchor
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents cs.CR · 2026-01-26 · unverdicted · none · ref 106 · internal anchor
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
Common to Whom? Regional Cultural Commonsense and LLM Bias in India cs.CL · 2026-01-22 · unverdicted · none · ref 5 · internal anchor
Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 10 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding cs.CV · 2026-01-01 · unverdicted · none · ref 14 · internal anchor
S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos cs.CV · 2025-12-03 · accept · none · ref 2 · internal anchor
ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
SpatialUAV: Benchmarking Spatial Intelligence for Low-Altitude UAV Perception, Collaboration, and Motion cs.CV · 2026-06-26 · accept · none · ref 31 · internal anchor
SpatialUAV is a new real-world benchmark dataset and evaluation suite exposing large gaps between vision-language models and human performance on spatial tasks for low-altitude UAVs.
Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning cs.CV · 2026-06-26 · unverdicted · none · ref 3 · internal anchor
Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.
Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering cs.CL · 2026-06-15 · unverdicted · none · ref 20 · internal anchor
Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.
End-to-End Text Line Detection and Ordering cs.CV · 2026-06-02 · unverdicted · none · ref 4 · internal anchor
Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.
Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing cs.CV · 2026-05-31 · unverdicted · none · ref 38 · internal anchor
Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models cs.CV · 2026-05-30 · unverdicted · none · ref 5 · internal anchor
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue cs.CV · 2026-05-30 · unverdicted · none · ref 26 · internal anchor
MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.
DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV · 2026-05-30 · unverdicted · none · ref 97 · internal anchor
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction cs.CV · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.
YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models cs.CV · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.
ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models cs.CV · 2026-05-29 · accept · none · ref 2 · internal anchor
ERGeoBench is a new diagnostic benchmark evaluating MLLMs on four capabilities in three progressive embodied geo-localization settings, finding that models handle high-level semantics but struggle with fine-grained perception and metric localization.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 79 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence cs.CV · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
EAGLE is a new evidence-aligned framework that improves multi-agent VQA by enforcing consistency in visual grounding across agents, achieving best average performance on six benchmarks.
PInVerify: An Offline Embodied Benchmark for Active Instance Verification cs.CV · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.
Why Far Looks Up: Probing Spatial Representation in Vision-Language Models cs.CV · 2026-05-28 · conditional · none · ref 3 · internal anchor
VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.
RAISE: RAG Design as an Architecture Search Problem cs.AI · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
RAISE is a standardized benchmark for RAG hyperparameter optimization that evaluates 13 search algorithms across seven datasets and finds performance is highly task-dependent.
Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning cs.CV · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
VisHarness learns a reinforcement-learned policy to harness specialized visual experts via multi-turn interactions and dynamic visual memory archiving, outperforming general models on four visual reasoning benchmarks.
OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields cs.AI · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
OmniMatBench is a new human-calibrated benchmark for multimodal materials-science reasoning that reveals the best evaluated MLLM scores only 0.372 overall.
CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations cs.CV · 2026-05-28 · unverdicted · none · ref 8 · internal anchor
CardioLens is a leakage-resistant CMR testbed of 473k slices and 13k QA pairs showing current MLLMs exhibit a large clinical reality gap with category-collapse failures on real workflows.
Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation cs.CV · 2026-05-28 · unverdicted · none · ref 5 · internal anchor
Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation cs.RO · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
POINav-Bench provides the first high-fidelity real-world benchmark for POI-goal VLN using 3DGS reconstructions of 126k m² with 163 POIs, supported by a Brain-Action framework and 70K real signage-entrance dataset.
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications cs.CV · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning cs.CV · 2026-05-26 · unverdicted · none · ref 1 · internal anchor
View Dropout forces reliance on intermediate thinking images in unified multimodal models, with panoramic renderings proving most effective for out-of-domain cross-view spatial reasoning.
EpiCurveBench: Evaluating VLMs on Epidemic Curve Digitization cs.CL · 2026-05-26 · unverdicted · none · ref 1 · internal anchor
EpiCurveBench supplies 1,000 epidemic curve images and ECS metric shows top VLMs reach only 52.3% while correlating 1.5-3.6 times more strongly than DTW with downstream epidemiological statistics.
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? cs.AI · 2026-05-26 · unverdicted · none · ref 11 · internal anchor
LiveK12Bench is a growing multi-disciplinary benchmark showing LMMs like GPT-5 drop from 79 to 53 under realistic exam constraints including process rigor and efficiency.
METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition cs.CV · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
METATR is a new benchmark dataset and evaluation framework for ATR covering 29 languages, multiple scripts and layouts, with standardized prompting and a dynamic extensible protocol.
OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following cs.CV · 2026-05-26 · unverdicted · none · ref 42 · internal anchor
OmniGF adapts VLMs via dual-branch decoding and head embeddings to unify precise multi-person gaze localization with semantic and social reasoning, claiming new SOTA on benchmarks.
ETCHR: Editing To Clarify and Harness Reasoning cs.CV · 2026-05-22 · unverdicted · none · ref 1 · internal anchor
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval cs.CV · 2026-05-22 · unverdicted · none · ref 2 · internal anchor
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models cs.CV · 2026-05-22 · unverdicted · none · ref 4 · internal anchor
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer