super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

760 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 760 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 120 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 113 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

citing papers explorer

Showing 50 of 760 citing papers.

GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models cs.CV · 2026-04-12 · unverdicted · none · ref 6 · internal anchor
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping cs.CV · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
PlantXpert benchmark shows fine-tuned VLMs reach up to 78% accuracy on plant phenotyping but scaling gains plateau and quantitative biological reasoning remains weak.
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-horizon visual reasoning benchmarks.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation cs.CL · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation cs.CV · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos cs.CV · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos cs.CV · 2026-04-10 · unverdicted · none · ref 2 · 2 links · internal anchor
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.
Large-Scale Universal Defect Generation: Foundation Models and Datasets cs.CV · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.
ParseBench: A Document Parsing Benchmark for AI Agents cs.CV · 2026-04-09 · accept · none · ref 5 · internal anchor
ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
WildDet3D: Scaling Promptable 3D Detection in the Wild cs.CV · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning cs.RO · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to 0.2 seconds.
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions cs.CV · 2026-04-09 · unverdicted · none · ref 51 · internal anchor
ESOM is a training-free streaming model for open-world video anomaly detection with dynamic definitions that achieves real-time single-GPU efficiency and state-of-the-art results on a new benchmark.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 3 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation cs.CV · 2026-04-08 · conditional · none · ref 35 · internal anchor
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning cs.CV · 2026-04-08 · unverdicted · none · ref 2 · internal anchor
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions cs.CV · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents cs.CL · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
JailWAM: Jailbreaking World Action Models in Robot Control cs.RO · 2026-04-07 · unverdicted · none · ref 1 · internal anchor
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing cs.CV · 2026-04-06 · unverdicted · none · ref 31 · internal anchor
DIRECT uses a three-level multi-agent framework to solve video mashup creation as a multimodal coherency problem, outperforming baselines on a new benchmark.
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing cs.CV · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables cs.AI · 2026-04-04 · conditional · none · ref 2 · internal anchor
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
PortraitCraft: A Benchmark for Portrait Composition Understanding and Generation cs.CV · 2026-04-04 · unverdicted · none · ref 1 · internal anchor
PortraitCraft supplies a new 50k-image dataset and two tasks for evaluating AI on fine-grained portrait composition understanding and constrained generation.
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space cs.LG · 2026-04-03 · unverdicted · none · ref 4 · internal anchor
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
Self-Distilled RLVR cs.LG · 2026-04-03 · unverdicted · none · ref 19 · internal anchor
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight cs.RO · 2026-04-03 · unverdicted · none · ref 32 · internal anchor
QuadAgent uses an asynchronous multi-agent architecture with an Impression Graph for scene memory and vision-based avoidance to enable training-free vision-language guided agile quadrotor flight, outperforming baselines in simulations and achieving real-world speeds up to 5 m/s.
XrayClaw: Cooperative-Competitive Multi-Agent Alignment for Trustworthy Chest X-ray Diagnosis cs.CV · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
XrayClaw deploys cooperative-competitive multi-agent alignment and Competitive Preference Optimization to raise diagnostic accuracy, reasoning fidelity, and generalization on chest X-ray benchmarks.
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models cs.CV · 2026-03-31 · unverdicted · none · ref 3 · internal anchor
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
Internalized Reasoning for Long-Context Visual Document Understanding cs.CV · 2026-03-31 · unverdicted · none · ref 44 · internal anchor
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators cs.CV · 2026-03-31 · unverdicted · none · ref 2 · internal anchor
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.
Gen-Searcher: Reinforcing Agentic Search for Image Generation cs.CV · 2026-03-30 · unverdicted · none · ref 36 · 2 links · internal anchor
Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
MeshTailor: Cutting Seams via Generative Mesh Traversal cs.GR · 2026-03-28 · unverdicted · none · ref 41 · internal anchor
MeshTailor is a mesh-native generative model that uses ChainingSeams serialization and a dual-stream transformer with pointer layers to trace coherent seams vertex-by-vertex on 3D surfaces.
RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation cs.CV · 2026-03-28 · unverdicted · none · ref 46 · internal anchor
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset cs.CV · 2026-03-24 · unverdicted · none · ref 7 · internal anchor
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off cs.CV · 2026-03-23 · unverdicted · none · ref 3 · internal anchor
Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.
When Negation Is a Geometry Problem in Vision-Language Models cs.CV · 2026-03-20 · conditional · none · ref 2 · internal anchor
A direction associated with negation exists in CLIP embedding space and can be steered at test time via representation engineering to produce negation-aware outputs without fine-tuning.
SPRITE: From Static Mockups to Engine-Ready Game UI cs.HC · 2026-03-18 · unverdicted · none · ref 2 · internal anchor
SPRITE converts static game UI screenshots into editable engine-ready assets by using VLMs to parse complex layouts into a YAML intermediate representation.
Towards Generalizable Robotic Manipulation in Dynamic Environments cs.CV · 2026-03-16 · unverdicted · none · ref 1 · internal anchor
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
Visual-ERM: Reward Modeling for Visual Equivalence cs.CV · 2026-03-13 · unverdicted · none · ref 2 · internal anchor
Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.
Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence cs.CV · 2026-03-13 · unverdicted · none · ref 2 · internal anchor
VAEX-BENCH shows state-of-the-art MLLMs perform substantially worse on abstractive spatiotemporal reasoning tasks than on matched extractive tasks in video understanding.
SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents cs.CV · 2026-03-09 · unverdicted · none · ref 2 · internal anchor
SPIRAL is a closed-loop think-act-reflect framework using PlanAgent, VideoGenerator, and CriticAgent plus GRPO self-evolution to improve long-horizon action-conditioned video generation, with new dataset and benchmark showing gains over open-loop baselines.
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning cs.RO · 2026-02-23 · unverdicted · none · ref 4 · internal anchor
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework cs.CL · 2026-02-23 · unverdicted · none · ref 1 · internal anchor
Prune-then-Merge combines adaptive pruning of low-signal patches with hierarchical merging to achieve higher compression rates and better performance than prior single-stage methods in visual document retrieval.
EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution cs.HC · 2026-02-20 · unverdicted · none · ref 1 · internal anchor
EvoDiagram uses a coordinated multi-agent system and design knowledge evolution to generate editable diagrams via canvas schema, with a new CanvasBench benchmark showing strong performance over baselines.
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling cs.CV · 2026-02-11 · unverdicted · none · ref 1 · internal anchor
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension cs.CV · 2026-02-10 · unverdicted · none · ref 1 · internal anchor
Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer