super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

846 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 846 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 114 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

GaussDet enables open-vocabulary and referring segmentation in 3D Gaussians by learning instance features and aggregating votes from 2D detectors, improving referential grounding by 16.7% mIoU in zero-shot setting.

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Goku supplies a 2M-scale dataset, synthesis pipeline, decoupled dual-branch model, and 1000-case benchmark for multi-task instruction-based video editing, reporting up to 8% gains in instruction following.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

citing papers explorer

Showing 50 of 846 citing papers.

Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination cs.AI · 2026-05-19 · unverdicted · none · ref 1 · internal anchor
Causal path-patching analysis across five MLLMs identifies distributed hallucination-driving attention heads and localized resisting heads whose imbalance biases generation toward erroneous text over visual evidence; a conditional intervention MACI suppresses the driving heads and cuts hallucination
FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models cs.CV · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
FAGER is a new agentic framework that creates structured factual rubrics to evaluate and refine text-to-image outputs for implicit factual correctness across science, history, products, and culture.
CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark cs.CV · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
CrossView Suite supplies a 1.6M-sample dataset, scene-disjoint benchmark, and explicit-alignment framework to advance MLLMs from single-view perception to cross-view spatial intelligence.
What's Holding Back Latent Visual Reasoning? cs.CV · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
Latent visual reasoning fails in current models because standard datasets make oracle latents uninformative and inference-time latents collapse away from useful representations.
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency cs.CV · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.
SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening cs.CV · 2026-05-17 · unverdicted · none · ref 66 · internal anchor
SafeLens presents a fast-and-slow video guardrail framework that filters the SafeWatch dataset to 2.4% and adds Chain-of-Thought traces to achieve state-of-the-art moderation performance at reduced inference cost.
AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment cs.RO · 2026-05-17 · unverdicted · none · ref 27 · internal anchor
AffordVLA improves VLA models for robotic manipulation by implicitly injecting affordance perception through feature alignment with a zero-shot teacher, claiming SOTA results in simulation and real-world tests.
PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning cs.MA · 2026-05-16 · unverdicted · none · ref 27 · internal anchor
PyraVid is a hierarchical multimodal memory system that structures long videos into pyramids to improve long-horizon reasoning and evidence aggregation.
Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing cs.CV · 2026-05-16 · unverdicted · none · ref 2 · internal anchor
Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.
NGM: A Plug-and-Play Training-Free Memory Module for LLMs cs.AI · 2026-05-16 · unverdicted · none · ref 1 · internal anchor
NGM is a plug-and-play n-gram memory module that encodes n-grams from pretrained embeddings and gates their injection to improve LLM performance by 0.5-1.2 points on average across eight benchmarks.
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 1 · internal anchor
EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.
WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes cs.CV · 2026-05-15 · unverdicted · none · ref 1 · internal anchor
WorldAct activates monolithic 3D worlds into interactive scenes via multimodal agent-guided decomposition, geometrically aligned mesh reconstruction, and 3D inpainting.
VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following cs.CV · 2026-05-15 · unverdicted · none · ref 3 · internal anchor
VLMs frequently switch away from a target visual path to nearby similar distractors in controlled tracing tasks, with standard scaling, reasoning, and instruction interventions providing only partial mitigation.
PhysBrain 1.0 Technical Report cs.RO · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.
Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action cs.RO · 2026-05-14 · unverdicted · none · ref 3 · 2 links · internal anchor
A unified embodied foundation model uses one VLM for understanding and reasoning plus a joint video-action future generator, reporting competitive scores on VLM, world modeling, and robot benchmarks without apparent compromise.
EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control cs.RO · 2026-05-14 · unverdicted · none · ref 25 · 2 links · internal anchor
DAJI is a hierarchical framework using distillation and autoregressive generation to learn future-aware joint intents for language-conditioned humanoid robot control.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 72 · internal anchor
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models cs.CL · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
OmniThoughtVis curates 1.8M multimodal CoT samples via teacher distillation, difficulty annotation, and tag-based sampling, yielding consistent gains on nine reasoning benchmarks and allowing 4B models to match or beat undistilled 8B baselines.
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph cs.CV · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning cs.CV · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse cs.CV · 2026-05-11 · unverdicted · none · ref 7 · 2 links · internal anchor
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
A neuro-symbolic engine generates GeoSym127K, a 127K-question dataset with symbolic ground truths and verified CoT pairs, yielding +22.21% gains on MathVerse Vision-Only after SFT on Qwen3-VL-8B.
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering cs.CV · 2026-05-10 · unverdicted · none · ref 19 · internal anchor
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
Fre-Res compresses video tokens by preserving spatial anchors and representing temporal dynamics with low-frequency residual tokens derived from 1D-DCT on inter-frame residuals, plus a Spatial-Guided Absorber to reinject the information.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization cs.AI · 2026-05-09 · unverdicted · none · ref 1 · 2 links · internal anchor
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation cs.AI · 2026-05-09 · unverdicted · none · ref 31 · internal anchor
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution cs.AI · 2026-05-09 · unverdicted · none · ref 1 · internal anchor
Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to smaller models.
Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models cs.CV · 2026-05-09 · conditional · none · ref 15 · internal anchor
A combination of illusion-specific image transformations, anti-illusion prompts, and majority voting lets VLMs reach 90.48% accuracy on a 630-image illusion benchmark without any model training.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation cs.RO · 2026-05-09 · unverdicted · none · ref 4 · internal anchor
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference eess.SP · 2026-05-08 · unverdicted · none · ref 58 · internal anchor
TOAU compresses human motion videos to 9 bits per frame with pose estimation and VQ-VAE, then aligns the tokens to a vision-language model via a lightweight projector, achieving 1% transmission payload and 20% latency of video codecs while maintaining comparable action understanding accuracy.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision cs.CV · 2026-05-07 · unverdicted · none · ref 5 · internal anchor
SuperFace refines ARKit facial expression estimation by using human preference feedback on rendered faces to optimize beyond noisy pseudo-label supervision from capture software.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts cs.RO · 2026-05-07 · unverdicted · none · ref 1 · 3 links · internal anchor
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore) cs.CV · 2026-05-06 · conditional · none · ref 3 · internal anchor
The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models cs.LG · 2026-05-06 · unverdicted · none · ref 2 · internal anchor
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection cs.CV · 2026-05-05 · unverdicted · none · ref 1 · 3 links · internal anchor
VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended settings.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CV · 2026-05-02 · unverdicted · none · ref 46 · internal anchor
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations cs.CV · 2026-05-01 · unverdicted · none · ref 3 · internal anchor
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation cs.CV · 2026-04-30 · unverdicted · none · ref 2 · internal anchor
Echo-α integrates organ-specific detectors with global visual context via an invoke-and-reason agentic loop, trained on a nine-task curriculum plus sequential RL, to achieve superior grounding (56.73%/43.78% F1@0.5) and diagnosis (74.90%/49.20% accuracy) on cross-center renal and breast ultrasound.
Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models cs.LG · 2026-04-29 · unverdicted · none · ref 9 · internal anchor
A Meta AutoEncoder framework enables adaptive, progressive compression of visual features for low-latency edge-cloud VLM inference without model fine-tuning.
A Systematic Post-Train Framework for Video Generation cs.CV · 2026-04-28 · unverdicted · none · ref 50 · internal anchor
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation cs.CV · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference cs.LG · 2026-04-24 · unverdicted · none · ref 3 · internal anchor
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and dual quantization paths.
Context Unrolling in Omni Models cs.CV · 2026-04-23 · unverdicted · none · ref 1 · internal anchor
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 2 · internal anchor
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer