super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

850 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 850 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 114 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

GaussDet enables open-vocabulary and referring segmentation in 3D Gaussians by learning instance features and aggregating votes from 2D detectors, improving referential grounding by 16.7% mIoU in zero-shot setting.

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Goku supplies a 2M-scale dataset, synthesis pipeline, decoupled dual-branch model, and 1000-case benchmark for multi-task instruction-based video editing, reporting up to 8% gains in instruction following.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

citing papers explorer

Showing 50 of 850 citing papers.

Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference eess.SP · 2026-05-08 · unverdicted · none · ref 58 · internal anchor
TOAU compresses human motion videos to 9 bits per frame with pose estimation and VQ-VAE, then aligns the tokens to a vision-language model via a lightweight projector, achieving 1% transmission payload and 20% latency of video codecs while maintaining comparable action understanding accuracy.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision cs.CV · 2026-05-07 · unverdicted · none · ref 5 · internal anchor
SuperFace refines ARKit facial expression estimation by using human preference feedback on rendered faces to optimize beyond noisy pseudo-label supervision from capture software.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts cs.RO · 2026-05-07 · unverdicted · none · ref 1 · 3 links · internal anchor
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore) cs.CV · 2026-05-06 · conditional · none · ref 3 · internal anchor
The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models cs.LG · 2026-05-06 · unverdicted · none · ref 2 · internal anchor
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection cs.CV · 2026-05-05 · unverdicted · none · ref 1 · 3 links · internal anchor
VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended settings.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CV · 2026-05-02 · unverdicted · none · ref 46 · internal anchor
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations cs.CV · 2026-05-01 · unverdicted · none · ref 3 · internal anchor
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation cs.CV · 2026-04-30 · unverdicted · none · ref 2 · internal anchor
Echo-α integrates organ-specific detectors with global visual context via an invoke-and-reason agentic loop, trained on a nine-task curriculum plus sequential RL, to achieve superior grounding (56.73%/43.78% F1@0.5) and diagnosis (74.90%/49.20% accuracy) on cross-center renal and breast ultrasound.
Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models cs.LG · 2026-04-29 · unverdicted · none · ref 9 · internal anchor
A Meta AutoEncoder framework enables adaptive, progressive compression of visual features for low-latency edge-cloud VLM inference without model fine-tuning.
A Systematic Post-Train Framework for Video Generation cs.CV · 2026-04-28 · unverdicted · none · ref 50 · internal anchor
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation cs.CV · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference cs.LG · 2026-04-24 · unverdicted · none · ref 3 · internal anchor
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and dual quantization paths.
Context Unrolling in Omni Models cs.CV · 2026-04-23 · unverdicted · none · ref 1 · internal anchor
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 2 · internal anchor
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models cs.RO · 2026-04-21 · unverdicted · none · ref 4 · internal anchor
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling cs.AI · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
Environmental Understanding Vision-Language Model for Embodied Agent cs.CV · 2026-04-21 · unverdicted · none · ref 3 · internal anchor
EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning cs.CV · 2026-04-19 · unverdicted · none · ref 2 · internal anchor
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents cs.AI · 2026-04-19 · unverdicted · none · ref 2 · internal anchor
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement cs.CV · 2026-04-18 · unverdicted · none · ref 1 · internal anchor
Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual refinement in image generation.
Why Do Vision Language Models Struggle To Recognize Human Emotions? cs.CV · 2026-04-16 · unverdicted · none · ref 6 · internal anchor
VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.
Touching Space: Accessible Map Exploration Through Conversational Audio-Haptic Interaction cs.HC · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
Touching Space is an audio-haptic conversational system that enables BLV users to explore map data and construct cognitive maps of unfamiliar places before travel.
SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing cs.RO · 2026-04-15 · unverdicted · none · ref 34 · internal anchor
SpaceMind is a self-evolving modular VLM agent framework that achieves 90-100% navigation success in nominal conditions and recovers from failures via experience distillation, with zero-code transfer to physical robots for on-orbit tasks.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System cs.CV · 2026-04-15 · unverdicted · none · ref 1 · 2 links · internal anchor
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 200 · internal anchor
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting cs.GR · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
A rectified-flow DiT model unifies motion generation, editing, and intra-structural retargeting as conditional transport tasks distinguished only by semantic versus structural conditioning.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference cs.CV · 2026-04-13 · unverdicted · none · ref 8 · internal anchor
Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model cs.AI · 2026-04-13 · unverdicted · none · ref 8 · internal anchor
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment: Methods and Results cs.CV · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
The LoViF 2026 Challenge creates the SeIQA dataset and benchmark for human-oriented semantic image quality assessment, with six submitted solutions reaching state-of-the-art performance.
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images cs.CV · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
LVSum: A Benchmark for Timestamp-Aware Long Video Summarization cs.CV · 2026-04-11 · unverdicted · none · ref 4 · internal anchor
LVSum is a new benchmark for timestamp-aware long video summarization that exposes systematic temporal gaps in existing multimodal large language models.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion cs.LG · 2026-04-10 · unverdicted · none · ref 3 · 2 links · internal anchor
ECHO introduces one-step block diffusion via Direct Conditional Distillation and Response-Asymmetric Diffusion to generate chest X-ray reports faster than autoregressive models while improving clinical metrics.
Task-Aware Bimanual Affordance Prediction via VLM-Guided Semantic-Geometric Reasoning cs.RO · 2026-04-09 · unverdicted · none · ref 9 · internal anchor
A VLM-guided method for joint bimanual affordance localization and arm allocation achieves higher real-world task success rates than geometric or semantic baselines across nine manipulation tasks.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks cs.CV · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping cs.CV · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
A scalable pipeline generates an intra-consistent, inter-diverse 1.4M style image dataset from text-to-image models and uses it to train a style encoder and generalizable style transfer model.
Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs cs.CL · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
DLR is a new reinforced latent reasoning method for VLMs that decomposes queries, uses continuous visual latents, and outperforms text-only and multimodal CoT baselines on vision-centric benchmarks with better interpretability.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence cs.CL · 2026-04-08 · unverdicted · none · ref 3 · internal anchor
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Head-wise Modality Specialization within MLLMs for Robust Fake News Detection under Missing Modality cs.CV · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
Head-wise modality specialization via attention constraints and unimodal knowledge retention in MLLMs improves robustness to missing modalities in fake news detection while preserving full multimodal performance.
Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents cs.CV · 2026-04-07 · unverdicted · none · ref 2 · internal anchor
EchoTrust is an evidence-driven actor-verifier framework that produces structured intermediate representations for more reliable and interpretable reasoning in echocardiography visual language models.
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations cs.CV · 2026-04-06 · unverdicted · none · ref 3 · internal anchor
Patch-level analysis of token attention patterns and semantic alignment detects LVLM hallucinations at up to 90% accuracy by identifying diffuse, non-localized grounding that global methods miss.
LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild cs.CV · 2026-04-04 · unverdicted · none · ref 1 · internal anchor
LOGER ensembles heterogeneous global vision models with selective local patch aggregation via multiple instance learning to achieve robust deepfake detection across varied manipulations and degradations.
Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs cs.CV · 2026-03-21 · unverdicted · none · ref 3 · internal anchor
CogAlign uses hierarchical supervised fine-tuning on clinical cognition data plus counterfactual RL to align MLLMs with expert diagnostic pathways and enforce causal lesion grounding for GI endoscopy diagnosis.
Gym-V: A Unified Vision Environment System for Agentic Vision Research cs.CV · 2026-03-16 · unverdicted · none · ref 2 · internal anchor
Gym-V supplies 179 visual environments showing that observation scaffolding like captions and rules matters more for training success than the choice of RL algorithm.
Exploring a Multimodal Chatbot as a Facilitator in Therapeutic Art Activity cs.HC · 2026-02-15 · unverdicted · none · ref 11 · internal anchor
The authors built and expert-tested a multimodal chatbot that analyzes drawings in real time and holds reflective conversations to aid therapeutic art activities.
JARVIS: An Evidence-Grounded Retrieval System for Interpretable Deceptive Reviews Adjudication cs.IR · 2026-02-13 · unverdicted · none · ref 3 · internal anchor
JARVIS combines hybrid retrieval and evidence graphs with LLMs to raise deceptive-review detection precision from 0.953 to 0.988 and recall from 0.830 to 0.901 on a custom dataset while cutting manual inspection time by 75% in production.
UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics cs.LG · 2026-02-11 · unverdicted · none · ref 5 · internal anchor
UI-Oceanus shows that continual pre-training on forward dynamics predictions from synthetic GUI exploration improves agent success rates by 7% offline and 16.8% online, with gains scaling by data volume.
Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning cs.CV · 2026-02-10 · unverdicted · none · ref 1 · internal anchor
Reason-IAD improves explainable industrial anomaly detection by combining retrieval-augmented category knowledge with entropy-guided latent reasoning and dynamic visual patch injection in MLLMs.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer