mega hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (48%).

1171 Pith papers citing it

Background 48% of classified citations

open full Pith review browse 1171 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 115 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents

cs.MM · 2026-06-26 · unverdicted · novelty 8.0

Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.

MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models

cs.CV · 2026-06-19 · unverdicted · novelty 8.0

Introduces the first large-scale multimodal benchmark MedLayXPlain-122K showing medical VLMs suffer significant lay-register degradation while general VLMs lack clinical precision.

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

LOCUS is a released corpus of nearly all US municipal and county ordinance codes, processed via OCR and paired with ModernBERT classifiers for dimensions such as opacity and paternalism.

Vision-language models for chest radiography do not always need the image

cs.CV · 2026-06-16 · accept · novelty 8.0

A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

cs.CL · 2026-06-01 · conditional · novelty 8.0

FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

citing papers explorer

Showing 50 of 754 citing papers after filters.

MotionAtlas: Detailed Region Captioning for Motion-Centric Videos cs.CV · 2026-06-28 · unverdicted · none · ref 3 · internal anchor
MotionAtlas supplies a 2,073-question benchmark, a self-bootstrap pipeline yielding 159k captions, and fine-tuned Video-MLLMs that deliver 5.2-point gains over Qwen3-VL-4B on motion tasks.
The Platonic Defense: Backdoor Defense for Self-Supervised Encoders in the Era of Large Scale Pre-training cs.CV · 2026-06-28 · unverdicted · none · ref 2 · internal anchor
Introduces an attack-agnostic black-box defense for SSL encoders that trains a conditional energy function via NCE and DSM to detect and purify representations, with an energy gap lower-bounded by mutual information.
HKVLM: Faithful Reasoning Grounding by Binding Language Queries to a Frozen Detector cs.CV · 2026-06-27 · unverdicted · none · ref 24 · internal anchor
HKVLM trains only an alignment hook to bind frozen LM query embeddings to frozen detector proposals via contrastive retrieval and bipartite assignment, yielding 50-90x grounding gains and reduced hallucinations on RefCOCO and POPE.
Detecting Clinical Hallucinations in LVLMs via Counterfactual Visual Grounding Uncertainty cs.CV · 2026-06-26 · unverdicted · none · ref 1 · internal anchor
A counterfactual visual grounding uncertainty method detects hallucinations in LVLMs on medical images, improving over baselines with interpretable evidence and cross-model transfer.
HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration cs.CV · 2026-06-26 · unverdicted · none · ref 1 · internal anchor
HAT-4D presents an agentic VLM-plus-human-in-the-loop pipeline for monocular 4D multi-object interaction reconstruction and releases the MVOIK-4D benchmark.
Toward Robust In-Context Segmentation via Concept Guidance cs.CV · 2026-06-26 · unverdicted · none · ref 3 · internal anchor
CG-ICS improves ICS robustness by using MLLM-proposed textual concepts scored via SAM3 and tree search plus visual exemplars to activate a frozen SAM3, claiming SOTA accuracy and lower variance across references.
ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering cs.CV · 2026-06-26 · unverdicted · none · ref 2 · internal anchor
ProMSA is a progressive multimodal search agent for KB-VQA that iteratively selects search tools under budgets, trained via rejection-sampling SFT then TN-GSPO RL, reporting gains on E-VQA and InfoSeek over RAG baselines.
Understanding How MLLMs Describe Artworks Using Token Activation Maps cs.CV · 2026-06-26 · unverdicted · none · ref 5 · internal anchor
Token Activation Maps applied to MLLM art descriptions reveal that visual grounding strength varies by token category, with better artist identification than title prediction.
Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models cs.CV · 2026-06-25 · unverdicted · none · ref 3 · internal anchor
VISE is an unsupervised self-evolving method for LMMs that uses invariance rewards to improve visual conditioning, reporting gains on captioning and reduced hallucination across multiple models.
RoPEMover: Depth-Aware Object Relocation via Positional Embeddings cs.CV · 2026-06-25 · unverdicted · none · ref 1 · internal anchor
RoPEMover extends 2D RoPE to a depth-aware version in diffusion transformers to enable consistent object relocation in single images, trained mostly on synthetic data with minimal real supervision and claiming SOTA results.
HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models cs.CV · 2026-06-25 · unverdicted · none · ref 14 · 2 links · internal anchor
HarmVideoBench is a multi-layered benchmark for harmful video understanding in LVLMs with three hierarchical dimensions, and BCR is a method that raises average model performance from 61.7% to 84.4%.
Unison: Benchmarking Unified Multimodal Models via Synergistic Understanding and Generation cs.CV · 2026-06-25 · unverdicted · none · ref 1 · internal anchor
Unison is a new benchmark with unified and decoupled tracks plus Unison-Judge to measure synergy between understanding and generation in multimodal models.
PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation cs.CV · 2026-06-25 · unverdicted · none · ref 1 · internal anchor
PortraitGen integrates real-image exemplars into GRPO sampling and applies dual rewards (OmniReward and AI-Portrait) to improve photorealism, claiming better results than baselines on a new PortraitBench.
SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing cs.CV · 2026-06-25 · unverdicted · none · ref 1 · 2 links · internal anchor
SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.
From Hallucination to Grounding: Diagnosing Visual Spatial Intelligence via CRISP cs.CV · 2026-06-25 · unverdicted · none · ref 2 · internal anchor
CRISP diagnoses a systematic perception-reasoning disconnect in VLMs, showing proprietary models have latent reasoning but poor metric estimation while open-source models lack compositional reasoning.
Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs cs.CV · 2026-06-24 · unverdicted · none · ref 4 · internal anchor
ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.
Text Over Image: Auditing Multimodal Robustness in Synthetic Medical Image Detection cs.CV · 2026-06-24 · unverdicted · none · ref 28 · internal anchor
VLMs for synthetic medical image detection overweight text metadata, flipping authenticity judgments on the same image and dropping accuracy on authentic images by 61.1% on average when an explicit AI-origin tag is present.
Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity cs.CV · 2026-06-24 · unverdicted · none · ref 9 · internal anchor
Presents Invoice Haystack benchmark for homogeneous document retrieval and VL-RAG hybrid framework achieving 60% Recall@1 and up to 13.5 point gains over prior methods.
SteerVTE: Seamless Video Text Editing with Style and Glyph Control cs.CV · 2026-06-22 · unverdicted · none · ref 2 · internal anchor
SteerVTE adds lightweight style and dual-granularity glyph adapters to a frozen video diffusion model, introduces a glyph-aware loss and progressive training, and releases a 1M synthetic dataset to enable accurate video text editing.
Compression and Retrieval: Implicit Memory Retrieval for Video World Models cs.CV · 2026-06-22 · unverdicted · none · ref 1 · internal anchor
CaR uses attention with viewpoint positional encoding and context compression for flexible memory retrieval in video world models, backed by a new SceneFly dataset, and reports SOTA results with open-domain generalization.
READ More than What You See: Reinforcement Learning for Accurate and Coherent Audio Description Generations cs.CV · 2026-06-22 · unverdicted · none · ref 26 · internal anchor
READ is the first reinforcement-learning framework for training audio-description generators, using sequence-level rewards for reference match, length, format, and context-aware coherence.
OmniSpace: Efficient Geometry Awareness for Autonomous Vehicles MLLMs cs.CV · 2026-06-21 · unverdicted · none · ref 1 · internal anchor
OmniSpace is a plug-and-play method that improves spatial reasoning in MLLMs for AV by injecting camera pose, using epipolar attention across views, and distilling 3D geometric knowledge to overcome weak cross-view correspondence and depth estimation.
Training-Free Semantic Correction for Autoregressive Visual Models cs.CV · 2026-06-21 · unverdicted · none · ref 41 · internal anchor
Gazer uses MLLM feedback in two stages to diagnose semantic errors in intermediate AVM states and rewind/rectify the generation trajectory, improving alignment on compositional benchmarks without training.
CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming cs.CV · 2026-06-21 · unverdicted · none · ref 4 · internal anchor
CVSBench benchmark shows VLMs struggle with cross-view spatial consistency but improve substantially when given 3D scene imagination inputs.
T-IMPACT: A Severity-Aware Benchmark for Contextual Image-Text Manipulation cs.CV · 2026-06-21 · unverdicted · none · ref 3 · internal anchor
T-IMPACT is a new benchmark dataset and pipeline that supplies nearly 99k manipulated image-text pairs together with a human-calibrated continuous severity signal for contextual interpretation change.
CapRiCorn-1K: A Comprehensive Benchmark for Video Captioning and Subject Referential Consistency Across Temporal Scales cs.CV · 2026-06-20 · unverdicted · none · ref 1 · internal anchor
CapRiCorn-1K benchmark shows current video captioning models produce inaccurate and inconsistent captions that worsen with longer videos, with proposed metrics correlating to downstream task performance.
HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning cs.CV · 2026-06-19 · unverdicted · none · ref 9 · internal anchor
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating cs.CV · 2026-06-19 · unverdicted · none · ref 2 · internal anchor
UnityShots uses fixed LTM and STM memory slots with boundary-conditioned gating and speaker tokens to achieve coherent multi-shot audio-video generation, leading open-source baselines on cross-shot coherence metrics.
A Neurosymbolic Framework for Interpretable Skeleton-Based Seizure Detection via Concept-Driven Logical Reasoning cs.CV · 2026-06-19 · unverdicted · none · ref 1 · 2 links · internal anchor
Neurosymbolic framework detects seizures from video skeletons by activating clinical concepts and composing them with differentiable logic into interpretable rules, evaluated on two benchmarks with public code release.
EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies cs.CV · 2026-06-18 · unverdicted · none · ref 15 · 2 links · internal anchor
EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.
PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models cs.CV · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
PerceptionDLM enables parallel region captioning in multimodal diffusion language models via prompting and attention masking, introduces ParaDLC-Bench, and claims first parallel region perception with DLMs.
OneCanvas: 3D Scene Understanding via Panoramic Reprojection cs.CV · 2026-06-17 · unverdicted · none · ref 1 · internal anchor
OneCanvas aggregates multi-view 3D patches onto one panoramic canvas with continuous angular placement and 3D embeddings, enabling pretrained VLMs to achieve SOTA on SQA3D and VSI-Bench with an order of magnitude less compute via a new spatial pretraining curriculum.
AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model cs.CV · 2026-06-17 · unverdicted · none · ref 47 · 3 links · internal anchor
Introduces AMALIA-VL, the first open-source instruction-tuned LVLM for European Portuguese, using a high-resolution vision encoder, pt-PT language model, learned connector, and three-stage training on a custom data mix.
Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning cs.CV · 2026-06-17 · unverdicted · none · ref 19 · internal anchor
Visual-OPSD distills reasoning from a privileged visual-thought teacher to a text-only student using on-policy JSD, delivering +3.40pp accuracy gain and 14.3x speedup over the generative teacher on nine benchmarks.
VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers cs.CV · 2026-06-17 · unverdicted · none · ref 9 · internal anchor
VTOS jointly searches solution and observer programs to adaptively orchestrate vision tools, outperforming static pipelines on dense object counting and zero-shot plant disease segmentation.
Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification cs.CV · 2026-06-16 · unverdicted · none · ref 1 · internal anchor
UniAR uses a shared context-visual tokenizer with bitwise quantization and parallel prediction in an autoregressive framework to unify visual understanding and generation, claiming SOTA on generation and editing tasks.
EventDrive: Event Cameras for Vision-Language Driving Intelligence cs.CV · 2026-06-16 · unverdicted · none · ref 2 · internal anchor
EventDrive supplies a multi-task benchmark and EventDrive-VLM architecture that fuses event data, RGB, and language supervision, reporting gains in temporal precision and motion awareness for driving intelligence.
MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias cs.CV · 2026-06-16 · unverdicted · none · ref 1 · internal anchor
MLLMs show late-layer textual override of correct visual predictions, with a directional signature enabling a simple inference-time recovery method that improves conflict benchmarks by up to 9.4%.
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models cs.CV · 2026-06-16 · unverdicted · none · ref 45 · internal anchor
SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.
NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis cs.CV · 2026-06-14 · unverdicted · none · ref 1 · internal anchor
NeRD framework generates efficient ontology-grounded reasoning chains for medical image diagnosis via neuro-symbolic rule distillation, shown on skin datasets with expert validation.
Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation cs.CV · 2026-06-11 · unverdicted · none · ref 3 · internal anchor
Prompt2Effect is a weight-driven hypernetwork that synthesizes LoRA adapters for I2V models from prompts and base weights via SVD parameterization, matching fine-tuned quality at 3.3s inference instead of 56 GPU hours.
OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data cs.CV · 2026-06-11 · unverdicted · none · ref 60 · internal anchor
OmniDirector introduces a grid-based camera representation and hierarchical prompt agent for multi-shot camera cloning in video diffusion models trained on million-scale unpaired data.
Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback cs.CV · 2026-06-11 · unverdicted · none · ref 1 · internal anchor
IVT teaches VLMs iterative spatial self-correction via visual feedback from rendered bounding boxes, improving Acc@0.5 by 2.4pp on referring expression benchmarks using 2400 samples and GRPO.
VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving cs.CV · 2026-06-10 · unverdicted · none · ref 1 · internal anchor
VLGA introduces geometry as a fourth modality in VLA models via pointmap regression loss, reporting SOTA open-loop and closed-loop driving metrics on nuScenes and Bench2Drive.
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning cs.CV · 2026-06-10 · unverdicted · none · ref 39 · internal anchor
ReRe boosts open-source MLLMs on spatial reasoning benchmarks VSI-Bench and STI-Bench to rival proprietary SOTA by using a two-phase Reason then Re-reason process with Geometry-to-Video novel view synthesis.
ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation cs.CV · 2026-06-10 · unverdicted · none · ref 39 · internal anchor
ARGUS converts MLLM-selected identity evidence into a synchronized 3x3 mosaic injected as negative-time memory in a diffusion model, plus supporting training techniques, to achieve SOTA subject preservation on human video benchmarks.
CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence cs.CV · 2026-06-09 · unverdicted · none · ref 30 · internal anchor
CoCoSI is a training-free multi-agent system for collaborative cognitive map construction that improves spatial understanding in arbitrary pretrained MLLMs.
Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur? cs.CV · 2026-06-08 · unverdicted · none · ref 13 · internal anchor
Introduces Ego-MC-Bench benchmark and Ego-CoMist synthetic dataset showing that fine-tuning video LLMs on proactive mistake corrections improves performance especially for smaller models.
Temporal-Aware Reasoning Optimization for Video Temporal Grounding cs.CV · 2026-06-08 · unverdicted · none · ref 93 · internal anchor
TaRO improves video temporal grounding in MLLMs via constructive reasoning exploration from dense captions and a temporal-sensitivity reward that uses logit drops on disrupted event boundaries, followed by curriculum learning to SOTA results.
HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging cs.CV · 2026-06-08 · unverdicted · none · ref 49 · internal anchor
HDRAgent is the first agent-driven framework for multi-exposure HDR imaging that uses MLLM scene perception, contextual knowledge matching, and perception-distortion feedback to reduce ghosting artifacts.