mega hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

1130 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 1130 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 114 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents

cs.MM · 2026-06-26 · unverdicted · novelty 8.0

Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

LOCUS is a released corpus of nearly all US municipal and county ordinance codes, processed via OCR and paired with ModernBERT classifiers for dimensions such as opacity and paternalism.

Vision-language models for chest radiography do not always need the image

cs.CV · 2026-06-16 · accept · novelty 8.0

A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

cs.CL · 2026-06-01 · conditional · novelty 8.0

FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

citing papers explorer

Showing 50 of 726 citing papers after filters.

EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs cs.CV · 2026-04-25 · unverdicted · none · ref 1 · internal anchor
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
Grounding Video Reasoning in Physical Signals cs.CV · 2026-04-23 · unverdicted · none · ref 2 · internal anchor
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback cs.CV · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.
Evaluating Remote Sensing Image Captions Beyond Metric Biases cs.CV · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis cs.CV · 2026-04-22 · unverdicted · none · ref 5 · internal anchor
X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark cs.CV · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Knowledge-Clue-Answer.
WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring cs.CV · 2026-04-22 · unverdicted · none · ref 5 · internal anchor
WildFireVQA is a new large-scale visual question answering benchmark that pairs RGB imagery with radiometric thermal measurements for aerial wildfire monitoring across six task categories.
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training cs.CV · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.
DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax cs.CV · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
DanceCrafter generates high-fidelity, text-controlled dance sequences using a new Choreographic Syntax framework and a large fine-grained motion dataset.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs cs.CV · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI cs.CV · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.
AnimationBench: Are Video Models Good at Character-Centric Animation? cs.CV · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID cs.CV · 2026-04-16 · unverdicted · none · ref 56 · internal anchor
STFER uses LVLM-generated identity-consistent semantic text to drive visual token filtering and expert routing for improved any-time person re-identification under clothing changes and modality shifts.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation cs.CV · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 5 · internal anchor
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports cs.CV · 2026-04-14 · unverdicted · none · ref 3 · internal anchor
EgoEsportsQA is a new egocentric video QA benchmark from esports matches that shows state-of-the-art Video-LLMs reach only 71.58% accuracy and struggle more with tactical reasoning than basic perception.
Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
VLMs display semantic fixation, with higher accuracy on standard rule mappings than inverse ones across 14 models, narrowed by neutral prompts but widened by loaded ones and affected by post-training alignment.
Online Reasoning Video Object Segmentation cs.CV · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 6 · internal anchor
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote sensing interpretation.
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video cs.CV · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates cs.CV · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling cs.CV · 2026-04-13 · unverdicted · none · ref 4 · internal anchor
AmodalSVG produces semantically separate and geometrically complete SVG layers from natural images by using VLM-guided semantic layer peeling for amodal completion followed by adaptive vectorization.
IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly cs.CV · 2026-04-12 · accept · none · ref 3 · internal anchor
IMPACT is a synchronized five-view RGB-D dataset of 112 real industrial assembly trials with multi-granularity annotations, anomaly taxonomy, and compliance tracking.
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models cs.CV · 2026-04-12 · unverdicted · none · ref 6 · internal anchor
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping cs.CV · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
PlantXpert benchmark shows fine-tuned VLMs reach up to 78% accuracy on plant phenotyping but scaling gains plateau and quantitative biological reasoning remains weak.
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-horizon visual reasoning benchmarks.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation cs.CV · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos cs.CV · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos cs.CV · 2026-04-10 · unverdicted · none · ref 2 · 2 links · internal anchor
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.
Large-Scale Universal Defect Generation: Foundation Models and Datasets cs.CV · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.
ParseBench: A Document Parsing Benchmark for AI Agents cs.CV · 2026-04-09 · accept · none · ref 5 · internal anchor
ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
WildDet3D: Scaling Promptable 3D Detection in the Wild cs.CV · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to 0.2 seconds.
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions cs.CV · 2026-04-09 · unverdicted · none · ref 51 · internal anchor
ESOM is a training-free streaming model for open-world video anomaly detection with dynamic definitions that achieves real-time single-GPU efficiency and state-of-the-art results on a new benchmark.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 3 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation cs.CV · 2026-04-08 · conditional · none · ref 35 · internal anchor
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning cs.CV · 2026-04-08 · unverdicted · none · ref 2 · internal anchor
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions cs.CV · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing cs.CV · 2026-04-06 · unverdicted · none · ref 31 · internal anchor
DIRECT uses a three-level multi-agent framework to solve video mashup creation as a multimodal coherency problem, outperforming baselines on a new benchmark.
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing cs.CV · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
PortraitCraft: A Benchmark for Portrait Composition Understanding and Generation cs.CV · 2026-04-04 · unverdicted · none · ref 1 · internal anchor
PortraitCraft supplies a new 50k-image dataset and two tasks for evaluating AI on fine-grained portrait composition understanding and constrained generation.
XrayClaw: Cooperative-Competitive Multi-Agent Alignment for Trustworthy Chest X-ray Diagnosis cs.CV · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
XrayClaw deploys cooperative-competitive multi-agent alignment and Competitive Preference Optimization to raise diagnostic accuracy, reasoning fidelity, and generalization on chest X-ray benchmarks.