mega hub Mixed citations

Qwen2.5-VL Technical Report

Jialin Wang, Keqin Chen, Shuai Bai, Sibo Song, Wenbin Ge, Xuejing Liu · 2025 · cs.CV · arXiv 2502.13923

Mixed citation behavior. Most common role is background (53%).

1259 Pith papers citing it

Background 53% of classified citations

open full Pith review browse 1259 citing papers more from Jialin Wang arXiv PDF

abstract

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 153 baseline 57 method 57 dataset 5 other 3

citation-polarity summary

background 147 use method 59 baseline 56 unclear 6 use dataset 5 support 2

claims ledger

abstract We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as wel

authors

Jialin Wang Keqin Chen Shuai Bai Sibo Song Wenbin Ge Xuejing Liu

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models

cs.CV · 2026-06-19 · unverdicted · novelty 8.0

Introduces the first large-scale multimodal benchmark MedLayXPlain-122K showing medical VLMs suffer significant lay-register degradation while general VLMs lack clinical precision.

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 8.0

A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.

The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

cs.RO · 2026-05-19 · conditional · novelty 8.0

The paper presents RoboAbstention, a new benchmark showing frontier VLMs and embodied planners abstain on only 16.5-39% of 6,069 instructions grounded in robotics images, with prompting interventions raising rates to 88-93% but not solving the problem.

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

cs.CV · 2026-05-15 · conditional · novelty 8.0

MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

cs.CV · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

cs.CV · 2026-05-11 · unverdicted · novelty 8.0 · 2 refs

Hilbert-Geo creates the first unified formal language for solid geometry and a two-step parsing-then-reasoning method that reaches SOTA accuracy on solid geometry benchmarks.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

cs.CV · 2026-05-10 · accept · novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models

cs.CV · 2026-05-09 · unverdicted · novelty 8.0

Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

cs.LG · 2025-12-16 · conditional · novelty 8.0

Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment

cs.CV · 2025-06-02 · conditional · novelty 8.0

FLEX is the first large-scale multimodal multiview dataset for fitness AQA, featuring RGB, 3D pose, sEMG and physiological data plus a Fitness Knowledge Graph for structured annotations and a VideoQA benchmark.

ReQuest: Rethinking-based Question-Aware Frame Selection for Long-Form Video QA

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

ReQuest introduces an uncertainty-driven question-adaptive keyframe selector with rethinking routing and adaptive NMS that boosts long-form video QA accuracy on Video-MME, MLVU, and LongVideoBench without fine-tuning the base MLLM.

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.

MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.

LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.

TrajLoc: Trajectory-Attention Localization for Multi-Object Motion Control

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

TrajLoc enforces per-object trajectory constraints in I2V generation via attention-layer Gaussian heatmap substitution, yielding +4.3 dB PSNR and 51% lower endpoint error on datasets with up to 20 objects across two backbones.

citing papers explorer

Showing 50 of 1132 citing papers after filters.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning cs.CV · 2026-06-30 · unverdicted · none · ref 3 · internal anchor
A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.
MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models cs.CV · 2026-06-19 · unverdicted · none · ref 6 · internal anchor
Introduces the first large-scale multimodal benchmark MedLayXPlain-122K showing medical VLMs suffer significant lay-register degradation while general VLMs lack clinical precision.
Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation cs.CV · 2026-06-03 · unverdicted · none · ref 4 · internal anchor
A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts cs.CV · 2026-05-12 · unverdicted · none · ref 5 · 2 links · internal anchor
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning cs.CV · 2026-05-11 · unverdicted · none · ref 4 · 2 links · internal anchor
Hilbert-Geo creates the first unified formal language for solid geometry and a two-step parsing-then-reasoning method that reaches SOTA accuracy on solid geometry benchmarks.
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation cs.CR · 2026-05-11 · unverdicted · none · ref 53 · internal anchor
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 48 · internal anchor
Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation cs.AI · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild cs.CV · 2026-05-07 · unverdicted · none · ref 58 · internal anchor
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
EgoSound: Benchmarking Sound Understanding in Egocentric Videos cs.CV · 2026-02-15 · unverdicted · none · ref 2 · internal anchor
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing cs.CV · 2026-02-04 · unverdicted · none · ref 2 · internal anchor
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
ReQuest: Rethinking-based Question-Aware Frame Selection for Long-Form Video QA cs.CV · 2026-07-02 · unverdicted · none · ref 3 · internal anchor
ReQuest introduces an uncertainty-driven question-adaptive keyframe selector with rethinking routing and adaptive NMS that boosts long-form video QA accuracy on Video-MME, MLVU, and LongVideoBench without fine-tuning the base MLLM.
Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning cs.CV · 2026-07-01 · unverdicted · none · ref 11 · internal anchor
P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.
MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models cs.CV · 2026-07-01 · unverdicted · none · ref 4 · internal anchor
MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.
LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models cs.CV · 2026-07-01 · unverdicted · none · ref 5 · internal anchor
LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.
TrajLoc: Trajectory-Attention Localization for Multi-Object Motion Control cs.CV · 2026-07-01 · unverdicted · none · ref 1 · internal anchor
TrajLoc enforces per-object trajectory constraints in I2V generation via attention-layer Gaussian heatmap substitution, yielding +4.3 dB PSNR and 51% lower endpoint error on datasets with up to 20 objects across two backbones.
Learning to Watch: Active Video Anomaly Understanding via Interleaved Policy Optimization cs.CV · 2026-07-01 · unverdicted · none · ref 37 · internal anchor
Introduces Anom-π framework for active video anomaly understanding via interleaved policy optimization and iDPO under weak supervision, claiming a 2B model outperforms larger SOTA VAU models.
No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs cs.CV · 2026-06-30 · unverdicted · none · ref 4 · internal anchor
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
Arena-T2I Hard: Benchmarking and Improving Faithfulness with Dependency-Aware Checklist cs.AI · 2026-06-30 · unverdicted · none · ref 33 · internal anchor
Arena-T2I Hard benchmark with ~30 decomposed constraints per prompt and a dependency-aware checklist reward yields better faithfulness-aesthetics trade-off than single-reward or weighted-sum baselines on SD3.5-Medium and FLUX.1-dev.
Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity? cs.CV · 2026-06-30 · unverdicted · none · ref 2 · internal anchor
VSE perturbs images only to probe visual ambiguity in VLMs, clusters outputs into semantic prototypes, and computes mass-weighted dispersion, outperforming prior entropy methods on five VQA benchmarks across five models.
Learning to Deny: Action Denial in Multimodal Large Language Models cs.CV · 2026-06-30 · unverdicted · none · ref 5 · internal anchor
MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.
OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning cs.CV · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.
OrthoTryOn: Geometric Orthogonalization for Conflict-Free Unified Fashion Generation cs.CV · 2026-06-26 · unverdicted · none · ref 3 · internal anchor
OrthoTryOn uses Orthogonal Subspace Projection on shared LoRA and Fisher-guided Negative Guidance to enable conflict-free unified fashion generation, outperforming task-specific models on benchmarks.
Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning cs.CV · 2026-06-26 · unverdicted · none · ref 2 · internal anchor
Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.
DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues cs.CV · 2026-06-25 · unverdicted · none · ref 4 · internal anchor
DiCoBench is a new high-resolution multi-image benchmark exposing large gaps between top MLLMs and human performance (98.3%) on differential and commonality visual cue perception.
ChartWalker: Benchmarking the Cross-Chart RAG Task with Hierarchical Knowledge Graphs cs.IR · 2026-06-22 · unverdicted · none · ref 2 · internal anchor
ChartWalker provides a hierarchical knowledge graph construction method and structure-aware sampling to generate cross-chart RAG benchmarks, releasing ChartWalker-Bench that exposes performance gaps across RAG paradigms.
Keep The Essentials: Efficient Reference Conditioned Generation via Token Dropping cs.CV · 2026-06-22 · unverdicted · none · ref 2 · internal anchor
Sparse Context achieves 2-4x faster inference in reference-conditioned diffusion models by fine-tuning with random token dropping and applying task-aware selection at inference time, without loss of visual quality.
Flow as Flow: Modeling Robot Velocity Fields as Probability Velocity Fields for Flow-Based Object Manipulation cs.RO · 2026-06-22 · unverdicted · none · ref 84 · internal anchor
Flow as Flow models robot flows as probability flows using flow matching to generate velocity fields more efficiently than prior sparse keypoint approaches.
Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation cs.CV · 2026-06-22 · unverdicted · none · ref 42 · internal anchor
JudgeFit produces per-VLM physical video evaluation taxonomies that improve held-out accuracy by a mean 32% relative to a single global schema across 16 models from eight families.
When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents cs.LG · 2026-06-22 · unverdicted · none · ref 3 · internal anchor
High AUC from linear probes on model activations for indirect prompt injection does not license an unqualified claim of malicious-content detection, per a Qwen2.5-VL-7B case study with text and visual controls.
FleetAgent: Teleoperation Assistant for Autonomous Fleets via Vectorized V2N Messages cs.RO · 2026-06-19 · unverdicted · none · ref 36 · internal anchor
FleetAgent pairs a vector-to-embedding interface (VecFormer) with an MLLM to turn compact V2N messages into structured natural-language teleoperation assistance, cutting uplink payload 625x and improving Lingo-Judge score 16.8% on a new nuScenes-derived dataset.
HERO: Hypothesis-Driven Evidence Retrieval from Omics for Multi-Task Breast Cancer Analysis cs.CV · 2026-06-19 · unverdicted · none · ref 2 · internal anchor
HERO maps DNA methylation and miRNA to a 16-dimensional intent vector for TF-IDF caption retrieval and cosine-gated repair in VLM-based multi-task breast cancer prediction, claiming SOTA on TCGA-BRCA.
MammoExpert: Benchmarking Chain-of-Thought Reasoning in Mammography Diagnosis cs.CV · 2026-06-19 · unverdicted · none · ref 2 · internal anchor
MammoExpert is a dataset of 2,379 mammography images with CoT annotations covering 67 subtypes; combining it with existing data improves lesion classification accuracy by 7.1% and CoT training adds another 4%.
Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models cs.CV · 2026-06-18 · unverdicted · none · ref 23 · internal anchor
PRISM shows video diffusion models inherently encode preference information in noisy latents, achieving SOTA accuracy and enabling noise-robust early-stage sampling with a correlation to generative performance.
Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs cs.CV · 2026-06-18 · unverdicted · none · ref 2 · internal anchor
Remote sensing MLLMs perform poorly on negation tasks with hallucinations and accuracy drops, but the NeFo test-time learning method substantially improves negation understanding and generalizes to unseen tasks using ~5% unlabeled test samples.
Online Dynamic Batching with Formal Guarantees for LLM Training cs.DC · 2026-06-18 · unverdicted · none · ref 3 · internal anchor
ODB is an online batching system for distributed LLM training that forms batches post-preprocessing, provides formal deadlock-free guarantees via the Distributed Group Alignment Problem, and reports 1.58-3.78x throughput gains versus fixed-batch baselines.
Image Prompt Reconstruction Attacks on Distributed MLLM Inference Frameworks cs.CR · 2026-06-17 · unverdicted · none · ref 21 · internal anchor
First study of image prompt reconstruction attacks on distributed MLLM inference, proposing MPAA for pixel-level and IEDA for semantic reconstruction with 100% embedding extraction accuracy on four model families.
OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation cs.CV · 2026-06-16 · unverdicted · none · ref 2 · internal anchor
DRIVE-CHOREO uses three LLM agents to create a unified position-aware token sequence co-compressed with multi-view video, achieving SOTA BEV mAP of 21.6 and +2.4 NDS improvement on nuScenes.
Enhancing Pathological VLMs with Cross-scale Reasoning cs.CV · 2026-06-16 · unverdicted · none · ref 3 · internal anchor
Presents Scale-VQA benchmark for cross-scale pathology VQA and RL-trained ScaleReasoner-R1 model that reaches SOTA on the new benchmark plus existing single-scale tasks.
Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering cs.CL · 2026-06-15 · unverdicted · none · ref 1 · internal anchor
Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.
InterleaveThinker: Reinforcing Agentic Interleaved Generation cs.CV · 2026-06-11 · unverdicted · none · ref 17 · internal anchor
InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.
ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning cs.AI · 2026-06-11 · unverdicted · none · ref 1 · internal anchor
ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.
GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models cs.CV · 2026-06-10 · unverdicted · none · ref 1 · internal anchor
GRIP uses contrastive training on LMM feedback to retrieve beneficial in-context examples for multimodal tasks, outperforming similarity-based methods and transferring across models including GPT-4o.
From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning cs.CV · 2026-06-10 · unverdicted · none · ref 33 · internal anchor
BridgeVLM internalizes causal supervision in VLMs via causal graph induction, Causal Tokens, and RAMP layers with M3S training, raising intervention accuracy on CausalVLBench from 33.2% to 54.4% and structure learning F1 from 33.4% to 75.1%.
Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks cs.CV · 2026-06-09 · unverdicted · none · ref 21 · internal anchor
Earth-OneVision is a unified 2B-parameter RS-MLLM supporting six modalities and nine tasks via FGVLA, SLIS, and PCMA mechanisms plus a 34M QA-pair dataset, reporting competitive or superior benchmark results versus larger models.
3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis cs.CV · 2026-06-09 · unverdicted · none · ref 4 · internal anchor
3D-CoS represents 3D objects as Blender code generated by VLMs, with workflows for planning, RAG, and agents, showing better edit fidelity than point-cloud baselines.
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks cs.AI · 2026-06-08 · unverdicted · none · ref 4 · internal anchor
SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.
H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions cs.CL · 2026-06-08 · unverdicted · none · ref 58 · internal anchor
H2HMem is a multimodal memory benchmark evaluating LLM agents on recall, reasoning, and application in dyadic and multi-party human-human conversations with phenomena such as anaphora and deixis.
Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning cs.CV · 2026-06-08 · unverdicted · none · ref 2 · internal anchor
Rea2Seg turns image segmentation into candidate mask discovery from MLLM attention followed by MLLM-based comparative scoring and selection, plus a new multi-dimensional reasoning benchmark ReasonSeg-SGDR.
X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining cs.CV · 2026-06-07 · unverdicted · none · ref 41 · internal anchor
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

Qwen2.5-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

mega hub controls

Recognition alignment

counterfactual ablation

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer