SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
super hub Mixed citations
LLaVA-OneVision: Easy Visual Task Transfer
Mixed citation behavior. Most common role is background (55%).
abstract
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particu
authors
co-cited works
representative citing papers
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
Chartographer generates seed-controlled counterfactual charts from existing QA datasets to expose generalization failures in VLMs that single-chart benchmarks miss.
Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.
STORM teaches LVLMs to internalize spatial-temporal reasoning via bounded latent trajectories trained with generated thought videos in two stages, improving accuracy on VideoMME, MVBench and similar benchmarks while lowering inference overhead.
MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.
VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.
ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.
EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.
EgoInteract is a new simulator for generating synthetic egocentric videos with precise control over camera, body, hand, and object motions, producing a dataset that improves model performance on real-world benchmarks for temporal action segmentation, next-active object detection, interaction Anticip
R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.
HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.
citing papers explorer
-
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
-
VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
-
From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
Audit of four VideoQA benchmarks reveals text-only shortcuts in VLMs; new diagnostics Blind Gap, Visual Gain, and Shortcut Score quantify and filter visual dependence.
-
Latent Noise Mask for Reducing Visual Redundancy in Multimodal Large Language Models
Lens purifies visual evidence in MLLMs via question-conditioned latent noise masking with a LET token, yielding 2.4-6.4 point gains on VQA and grounding tasks.
-
Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking
A language dependency parsing mechanism combined with Qwen-VL enables adaptive updates to textual descriptions for improved vision-language tracking performance on benchmarks like TNL2K and LaSOT.
-
MVPruner: Dynamic Token Pruning for Accelerating Multi-view Vision-Language Models in Autonomous Driving
MVPruner is a two-stage dynamic token pruning technique that uses view diversity for initial budget allocation and instruction text for task-aligned selection, delivering 87.3% FLOPs reduction and 4.97x prefilling speedup while retaining 98.5% accuracy on DriveLM.
-
T-IMPACT: A Severity-Aware Benchmark for Contextual Image-Text Manipulation
T-IMPACT is a new benchmark dataset and pipeline that supplies nearly 99k manipulated image-text pairs together with a human-calibrated continuous severity signal for contextual interpretation change.
-
AdaCodec: A Predictive Visual Code for Video MLLMs
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
-
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.
-
AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection
AnomalyAgent is a training-free agentic framework that equips MLLMs with anomaly-centric tools and a memory module to outperform VLM-based methods on both simple and complex contextual anomalies in zero- and few-shot settings.
-
Self-Prophetic Decoding to Unlock Visual Search in LVLMs
SeProD is a plug-and-play self-prophetic decoding framework that combines pre- and post-training LVLM capabilities via probability-based sampling to improve coherent visual search and multi-step reasoning.
-
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.
-
Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning
Q-GeoMem uses question-guided scoring to maintain a Fine-Grained Context Bank and Semantic-Geometric Evidence Bank, achieving SOTA on VSI-Bench and VSTI-Bench.
-
IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams
IPIBench evaluates MLLMs on interactive proactive intelligence in streaming videos, identifies unstable triggering and poor coordination, and proposes the training-free IPI-Agent framework to improve performance across settings.
-
DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding
DynFrame introduces tokenized learnable span-density retrieval and Segment-Decoupled GRPO in video MLLMs, achieving competitive or SOTA results on six benchmarks with 4B and 8B models.
-
O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding
O-MARC is a compression distillation framework that lets compact omnimodal models maintain or exceed full-token performance on video QA while cutting latency and memory by about 35%.
-
Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation
Introduces Layout-as-Policy (LaP) to turn 3D layout estimation into an iterative policy-learning refinement process for better physical coherence.
-
Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models
DCI decomposes large-scale visual recognition into simpler subproblems with dynamic pruning to raise MLLM accuracy on datasets like ImageNet-1K and 21K.
-
DrawVideo: Generating Long Video from Storyboard Keyframe Sketches
DrawVideo is a sketch-guided framework that decomposes long videos into controllable shots using keyframe sketches, appearance prompts, and motion prompts, supported by a new SketchLongVideo dataset.
-
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
-
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.
-
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
-
Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
Introduces Zoom-then-Diagnose paradigm and uncertainty-aware reward in GRPO for confidence-aware ultrasound VQA, reporting 39.3% improvement in lesion localization across liver, breast, and thyroid datasets.
-
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
-
Multimodal LLMs under Pairwise Modalities
A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
-
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
-
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
AutoTool uses dual-mode RL to let MLLMs adaptively choose tool use or text-only reasoning, reporting 21.8% accuracy gain on V* and 44.9% efficiency gain on POPE versus baselines.
-
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
FineBench is a large-scale human-centric VQA benchmark exposing weaknesses in open VLMs for fine-grained activity understanding, with FineAgent providing a practical enhancement method.
-
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
-
DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs
DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.
-
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
-
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
-
UAM: A Dual-Stream Perspective on Forgetting in VLA Training
UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.
-
LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs
LRCP prunes visual tokens in LVLMs by scoring projection residuals onto a PCA-estimated low-rank subspace, achieving 88.9% image token reduction with 94.7% performance retention and 87.5% video reduction with 97.8% accuracy retention.
-
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
-
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoning benchmarks.
-
OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
OTT-Vid uses optimal transport with non-uniform token mass and locality-aware costs to dynamically allocate compression budgets across video frames, retaining 95.8% VQA and 73.9% VTG performance at 10% token retention.
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
-
Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning
Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.
-
Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection
The paper releases the Sens-VisualNews dataset of 9,576 annotated news images for sensational image detection and benchmarks open multimodal LLMs on zero-shot and fine-tuned performance.
-
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A
F^3A is a training-free visual token pruning router that treats pruning as task-conditioned evidence search and allocates a fixed vision token budget using question cues and frozen sparse heads without extra LLM passes.
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.