A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.
super hub Mixed citations
LLaVA-OneVision: Easy Visual Task Transfer
Mixed citation behavior. Most common role is background (55%).
abstract
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particu
authors
co-cited works
representative citing papers
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
ReQuest introduces an uncertainty-driven question-adaptive keyframe selector with rethinking routing and adaptive NMS that boosts long-form video QA accuracy on Video-MME, MLVU, and LongVideoBench without fine-tuning the base MLLM.
P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.
LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.
Imprint compresses egocentric observations into interaction patterns via online memory compression, raising QA accuracy from 31.0% to 35.8% while cutting memory 2.3× and latency 11.8× on a seven-day benchmark.
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.
PRISM shows video diffusion models inherently encode preference information in noisy latents, achieving SOTA accuracy and enabling noise-robust early-stage sampling with a correlation to generative performance.
Hour-long video temporal grounding is a search problem, shown by a new benchmark where all Video-LLMs collapse, frame retrieval outperforms them, 85% of failures are search-related, and a retrieve-then-ground hybrid improves results 6.7x.
A closed-loop self-evolving training system for spatial reasoning in MLLMs that iteratively generates QA pairs matched to the model's current capabilities via confidence feedback, achieving gains with an order of magnitude less data.
3D-CoS represents 3D objects as Blender code generated by VLMs, with workflows for planning, RAG, and agents, showing better edit fidelity than point-cloud baselines.
VLM-Safe-RL adds frozen VLM signals as anticipatory costs to the CMDP Lagrangian update via dual-path CLIP, VLM-Lagrange, and confidence gating, outperforming baselines on Safety-Gymnasium FormulaOne while showing partial generalization.
AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.
Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
citing papers explorer
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models
Chartographer generates seed-controlled counterfactual charts from existing QA datasets to expose generalization failures in VLMs that single-chart benchmarks miss.
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
-
M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.
-
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.
-
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach without labeled data.
-
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
VIDA provides 2,500 visually-dependent ambiguous translation examples and span-level disambiguation metrics; CoT-SFT on LVLMs improves out-of-distribution performance over standard SFT.
-
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.
-
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
-
Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding
Intent Projection decomposes literal and pragmatic signals in multimodal memes via orthogonal projection and contrastive objectives in LVLMs, outperforming baselines on six benchmarks especially for high-divergence posts.
-
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
AutoTool uses dual-mode RL to let MLLMs adaptively choose tool use or text-only reasoning, reporting 21.8% accuracy gain on V* and 44.9% efficiency gain on POPE versus baselines.
-
Training-Free Multimodal Large Language Model Orchestration
LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Aguvis presents a pure vision-based framework for autonomous GUI agents using structured reasoning via inner monologue, a new multimodal dataset, and two-stage training to reach SOTA on offline and online benchmarks.
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
-
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
-
Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models
MLLMs generate verbose, comprehensive, and repetitive aesthetic critiques unlike selective human ones, and reference-based metrics fail to detect this because they capture model house style instead of image-specific content.
-
PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining
PMC-InterCPT builds a context-grounded biomedical interleaved corpus from PMC literature and shows it improves multimodal performance on Qwen3.5-4B-Base after CPT and SFT while using fewer tokens.
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
Explicit Representation Alignment for Multimodal Sentiment Analysis
Converting visuals to text descriptions with VLMs for explicit representation alignment, combined with token selection and uniformity regularization, yields state-of-the-art results on multimodal sentiment benchmarks.