SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
super hub Mixed citations
LLaVA-OneVision: Easy Visual Task Transfer
Mixed citation behavior. Most common role is background (55%).
abstract
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particu
authors
co-cited works
representative citing papers
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.
ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.
EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.
EgoInteract is a new simulator for generating synthetic egocentric videos with precise control over camera, body, hand, and object motions, producing a dataset that improves model performance on real-world benchmarks for temporal action segmentation, next-active object detection, interaction Anticip
HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.
citing papers explorer
-
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.
-
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
LLMind uses bio-inspired non-uniform sampling via a Mobius module and closed-loop semantic feedback to retain 82-97% of full-resolution VLM performance with only 1-5% of pixels on VQA benchmarks.
-
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
Panorama-Language Models with a sparse attention module and PanoVQA dataset deliver superior holistic reasoning on 360° adverse omni-scenes compared to stitched pinhole views.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
-
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning
GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.
-
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.
-
Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution
Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes
DarkQA is a new benchmark that measures vision-language model performance on basic visual questions under controlled low-light degradations modeled from real camera physics.
-
SpatialMosaic: A Multiview VLM Dataset for Partial Visibility
SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.
-
VPTracker: Global Vision-Language Tracking via Visual Prompt
VPTracker enables global object tracking in videos by using multimodal large language models with location-aware visual prompts to search entire images while reducing distractions from similar objects.
-
M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.
-
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
-
Training Multi-Image Vision Agents via End2End Reinforcement Learning
IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing tool-use effects on attention.
-
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.
-
CaptionQA: Is Your Caption as Useful as the Image Itself?
CaptionQA is a new benchmark with 33,027 questions across natural, document, e-commerce, and embodied AI domains that measures how much utility model-generated captions retain compared to original images when used by LLMs for downstream tasks.
-
POMA-3D: The Point Map Way to 3D Scene Understanding
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
-
VIDEOP2R: Video Understanding from Perception to Reasoning
VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
-
Latent Visual Reasoning
Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
-
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.
-
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
-
MMSearch-R1: Incentivizing LMMs to Search
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.
-
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
-
Transfer between Modalities with MetaQueries
MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
-
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
-
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach without labeled data.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance
GuideDog supplies 22K egocentric image-description pairs from 46 countries and an 818-sample QA benchmark showing that current multimodal models still struggle with depth perception and BLV-specific guidance rules.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
-
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.
-
LVBench: An Extreme Long Video Understanding Benchmark
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
-
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.
-
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
-
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
-
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
-
T-IMPACT: A Severity-Aware Benchmark for Contextual Image-Text Manipulation
T-IMPACT is a new benchmark dataset and pipeline that supplies nearly 99k manipulated image-text pairs together with a human-calibrated continuous severity signal for contextual interpretation change.
-
AdaCodec: A Predictive Visual Code for Video MLLMs
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
-
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.
-
AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection
AnomalyAgent is a training-free agentic framework that equips MLLMs with anomaly-centric tools and a memory module to outperform VLM-based methods on both simple and complex contextual anomalies in zero- and few-shot settings.
-
DrawVideo: Generating Long Video from Storyboard Keyframe Sketches
DrawVideo is a sketch-guided framework that decomposes long videos into controllable shots using keyframe sketches, appearance prompts, and motion prompts, supported by a new SketchLongVideo dataset.
-
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.