Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
hub Mixed citations
Kwai keye-vl 1.5 technical report
Mixed citation behavior. Most common role is background (50%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.
VisReason is a new benchmark exposing substantial gaps in vision-centric reasoning for MLLMs versus humans, with limited gains from test-time reasoning strategies.
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
ESOM is a training-free streaming model for open-world video anomaly detection with dynamic definitions that achieves real-time single-GPU efficiency and state-of-the-art results on a new benchmark.
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
LocateAnything proposes Parallel Box Decoding for unified generative visual grounding and detection, paired with a 138M-sample dataset, to raise both speed and high-IoU accuracy.
IPIBench evaluates MLLMs on interactive proactive intelligence in streaming videos, identifies unstable triggering and poor coordination, and proposes the training-free IPI-Agent framework to improve performance across settings.
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
SciVQR is a new multimodal benchmark covering 54 scientific subfields that evaluates MLLMs on visual comprehension and multi-step reasoning, revealing significant limitations in leading models.
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.
The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.
citing papers explorer
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
-
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
-
Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker
OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.
-
Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning
VisReason is a new benchmark exposing substantial gaps in vision-centric reasoning for MLLMs versus humans, with limited gains from test-time reasoning strategies.
-
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
-
Visual Preference Optimization with Rubric Rewards
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
-
ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
ESOM is a training-free streaming model for open-world video anomaly detection with dynamic definitions that achieves real-time single-GPU efficiency and state-of-the-art results on a new benchmark.
-
AdaCodec: A Predictive Visual Code for Video MLLMs
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
-
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
LocateAnything proposes Parallel Box Decoding for unified generative visual grounding and detection, paired with a 138M-sample dataset, to raise both speed and high-IoU accuracy.
-
IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams
IPIBench evaluates MLLMs on interactive proactive intelligence in streaming videos, identifies unstable triggering and poor coordination, and proposes the training-free IPI-Agent framework to improve performance across settings.
-
Building a Precise Video Language with Human-AI Oversight
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
-
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
SciVQR is a new multimodal benchmark covering 54 scientific subfields that evaluates MLLMs on visual comprehension and multi-step reasoning, revealing significant limitations in leading models.
-
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
-
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.