hub Mixed citations

Kwai keye-vl 1.5 technical report

Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al · 2025 · arXiv 2509.01563

Mixed citation behavior. Most common role is background (50%).

11 Pith papers citing it

Background 50% of classified citations

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 2 other 1

citation-polarity summary

background 3 baseline 2 unclear 1

representative citing papers

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

Visual Preference Optimization with Rubric Rewards

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

ESOM is a training-free streaming model for open-world video anomaly detection with dynamic definitions that achieves real-time single-GPU efficiency and state-of-the-art results on a new benchmark.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.

Streaming Video Instruction Tuning

cs.CV · 2025-12-24 · unverdicted · novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

cs.CV · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

SciVQR is a new multimodal benchmark covering 54 scientific subfields that evaluates MLLMs on visual comprehension and multi-step reasoning, revealing significant limitations in leading models.

Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

cs.LG · 2024-08-14 · accept · novelty 4.0

The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.

citing papers explorer

Showing 11 of 11 citing papers.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 170
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL · 2026-05-10 · unverdicted · none · ref 25
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 4
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions cs.CV · 2026-04-09 · unverdicted · none · ref 49
ESOM is a training-free streaming model for open-world video anomaly detection with dynamic definitions that achieves real-time single-GPU efficiency and state-of-the-art results on a new benchmark.
Building a Precise Video Language with Human-AI Oversight cs.CV · 2026-04-22 · unverdicted · none · ref 76
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 98
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding cs.CV · 2026-04-06 · unverdicted · none · ref 33
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
Streaming Video Instruction Tuning cs.CV · 2025-12-24 · unverdicted · none · ref 19
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation cs.CV · 2026-05-11 · unverdicted · none · ref 97 · 2 links
SciVQR is a new multimodal benchmark covering 54 scientific subfields that evaluates MLLMs on visual comprehension and multi-step reasoning, revealing significant limitations in leading models.
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration cs.CV · 2026-05-01 · unverdicted · none · ref 25
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities cs.LG · 2024-08-14 · accept · none · ref 267
The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.

Kwai keye-vl 1.5 technical report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer