StreamingVLM: Real-Time Understanding for Infinite Video Streams

Guangxuan Xiao; Kelly Peng; Liuning He; Ruyi Xu; Song Han; Yao Lu; Yukang Chen

arxiv: 2510.09608 · v1 · pith:LWVTT7QDnew · submitted 2025-10-10 · 💻 cs.CV · cs.AI· cs.CL

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu , Guangxuan Xiao , Yukang Chen , Liuning He , Kelly Peng , Yao Lu , Song Han This is my paper

Pith reviewed 2026-05-17 11:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords streaming videovision language modelsreal-time processingkey value cacheinfinite streamssupervised fine tuningvideo understanding

0 comments

The pith

A vision-language model achieves stable real-time understanding of arbitrarily long video streams through a streaming attention cache aligned with training on short clips.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StreamingVLM to solve the problem of processing near-infinite video streams without quadratic costs or loss of coherence. It aligns supervised fine-tuning on short overlapped video chunks with a streaming inference setup that reuses key-value states from attention sinks, recent vision tokens, and recent text tokens. This allows the model to maintain constant memory and latency while processing videos averaging over two hours. If successful, such models could power real-time video assistants and agents that respond continuously to live visual input.

Core claim

StreamingVLM maintains a compact KV cache consisting of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. The streaming behavior is trained by applying full attention to short overlapped chunks during supervised fine-tuning, which mimics the inference pattern without requiring long-context training.

What carries the argument

The streaming KV cache combined with SFT on overlapped short video chunks that replicates the sparse attention pattern used at inference time.

If this is right

On the Inf-Streams-Eval benchmark with videos over two hours long, the model achieves a 66.18% win rate against GPT-4O mini.
It runs stably at up to 8 FPS on a single NVIDIA H100 GPU.
The same SFT approach improves performance on LongVideoBench by 4.30 points and OVOBench Realtime by 5.96 points without specific fine-tuning for those tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the alignment holds, the method could generalize to other vision-language models beyond the one tested.
Applications in continuous monitoring or live event analysis become feasible at low computational cost.
Future work might test the approach on even longer streams or different modalities to verify scalability.

Load-bearing premise

That training with full attention on short overlapped chunks will transfer to produce coherent outputs when the model switches to the streaming KV cache on long continuous streams.

What would settle it

A direct comparison where the same model is run once with full attention on a long video and once with the streaming cache, checking if coherence or accuracy drops significantly on the streaming version.

read the original abstract

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces StreamingVLM for real-time understanding of infinite video streams. It maintains a compact KV cache using attention sinks plus a short recent vision window and long recent text window at inference time. Training aligns to this pattern via supervised fine-tuning with full attention on short overlapped video chunks rather than long contexts. A new benchmark Inf-Streams-Eval is introduced consisting of videos averaging over two hours that require dense per-second frame-text alignment. On this benchmark the model reports a 66.18% win rate versus GPT-4O mini while sustaining up to 8 FPS on one NVIDIA H100; the same SFT also yields gains of +4.30 on LongVideoBench and +5.96 on OVOBench Realtime.

Significance. If the training-inference alignment and benchmark results prove robust, the work would meaningfully advance practical deployment of VLMs for continuous, long-horizon video input in real-time assistants and agents. The decision to release code supports reproducibility, and the new benchmark addresses a clear evaluation gap for streaming scenarios.

major comments (2)

[§3] §3 (training-inference alignment): the central claim that full-attention SFT on short overlapped chunks instills the exact restricted attention pattern (sinks + short vision window + long text window) used at inference is load-bearing, yet the manuscript provides no direct validation such as attention-map comparisons or coherence metrics on streams longer than the training chunks; without this, the reported stability on arbitrarily long non-overlapped streams remains an untested assumption.
[§5.1] §5.1 (Inf-Streams-Eval results): the 66.18% win rate is presented as the primary evidence of superiority, but the manuscript does not report the number of pairwise comparisons, tie-handling procedure, or inter-annotator agreement, making it impossible to assess whether the margin is statistically reliable or sensitive to prompt variations.

minor comments (2)

[§3] The exact token counts or frame counts for the 'short window of recent vision tokens' and 'long window of recent text tokens' are described only qualitatively; numerical values and ablation on these hyperparameters should be added for reproducibility.
[Figure 3] Figure 3 (or equivalent streaming diagram) would benefit from an explicit legend distinguishing the maintained cache regions from discarded tokens.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will update the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [§3] §3 (training-inference alignment): the central claim that full-attention SFT on short overlapped chunks instills the exact restricted attention pattern (sinks + short vision window + long text window) used at inference is load-bearing, yet the manuscript provides no direct validation such as attention-map comparisons or coherence metrics on streams longer than the training chunks; without this, the reported stability on arbitrarily long non-overlapped streams remains an untested assumption.

Authors: We agree that explicit validation of the attention pattern transfer would strengthen the central claim. The overlapped-chunk SFT is intended to induce the desired behavior by forcing the model to maintain coherence across chunk boundaries while attending only to the most recent tokens within each chunk, which mirrors the inference-time KV cache. However, the current manuscript relies on downstream performance metrics rather than direct attention visualizations. In the revision we will add attention-map comparisons and coherence metrics evaluated on video streams substantially longer than the training chunks to directly test the assumption. revision: yes
Referee: [§5.1] §5.1 (Inf-Streams-Eval results): the 66.18% win rate is presented as the primary evidence of superiority, but the manuscript does not report the number of pairwise comparisons, tie-handling procedure, or inter-annotator agreement, making it impossible to assess whether the margin is statistically reliable or sensitive to prompt variations.

Authors: We acknowledge that these statistical details are necessary for readers to evaluate the reliability of the reported win rate. The current manuscript omits them. In the revised version we will report the total number of pairwise comparisons, the exact tie-handling procedure, and inter-annotator agreement metrics so that the robustness of the 66.18% figure can be properly assessed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results measured externally

full rationale

The paper's core contribution is an empirical training-inference alignment: SFT with full attention on short overlapped video chunks is used to instill a streaming KV cache pattern (attention sinks + recent vision/text windows) for long non-overlapped streams. Reported metrics (66.18% win rate vs. GPT-4O mini on Inf-Streams-Eval, up to 8 FPS on H100, gains on LongVideoBench/OVOBench) are direct measurements against external models and a new benchmark, not quantities derived by construction from fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes reduce the claims to inputs; the coherence-transfer assumption is a testable hypothesis, not a tautology. Self-citations, if present for attention-sink mechanisms, are not load-bearing for the central empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard transformer KV caching and attention mechanisms plus the empirical assumption that short-chunk SFT transfers to long streaming inference.

axioms (1)

standard math Standard transformer attention and KV-cache mechanics function as previously established in the literature.
The streaming design directly reuses existing attention-sink and windowed-cache techniques.

pith-pipeline@v0.9.0 · 5619 in / 1334 out tokens · 40226 ms · 2026-05-17T11:47:01.498125+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
EgoSAT: A Comprehensive Benchmark of Egocentric Streaming Interaction Understanding
cs.CV 2026-06 unverdicted novelty 7.0

EgoSAT is the first benchmark unifying retrospective, online, and prospective reasoning tasks in egocentric streaming video to evaluate VLMs, revealing struggles with temporal modeling and mis-calibration.
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
cs.CV 2026-06 accept novelty 7.0

OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.
X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
cs.CV 2026-06 unverdicted novelty 7.0

X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.
X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
cs.CV 2026-06 unverdicted novelty 7.0

X-Stream benchmark shows state-of-the-art MLLMs achieve only about 50% on multi-stream video tasks and exhibit poor proactive ability.
EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision
cs.CV 2026-05 unverdicted novelty 7.0

Egostream introduces a diagnostic benchmark that expands 2,250 questions into 8,528 recall-conditioned evaluations to measure streaming episodic memory performance across detail, spatial, temporal, event, social, caus...
An Efficient Streaming Video Understanding Framework with Agentic Control
cs.CV 2026-05 unverdicted novelty 7.0

R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.
StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video
cs.CV 2026-05 unverdicted novelty 7.0

StreamPro introduces a benchmark and training method using CB-Stream Loss and GRPO to enable proactive decision-making in streaming videos, achieving 41.5 on StreamPro-Bench compared to 10.4 previously.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
Don't Pause! Every prediction matters in a streaming video
cs.CV 2026-04 unverdicted novelty 7.0

SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
Online Reasoning Video Object Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
cs.PF 2026-04 unverdicted novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
cs.CV 2026-04 unverdicted novelty 7.0

VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing
cs.CV 2026-04 unverdicted novelty 7.0

BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
cs.CV 2026-02 unverdicted novelty 7.0

Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse
cs.DC 2026-06 unverdicted novelty 6.0

Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.
Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?
cs.CV 2026-06 unverdicted novelty 6.0

Introduces Ego-MC-Bench benchmark and Ego-CoMist synthetic dataset showing that fine-tuning video LLMs on proactive mistake corrections improves performance especially for smaller models.
Harnessing Streaming Video in the Wild
cs.CV 2026-06 unverdicted novelty 6.0

Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
ProactiveLLM: Learning Active Interaction for Streaming Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.
SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary
cs.CV 2026-05 unverdicted novelty 6.0

SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.
An Efficient Streaming Video Understanding Framework with Agentic Control
cs.CV 2026-05 unverdicted novelty 6.0

R3-Streaming uses cascaded control, age-aware memory forgetting, and TB-GRPO reinforcement learning to reach SOTA scores on streaming video benchmarks while cutting visual token usage by 95-96%.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
cs.CV 2026-01 unverdicted novelty 6.0

HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
Streaming Video Instruction Tuning
cs.CV 2025-12 unverdicted novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
cs.CV 2025-11 conditional novelty 6.0

OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
Linear Scaling Video VLMs for Long Video Understanding
cs.CV 2026-05 unverdicted novelty 5.0

StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks...
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
cs.CV 2026-05 conditional novelty 5.0

MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
cs.CV 2026-05 unverdicted novelty 5.0

Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
Decouple and Cache: KV Cache Construction for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 5.0

DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.
Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
cs.CV 2026-02 unverdicted novelty 4.0

An edge-deployed multimodal LLM pipeline for online episodic memory QA reaches 51.76% accuracy on an 8 GB GPU and 54.40% on a local server, within 4-5 points of a 56% cloud baseline.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 30 Pith papers · 8 internal anchors

[1]

Qwen2.5-VL Technical Report

URLhttps://arxiv.org/abs/2502.13923. Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large lan- guage model for streaming video, 2024a. URLhttps://arxiv.org/abs/2406.11816. Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zhe...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

URLhttps://arxiv.org/abs/ 2406.07476. Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv- cache retrieval,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

URLhttps://arxiv.org/abs/2503.00540. Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens,

work page arXiv
[4]

URLhttps://arxiv.org/abs/2402.13753. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of mu...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

URL https://arxiv.org/abs/2405.21075. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024a. URLhttps://arxiv.org/abs/2408.03326. Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URL https://arxiv

URLhttps://arxiv.org/abs/2501.05510. Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation, 2024c. URLhttps://arxiv.org/abs/2404.14469. OpenAI. Gpt-4 technical report,

work page arXiv
[7]

GPT-4 Technical Report

URLhttps://arxiv.org/abs/2303.08774. 10 StreamingVLM: Real-Time Understanding for Infinite Video Streams Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

YaRN: Efficient Context Window Extension of Large Language Models

URLhttps://arxiv.org/abs/2309.00071. Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2405.16009 , year=

URLhttps:// arxiv.org/abs/2405.16009. Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2025a. URLhttps://arxiv.org/abs/2406.08035. Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen,...

work page arXiv
[10]

URLhttps://arxiv.org/ abs/2403.15377. Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling, 2025b. URLhttps://arxiv.org/abs/2501.12386. Guangxuan...

work page arXiv
[11]

Efficient Streaming Language Models with Attention Sinks

URLhttps://arxiv.org/abs/2309.17453. Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2o: Heavy- hitter oracle for efficient generative inference of large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

URLhttps: //arxiv.org/abs/2306.14048. 11 StreamingVLM: Real-Time Understanding for Infinite Video Streams A APPENDIX A.1 LLM USAGESTATEMENT We acknowledge the use of Large Language Models (specifically Claude and GPT-5) in the prepa- ration of this manuscript. The LLMs were used exclusively as writing assistants to: • Polish and refine the language for cl...

work page internal anchor Pith review arXiv

[1] [1]

Qwen2.5-VL Technical Report

URLhttps://arxiv.org/abs/2502.13923. Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large lan- guage model for streaming video, 2024a. URLhttps://arxiv.org/abs/2406.11816. Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zhe...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

URLhttps://arxiv.org/abs/ 2406.07476. Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv- cache retrieval,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

URLhttps://arxiv.org/abs/2503.00540. Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens,

work page arXiv

[4] [4]

URLhttps://arxiv.org/abs/2402.13753. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of mu...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

URL https://arxiv.org/abs/2405.21075. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024a. URLhttps://arxiv.org/abs/2408.03326. Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

URL https://arxiv

URLhttps://arxiv.org/abs/2501.05510. Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation, 2024c. URLhttps://arxiv.org/abs/2404.14469. OpenAI. Gpt-4 technical report,

work page arXiv

[7] [7]

GPT-4 Technical Report

URLhttps://arxiv.org/abs/2303.08774. 10 StreamingVLM: Real-Time Understanding for Infinite Video Streams Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

YaRN: Efficient Context Window Extension of Large Language Models

URLhttps://arxiv.org/abs/2309.00071. Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2405.16009 , year=

URLhttps:// arxiv.org/abs/2405.16009. Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2025a. URLhttps://arxiv.org/abs/2406.08035. Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen,...

work page arXiv

[10] [10]

URLhttps://arxiv.org/ abs/2403.15377. Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling, 2025b. URLhttps://arxiv.org/abs/2501.12386. Guangxuan...

work page arXiv

[11] [11]

Efficient Streaming Language Models with Attention Sinks

URLhttps://arxiv.org/abs/2309.17453. Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2o: Heavy- hitter oracle for efficient generative inference of large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

URLhttps: //arxiv.org/abs/2306.14048. 11 StreamingVLM: Real-Time Understanding for Infinite Video Streams A APPENDIX A.1 LLM USAGESTATEMENT We acknowledge the use of Large Language Models (specifically Claude and GPT-5) in the prepa- ration of this manuscript. The LLMs were used exclusively as writing assistants to: • Polish and refine the language for cl...

work page internal anchor Pith review arXiv