pith. machine review for the scientific record. sign in

arxiv: 2605.03351 · v1 · submitted 2026-05-05 · 💻 cs.CV · cs.AI

Recognition: unknown

VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video vision-language modelsstate reuseanti-recomputationlatency reductionVideoMMEfollow-up queriestraining-free optimizationvision tower pruning
0
0 comments X

The pith

Video vision-language models can reuse prior visual state for follow-up questions on the same video, preserving answer correctness while cutting latency by up to 36 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video vision-language models keep recomputing visual features for later questions even when the scene has not changed. This paper demonstrates that a validation check can safely decide when to reuse earlier computation instead of starting over from fresh frames. The method works without any retraining and focuses on the period after the first query has already processed the video. A reader would care because interactive video tasks currently pay full cost on every turn, and avoiding that repeated work makes longer conversations practical at lower expense.

Core claim

Video VLMs keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: reuse state when validation says it survives, and buy fresh evidence when the scene, query, or cache topology requires it. On frozen Qwen2.5-VL-7B-Instruct-4bit, adaptive same-video follow-up reuse preserves paired choices and correctness on a 93-query VideoMME breadth setting while reducing follow-up latency by 14.90-35.92x. Fresh-video pruning is smaller but real.

What carries the argument

adaptive same-video follow-up reuse that validates whether prior visual state remains stable enough to skip fresh processing

If this is right

  • Follow-up latency drops 14.9 to 35.9 times on the tested Qwen model while choices and correctness stay identical to full processing.
  • The first query on any new video remains fully cold-start; gains appear only on later questions about the same video.
  • C-VISION pruning of vision-tower work on fresh videos yields a 1.316 times first-query speedup on Gemma 4-E4B-4bit with no drift.
  • Stage-share ceiling means any component speedup contributes to end-to-end time only in proportion to how much wall-clock time that stage originally took.
  • Stress tests confirm repeated-question schedules hold through 50 turns and that prompt variation separates conservative versus aggressive reuse policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Longer multi-turn video conversations could become feasible on current hardware if reuse scales to hours of footage.
  • The same validation logic might extend to audio or text streams that also contain long stable segments.
  • Models could be trained to emit explicit signals about motion or object state, making the validation step cheaper and more reliable.
  • Testing the approach on videos with gradual lighting shifts or camera motion would reveal where the stability boundary actually lies.

Load-bearing premise

The validation check can reliably tell when visual state has stayed the same and is still sufficient for the correct answer, without missing changes that would alter what the model should say.

What would settle it

Run the 93 VideoMME queries with controlled small scene alterations inserted after the first answer; if reused answers diverge from full-recomputation answers on any query, the preservation claim fails.

Figures

Figures reproduced from arXiv: 2605.03351 by JF Bastien, Sam D'Amico.

Figure 1
Figure 1. Figure 1: Graphical overview. Video state is reused when validation gates allow it and refreshed view at source ↗
Figure 2
Figure 2. Figure 2: Unrepaired persistent-KV frame-scaling envelope. The figure separates the raw warm view at source ↗
Figure 3
Figure 3. Figure 3: Adaptive C-PERSIST timing attribution. Fixed view at source ↗
Figure 4
Figure 4. Figure 4: C-CEILING share-model validation. Predicted speedup uses dense vision share and view at source ↗
Figure 5
Figure 5. Figure 5: Qwen base-policy routing frontier under dense-backend substitution. view at source ↗
Figure 6
Figure 6. Figure 6: Qwen routing-budget visualization on real clips. Panel A overlays static, shifted, and fresh view at source ↗
read the original abstract

Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: reuse state when validation says it survives, and buy fresh evidence when the scene, query, or cache topology requires it. The largest measured win is after ingest. On frozen Qwen2.5-VL-7B-Instruct-4bit, adaptive same-video follow-up reuse preserves paired choices and correctness on a 93-query VideoMME breadth setting while reducing follow-up latency by 14.90-35.92x. The first query is still cold; the win starts when later questions reuse the same video state. Stress tests bound the result: repeated-question schedules hold through 50 turns, while dense-answer-anchored prompt variation separates conservative fixed K=1 repair from faster aggressive policies that drift. Fresh-video pruning is smaller but real. C-VISION skips timed vision-tower work before the first answer is generated. On Gemma 4-E4B-4bit, the clean 32f short cell reaches 1.316x first-query speedup with no paired drift or parse failures on 20 items; Qwen shows the fidelity/speed boundary. Stage-share ceiling (C-CEILING) is the accounting guardrail: a component speedup becomes an end-to-end speedup only in proportion to the wall-clock share it accelerates, so C-VISION and after-ingest follow-up reuse do not multiply. Candidate C-STREAM remains a native-rate target, not a headline result here. The broader direction is VLM-native media that expose change, motion, uncertainty, object state, sensor time, and active tiles directly, so models do not have to rediscover the world from dense RGB every frame.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces training-free anti-recomputation techniques for video vision-language models (VLMs), focusing on adaptive reuse of visual state for follow-up queries on the same video (C-STREAM style) and fresh-video pruning (C-VISION). On frozen Qwen2.5-VL-7B-Instruct-4bit, it claims that adaptive same-video follow-up reuse preserves paired choices and correctness on a 93-query VideoMME setting while achieving 14.90-35.92x follow-up latency reduction. Additional results include 1.316x first-query speedup on Gemma 4-E4B-4bit with C-VISION, bounded by stress tests (50-turn repeated questions and dense-answer-anchored prompt variation). The work emphasizes component speedups, stage-share accounting (C-CEILING), and a broader call for VLM-native media exposing change and uncertainty.

Significance. If the empirical results hold under detailed validation, the approach could meaningfully improve inference efficiency for multi-turn video VLM applications by avoiding redundant visual recomputation, particularly for follow-up queries where state stability is high. The concrete speedups on named models (Qwen2.5-VL, Gemma) and benchmarks (VideoMME) with preserved correctness are a strength, as is the explicit accounting for wall-clock share via C-CEILING. However, the lack of description for the core stability validator limits immediate impact and reproducibility.

major comments (2)
  1. [Abstract] Abstract and § on adaptive reuse: the headline claim of preserved correctness with 14.90-35.92x latency reduction rests on an unspecified 'validation' procedure that decides when visual state survives. No details are provided on the detector (e.g., frame differencing, feature similarity, query-dependent check), its false-negative rate for changes affecting downstream answers, or failure modes under prompt variation. The stress tests bound only the chosen policy, not whether the underlying detector systematically misses relevant changes.
  2. [Abstract] Abstract, stress-test paragraph: the 50-turn repeated-question and dense-answer-anchored variation tests are described only at high level. It is unclear how these isolate the stability detector's robustness versus test-distribution artifacts, or what controls were used for drift detection and error analysis.
minor comments (2)
  1. [Abstract] Abstract: 'C-VISION skips timed vision-tower work' and 'C-CEILING' are introduced without prior definition or cross-reference, making the first read harder.
  2. [Abstract] Abstract: the 93-query VideoMME breadth setting and 20-item Gemma evaluation lack any mention of query selection criteria or diversity controls.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The feedback correctly identifies areas where additional description is needed for reproducibility. We will revise the manuscript with expanded sections on the stability validator and stress-test methodology, including implementation details, quantitative analysis, and controls. These changes will be incorporated as a major revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract and § on adaptive reuse: the headline claim of preserved correctness with 14.90-35.92x latency reduction rests on an unspecified 'validation' procedure that decides when visual state survives. No details are provided on the detector (e.g., frame differencing, feature similarity, query-dependent check), its false-negative rate for changes affecting downstream answers, or failure modes under prompt variation. The stress tests bound only the chosen policy, not whether the underlying detector systematically misses relevant changes.

    Authors: We agree that the current manuscript provides insufficient detail on the stability validator, limiting immediate reproducibility. The validator combines lightweight frame differencing (pixel-level L2 threshold) with cosine similarity on pooled visual encoder features, applied query-dependently only when the follow-up query references visual content. In the revision we will add a dedicated subsection with the exact algorithm, chosen thresholds, false-negative rate estimation (via manual review of 50 VideoMME pairs), and enumerated failure modes under prompt variation. This directly addresses the concern that stress tests alone do not validate the detector. revision: yes

  2. Referee: [Abstract] Abstract, stress-test paragraph: the 50-turn repeated-question and dense-answer-anchored variation tests are described only at high level. It is unclear how these isolate the stability detector's robustness versus test-distribution artifacts, or what controls were used for drift detection and error analysis.

    Authors: We acknowledge the high-level presentation of the stress tests. The 50-turn repeated-question schedule uses identical prompts on an unchanged video to measure cumulative drift, while dense-answer-anchored variation inserts paraphrases that still require the same visual evidence. In the revised manuscript we will expand this paragraph to specify the exact prompt templates, the full-recomputation baseline used for drift detection, the error-analysis protocol (answer-choice agreement plus manual correctness check), and controls for test-distribution artifacts. These additions will clarify how the tests bound detector robustness rather than merely the chosen policy. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical latency results on frozen VLMs with no derivation chain

full rationale

The manuscript reports measured speedups from adaptive state reuse on specific frozen models (Qwen2.5-VL-7B, Gemma) and query sets (VideoMME), bounded by stress tests such as 50-turn repetition and prompt variation. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear. The validation procedure for stable visual state is presented as an engineering choice whose correctness is evaluated directly by the reported paired-choice preservation and failure-mode tests rather than reduced to prior self-referential results. The work is therefore self-contained as an empirical optimization study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; no free parameters, invented entities, or detailed axioms are extractable. The approach implicitly assumes reliable validation of state stability and that component speedups translate proportionally to end-to-end time.

axioms (1)
  • domain assumption Visual state for a given video remains stable enough across follow-up queries that cached features can be reused without accuracy loss when validation passes.
    This is the load-bearing premise for the anti-recomputation strategy stated in the abstract.

pith-pipeline@v0.9.0 · 5660 in / 1283 out tokens · 63893 ms · 2026-05-08T01:26:41.170361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 33 canonical work pages · 8 internal anchors

  1. [1]

    CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

    Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Fran- cisco Romero, and Dmitrii Ustiugov. CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference . arXiv:2604.06036, 2026. 32

  2. [2]

    arXiv preprint arXiv:2602.13191 , year=

    Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, and Mihai Dusmanu. CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling. arXiv:2602.13191, 2026

  3. [3]

    Manmatha, Alexander J

    Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, and Philipp Krähenbühl. Compressed Video Action Recognition . CVPR 2018; arXiv:1712.00636. Pages 6026–6035. doi:10.1109/CVPR.2018.00631

  4. [4]

    CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion . EuroSys 2025; arXiv:2405.16444

  5. [5]

    Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse

    Jinwoo Hwang, Daeun Kim, Sangyeop Lee, Yoonsung Kim, Guseul Heo, Hojoon Kim, Yunseok Jeong, Tadiwos Meaza, Eunhyeok Park, Jeongseob Ahn, and Jongse Park. Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse . PVLDB 18(10):3284–3298, 2025; arXiv:2506.14107

  6. [6]

    Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

    Matthew Dutson, Yin Li, and Mohit Gupta. Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers . ICCV 2023, pages 16911–16923; arXiv:2308.13494

  7. [7]

    EvoPrune: Early-Stage Visual Token Prun- ing for Efficient MLLMs

    Yuhao Chen, Bin Shan, Xin Ye, and Cheng Chen. EvoPrune: Early-Stage Visual Token Prun- ing for Efficient MLLMs . arXiv:2603.03681, 2026

  8. [8]

    An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models . ECCV 2024 Oral; pages 19–35. doi:10.1007/978-3-031-73004- 7_2. arXiv:2403.06764

  9. [9]

    arXiv preprint arXiv:2503.11187 (2025)

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. FastVID: Dynamic Density Pruning for Fast Video Large Language Mod- els. NeurIPS 2025 poster; arXiv:2503.11187; OpenReview: https://openreview.net/forum? id=2xS4VtpApy

  10. [10]

    FastVLM: Efficient Vision Encoding for Vision Language Models

    Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokula Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, and Hadi Pouransari. FastVLM: Efficient Vision Encoding for Vision Language Models . CVPR 2025, pages 19769– 19780; arXiv:2412.13303

  11. [11]

    Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging.arXiv preprint arXiv:2602.08024, 2026

    Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, and Zhuotao Tian. FlashVID: Ef- ficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merg- ing. ICLR 2026 Oral; arXiv:2602.08024; OpenReview: https://openreview.net/forum?id= H6rDX4w6Al

  12. [12]

    Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024

    Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models . In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) , pages 22654–22663, 2025; arXiv:2501.01986. https://openacc...

  13. [13]

    Gemma 4 E4B model card

    Google DeepMind. Gemma 4 E4B model card . https://huggingface.co/google/ gemma-4-E4B. Accessed 2026-04-30. 33

  14. [14]

    gemma-4-e4b-4bit model card

    MLX Community. gemma-4-e4b-4bit model card . https://huggingface.co/mlx-community/ gemma-4-e4b-4bit . Accessed 2026-04-30

  15. [15]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark . CVPR 2024, pages 22195–22206; arXiv:2311.17005

  16. [16]

    FFmpeg Codecs Documentation: export_mvs and motion vector side data

    FFmpeg. FFmpeg Codecs Documentation: export_mvs and motion vector side data . https: //ffmpeg.org/ffmpeg-codecs.html. Accessed 2026-04-29

  17. [17]

    FFmpeg Filters Documentation: codecview

    FFmpeg. FFmpeg Filters Documentation: codecview. https://ffmpeg.org/ ffmpeg-filters.html. Accessed 2026-04-29

  18. [18]

    ISO/IEC DIS 23888-2, Information technology — Artificial intelligence for mul- timedia — Part 2: Video coding for machines

    ISO/IEC. ISO/IEC DIS 23888-2, Information technology — Artificial intelligence for mul- timedia — Part 2: Video coding for machines . Draft International Standard, enquiry phase; ISO stage 40.20, DIS ballot initiated for 12 weeks; under development, 2026. https://www. mpeg.org/standards/MPEG-AI/2/; https://www.iso.org/standard/88879.html. Accessed 2026-04-29

  19. [19]

    JPEG. JPEG AI . https://jpeg.org/jpegai/. Accessed 2026-04-29

  20. [20]

    JPEG XE activity status

    JPEG. JPEG XE activity status . https://jpeg.org/items/20241121_jpeg_xe_activity_ status.html. Accessed 2026-04-29

  21. [21]

    HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

    Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, and Xipeng Qiu. HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding . ACL 2026 Main; arXiv:2601.14724

  22. [22]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adap- tive Token Reduction for Efficient Large Multimodal Models . ICCV 2025, pages 22857–22867; arXiv:2403.15388

  23. [23]

    Dolby Atmos Master File

    Library of Congress. Dolby Atmos Master File . https://www.loc.gov/preservation/ digital/formats/fdd/fdd000646.shtml. Accessed 2026-04-29

  24. [24]

    MPEG-4, Visual Coding, Main Profile

    Library of Congress. MPEG-4, Visual Coding, Main Profile . https://www.loc.gov/ preservation/digital/formats/fdd/fdd000048.shtml. Accessed 2026-04-29

  25. [25]

    MLX-VLM: Vision Language Models on your Mac using MLX

    Bayram Annakov. MLX-VLM: Vision Language Models on your Mac using MLX . https:// github.com/Blaizzy/mlx-vlm. Accessed 2026-04-30

  26. [26]

    QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

    Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, and Wenhu Chen. QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design . Transactions on Ma- chine Learning Research, 2026; arXiv:2505.16175; OpenReview: https://openreview.net/ forum?id=Rpcxgzcsuc

  27. [27]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report. a...

  28. [28]

    arXiv preprint arXiv:2503.00540 , year=

    Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming Video Question-Answering with In- context Video KV-Cache Retrieval . ICLR 2025; arXiv:2503.00540

  29. [29]

    Martin C. Rinard. Approximate Computing . CSAIL research overview. https://people. csail.mit.edu/rinard/research/ApproximateComputing/. Accessed 2026-04-29

  30. [30]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient Execution of Structured Language Model Programs . NeurIPS 2024; arXiv:2312.07104

  31. [31]

    Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A. Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. Sparse- VLM: Visual Token Sparsification for Efficient Vision-Language Model Inference . In Proceed- ings of the 42nd International Conference on Machine Learning (ICML) , PMLR 267:74840– 74857, 2025...

  32. [32]

    Pla- taniotis, Yao Lu, Song Han, and Zhijian Liu

    Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Pla- taniotis, Yao Lu, Song Han, and Zhijian Liu. SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference . ICCV 2025, pages 23784–23794; arXiv:2510.17777

  33. [33]

    A Simple Baseline for Streaming Video Understanding

    Yujiao Shen, Shulin Tian, Jingkang Yang, and Ziwei Liu. A Simple Baseline for Streaming Video Understanding. arXiv:2604.02317, 2026

  34. [34]

    Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

    Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, and Minho Shim. Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs . ICCV 2025, pages 23990–24000; arXiv:2507.07990

  35. [35]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient Streaming Language Models with Attention Sinks . ICLR 2024; arXiv:2309.17453

  36. [36]

    arXiv preprint arXiv:2510.09608 , year=

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. StreamingVLM: Real-Time Understanding for Infinite Video Streams . arXiv:2510.09608, 2025

  37. [37]

    Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models,

    Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models. ICLR 2025 poster; arXiv:2410.23266. https://openreview.net/forum? id=fCi4o83Mfs

  38. [38]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token Merging: Your ViT But Faster . ICLR 2023; arXiv:2210.09461

  39. [39]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. CVP...

  40. [40]

    VisionZip: Longer is Better but Not Necessary in Vision Language Models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is Better but Not Necessary in Vision Language Models . CVPR 35 2025, pages 19792–19802; arXiv:2412.04467. https://openaccess.thecvf.com/content/ CVPR2025/html/Yang_VisionZip_Longer_is_Better_but_Not_Necessary_in_Vision_ Language_CVPR_2025_paper.html

  41. [41]

    VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

    Shengling Qin, Hao Yu, Chenxin Wu, Zheng Li, Yizhong Cao, Zhengyang Zhuge, Yuxin Zhou, Wentao Yao, Yi Zhang, Zhengheng Wang, Shuai Bai, Jianwei Zhang, and Junyang Lin. VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference . arXiv:2512.12977, 2025

  42. [42]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention . SOSP 2023; arXiv:2309.06180

  43. [43]

    VScan: Rethinking Visual Token Reduction for Efficient Large Vision- Language Models

    Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Haitao Mi, and Dong Yu. VScan: Rethinking Visual Token Reduction for Efficient Large Vision- Language Models. Transactions on Machine Learning Research, 2026; arXiv:2505.22654; Open- Review: https://openreview.net/forum?id=KZYhyilFnt. A Source Traceability and Provenance The publi...