arxiv: 2604.06036 · v3 · submitted 2026-04-07 · 💻 cs.DC · cs.CV· cs.LG

Recognition: no theorem link

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

Yulin Zou , Yan Chen , Wenyan Chen , JooYoung Park , Shivaraman Nitin , Luo Tao , Francisco Romero , Dmitrii Ustiugov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:41 UTC · model grok-4.3

classification 💻 cs.DC cs.CVcs.LG

keywords video streaming analyticsvision-language modelsinference optimizationvideo codecspatch pruningkey-value cachetemporal redundancyspatial redundancy

0 comments

The pith

CodecSight uses video codec metadata as a free signal to prune patches and refresh LLM caches, boosting streaming VLM throughput up to 3x while cutting GPU compute by up to 87%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that metadata produced by standard video codecs already encodes the temporal and spatial redundancy present in live streams. CodecSight reads this metadata at runtime to skip redundant image patches before they reach the vision transformer and to refresh only the necessary entries in the LLM's key-value cache. These steps run entirely online, require no offline profiling or training, and act across decoding, vision encoding, and language-model prefilling in one unified pass. The result is substantially lower inference cost for real-time video analytics without meaningful accuracy loss.

Core claim

CodecSight demonstrates that codec metadata, generated as a byproduct of compression, functions as a reliable low-overhead proxy for visual redundancy. Using this proxy, the system performs codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling. Both operations are fully online, integrate the full pipeline from compressed bitstream to final output, and produce up to 3x higher throughput and 87% lower GPU compute with only a 0-8% F1 drop relative to prior methods.

What carries the argument

Codec metadata serving as a runtime redundancy signal that directly drives patch pruning in visual encoding and selective KV-cache updates in LLM prefilling.

If this is right

Throughput rises by up to 3 times compared with state-of-the-art baselines.
GPU compute falls by up to 87 percent.
Accuracy stays competitive, with F1 score dropping only 0-8 percent.
All optimizations run online with no offline training or profiling required.
Transmission volume decreases automatically because processing stays on compressed bitstreams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same codec-derived signal could be applied to other real-time vision workloads such as detection or segmentation to reduce their inference cost.
Savings may compound on longer or higher-resolution streams, enabling larger VLMs to run in latency-sensitive settings.
Combining codec metadata with lightweight learned predictors could further refine redundancy decisions when codec signals are ambiguous.
Widespread use would lower both operational cost and energy demand for cloud-based video analytics services.

Load-bearing premise

Video codec metadata remains a faithful, low-overhead indicator of actual visual redundancy across diverse live streams without any per-stream adjustment.

What would settle it

A live video sequence in which the codec metadata does not track scene changes, causing either accuracy to fall more than 8% below baseline or the claimed compute savings to vanish.

Figures

Figures reproduced from arXiv: 2604.06036 by Dmitrii Ustiugov, Francisco Romero, JooYoung Park, Luo Tao, Shivaraman Nitin, Wenyan Chen, Yan Chen, Yulin Zou.

**Figure 1.** Figure 1: End-to-end serving pipeline for video streaming analytics: Video compression, bitstream transmission, decompression, preprocessing, and VLM inference (vision feature extractor (ViT) and semantic inference engine (LLM)). implement CodecSight on top of vLLM [28] and evaluate it on two representative VLMs, showing up to 3× latency reduction (equivalently, 3× throughput improvement) and up to 87% GPU comput… view at source ↗

**Figure 2.** Figure 2: Statistics [14, 43, 44] of the imbalance between CCTVs and GPUs in different regions. InternVL3-14B Qwen3-VL-32B 0 1 2 3 4 Latency (s) 3.22 14% 20% 63% 1.61 28% 31% 37% Trans Preproc ViT LLM [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 6.** Figure 6: SM utilization trend of video stream inference with InternVL3-14B on 2 A100 GPUs of TP=2. frames, overlooking the signals that the codec already derives to precisely capture this redundancy and could guide optimization across all stages. 2.4.1 Unlocking Optimization Opportunities with Video Codecs. Video codecs [54] offer a natural mechanism for addressing the bottlenecks above. As shown in [PITH_FULL_… view at source ↗

**Figure 7.** Figure 7: Illustration of overlapping redundancy in slidingwindow VLM inference. When the window slides over the video stream (from Time 𝑡 to 𝑡 + 1), a significant amount of frames (F5∼F12) are overlapped between adjacent windows. 77%∼94% similar when evaluated under motion and residual thresholds. The high redundancy in streaming video imposes substantial GPU overhead on the ViT encoder. As shown in [PITH_FULL_I… view at source ↗

**Figure 8.** Figure 8: System architecture overview of CodecSight. vectors and residuals. This metadata provides essential cues for subsequent visual feature extraction. Specifically, the Motion Analyzer (❷ in [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 10.** Figure 10: Example of selective KVC refresh with codec metadata. is marked dynamic, it remains active until the next I-frame resets the mask. I-frames are always fully encoded and provide the reference visual context for subsequent P-frames. This policy is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

**Figure 11.** Figure 11: Latency speedup of CodecSight. HW-Dec refers to the stage with hardware-accelerated codec decoding. contribution of each component, conduct a sensitivity analysis for key parameters, and measure the runtime overhead of CodecSight to ensure that these optimizations do not offset the overall gains. Unless otherwise specified, the endto-end and component-level experiments use the parameter configuration s… view at source ↗

**Figure 14.** Figure 14: Performance across video motion intensity levels with InternVL3. Precision Recall F1 0.00 0.25 0.50 0.75 1.00 0.83 0.95 0.89 0.83 0.94 0.87 0.78 0.77 0.82 0.79 0.88 0.81 Full Comp Token Prune KV Reuse CodecSight (a) Accuracy 0 2 4 Norm. Latency Full Comp KV Reuse Token Prune CodecSight 3.87x 2.36x 1.48x 1.00x HW-Dec Preproc ViT LLM (b) Latency Speedup [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 13.** Figure 13: Memory and compute resource savings of CodecSight with InternVL3. report the average Precision, Recall, and F1 scores across all crime categories for InternVL3 and Qwen3-VL, respectively. CodecSight maintains accuracy close to Full-Comp on both models, with only modest F1 degradation, from 0.89 to 0.81 for InternVL3 and even zero degradation for Qwen3-VL. CodecSight also remains competitive with VLCach… view at source ↗

**Figure 17.** Figure 17: Sensitivity analysis of MV threshold. Precision Recall F1 0.00 0.25 0.50 0.75 1.00 0.77 0.79 0.77 0.74 0.87 0.77 0.79 0.88 0.81 4 8 16 (a) Accuracy 4 8 16 GOP Size 0.0 0.5 1.0 1.5 Norm. Latency 1.18x 1.03x 1.00x HW-Dec Preproc ViT LLM (b) Norm. Latency [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗

**Figure 18.** Figure 18: Sensitivity analysis of GOP. 6.3.2 MV Threshold. To evaluate the effect of the MV threshold on the accuracy-latency trade-off, we vary it from 0.25 to 5.0 pixels. The MV threshold controls the aggressiveness of codec-guided token pruning by determining which regions are treated as static. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_18.png] view at source ↗

**Figure 19.** Figure 19: System overheads. incur average/max overheads of 48.9/50.8 ms and 0.6/0.8 ms per request, respectively; for Qwen3-VL, the corresponding overheads are 49.1/51.4 ms and 0.6/0.8 ms. Notably, scaling to the much larger Qwen3-VL increases overhead only marginally. In both cases, the combined overhead of about 50 ms accounts for just 3.9% and 4.5% of CodecSight’s optimized end-to-end latency, respectively. E… view at source ↗

read the original abstract

Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CodecSight, a codec-guided streaming video analytics system, built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CodecSight treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CodecSight achieves an improvement in throughput of up to 3$\times$, and a reduction of up to 87% in GPU compute over state-of-the-art baselines, maintaining competitive accuracy with only 0$\sim$8% F1 drop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces CodecSight, a codec-guided system for efficient streaming VLM inference. It leverages video codec metadata such as motion vectors, residuals, and macroblock information as low-cost signals to prune redundant ViT patches before encoding and to selectively refresh LLM KV-cache entries during prefilling. The system is designed to be fully online and training-free, unifying optimizations across decoding, visual processing, and language model components, with inherent transmission benefits from operating on compressed streams. Experiments demonstrate up to 3× throughput improvement and up to 87% reduction in GPU compute compared to state-of-the-art baselines, with only 0-8% F1 score drop in accuracy.

Significance. If the central claims hold, this work has significant implications for scalable video streaming analytics using VLMs. By exploiting pre-existing codec computations without requiring offline profiling, training, or per-stream tuning, CodecSight addresses key limitations of prior approaches that either focus narrowly on ViT or LLM or incur high overhead for redundancy detection. The training-free, online nature is a notable strength that could facilitate real-time deployment in dynamic environments. Credit is due for the unified end-to-end optimization and the reported efficiency gains.

major comments (2)

Abstract: The abstract reports concrete performance numbers (3× throughput, 87% GPU reduction, 0∼8% F1 drop) but provides no error bars, details on the number of runs, or ablations on the pruning thresholds and cache refresh criteria. These details are load-bearing for verifying the soundness of the efficiency claims.
Experimental evaluation: The key assumption that codec-derived signals reliably proxy visual and semantic redundancy for VLM tasks across diverse real-time streams without per-stream tuning requires stronger empirical support. Tests should include scenarios where perceptual motion (high codec activity) does not align with semantic change (e.g., camera panning over static scenes or sudden semantic events in low-motion video), to confirm that accuracy remains within the claimed bound and savings do not degrade.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects for strengthening the presentation and empirical validation of CodecSight. We address each major comment below and will incorporate the requested details and experiments in the revised manuscript.

read point-by-point responses

Referee: [—] Abstract: The abstract reports concrete performance numbers (3× throughput, 87% GPU reduction, 0∼8% F1 drop) but provides no error bars, details on the number of runs, or ablations on the pruning thresholds and cache refresh criteria. These details are load-bearing for verifying the soundness of the efficiency claims.

Authors: We agree that error bars, run counts, and threshold ablations are necessary to substantiate the reported figures. In the revised manuscript we will (1) add standard-deviation error bars to the abstract numbers, (2) state that all results are averaged over five independent runs, and (3) include a new ablation subsection (Section 5.4) that sweeps pruning thresholds and KV-cache refresh criteria, confirming that the 3× throughput and 87 % GPU reduction remain stable within the stated accuracy bounds. revision: yes
Referee: [—] Experimental evaluation: The key assumption that codec-derived signals reliably proxy visual and semantic redundancy for VLM tasks across diverse real-time streams without per-stream tuning requires stronger empirical support. Tests should include scenarios where perceptual motion (high codec activity) does not align with semantic change (e.g., camera panning over static scenes or sudden semantic events in low-motion video), to confirm that accuracy remains within the claimed bound and savings do not degrade.

Authors: We acknowledge that the current evaluation, while spanning multiple real-world streams, does not explicitly isolate cases of codec-semantic misalignment. In the revision we will add two targeted experiments: (i) camera panning over largely static scenes and (ii) low-motion videos containing sudden semantic events. These will be run under the same online, training-free protocol and will demonstrate that the F1 drop stays within 0–8 % while the reported throughput and GPU savings are preserved, thereby providing direct empirical support for the proxy assumption without per-stream tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system with external baselines and no self-referential derivations

full rationale

The paper presents CodecSight as an online, training-free system that uses existing video codec metadata (motion vectors, residuals, macroblocks) as a runtime signal for patch pruning and KV-cache decisions. All performance claims (up to 3× throughput, 87% GPU reduction, 0-8% F1 drop) are measured against external state-of-the-art baselines rather than derived from fitted parameters or self-citations. No equations, uniqueness theorems, or ansatzes are introduced that reduce to the paper's own inputs by construction. The design choices are justified by the observation that codecs already compute temporal/spatial structure, which is an independent property of standard video compression pipelines, not a self-defined proxy. This is a standard systems paper with empirical validation; the derivation chain does not collapse into its own fitted values or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that codec metadata is always available and sufficiently informative for pruning decisions without additional computation or training. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Video codecs reliably encode temporal and spatial redundancy that can be directly reused for inference optimization.
Invoked in the key observation that codec metadata serves as a low-cost runtime signal.

pith-pipeline@v0.9.0 · 5565 in / 1256 out tokens · 37743 ms · 2026-05-10T18:41:39.518975+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
cs.CV 2026-05 unverdicted novelty 5.0

Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.

Reference graph

Works this paper leans on

103 extracted references · 34 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Neil Agarwal and Ravi Netravali. 2023. Boggart: Towards General- Purpose Acceleration of Retrospective Video Analytics.. InProceed- ings of the 20th Symposium on Networked Systems Design and Imple- mentation (NSDI). 933–951

2023
[2]

Asma Baobaid and Mahmoud Méribout. 2025. Edge-GPU Based Face Tracking for Face Detection and Recognition Acceleration.CoRR abs/2505.04524 (2025)

work page arXiv 2025
[3]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space- Time Attention All You Need for Video Understanding?. InThe 38th Annual Conference on Machine Learning (ICML). 813–824

2021
[4]

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. 2023. Token Merging: Your ViT But Faster.. InThe International Conference on Learning Rep- resentations 2023 (ICLR)

2023
[5]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao
[6]

PyramidKV: Dynamic KV Cache Compression based on Pyra- midal Information Funneling.CoRRabs/2406.02069 (2024)

work page internal anchor Pith review arXiv 2024
[7]

João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recog- nition? A New Model and the Kinetics Dataset.. InThe IEEE/CVF Con- ference on Computer Vision and Pattern Recognition 2017 (CVPR). 4724– 4733

2017
[8]

Amine Chaabouni, Yann Gaudeau, Julien Lambert, J-M Moureaux, and Patrice Gallet. 2016. H. 264 medical video compression for telemedicine: A performance analysis.IRBM37, 1 (2016), 40–48

2016
[9]

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. 2025. LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale.. InThe IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition 2025 (CVPR). 29083–29095

2025
[10]

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. 2024. An Image is Worth 1/2 Tokens Af- ter Layer 2: Plug-and-Play Inference Acceleration for Large Vision- Language Models.. InECCV (81). 19–35

2024
[11]

Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. 2025. Streaming- TOM: Streaming Token Compression for Efficient Video Understand- ing.CoRRabs/2510.18269 (2025)

work page arXiv 2025
[12]

Trudeau, Nathan E

Yue Chen, Debargha Mukherjee, Jingning Han, Adrian Grange, Yaowu Xu, Zoe Liu, Sarah Parker, Cheng Chen, Hui Su, Urvang Joshi, Ching-Han Chiang, Yunqing Wang, Paul Wilkins, Jim Bankoski, Luc N. Trudeau, Nathan E. Egge, Jean-Marc Valin, Thomas Davies, Steinar Midtskogen, Andrey Norkin, and Peter De Rivaz. 2018. An Overview of Core Coding Tools in the AV1 Vi...

2018
[13]

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. 2023. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks.CoRRabs/2312.14238 (2023)

work page internal anchor Pith review arXiv 2023
[14]

Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang
[15]

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.CoRRabs/2510.09665 (2025)

work page arXiv 2025
[16]

Comparitech. 2025. Surveillance Camera Statistics: Which City has the Most CCTV? https://www.comparitech.com/vpn-privacy/the- worlds-most-surveilled-cities/. Accessed 2026-03-26

2025
[17]

Khaled Waleed Dawoud, Zaigham Zaheer, Mustaqeem Khan, Karthik Nandakumar, Abdulmotaleb Elsaddik, and Muhammad Haris Khan
[18]

InCVPR Workshops

FusedVision: A Knowledge-Infusing Approach for Practi- cal Anomaly Detection in Real-world Surveillance Videos.. InCVPR Workshops. 4036–4046
[19]

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang
[20]

KV-Cache Retrieval

Streaming Video Question-Answering with In-context Video 13 Conference’17, July 2017, Washington, DC, USA Zou et al. KV-Cache Retrieval.. InThe International Conference on Learning Rep- resentations 2025 (ICLR)

2017
[21]

arXiv preprint arXiv:2407.11550 , year =

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. 2024. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Alloca- tion for Efficient LLM Inference.CoRRabs/2407.11550 (2024)

work page arXiv 2024
[22]

Edward Fish and Andrew Gilbert. 2025. PLOT-TAL: Prompt-Learning with Optimal Transport for Few-Shot Temporal Action Localization.. InThe International Conference on Computer Vision Workshops 2025 (ICCVW). 5912–5921

2025
[23]

Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Rey- naud, Dong Yang, Pengfei Guo, Marc Edgar, Daguang Xu, Bernhard Kainz, and Bjoern H. Menze. 2025. Better Tokens for Better 3D: Ad- vancing Vision-Language Modeling in 3D Medical Imaging.CoRR abs/2510.20639 (2025)

work page arXiv 2025
[24]

Zelin He, Sarah Alnegheimish, and Matthew Reimherr. 2025. Har- nessing Vision-Language Models for Time Series Anomaly Detection. CoRRabs/2506.06836 (2025)

work page arXiv 2025
[25]

Gibbons, and Onur Mutlu

Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodík, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost.. InProceedings of the 13th Symposium on Operating System Design and Implementation (OSDI). 269–286

2018
[26]

Jinwoo Hwang, Daeun Kim, Sangyeop Lee, Yoonsung Kim, Guseul Heo, Hojoon Kim, Yunseok Jeong, Tadiwos Meaza, Eunhyeok Park, Jeongseob Ahn, and Jongse Park. 2025. Déjà Vu: Efficient Video- Language Query Engine with Learning-based Inter-Frame Computa- tion Reuse.Proc. VLDB Endow.18, 10 (2025), 3284–3298

2025
[27]

Jinwoo Hwang, Minsu Kim, Daeun Kim, Seungho Nam, Yoonsung Kim, Dohee Kim, Hardik Sharma, and Jongse Park. 2022. CoVA: Ex- ploiting Compressed-Domain Analysis to Accelerate Video Analyt- ics.. InProceedings of the 2022 USENIX Annual Technical Conference (ATC). 707–722

2022
[28]

Jindong Jiang, Amala Sanjay Deshmukh, Kateryna Chumachenko, Karan Sapra, Zhiding Yu, Guilin Liu, Andrew Tao, Pavlo Molchanov, Jan Kautz, and Wonmin Byeon. 2026. Stateful Token Reduction for Long-Video Hybrid VLMs.arXiv preprint arXiv:2603.00198(2026)

work page arXiv 2026
[29]

Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Za- haria. 2017. Noscope: optimizing neural network queries over video at scale.arXiv preprint arXiv:1703.02529(2017)

work page arXiv 2017
[30]

Brendan Klare and Mark Burge. 2010. Assessment of H. 264 video compression on automated face recognition performance in surveil- lance and mobile video scenarios. InBiometric Technology for Human Identification VII, Vol. 7667. SPIE, 325–332

2010
[31]

Vaibhav Kurrey, Sivakalyan Pujari, and Gagan Raj Gupta. 2025. Pro- cess Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill.CoRRabs/2510.26684 (2025)

work page arXiv 2025
[32]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
[33]

InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)

Efficient Memory Management for Large Language Model Serv- ing with PagedAttention.. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP). 611–626
[34]

Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li
[35]

In The 38th Annual Conference on Neural Information Processing Systems (NeurIPS)

Video Token Merging for Long Video Understanding.. In The 38th Annual Conference on Neural Information Processing Systems (NeurIPS)
[36]

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li 0190, Zejun Ma, and Chunyuan Li. 2024. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models. CoRRabs/2407.07895 (2024)

work page internal anchor Pith review arXiv 2024
[37]

Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, and Yu Kong. 2025. Window Token Concatenation for Efficient Visual Large Language Models.. InCVPR Workshops. 3187–3197

2025
[38]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Lo- catelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. SnapKV: LLM Knows What You are Looking for Before Generation.. InThe 38th Annual Conference on Neural Information Processing Sys- tems (NeurIPS)

2024
[39]

Yuanqi Li, Arthi Padmanabhan, Pengzhan Zhao, Yufei Wang, Guo- qing Harry Xu, and Ravi Netravali. 2020. Reducto: On-Camera Filter- ing for Resource-Efficient Real-Time Video Analytics.. InProceedings of the ACM SIGCOMM 2020 Conference. 359–376

2020
[40]

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Reza Haffari, and Bohan Zhuang. 2024. MiniCache: KV Cache Compression in Depth Dimen- sion for Large Language Models.. InThe 38th Annual Conference on Neural Information Processing Systems (NeurIPS)

2024
[41]

Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, and José M. Álvarez. 2024. StreamChat: Chat- ting with Streaming Video.CoRRabs/2412.08646 (2024)

work page arXiv 2024
[42]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Stream- ing for Fast Large Language Model Serving.. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

2024
[43]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time.. InThe 37th Annual Conference on Neural Information Processing Systems (NeurIPS)

2023
[44]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.. InThe 41st Annual Conference on Machine Learning (ICML). 32332–32344

2024
[45]

Market Reports World. 2026. CCTV Cameras Market Size, Share, Growth, and Industry Analysis, Forecast to 2034. https://www.marketreportsworld.com/market-reports/cctv- cameras-market-14721988

2026
[46]

Debargha Mukherjee, Jim Bankoski, Adrian Grange, Jingning Han, John Koleszar, Paul Wilkins, Yaowu Xu, and Ronald Bultje. 2013. The latest open-source video codec VP9 - An overview and preliminary results. In2013 Picture Coding Symposium (PCS). https://doi.org/10. 1109/PCS.2013.6737765

work page arXiv 2013
[47]

Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. 2025. LiveVLM: Efficient Online Video Under- standing via Streaming-Oriented KV Cache and Retrieval.CoRR abs/2505.15269 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qi- hao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. 2025. OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?. InThe IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition 2025 ...

2025
[49]

NVIDIA. 2025. Europe Builds AI Infrastructure With NVIDIA to Fuel Region’s Next Industrial Transformation. https://nvidianews.nvidia. com/news/europe-ai-infrastructure. Accessed 2026-03-26

2025
[50]

NVIDIA. 2025. NVIDIA and Partners Build America’s AI In- frastructure and Create Blueprint to Power the Next Industrial Revolution. https://nvidianews.nvidia.com/news/nvidia-partners-ai- infrastructure-america. Accessed 2026-03-26

2025
[51]

NVIDIA. 2026. VDEC Application Note. https://docs.nvidia. com/video-technologies/video-codec-sdk/13.0/nvdec-application- note/index.html

2026
[52]

Tsung-Yin Ou, Andrés Ponce, Cody Lee, and Areoll Wu. 2025. Real- time retail planogram compliance application using computer vision and virtual shelves.Scientific Reports15, 1 (2025), 43898

2025
[53]

Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. 2022. ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning.. InThe 36th Annual Conference on Neural Information Processing Sys- tems (NeurIPS). 14 CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference Conference’17, July 2017, Washington, DC, USA

2022
[54]

Persistence Market Research. 2024. CCTV Cameras Market Size, Share & Forecast to 2032. https://www.persistencemarketresearch. com/market-research/cctv-cameras-market.asp

2024
[55]

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ram- jee, and Ashish Panwar. 2024. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention.CoRR abs/2405.04437 (2024)

work page arXiv 2024
[56]

Minghao Qin, Xiangrui Liu, Zhengyang Liang, Yan Shu, Huaying Yuan, Junjie Zhou, Shitao Xiao, Bo Zhao, and Zheng Liu. 2025. Video- XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification.CoRRabs/2506.19225 (2025)

work page arXiv 2025
[57]

Shengling Qin, Hao Yu, Chenxin Wu, Zheng Li, Yizhong Cao, Zhengyang Zhuge, Yuxin Zhou, Wentao Yao, Yi Zhang, Zhengheng Wang, Shuai Bai, Jianwei Zhang, and Junyang Lin. 2025. VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference.CoRRabs/2512.12977 (2025)

work page arXiv 2025
[58]

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.. InThe 35th Annual Conference on Neu- ral Information Processing Systems (NeurIPS). 13937–13949

2021
[59]

António Gouveia Ribeiro, Luís Vilaça, Carlos Costa, Tiago Soares da Costa, and Pedro Miguel Carvalho. 2025. Automatic Visual Inspection for Industrial Application.J. Imaging11, 10 (2025), 350

2025
[60]

2024.Coding video: A practical guide to HEVC and beyond

Iain E Richardson. 2024.Coding video: A practical guide to HEVC and beyond. John Wiley & Sons

2024
[61]

Francisco Romero, Johann Hauswald, Aditi Partap, Daniel Kang, Matei Zaharia, and Christos Kozyrakis. 2022. Optimizing Video Ana- lytics with Declarative Model Relationships.Proc. VLDB Endow.16, 3 (2022), 447–460

2022
[62]

Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, and Mihai Dusmanu. 2026. CoPE-VideoLM: Codec Primitives For Efficient Video Language Models.CoRR abs/2602.13191 (2026)

work page arXiv 2026
[63]

Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, and Wenhu Chen. 2025. QuickVideo: Real-Time Long Video Understand- ing with System Algorithm Co-Design.CoRRabs/2505.16175 (2025)

work page arXiv 2025
[64]

Akash Sharma, Pranjal Naman, Roopkatha Banerjee, Priyanshu Pansari, Sankalp Gawali, Mayank Arya, Sharath Chandra, Arun Josephraj, Rakshit Ramesh, Punit Rathore, et al. 2026. Scaling Real- Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks.arXiv preprint arXiv:2603.05217(2026)

work page arXiv 2026
[65]

Zhuoran Song, Chunyu Qi, Fangxin Liu, Naifeng Jing, and Xiaoyao Liang. 2024. CMC: Video Transformer Acceleration via CODEC As- sisted Matrix Condensing.. InASPLOS (2). 201–215

2024
[66]

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. 2025. DyCoke: Dynamic Compression of Tokens for Fast Video Large Lan- guage Models.. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR). 18992–19001

2025
[67]

LMCache Team. 2026. LMCache. https://github.com/lmcache/ lmcache

2026
[68]

Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Qwen Team. 2025. Qwen3-VL Technical Report.CoRRabs/2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

vLLM Team. 2026. Easy, fast, and cheap LLM serving for everyone. https://github.com/vllm-project/vllm

2026
[71]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition.. InECCV (8). 20–36

2016
[72]

Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, and Yiran Chen. 2025. CoreMatching: A Co-adaptive Sparse Inference Framework with To- ken and Neuron Pruning for Comprehensive Acceleration of Vision- Language Models.. InThe 42nd Annual Conference on Machine Learn- ing (ICML)

2025
[73]

Sullivan, Gisle Bjøntegaard, and Ajay Luthra

Thomas Wiegand, Gary J. Sullivan, Gisle Bjøntegaard, and Ajay Luthra. 2003. Overview of the H.264/AVC video coding standard.IEEE Trans. Circuits Syst. Video Technol.13, 7 (2003), 560–576

2003
[74]

Jiangkai Wu, Liming Liu, Yunpeng Tan, Junlin Hao, and Xinggong Zhang. 2024. Promptus: Can Prompts Streaming Replace Video Streaming with Stable Diffusion.CoRRabs/2405.20032 (2024)

work page arXiv 2024
[75]

Jiang Wu, Sichao Wu, Yinsong Ma, Guangyuan Yu, Haoyuan Xu, Li- fang Zheng, and Jingliang Duan. 2025. MonitorVLM:A Vision Lan- guage Framework for Safety Violation Detection in Mining Opera- tions.CoRRabs/2510.03666 (2025)

work page arXiv 2025
[76]

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. 2024. DeepSeek-VL2: Mixture-of-Experts...

work page internal anchor Pith review arXiv 2024
[77]

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.CoRRabs/2410.10819 (2024)

work page arXiv 2024
[78]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient Streaming Language Models with Attention Sinks.CoRRabs/2309.17453 (2023)

work page internal anchor Pith review arXiv 2023
[79]

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin
[80]

PyramidDrop: Accelerating Your Large Vision-Language Mod- els via Pyramid Visual Redundancy Reduction.CoRRabs/2410.17247 (2024)

work page internal anchor Pith review arXiv 2024

Showing first 80 references.