Recognition: no theorem link
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
Pith reviewed 2026-05-10 18:41 UTC · model grok-4.3
The pith
CodecSight uses video codec metadata as a free signal to prune patches and refresh LLM caches, boosting streaming VLM throughput up to 3x while cutting GPU compute by up to 87%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodecSight demonstrates that codec metadata, generated as a byproduct of compression, functions as a reliable low-overhead proxy for visual redundancy. Using this proxy, the system performs codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling. Both operations are fully online, integrate the full pipeline from compressed bitstream to final output, and produce up to 3x higher throughput and 87% lower GPU compute with only a 0-8% F1 drop relative to prior methods.
What carries the argument
Codec metadata serving as a runtime redundancy signal that directly drives patch pruning in visual encoding and selective KV-cache updates in LLM prefilling.
If this is right
- Throughput rises by up to 3 times compared with state-of-the-art baselines.
- GPU compute falls by up to 87 percent.
- Accuracy stays competitive, with F1 score dropping only 0-8 percent.
- All optimizations run online with no offline training or profiling required.
- Transmission volume decreases automatically because processing stays on compressed bitstreams.
Where Pith is reading between the lines
- The same codec-derived signal could be applied to other real-time vision workloads such as detection or segmentation to reduce their inference cost.
- Savings may compound on longer or higher-resolution streams, enabling larger VLMs to run in latency-sensitive settings.
- Combining codec metadata with lightweight learned predictors could further refine redundancy decisions when codec signals are ambiguous.
- Widespread use would lower both operational cost and energy demand for cloud-based video analytics services.
Load-bearing premise
Video codec metadata remains a faithful, low-overhead indicator of actual visual redundancy across diverse live streams without any per-stream adjustment.
What would settle it
A live video sequence in which the codec metadata does not track scene changes, causing either accuracy to fall more than 8% below baseline or the claimed compute savings to vanish.
Figures
read the original abstract
Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CodecSight, a codec-guided streaming video analytics system, built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CodecSight treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CodecSight achieves an improvement in throughput of up to 3$\times$, and a reduction of up to 87% in GPU compute over state-of-the-art baselines, maintaining competitive accuracy with only 0$\sim$8% F1 drop.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CodecSight, a codec-guided system for efficient streaming VLM inference. It leverages video codec metadata such as motion vectors, residuals, and macroblock information as low-cost signals to prune redundant ViT patches before encoding and to selectively refresh LLM KV-cache entries during prefilling. The system is designed to be fully online and training-free, unifying optimizations across decoding, visual processing, and language model components, with inherent transmission benefits from operating on compressed streams. Experiments demonstrate up to 3× throughput improvement and up to 87% reduction in GPU compute compared to state-of-the-art baselines, with only 0-8% F1 score drop in accuracy.
Significance. If the central claims hold, this work has significant implications for scalable video streaming analytics using VLMs. By exploiting pre-existing codec computations without requiring offline profiling, training, or per-stream tuning, CodecSight addresses key limitations of prior approaches that either focus narrowly on ViT or LLM or incur high overhead for redundancy detection. The training-free, online nature is a notable strength that could facilitate real-time deployment in dynamic environments. Credit is due for the unified end-to-end optimization and the reported efficiency gains.
major comments (2)
- Abstract: The abstract reports concrete performance numbers (3× throughput, 87% GPU reduction, 0∼8% F1 drop) but provides no error bars, details on the number of runs, or ablations on the pruning thresholds and cache refresh criteria. These details are load-bearing for verifying the soundness of the efficiency claims.
- Experimental evaluation: The key assumption that codec-derived signals reliably proxy visual and semantic redundancy for VLM tasks across diverse real-time streams without per-stream tuning requires stronger empirical support. Tests should include scenarios where perceptual motion (high codec activity) does not align with semantic change (e.g., camera panning over static scenes or sudden semantic events in low-motion video), to confirm that accuracy remains within the claimed bound and savings do not degrade.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects for strengthening the presentation and empirical validation of CodecSight. We address each major comment below and will incorporate the requested details and experiments in the revised manuscript.
read point-by-point responses
-
Referee: [—] Abstract: The abstract reports concrete performance numbers (3× throughput, 87% GPU reduction, 0∼8% F1 drop) but provides no error bars, details on the number of runs, or ablations on the pruning thresholds and cache refresh criteria. These details are load-bearing for verifying the soundness of the efficiency claims.
Authors: We agree that error bars, run counts, and threshold ablations are necessary to substantiate the reported figures. In the revised manuscript we will (1) add standard-deviation error bars to the abstract numbers, (2) state that all results are averaged over five independent runs, and (3) include a new ablation subsection (Section 5.4) that sweeps pruning thresholds and KV-cache refresh criteria, confirming that the 3× throughput and 87 % GPU reduction remain stable within the stated accuracy bounds. revision: yes
-
Referee: [—] Experimental evaluation: The key assumption that codec-derived signals reliably proxy visual and semantic redundancy for VLM tasks across diverse real-time streams without per-stream tuning requires stronger empirical support. Tests should include scenarios where perceptual motion (high codec activity) does not align with semantic change (e.g., camera panning over static scenes or sudden semantic events in low-motion video), to confirm that accuracy remains within the claimed bound and savings do not degrade.
Authors: We acknowledge that the current evaluation, while spanning multiple real-world streams, does not explicitly isolate cases of codec-semantic misalignment. In the revision we will add two targeted experiments: (i) camera panning over largely static scenes and (ii) low-motion videos containing sudden semantic events. These will be run under the same online, training-free protocol and will demonstrate that the F1 drop stays within 0–8 % while the reported throughput and GPU savings are preserved, thereby providing direct empirical support for the proxy assumption without per-stream tuning. revision: yes
Circularity Check
No circularity: empirical system with external baselines and no self-referential derivations
full rationale
The paper presents CodecSight as an online, training-free system that uses existing video codec metadata (motion vectors, residuals, macroblocks) as a runtime signal for patch pruning and KV-cache decisions. All performance claims (up to 3× throughput, 87% GPU reduction, 0-8% F1 drop) are measured against external state-of-the-art baselines rather than derived from fitted parameters or self-citations. No equations, uniqueness theorems, or ansatzes are introduced that reduce to the paper's own inputs by construction. The design choices are justified by the observation that codecs already compute temporal/spatial structure, which is an independent property of standard video compression pipelines, not a self-defined proxy. This is a standard systems paper with empirical validation; the derivation chain does not collapse into its own fitted values or prior self-citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video codecs reliably encode temporal and spatial redundancy that can be directly reused for inference optimization.
Forward citations
Cited by 1 Pith paper
-
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
Reference graph
Works this paper leans on
-
[1]
Neil Agarwal and Ravi Netravali. 2023. Boggart: Towards General- Purpose Acceleration of Retrospective Video Analytics.. InProceed- ings of the 20th Symposium on Networked Systems Design and Imple- mentation (NSDI). 933–951
2023
- [2]
-
[3]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space- Time Attention All You Need for Video Understanding?. InThe 38th Annual Conference on Machine Learning (ICML). 813–824
2021
-
[4]
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. 2023. Token Merging: Your ViT But Faster.. InThe International Conference on Learning Rep- resentations 2023 (ICLR)
2023
-
[5]
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao
-
[6]
PyramidKV: Dynamic KV Cache Compression based on Pyra- midal Information Funneling.CoRRabs/2406.02069 (2024)
work page internal anchor Pith review arXiv 2024
-
[7]
João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recog- nition? A New Model and the Kinetics Dataset.. InThe IEEE/CVF Con- ference on Computer Vision and Pattern Recognition 2017 (CVPR). 4724– 4733
2017
-
[8]
Amine Chaabouni, Yann Gaudeau, Julien Lambert, J-M Moureaux, and Patrice Gallet. 2016. H. 264 medical video compression for telemedicine: A performance analysis.IRBM37, 1 (2016), 40–48
2016
-
[9]
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. 2025. LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale.. InThe IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition 2025 (CVPR). 29083–29095
2025
-
[10]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. 2024. An Image is Worth 1/2 Tokens Af- ter Layer 2: Plug-and-Play Inference Acceleration for Large Vision- Language Models.. InECCV (81). 19–35
2024
- [11]
-
[12]
Trudeau, Nathan E
Yue Chen, Debargha Mukherjee, Jingning Han, Adrian Grange, Yaowu Xu, Zoe Liu, Sarah Parker, Cheng Chen, Hui Su, Urvang Joshi, Ching-Han Chiang, Yunqing Wang, Paul Wilkins, Jim Bankoski, Luc N. Trudeau, Nathan E. Egge, Jean-Marc Valin, Thomas Davies, Steinar Midtskogen, Andrey Norkin, and Peter De Rivaz. 2018. An Overview of Core Coding Tools in the AV1 Vi...
2018
-
[13]
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. 2023. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks.CoRRabs/2312.14238 (2023)
work page internal anchor Pith review arXiv 2023
-
[14]
Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang
- [15]
-
[16]
Comparitech. 2025. Surveillance Camera Statistics: Which City has the Most CCTV? https://www.comparitech.com/vpn-privacy/the- worlds-most-surveilled-cities/. Accessed 2026-03-26
2025
-
[17]
Khaled Waleed Dawoud, Zaigham Zaheer, Mustaqeem Khan, Karthik Nandakumar, Abdulmotaleb Elsaddik, and Muhammad Haris Khan
-
[18]
InCVPR Workshops
FusedVision: A Knowledge-Infusing Approach for Practi- cal Anomaly Detection in Real-world Surveillance Videos.. InCVPR Workshops. 4036–4046
-
[19]
Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang
-
[20]
KV-Cache Retrieval
Streaming Video Question-Answering with In-context Video 13 Conference’17, July 2017, Washington, DC, USA Zou et al. KV-Cache Retrieval.. InThe International Conference on Learning Rep- resentations 2025 (ICLR)
2017
-
[21]
arXiv preprint arXiv:2407.11550 , year =
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. 2024. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Alloca- tion for Efficient LLM Inference.CoRRabs/2407.11550 (2024)
-
[22]
Edward Fish and Andrew Gilbert. 2025. PLOT-TAL: Prompt-Learning with Optimal Transport for Few-Shot Temporal Action Localization.. InThe International Conference on Computer Vision Workshops 2025 (ICCVW). 5912–5921
2025
- [23]
- [24]
-
[25]
Gibbons, and Onur Mutlu
Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodík, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost.. InProceedings of the 13th Symposium on Operating System Design and Implementation (OSDI). 269–286
2018
-
[26]
Jinwoo Hwang, Daeun Kim, Sangyeop Lee, Yoonsung Kim, Guseul Heo, Hojoon Kim, Yunseok Jeong, Tadiwos Meaza, Eunhyeok Park, Jeongseob Ahn, and Jongse Park. 2025. Déjà Vu: Efficient Video- Language Query Engine with Learning-based Inter-Frame Computa- tion Reuse.Proc. VLDB Endow.18, 10 (2025), 3284–3298
2025
-
[27]
Jinwoo Hwang, Minsu Kim, Daeun Kim, Seungho Nam, Yoonsung Kim, Dohee Kim, Hardik Sharma, and Jongse Park. 2022. CoVA: Ex- ploiting Compressed-Domain Analysis to Accelerate Video Analyt- ics.. InProceedings of the 2022 USENIX Annual Technical Conference (ATC). 707–722
2022
- [28]
- [29]
-
[30]
Brendan Klare and Mark Burge. 2010. Assessment of H. 264 video compression on automated face recognition performance in surveil- lance and mobile video scenarios. InBiometric Technology for Human Identification VII, Vol. 7667. SPIE, 325–332
2010
- [31]
-
[32]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
-
[33]
InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)
Efficient Memory Management for Large Language Model Serv- ing with PagedAttention.. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP). 611–626
-
[34]
Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li
-
[35]
In The 38th Annual Conference on Neural Information Processing Systems (NeurIPS)
Video Token Merging for Long Video Understanding.. In The 38th Annual Conference on Neural Information Processing Systems (NeurIPS)
-
[36]
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li 0190, Zejun Ma, and Chunyuan Li. 2024. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models. CoRRabs/2407.07895 (2024)
work page internal anchor Pith review arXiv 2024
-
[37]
Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, and Yu Kong. 2025. Window Token Concatenation for Efficient Visual Large Language Models.. InCVPR Workshops. 3187–3197
2025
-
[38]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Lo- catelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. SnapKV: LLM Knows What You are Looking for Before Generation.. InThe 38th Annual Conference on Neural Information Processing Sys- tems (NeurIPS)
2024
-
[39]
Yuanqi Li, Arthi Padmanabhan, Pengzhan Zhao, Yufei Wang, Guo- qing Harry Xu, and Ravi Netravali. 2020. Reducto: On-Camera Filter- ing for Resource-Efficient Real-Time Video Analytics.. InProceedings of the ACM SIGCOMM 2020 Conference. 359–376
2020
-
[40]
Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Reza Haffari, and Bohan Zhuang. 2024. MiniCache: KV Cache Compression in Depth Dimen- sion for Large Language Models.. InThe 38th Annual Conference on Neural Information Processing Systems (NeurIPS)
2024
- [41]
-
[42]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Stream- ing for Fast Large Language Model Serving.. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56
2024
-
[43]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time.. InThe 37th Annual Conference on Neural Information Processing Systems (NeurIPS)
2023
-
[44]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.. InThe 41st Annual Conference on Machine Learning (ICML). 32332–32344
2024
-
[45]
Market Reports World. 2026. CCTV Cameras Market Size, Share, Growth, and Industry Analysis, Forecast to 2034. https://www.marketreportsworld.com/market-reports/cctv- cameras-market-14721988
2026
-
[46]
Debargha Mukherjee, Jim Bankoski, Adrian Grange, Jingning Han, John Koleszar, Paul Wilkins, Yaowu Xu, and Ronald Bultje. 2013. The latest open-source video codec VP9 - An overview and preliminary results. In2013 Picture Coding Symposium (PCS). https://doi.org/10. 1109/PCS.2013.6737765
-
[47]
Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. 2025. LiveVLM: Efficient Online Video Under- standing via Streaming-Oriented KV Cache and Retrieval.CoRR abs/2505.15269 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qi- hao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. 2025. OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?. InThe IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition 2025 ...
2025
-
[49]
NVIDIA. 2025. Europe Builds AI Infrastructure With NVIDIA to Fuel Region’s Next Industrial Transformation. https://nvidianews.nvidia. com/news/europe-ai-infrastructure. Accessed 2026-03-26
2025
-
[50]
NVIDIA. 2025. NVIDIA and Partners Build America’s AI In- frastructure and Create Blueprint to Power the Next Industrial Revolution. https://nvidianews.nvidia.com/news/nvidia-partners-ai- infrastructure-america. Accessed 2026-03-26
2025
-
[51]
NVIDIA. 2026. VDEC Application Note. https://docs.nvidia. com/video-technologies/video-codec-sdk/13.0/nvdec-application- note/index.html
2026
-
[52]
Tsung-Yin Ou, Andrés Ponce, Cody Lee, and Areoll Wu. 2025. Real- time retail planogram compliance application using computer vision and virtual shelves.Scientific Reports15, 1 (2025), 43898
2025
-
[53]
Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. 2022. ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning.. InThe 36th Annual Conference on Neural Information Processing Sys- tems (NeurIPS). 14 CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference Conference’17, July 2017, Washington, DC, USA
2022
-
[54]
Persistence Market Research. 2024. CCTV Cameras Market Size, Share & Forecast to 2032. https://www.persistencemarketresearch. com/market-research/cctv-cameras-market.asp
2024
- [55]
- [56]
-
[57]
Shengling Qin, Hao Yu, Chenxin Wu, Zheng Li, Yizhong Cao, Zhengyang Zhuge, Yuxin Zhou, Wentao Yao, Yi Zhang, Zhengheng Wang, Shuai Bai, Jianwei Zhang, and Junyang Lin. 2025. VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference.CoRRabs/2512.12977 (2025)
-
[58]
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.. InThe 35th Annual Conference on Neu- ral Information Processing Systems (NeurIPS). 13937–13949
2021
-
[59]
António Gouveia Ribeiro, Luís Vilaça, Carlos Costa, Tiago Soares da Costa, and Pedro Miguel Carvalho. 2025. Automatic Visual Inspection for Industrial Application.J. Imaging11, 10 (2025), 350
2025
-
[60]
2024.Coding video: A practical guide to HEVC and beyond
Iain E Richardson. 2024.Coding video: A practical guide to HEVC and beyond. John Wiley & Sons
2024
-
[61]
Francisco Romero, Johann Hauswald, Aditi Partap, Daniel Kang, Matei Zaharia, and Christos Kozyrakis. 2022. Optimizing Video Ana- lytics with Declarative Model Relationships.Proc. VLDB Endow.16, 3 (2022), 447–460
2022
- [62]
- [63]
-
[64]
Akash Sharma, Pranjal Naman, Roopkatha Banerjee, Priyanshu Pansari, Sankalp Gawali, Mayank Arya, Sharath Chandra, Arun Josephraj, Rakshit Ramesh, Punit Rathore, et al. 2026. Scaling Real- Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks.arXiv preprint arXiv:2603.05217(2026)
-
[65]
Zhuoran Song, Chunyu Qi, Fangxin Liu, Naifeng Jing, and Xiaoyao Liang. 2024. CMC: Video Transformer Acceleration via CODEC As- sisted Matrix Condensing.. InASPLOS (2). 201–215
2024
-
[66]
Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. 2025. DyCoke: Dynamic Compression of Tokens for Fast Video Large Lan- guage Models.. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR). 18992–19001
2025
-
[67]
LMCache Team. 2026. LMCache. https://github.com/lmcache/ lmcache
2026
-
[68]
Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Qwen Team. 2025. Qwen3-VL Technical Report.CoRRabs/2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
vLLM Team. 2026. Easy, fast, and cheap LLM serving for everyone. https://github.com/vllm-project/vllm
2026
-
[71]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition.. InECCV (8). 20–36
2016
-
[72]
Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, and Yiran Chen. 2025. CoreMatching: A Co-adaptive Sparse Inference Framework with To- ken and Neuron Pruning for Comprehensive Acceleration of Vision- Language Models.. InThe 42nd Annual Conference on Machine Learn- ing (ICML)
2025
-
[73]
Sullivan, Gisle Bjøntegaard, and Ajay Luthra
Thomas Wiegand, Gary J. Sullivan, Gisle Bjøntegaard, and Ajay Luthra. 2003. Overview of the H.264/AVC video coding standard.IEEE Trans. Circuits Syst. Video Technol.13, 7 (2003), 560–576
2003
- [74]
- [75]
-
[76]
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. 2024. DeepSeek-VL2: Mixture-of-Experts...
work page internal anchor Pith review arXiv 2024
- [77]
-
[78]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient Streaming Language Models with Attention Sinks.CoRRabs/2309.17453 (2023)
work page internal anchor Pith review arXiv 2023
-
[79]
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.