Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi , Shuo Yang , Yilong Zhao , Muyang Li , Han Cai , Xingyang Li , Yujun Lin , Zhuoyang Zhang

show 8 more authors

Jintao Zhang Xiuyu Li Zhiying Xu Jun Wu Chenfeng Xu Ion Stoica Song Han Kurt Keutzer

Authors on Pith no claims yet

classification 💻 cs.LG

keywords memoryvideocachegenerationquantizationautoregressivequalityavailable

0 comments

read the original abstract

Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceeding 30 GB, preventing deployment on widely available hardware. More critically, constrained KV cache budgets restrict the effective working memory, directly degrading long horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training free KV cache quantization framework for autoregressive video diffusion models. QVG leverages video spatiotemporal redundancy through Semantic Aware Smoothing, producing low magnitude, quantization friendly residuals. It further introduces Progressive Residual Quantization, a coarse to fine multi stage scheme that reduces quantization error while enabling a smooth quality memory trade off. Across LongCat Video, HY WorldPlay, and Self Forcing benchmarks, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV cache memory by up to 7.0 times with less than 4% end to end latency overhead while consistently outperforming existing baselines in generation quality. Code is available at: https://github.com/svg-project/Quant-VideoGen

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.