Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion
Pith reviewed 2026-05-20 11:16 UTC · model grok-4.3
The pith
Focused Forcing selects per-frame and per-head KV caches to accelerate autoregressive video diffusion without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Focused Forcing is a training-free method that, for each generated frame, keeps the most relevant and distinctive historical frames by merging their attention scores with diversity scores, then gives larger cache budgets to heads whose masking causes greater generation degradation.
What carries the argument
Focused Forcing: per-generated-frame selection that combines attention scores with diversity scores of historical frames, paired with explicit estimation of per-head importance to set unequal cache budgets.
If this is right
- Up to 1.48× end-to-end acceleration is obtained across multiple autoregressive video diffusion paradigms.
- Visual quality and text alignment improve rather than degrade.
- The method works without any retraining or fine-tuning.
- Selection decisions are made separately for each frame inside a generation chunk instead of once for the whole chunk.
Where Pith is reading between the lines
- The same per-frame and per-head logic could be tested on autoregressive image or audio diffusion models that also maintain growing context caches.
- Further speed gains might appear if the diversity scoring step is approximated with cheaper features for very long videos.
- Combining Focused Forcing with existing quantization or pruning of the kept KV entries could compound the efficiency benefit.
Load-bearing premise
Combining attention scores with diversity scores and assigning budgets by estimated head importance will preserve quality better than uniform or attention-only selection.
What would settle it
Running the same long video generation task with Focused Forcing and with uniform attention-based selection, then measuring lower PSNR, LPIPS, or text-alignment scores under Focused Forcing, would show the claim is false.
Figures
read the original abstract
Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Focused Forcing, a training-free KV cache compression method for autoregressive video diffusion. Motivated by observations that frames within a generation chunk depend on distinct history, attention scores vary with relative temporal distance, and heads degrade unequally under masking, the method selects historical frames per generated frame by combining attention scores with diversity scores and allocates per-head budgets according to estimated importance. It reports up to 1.48× end-to-end acceleration while improving visual quality and text alignment across multiple autoregressive paradigms.
Significance. If the empirical claims hold under rigorous controls, the work would offer a practical, training-free route to scaling long-horizon autoregressive video generation by reducing KV cache footprint without quality regression. The per-frame, per-head granularity and explicit use of diversity alongside attention distinguish it from prior uniform or attention-only selection heuristics.
major comments (2)
- [Abstract and results] The headline performance claim (1.48× acceleration with quality gains) is presented in the abstract without any description of experimental setup, datasets, baselines, chunk sizes, or statistical testing. This information is load-bearing for assessing whether the attention-plus-diversity selection plus head-importance budgeting actually outperforms uniform/attention-only alternatives or merely reflects particular test conditions.
- [Motivation and method] The motivation section establishes that frames in the same chunk can depend on distinct history and that heads show unequal degradation, yet the manuscript does not quantify whether the chosen diversity metric (feature variance or similar) correlates with downstream generation impact better than attention alone. Without an ablation isolating this combination, the superiority claim risks being an artifact of the evaluation protocol.
minor comments (2)
- [Method] Notation for the diversity score and head-importance estimator should be introduced with explicit formulas rather than descriptive text only.
- [Figures] Figure captions should state the exact metrics (e.g., FID, CLIP score) and number of samples used for the reported quality improvements.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback, which has helped us identify areas for improvement in the presentation of our work. Below, we respond to each major comment and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract and results] The headline performance claim (1.48× acceleration with quality gains) is presented in the abstract without any description of experimental setup, datasets, baselines, chunk sizes, or statistical testing. This information is load-bearing for assessing whether the attention-plus-diversity selection plus head-importance budgeting actually outperforms uniform/attention-only alternatives or merely reflects particular test conditions.
Authors: We thank the referee for this observation. The abstract is designed as a concise summary, while comprehensive details on datasets, baselines, chunk sizes, and statistical testing (including multiple random seeds with reported variance) are provided in Section 4. To improve accessibility, we will revise the abstract to include a brief clause referencing the evaluation across standard video benchmarks and multiple autoregressive paradigms. This provides necessary context without exceeding typical abstract length constraints. revision: partial
-
Referee: [Motivation and method] The motivation section establishes that frames in the same chunk can depend on distinct history and that heads show unequal degradation, yet the manuscript does not quantify whether the chosen diversity metric (feature variance or similar) correlates with downstream generation impact better than attention alone. Without an ablation isolating this combination, the superiority claim risks being an artifact of the evaluation protocol.
Authors: We agree that further quantification would strengthen the motivation. The section already includes quantitative masking experiments demonstrating unequal head degradation and frame-specific history dependencies. In the revision, we will add a dedicated ablation study comparing attention-only, diversity-only, and combined selection strategies, along with correlation analysis between diversity scores and downstream quality metrics (e.g., visual and alignment scores). This will substantiate that the combination provides benefits beyond attention alone. revision: yes
Circularity Check
No significant circularity; derivation is self-contained and empirically motivated
full rationale
The paper reports direct observations on frame-specific dependencies within chunks, distance-dependent attention scores, and unequal head degradation under masking. It then defines Focused Forcing as a training-free heuristic that combines per-frame attention scores with diversity scores and allocates budgets according to estimated head importance. No equations, fitted parameters, or self-citation chains are shown that reduce the selection rule or the reported 1.48× speedup-plus-quality claim back to the motivating observations by construction. The central results are presented as outcomes of empirical evaluation across multiple autoregressive paradigms, leaving the method independent of its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention scores and diversity metrics can be computed from the model's existing forward pass without additional training.
Reference graph
Works this paper leans on
-
[1]
Monarchrt: Efficient attention for real-time video generation, 2026
Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, and Beidi Chen. Monarchrt: Efficient attention for real-time video generation, 2026. URLhttps://arxiv.org/abs/2602.12271
-
[2]
MAGI-1: Autoregressive Video Generation at Scale
Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shuchen...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024
work page 2024
-
[5]
Lesa: Learnable stage-aware predictors for diffusion model acceleration, 2026
Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, and Linfeng Zhang. Lesa: Learnable stage-aware predictors for diffusion model acceleration, 2026. URL https:// arxiv.org/abs/2602.20497
-
[6]
Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024
work page 2024
-
[7]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Context forcing: Consistent autoregressive video generation with long context,
Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026
-
[9]
Self-forcing++: Towards minute-scale high-quality video generation
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=DzvPiqh23f
work page 2026
-
[10]
Flashattention-2: Faster attention with better parallelism and work partition- ing
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partition- ing. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, edi- tors,International Conference on Learning Representations, volume 2024, pages 35549– 35562, 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024/file/ 98ed250b203d1ac6b24bbcf263e3...
work page 2024
-
[11]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems, volume 35, pages 16344–16359. Curran Associates, Inc., 2022. URL https://proceeding...
work page 2022
-
[12]
Autoregressive video generation without vector quantization
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=JE9tCwe3lp. 10
work page 2025
-
[13]
Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference
Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and WANG Jiannan. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[14]
One Step Diffusion via Shortcut Models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models, 2025. URLhttps://arxiv.org/abs/2410.12557
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling, 2025. URLhttps://arxiv.org/abs/2505.13447
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026
Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, and Yan Lu. Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026
-
[17]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Ltx-2: Efficient joint audio-visual foundation model, 2026
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V...
work page 2026
-
[19]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[20]
arXiv preprint arXiv:2401.08671 (2024)
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yux- iong He. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed- inference, 2024. URLhttps://arxiv.org/abs/2401.08671
-
[21]
Self forcing: Bridging the train-test gap in autoregressive video diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=mSiN7i0BYH
work page 2026
-
[22]
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelli...
-
[23]
Distrifusion: Distributed parallel inference for high-resolution diffusion models
Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming- Yu Liu, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[24]
Timestep embedding tells: It’s time to cache for video diffusion model, 2024
Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model.arXiv preprint arXiv:2411.19108, 2024
-
[25]
From reusing to forecasting: Accelerating diffusion models with taylorseers
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15853–15863, October 2025
work page 2025
-
[26]
Rolling forcing: Autoregressive long video diffusion in real time
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=IAyzXjbfwo. 11
work page 2026
-
[27]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, June 2025. ISSN 2731-5398. doi: 10.1007/s11633-025-1562-4. URL http://dx.doi.org/10.1007/s11633-025-1562-4
-
[28]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Deepcache: Accelerating diffusion models for free
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[30]
Follow your pose: Pose-guided text-to-video generation using pose-free videos
Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4117–4125, 2024
work page 2024
-
[31]
Con- trollable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025
Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Bingyuan Wang, Qinghe Wang, Xuanhua He, Hongfa Wang, et al. Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025
-
[32]
Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zexuan Yan, Zhifeng Li, Sirui Han, Chenyang Qi, et al. Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025
-
[33]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Ja- gadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Si...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[36]
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hon...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,
Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast- forward caching in diffusion transformer acceleration, 2024. URL https://arxiv.org/abs/ 2407.01425
-
[38]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[39]
Hunyuan-gamecraft-2: Instruction- following interactive game world model, 2026
Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, and Qinglin Lu. Hunyuan-gamecraft-2: Instruction- following interactive game world model, 2026. URLhttps://arxiv.org/abs/2511.23429
-
[40]
Advancing Open-source World Models
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
HunyuanVideo 1.5 Technical Report
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, Linus, Patrol, Peizhen Zhang, Peng Chen, Penghao Zhao, Qi Tian, Songtao Liu, Weijie Kong, Weiyan Wang, Xiao He, Xin Li, Xinchi Deng, Xuefei Zhe, Yang Li, Yanxin Long, Yuanbo Peng, Yue Wu, Yuhong Liu, Zhenyu Wang, Zuozhuo Dai, Bo Peng, Coo...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representations, volume 2024, pages 21875–21895, 2024. URL https://proceedings.iclr.cc/paper_files/ paper/2024/fi...
work page 2024
-
[44]
Duoattention: Efficient long-context LLM inference with retrieval and streaming 13 heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, junxian guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming 13 heads. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=cFu7ze7xUm
work page 2025
-
[45]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Longlive: Real-time interactive long video generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying-Cong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=nCAODkpsPJ
work page 2026
-
[47]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
H., Nam, J., Yoon, H., and Kim, S
Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025
-
[49]
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Frédo Du- rand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 47455–4748...
work page 2024
-
[50]
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6613–6623, June 2024
work page 2024
-
[51]
Freeman, Fredo Durand, Eli Shechtman, and Xun Huang
Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22963–22974, June 2025
work page 2025
-
[52]
Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026
Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026
-
[53]
Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageat- tention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization,
- [54]
-
[55]
Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration, 2025
Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration, 2025. URL https://arxiv. org/abs/2410.02367
-
[56]
Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026
-
[57]
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "Atlas" Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in 14...
work page 2023
-
[58]
Unipc: A unified predictor- corrector framework for fast sampling of diffusion models
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor- corrector framework for fast sampling of diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[59]
Relax forcing: Relaxed kv-memory for consistent long video generation,
Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation, 2026. URL https:// arxiv.org/abs/2603.21366
-
[60]
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026
work page internal anchor Pith review arXiv 2026
-
[61]
Accelerating diffusion transformers with token-wise feature caching
Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Linfeng Zhang. Accelerating diffusion transformers with token-wise feature caching. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=yYZbZGo4ei. 15 A Limitations Focused Forcing is designed as a training-free KV selection method for efficient...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.