pith. sign in

arxiv: 2605.18739 · v2 · pith:NYLMCP44new · submitted 2026-05-18 · 💻 cs.CV · cs.DC

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

Pith reviewed 2026-05-20 10:58 UTC · model grok-4.3

classification 💻 cs.CV cs.DC
keywords long video generationautoregressive diffusionNVFP4 quantizationsequence parallelismteacher-forcingvideo diffusion modelsinference accelerationtraining speedup
0
0 comments X

The pith

LongLive-2.0 directly converts diffusion models into long multi-shot autoregressive video generators with NVFP4 and balanced sequence parallelism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongLive-2.0 as a full training and inference infrastructure built around NVFP4 precision to handle the speed and memory demands of long video generation. It introduces Balanced SP, a sequence-parallel autoregressive training scheme that co-designs the teacher-forcing layout by pairing clean-history chunks with noisy-target chunks on each rank, creating a natural mask and SP-aware VAE encoding. This setup allows direct tuning of an existing diffusion model into an interactive autoregressive model for multi-shot videos, skipping the ODE initialization and distribution matching distillation steps common in prior methods. The combination yields up to 2.15 times faster training and 1.84 times faster inference, with the 5B model reaching 45.7 FPS, plus support for real-time generation via few-step denoising and standalone LoRA weights. A sympathetic reader would care because the approach targets the core hardware bottlenecks that currently limit practical long-video generation.

Core claim

LongLive-2.0 is the first NVFP4 training and inference system for long video generation. It directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive diffusion model through sequence-parallel autoregressive training instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training. For inference on Blackwell GPUs it enables W4A4 NVFP4 with quantized KV cache and asynchronous streaming VAE decoding; on other architectures it deploys SP inference while the

What carries the argument

Balanced SP sequence-parallel autoregressive training that co-designs teacher-forcing layout with chunk pairing, paired with NVFP4 precision for memory reduction and GEMM speedup.

If this is right

  • A high-quality infrastructure and dataset enable a clean training pipeline that avoids ODE initialization and distribution matching distillation.
  • The model converts to real-time generation with 4 to 2 denoising steps using standalone LoRA weights.
  • W4A4 NVFP4 inference with quantized KV cache lowers memory use and inter-GPU communication during sequence-parallel execution.
  • Asynchronous streaming VAE decoding boosts end-to-end throughput on Blackwell GPUs.
  • SP inference on non-Blackwell architectures matches Blackwell speeds while the quantized cache reduces communication overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The chunk-pairing idea in Balanced SP could extend to other long-sequence generative tasks such as audio or 3D content synthesis.
  • The reported speedups suggest the infrastructure may support more interactive user-guided video generation in real time.
  • Testing the layout on videos longer than current benchmarks would reveal whether communication costs stay sub-linear.
  • Similar co-designs of parallelism and low-precision formats might apply to large language models handling extended contexts.

Load-bearing premise

The assumption that the Balanced SP co-design of teacher-forcing layout with sequence-parallel execution preserves training stability and final model quality without additional regularization or loss terms.

What would settle it

An experiment that trains the same diffusion model with Balanced SP chunk pairing versus a standard non-paired sequence-parallel baseline and measures a clear drop in video quality metrics or training stability on identical data and length.

read the original abstract

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents LongLive-2.0, an NVFP4-based parallel infrastructure for the full training and inference workflow of long video generation. It introduces sequence-parallel autoregressive (AR) training via Balanced SP, which co-designs a teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling SP-aware chunked VAE encoding. Combined with NVFP4 precision for reduced memory and accelerated GEMM, the system directly tunes a diffusion model into a long multi-shot interactive AR diffusion model (without ODE initialization or DMD), convertible to real-time generation via standalone LoRA weights. For inference, it supports W4A4 NVFP4, quantized KV cache, asynchronous streaming VAE decoding, and SP on non-Blackwell GPUs. Experiments report up to 2.15x training and 1.84x inference speedups, with LongLive-2.0-5B reaching 45.7 FPS while attaining strong benchmark performance; it claims to be the first such NVFP4 system for long video generation.

Significance. If the quality-preservation claims hold, this would represent a meaningful engineering advance for practical long-video generation by addressing memory and compute bottlenecks in long-sequence AR diffusion models. The co-design of Balanced SP with NVFP4 and the direct-tuning pipeline (avoiding distillation) could simplify workflows and enable higher throughput on Blackwell and other GPUs, with potential impact on real-time interactive video systems. Concrete speedups and FPS numbers are reported, though their significance depends on verifiable quality retention.

major comments (2)
  1. Abstract (description of Balanced SP): The claim that pairing clean-history and noisy-target temporal chunks realizes an SP-aware teacher-forcing mask while preserving training stability and final model quality without additional regularization or loss terms is load-bearing for the headline speedups (2.15x training, 1.84x inference) and 45.7 FPS figure being meaningful. Distributing the noise schedule and history across ranks can change per-token gradient statistics and introduce chunk-boundary artifacts; combined with NVFP4's narrowed dynamic range for activations and gradients, this risks shifting the optimization trajectory. No ablation tables, training curves, gradient-variance analysis, or quality comparisons (e.g., vs. non-SP baseline) are referenced to substantiate stability under these changes.
  2. Abstract: The abstract reports concrete speedups and FPS numbers but provides no error bars, ablation tables, or detailed training curves. The central claims rest on engineering results whose reproducibility and quality preservation under NVFP4 and SP cannot be verified from the given text alone, undermining assessment of whether the Balanced SP construction maintains comparable AR distribution quality.
minor comments (2)
  1. Abstract: The phrase 'strong performance on benchmarks' is used without naming the specific benchmarks or reporting quantitative scores; adding these details would improve clarity and allow direct comparison to prior work.
  2. Abstract: Consider clarifying the exact video lengths and model scales at which the 2.15x and 1.84x speedups were measured, as the proportion of GEMM computation is stated to increase with video length.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below. Where the comments correctly identify gaps in evidence or presentation, we have revised the manuscript to incorporate additional analysis and results.

read point-by-point responses
  1. Referee: Abstract (description of Balanced SP): The claim that pairing clean-history and noisy-target temporal chunks realizes an SP-aware teacher-forcing mask while preserving training stability and final model quality without additional regularization or loss terms is load-bearing for the headline speedups (2.15x training, 1.84x inference) and 45.7 FPS figure being meaningful. Distributing the noise schedule and history across ranks can change per-token gradient statistics and introduce chunk-boundary artifacts; combined with NVFP4's narrowed dynamic range for activations and gradients, this risks shifting the optimization trajectory. No ablation tables, training curves, gradient-variance analysis, or quality comparisons (e.g., vs. non-SP baseline) are referenced to substantiate stability under these changes.

    Authors: We agree that explicit substantiation of stability under the combined Balanced SP and NVFP4 regime strengthens the central claims. In the revised manuscript we have added a new subsection in the experiments (Section 4.2) containing: (i) side-by-side training-loss curves for SP versus non-SP runs on identical hardware and data, (ii) per-token gradient-variance statistics measured at multiple training checkpoints, and (iii) benchmark-quality comparisons (FVD, CLIP score, and human preference) between the final SP-trained model and a non-SP baseline trained to the same number of steps. These results show that chunk-boundary artifacts remain negligible and that the optimization trajectory does not deviate materially from the non-SP case, confirming that no additional regularization is required. revision: yes

  2. Referee: Abstract: The abstract reports concrete speedups and FPS numbers but provides no error bars, ablation tables, or detailed training curves. The central claims rest on engineering results whose reproducibility and quality preservation under NVFP4 and SP cannot be verified from the given text alone, undermining assessment of whether the Balanced SP construction maintains comparable AR distribution quality.

    Authors: We accept that the original abstract and experimental section lacked sufficient statistical detail. The revised manuscript now reports all speedup and FPS numbers with error bars computed over five independent runs (different random seeds and data-order shuffles). We have also inserted an expanded ablation table (Table 3) that isolates the contribution of Balanced SP, NVFP4 quantization, and asynchronous VAE decoding, together with the corresponding training curves placed in Appendix C. These additions allow direct verification that quality is preserved while the reported throughput gains are realized. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance metrics are empirical measurements

full rationale

The paper is a systems/engineering contribution describing an NVFP4 parallel infrastructure for long video generation. Reported speedups (up to 2.15x training, 1.84x inference) and 45.7 FPS are measured experimental outcomes on benchmarks, not quantities obtained by fitting parameters inside the same equations or by renaming inputs as predictions. The Balanced SP co-design (pairing clean-history and noisy-target chunks) is presented as an implementation choice enabling teacher-forcing masks and chunked VAE encoding; the claim that it preserves stability without extra regularization is an empirical statement, not a self-definitional derivation. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the abstract or description. The derivation chain is self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions about GPU hardware behavior under NVFP4 arithmetic and the correctness of the sequence-parallel communication pattern; no new physical constants or ad-hoc fitted scalars are introduced in the abstract.

axioms (2)
  • domain assumption NVFP4 arithmetic preserves sufficient numerical stability for diffusion model training and inference on the target video lengths.
    Invoked when claiming memory reduction and GEMM acceleration without quality loss.
  • domain assumption The Balanced SP chunk pairing produces an exact teacher-forcing mask equivalent to non-parallel training.
    Stated as enabling natural teacher-forcing with SP-aware encoding.

pith-pipeline@v0.9.0 · 5924 in / 1583 out tokens · 26540 ms · 2026-05-20T10:58:38.911550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 20 internal anchors

  1. [1]

    Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

    Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blake- man, Evan Briones, et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

  2. [2]

    Introducing nvfp4 for efficient and accurate low-precision inference, 2025

    Eduardo Alvarez. Introducing nvfp4 for efficient and accurate low-precision inference, 2025. NVIDIA Technical Blog

  3. [3]

    Quarot: Outlier-free 4-bit inference in rotated llms

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximil- ian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. NeurIPS, 37:100213–100240, 2024

  4. [4]

    Quartet: Native fp4 training can be optimal for large language models

    Roberto L Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh. Quartet: Native fp4 training can be optimal for large language models. arXiv preprint arXiv:2505.14669, 2025

  5. [5]

    Diffusion forcing: Next-token prediction meets full- sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full- sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  6. [6]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weim- ing Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhi- heng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zheng- cong Fei, Yang Li, and Yahui Zhou. SkyReels-v2: Infinite-length film generative...

  7. [7]

    Sana-video: Efficient video genera- tion with block linear diffusion transformer

    Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, and Enze Xie. Sana-video: Efficient video genera- tion with block linear diffusion transformer. InICLR, 2026

  8. [8]

    Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

    Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

  9. [9]

    Scaling RL to long videos

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, and Song Han. Scaling RL to long videos. InNeurIPS, 2025

  10. [10]

    Longvila: Scaling long- context visual language models for long videos

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Yihui He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. Longvila: Scaling long- context visual language models for long videos. In ICLR, 2025

  11. [11]

    Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025

    Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025

  12. [12]

    Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

    Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling.arXiv preprint arXiv:2512.02010, 2025

  13. [13]

    Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, and Wei Zhao. Not all frames deserve full computation: Accelerating autore- gressive video generation via selective computa- tion and predictive extrapolation.arXiv preprint arXiv:2604.02979, 2026

  14. [14]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute- scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

  15. [15]

    LoL: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho- Jui Hsieh. LoL: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

  16. [16]

    Autoregressive video generation without vector quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InIn- ternational Conference on Learning Representations, 2025

  17. [17]

    Qlora: Efficient finetuning of quantized llms.NeurIPS, 36:10088–10115, 2023

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.NeurIPS, 36:10088–10115, 2023

  18. [18]

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A program- ming model for generating optimized attention ker- nels.arXiv preprint arXiv:2412.05496, 2(3):4, 2024

  19. [19]

    Usp: A unified sequence parallelism approach for long context gen- erative ai.arXiv preprint arXiv:2405.07719, 2024

    Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context gen- erative ai.arXiv preprint arXiv:2405.07719, 2024. 9 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

  20. [20]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quanti- zation for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  21. [21]

    Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

    Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaol- ing Chen, Shangchun Zhao, Jiarui Fang, et al. Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

  22. [22]

    Acdit: Interpolating autore- gressive conditional modeling and diffusion trans- former.Trans

    Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, and Maosong Sun. Acdit: Interpolating autore- gressive conditional modeling and diffusion trans- former.Trans. Mach. Learn. Res., 2026, 2026

  23. [23]

    Qerl: Beyond efficiency– quantization-enhanced reinforcement learning for llms.arXiv preprint arXiv:2510.11696, 2025

    Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, et al. Qerl: Beyond efficiency– quantization-enhanced reinforcement learning for llms.arXiv preprint arXiv:2510.11696, 2025

  24. [24]

    Mc#: Mixture compressor for mixture-of-experts large models.T-PAMI, 2026

    Wei Huang, Yue Liao, Yukang Chen, Jianhui Liu, Haoru Tan, Si Liu, Shiming Zhang, Shuicheng Yan, and Xiaojuan Qi. Mc#: Mixture compressor for mixture-of-experts large models.T-PAMI, 2026

  25. [25]

    Mixture compressor for mixture- of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024

    Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, and Xiaojuan Qi. Mixture compressor for mixture- of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024

  26. [26]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025

  27. [27]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianx- ing Wu, Qingyang Jin, Nattapol Chanpaisit, Yao- hui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. In CVPR, pages 21807–21818, 2024

  28. [28]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models.T-PAMI, 48(3):3268–3285, 2026

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Ji- ashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chan- paisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models.T-PAMI, 48(3):3268–3285, 2026

  29. [29]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

  30. [30]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

  31. [31]

    Rehg, and Tobias Hinz

    Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M. Rehg, and Tobias Hinz. Shotadapter: Text-to-multi-shot video generation with diffusion models. InCVPR, pages 28405– 28415, 2025

  32. [32]

    Jay Kuo, and Pe- ter A

    Youngrae Kim, Qixin Hu, C.-C. Jay Kuo, and Pe- ter A. Beerel. MemRoPE: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

  33. [33]

    Train short, inference long: Training-free horizon extension for autoregressive video generation.arXiv preprint arXiv:2602.14027, 2026

    Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fang- min Chen, Xing Wang, and Hayden Kwok-Hay So. Train short, inference long: Training-free horizon extension for autoregressive video generation.arXiv preprint arXiv:2602.14027, 2026

  34. [34]

    Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

    Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

  35. [35]

    Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    Ruibin Li, Tao Yang, Fangzhou Ai, Tianhe Wu, Shilei Wen, Bingyue Peng, and Lei Zhang. Long- horizon streaming video generation via hybrid at- tention with decoupled distillation.arXiv preprint arXiv:2604.10103, 2026

  36. [36]

    Sequence parallelism: Long sequence training from system perspective

    Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective. In ACL, pages 2391–2404, 2023

  37. [37]

    Autoregressive image generation without vector quantization.NeurIPS, 37:56424– 56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.NeurIPS, 37:56424– 56445, 2024

  38. [38]

    Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

    Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

  39. [39]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device 10 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation llm compression and acceleration.MLSys, 6:87–100, 2024

  40. [40]

    Autoregressive adversarial post- training for real-time interactive video generation

    Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post- training for real-time interactive video generation. arXiv preprint arXiv:2506.09350, 2025

  41. [41]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

  42. [42]

    Stream- ing autoregressive video generation via diagonal dis- tillation.arXiv preprint arXiv:2603.09488, 2026

    Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Stream- ing autoregressive video generation via diagonal dis- tillation.arXiv preprint arXiv:2603.09488, 2026

  43. [43]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  44. [44]

    Startrail: Concentric ring sequence parallelism for efficient near-infinite- context transformer model training.arXiv preprint arXiv:2407.00611, 2024

    Ziming Liu, Shaoyu Wang, Shenggan Cheng, Zhongkai Zhao, Kai Wang, Xuanlei Zhao, James Demmel, and Yang You. Startrail: Concentric ring sequence parallelism for efficient near-infinite- context transformer model training.arXiv preprint arXiv:2407.00611, 2024

  45. [45]

    Lcm-lora: A uni- versal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

    Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick V on Platen, Apolinà ˛ Ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A uni- versal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

  46. [46]

    ShotStream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026

    Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. ShotStream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026

  47. [47]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  48. [48]

    Flow caching for autoregressive video generation

    Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, Fei Chao, and Rongrong Ji. Flow caching for autoregressive video generation. arXiv preprint arXiv:2602.10825, 2026

  49. [49]

    TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

    Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Triattention: Efficient long reasoning with trigonometric kv compression.arXiv preprint arXiv:2604.04921, 2026

  50. [50]

    PackForcing: Short video training suffices for long video sampling and long context inference

    Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, and Kaipeng Zhang. PackForcing: Short video training suffices for long video sampling and long context inference. arXiv preprint arXiv:2603.25730, 2026

  51. [51]

    FP8 Formats for Deep Learning

    Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenth- waite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

  52. [52]

    Nvidia blackwell architecture technical brief, 2024

    NVIDIA. Nvidia blackwell architecture technical brief, 2024. Accessed: 2025-05-13

  53. [53]

    Speeding up variable-length training with dynamic context parallelism and nvidia megatron core, 2026

    NVIDIA. Speeding up variable-length training with dynamic context parallelism and nvidia megatron core, 2026

  54. [54]

    Open Compute Project, version 1.0 edition, 2023

    Open Compute Project.OCP Microscaling Formats (MX) Specification. Open Compute Project, version 1.0 edition, 2023

  55. [55]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195– 4205, 2023

  56. [56]

    Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

    Bita Darvish Rouhani et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

  57. [57]

    MAGI-1: Autoregressive Video Generation at Scale

    Sand.ai. MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  58. [58]

    Free-lunch long video generation via layer-adaptive o.o.d correction.arXiv preprint arXiv:2603.25209, 2026

    Jiahao Tian, Chenxi Song, Wei Cheng, and Chi Zhang. Free-lunch long video generation via layer-adaptive o.o.d correction.arXiv preprint arXiv:2603.25209, 2026

  59. [59]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan. Wan: Open and advanced large- scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  60. [60]

    Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

    Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, et al. Quant videogen: Auto- regressive long video generation via 2-bit kv-cache quantization.arXiv preprint arXiv:2602.02958, 2026

  61. [61]

    Pathwise test- time correction for autoregressive long video genera- tion.arXiv preprint arXiv:2602.05871, 2026

    Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, and Chunchao Guo. Pathwise test- time correction for autoregressive long video genera- tion.arXiv preprint arXiv:2602.05871, 2026

  62. [62]

    Smoothquant: Accu- rate and efficient post-training quantization for large 11 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accu- rate and efficient post-training quantization for large 11 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation language models. InICML, pages 38087–38099. PMLR, 2023

  63. [63]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming lan- guage models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  64. [64]

    Streamfu- sion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

    Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, and Gennady Pekhimenko. Streamfu- sion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

  65. [65]

    Longlive: Real-time interactive long video generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation. InICLR, 2026

  66. [66]

    MANIQA: multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. MANIQA: multi-dimension attention network for no-reference image quality assessment. InCVPR Workshops, pages 1190–1199, 2022

  67. [67]

    Anchor forcing: Anchor memory and tri- region rope for interactive streaming video diffusion

    Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, and Peng- Tao Jiang. Anchor forcing: Anchor memory and tri- region rope for interactive streaming video diffusion. arXiv preprint arXiv:2603.13405, 2026

  68. [68]

    Deep forcing: Training-free long video generation with deep sink and participative compression

    Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025

  69. [69]

    Freeman, Fredo Durand, Eli Shechtman, and Xun Huang

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast au- toregressive video diffusion models.arXiv preprint arXiv:2412.07772, 2024

  70. [70]

    Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

    Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, and Xiaojuan Qi. VideoSSM: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

  71. [71]

    Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

    Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real- time long video generation model.arXiv preprint arXiv:2603.04379, 2026

  72. [72]

    TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

    Amir Zandieh, Majid Daliri, Majid Hadian, and Va- hab Mirrokni. Turboquant: Online vector quantiza- tion with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

  73. [73]

    Sageattention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization

    Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization. InICML, 2025

  74. [74]

    Sageattention3: Microscaling FP4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

    Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, and Jianfei Chen. Sageattention3: Microscaling FP4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

  75. [75]

    Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration

    Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InICLR, 2025

  76. [76]

    Test-Time Training Done Right

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

  77. [77]

    arXiv preprint arXiv:2505.07344 , year=

    Yuan Zhang, Jiacheng Jiang, Guoqing Ma, Zhiy- ing Lu, Haoyang Huang, Jianlong Yuan, Nan Duan, and Daxin Jiang. Generative pre-trained autore- gressive diffusion transformer.arXiv preprint arXiv:2505.07344, 2025

  78. [78]

    Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video genera- tion.arXiv preprint arXiv:2406.02540, 2024

    Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video genera- tion.arXiv preprint arXiv:2406.02540, 2024

  79. [79]

    Dsp: Dynamic sequence parallelism for multi-dimensional transformers.arXiv preprint arXiv:2403.10266, 2024

    Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zang- wei Zheng, Ziming Liu, Zheming Yang, and Yang You. Dsp: Dynamic sequence parallelism for multi-dimensional transformers.arXiv preprint arXiv:2403.10266, 2024

  80. [80]

    Relax forcing: Relaxed kv-memory for consistent long video generation, 2026

    Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video gener- ation.arXiv preprint arXiv:2603.21366, 2026

Showing first 80 references.