Recognition: 2 theorem links
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Pith reviewed 2026-05-16 11:11 UTC · model grok-4.3
The pith
Rolling Forcing generates multi-minute streaming videos in real time by jointly denoising frames with rising noise levels and anchoring attention to early frames.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Rolling Forcing enables streaming long videos with minimal error accumulation through a joint denoising scheme that processes multiple frames simultaneously with progressively increasing noise levels, an attention sink that retains initial key-value states as a global context anchor, and an efficient training algorithm that performs few-step distillation over extended non-overlapping windows to reduce exposure bias from self-generated histories.
What carries the argument
Joint denoising scheme with progressively increasing noise levels across multiple frames, combined with an attention sink for long-term context and windowed distillation training.
If this is right
- Real-time streaming generation of multi-minute videos becomes feasible on a single GPU.
- Error accumulation is substantially reduced compared with standard frame-by-frame autoregressive diffusion.
- Temporal coherence is maintained across long horizons through the global context anchor.
- Few-step inference is enabled without the exposure bias that arises from conditioning on self-generated histories.
- The approach supports applications in interactive world models and neural game engines that require low-latency video streams.
Where Pith is reading between the lines
- The attention sink technique could be tested in other long-horizon autoregressive tasks such as audio or 3D scene generation to check whether the same anchoring reduces drift.
- Applying the windowed distillation to existing video diffusion backbones might require only modest retraining, making adoption easier than full model redesigns.
- If the progressive noise schedule proves robust, it could be combined with variable frame-rate inputs to handle mixed slow and fast motion without retraining.
- Scaling the method to higher resolutions would likely depend on whether the joint denoising window size can grow without exceeding single-GPU memory limits.
Load-bearing premise
The combination of joint denoising with rising noise levels, attention sink, and windowed distillation actually prevents error accumulation over long sequences without creating new artifacts or quality loss.
What would settle it
Generating a multi-minute video with Rolling Forcing and observing clear temporal inconsistencies, blurring, or new artifacts after the first minute would falsify the reduced error accumulation claim.
read the original abstract
Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Rolling Forcing, a technique for autoregressive long video diffusion that enables real-time streaming generation of multi-minute videos. It proposes three designs: a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels to relax strict causality and suppress error accumulation; an attention sink mechanism that retains key-value states from initial frames as a global context anchor for long-term consistency; and a windowed distillation training algorithm operating on non-overlapping windows to enable few-step inference while mitigating exposure bias from self-generated histories. The central claim is that these elements together allow high-quality, low-latency multi-minute video streams on a single GPU with substantially reduced error accumulation compared to prior autoregressive diffusion approaches.
Significance. If the empirical claims are substantiated, the work would be significant for video generation and interactive world models, as it targets the longstanding problem of error drift in long-horizon autoregressive sampling. The engineering combination of progressive noise scheduling, attention sinks, and non-overlapping distillation could enable practical real-time systems for neural game engines and streaming applications. The absence of parameter fitting or self-referential derivations is a strength in that the method is presented as a practical design rather than a closed-form derivation.
major comments (3)
- [Abstract] Abstract: the claim that the joint denoising scheme 'effectively suppressing error growth' and enables 'substantially reduced error accumulation' over multi-minute horizons is unsupported by any quantitative error metrics, ablation studies, or baseline comparisons; the central contribution therefore rests on unverified assertions rather than demonstrated results.
- [§3] §3 (joint denoising and attention sink): the description of progressively increasing noise levels relaxing causality lacks a recurrence relation or variance bound showing how error growth is controlled beyond the training window; without this, it remains unclear whether the scheme prevents drift or merely shifts artifacts when conditioned on self-generated frames.
- [§4] §4 (experiments): no quantitative results on temporal coherence, FID/VFID over long sequences, or real-time FPS measurements on single-GPU multi-minute streams are provided, which is load-bearing for the headline claim of practical deployment.
minor comments (2)
- [§3] Clarify the exact noise schedule parameters and window sizes used in the joint denoising and distillation stages so that the method can be reproduced.
- Add captions and axis labels to all figures showing generated video frames to indicate the temporal horizon and any visible drift.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the empirical support and clarity of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the joint denoising scheme 'effectively suppressing error growth' and enables 'substantially reduced error accumulation' over multi-minute horizons is unsupported by any quantitative error metrics, ablation studies, or baseline comparisons; the central contribution therefore rests on unverified assertions rather than demonstrated results.
Authors: We agree the abstract claims require quantitative backing. In revision we will add VFID scores, temporal coherence metrics, and baseline comparisons over multi-minute sequences, plus ablations isolating the joint denoising contribution. These results will be summarized in the abstract and detailed in §4. revision: yes
-
Referee: [§3] §3 (joint denoising and attention sink): the description of progressively increasing noise levels relaxing causality lacks a recurrence relation or variance bound showing how error growth is controlled beyond the training window; without this, it remains unclear whether the scheme prevents drift or merely shifts artifacts when conditioned on self-generated frames.
Authors: The progressive noise schedule is presented as an empirical design that relaxes frame-wise causality. We will expand §3 with a qualitative analysis of error propagation under self-generated conditioning and additional plots showing reduced drift beyond the training window. A formal recurrence bound is outside the paper's practical scope, but the mechanism will be clarified. revision: partial
-
Referee: [§4] §4 (experiments): no quantitative results on temporal coherence, FID/VFID over long sequences, or real-time FPS measurements on single-GPU multi-minute streams are provided, which is load-bearing for the headline claim of practical deployment.
Authors: We will augment §4 with quantitative tables reporting VFID, temporal coherence, and single-GPU FPS for multi-minute streams, including direct comparisons to prior autoregressive baselines. These metrics will directly support the real-time deployment claims. revision: yes
Circularity Check
No significant circularity; method is an engineering design without self-referential derivations
full rationale
The paper introduces Rolling Forcing through three explicit design choices (joint denoising with increasing noise, attention sink, and non-overlapping window distillation) presented as novel engineering contributions rather than mathematical derivations. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description that would reduce claims to inputs by construction. The central claim of reduced error accumulation is framed as an empirical outcome of these designs, not a self-definitional or uniqueness-theorem result. This is the common case of a self-contained technical proposal.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 19 Pith papers
-
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
-
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
LPM 1.0: Video-based Character Performance Model
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
-
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
-
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
-
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
-
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[3]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Align your latents: High-resolution video synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[6]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[7]
European Conference on Computer Vision , pages=
Photorealistic video generation with diffusion models , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[12]
Forty-first International Conference on Machine Learning , year=
Genie: Generative interactive environments , author=. Forty-first International Conference on Machine Learning , year=
-
[17]
Advances in Neural Information Processing Systems , volume=
Diffusion forcing: Next-token prediction meets full-sequence diffusion , author=. Advances in Neural Information Processing Systems , volume=
-
[23]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Art-v: Auto-regressive text-to-video generation with diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[24]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
From slow bidirectional to fast autoregressive video diffusion models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[28]
Advances in Neural Information Processing Systems , volume=
Fifo-diffusion: Generating infinite videos from text without training , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
International Conference on Machine Learning , year=
Rolling diffusion models , author=. International Conference on Machine Learning , year=
-
[31]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Ar-diffusion: Asynchronous video generation with auto-regressive diffusion , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[32]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Progressive autoregressive video diffusion models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[34]
European Conference on Computer Vision , pages=
Videostudio: Generating consistent-content and multi-scene videos , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[45]
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
work page 2024
-
[46]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
One-step diffusion with distribution matching distillation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[47]
Advances in neural information processing systems , volume=
Improved distribution matching distillation for fast image synthesis , author=. Advances in neural information processing systems , volume=
-
[48]
Denoising Diffusion Implicit Models
Denoising diffusion implicit models , author=. arXiv preprint arXiv:2010.02502 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[50]
Advances in Neural Information Processing Systems , volume=
Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models , author=. Advances in Neural Information Processing Systems , volume=
-
[51]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Vbench: Comprehensive benchmark suite for video generative models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[53]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Streamingt2v: Consistent, dynamic, and extendable long video generation from text , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[55]
arXiv preprint arXiv:2508.15720 , year=
WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception , author=. arXiv preprint arXiv:2508.15720 , year=
-
[57]
Talc: Time-aligned captions for multi-scene text-to-video generation
Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, and Kai-Wei Chang. Talc: Time-aligned captions for multi-scene text-to-video generation. arXiv preprint arXiv:2405.04682, 2024
-
[58]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22563--22575, 2023 b
work page 2023
-
[60]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[61]
Diffusion forcing: Next-token prediction meets full-sequence diffusion
Boyuan Chen, Diego Mart \' Mons \'o , Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 37: 0 24081--24125, 2024
work page 2024
-
[62]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325, 2025
-
[64]
Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025
Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. arXiv preprint arXiv:2503.10589, 2025
-
[65]
Photorealistic video generation with diffusion models
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and Jos \'e Lezama. Photorealistic video generation with diffusion models. In European Conference on Computer Vision, pp.\ 393--411. Springer, 2024
work page 2024
-
[66]
Streamingt2v: Consistent, dynamic, and extendable long video generation from text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 2568--2577, 2025
work page 2025
-
[67]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[69]
Storyagent: Customized storytelling video generation via multi-agent collaboration
Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Customized storytelling video generation via multi-agent collaboration. arXiv preprint arXiv:2411.04925, 2024
-
[70]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21807--21818, 2024
work page 2024
-
[72]
Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024
-
[73]
Fifo-diffusion: Generating infinite videos from text without training
Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training. Advances in Neural Information Processing Systems, 37: 0 89834--89868, 2024
work page 2024
-
[74]
Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745,
Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. arXiv preprint arXiv:2507.03745, 2025
-
[75]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos \'e Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Arlon: Boosting diffusion transformers with autoregressive models for long video generation
Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation. arXiv preprint arXiv:2410.20502, 2024
-
[78]
Diffusion adversarial post-training for one-step video generation
Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. arXiv preprint arXiv:2501.08316, 2025 a
-
[79]
Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time interactive video generation. arXiv preprint arXiv:2506.09350, 2025 b
-
[80]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[81]
Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C P \'e rez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale. arXiv preprint arXiv:2410.20280, 2024
-
[82]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[83]
Videostudio: Generating consistent-content and multi-scene videos
Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Videostudio: Generating consistent-content and multi-scene videos. In European Conference on Computer Vision, pp.\ 468--485. Springer, 2024
work page 2024
-
[84]
OpenAI . Sora. https://openai.com/sora, 2024
work page 2024
-
[85]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 4195--4205, 2023
work page 2023
-
[86]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[87]
David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. In International Conference on Machine Learning, 2024
work page 2024
-
[88]
Generalization in generation: A closer look at exposure bias
Florian Schmidt. Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292, 2019
-
[89]
Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model. arXiv preprint arXiv:2504.08685, 2025
-
[90]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[91]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024
work page 2024
-
[92]
Ar-diffusion: Asynchronous video generation with auto-regressive diffusion
Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 7364--7373, 2025
work page 2025
-
[93]
MAGI-1: Autoregressive Video Generation at Scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[94]
Diffusion Models Are Real-Time Game Engines
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024
work page internal anchor Pith review arXiv 2024
-
[95]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[96]
Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models
Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advances in Neural Information Processing Systems, 37: 0 65618--65642, 2024
work page 2024
-
[97]
Loong: Generating minute-level long videos with autoregressive language models
Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models. arXiv preprint arXiv:2410.02757, 2024
- [98]
-
[99]
Art-v: Auto-regressive text-to-video generation with diffusion models
Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to-video generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7395--7405, 2024
work page 2024
-
[100]
Macro-from-micro planning for high-quality and parallelized autoregressive long video generation
Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, et al. Macro-from-micro planning for high-quality and parallelized autoregressive long video generation. arXiv preprint arXiv:2508.03334, 2025
-
[101]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[102]
Progressive autoregressive video diffusion models
Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregressive video diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 6322--6332, 2025
work page 2025
-
[103]
Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework
Zhifei Xie, Daniel Tang, Dingwei Tan, Jacques Klein, Tegawend F Bissyand, and Saad Ezzini. Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework. arXiv preprint arXiv:2408.11788, 2024
-
[104]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[105]
Synchronized video storytelling: Generating video narrations with structured storyline
Dingyi Yang, Chunru Zhan, Ziheng Wang, Biao Wang, Tiezheng Ge, Bo Zheng, and Qin Jin. Synchronized video storytelling: Generating video narrations with structured storyline. arXiv preprint arXiv:2405.14040, 2024
-
[106]
Improved distribution matching distillation for fast image synthesis
Tianwei Yin, Micha \"e l Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37: 0 47455--47487, 2024 a
work page 2024
-
[107]
One-step diffusion with distribution matching distillation
Tianwei Yin, Micha \"e l Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6613--6623, 2024 b
work page 2024
-
[108]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 22963--22974, 2025
work page 2025
-
[109]
Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025
-
[110]
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right. arXiv preprint arXiv:2505.23884, 2025
work page internal anchor Pith review arXiv 2025
-
[111]
Moviedreamer: Hierarchical generation for coherent long visual sequence
Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, and Chunhua Shen. Moviedreamer: Hierarchical generation for coherent long visual sequence. arXiv preprint arXiv:2407.16655, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.