Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Pith reviewed 2026-05-16 06:59 UTC · model grok-4.3
The pith
Rolling Sink lets autoregressive video models trained on five-second clips generate consistent videos lasting many minutes at test time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Rolling Sink is a training-free technique obtained from systematic analysis of autoregressive cache maintenance; when applied to models such as Self Forcing that were trained only on five-second clips, it scales video synthesis to open-ended durations of five to thirty minutes at sixteen frames per second while maintaining consistent subjects, stable colors, coherent structures, and smooth motions.
What carries the argument
Rolling Sink, a periodic adjustment to the autoregressive cache that rolls forward and resets accumulating state to limit error propagation during long-horizon generation.
If this is right
- Models trained on five-second clips can now produce five-to-thirty-minute videos at sixteen frames per second with stable visual quality.
- Long-horizon fidelity and temporal consistency exceed those of current state-of-the-art baselines on the same short-trained models.
- No additional training or longer data is required to reach open-ended generation lengths.
- Subject identity, color constancy, and motion smoothness remain intact across the extended sequence.
Where Pith is reading between the lines
- The same cache-rolling principle might apply to autoregressive generation in other domains such as audio waveforms or long text sequences.
- Combining Rolling Sink with occasional fine-tuning on medium-length clips could further reduce residual drift.
- The method implies that cache state management is a primary bottleneck when scaling autoregressive diffusion beyond training horizons.
- Testing the approach at higher frame rates or resolutions would show whether the cache rules remain sufficient.
Load-bearing premise
The assumption that a fixed set of cache-maintenance rules derived from short-horizon analysis will continue to prevent degradation at arbitrary test lengths without introducing fresh artifacts.
What would settle it
Run a thirty-minute generation with Rolling Sink and compare frame-by-frame consistency metrics against the same model without the cache adjustment; persistent degradation equal to the baseline would falsify the claim.
Figures
read the original abstract
Recently, autoregressive (AR) video diffusion models have achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Rolling Sink, a training-free technique for autoregressive video diffusion models. Building on Self Forcing (trained only on 5-second clips), the method performs a systematic analysis of AR cache maintenance to derive a sink rule that purportedly bridges the train-test gap for open-ended testing horizons. It claims to enable generation of ultra-long videos (5–30 minutes at 16 FPS) while preserving subject consistency, color stability, structural coherence, and motion smoothness, outperforming SOTA baselines in long-horizon fidelity and temporal consistency.
Significance. If the empirical results and generalization hold, the work would represent a meaningful contribution to long-form video synthesis by eliminating the need for computationally prohibitive long-horizon training. The training-free character and grounding in cache-behavior analysis are notable strengths; successful scaling from 5 s to thousands of frames without new degradations would have clear practical value for applications requiring extended coherent video.
major comments (2)
- [Abstract / Methods] Abstract and Methods: The central claim that the Rolling Sink rule bounds cumulative denoising error and conditioning drift for arbitrary horizons (5–30 min) rests on an unstated premise with no explicit error-bound derivation or invariant provided; the skeptic concern is therefore load-bearing because the manuscript supplies no mathematical guarantee once the finite training window is exceeded.
- [Experiments] Experiments: The abstract asserts 'superior long-horizon visual fidelity and temporal consistency' and 'extensive experiments' yet supplies no quantitative metrics (e.g., FVD, subject-consistency scores, or long-horizon ablations) or details on how consistency is measured over thousands of frames; without these, the superiority claim cannot be evaluated.
minor comments (2)
- [Methods] Notation for the sink rule and cache-maintenance operations should be defined more explicitly with equations to allow reproduction.
- [Experiments] The project page link is given but the manuscript should include a brief summary of the qualitative examples shown there.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript introducing Rolling Sink. We address each major comment point by point below, providing clarifications on our empirical approach and committing to revisions where the manuscript can be strengthened.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods: The central claim that the Rolling Sink rule bounds cumulative denoising error and conditioning drift for arbitrary horizons (5–30 min) rests on an unstated premise with no explicit error-bound derivation or invariant provided; the skeptic concern is therefore load-bearing because the manuscript supplies no mathematical guarantee once the finite training window is exceeded.
Authors: We appreciate the referee's emphasis on the distinction between empirical derivation and formal guarantees. Our manuscript does not claim or provide a rigorous mathematical error bound, invariant, or proof that Rolling Sink guarantees bounded drift for arbitrary horizons. The method is instead derived from a systematic analysis of observed cache behaviors and error accumulation patterns in autoregressive video diffusion, building directly on the Self Forcing framework. We identify practical rules that mitigate the train-test gap beyond the 5-second training horizon and validate them through long-horizon generations. While a theoretical bound would strengthen the work, deriving one for stochastic diffusion processes in this setting is an open research question and outside the current scope; the contribution lies in the training-free, analysis-driven solution that enables practical ultra-long synthesis. revision: no
-
Referee: [Experiments] Experiments: The abstract asserts 'superior long-horizon visual fidelity and temporal consistency' and 'extensive experiments' yet supplies no quantitative metrics (e.g., FVD, subject-consistency scores, or long-horizon ablations) or details on how consistency is measured over thousands of frames; without these, the superiority claim cannot be evaluated.
Authors: We agree that the current manuscript version prioritizes qualitative visual results and comparisons in the main text, which limits the ability to fully evaluate the superiority claims. In the revision, we will incorporate quantitative metrics into the main paper, including FVD scores computed on long sequences, subject consistency via averaged CLIP embedding similarities across sampled frames, color stability via histogram distances, and motion smoothness via optical flow metrics. We will also add details on the evaluation protocol: metrics are computed by sampling frames at fixed intervals (e.g., every 50–100 frames) over the full generation length and averaging across multiple independent long videos (5–30 minutes at 16 FPS). Long-horizon ablations will be included to isolate the effect of the Rolling Sink rule. revision: yes
Circularity Check
No significant circularity; derivation rests on empirical cache analysis independent of target outcome
full rationale
The paper's central derivation proceeds from a systematic analysis of AR cache maintenance during inference (beyond the 5s training horizon of the base Self Forcing model) to the design of the Rolling Sink rule. No equation or claim reduces the long-horizon fidelity result to a fitted parameter, a self-citation that itself assumes the result, or a renaming of an input pattern. The generalization to 5-30 minute videos is presented as an empirical outcome of the cache rule rather than a quantity forced by construction from the limited-horizon training data. Self-citation to Self Forcing is present but serves only as the base model; it is not invoked as a uniqueness theorem or load-bearing justification for the unbounded-horizon claim. The approach therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink... rolling the sink content (i.e., at each AR step, we update the sink blocks’ semantic content with a rolling segment from the within-duration history)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and 8-tick orbit unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the total cache capacity K is strictly bounded for streaming efficiency... S/K = 83%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 7 Pith papers
-
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video ...
-
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks
Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems
A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 30
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Cosmos World Foundation Model Platform for Physical AI
Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025) 30
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
World Simulation with Video Foundation Models for Physical AI
Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025) 30
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., Arnaud, S., Gejji, A., Martin, A., Robert Hogan, F., Dugas, D., Bojanowski, P., Khalidov, V., Labatut, P., Massa, F., Szafraniec, M., Krishnakumar, K., Li, Y., Ma, X., Chandar, S., Meier, F., LeCun, Y., Rabbat, M., Ballas, N.: ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...
work page 2025
-
[6]
Advances in neural information pro- cessing systems28(2015) 3
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information pro- cessing systems28(2015) 3
work page 2015
-
[7]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 30
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023) 30
work page 2023
-
[9]
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024) 30
work page 2024
-
[10]
Advances in neural information processing systems33, 1877–1901 (2020) 30 50 H
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 30 50 H. Li et al
work page 1901
-
[11]
In: Forty-first International Conference on Machine Learning (2024) 30
Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interac- tive environments. In: Forty-first International Conference on Machine Learning (2024) 30
work page 2024
-
[12]
Advances in Neural Information Processing Systems37, 24081–24125 (2024) 5, 30
Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024) 5, 30
work page 2024
-
[13]
SkyReels-V2: Infinite-length Film Generative Model
Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 5, 30
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023) 30
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining
Chung, H.W., Constant, N., Garcia, X., Roberts, A., Tay, Y., Narang, S., Firat, O.: Unimax: Fairer and more effective language sampling for large-scale multilin- gual pretraining. arXiv preprint arXiv:2304.09151 (2023) 6
-
[16]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025) 3, 5, 30
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Autoregressive Video Generation without Vector Quantization
Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y., Lu, H., Shan, S., Qi, Y., Wang, X.: Autoregressive video generation without vector quantization. arXiv preprint arXiv:2412.14169 (2024) 30
work page internal anchor Pith review arXiv 2024
-
[18]
Ca2-vdm: Efficient autore- gressive video diffusion model with causal generation and cache sharing,
Gao, K., Shi, J., Zhang, H., Wang, C., Xiao, J., Chen, L.: Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing. arXiv preprint arXiv:2411.16375 (2024) 5, 30
-
[19]
arXiv preprint arXiv:2512.12167 (2025) 44
Gelberg, Y., Eguchi, K., Akiba, T., Cetin, E.: Extending the context of pretrained llms by dropping their positional embeddings. arXiv preprint arXiv:2512.12167 (2025) 44
-
[20]
Emu video: Factorizing text-to-video generation by explicit image conditioning
Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S.S., Shah, A., Yin, X., Parikh, D., Misra, I.: Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023) 30
-
[21]
google/models/veo/(2025) 2, 30
Google: Introducing veo 3, our video generation model with expanded creative controls – including native audio and extended videos.https://deepmind. google/models/veo/(2025) 2, 30
work page 2025
-
[22]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model
Gu, S., Yin, W., Jin, B., Guo, X., Wang, J., Li, H., Zhang, Q., Long, X.: Dome: Taming diffusion model into high-fidelity controllable occupancy world model. arXiv preprint arXiv:2410.10429 (2024) 30
-
[24]
When Attention Sink Emerges in Language Models: An Empirical View
Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. arXiv preprint arXiv:2410.10781 (2024) 7
work page internal anchor Pith review arXiv 2024
-
[25]
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Gu, Y., Mao, W., Shou, M.Z.: Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325 (2025) 5, 30
work page internal anchor Pith review arXiv 2025
-
[26]
Guo, Y., Yang, C., He, H., Zhao, Y., Wei, M., Yang, Z., Huang, W., Lin, D.: End-to-end training for autoregressive video diffusion via self-resampling. arXiv preprint arXiv:2512.15702 (2025) 30
-
[27]
Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025) 30 Rolling Sink51
-
[28]
In: European Conference on Computer Vision
Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024) 30
work page 2024
-
[29]
LTX-Video: Realtime Video Latent Diffusion
HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024) 30
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Disenvisioner: Disentangled and enriched visual prompt for customized image generation,
He, J., Li, H., Hu, Y., Shen, G., Cai, Y., Qiu, W., Chen, Y.C.: Disenvisioner: Disentangled and enriched visual prompt for customized image generation. arXiv preprint arXiv:2410.02067 (2024) 30
-
[31]
arXiv preprint arXiv:2512.01030 (2025) 30
He, J., Li, H., Sheng, M., Chen, Y.C.: Lotus-2: Advancing geometric dense pre- diction with powerful image generative model. arXiv preprint arXiv:2512.01030 (2025) 30
work page internal anchor Pith review arXiv 2025
-
[32]
Lotus: Diffusion-based visual foundation model for high-quality dense prediction
He, J., Li, H., Yin, W., Liang, Y., Li, L., Zhou, K., Zhang, H., Liu, B., Chen, Y.C.: Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124 (2024) 30
-
[33]
Henschel, R., Khachatryan, L., Hayrapetyan, D., Poghosyan, H., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and ex- tendable long video generation from text. arXiv preprint arXiv:2403.14773 (2024) 30
-
[34]
Imagen Video: High Definition Video Generation with Diffusion Models
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022) 30
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Advances in neural information processing systems33, 6840–6851 (2020) 30
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 30
work page 2020
-
[36]
Advances in neural information processing systems35, 8633– 8646 (2022) 30
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022) 30
work page 2022
-
[37]
Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,
Hong, Y., Mei, Y., Ge, C., Xu, Y., Zhou, Y., Bi, S., Hold-Geoffroy, Y., Roberts, M., Fisher, M., Shechtman, E., et al.: Relic: Interactive video world model with long-horizon memory. arXiv preprint arXiv:2512.04040 (2025) 5, 30
-
[38]
Hu, J., Hu, S., Song, Y., Huang, Y., Wang, M., Zhou, H., Liu, Z., Ma, W.Y., Sun, M.: Acdit: Interpolating autoregressive conditional modeling and diffusion transformer. arXiv preprint arXiv:2412.07720 (2024) 5, 30
-
[39]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025) 2, 3, 4, 5, 6, 7, 10, 13, 14, 30, 31, 43, 44
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4, 7, 10, 13, 31
work page 2024
-
[41]
Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., Wang, Y., Chen, X., Chen, Y.C., Wang, L., Lin, D., Qiao, Y., Liu, Z.:VBench++:Comprehensiveandversatilebenchmarksuiteforvideogenerative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025). https://doi.org/10.1109/TPAMI.2025.36338904...
-
[42]
Memflow: Flowing adaptive memory for consistent and efficient long video narratives,
Ji, S., Chen, X., Yang, S., Tao, X., Wan, P., Zhao, H.: Memflow: Flowing adap- tive memory for consistent and efficient long video narratives. arXiv preprint arXiv:2512.14699 (2025) 7, 30 52 H. Li et al
-
[43]
Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b (2025) 7
work page 2025
-
[44]
Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024) 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,
Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 (2024) 5, 30
-
[46]
Nature638(8051), 656–663 (2025) 30
Kanervisto, A., Bignell, D., Wen, L.Y., Grayson, M., Georgescu, R., Valcar- cel Macua, S., Tan, S.Z., Rashid, T., Pearce, T., Cao, Y., et al.: World and human action models towards gameplay ideation. Nature638(8051), 656–663 (2025) 30
work page 2025
-
[47]
Kling: Kling video 2.6 – kling’s first “native audio” model official launched!https: //app.klingai.com/global/release-notes/c605hp1tzd(2025) 2, 30
work page 2025
-
[48]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023) 30
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 30
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Kubrick, S.: The shining.https://en.wikipedia.org/wiki/The_Shining_ (film)(1980) 2
work page 1980
-
[51]
Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 30
work page 2024
-
[52]
Labs, B.F.: Flux.2: Frontier visual intelligence.https://bfl.ai/blog/flux-2 (2025) 30
work page 2025
-
[53]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Advances in neural information processing systems29(2016) 3
Lamb, A.M., ALIAS PARTH GOYAL, A.G., Zhang, Y., Zhang, S., Courville, A.C., Bengio, Y.: Professor forcing: A new algorithm for training recurrent net- works. Advances in neural information processing systems29(2016) 3
work page 2016
-
[55]
arXiv preprint arXiv:2512.11423 (2025) 30
Li, C., Wang, R., Zhou, L., Feng, J., Luo, H., Zhang, H., Wu, Y., He, X.: Joya- vatar: Real-time and infinite audio-driven avatar generation with autoregressive diffusion. arXiv preprint arXiv:2512.11423 (2025) 30
-
[56]
Da 2: Depth anything in any direction,
Li, H., Zheng, W., He, J., Liu, Y., Lin, X., Yang, X., Chen, Y.C., Guo, C.: Da2: Depth anything in any direction. arXiv preprint arXiv:2509.26618 (2025) 30
-
[57]
Li, M., Qu, T., Yao, R., Sun, W., Moens, M.F.: Alleviating exposure bias in diffusion models through sampling with shifted time steps. arXiv preprint arXiv:2305.15583 (2023) 3
-
[58]
In: SIGGRAPH Asia 2024 Conference Papers
Li, X.L., Li, H., Chen, H.X., Mu, T.J., Hu, S.M.: Discene: Object decoupling and interaction modeling for complex scene generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–12 (2024) 30
work page 2024
-
[59]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6517–6526 (2024) 30 Rolling Sink53
work page 2024
-
[60]
Autoregressive adversarial post- training for real-time interactive video generation
Lin, S., Yang, C., He, H., Jiang, J., Ren, Y., Xia, X., Zhao, Y., Xiao, X., Jiang, L.: Autoregressiveadversarialpost-trainingforreal-timeinteractivevideogeneration. arXiv preprint arXiv:2506.09350 (2025) 3
-
[61]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 4, 30
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[62]
Mardini: Masked autoregressive diffusion for video generation at scale,
Liu, H., Liu, S., Zhou, Z., Xu, M., Xie, Y., Han, X., Pérez, J.C., Liu, D., Ka- hatapitiya, K., Jia, M., et al.: Mardini: Masked autoregressive diffusion for video generation at scale. arXiv preprint arXiv:2410.20280 (2024) 30
-
[63]
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025) 3, 30
work page internal anchor Pith review arXiv 2025
-
[64]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 4, 30
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[65]
Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models,
Low, C., Wang, W.: Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099 (2025) 7
-
[66]
Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025) 3, 5, 30
work page internal anchor Pith review arXiv 2025
-
[67]
Latte: Latent Diffusion Transformer for Video Generation
Ma, X., Wang, Y., Chen, X., Jia, G., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024) 30
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
McQueen, S.: Hunger.https://en.wikipedia.org/wiki/Hunger_(2008_film) (2008) 2
work page 2008
-
[69]
Ning, M., Li, M., Su, J., Salah, A.A., Ertugrul, I.O.: Elucidating the exposure bias in diffusion models. arXiv preprint arXiv:2308.15321 (2023) 3
-
[70]
OpenAI: Sora 2 is here.https://openai.com/index/sora-2/(2025) 2, 30
work page 2025
-
[71]
In: Proceedings of the IEEE/CVF international conference on computer vision
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 2, 10, 30
work page 2023
-
[72]
R., Chen, C., and Wetzstein, G
Po, R., Chan, E.R., Chen, C., Wetzstein, G.: Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models. arXiv preprint arXiv:2512.12080 (2025) 3, 5, 30
-
[73]
Movie Gen: A Cast of Media Foundation Models
Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024) 30
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Qiu, H., Liu, S., Zhou, Z., An, Z., Ren, W., Liu, Z., Schult, J., He, S., Chen, S., Cong, Y., et al.: Histream: Efficient high-resolution video generation via redundancy-eliminated streaming. arXiv preprint arXiv:2512.21338 (2025) 30
-
[75]
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Lan- guage models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019) 30
work page 2019
-
[76]
Sequence Level Training with Recurrent Neural Networks
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015) 3
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[77]
arXiv preprint arXiv:2502.07737 (2025) 30
Ren, S., Ma, S., Sun, X., Wei, F.: Next block prediction: Video generation via semi-autoregressive modeling. arXiv preprint arXiv:2502.07737 (2025) 30
-
[78]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 30
work page 2022
- [79]
-
[80]
arXiv preprint arXiv:1910.00292 , year=
Schmidt, F.: Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292 (2019) 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.