pith. sign in

arxiv: 2602.07775 · v6 · submitted 2026-02-08 · 💻 cs.CV

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Pith reviewed 2026-05-16 06:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive video diffusionlong video generationtraining-free methodcache maintenancetemporal consistencyvideo synthesisopen-ended generationSelf Forcing
0
0 comments X

The pith

Rolling Sink lets autoregressive video models trained on five-second clips generate consistent videos lasting many minutes at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive video diffusion models suffer rapid degradation when generating videos much longer than their short training clips because errors accumulate in the model's internal state over extended sequences. The paper analyzes how the autoregressive cache is maintained across generation steps and derives Rolling Sink as a training-free adjustment that keeps the cache from drifting. This enables open-ended testing horizons of five to thirty minutes while preserving subject identity, color stability, structural coherence, and motion smoothness. A reader would care because it removes the computational barrier of training on long videos and makes practical long-form video synthesis feasible with existing short-clip models. The core insight is that targeted cache management during inference can close the train-test gap without retraining.

Core claim

Rolling Sink is a training-free technique obtained from systematic analysis of autoregressive cache maintenance; when applied to models such as Self Forcing that were trained only on five-second clips, it scales video synthesis to open-ended durations of five to thirty minutes at sixteen frames per second while maintaining consistent subjects, stable colors, coherent structures, and smooth motions.

What carries the argument

Rolling Sink, a periodic adjustment to the autoregressive cache that rolls forward and resets accumulating state to limit error propagation during long-horizon generation.

If this is right

  • Models trained on five-second clips can now produce five-to-thirty-minute videos at sixteen frames per second with stable visual quality.
  • Long-horizon fidelity and temporal consistency exceed those of current state-of-the-art baselines on the same short-trained models.
  • No additional training or longer data is required to reach open-ended generation lengths.
  • Subject identity, color constancy, and motion smoothness remain intact across the extended sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cache-rolling principle might apply to autoregressive generation in other domains such as audio waveforms or long text sequences.
  • Combining Rolling Sink with occasional fine-tuning on medium-length clips could further reduce residual drift.
  • The method implies that cache state management is a primary bottleneck when scaling autoregressive diffusion beyond training horizons.
  • Testing the approach at higher frame rates or resolutions would show whether the cache rules remain sufficient.

Load-bearing premise

The assumption that a fixed set of cache-maintenance rules derived from short-horizon analysis will continue to prevent degradation at arbitrary test lengths without introducing fresh artifacts.

What would settle it

Run a thirty-minute generation with Rolling Sink and compare frame-by-frame consistency metrics against the same model without the cache adjustment; persistent degradation equal to the baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2602.07775 by Haodong Li, Manmohan Chandraker, Shaoteng Liu, Zhe Lin.

Figure 1
Figure 1. Figure 1: Rolling Sink unlocks open-ended AR video generation. Despite a 5s training duration, Rolling Sink effectively scales the AR video synthesis to minutes long during testing, e.g., 5-minute and 30-minute (please see Fig. S28, S29 in our Supp1 ). Abstract. Recently, autoregressive (AR) video diffusion models have achieved remarkable performance. However, due to their limited train￾ing durations, a train-test g… view at source ↗
Figure 2
Figure 2. Figure 2: Bridging the gap between limited￾horizon training and open-ended testing. Self Forcing [39] studies the train-test gap when test￾ing within the training window (i.e., 5s at 16 FPS), while we extend the focus to the train-test gap that emerges when testing beyond the training window. Generating a long video (e.g., a movie) typically requires a “multi-shot” input, i.e., a se￾quence of prompts. Each shot typi… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our analysis and the proposed [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation results during the systematic analysis [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparisons across various sink sizes. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparisons of sliding indices and sliding semantics (when S K = 83%). Incor￾porating sliding indices and then sliding semantics consistently mitigates the artifacts (or AR drift). Following [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparisons of Rolling Sink with SOTA AR video synthe￾sis baselines. When extrapolating beyond the training horizon, SOTA baselines often exhibit rapid AR drift, leading to noticeable visual degradation (e.g., over-saturated colors, collapsed structures, etc.). In contrast, Rolling Sink substantially reduces the AR drift, preserving stable identities and scene structure while maintaining cohere… view at source ↗
Figure 8
Figure 8. Figure 8: Radar charts of quantitative comparisons [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Recently, autoregressive (AR) video diffusion models have achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Rolling Sink, a training-free technique for autoregressive video diffusion models. Building on Self Forcing (trained only on 5-second clips), the method performs a systematic analysis of AR cache maintenance to derive a sink rule that purportedly bridges the train-test gap for open-ended testing horizons. It claims to enable generation of ultra-long videos (5–30 minutes at 16 FPS) while preserving subject consistency, color stability, structural coherence, and motion smoothness, outperforming SOTA baselines in long-horizon fidelity and temporal consistency.

Significance. If the empirical results and generalization hold, the work would represent a meaningful contribution to long-form video synthesis by eliminating the need for computationally prohibitive long-horizon training. The training-free character and grounding in cache-behavior analysis are notable strengths; successful scaling from 5 s to thousands of frames without new degradations would have clear practical value for applications requiring extended coherent video.

major comments (2)
  1. [Abstract / Methods] Abstract and Methods: The central claim that the Rolling Sink rule bounds cumulative denoising error and conditioning drift for arbitrary horizons (5–30 min) rests on an unstated premise with no explicit error-bound derivation or invariant provided; the skeptic concern is therefore load-bearing because the manuscript supplies no mathematical guarantee once the finite training window is exceeded.
  2. [Experiments] Experiments: The abstract asserts 'superior long-horizon visual fidelity and temporal consistency' and 'extensive experiments' yet supplies no quantitative metrics (e.g., FVD, subject-consistency scores, or long-horizon ablations) or details on how consistency is measured over thousands of frames; without these, the superiority claim cannot be evaluated.
minor comments (2)
  1. [Methods] Notation for the sink rule and cache-maintenance operations should be defined more explicitly with equations to allow reproduction.
  2. [Experiments] The project page link is given but the manuscript should include a brief summary of the qualitative examples shown there.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing Rolling Sink. We address each major comment point by point below, providing clarifications on our empirical approach and committing to revisions where the manuscript can be strengthened.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods: The central claim that the Rolling Sink rule bounds cumulative denoising error and conditioning drift for arbitrary horizons (5–30 min) rests on an unstated premise with no explicit error-bound derivation or invariant provided; the skeptic concern is therefore load-bearing because the manuscript supplies no mathematical guarantee once the finite training window is exceeded.

    Authors: We appreciate the referee's emphasis on the distinction between empirical derivation and formal guarantees. Our manuscript does not claim or provide a rigorous mathematical error bound, invariant, or proof that Rolling Sink guarantees bounded drift for arbitrary horizons. The method is instead derived from a systematic analysis of observed cache behaviors and error accumulation patterns in autoregressive video diffusion, building directly on the Self Forcing framework. We identify practical rules that mitigate the train-test gap beyond the 5-second training horizon and validate them through long-horizon generations. While a theoretical bound would strengthen the work, deriving one for stochastic diffusion processes in this setting is an open research question and outside the current scope; the contribution lies in the training-free, analysis-driven solution that enables practical ultra-long synthesis. revision: no

  2. Referee: [Experiments] Experiments: The abstract asserts 'superior long-horizon visual fidelity and temporal consistency' and 'extensive experiments' yet supplies no quantitative metrics (e.g., FVD, subject-consistency scores, or long-horizon ablations) or details on how consistency is measured over thousands of frames; without these, the superiority claim cannot be evaluated.

    Authors: We agree that the current manuscript version prioritizes qualitative visual results and comparisons in the main text, which limits the ability to fully evaluate the superiority claims. In the revision, we will incorporate quantitative metrics into the main paper, including FVD scores computed on long sequences, subject consistency via averaged CLIP embedding similarities across sampled frames, color stability via histogram distances, and motion smoothness via optical flow metrics. We will also add details on the evaluation protocol: metrics are computed by sampling frames at fixed intervals (e.g., every 50–100 frames) over the full generation length and averaging across multiple independent long videos (5–30 minutes at 16 FPS). Long-horizon ablations will be included to isolate the effect of the Rolling Sink rule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical cache analysis independent of target outcome

full rationale

The paper's central derivation proceeds from a systematic analysis of AR cache maintenance during inference (beyond the 5s training horizon of the base Self Forcing model) to the design of the Rolling Sink rule. No equation or claim reduces the long-horizon fidelity result to a fitted parameter, a self-citation that itself assumes the result, or a renaming of an input pattern. The generalization to 5-30 minute videos is presented as an empirical outcome of the cache rule rather than a quantity forced by construction from the limited-horizon training data. Self-citation to Self Forcing is present but serves only as the base model; it is not invoked as a uniqueness theorem or load-bearing justification for the unbounded-horizon claim. The approach therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method builds on existing autoregressive diffusion frameworks without introducing new postulated components.

pith-pipeline@v0.9.0 · 5524 in / 956 out tokens · 31404 ms · 2026-05-16T06:59:53.286497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video ...

  2. World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

    cs.CV 2026-05 unverdicted novelty 6.0

    Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.

  3. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  4. Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

  5. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  6. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  7. One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

    cs.CV 2026-05 unverdicted novelty 5.0

    A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · cited by 7 Pith papers · 43 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 30

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025) 30

  3. [3]

    World Simulation with Video Foundation Models for Physical AI

    Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025) 30

  4. [4]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., Arnaud, S., Gejji, A., Martin, A., Robert Hogan, F., Dugas, D., Bojanowski, P., Khalidov, V., Labatut, P., Massa, F., Szafraniec, M., Krishnakumar, K., Li, Y., Ma, X., Chandar, S., Meier, F., LeCun, Y., Rabbat, M., Ballas, N.: ...

  5. [5]

    Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

  6. [6]

    Advances in neural information pro- cessing systems28(2015) 3

    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information pro- cessing systems28(2015) 3

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 30

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023) 30

  9. [9]

    OpenAI Blog1(8), 1 (2024) 30

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024) 30

  10. [10]

    Advances in neural information processing systems33, 1877–1901 (2020) 30 50 H

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 30 50 H. Li et al

  11. [11]

    In: Forty-first International Conference on Machine Learning (2024) 30

    Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interac- tive environments. In: Forty-first International Conference on Machine Learning (2024) 30

  12. [12]

    Advances in Neural Information Processing Systems37, 24081–24125 (2024) 5, 30

    Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024) 5, 30

  13. [13]

    SkyReels-V2: Infinite-length Film Generative Model

    Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 5, 30

  14. [14]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023) 30

  15. [15]

    Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

    Chung, H.W., Constant, N., Garcia, X., Roberts, A., Tay, Y., Narang, S., Firat, O.: Unimax: Fairer and more effective language sampling for large-scale multilin- gual pretraining. arXiv preprint arXiv:2304.09151 (2023) 6

  16. [16]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025) 3, 5, 30

  17. [17]

    Autoregressive Video Generation without Vector Quantization

    Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y., Lu, H., Shan, S., Qi, Y., Wang, X.: Autoregressive video generation without vector quantization. arXiv preprint arXiv:2412.14169 (2024) 30

  18. [18]

    Ca2-vdm: Efficient autore- gressive video diffusion model with causal generation and cache sharing,

    Gao, K., Shi, J., Zhang, H., Wang, C., Xiao, J., Chen, L.: Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing. arXiv preprint arXiv:2411.16375 (2024) 5, 30

  19. [19]

    arXiv preprint arXiv:2512.12167 (2025) 44

    Gelberg, Y., Eguchi, K., Akiba, T., Cetin, E.: Extending the context of pretrained llms by dropping their positional embeddings. arXiv preprint arXiv:2512.12167 (2025) 44

  20. [20]

    Emu video: Factorizing text-to-video generation by explicit image conditioning

    Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S.S., Shah, A., Yin, X., Parikh, D., Misra, I.: Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023) 30

  21. [21]

    google/models/veo/(2025) 2, 30

    Google: Introducing veo 3, our video generation model with expanded creative controls – including native audio and extended videos.https://deepmind. google/models/veo/(2025) 2, 30

  22. [22]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 7

  23. [23]

    Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

    Gu, S., Yin, W., Jin, B., Guo, X., Wang, J., Li, H., Zhang, Q., Long, X.: Dome: Taming diffusion model into high-fidelity controllable occupancy world model. arXiv preprint arXiv:2410.10429 (2024) 30

  24. [24]

    When Attention Sink Emerges in Language Models: An Empirical View

    Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. arXiv preprint arXiv:2410.10781 (2024) 7

  25. [25]

    Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    Gu, Y., Mao, W., Shou, M.Z.: Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325 (2025) 5, 30

  26. [26]

    End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

    Guo, Y., Yang, C., He, H., Zhao, Y., Wei, M., Yang, Z., Huang, W., Lin, D.: End-to-end training for autoregressive video diffusion via self-resampling. arXiv preprint arXiv:2512.15702 (2025) 30

  27. [27]

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al

    Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025) 30 Rolling Sink51

  28. [28]

    In: European Conference on Computer Vision

    Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024) 30

  29. [29]

    LTX-Video: Realtime Video Latent Diffusion

    HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024) 30

  30. [30]

    Disenvisioner: Disentangled and enriched visual prompt for customized image generation,

    He, J., Li, H., Hu, Y., Shen, G., Cai, Y., Qiu, W., Chen, Y.C.: Disenvisioner: Disentangled and enriched visual prompt for customized image generation. arXiv preprint arXiv:2410.02067 (2024) 30

  31. [31]

    arXiv preprint arXiv:2512.01030 (2025) 30

    He, J., Li, H., Sheng, M., Chen, Y.C.: Lotus-2: Advancing geometric dense pre- diction with powerful image generative model. arXiv preprint arXiv:2512.01030 (2025) 30

  32. [32]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction

    He, J., Li, H., Yin, W., Liang, Y., Li, L., Zhou, K., Zhang, H., Liu, B., Chen, Y.C.: Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124 (2024) 30

  33. [33]

    Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

    Henschel, R., Khachatryan, L., Hayrapetyan, D., Poghosyan, H., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and ex- tendable long video generation from text. arXiv preprint arXiv:2403.14773 (2024) 30

  34. [34]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022) 30

  35. [35]

    Advances in neural information processing systems33, 6840–6851 (2020) 30

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 30

  36. [36]

    Advances in neural information processing systems35, 8633– 8646 (2022) 30

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022) 30

  37. [37]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

    Hong, Y., Mei, Y., Ge, C., Xu, Y., Zhou, Y., Bi, S., Hold-Geoffroy, Y., Roberts, M., Fisher, M., Shechtman, E., et al.: Relic: Interactive video world model with long-horizon memory. arXiv preprint arXiv:2512.04040 (2025) 5, 30

  38. [38]

    Acdit: Interpolating autoregressive conditional modeling and diffusion transformer.arXiv preprint arXiv:2412.07720,

    Hu, J., Hu, S., Song, Y., Huang, Y., Wang, M., Zhou, H., Liu, Z., Ma, W.Y., Sun, M.: Acdit: Interpolating autoregressive conditional modeling and diffusion transformer. arXiv preprint arXiv:2412.07720 (2024) 5, 30

  39. [39]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025) 2, 3, 4, 5, 6, 7, 10, 13, 14, 30, 31, 43, 44

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4, 7, 10, 13, 31

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4, 7, 10, 13, 31

  41. [41]

    VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., Wang, Y., Chen, X., Chen, Y.C., Wang, L., Lin, D., Qiao, Y., Liu, Z.:VBench++:Comprehensiveandversatilebenchmarksuiteforvideogenerative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025). https://doi.org/10.1109/TPAMI.2025.36338904...

  42. [42]

    Memflow: Flowing adaptive memory for consistent and efficient long video narratives,

    Ji, S., Chen, X., Yang, S., Tao, X., Wan, P., Zhao, H.: Memflow: Flowing adap- tive memory for consistent and efficient long video narratives. arXiv preprint arXiv:2512.14699 (2025) 7, 30 52 H. Li et al

  43. [43]

    Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b (2025) 7

  44. [44]

    Mixtral of Experts

    Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024) 7

  45. [45]

    Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

    Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 (2024) 5, 30

  46. [46]

    Nature638(8051), 656–663 (2025) 30

    Kanervisto, A., Bignell, D., Wen, L.Y., Grayson, M., Georgescu, R., Valcar- cel Macua, S., Tan, S.Z., Rashid, T., Pearce, T., Cao, Y., et al.: World and human action models towards gameplay ideation. Nature638(8051), 656–663 (2025) 30

  47. [47]

    native audio

    Kling: Kling video 2.6 – kling’s first “native audio” model official launched!https: //app.klingai.com/global/release-notes/c605hp1tzd(2025) 2, 30

  48. [48]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023) 30

  49. [49]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 30

  50. [50]

    Kubrick, S.: The shining.https://en.wikipedia.org/wiki/The_Shining_ (film)(1980) 2

  51. [51]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 30

  52. [52]

    Labs, B.F.: Flux.2: Frontier visual intelligence.https://bfl.ai/blog/flux-2 (2025) 30

  53. [53]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint ...

  54. [54]

    Advances in neural information processing systems29(2016) 3

    Lamb, A.M., ALIAS PARTH GOYAL, A.G., Zhang, Y., Zhang, S., Courville, A.C., Bengio, Y.: Professor forcing: A new algorithm for training recurrent net- works. Advances in neural information processing systems29(2016) 3

  55. [55]

    arXiv preprint arXiv:2512.11423 (2025) 30

    Li, C., Wang, R., Zhou, L., Feng, J., Luo, H., Zhang, H., Wu, Y., He, X.: Joya- vatar: Real-time and infinite audio-driven avatar generation with autoregressive diffusion. arXiv preprint arXiv:2512.11423 (2025) 30

  56. [56]

    Da 2: Depth anything in any direction,

    Li, H., Zheng, W., He, J., Liu, Y., Lin, X., Yang, X., Chen, Y.C., Guo, C.: Da2: Depth anything in any direction. arXiv preprint arXiv:2509.26618 (2025) 30

  57. [57]

    Alleviating exposure bias in diffusion mod- els through sampling with shifted time steps.arXiv preprint arXiv:2305.15583, 2023

    Li, M., Qu, T., Yao, R., Sun, W., Moens, M.F.: Alleviating exposure bias in diffusion models through sampling with shifted time steps. arXiv preprint arXiv:2305.15583 (2023) 3

  58. [58]

    In: SIGGRAPH Asia 2024 Conference Papers

    Li, X.L., Li, H., Chen, H.X., Mu, T.J., Hu, S.M.: Discene: Object decoupling and interaction modeling for complex scene generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–12 (2024) 30

  59. [59]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6517–6526 (2024) 30 Rolling Sink53

  60. [60]

    Autoregressive adversarial post- training for real-time interactive video generation

    Lin, S., Yang, C., He, H., Jiang, J., Ren, Y., Xia, X., Zhao, Y., Xiao, X., Jiang, L.: Autoregressiveadversarialpost-trainingforreal-timeinteractivevideogeneration. arXiv preprint arXiv:2506.09350 (2025) 3

  61. [61]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 4, 30

  62. [62]

    Mardini: Masked autoregressive diffusion for video generation at scale,

    Liu, H., Liu, S., Zhou, Z., Xu, M., Xie, Y., Han, X., Pérez, J.C., Liu, D., Ka- hatapitiya, K., Jia, M., et al.: Mardini: Masked autoregressive diffusion for video generation at scale. arXiv preprint arXiv:2410.20280 (2024) 30

  63. [63]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025) 3, 30

  64. [64]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 4, 30

  65. [65]

    Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models,

    Low, C., Wang, W.: Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099 (2025) 7

  66. [66]

    Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025) 3, 5, 30

  67. [67]

    Latte: Latent Diffusion Transformer for Video Generation

    Ma, X., Wang, Y., Chen, X., Jia, G., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024) 30

  68. [68]

    McQueen, S.: Hunger.https://en.wikipedia.org/wiki/Hunger_(2008_film) (2008) 2

  69. [69]

    , author Li, M

    Ning, M., Li, M., Su, J., Salah, A.A., Ertugrul, I.O.: Elucidating the exposure bias in diffusion models. arXiv preprint arXiv:2308.15321 (2023) 3

  70. [70]

    OpenAI: Sora 2 is here.https://openai.com/index/sora-2/(2025) 2, 30

  71. [71]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 2, 10, 30

  72. [72]

    R., Chen, C., and Wetzstein, G

    Po, R., Chan, E.R., Chen, C., Wetzstein, G.: Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models. arXiv preprint arXiv:2512.12080 (2025) 3, 5, 30

  73. [73]

    Movie Gen: A Cast of Media Foundation Models

    Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024) 30

  74. [74]

    Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338,

    Qiu, H., Liu, S., Zhou, Z., An, Z., Ren, W., Liu, Z., Schult, J., He, S., Chen, S., Cong, Y., et al.: Histream: Efficient high-resolution video generation via redundancy-eliminated streaming. arXiv preprint arXiv:2512.21338 (2025) 30

  75. [75]

    OpenAI blog1(8), 9 (2019) 30

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Lan- guage models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019) 30

  76. [76]

    Sequence Level Training with Recurrent Neural Networks

    Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015) 3

  77. [77]

    arXiv preprint arXiv:2502.07737 (2025) 30

    Ren, S., Ma, S., Sun, X., Wei, F.: Next block prediction: Video generation via semi-autoregressive modeling. arXiv preprint arXiv:2502.07737 (2025) 30

  78. [78]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 30

  79. [79]

    Li et al

    Runway: Introducing runway gen-4.5: A new frontier for video generation.https: //runwayml.com/research/introducing-runway-gen-4.5(2025) 2, 30 54 H. Li et al

  80. [80]

    arXiv preprint arXiv:1910.00292 , year=

    Schmidt, F.: Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292 (2019) 3

Showing first 80 references.