Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li; Manmohan Chandraker; Shaoteng Liu; Zhe Lin

arxiv: 2602.07775 · v6 · submitted 2026-02-08 · 💻 cs.CV

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li , Shaoteng Liu , Zhe Lin , Manmohan Chandraker This is my paper

Pith reviewed 2026-05-16 06:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive video diffusionlong video generationtraining-free methodcache maintenancetemporal consistencyvideo synthesisopen-ended generationSelf Forcing

0 comments

The pith

Rolling Sink lets autoregressive video models trained on five-second clips generate consistent videos lasting many minutes at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive video diffusion models suffer rapid degradation when generating videos much longer than their short training clips because errors accumulate in the model's internal state over extended sequences. The paper analyzes how the autoregressive cache is maintained across generation steps and derives Rolling Sink as a training-free adjustment that keeps the cache from drifting. This enables open-ended testing horizons of five to thirty minutes while preserving subject identity, color stability, structural coherence, and motion smoothness. A reader would care because it removes the computational barrier of training on long videos and makes practical long-form video synthesis feasible with existing short-clip models. The core insight is that targeted cache management during inference can close the train-test gap without retraining.

Core claim

Rolling Sink is a training-free technique obtained from systematic analysis of autoregressive cache maintenance; when applied to models such as Self Forcing that were trained only on five-second clips, it scales video synthesis to open-ended durations of five to thirty minutes at sixteen frames per second while maintaining consistent subjects, stable colors, coherent structures, and smooth motions.

What carries the argument

Rolling Sink, a periodic adjustment to the autoregressive cache that rolls forward and resets accumulating state to limit error propagation during long-horizon generation.

If this is right

Models trained on five-second clips can now produce five-to-thirty-minute videos at sixteen frames per second with stable visual quality.
Long-horizon fidelity and temporal consistency exceed those of current state-of-the-art baselines on the same short-trained models.
No additional training or longer data is required to reach open-ended generation lengths.
Subject identity, color constancy, and motion smoothness remain intact across the extended sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cache-rolling principle might apply to autoregressive generation in other domains such as audio waveforms or long text sequences.
Combining Rolling Sink with occasional fine-tuning on medium-length clips could further reduce residual drift.
The method implies that cache state management is a primary bottleneck when scaling autoregressive diffusion beyond training horizons.
Testing the approach at higher frame rates or resolutions would show whether the cache rules remain sufficient.

Load-bearing premise

The assumption that a fixed set of cache-maintenance rules derived from short-horizon analysis will continue to prevent degradation at arbitrary test lengths without introducing fresh artifacts.

What would settle it

Run a thirty-minute generation with Rolling Sink and compare frame-by-frame consistency metrics against the same model without the cache adjustment; persistent degradation equal to the baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2602.07775 by Haodong Li, Manmohan Chandraker, Shaoteng Liu, Zhe Lin.

**Figure 1.** Figure 1: Rolling Sink unlocks open-ended AR video generation. Despite a 5s training duration, Rolling Sink effectively scales the AR video synthesis to minutes long during testing, e.g., 5-minute and 30-minute (please see Fig. S28, S29 in our Supp1 ). Abstract. Recently, autoregressive (AR) video diffusion models have achieved remarkable performance. However, due to their limited training durations, a train-test g… view at source ↗

**Figure 2.** Figure 2: Bridging the gap between limitedhorizon training and open-ended testing. Self Forcing [39] studies the train-test gap when testing within the training window (i.e., 5s at 16 FPS), while we extend the focus to the train-test gap that emerges when testing beyond the training window. Generating a long video (e.g., a movie) typically requires a “multi-shot” input, i.e., a sequence of prompts. Each shot typi… view at source ↗

**Figure 3.** Figure 3: Overview of our analysis and the proposed [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation results during the systematic analysis [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visual comparisons across various sink sizes. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visual comparisons of sliding indices and sliding semantics (when S K = 83%). Incorporating sliding indices and then sliding semantics consistently mitigates the artifacts (or AR drift). Following [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparisons of Rolling Sink with SOTA AR video synthesis baselines. When extrapolating beyond the training horizon, SOTA baselines often exhibit rapid AR drift, leading to noticeable visual degradation (e.g., over-saturated colors, collapsed structures, etc.). In contrast, Rolling Sink substantially reduces the AR drift, preserving stable identities and scene structure while maintaining cohere… view at source ↗

**Figure 8.** Figure 8: Radar charts of quantitative comparisons [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Recently, autoregressive (AR) video diffusion models have achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rolling Sink gives a practical training-free way to stretch short-trained AR video models to minute-long outputs via cache rules, but the claims need numbers and bounds to hold up.

read the letter

Rolling Sink is a training-free method that analyzes how the autoregressive cache is maintained to let models trained on short clips generate much longer videos without rapid degradation. They start from Self Forcing, which already tackles the gap inside the training horizon, and extend the idea to open-ended testing by deriving a rolling sink rule from cache behavior. This leads to claims of consistent subjects, stable colors, and smooth motions out to 5-30 minutes at 16 FPS from a 5s model. The approach is sensible because retraining on long videos is expensive, so an inference-only solution has real appeal. The paper does well in identifying the specific train-test gap beyond training length and in presenting a concrete strategy based on systematic cache analysis. If the full experiments show clear improvements over baselines in fidelity and consistency, this could be a useful addition for practical long-form video synthesis. The main soft spot is the absence of quantitative metrics or detailed ablations in the provided abstract, which makes the superiority claims hard to evaluate precisely. The stress-test concern holds some weight here: there's no derivation or bound shown for why the sink rule prevents cumulative denoising errors and conditioning drift over thousands of frames. It seems to rely on empirical results rather than a guarantee that it works for arbitrary horizons without new degradations creeping in. This paper is for researchers working on autoregressive video diffusion who are dealing with length limitations in generation. Readers focused on inference techniques and cache management in diffusion models would get the most value, particularly if they want to try extending existing short-trained models. It deserves peer review because the problem is central to scaling these models and the proposed solution is straightforward enough to test and refine.

Referee Report

2 major / 2 minor

Summary. The paper introduces Rolling Sink, a training-free technique for autoregressive video diffusion models. Building on Self Forcing (trained only on 5-second clips), the method performs a systematic analysis of AR cache maintenance to derive a sink rule that purportedly bridges the train-test gap for open-ended testing horizons. It claims to enable generation of ultra-long videos (5–30 minutes at 16 FPS) while preserving subject consistency, color stability, structural coherence, and motion smoothness, outperforming SOTA baselines in long-horizon fidelity and temporal consistency.

Significance. If the empirical results and generalization hold, the work would represent a meaningful contribution to long-form video synthesis by eliminating the need for computationally prohibitive long-horizon training. The training-free character and grounding in cache-behavior analysis are notable strengths; successful scaling from 5 s to thousands of frames without new degradations would have clear practical value for applications requiring extended coherent video.

major comments (2)

[Abstract / Methods] Abstract and Methods: The central claim that the Rolling Sink rule bounds cumulative denoising error and conditioning drift for arbitrary horizons (5–30 min) rests on an unstated premise with no explicit error-bound derivation or invariant provided; the skeptic concern is therefore load-bearing because the manuscript supplies no mathematical guarantee once the finite training window is exceeded.
[Experiments] Experiments: The abstract asserts 'superior long-horizon visual fidelity and temporal consistency' and 'extensive experiments' yet supplies no quantitative metrics (e.g., FVD, subject-consistency scores, or long-horizon ablations) or details on how consistency is measured over thousands of frames; without these, the superiority claim cannot be evaluated.

minor comments (2)

[Methods] Notation for the sink rule and cache-maintenance operations should be defined more explicitly with equations to allow reproduction.
[Experiments] The project page link is given but the manuscript should include a brief summary of the qualitative examples shown there.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing Rolling Sink. We address each major comment point by point below, providing clarifications on our empirical approach and committing to revisions where the manuscript can be strengthened.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: The central claim that the Rolling Sink rule bounds cumulative denoising error and conditioning drift for arbitrary horizons (5–30 min) rests on an unstated premise with no explicit error-bound derivation or invariant provided; the skeptic concern is therefore load-bearing because the manuscript supplies no mathematical guarantee once the finite training window is exceeded.

Authors: We appreciate the referee's emphasis on the distinction between empirical derivation and formal guarantees. Our manuscript does not claim or provide a rigorous mathematical error bound, invariant, or proof that Rolling Sink guarantees bounded drift for arbitrary horizons. The method is instead derived from a systematic analysis of observed cache behaviors and error accumulation patterns in autoregressive video diffusion, building directly on the Self Forcing framework. We identify practical rules that mitigate the train-test gap beyond the 5-second training horizon and validate them through long-horizon generations. While a theoretical bound would strengthen the work, deriving one for stochastic diffusion processes in this setting is an open research question and outside the current scope; the contribution lies in the training-free, analysis-driven solution that enables practical ultra-long synthesis. revision: no
Referee: [Experiments] Experiments: The abstract asserts 'superior long-horizon visual fidelity and temporal consistency' and 'extensive experiments' yet supplies no quantitative metrics (e.g., FVD, subject-consistency scores, or long-horizon ablations) or details on how consistency is measured over thousands of frames; without these, the superiority claim cannot be evaluated.

Authors: We agree that the current manuscript version prioritizes qualitative visual results and comparisons in the main text, which limits the ability to fully evaluate the superiority claims. In the revision, we will incorporate quantitative metrics into the main paper, including FVD scores computed on long sequences, subject consistency via averaged CLIP embedding similarities across sampled frames, color stability via histogram distances, and motion smoothness via optical flow metrics. We will also add details on the evaluation protocol: metrics are computed by sampling frames at fixed intervals (e.g., every 50–100 frames) over the full generation length and averaging across multiple independent long videos (5–30 minutes at 16 FPS). Long-horizon ablations will be included to isolate the effect of the Rolling Sink rule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical cache analysis independent of target outcome

full rationale

The paper's central derivation proceeds from a systematic analysis of AR cache maintenance during inference (beyond the 5s training horizon of the base Self Forcing model) to the design of the Rolling Sink rule. No equation or claim reduces the long-horizon fidelity result to a fitted parameter, a self-citation that itself assumes the result, or a renaming of an input pattern. The generalization to 5-30 minute videos is presented as an empirical outcome of the cache rule rather than a quantity forced by construction from the limited-horizon training data. Self-citation to Self Forcing is present but serves only as the base model; it is not invoked as a uniqueness theorem or load-bearing justification for the unbounded-horizon claim. The approach therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method builds on existing autoregressive diffusion frameworks without introducing new postulated components.

pith-pipeline@v0.9.0 · 5524 in / 956 out tokens · 31404 ms · 2026-05-16T06:59:53.286497+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink... rolling the sink content (i.e., at each AR step, we update the sink blocks’ semantic content with a rolling segment from the within-duration history)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and 8-tick orbit unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the total cache capacity K is strictly bounded for streaming efficiency... S/K = 83%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video ...
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks
cs.CV 2026-05 unverdicted novelty 6.0

Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
Stream-T1: Test-Time Scaling for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems
cs.CV 2026-05 unverdicted novelty 5.0

A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · cited by 7 Pith papers · 43 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 30

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Cosmos World Foundation Model Platform for Physical AI

Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025) 30

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

World Simulation with Video Foundation Models for Physical AI

Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025) 30

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., Arnaud, S., Gejji, A., Martin, A., Robert Hogan, F., Dugas, D., Bojanowski, P., Khalidov, V., Labatut, P., Massa, F., Szafraniec, M., Krishnakumar, K., Li, Y., Ma, X., Chandar, S., Meier, F., LeCun, Y., Rabbat, M., Ballas, N.: ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

work page 2025
[6]

Advances in neural information pro- cessing systems28(2015) 3

Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information pro- cessing systems28(2015) 3

work page 2015
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 30

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023) 30

work page 2023
[9]

OpenAI Blog1(8), 1 (2024) 30

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024) 30

work page 2024
[10]

Advances in neural information processing systems33, 1877–1901 (2020) 30 50 H

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 30 50 H. Li et al

work page 1901
[11]

In: Forty-first International Conference on Machine Learning (2024) 30

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interac- tive environments. In: Forty-first International Conference on Machine Learning (2024) 30

work page 2024
[12]

Advances in Neural Information Processing Systems37, 24081–24125 (2024) 5, 30

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024) 5, 30

work page 2024
[13]

SkyReels-V2: Infinite-length Film Generative Model

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 5, 30

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023) 30

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

Chung, H.W., Constant, N., Garcia, X., Roberts, A., Tay, Y., Narang, S., Firat, O.: Unimax: Fairer and more effective language sampling for large-scale multilin- gual pretraining. arXiv preprint arXiv:2304.09151 (2023) 6

work page arXiv 2023
[16]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025) 3, 5, 30

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Autoregressive Video Generation without Vector Quantization

Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y., Lu, H., Shan, S., Qi, Y., Wang, X.: Autoregressive video generation without vector quantization. arXiv preprint arXiv:2412.14169 (2024) 30

work page internal anchor Pith review arXiv 2024
[18]

Ca2-vdm: Efficient autore- gressive video diffusion model with causal generation and cache sharing,

Gao, K., Shi, J., Zhang, H., Wang, C., Xiao, J., Chen, L.: Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing. arXiv preprint arXiv:2411.16375 (2024) 5, 30

work page arXiv 2024
[19]

arXiv preprint arXiv:2512.12167 (2025) 44

Gelberg, Y., Eguchi, K., Akiba, T., Cetin, E.: Extending the context of pretrained llms by dropping their positional embeddings. arXiv preprint arXiv:2512.12167 (2025) 44

work page arXiv 2025
[20]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S.S., Shah, A., Yin, X., Parikh, D., Misra, I.: Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023) 30

work page arXiv 2023
[21]

google/models/veo/(2025) 2, 30

Google: Introducing veo 3, our video generation model with expanded creative controls – including native audio and extended videos.https://deepmind. google/models/veo/(2025) 2, 30

work page 2025
[22]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

Gu, S., Yin, W., Jin, B., Guo, X., Wang, J., Li, H., Zhang, Q., Long, X.: Dome: Taming diffusion model into high-fidelity controllable occupancy world model. arXiv preprint arXiv:2410.10429 (2024) 30

work page arXiv 2024
[24]

When Attention Sink Emerges in Language Models: An Empirical View

Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. arXiv preprint arXiv:2410.10781 (2024) 7

work page internal anchor Pith review arXiv 2024
[25]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Gu, Y., Mao, W., Shou, M.Z.: Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325 (2025) 5, 30

work page internal anchor Pith review arXiv 2025
[26]

End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

Guo, Y., Yang, C., He, H., Zhao, Y., Wei, M., Yang, Z., Huang, W., Lin, D.: End-to-end training for autoregressive video diffusion via self-resampling. arXiv preprint arXiv:2512.15702 (2025) 30

work page arXiv 2025
[27]

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al

Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025) 30 Rolling Sink51

work page arXiv 2025
[28]

In: European Conference on Computer Vision

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024) 30

work page 2024
[29]

LTX-Video: Realtime Video Latent Diffusion

HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024) 30

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Disenvisioner: Disentangled and enriched visual prompt for customized image generation,

He, J., Li, H., Hu, Y., Shen, G., Cai, Y., Qiu, W., Chen, Y.C.: Disenvisioner: Disentangled and enriched visual prompt for customized image generation. arXiv preprint arXiv:2410.02067 (2024) 30

work page arXiv 2024
[31]

arXiv preprint arXiv:2512.01030 (2025) 30

He, J., Li, H., Sheng, M., Chen, Y.C.: Lotus-2: Advancing geometric dense pre- diction with powerful image generative model. arXiv preprint arXiv:2512.01030 (2025) 30

work page internal anchor Pith review arXiv 2025
[32]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

He, J., Li, H., Yin, W., Liang, Y., Li, L., Zhou, K., Zhang, H., Liu, B., Chen, Y.C.: Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124 (2024) 30

work page arXiv 2024
[33]

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

Henschel, R., Khachatryan, L., Hayrapetyan, D., Poghosyan, H., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and ex- tendable long video generation from text. arXiv preprint arXiv:2403.14773 (2024) 30

work page arXiv 2024
[34]

Imagen Video: High Definition Video Generation with Diffusion Models

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022) 30

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Advances in neural information processing systems33, 6840–6851 (2020) 30

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 30

work page 2020
[36]

Advances in neural information processing systems35, 8633– 8646 (2022) 30

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022) 30

work page 2022
[37]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

Hong, Y., Mei, Y., Ge, C., Xu, Y., Zhou, Y., Bi, S., Hold-Geoffroy, Y., Roberts, M., Fisher, M., Shechtman, E., et al.: Relic: Interactive video world model with long-horizon memory. arXiv preprint arXiv:2512.04040 (2025) 5, 30

work page arXiv 2025
[38]

Acdit: Interpolating autoregressive conditional modeling and diffusion transformer.arXiv preprint arXiv:2412.07720,

Hu, J., Hu, S., Song, Y., Huang, Y., Wang, M., Zhou, H., Liu, Z., Ma, W.Y., Sun, M.: Acdit: Interpolating autoregressive conditional modeling and diffusion transformer. arXiv preprint arXiv:2412.07720 (2024) 5, 30

work page arXiv 2024
[39]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025) 2, 3, 4, 5, 6, 7, 10, 13, 14, 30, 31, 43, 44

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4, 7, 10, 13, 31

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4, 7, 10, 13, 31

work page 2024
[41]

VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., Wang, Y., Chen, X., Chen, Y.C., Wang, L., Lin, D., Qiao, Y., Liu, Z.:VBench++:Comprehensiveandversatilebenchmarksuiteforvideogenerative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025). https://doi.org/10.1109/TPAMI.2025.36338904...

work page doi:10.1109/tpami.2025.36338904 2025
[42]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives,

Ji, S., Chen, X., Yang, S., Tao, X., Wan, P., Zhao, H.: Memflow: Flowing adap- tive memory for consistent and efficient long video narratives. arXiv preprint arXiv:2512.14699 (2025) 7, 30 52 H. Li et al

work page arXiv 2025
[43]

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b (2025) 7

work page 2025
[44]

Mixtral of Experts

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024) 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 (2024) 5, 30

work page arXiv 2024
[46]

Nature638(8051), 656–663 (2025) 30

Kanervisto, A., Bignell, D., Wen, L.Y., Grayson, M., Georgescu, R., Valcar- cel Macua, S., Tan, S.Z., Rashid, T., Pearce, T., Cao, Y., et al.: World and human action models towards gameplay ideation. Nature638(8051), 656–663 (2025) 30

work page 2025
[47]

native audio

Kling: Kling video 2.6 – kling’s first “native audio” model official launched!https: //app.klingai.com/global/release-notes/c605hp1tzd(2025) 2, 30

work page 2025
[48]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023) 30

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 30

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Kubrick, S.: The shining.https://en.wikipedia.org/wiki/The_Shining_ (film)(1980) 2

work page 1980
[51]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 30

work page 2024
[52]

Labs, B.F.: Flux.2: Frontier visual intelligence.https://bfl.ai/blog/flux-2 (2025) 30

work page 2025
[53]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Advances in neural information processing systems29(2016) 3

Lamb, A.M., ALIAS PARTH GOYAL, A.G., Zhang, Y., Zhang, S., Courville, A.C., Bengio, Y.: Professor forcing: A new algorithm for training recurrent net- works. Advances in neural information processing systems29(2016) 3

work page 2016
[55]

arXiv preprint arXiv:2512.11423 (2025) 30

Li, C., Wang, R., Zhou, L., Feng, J., Luo, H., Zhang, H., Wu, Y., He, X.: Joya- vatar: Real-time and infinite audio-driven avatar generation with autoregressive diffusion. arXiv preprint arXiv:2512.11423 (2025) 30

work page arXiv 2025
[56]

Da 2: Depth anything in any direction,

Li, H., Zheng, W., He, J., Liu, Y., Lin, X., Yang, X., Chen, Y.C., Guo, C.: Da2: Depth anything in any direction. arXiv preprint arXiv:2509.26618 (2025) 30

work page arXiv 2025
[57]

Alleviating exposure bias in diffusion mod- els through sampling with shifted time steps.arXiv preprint arXiv:2305.15583, 2023

Li, M., Qu, T., Yao, R., Sun, W., Moens, M.F.: Alleviating exposure bias in diffusion models through sampling with shifted time steps. arXiv preprint arXiv:2305.15583 (2023) 3

work page arXiv 2023
[58]

In: SIGGRAPH Asia 2024 Conference Papers

Li, X.L., Li, H., Chen, H.X., Mu, T.J., Hu, S.M.: Discene: Object decoupling and interaction modeling for complex scene generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–12 (2024) 30

work page 2024
[59]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6517–6526 (2024) 30 Rolling Sink53

work page 2024
[60]

Autoregressive adversarial post- training for real-time interactive video generation

Lin, S., Yang, C., He, H., Jiang, J., Ren, Y., Xia, X., Zhao, Y., Xiao, X., Jiang, L.: Autoregressiveadversarialpost-trainingforreal-timeinteractivevideogeneration. arXiv preprint arXiv:2506.09350 (2025) 3

work page arXiv 2025
[61]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 4, 30

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

Mardini: Masked autoregressive diffusion for video generation at scale,

Liu, H., Liu, S., Zhou, Z., Xu, M., Xie, Y., Han, X., Pérez, J.C., Liu, D., Ka- hatapitiya, K., Jia, M., et al.: Mardini: Masked autoregressive diffusion for video generation at scale. arXiv preprint arXiv:2410.20280 (2024) 30

work page arXiv 2024
[63]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025) 3, 30

work page internal anchor Pith review arXiv 2025
[64]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 4, 30

work page internal anchor Pith review Pith/arXiv arXiv 2022
[65]

Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models,

Low, C., Wang, W.: Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099 (2025) 7

work page arXiv 2025
[66]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025) 3, 5, 30

work page internal anchor Pith review arXiv 2025
[67]

Latte: Latent Diffusion Transformer for Video Generation

Ma, X., Wang, Y., Chen, X., Jia, G., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024) 30

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

McQueen, S.: Hunger.https://en.wikipedia.org/wiki/Hunger_(2008_film) (2008) 2

work page 2008
[69]

, author Li, M

Ning, M., Li, M., Su, J., Salah, A.A., Ertugrul, I.O.: Elucidating the exposure bias in diffusion models. arXiv preprint arXiv:2308.15321 (2023) 3

work page arXiv 2023
[70]

OpenAI: Sora 2 is here.https://openai.com/index/sora-2/(2025) 2, 30

work page 2025
[71]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 2, 10, 30

work page 2023
[72]

R., Chen, C., and Wetzstein, G

Po, R., Chan, E.R., Chen, C., Wetzstein, G.: Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models. arXiv preprint arXiv:2512.12080 (2025) 3, 5, 30

work page arXiv 2025
[73]

Movie Gen: A Cast of Media Foundation Models

Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024) 30

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338,

Qiu, H., Liu, S., Zhou, Z., An, Z., Ren, W., Liu, Z., Schult, J., He, S., Chen, S., Cong, Y., et al.: Histream: Efficient high-resolution video generation via redundancy-eliminated streaming. arXiv preprint arXiv:2512.21338 (2025) 30

work page arXiv 2025
[75]

OpenAI blog1(8), 9 (2019) 30

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Lan- guage models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019) 30

work page 2019
[76]

Sequence Level Training with Recurrent Neural Networks

Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015) 3

work page internal anchor Pith review Pith/arXiv arXiv 2015
[77]

arXiv preprint arXiv:2502.07737 (2025) 30

Ren, S., Ma, S., Sun, X., Wei, F.: Next block prediction: Video generation via semi-autoregressive modeling. arXiv preprint arXiv:2502.07737 (2025) 30

work page arXiv 2025
[78]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 30

work page 2022
[79]

Li et al

Runway: Introducing runway gen-4.5: A new frontier for video generation.https: //runwayml.com/research/introducing-runway-gen-4.5(2025) 2, 30 54 H. Li et al

work page 2025
[80]

arXiv preprint arXiv:1910.00292 , year=

Schmidt, F.: Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292 (2019) 3

work page arXiv 1910

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 30

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Cosmos World Foundation Model Platform for Physical AI

Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025) 30

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

World Simulation with Video Foundation Models for Physical AI

Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025) 30

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., Arnaud, S., Gejji, A., Martin, A., Robert Hogan, F., Dugas, D., Bojanowski, P., Khalidov, V., Labatut, P., Massa, F., Szafraniec, M., Krishnakumar, K., Li, Y., Ma, X., Chandar, S., Meier, F., LeCun, Y., Rabbat, M., Ballas, N.: ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

work page 2025

[6] [6]

Advances in neural information pro- cessing systems28(2015) 3

Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information pro- cessing systems28(2015) 3

work page 2015

[7] [7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 30

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023) 30

work page 2023

[9] [9]

OpenAI Blog1(8), 1 (2024) 30

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024) 30

work page 2024

[10] [10]

Advances in neural information processing systems33, 1877–1901 (2020) 30 50 H

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 30 50 H. Li et al

work page 1901

[11] [11]

In: Forty-first International Conference on Machine Learning (2024) 30

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interac- tive environments. In: Forty-first International Conference on Machine Learning (2024) 30

work page 2024

[12] [12]

Advances in Neural Information Processing Systems37, 24081–24125 (2024) 5, 30

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024) 5, 30

work page 2024

[13] [13]

SkyReels-V2: Infinite-length Film Generative Model

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 5, 30

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023) 30

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

Chung, H.W., Constant, N., Garcia, X., Roberts, A., Tay, Y., Narang, S., Firat, O.: Unimax: Fairer and more effective language sampling for large-scale multilin- gual pretraining. arXiv preprint arXiv:2304.09151 (2023) 6

work page arXiv 2023

[16] [16]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025) 3, 5, 30

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Autoregressive Video Generation without Vector Quantization

Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y., Lu, H., Shan, S., Qi, Y., Wang, X.: Autoregressive video generation without vector quantization. arXiv preprint arXiv:2412.14169 (2024) 30

work page internal anchor Pith review arXiv 2024

[18] [18]

Ca2-vdm: Efficient autore- gressive video diffusion model with causal generation and cache sharing,

Gao, K., Shi, J., Zhang, H., Wang, C., Xiao, J., Chen, L.: Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing. arXiv preprint arXiv:2411.16375 (2024) 5, 30

work page arXiv 2024

[19] [19]

arXiv preprint arXiv:2512.12167 (2025) 44

Gelberg, Y., Eguchi, K., Akiba, T., Cetin, E.: Extending the context of pretrained llms by dropping their positional embeddings. arXiv preprint arXiv:2512.12167 (2025) 44

work page arXiv 2025

[20] [20]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S.S., Shah, A., Yin, X., Parikh, D., Misra, I.: Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023) 30

work page arXiv 2023

[21] [21]

google/models/veo/(2025) 2, 30

Google: Introducing veo 3, our video generation model with expanded creative controls – including native audio and extended videos.https://deepmind. google/models/veo/(2025) 2, 30

work page 2025

[22] [22]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

Gu, S., Yin, W., Jin, B., Guo, X., Wang, J., Li, H., Zhang, Q., Long, X.: Dome: Taming diffusion model into high-fidelity controllable occupancy world model. arXiv preprint arXiv:2410.10429 (2024) 30

work page arXiv 2024

[24] [24]

When Attention Sink Emerges in Language Models: An Empirical View

Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. arXiv preprint arXiv:2410.10781 (2024) 7

work page internal anchor Pith review arXiv 2024

[25] [25]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Gu, Y., Mao, W., Shou, M.Z.: Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325 (2025) 5, 30

work page internal anchor Pith review arXiv 2025

[26] [26]

End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

Guo, Y., Yang, C., He, H., Zhao, Y., Wei, M., Yang, Z., Huang, W., Lin, D.: End-to-end training for autoregressive video diffusion via self-resampling. arXiv preprint arXiv:2512.15702 (2025) 30

work page arXiv 2025

[27] [27]

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al

Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025) 30 Rolling Sink51

work page arXiv 2025

[28] [28]

In: European Conference on Computer Vision

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Computer Vision. pp. 393–411. Springer (2024) 30

work page 2024

[29] [29]

LTX-Video: Realtime Video Latent Diffusion

HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024) 30

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Disenvisioner: Disentangled and enriched visual prompt for customized image generation,

He, J., Li, H., Hu, Y., Shen, G., Cai, Y., Qiu, W., Chen, Y.C.: Disenvisioner: Disentangled and enriched visual prompt for customized image generation. arXiv preprint arXiv:2410.02067 (2024) 30

work page arXiv 2024

[31] [31]

arXiv preprint arXiv:2512.01030 (2025) 30

He, J., Li, H., Sheng, M., Chen, Y.C.: Lotus-2: Advancing geometric dense pre- diction with powerful image generative model. arXiv preprint arXiv:2512.01030 (2025) 30

work page internal anchor Pith review arXiv 2025

[32] [32]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

He, J., Li, H., Yin, W., Liang, Y., Li, L., Zhou, K., Zhang, H., Liu, B., Chen, Y.C.: Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124 (2024) 30

work page arXiv 2024

[33] [33]

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

Henschel, R., Khachatryan, L., Hayrapetyan, D., Poghosyan, H., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and ex- tendable long video generation from text. arXiv preprint arXiv:2403.14773 (2024) 30

work page arXiv 2024

[34] [34]

Imagen Video: High Definition Video Generation with Diffusion Models

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022) 30

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Advances in neural information processing systems33, 6840–6851 (2020) 30

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 30

work page 2020

[36] [36]

Advances in neural information processing systems35, 8633– 8646 (2022) 30

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022) 30

work page 2022

[37] [37]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

Hong, Y., Mei, Y., Ge, C., Xu, Y., Zhou, Y., Bi, S., Hold-Geoffroy, Y., Roberts, M., Fisher, M., Shechtman, E., et al.: Relic: Interactive video world model with long-horizon memory. arXiv preprint arXiv:2512.04040 (2025) 5, 30

work page arXiv 2025

[38] [38]

Acdit: Interpolating autoregressive conditional modeling and diffusion transformer.arXiv preprint arXiv:2412.07720,

Hu, J., Hu, S., Song, Y., Huang, Y., Wang, M., Zhou, H., Liu, Z., Ma, W.Y., Sun, M.: Acdit: Interpolating autoregressive conditional modeling and diffusion transformer. arXiv preprint arXiv:2412.07720 (2024) 5, 30

work page arXiv 2024

[39] [39]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025) 2, 3, 4, 5, 6, 7, 10, 13, 14, 30, 31, 43, 44

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4, 7, 10, 13, 31

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4, 7, 10, 13, 31

work page 2024

[41] [41]

VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., Wang, Y., Chen, X., Chen, Y.C., Wang, L., Lin, D., Qiao, Y., Liu, Z.:VBench++:Comprehensiveandversatilebenchmarksuiteforvideogenerative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025). https://doi.org/10.1109/TPAMI.2025.36338904...

work page doi:10.1109/tpami.2025.36338904 2025

[42] [42]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives,

Ji, S., Chen, X., Yang, S., Tao, X., Wan, P., Zhao, H.: Memflow: Flowing adap- tive memory for consistent and efficient long video narratives. arXiv preprint arXiv:2512.14699 (2025) 7, 30 52 H. Li et al

work page arXiv 2025

[43] [43]

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b (2025) 7

work page 2025

[44] [44]

Mixtral of Experts

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024) 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 (2024) 5, 30

work page arXiv 2024

[46] [46]

Nature638(8051), 656–663 (2025) 30

Kanervisto, A., Bignell, D., Wen, L.Y., Grayson, M., Georgescu, R., Valcar- cel Macua, S., Tan, S.Z., Rashid, T., Pearce, T., Cao, Y., et al.: World and human action models towards gameplay ideation. Nature638(8051), 656–663 (2025) 30

work page 2025

[47] [47]

native audio

Kling: Kling video 2.6 – kling’s first “native audio” model official launched!https: //app.klingai.com/global/release-notes/c605hp1tzd(2025) 2, 30

work page 2025

[48] [48]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023) 30

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 30

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Kubrick, S.: The shining.https://en.wikipedia.org/wiki/The_Shining_ (film)(1980) 2

work page 1980

[51] [51]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 30

work page 2024

[52] [52]

Labs, B.F.: Flux.2: Frontier visual intelligence.https://bfl.ai/blog/flux-2 (2025) 30

work page 2025

[53] [53]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Advances in neural information processing systems29(2016) 3

Lamb, A.M., ALIAS PARTH GOYAL, A.G., Zhang, Y., Zhang, S., Courville, A.C., Bengio, Y.: Professor forcing: A new algorithm for training recurrent net- works. Advances in neural information processing systems29(2016) 3

work page 2016

[55] [55]

arXiv preprint arXiv:2512.11423 (2025) 30

Li, C., Wang, R., Zhou, L., Feng, J., Luo, H., Zhang, H., Wu, Y., He, X.: Joya- vatar: Real-time and infinite audio-driven avatar generation with autoregressive diffusion. arXiv preprint arXiv:2512.11423 (2025) 30

work page arXiv 2025

[56] [56]

Da 2: Depth anything in any direction,

Li, H., Zheng, W., He, J., Liu, Y., Lin, X., Yang, X., Chen, Y.C., Guo, C.: Da2: Depth anything in any direction. arXiv preprint arXiv:2509.26618 (2025) 30

work page arXiv 2025

[57] [57]

Alleviating exposure bias in diffusion mod- els through sampling with shifted time steps.arXiv preprint arXiv:2305.15583, 2023

Li, M., Qu, T., Yao, R., Sun, W., Moens, M.F.: Alleviating exposure bias in diffusion models through sampling with shifted time steps. arXiv preprint arXiv:2305.15583 (2023) 3

work page arXiv 2023

[58] [58]

In: SIGGRAPH Asia 2024 Conference Papers

Li, X.L., Li, H., Chen, H.X., Mu, T.J., Hu, S.M.: Discene: Object decoupling and interaction modeling for complex scene generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–12 (2024) 30

work page 2024

[59] [59]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6517–6526 (2024) 30 Rolling Sink53

work page 2024

[60] [60]

Autoregressive adversarial post- training for real-time interactive video generation

Lin, S., Yang, C., He, H., Jiang, J., Ren, Y., Xia, X., Zhao, Y., Xiao, X., Jiang, L.: Autoregressiveadversarialpost-trainingforreal-timeinteractivevideogeneration. arXiv preprint arXiv:2506.09350 (2025) 3

work page arXiv 2025

[61] [61]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 4, 30

work page internal anchor Pith review Pith/arXiv arXiv 2022

[62] [62]

Mardini: Masked autoregressive diffusion for video generation at scale,

Liu, H., Liu, S., Zhou, Z., Xu, M., Xie, Y., Han, X., Pérez, J.C., Liu, D., Ka- hatapitiya, K., Jia, M., et al.: Mardini: Masked autoregressive diffusion for video generation at scale. arXiv preprint arXiv:2410.20280 (2024) 30

work page arXiv 2024

[63] [63]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025) 3, 30

work page internal anchor Pith review arXiv 2025

[64] [64]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 4, 30

work page internal anchor Pith review Pith/arXiv arXiv 2022

[65] [65]

Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models,

Low, C., Wang, W.: Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099 (2025) 7

work page arXiv 2025

[66] [66]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025) 3, 5, 30

work page internal anchor Pith review arXiv 2025

[67] [67]

Latte: Latent Diffusion Transformer for Video Generation

Ma, X., Wang, Y., Chen, X., Jia, G., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024) 30

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

McQueen, S.: Hunger.https://en.wikipedia.org/wiki/Hunger_(2008_film) (2008) 2

work page 2008

[69] [69]

, author Li, M

Ning, M., Li, M., Su, J., Salah, A.A., Ertugrul, I.O.: Elucidating the exposure bias in diffusion models. arXiv preprint arXiv:2308.15321 (2023) 3

work page arXiv 2023

[70] [70]

OpenAI: Sora 2 is here.https://openai.com/index/sora-2/(2025) 2, 30

work page 2025

[71] [71]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 2, 10, 30

work page 2023

[72] [72]

R., Chen, C., and Wetzstein, G

Po, R., Chan, E.R., Chen, C., Wetzstein, G.: Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models. arXiv preprint arXiv:2512.12080 (2025) 3, 5, 30

work page arXiv 2025

[73] [73]

Movie Gen: A Cast of Media Foundation Models

Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024) 30

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [74]

Histream: Efficient high-resolution video generation via redundancy-eliminated streaming.arXiv preprint arXiv:2512.21338,

Qiu, H., Liu, S., Zhou, Z., An, Z., Ren, W., Liu, Z., Schult, J., He, S., Chen, S., Cong, Y., et al.: Histream: Efficient high-resolution video generation via redundancy-eliminated streaming. arXiv preprint arXiv:2512.21338 (2025) 30

work page arXiv 2025

[75] [75]

OpenAI blog1(8), 9 (2019) 30

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Lan- guage models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019) 30

work page 2019

[76] [76]

Sequence Level Training with Recurrent Neural Networks

Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015) 3

work page internal anchor Pith review Pith/arXiv arXiv 2015

[77] [77]

arXiv preprint arXiv:2502.07737 (2025) 30

Ren, S., Ma, S., Sun, X., Wei, F.: Next block prediction: Video generation via semi-autoregressive modeling. arXiv preprint arXiv:2502.07737 (2025) 30

work page arXiv 2025

[78] [78]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 30

work page 2022

[79] [79]

Li et al

Runway: Introducing runway gen-4.5: A new frontier for video generation.https: //runwayml.com/research/introducing-runway-gen-4.5(2025) 2, 30 54 H. Li et al

work page 2025

[80] [80]

arXiv preprint arXiv:1910.00292 , year=

Schmidt, F.: Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292 (2019) 3

work page arXiv 1910