pith. sign in

arxiv: 2606.22370 · v1 · pith:Y77KF723new · submitted 2026-06-21 · 💻 cs.CV

Towards Error-Free Long Video Generation

Pith reviewed 2026-06-26 10:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video generationdiffusion modelscausal attentionerror accumulationattribute driftrectified flowautoregressive videovideo synthesis
0
0 comments X

The pith

Finetuning a diffusion model first on short videos then with causal attention across clips plus T-RFlow produces minute-long videos without error accumulation or attribute drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome error buildup and identity drift that limit current video diffusion models to short clips. It starts by training a model on abundant short video data to generate coherent extensions autoregressively. It then switches to causal attention so that each new clip sees prior clips only in one direction while still using full bidirectional attention inside each clip, and adds a truncation-rectified flow step to dampen remaining drift. A reader would care because the method claims to reach consistent single-shot videos at minute scale using mostly short data plus limited long data.

Core claim

The framework first finetunes a diffusion model on large-scale short video data to serve as an autoregressive video extension model. It then continues finetuning with causal attention between clips on long video data so that tokens inside a clip receive bidirectional attention while tokens across clips receive only unidirectional attention. A KV cache keeps memory constant at inference time. The truncation-rectified flow technique is introduced to further reduce error accumulation, yielding high-quality, dynamic, identity-consistent videos at minute length.

What carries the argument

Causal attention computation between successive clips together with the truncation-rectified flow (T-RFlow) technique, which together preserve long-range context while retaining bidirectional modeling inside each short clip.

If this is right

  • Causal attention lets the model exploit long-video data for training while keeping the strengths of modern diffusion models on short clips.
  • KV caching makes inference memory independent of total video length.
  • T-RFlow supplies an additional error-suppression mechanism beyond the attention change.
  • The pipeline needs only short-video corpora for the first stage and modest long-video data for the second stage.
  • Identity and motion consistency are maintained across minute-scale single-shot outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same causal-attention split might reduce drift in other autoregressive generators such as long audio or text sequences.
  • Testing the method on multi-shot or variable-resolution videos would reveal whether the clip-boundary handling generalizes.
  • An end-to-end training regime on mixed short and long data could be compared directly against the staged approach.
  • Minute-scale coherent video opens direct use in simulation loops or narrative generation where current methods still require manual stitching.

Load-bearing premise

That the two-stage finetuning sequence with causal attention and T-RFlow will suppress error accumulation and attribute drift without introducing new inconsistencies that the model cannot correct.

What would settle it

Measure attribute consistency and perceptual quality on generated videos of steadily increasing length; if quality drops sharply after roughly 30-60 seconds the central claim is falsified.

Figures

Figures reproduced from arXiv: 2606.22370 by Bohan Zhuang, Fan Wang, Hangjie Yuan, Hao Xu, Jiasheng Tang, Ruigang Niu, Shuning Chang, Weihua Chen, Yi Yang, Yu Lu, Zeyu Zhang.

Figure 1
Figure 1. Figure 1: 30-second videos generated by both the Kling model and our model using the same initial frames. accumulation and attribute drift are the two most critical problems in long video generation, especially single-shot long video generation. Error accumu￾lation refers to the iterative degradation visual quality over time, whereas at￾tribute drift refers to the fading of memory as the model struggles to remember … view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of error accumulation trend along denoising process and video generation process. In the dimension of denoising process, the error accumulation, such as oversaturation, exists in the late timesteps. In the dimension of video generation process, the errors are irreversibly amplified over time. denoising network, modeled as uθ(z 1:M t , c, t). In the inference phase, we generate the video clip … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our framework. The video extension model is from fine-tuning a video foundation model. After causal attention fine-tuning, this model generates long videos on a clip-by-clip paradigm, conditioned on KV cache and the last frames of the previous clip, thereby enhancing identity consistency and reducing error accumulation. To manage GPU memory efficiently, the KV cache is curated by our KV selecti… view at source ↗
Figure 4
Figure 4. Figure 4: A long video generated with T-RFlow. StoryDiffusion [53], to prevent from overfitting, we inject KV only into the latter half of the Transformer blocks. Consequently, LoRA is applied solely to the self￾attention layers in those blocks, and only the LoRA parameters are updated during training. Inference During inference, the video is generated clip-wisely. The generation of each clip is conditioned on indiv… view at source ↗
Figure 5
Figure 5. Figure 5: Visualizations of our method. Each video has a 30-second duration and is generated using multiple prompts. The specific prompts used can be found in our supplementary material [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 30-second videos under three dif￾ferent settings [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Recent advances in video generation have made minute-level synthesis possible; however, generating long videos remains challenging due to error accumulation, attribute drift, and the limited availability of long video data. In this paper, we introduce an infinite-length video generation framework that focusing on addressing these issues and produces high-quality, dynamic, and identity-consistent single-shot long videos. We first finetune a diffusion model as a video extension model on large-scale short video data to autoregressively generate temporally coherent clips. Inspired by the success of large language models (LLMs), we adopt causal attention computation between clips to further finetune this model on long video data. In this way, the tokens in one clip (short video) are computed by bidirectional attention while tokens among clips are computed by unidirectional attention. This design leverages the strengths of modern diffusion models while preserving long-term context information, effectively mitigating error accumulation and attribute drift. To achieve memory efficiency during inference, we adopt a key-value (KV) caching mechanism to maintain a constant KV memory. Furthermore, we introduce truncation-rectified flow (T-RFlow) technique to further suppress error accumulation. Experimental results demonstrate the effectiveness of our method. Our framework establishes a new benchmark for realistic and coherent minute-level video synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces an infinite-length video generation framework based on two-stage finetuning of a diffusion model: first on short-video data to enable autoregressive clip generation, then on long-video data using causal attention between clips (while retaining bidirectional attention within each clip). It adds KV caching for constant memory and a truncation-rectified flow (T-RFlow) technique, claiming these steps mitigate error accumulation and attribute drift to produce identity-consistent minute-scale videos and establish a new benchmark.

Significance. If the two-stage finetuning plus T-RFlow demonstrably keeps drift below coherence-destroying thresholds, the work would advance practical long-video synthesis by adapting LLM-style causal attention to diffusion models. The KV-caching mechanism for inference efficiency is a practical contribution. However, the absence of any quantitative metrics, baselines, or stability analysis in the provided text makes it impossible to evaluate whether the central claim holds.

major comments (3)
  1. [Abstract] Abstract: the claim that the method 'establishes a new benchmark for realistic and coherent minute-level video synthesis' rests on the assertion that 'experimental results demonstrate the effectiveness,' yet no quantitative metrics, baselines, datasets, error bars, or ablation results are supplied; this is load-bearing for the central claim of error-free generation.
  2. [Abstract] Abstract (method description): the architecture retains full bidirectional attention inside each clip while making only inter-clip attention causal; no derivation, contraction mapping, or empirical bound is given showing that intra-clip denoising steps cannot re-introduce attribute drift that compounds across autoregressive steps, leaving the weakest assumption unaddressed.
  3. [Abstract] Abstract: the T-RFlow technique is introduced to 'further suppress error accumulation,' but no equation, algorithm, or ablation quantifies its effect relative to standard rectified flow, making its contribution to the 'error-free' claim impossible to assess.
minor comments (2)
  1. [Abstract] Abstract contains a grammatical error: 'framework that focusing on' should be 'framework that focuses on.'
  2. [Title] The title uses 'Error-Free,' which is stronger than the body text's more qualified language about 'mitigating' drift; this mismatch should be reconciled.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and method details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'establishes a new benchmark for realistic and coherent minute-level video synthesis' rests on the assertion that 'experimental results demonstrate the effectiveness,' yet no quantitative metrics, baselines, datasets, error bars, or ablation results are supplied; this is load-bearing for the central claim of error-free generation.

    Authors: We agree that the abstract would be strengthened by referencing specific results. The full manuscript includes quantitative evaluations (coherence metrics, baseline comparisons, and user studies) in the experiments section; we will revise the abstract to concisely cite key metrics and datasets while preserving length constraints. revision: yes

  2. Referee: [Abstract] Abstract (method description): the architecture retains full bidirectional attention inside each clip while making only inter-clip attention causal; no derivation, contraction mapping, or empirical bound is given showing that intra-clip denoising steps cannot re-introduce attribute drift that compounds across autoregressive steps, leaving the weakest assumption unaddressed.

    Authors: The hybrid attention design follows the LLM paradigm to preserve per-clip generation quality while limiting error propagation across clips. We acknowledge the absence of a formal bound; the revised manuscript will add a dedicated discussion paragraph explaining the empirical stability observed during two-stage finetuning and the role of causal inter-clip conditioning in containing drift. revision: partial

  3. Referee: [Abstract] Abstract: the T-RFlow technique is introduced to 'further suppress error accumulation,' but no equation, algorithm, or ablation quantifies its effect relative to standard rectified flow, making its contribution to the 'error-free' claim impossible to assess.

    Authors: We will add the full mathematical definition of truncation-rectified flow, the corresponding algorithm, and ablation results quantifying its reduction in error accumulation versus standard rectified flow in the revised methods and experiments sections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical finetuning procedure with no self-referential derivations or load-bearing self-citations

full rationale

The paper presents an empirical framework consisting of two-stage finetuning of a diffusion model (first on short videos with bidirectional attention, then on long videos with causal inter-clip attention) plus KV caching and the T-RFlow technique. No equations, uniqueness theorems, or derivations appear that reduce the claimed mitigation of error accumulation to a fitted quantity or self-citation by construction. The central claims rest on experimental results rather than any mathematical reduction to inputs. No self-citations are invoked as load-bearing premises, and the method does not rename known results or smuggle ansatzes via prior work. The derivation chain is self-contained as a description of architectural and training choices whose effectiveness is asserted via evaluation, not by definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full manuscript would be required to enumerate any fitted scales, background assumptions, or new postulated components such as the precise definition of T-RFlow.

pith-pipeline@v0.9.1-grok · 5777 in / 1008 out tokens · 24424 ms · 2026-06-26T10:47:19.663408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 19 linked inside Pith

  1. [1]

    In: SIGGRAPH Asia 2024 Conference Papers

    Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

  2. [2]

    arXiv preprint arXiv:2311.15127 (2023)

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  3. [3]

    arXiv preprint arXiv:2508.21058 (2025)

    Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., et al.: Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058 (2025)

  4. [4]

    Advances in Neural Information Processing Systems38, 91456–91483 (2026)

    Chang, S., Wang, P., Tang, J., Wang, F., Yang, Y.: Sparsedit: Token sparsifica- tion for efficient diffusion transformer. Advances in Neural Information Processing Systems38, 91456–91483 (2026)

  5. [5]

    arXiv preprint arXiv:2504.13074 (2025)

    Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025)

  6. [6]

    arXiv preprint arXiv:2310.19512 (2023)

    Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y.: Videocrafter2: Overcoming data limitations for high-quality video diffusion mod- els. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7310–7320 (2024) 14 Authors Suppressed Due to Excessive Length

  8. [8]

    In: The Twelfth International Conference on Learning Representations (2023)

    Chen, X., Wang, Y., Zhang, L., Zhuang, S., Ma, X., Yu, J., Wang, Y., Lin, D., Qiao, Y., Liu, Z.: Seine: Short-to-long video diffusion model for generative transition and prediction. In: The Twelfth International Conference on Learning Representations (2023)

  9. [9]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Dalal, K., Koceja, D., Xu, J., Zhao, Y., Han, S., Cheung, K.C., Kautz, J., Choi, Y., Sun, Y., Wang, X.: One-minute video generation with test-time training. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 17702– 17711 (2025)

  10. [10]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  11. [11]

    arXiv preprint arXiv:2503.19325 (2025)

    Gu, Y., Mao, W., Shou, M.Z.: Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325 (2025)

  12. [12]

    arXiv preprint arXiv:2307.04725 (2023)

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  13. [13]

    arXiv preprint arXiv:2503.10589 (2025)

    Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025)

  14. [14]

    Advances in Neural Information Processing Systems 35, 27953–27965 (2022)

    Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., Wood, F.: Flexible diffu- sion modeling of long videos. Advances in Neural Information Processing Systems 35, 27953–27965 (2022)

  15. [15]

    arXiv preprint arXiv:2211.13221 (2022)

    He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221 (2022)

  16. [16]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  17. [17]

    arXiv preprint arXiv:2205.15868 (2022)

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)

  18. [18]

    arXiv preprint arXiv:2410.23277 (2024)

    Hong, Y., Liu, B., Wu, M., Zhai, Y., Chang, K.W., Li, L., Lin, K., Lin, C.C., Wang, J., Yang, Z., et al.: Slowfast-vgen: Slow-fast learning for action-driven long video generation. arXiv preprint arXiv:2410.23277 (2024)

  19. [19]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  20. [20]

    arXiv preprint arXiv:2506.08009 (2025)

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

  22. [22]

    arXiv preprint arXiv:2412.03603 (2024)

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  23. [23]

    kuaishou.com(2024)

    Kuaishou: Kling ai: Next-generation ai creative studio.https : / / klingai . kuaishou.com(2024)

  24. [24]

    arXiv preprint arXiv:2412.00131 (2024)

    Lin, B., Ge, Y., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y., Yuan, S., Chen, L., et al.: Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131 (2024)

  25. [25]

    arXiv preprint arXiv:2210.02747 (2022) Abbreviated paper title 15

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) Abbreviated paper title 15

  26. [26]

    arXiv preprint arXiv:2209.03003 (2022)

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

  27. [27]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  28. [28]

    arXiv preprint arXiv:2503.09642 (2025)

    Peng, X., Zheng, Z., Shen, C., Young, T., Guo, X., Wang, B., Xu, H., Liu, H., Jiang, M., Li, W., Wang, Y., Ye, A., Ren, G., Ma, Q., Liang, W., Lian, X., Wu, X., Zhong, Y., Li, Z., Gong, C., Lei, G., Cheng, L., Zhang, L., Li, M., Zhang, R., Hu, S., Huang, S., Wang, X., Zhao, Y., Wang, Y., Wei, Z., You, Y.: Open-sora 2.0: Training a commercial-level video g...

  29. [29]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  30. [30]

    arXiv preprint arXiv:1908.10084 (2019)

    Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. arXiv preprint arXiv:1908.10084 (2019)

  31. [31]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  32. [32]

    Neurocomputing568, 127063 (2024)

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

  33. [33]

    Team, G.: Mochi 1.https://github.com/genmoai/models(2024)

  34. [34]

    arXiv preprint arXiv:2505.13211 (2025)

    Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025)

  35. [35]

    Advances in neural information processing systems35, 23371–23385 (2022)

    Voleti, V., Jolicoeur-Martineau, A., Pal, C.: Mcvd-masked conditional video diffu- sion for prediction, generation, and interpolation. Advances in neural information processing systems35, 23371–23385 (2022)

  36. [36]

    arXiv preprint arXiv:2503.20314 (2025)

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  37. [37]

    arXiv preprint arXiv:2410.02757 (2024)

    Wang, Y., Xiong, T., Zhou, D., Lin, Z., Zhao, Y., Kang, B., Feng, J., Liu, X.: Loong: Generating minute-level long videos with autoregressive language models. arXiv preprint arXiv:2410.02757 (2024)

  38. [38]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wu, W., Liu, M., Zhu, Z., Xia, X., Feng, H., Wang, W., Lin, K.Q., Shen, C., Shou, M.Z.: Moviebench: A hierarchical movie level dataset for long video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28984–28994 (2025)

  39. [39]

    arXiv preprint arXiv:2507.18634 (2025)

    Xiao, J., Yang, C., Zhang, L., Cai, S., Zhao, Y., Guo, Y., Wetzstein, G., Agrawala, M., Yuille, A., Jiang, L.: Captain cinema: Towards short movie generation. arXiv preprint arXiv:2507.18634 (2025)

  40. [40]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xie, D., Xu, Z., Hong, Y., Tan, H., Liu, D., Liu, F., Kaufman, A., Zhou, Y.: Progressive autoregressive video diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6322–6332 (2025) 16 Authors Suppressed Due to Excessive Length

  41. [41]

    In: European Conference on Computer Vision

    Xing,J.,Xia,M.,Zhang,Y.,Chen,H.,Yu,W.,Liu,H.,Liu,G.,Wang,X.,Shan,Y., Wong, T.T.: Dynamicrafter: Animating open-domain images with video diffusion priors. In: European Conference on Computer Vision. pp. 399–417. Springer (2024)

  42. [42]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yan, X., Cai, Y., Wang, Q., Zhou, Y., Huang, W., Yang, H.: Long video diffusion generation with segmented cross-attention and content-rich video data curation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3184–3194 (2025)

  43. [43]

    arXiv preprint arXiv:2509.22622 (2025)

    Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

  44. [44]

    arXiv preprint arXiv:2306.02562 (2023)

    Yang, S., Zhang, L., Liu, Y., Jiang, Z., He, Y.: Video diffusion models with local- global context guidance. arXiv preprint arXiv:2306.02562 (2023)

  45. [45]

    arXiv preprint arXiv:2408.06072 (2024)

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

  46. [46]

    In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Yin, S., Wu, C., Yang, H., Wang, J., Wang, X., Ni, M., Yang, Z., Li, L., Liu, S., Yang, F., et al.: Nuwa-xl: Diffusion over diffusion for extremely long video generation. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1309–1320 (2023)

  47. [47]

    International Journal of Computer Vision133(4), 1879–1893 (2025)

    Zhang, D.J., Wu, J.Z., Liu, J.W., Zhao, R., Ran, L., Gu, Y., Gao, D., Shou, M.Z.: Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision133(4), 1879–1893 (2025)

  48. [48]

    arXiv preprint arXiv:2504.12626 (2025)

    Zhang, L., Agrawala, M.: Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626 (2025)

  49. [49]

    arXiv preprint arXiv:2311.04145 (2023)

    Zhang, S., Wang, J., Zhang, Y., Zhao, K., Yuan, H., Qin, Z., Wang, X., Zhao, D., Zhou, J.: I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145 (2023)

  50. [50]

    Zhang, Z., Chang, S., He, Y., Han, Y., Tang, J., Wang, F., Zhuang, B.: Blockvid: Block diffusion for high-fidelity and coherent minute-long video generation (2025), https://openreview.net/forum?id=k8KIwW4f7P

  51. [51]

    arXiv preprint arXiv:2407.16655 (2024)

    Zhao, C., Liu, M., Wang, W., Chen, W., Wang, F., Chen, H., Zhang, B., Shen, C.: Moviedreamer: Hierarchical generation for coherent long visual sequence. arXiv preprint arXiv:2407.16655 (2024)

  52. [52]

    arXiv preprint arXiv:2502.15894 (2025)

    Zhao, M., He, G., Chen, Y., Zhu, H., Li, C., Zhu, J.: Riflex: A free lunch for length extrapolation in video diffusion transformers. arXiv preprint arXiv:2502.15894 (2025)

  53. [53]

    Advances in Neural In- formation Processing Systems37, 110315–110340 (2024)

    Zhou, Y., Zhou, D., Cheng, M.M., Feng, J., Hou, Q.: Storydiffusion: Consistent self-attention for long-range image and video generation. Advances in Neural In- formation Processing Systems37, 110315–110340 (2024)