pith. machine review for the scientific record. sign in

arxiv: 2605.09442 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI

Recognition: unknown

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords long video generationprompt-adaptive memorysemantic injection cachecausal video diffusionmulti-prompt videotraining-free frameworktemporal coherenceadaptive windowing
0
0 comments X

The pith

SWIFT uses prompt-adaptive memory to let causal video diffusion models switch semantics efficiently without rebuilding caches or losing quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current cache-rebuild methods for long video generation waste computation at prompt boundaries because fixed memory cannot adapt to new semantics. SWIFT counters this mismatch with a lightweight Semantic Injection Cache that augments existing video history instead of replacing it, plus head-wise injection so only aligned attention heads receive the prompt update. An Adaptive Dynamic Window expands context near switches and shrinks it during stable segments, while segment-level semantic anchors keep long-range consistency under the compressed attention. If these mechanisms hold, multi-prompt streaming video becomes practical at 22.6 FPS on a single H100 GPU while matching the visual quality of slower baselines.

Core claim

SWIFT is a training-free framework that augments cached video memory via a Semantic Injection Cache and head-wise injection, allocates temporal context through an Adaptive Dynamic Window, and maintains consistency with segment-level semantic anchors, thereby enabling efficient semantic switching at prompt boundaries in causal video diffusion models without full cache reconstruction or quality loss.

What carries the argument

The Semantic Injection Cache that augments rather than rebuilds video memory, combined with head-wise prompt injection, an Adaptive Dynamic Window sized to prompt phase, and compact segment-level semantic anchors.

If this is right

  • Average inference cost drops because smaller temporal windows are used during stable prompt segments.
  • Long-range semantic consistency is retained by reintroducing summarized anchors as compact tokens.
  • Prompt boundaries incur only proportional head-wise updates instead of complete memory resets.
  • Generation reaches 22.6 FPS on one H100 GPU while matching the output quality of prior state-of-the-art methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same injection and windowing pattern could reduce memory pressure in other autoregressive generative tasks that must handle changing conditioning signals.
  • Real-time interactive video tools become feasible if the per-frame cost stays low enough for user-driven prompt edits.
  • Scaling the anchors to even longer sequences would test whether the compressed memory still prevents drift over minutes of video.

Load-bearing premise

Augmenting cached memory with lightweight semantic injection and dynamic windows preserves temporal coherence and visual quality across prompt changes without needing full reconstruction.

What would settle it

Generate videos with frequent prompt switches using SWIFT and measure both perceptual quality scores and frame rate against a full-cache-rebuild baseline run at the same speed; degradation in either metric would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.09442 by Hao Li, Jingtao Zhang, Shanwen Tan, Shaofeng Zhang, Xiaosong Jia, Xue Yang, Yanyong Zhang.

Figure 1
Figure 1. Figure 1: Streaming interactive T2V models, e.g., LongLive [ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Semantic Injection Cache. Instead of rebuilding the full video cache at every prompt boundary, SWIFT constructs a lightweight semantic bridge from the prompt transition signal. The transition is first projected onto a motion-orthogonal subspace to avoid interfering with local temporal dynamics, and is then injected into memory through head-wise alignment with recent and sink summaries. The … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Adaptive Dynamic Window. SWIFT allocates temporal memory according to prompt phase rather than using a fixed local attention span throughout generation. The effective window expands around prompt transitions for stable semantic handover and shrinks inside stable intervals for efficient rollout. Segment-level semantic anchors compensate for the reduced local context by preserving compact pro… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of different memory variants under the multi-prompt 60- second setting. Fixed uses a constant local window. No-Memory removes additional transition-aware memory. Sink retains only sink memory. Sink+SIC adds the Semantic Injection Cache on top of sink memory. Ours denotes the full SWIFT model, which achieves more coherent prompt transitions and better long-range visual consistency. al… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis of the adaptive dynamic window. The figure reports the variation of representative visual quality metrics under different hyperparameter settings of the adaptive window, including the minimum window size and the phase scheduling factors. 4.4 Sensitivity Analysis Long-horizon video generation requires the model to balance temporal coherence and computational efficiency under constrained… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples of temporal injection schedules. We visualize generated frames from one-shot, constant, and continuously decayed injection under the same multi-prompt sequence. One-shot injection causes abrupt semantic changes, constant injection shows weaker adaptation after prompt switches, and continuously decayed injection provides smoother semantic transitions while preserving temporal consistenc… view at source ↗
Figure 7
Figure 7. Figure 7: Per-block efficiency of SWIFT. Gray bands mark prompt-switch boundaries, and dashed lines denote fixed-window references. SWIFT shows only mild latency increases at prompt transitions, while its adaptive memory schedule expands the read budget near boundaries and contracts it in stable segments, with nearly constant GPU memory usage [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative example of 30-second multi-prompt video generation. We visualize representative frames generated by SWIFT and LongLive under the same six-prompt sequence. E.4 Case Study We present additional qualitative results in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative example of 60-second multi-prompt video generation. We visualize representative frames generated by SWIFT under a six-prompt sequence. A woman is writing in a journal at a cafe table by a window. It is raining outside, and the window is streaked with water. The woman writing in her journal at the cafe is approached by a waiter who brings her a steaming cup of hot chocolate. The rain is coming d… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative example of 60-second multi-prompt video generation. We visualize representative frames generated by SWIFT under a six-prompt sequence. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Failure case of SWIFT under complex multi-prompt generation. Although SWIFT preserves the laboratory scene and maintains a coherent visual layout across prompt transitions, it can still inherit limitations from the pretrained video diffusion backbone. backbone. As a result, difficult cases involving rare objects, fine-grained physical interactions, crowded scenes, or large camera motion may still produce … view at source ↗
read the original abstract

Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at https://github.com/ShanwenTan/SWIFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SWIFT, a training-free framework for efficient multi-prompt long-video generation in causal video diffusion models. It proposes a Semantic Injection Cache augmented with head-wise semantic injection to handle prompt switches without full cache reconstruction, an Adaptive Dynamic Window that varies local context size based on prompt phase, and segment-level semantic anchors to maintain long-range consistency under compressed attention. The central claim is that these components enable 22.6 FPS on a single H100 GPU while preserving generation quality relative to existing state-of-the-art methods.

Significance. If the empirical claims are substantiated, the work would offer a practical advance for interactive long-video synthesis by reducing redundant computation at semantic boundaries without requiring model retraining. The public code release and focus on causal diffusion models align with current needs in efficient video generation pipelines.

major comments (2)
  1. [Abstract] Abstract: The headline performance claim (22.6 FPS with quality preservation versus SOTA) is presented without any quantitative metrics, ablation tables, video-length statistics, baseline comparisons, or experimental protocol. This absence directly undermines verification of the efficiency-quality tradeoff that constitutes the paper's main contribution.
  2. [Method] Method description (presumed §3): The head-wise injection and Adaptive Dynamic Window mechanisms are described at a high level but lack explicit equations or pseudocode for alignment-score computation, threshold selection, or the precise memory-token update rule. Without these, it is impossible to confirm that the approach avoids quality degradation at prompt boundaries or to reproduce the reported speed-up.
minor comments (2)
  1. [Abstract] Abstract: The expansion of the acronym SWIFT ('Semantic Windowing and Injection for Flexible Transitions') is clear, but the title emphasizes 'Prompt-Adaptive Memory'; a brief reconciliation of the two phrasings would improve consistency.
  2. [Method] The manuscript lists free parameters (local context window sizes, head-wise injection alignment thresholds) yet repeatedly stresses a 'training-free' design. A short clarification on how these hyperparameters are chosen in practice would remove potential reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance claim (22.6 FPS with quality preservation versus SOTA) is presented without any quantitative metrics, ablation tables, video-length statistics, baseline comparisons, or experimental protocol. This absence directly undermines verification of the efficiency-quality tradeoff that constitutes the paper's main contribution.

    Authors: The abstract provides a high-level summary of the contribution. Detailed quantitative metrics (including FPS, FVD, and CLIP scores), ablation tables, video-length statistics, baseline comparisons, and the full experimental protocol are presented in Section 4 and the supplementary material. We agree that the abstract would benefit from a brief inclusion of key results to better substantiate the efficiency-quality tradeoff, and we will revise it accordingly in the next version. revision: yes

  2. Referee: [Method] Method description (presumed §3): The head-wise injection and Adaptive Dynamic Window mechanisms are described at a high level but lack explicit equations or pseudocode for alignment-score computation, threshold selection, or the precise memory-token update rule. Without these, it is impossible to confirm that the approach avoids quality degradation at prompt boundaries or to reproduce the reported speed-up.

    Authors: We agree that the current high-level description in §3 would benefit from greater formalism to support reproducibility and verification. In the revised manuscript we will add explicit equations for the alignment-score computation (cosine similarity between head-wise features), the adaptive threshold selection rule, and the memory-token update procedure, along with pseudocode in a new Algorithm 1. These additions will clarify how selective head-wise injection and dynamic windowing preserve quality at prompt boundaries while achieving the reported speed-up. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training-free framework is self-contained

full rationale

The paper presents SWIFT as a training-free method for multi-prompt long-video generation, motivated by observed mismatches between cached video history and prompt updates in prior approaches. It introduces new components (Semantic Injection Cache, head-wise injection, Adaptive Dynamic Window, segment-level semantic anchors) as practical engineering solutions without any equations, predictions, or derivations that reduce to fitted parameters, self-definitions, or self-citation chains. Central efficiency claims (22.6 FPS on H100) are empirical and supported by public code for verification, with no load-bearing steps that equate outputs to inputs by construction. This is a standard non-circular finding for a methods paper focused on implementation rather than theoretical derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 3 invented entities

The central claim rests on the effectiveness of newly introduced algorithmic components whose performance is asserted without independent prior evidence or formal derivation; the approach assumes standard properties of causal attention in diffusion models.

free parameters (2)
  • local context window sizes
    Larger windows near prompt boundaries and smaller during stable segments are allocated according to prompt phase; specific sizes or thresholds are not derived from first principles.
  • head-wise injection alignment thresholds
    Proportion of prompt update per attention head depends on alignment with video state; selection or scaling parameters are introduced to implement this.
axioms (1)
  • domain assumption Causal video diffusion models support external cache augmentation without retraining or loss of core generation capability.
    The entire framework is presented as training-free and compatible with existing causal models.
invented entities (3)
  • Semantic Injection Cache no independent evidence
    purpose: Augments existing cached video memory with prompt updates at boundaries instead of full reconstruction.
    New cache structure introduced to address the stated mismatch between memory and prompt switches.
  • Adaptive Dynamic Window no independent evidence
    purpose: Allocates varying temporal context sizes based on prompt phase to reduce average inference cost.
    New memory allocation mechanism not standard in prior fixed-budget approaches.
  • segment-level semantic anchors no independent evidence
    purpose: Compact summary tokens that preserve long-range consistency when local attention is compressed.
    New memory token type introduced to compensate for reduced context windows.

pith-pipeline@v0.9.0 · 5594 in / 1629 out tokens · 93140 ms · 2026-05-12T04:52:50.800817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

  1. [1]

    Brooks, J

    T. Brooks, J. Hellsten, M. Aittala, T.-C. Wang, T. Aila, J. Lehtinen, M.-Y . Liu, A. Efros, and T. Karras. Generating long videos of dynamic scenes.Advances in Neural Information Processing Systems, 35:31769–31781, 2022

  2. [2]

    M. Cai, X. Cun, X. Li, W. Liu, Z. Zhang, Y . Zhang, Y . Shan, and X. Yue. Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7763–7772, 2025

  3. [3]

    S. Cai, C. Yang, L. Zhang, Y . Guo, J. Xiao, Z. Yang, Y . Xu, Z. Yang, A. Yuille, L. Guibas, et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058, 2025

  4. [4]

    J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh. Self-forcing++: Towards minute-scale high-quality video generation. InThe Fourteenth International Confer- ence on Learning Representations, 2026

  5. [5]

    W. Feng, C. Liu, S. Liu, W. Y . Wang, A. Vahdat, and W. Nie. Blobgen-vid: Compositional text-to-video generation with blob video representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12989–12998, 2025

  6. [6]

    K. Gao, J. Shi, H. Zhang, C. Wang, J. Xiao, and L. Chen. Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing. InInternational Conference on Machine Learning, pages 18550–18565. PMLR, 2025

  7. [7]

    Y . Gu, W. Mao, and M. Z. Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025

  8. [8]

    Y . He, T. Yang, Y . Zhang, Y . Shan, and Q. Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022

  9. [9]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  10. [10]

    Huang, Y

    Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  11. [11]

    S. Ji, X. Chen, S. Yang, X. Tao, P. Wan, and H. Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

  12. [12]

    X. Ju, W. Ye, Q. Liu, Q. Wang, X. Wang, P. Wan, D. Zhang, K. Gai, and Q. Xu. Fulldit: Multi- task video generative foundation model with full attention.arXiv preprint arXiv:2503.19907, 2025

  13. [13]

    J. Kim, J. Kang, J. Choi, and B. Han. Fifo-diffusion: Generating infinite videos from text without training.Advances in Neural Information Processing Systems, 37:89834–89868, 2024

  14. [14]

    S. Kim, S. W. Oh, J.-H. Wang, J.-Y . Lee, and J. Shin. Tuning-free multi-event long video generation via synchronized coupled sampling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6418–6429, 2025

  15. [15]

    T. Lee, S. Kwon, and T. Kim. Grid diffusion models for text-to-video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8734–8743, 2024

  16. [16]

    R. Li, T. Yang, F. Ai, T. Wu, S. Wen, B. Peng, and L. Zhang. Long-horizon streaming video generation via hybrid attention with decoupled distillation.arXiv preprint arXiv:2604.10103, 2026

  17. [17]

    Z. Li, H. Rahmani, Q. Ke, and J. Liu. Longdiff: Training-free long video generation in one go. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17789–17798, 2025. 10

  18. [18]

    S. Lin, C. Yang, H. He, J. Jiang, Y . Ren, X. Xia, Y . Zhao, X. Xiao, and L. Jiang. Autoregressive adversarial post-training for real-time interactive video generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  19. [19]

    Z. Liu, X. Deng, S. Chen, A. Wang, Q. Guo, M. Han, Z. Xue, M. Chen, P. Luo, and L. Yang. Worldweaver: Generating long-horizon video worlds via rich perception.arXiv preprint arXiv:2508.15720, 2025

  20. [20]

    Z. Qing, S. Zhang, J. Wang, X. Wang, Y . Wei, Y . Zhang, C. Gao, and N. Sang. Hierarchical spatio-temporal decoupling for text-to-video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6635–6645, 2024

  21. [21]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  22. [22]

    K. Song, B. Chen, M. Simchowitz, Y . Du, R. Tedrake, and V . Sitzmann. History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

  23. [23]

    Y . Tian, L. Yang, H. Yang, Y . Gao, Y . Deng, J. Chen, X. Wang, Z. Yu, X. Tao, P. Wan, et al. Videotetris: Towards compositional text-to-video generation.Advances in Neural Information Processing Systems, 37:29489–29513, 2024

  24. [24]

    Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

    R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

  25. [25]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  26. [26]

    Wang, C.-Y

    H. Wang, C.-Y . Ma, Y .-C. Liu, J. Hou, T. Xu, J. Wang, F. Juefei-Xu, Y . Luo, P. Zhang, T. Hou, et al. Lingen: Towards high-resolution minute-length text-to-video generation with linear computational complexity. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2578–2588, 2025

  27. [27]

    T. Wu, S. Yang, R. Po, Y . Xu, Z. Liu, D. Lin, and G. Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

  28. [28]

    Z. Wu, A. Siarohin, W. Menapace, I. Skorokhodov, Y . Fang, V . Chordia, I. Gilitschenski, and S. Tulyakov. Mind the time: Temporally-controlled multi-event video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23989–24000, 2025

  29. [29]

    H. Xi, S. Yang, Y . Zhao, C. Xu, M. Li, X. Li, Y . Lin, H. Cai, J. Zhang, D. Li, et al. Sparse video- gen: Accelerating video diffusion transformers with spatial-temporal sparsity. InInternational Conference on Machine Learning, pages 68208–68224. PMLR, 2025

  30. [30]

    D. Xie, Z. Xu, Y . Hong, H. Tan, D. Liu, F. Liu, A. Kaufman, and Y . Zhou. Progressive autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6322–6332, 2025

  31. [31]

    F. Xie, D. Zeng, Q. Shen, and B. Tang. A comprehensive survey on text-to-video generation. Chinese Journal of Electronics, 34(4):1009–1036, 2025

  32. [32]

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T...

  33. [33]

    S. Yang, W. Huang, R. Chu, Y . Xiao, Y . Zhao, X. Wang, M. Li, E. Xie, Y .-C. Chen, Y . Lu, S. Han, and Y . Chen. Longlive: Real-time interactive long video generation. InThe Fourteenth International Conference on Learning Representations, 2026

  34. [34]

    T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  35. [35]

    T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

  36. [36]

    Z. Yin, K. Chen, X. Bai, R. Jiang, J. Li, H. Li, J. Liu, Y . Xiang, J. Yu, and M. Zhang. A survey: spatiotemporal consistency in video generation.ACM Computing Surveys, 2025

  37. [37]

    J. Yu, J. Bai, Y . Qin, Q. Liu, X. Wang, P. Wan, D. Zhang, and X. Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

  38. [38]

    Z. Yuan, Y . Liu, Y . Cao, W. Sun, H. Jia, R. Chen, Z. Li, B. Lin, L. Yuan, L. He, et al. Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248, 2024

  39. [39]

    Storymem: Multi-shot long video storytelling with memory,

    K. Zhang, L. Jiang, A. Wang, J. Z. Fang, T. Zhi, Q. Yan, H. Kang, X. Lu, and X. Pan. Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

  40. [40]

    Zhang, S

    L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala. Frame context packing and drift pre- vention in next-frame-prediction video diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  41. [41]

    Zhang, Y

    P. Zhang, Y . Chen, R. Su, H. Ding, I. Stoica, Z. Liu, and H. Zhang. Fast video generation with sliding tile attention. InInternational Conference on Machine Learning, pages 74714–74731. PMLR, 2025

  42. [42]

    S. Zhou, P. Yang, J. Wang, Y . Luo, and C. C. Loy. Upscale-a-video: Temporal-consistent diffu- sion model for real-world video super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2535–2545, 2024

  43. [43]

    Y . Zhou, D. Zhou, M.-M. Cheng, J. Feng, and Q. Hou. Storydiffusion: Consistent self-attention for long-range image and video generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  44. [44]

    T. Zhu, S. Zhang, Z. Sun, J. Tian, and Y . Tang. Memorize-and-generate: Towards long-term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025. 12 A Use of Large Language Models Large language models were used only as auxiliary writing tools during the preparation of this manuscript. Their use was limited to improving grammar, r...

  45. [45]

    When the finite-difference motion estimate has very small magnitude, this denominator may become numerically unstable. We therefore use the stabilized implementation c∆p (m) ⊥ = ∆p(m) − ⟨∆p(m), m⟩ ∥m∥2 2 +ϵ m, ϵ >0.(36) This update converges to the exact projection asϵ→0: lim ϵ→0 c∆p (m) ⊥ = ∆p(m) ⊥ .(37) 16 Its residual first-order motion response is c∆p...