pith. machine review for the scientific record. sign in

arxiv: 2502.06764 · v2 · submitted 2025-02-10 · 💻 cs.LG · cs.CV

Recognition: no theorem link

History-Guided Video Diffusion

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords video diffusionhistory guidanceDiffusion Forcing Transformertemporal consistencyclassifier-free guidancelong video generationconditional generation
0
0 comments X

The pith

Diffusion Forcing Transformer lets video models condition on any number of past frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Diffusion Forcing Transformer, an architecture paired with a training objective that supports conditioning on a variable number of history frames instead of requiring fixed-size inputs. It then defines History Guidance as a family of methods that use this flexibility to steer generation. Vanilla history guidance already raises sample quality and temporal consistency, while the time-and-frequency variant strengthens motion, supports compositional generalization to unseen history lengths, and permits stable generation of very long videos. A reader would care because most video diffusion pipelines currently struggle with flexible context, limiting coherence over extended sequences.

Core claim

The central claim is that the DFoT architecture and its associated training objective jointly remove the fixed-history restriction in video diffusion, and that the resulting History Guidance techniques measurably improve generation quality, temporal consistency, motion dynamics, out-of-distribution history handling, and long-horizon rollout stability.

What carries the argument

The Diffusion Forcing Transformer (DFoT) is a video diffusion architecture with a theoretically grounded training objective that enables conditioning on an arbitrary number of history frames, which in turn unlocks the History Guidance family of methods.

If this is right

  • Vanilla history guidance already raises video quality and temporal consistency over standard conditioning.
  • History guidance across time and frequency further improves motion dynamics and compositional generalization to out-of-distribution history.
  • The same methods permit stable generation of extremely long videos without drift.
  • The architecture removes the need to choose a single fixed context length in advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on non-video domains such as audio or point-cloud sequences where variable-length history is also natural.
  • If the training objective proves stable, it might reduce reliance on large fixed context windows in other diffusion settings.
  • Long-rollout results suggest the method could be combined with existing autoregressive or hierarchical video models for further length scaling.

Load-bearing premise

That DFoT truly supports arbitrary-length history without hidden performance costs or instability and that the proposed guidance methods generalize beyond the tested datasets and sequence lengths.

What would settle it

A controlled experiment showing that DFoT performance or stability degrades sharply once history length exceeds the training distribution, or that history guidance produces no measurable improvement on a new dataset or longer rollout.

read the original abstract

Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Project website: https://boyuan.space/history-guidance

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims to introduce the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. It further proposes History Guidance (vanilla and time-frequency variants) as a family of methods that improve video generation quality and temporal consistency, enhance motion dynamics, enable compositional generalization to out-of-distribution history, and support stable rollouts of extremely long videos.

Significance. If the empirical claims hold, the work would advance video diffusion by overcoming fixed-context limitations and extending guidance techniques beyond standard classifier-free guidance, with potential benefits for applications requiring long-term consistency and generalization.

major comments (2)
  1. [Abstract] Abstract: the central claims of significant improvements in quality, consistency, motion dynamics, compositional generalization, and stable long rollouts rest on unshown experiments; no quantitative tables, ablation details, or error analysis are provided to substantiate the magnitude or reliability of these gains.
  2. [Abstract] Abstract: the assertion that the DFoT objective and architecture support arbitrary-length history conditioning without hidden performance costs or instability is not accompanied by any analysis, bounds, or discussion of potential issues such as attention dilution, gradient variance, or distribution shift for lengths far beyond the training distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments. The experimental results supporting the claims are presented in the main body (Sections 4 and 5) with quantitative tables, ablations, and rollout analyses; we have revised the abstract to reference these sections explicitly. We have also added discussion of potential scaling issues for long histories.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of significant improvements in quality, consistency, motion dynamics, compositional generalization, and stable long rollouts rest on unshown experiments; no quantitative tables, ablation details, or error analysis are provided to substantiate the magnitude or reliability of these gains.

    Authors: The abstract summarizes results from the full paper. Quantitative comparisons (PSNR, FVD, temporal consistency metrics), ablations on history length and guidance strength, and error analysis of failure modes appear in Section 4 (Tables 1-3, Figures 3-5) and the supplementary material. We have revised the abstract to include explicit pointers to these sections and added a brief mention of the evaluation protocol. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that the DFoT objective and architecture support arbitrary-length history conditioning without hidden performance costs or instability is not accompanied by any analysis, bounds, or discussion of potential issues such as attention dilution, gradient variance, or distribution shift for lengths far beyond the training distribution.

    Authors: Our experiments demonstrate stable rollouts up to 200 frames (Section 5.1, Figure 6) with no observed degradation in the tested regime, supported by the diffusion forcing objective that decouples per-frame noise prediction. We agree a dedicated analysis of edge cases is valuable and have added Section 5.2 discussing attention dilution, empirical gradient statistics, and distribution shift, including bounds derived from the training objective and suggestions for future regularization. revision: yes

Circularity Check

0 steps flagged

No circularity: DFoT architecture and objective introduced independently

full rationale

The paper defines the Diffusion Forcing Transformer (DFoT) via a new architecture and a theoretically grounded training objective that together support variable-length history conditioning. No equations or claims reduce the central improvements (flexible history support, History Guidance) to reparameterized inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained; the new objective and guidance family are presented as direct consequences of the proposed architecture rather than tautological restatements of prior results or data fits.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on the assumption that a new transformer-based diffusion architecture can be trained to accept variable-length history without architectural changes, plus standard diffusion assumptions.

axioms (2)
  • domain assumption Classifier-free guidance can be extended to variable-length conditioning in diffusion models
    Invoked when stating that CFG-style history dropout performs poorly and a new method is needed.
  • domain assumption Diffusion models admit a theoretically grounded training objective for flexible history
    Stated as part of the DFoT proposal.
invented entities (2)
  • Diffusion Forcing Transformer (DFoT) no independent evidence
    purpose: Video diffusion architecture enabling flexible history conditioning
    Newly proposed component whose properties are not independently verified outside the paper.
  • History Guidance (vanilla and time-frequency variants) no independent evidence
    purpose: Guidance methods for steering video generation using variable history
    New family of methods introduced without prior external validation.

pith-pipeline@v0.9.0 · 5494 in / 1347 out tokens · 32650 ms · 2026-05-16T11:56:30.341937+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. 3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.

  2. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  3. FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

    cs.CV 2026-03 unverdicted novelty 7.0

    FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.

  4. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  5. SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...

  6. Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.

  7. Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...

  8. Motion-Aware Caching for Efficient Autoregressive Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.

  9. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

  10. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  11. From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

    cs.CV 2026-04 unverdicted novelty 6.0

    Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

  12. Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation

    cs.LG 2026-03 unverdicted novelty 6.0

    EAD is an equivariant diffusion model with adaptive asynchronous denoising that achieves state-of-the-art 3D molecular conformation generation.

  13. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

  14. Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

    cs.LG 2026-02 unverdicted novelty 6.0

    Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4...

  15. LongLive: Real-time Interactive Long Video Generation

    cs.CV 2025-09 conditional novelty 6.0

    LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.

  16. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    cs.CV 2025-06 unverdicted novelty 6.0

    Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...

  17. Test-Time Training Done Right

    cs.LG 2025-05 conditional novelty 6.0

    Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.

  18. Motion-Aware Caching for Efficient Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.

  19. Reward-Forcing: Autoregressive Video Generation with Reward Feedback

    cs.CV 2026-01 unverdicted novelty 5.0

    Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.

  20. EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

    cs.CV 2026-02 unverdicted novelty 4.0

    EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 19 Pith papers · 20 internal anchors

  1. [1]

    All are worth words: A vit backbone for diffusion models

    Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22669--22679, 2023

  2. [2]

    Bellec, P. C. Optimal exponential bounds for aggregation of density estimators. Bernoulli, 23 0 (1): 0 219--248, 2017

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a

  4. [4]

    W., Fidler, S., and Kreis, K

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b

  5. [5]

    Video generation models as world simulators

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators. OpenAI Blog, 1: 0 8, 2024

  6. [6]

    and Zisserman, A

    Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 6299--6308, 2017

  7. [7]

    Chan, S. et al. Tutorial on diffusion models for imaging and vision. Foundations and Trends in Computer Graphics and Vision , 16 0 (4): 0 322--471, 2024

  8. [8]

    M., Du, Y., Simchowitz, M., Tedrake, R., and Sitzmann, V

    Chen, B., Monso, D. M., Du, Y., Simchowitz, M., Tedrake, R., and Sitzmann, V. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 2024

  9. [9]

    On the importance of noise scheduling for diffusion models

    Chen, T. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023

  10. [10]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp.\ 02783649241273668, 2023

  11. [11]

    and Nichol, A

    Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 0 8780--8794, 2021

  12. [12]

    Diffusion is spectral autoregression, 2024

    Dieleman, S. Diffusion is spectral autoregression, 2024. URL https://sander.ai/2024/09/02/spectral-autoregression.html

  13. [13]

    and Kaelbling, L

    Du, Y. and Kaelbling, L. Compositional generative modeling: A single model is not all you need. arXiv preprint arXiv:2402.01103, 2024

  14. [14]

    B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W

    Du, Y., Durkan, C., Strudel, R., Tenenbaum, J. B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W. S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International conference on machine learning, pp.\ 8489--8510. PMLR, 2023

  15. [15]

    P., Barron, J

    Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P. P., Barron, J. T., and Poole, B. Cat3d: Create anything in 3d with multi-view diffusion models. Advances in Neural Information Processing Systems, 2024

  16. [16]

    Act3d: 3d feature field transformers for multi-task robotic manipulation

    Gervet, T., Xian, Z., Gkanatsios, N., and Fragkiadaki, K. Act3d: 3d feature field transformers for multi-task robotic manipulation. In Conference on Robot Learning, pp.\ 3949--3965. PMLR, 2023

  17. [17]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., and Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

  18. [18]

    Photorealistic video generation with diffusion models

    Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.-F., Essa, I., Jiang, L., and Lezama, J. Photorealistic video generation with diffusion models. In European Conference on Computer Vision, pp.\ 393--411. Springer, 2024

  19. [19]

    Efficient diffusion training via min-snr weighting strategy

    Hang, T., Gu, S., Li, C., Bao, J., Chen, D., Hu, H., Geng, X., and Guo, B. Efficient diffusion training via min-snr weighting strategy. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 7441--7451, 2023

  20. [20]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    He, Y., Yang, T., Zhang, Y., Shan, Y., and Chen, Q. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022

  21. [21]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

  22. [22]

    Classifier-Free Diffusion Guidance

    Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  23. [23]

    Denoising diffusion probabilistic models

    Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

  24. [24]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a

  25. [25]

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. Advances in Neural Information Processing Systems, 35: 0 8633--8646, 2022 b

  26. [26]

    simple diffusion: End-to-end diffusion for high resolution images

    Hoogeboom, E., Heek, J., and Salimans, T. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pp.\ 13213--13232. PMLR, 2023

  27. [27]

    Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

    Hoogeboom, E., Mensink, T., Heek, J., Lamerigts, K., Gao, R., and Salimans, T. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. arXiv preprint arXiv:2410.19324, 2024

  28. [28]

    Diffusion-based generation, optimization, and planning in 3d scenes

    Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., and Zhu, S.-C. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16750--16761, 2023

  29. [29]

    Vbench: Comprehensive benchmark suite for video generative models

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21807--21818, 2024

  30. [30]

    Pyramidal flow matching for efficient video generative modeling

    Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., and Lin, Z. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

  31. [31]

    Analyzing and improving the training dynamics of diffusion models

    Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., and Laine, S. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 24174--24184, 2024

  32. [32]

    The Kinetics Human Action Video Dataset

    Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

  33. [33]

    Kingma, D. P. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  34. [34]

    Kingma, D. P. and Gao, R. Understanding the diffusion objective as a weighted integral of elbos. Advances in Neural Information Processing Systems, 2023

  35. [35]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

  36. [36]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Lin, B., Ge, Y., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y., Yuan, S., Chen, L., et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024 a

  37. [37]

    Common diffusion noise schedules and sample steps are flawed

    Lin, S., Liu, B., Li, J., and Yang, X. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.\ 5404--5411, 2024 b

  38. [38]

    Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp.\ 423--439. Springer, 2022

  39. [39]

    Decoupled Weight Decay Regularization

    Loshchilov, I. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  40. [40]

    Latte: Latent Diffusion Transformer for Video Generation

    Ma, X., Wang, Y., Jia, G., Chen, X., Liu, Z., Li, Y.-F., Chen, C., and Qiao, Y. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024

  41. [41]

    P., Tancik, M., Barron, J

    Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65 0 (1): 0 99--106, 2021

  42. [42]

    Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp.\ 8162--8171. PMLR, 2021

  43. [43]

    and Xie, S

    Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023

  44. [44]

    Film: Visual reasoning with a general conditioning layer

    Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  45. [45]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

  46. [46]

    and Tsybakov, A

    Rigollet, P. and Tsybakov, A. B. Linear and convex aggregation of density estimators. Mathematical Methods of Statistics, 16: 0 260--280, 2007

  47. [47]

    High-resolution image synthesis with latent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

  48. [48]

    Rolling diffusion models

    Ruhe, D., Heek, J., Salimans, T., and Hoogeboom, E. Rolling diffusion models. In International Conference on Machine Learning, pp.\ 42818--42835. PMLR, 2024

  49. [49]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

  50. [50]

    Animating rotation with quaternion curves

    Shoemake, K. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pp.\ 245--254, 1985

  51. [51]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022

  52. [52]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML), 2015

  53. [53]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  54. [54]

    P., Kumar, A., Ermon, S., and Poole, B

    Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  55. [55]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding, 2023

  56. [56]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  57. [57]

    A connection between score matching and denoising autoencoders

    Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23 0 (7): 0 1661--1674, 2011

  58. [58]

    ModelScope Text-to-Video Technical Report

    Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., and Zhang, S. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023

  59. [59]

    Novel view synthesis with diffusion models

    Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., and Norouzi, M. Novel view synthesis with diffusion models. International Conference on Learning Representations, 2023

  60. [60]

    Watson, D., Saxena, S., Li, L., Tagliasacchi, A., and Fleet, D. J. Controlling space and time with diffusion models. International Conference on Learning Representations, 2025

  61. [61]

    Efficient streaming language models with attention sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. International Conference on Learning Representations, 2024

  62. [62]

    Dynamicrafter: Animating open-domain images with video diffusion priors

    Xing, J., Xia, M., Zhang, Y., Chen, H., Yu, W., Liu, H., Wang, X., Wong, T.-T., and Shan, Y. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023

  63. [63]

    Temporally consistent transformers for video generation

    Yan, W., Hafner, D., James, S., and Abbeel, P. Temporally consistent transformers for video generation. In International Conference on Machine Learning, pp.\ 39062--39098. PMLR, 2023

  64. [64]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

  65. [65]

    T., Durand, F., Shechtman, E., and Huang, X

    Yin, T., Zhang, Q., Zhang, R., Freeman, W. T., Durand, F., Shechtman, E., and Huang, X. From slow bidirectional to fast causal video generators. arXiv preprint arXiv:2412.07772, 2024

  66. [66]

    G., Yang, M.-H., Hao, Y., Essa, I., et al

    Yu, L., Cheng, Y., Sohn, K., Lezama, J., Zhang, H., Chang, H., Hauptmann, A. G., Yang, M.-H., Hao, Y., Essa, I., et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10459--10469, 2023 a

  67. [67]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Yu, L., Lezama, J., Gundavarapu, N. B., Versari, L., Sohn, K., Minnen, D., Cheng, Y., Birodkar, V., Gupta, A., Gu, X., et al. Language model beats diffusion--tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023 b

  68. [68]

    A., Shechtman, E., and Wang, O

    Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 586--595, 2018

  69. [69]

    Open-Sora: Democratizing Efficient Video Production for All

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

  70. [70]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Zhou, T., Tucker, R., Flynn, J., Fyffe, G., and Snavely, N. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018