arxiv: 2502.06764 · v2 · submitted 2025-02-10 · 💻 cs.LG · cs.CV

Recognition: no theorem link

History-Guided Video Diffusion

Kiwhan Song , Boyuan Chen , Max Simchowitz , Yilun Du , Russ Tedrake , Vincent Sitzmann

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords video diffusionhistory guidanceDiffusion Forcing Transformertemporal consistencyclassifier-free guidancelong video generationconditional generation

0 comments

The pith

Diffusion Forcing Transformer lets video models condition on any number of past frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Diffusion Forcing Transformer, an architecture paired with a training objective that supports conditioning on a variable number of history frames instead of requiring fixed-size inputs. It then defines History Guidance as a family of methods that use this flexibility to steer generation. Vanilla history guidance already raises sample quality and temporal consistency, while the time-and-frequency variant strengthens motion, supports compositional generalization to unseen history lengths, and permits stable generation of very long videos. A reader would care because most video diffusion pipelines currently struggle with flexible context, limiting coherence over extended sequences.

Core claim

The central claim is that the DFoT architecture and its associated training objective jointly remove the fixed-history restriction in video diffusion, and that the resulting History Guidance techniques measurably improve generation quality, temporal consistency, motion dynamics, out-of-distribution history handling, and long-horizon rollout stability.

What carries the argument

The Diffusion Forcing Transformer (DFoT) is a video diffusion architecture with a theoretically grounded training objective that enables conditioning on an arbitrary number of history frames, which in turn unlocks the History Guidance family of methods.

If this is right

Vanilla history guidance already raises video quality and temporal consistency over standard conditioning.
History guidance across time and frequency further improves motion dynamics and compositional generalization to out-of-distribution history.
The same methods permit stable generation of extremely long videos without drift.
The architecture removes the need to choose a single fixed context length in advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on non-video domains such as audio or point-cloud sequences where variable-length history is also natural.
If the training objective proves stable, it might reduce reliance on large fixed context windows in other diffusion settings.
Long-rollout results suggest the method could be combined with existing autoregressive or hierarchical video models for further length scaling.

Load-bearing premise

That DFoT truly supports arbitrary-length history without hidden performance costs or instability and that the proposed guidance methods generalize beyond the tested datasets and sequence lengths.

What would settle it

A controlled experiment showing that DFoT performance or stability degrades sharply once history length exceeds the training distribution, or that history guidance produces no measurable improvement on a new dataset or longer rollout.

read the original abstract

Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Project website: https://boyuan.space/history-guidance

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DFoT solves the variable history problem in video diffusion with a new forcing objective and guidance family, but the long-sequence stability claims need tighter validation.

read the letter

The main point is that this paper introduces the Diffusion Forcing Transformer to let video diffusion models condition on any number of history frames instead of being locked to fixed lengths. They pair it with a family of History Guidance methods that improve on standard classifier-free guidance for this setting. Vanilla history guidance already lifts quality and consistency, while the time-frequency version adds better motion control, compositional generalization to new histories, and stable long rollouts.

Referee Report

2 major / 0 minor

Summary. The paper claims to introduce the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. It further proposes History Guidance (vanilla and time-frequency variants) as a family of methods that improve video generation quality and temporal consistency, enhance motion dynamics, enable compositional generalization to out-of-distribution history, and support stable rollouts of extremely long videos.

Significance. If the empirical claims hold, the work would advance video diffusion by overcoming fixed-context limitations and extending guidance techniques beyond standard classifier-free guidance, with potential benefits for applications requiring long-term consistency and generalization.

major comments (2)

[Abstract] Abstract: the central claims of significant improvements in quality, consistency, motion dynamics, compositional generalization, and stable long rollouts rest on unshown experiments; no quantitative tables, ablation details, or error analysis are provided to substantiate the magnitude or reliability of these gains.
[Abstract] Abstract: the assertion that the DFoT objective and architecture support arbitrary-length history conditioning without hidden performance costs or instability is not accompanied by any analysis, bounds, or discussion of potential issues such as attention dilution, gradient variance, or distribution shift for lengths far beyond the training distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments. The experimental results supporting the claims are presented in the main body (Sections 4 and 5) with quantitative tables, ablations, and rollout analyses; we have revised the abstract to reference these sections explicitly. We have also added discussion of potential scaling issues for long histories.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of significant improvements in quality, consistency, motion dynamics, compositional generalization, and stable long rollouts rest on unshown experiments; no quantitative tables, ablation details, or error analysis are provided to substantiate the magnitude or reliability of these gains.

Authors: The abstract summarizes results from the full paper. Quantitative comparisons (PSNR, FVD, temporal consistency metrics), ablations on history length and guidance strength, and error analysis of failure modes appear in Section 4 (Tables 1-3, Figures 3-5) and the supplementary material. We have revised the abstract to include explicit pointers to these sections and added a brief mention of the evaluation protocol. revision: yes
Referee: [Abstract] Abstract: the assertion that the DFoT objective and architecture support arbitrary-length history conditioning without hidden performance costs or instability is not accompanied by any analysis, bounds, or discussion of potential issues such as attention dilution, gradient variance, or distribution shift for lengths far beyond the training distribution.

Authors: Our experiments demonstrate stable rollouts up to 200 frames (Section 5.1, Figure 6) with no observed degradation in the tested regime, supported by the diffusion forcing objective that decouples per-frame noise prediction. We agree a dedicated analysis of edge cases is valuable and have added Section 5.2 discussing attention dilution, empirical gradient statistics, and distribution shift, including bounds derived from the training objective and suggestions for future regularization. revision: yes

Circularity Check

0 steps flagged

No circularity: DFoT architecture and objective introduced independently

full rationale

The paper defines the Diffusion Forcing Transformer (DFoT) via a new architecture and a theoretically grounded training objective that together support variable-length history conditioning. No equations or claims reduce the central improvements (flexible history support, History Guidance) to reparameterized inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained; the new objective and guidance family are presented as direct consequences of the proposed architecture rather than tautological restatements of prior results or data fits.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on the assumption that a new transformer-based diffusion architecture can be trained to accept variable-length history without architectural changes, plus standard diffusion assumptions.

axioms (2)

domain assumption Classifier-free guidance can be extended to variable-length conditioning in diffusion models
Invoked when stating that CFG-style history dropout performs poorly and a new method is needed.
domain assumption Diffusion models admit a theoretically grounded training objective for flexible history
Stated as part of the DFoT proposal.

invented entities (2)

Diffusion Forcing Transformer (DFoT) no independent evidence
purpose: Video diffusion architecture enabling flexible history conditioning
Newly proposed component whose properties are not independently verified outside the paper.
History Guidance (vanilla and time-frequency variants) no independent evidence
purpose: Guidance methods for steering video generation using variable history
New family of methods introduced without prior external validation.

pith-pipeline@v0.9.0 · 5494 in / 1347 out tokens · 32650 ms · 2026-05-16T11:56:30.341937+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
cs.CV 2026-05 unverdicted novelty 7.0

3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
cs.CV 2026-03 unverdicted novelty 7.0

FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 conditional novelty 6.0

MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
cs.CV 2026-04 unverdicted novelty 6.0

Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation
cs.LG 2026-03 unverdicted novelty 6.0

EAD is an equivariant diffusion model with adaptive asynchronous denoising that achieves state-of-the-art 3D molecular conformation generation.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
cs.CV 2026-02 unverdicted novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
cs.LG 2026-02 unverdicted novelty 6.0

Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4...
LongLive: Real-time Interactive Long Video Generation
cs.CV 2025-09 conditional novelty 6.0

LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
cs.CV 2025-06 unverdicted novelty 6.0

Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
Test-Time Training Done Right
cs.LG 2025-05 conditional novelty 6.0

Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
Reward-Forcing: Autoregressive Video Generation with Reward Feedback
cs.CV 2026-01 unverdicted novelty 5.0

Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
cs.CV 2026-02 unverdicted novelty 4.0

EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 19 Pith papers · 20 internal anchors

[1]

All are worth words: A vit backbone for diffusion models

Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22669--22679, 2023

work page 2023
[2]

Bellec, P. C. Optimal exponential bounds for aggregation of density estimators. Bernoulli, 23 0 (1): 0 219--248, 2017

work page 2017
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

W., Fidler, S., and Kreis, K

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b

work page 2023
[5]

Video generation models as world simulators

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators. OpenAI Blog, 1: 0 8, 2024

work page 2024
[6]

and Zisserman, A

Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 6299--6308, 2017

work page 2017
[7]

Chan, S. et al. Tutorial on diffusion models for imaging and vision. Foundations and Trends in Computer Graphics and Vision , 16 0 (4): 0 322--471, 2024

work page 2024
[8]

M., Du, Y., Simchowitz, M., Tedrake, R., and Sitzmann, V

Chen, B., Monso, D. M., Du, Y., Simchowitz, M., Tedrake, R., and Sitzmann, V. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 2024

work page 2024
[9]

On the importance of noise scheduling for diffusion models

Chen, T. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023

work page arXiv 2023
[10]

Diffusion policy: Visuomotor policy learning via action diffusion

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp.\ 02783649241273668, 2023

work page 2023
[11]

and Nichol, A

Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 0 8780--8794, 2021

work page 2021
[12]

Diffusion is spectral autoregression, 2024

Dieleman, S. Diffusion is spectral autoregression, 2024. URL https://sander.ai/2024/09/02/spectral-autoregression.html

work page 2024
[13]

and Kaelbling, L

Du, Y. and Kaelbling, L. Compositional generative modeling: A single model is not all you need. arXiv preprint arXiv:2402.01103, 2024

work page arXiv 2024
[14]

B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W

Du, Y., Durkan, C., Strudel, R., Tenenbaum, J. B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W. S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International conference on machine learning, pp.\ 8489--8510. PMLR, 2023

work page 2023
[15]

P., Barron, J

Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P. P., Barron, J. T., and Poole, B. Cat3d: Create anything in 3d with multi-view diffusion models. Advances in Neural Information Processing Systems, 2024

work page 2024
[16]

Act3d: 3d feature field transformers for multi-task robotic manipulation

Gervet, T., Xian, Z., Gkanatsios, N., and Fragkiadaki, K. Act3d: 3d feature field transformers for multi-task robotic manipulation. In Conference on Robot Learning, pp.\ 3949--3965. PMLR, 2023

work page 2023
[17]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., and Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Photorealistic video generation with diffusion models

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.-F., Essa, I., Jiang, L., and Lezama, J. Photorealistic video generation with diffusion models. In European Conference on Computer Vision, pp.\ 393--411. Springer, 2024

work page 2024
[19]

Efficient diffusion training via min-snr weighting strategy

Hang, T., Gu, S., Li, C., Bao, J., Chen, D., Hu, H., Geng, X., and Guo, B. Efficient diffusion training via min-snr weighting strategy. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 7441--7451, 2023

work page 2023
[20]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

He, Y., Yang, T., Zhang, Y., Shan, Y., and Chen, Q. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

work page 2017
[22]

Classifier-Free Diffusion Guidance

Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Denoising diffusion probabilistic models

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

work page 2020
[24]

Imagen Video: High Definition Video Generation with Diffusion Models

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. Advances in Neural Information Processing Systems, 35: 0 8633--8646, 2022 b

work page 2022
[26]

simple diffusion: End-to-end diffusion for high resolution images

Hoogeboom, E., Heek, J., and Salimans, T. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pp.\ 13213--13232. PMLR, 2023

work page 2023
[27]

Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

Hoogeboom, E., Mensink, T., Heek, J., Lamerigts, K., Gao, R., and Salimans, T. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. arXiv preprint arXiv:2410.19324, 2024

work page arXiv 2024
[28]

Diffusion-based generation, optimization, and planning in 3d scenes

Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., and Zhu, S.-C. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16750--16761, 2023

work page 2023
[29]

Vbench: Comprehensive benchmark suite for video generative models

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21807--21818, 2024

work page 2024
[30]

Pyramidal flow matching for efficient video generative modeling

Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., and Lin, Z. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

work page arXiv 2024
[31]

Analyzing and improving the training dynamics of diffusion models

Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., and Laine, S. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 24174--24184, 2024

work page 2024
[32]

The Kinetics Human Action Video Dataset

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Kingma, D. P. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[34]

Kingma, D. P. and Gao, R. Understanding the diffusion objective as a weighted integral of elbos. Advances in Neural Information Processing Systems, 2023

work page 2023
[35]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Open-Sora Plan: Open-Source Large Video Generation Model

Lin, B., Ge, Y., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y., Yuan, S., Chen, L., et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Common diffusion noise schedules and sample steps are flawed

Lin, S., Liu, B., Li, J., and Yang, X. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.\ 5404--5411, 2024 b

work page 2024
[38]

Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp.\ 423--439. Springer, 2022

work page 2022
[39]

Decoupled Weight Decay Regularization

Loshchilov, I. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Latte: Latent Diffusion Transformer for Video Generation

Ma, X., Wang, Y., Jia, G., Chen, X., Liu, Z., Li, Y.-F., Chen, C., and Qiao, Y. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

P., Tancik, M., Barron, J

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65 0 (1): 0 99--106, 2021

work page 2021
[42]

Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp.\ 8162--8171. PMLR, 2021

work page 2021
[43]

and Xie, S

Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023

work page 2023
[44]

Film: Visual reasoning with a general conditioning layer

Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[45]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021
[46]

and Tsybakov, A

Rigollet, P. and Tsybakov, A. B. Linear and convex aggregation of density estimators. Mathematical Methods of Statistics, 16: 0 260--280, 2007

work page 2007
[47]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

work page 2022
[48]

Rolling diffusion models

Ruhe, D., Heek, J., Salimans, T., and Hoogeboom, E. Rolling diffusion models. In International Conference on Machine Learning, pp.\ 42818--42835. PMLR, 2024

work page 2024
[49]

Progressive Distillation for Fast Sampling of Diffusion Models

Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Animating rotation with quaternion curves

Shoemake, K. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pp.\ 245--254, 1985

work page 1985
[51]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Deep unsupervised learning using nonequilibrium thermodynamics

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML), 2015

work page 2015
[53]

Denoising Diffusion Implicit Models

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[54]

P., Kumar, A., Ermon, S., and Poole, B

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021
[55]

Roformer: Enhanced transformer with rotary position embedding, 2023

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding, 2023

work page 2023
[56]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[57]

A connection between score matching and denoising autoencoders

Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23 0 (7): 0 1661--1674, 2011

work page 2011
[58]

ModelScope Text-to-Video Technical Report

Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., and Zhang, S. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Novel view synthesis with diffusion models

Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., and Norouzi, M. Novel view synthesis with diffusion models. International Conference on Learning Representations, 2023

work page 2023
[60]

Watson, D., Saxena, S., Li, L., Tagliasacchi, A., and Fleet, D. J. Controlling space and time with diffusion models. International Conference on Learning Representations, 2025

work page 2025
[61]

Efficient streaming language models with attention sinks

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. International Conference on Learning Representations, 2024

work page 2024
[62]

Dynamicrafter: Animating open-domain images with video diffusion priors

Xing, J., Xia, M., Zhang, Y., Chen, H., Yu, W., Liu, H., Wang, X., Wong, T.-T., and Shan, Y. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023

work page arXiv 2023
[63]

Temporally consistent transformers for video generation

Yan, W., Hafner, D., James, S., and Abbeel, P. Temporally consistent transformers for video generation. In International Conference on Machine Learning, pp.\ 39062--39098. PMLR, 2023

work page 2023
[64]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

T., Durand, F., Shechtman, E., and Huang, X

Yin, T., Zhang, Q., Zhang, R., Freeman, W. T., Durand, F., Shechtman, E., and Huang, X. From slow bidirectional to fast causal video generators. arXiv preprint arXiv:2412.07772, 2024

work page arXiv 2024
[66]

G., Yang, M.-H., Hao, Y., Essa, I., et al

Yu, L., Cheng, Y., Sohn, K., Lezama, J., Zhang, H., Chang, H., Hauptmann, A. G., Yang, M.-H., Hao, Y., Essa, I., et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10459--10469, 2023 a

work page 2023
[67]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Yu, L., Lezama, J., Gundavarapu, N. B., Versari, L., Sohn, K., Minnen, D., Cheng, Y., Birodkar, V., Gupta, A., Gu, X., et al. Language model beats diffusion--tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

A., Shechtman, E., and Wang, O

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 586--595, 2018

work page 2018
[69]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Zhou, T., Tucker, R., Flynn, J., Fyffe, G., and Snavely, N. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018