Recognition: no theorem link
History-Guided Video Diffusion
Pith reviewed 2026-05-16 11:56 UTC · model grok-4.3
The pith
Diffusion Forcing Transformer lets video models condition on any number of past frames.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the DFoT architecture and its associated training objective jointly remove the fixed-history restriction in video diffusion, and that the resulting History Guidance techniques measurably improve generation quality, temporal consistency, motion dynamics, out-of-distribution history handling, and long-horizon rollout stability.
What carries the argument
The Diffusion Forcing Transformer (DFoT) is a video diffusion architecture with a theoretically grounded training objective that enables conditioning on an arbitrary number of history frames, which in turn unlocks the History Guidance family of methods.
If this is right
- Vanilla history guidance already raises video quality and temporal consistency over standard conditioning.
- History guidance across time and frequency further improves motion dynamics and compositional generalization to out-of-distribution history.
- The same methods permit stable generation of extremely long videos without drift.
- The architecture removes the need to choose a single fixed context length in advance.
Where Pith is reading between the lines
- The approach could be tested on non-video domains such as audio or point-cloud sequences where variable-length history is also natural.
- If the training objective proves stable, it might reduce reliance on large fixed context windows in other diffusion settings.
- Long-rollout results suggest the method could be combined with existing autoregressive or hierarchical video models for further length scaling.
Load-bearing premise
That DFoT truly supports arbitrary-length history without hidden performance costs or instability and that the proposed guidance methods generalize beyond the tested datasets and sequence lengths.
What would settle it
A controlled experiment showing that DFoT performance or stability degrades sharply once history length exceeds the training distribution, or that history guidance produces no measurable improvement on a new dataset or longer rollout.
read the original abstract
Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Project website: https://boyuan.space/history-guidance
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. It further proposes History Guidance (vanilla and time-frequency variants) as a family of methods that improve video generation quality and temporal consistency, enhance motion dynamics, enable compositional generalization to out-of-distribution history, and support stable rollouts of extremely long videos.
Significance. If the empirical claims hold, the work would advance video diffusion by overcoming fixed-context limitations and extending guidance techniques beyond standard classifier-free guidance, with potential benefits for applications requiring long-term consistency and generalization.
major comments (2)
- [Abstract] Abstract: the central claims of significant improvements in quality, consistency, motion dynamics, compositional generalization, and stable long rollouts rest on unshown experiments; no quantitative tables, ablation details, or error analysis are provided to substantiate the magnitude or reliability of these gains.
- [Abstract] Abstract: the assertion that the DFoT objective and architecture support arbitrary-length history conditioning without hidden performance costs or instability is not accompanied by any analysis, bounds, or discussion of potential issues such as attention dilution, gradient variance, or distribution shift for lengths far beyond the training distribution.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive comments. The experimental results supporting the claims are presented in the main body (Sections 4 and 5) with quantitative tables, ablations, and rollout analyses; we have revised the abstract to reference these sections explicitly. We have also added discussion of potential scaling issues for long histories.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of significant improvements in quality, consistency, motion dynamics, compositional generalization, and stable long rollouts rest on unshown experiments; no quantitative tables, ablation details, or error analysis are provided to substantiate the magnitude or reliability of these gains.
Authors: The abstract summarizes results from the full paper. Quantitative comparisons (PSNR, FVD, temporal consistency metrics), ablations on history length and guidance strength, and error analysis of failure modes appear in Section 4 (Tables 1-3, Figures 3-5) and the supplementary material. We have revised the abstract to include explicit pointers to these sections and added a brief mention of the evaluation protocol. revision: yes
-
Referee: [Abstract] Abstract: the assertion that the DFoT objective and architecture support arbitrary-length history conditioning without hidden performance costs or instability is not accompanied by any analysis, bounds, or discussion of potential issues such as attention dilution, gradient variance, or distribution shift for lengths far beyond the training distribution.
Authors: Our experiments demonstrate stable rollouts up to 200 frames (Section 5.1, Figure 6) with no observed degradation in the tested regime, supported by the diffusion forcing objective that decouples per-frame noise prediction. We agree a dedicated analysis of edge cases is valuable and have added Section 5.2 discussing attention dilution, empirical gradient statistics, and distribution shift, including bounds derived from the training objective and suggestions for future regularization. revision: yes
Circularity Check
No circularity: DFoT architecture and objective introduced independently
full rationale
The paper defines the Diffusion Forcing Transformer (DFoT) via a new architecture and a theoretically grounded training objective that together support variable-length history conditioning. No equations or claims reduce the central improvements (flexible history support, History Guidance) to reparameterized inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained; the new objective and guidance family are presented as direct consequences of the proposed architecture rather than tautological restatements of prior results or data fits.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Classifier-free guidance can be extended to variable-length conditioning in diffusion models
- domain assumption Diffusion models admit a theoretically grounded training objective for flexible history
invented entities (2)
-
Diffusion Forcing Transformer (DFoT)
no independent evidence
-
History Guidance (vanilla and time-frequency variants)
no independent evidence
Forward citations
Cited by 20 Pith papers
-
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...
-
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
-
Motion-Aware Caching for Efficient Autoregressive Video Generation
MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation
EAD is an equivariant diffusion model with adaptive asynchronous denoising that achieves state-of-the-art 3D molecular conformation generation.
-
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
-
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4...
-
LongLive: Real-time Interactive Long Video Generation
LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
-
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
-
Test-Time Training Done Right
Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.
-
Motion-Aware Caching for Efficient Autoregressive Video Generation
MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
-
Reward-Forcing: Autoregressive Video Generation with Reward Feedback
Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
Reference graph
Works this paper leans on
-
[1]
All are worth words: A vit backbone for diffusion models
Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22669--22679, 2023
work page 2023
-
[2]
Bellec, P. C. Optimal exponential bounds for aggregation of density estimators. Bernoulli, 23 0 (1): 0 219--248, 2017
work page 2017
-
[3]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b
work page 2023
-
[5]
Video generation models as world simulators
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators. OpenAI Blog, 1: 0 8, 2024
work page 2024
-
[6]
Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 6299--6308, 2017
work page 2017
-
[7]
Chan, S. et al. Tutorial on diffusion models for imaging and vision. Foundations and Trends in Computer Graphics and Vision , 16 0 (4): 0 322--471, 2024
work page 2024
-
[8]
M., Du, Y., Simchowitz, M., Tedrake, R., and Sitzmann, V
Chen, B., Monso, D. M., Du, Y., Simchowitz, M., Tedrake, R., and Sitzmann, V. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[9]
On the importance of noise scheduling for diffusion models
Chen, T. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023
-
[10]
Diffusion policy: Visuomotor policy learning via action diffusion
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp.\ 02783649241273668, 2023
work page 2023
-
[11]
Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 0 8780--8794, 2021
work page 2021
-
[12]
Diffusion is spectral autoregression, 2024
Dieleman, S. Diffusion is spectral autoregression, 2024. URL https://sander.ai/2024/09/02/spectral-autoregression.html
work page 2024
-
[13]
Du, Y. and Kaelbling, L. Compositional generative modeling: A single model is not all you need. arXiv preprint arXiv:2402.01103, 2024
-
[14]
B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W
Du, Y., Durkan, C., Strudel, R., Tenenbaum, J. B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W. S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International conference on machine learning, pp.\ 8489--8510. PMLR, 2023
work page 2023
-
[15]
Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P. P., Barron, J. T., and Poole, B. Cat3d: Create anything in 3d with multi-view diffusion models. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[16]
Act3d: 3d feature field transformers for multi-task robotic manipulation
Gervet, T., Xian, Z., Gkanatsios, N., and Fragkiadaki, K. Act3d: 3d feature field transformers for multi-task robotic manipulation. In Conference on Robot Learning, pp.\ 3949--3965. PMLR, 2023
work page 2023
-
[17]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., and Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Photorealistic video generation with diffusion models
Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.-F., Essa, I., Jiang, L., and Lezama, J. Photorealistic video generation with diffusion models. In European Conference on Computer Vision, pp.\ 393--411. Springer, 2024
work page 2024
-
[19]
Efficient diffusion training via min-snr weighting strategy
Hang, T., Gu, S., Li, C., Bao, J., Chen, D., Hu, H., Geng, X., and Guo, B. Efficient diffusion training via min-snr weighting strategy. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 7441--7451, 2023
work page 2023
-
[20]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
He, Y., Yang, T., Zhang, Y., Shan, Y., and Chen, Q. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017
work page 2017
-
[22]
Classifier-Free Diffusion Guidance
Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Denoising diffusion probabilistic models
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020
work page 2020
-
[24]
Imagen Video: High Definition Video Generation with Diffusion Models
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. Advances in Neural Information Processing Systems, 35: 0 8633--8646, 2022 b
work page 2022
-
[26]
simple diffusion: End-to-end diffusion for high resolution images
Hoogeboom, E., Heek, J., and Salimans, T. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pp.\ 13213--13232. PMLR, 2023
work page 2023
-
[27]
Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion
Hoogeboom, E., Mensink, T., Heek, J., Lamerigts, K., Gao, R., and Salimans, T. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. arXiv preprint arXiv:2410.19324, 2024
-
[28]
Diffusion-based generation, optimization, and planning in 3d scenes
Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., and Zhu, S.-C. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16750--16761, 2023
work page 2023
-
[29]
Vbench: Comprehensive benchmark suite for video generative models
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21807--21818, 2024
work page 2024
-
[30]
Pyramidal flow matching for efficient video generative modeling
Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., and Lin, Z. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024
-
[31]
Analyzing and improving the training dynamics of diffusion models
Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., and Laine, S. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 24174--24184, 2024
work page 2024
-
[32]
The Kinetics Human Action Video Dataset
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Kingma, D. P. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[34]
Kingma, D. P. and Gao, R. Understanding the diffusion objective as a weighted integral of elbos. Advances in Neural Information Processing Systems, 2023
work page 2023
-
[35]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Open-Sora Plan: Open-Source Large Video Generation Model
Lin, B., Ge, Y., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y., Yuan, S., Chen, L., et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Common diffusion noise schedules and sample steps are flawed
Lin, S., Liu, B., Li, J., and Yang, X. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.\ 5404--5411, 2024 b
work page 2024
-
[38]
Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp.\ 423--439. Springer, 2022
work page 2022
-
[39]
Decoupled Weight Decay Regularization
Loshchilov, I. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
Latte: Latent Diffusion Transformer for Video Generation
Ma, X., Wang, Y., Jia, G., Chen, X., Liu, Z., Li, Y.-F., Chen, C., and Qiao, Y. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65 0 (1): 0 99--106, 2021
work page 2021
-
[42]
Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp.\ 8162--8171. PMLR, 2021
work page 2021
-
[43]
Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023
work page 2023
-
[44]
Film: Visual reasoning with a general conditioning layer
Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018
work page 2018
-
[45]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021
work page 2021
-
[46]
Rigollet, P. and Tsybakov, A. B. Linear and convex aggregation of density estimators. Mathematical Methods of Statistics, 16: 0 260--280, 2007
work page 2007
-
[47]
High-resolution image synthesis with latent diffusion models
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022
work page 2022
-
[48]
Ruhe, D., Heek, J., Salimans, T., and Hoogeboom, E. Rolling diffusion models. In International Conference on Machine Learning, pp.\ 42818--42835. PMLR, 2024
work page 2024
-
[49]
Progressive Distillation for Fast Sampling of Diffusion Models
Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Animating rotation with quaternion curves
Shoemake, K. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pp.\ 245--254, 1985
work page 1985
-
[51]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Deep unsupervised learning using nonequilibrium thermodynamics
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML), 2015
work page 2015
-
[53]
Denoising Diffusion Implicit Models
Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[54]
P., Kumar, A., Ermon, S., and Poole, B
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
work page 2021
-
[55]
Roformer: Enhanced transformer with rotary position embedding, 2023
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding, 2023
work page 2023
-
[56]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[57]
A connection between score matching and denoising autoencoders
Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23 0 (7): 0 1661--1674, 2011
work page 2011
-
[58]
ModelScope Text-to-Video Technical Report
Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., and Zhang, S. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Novel view synthesis with diffusion models
Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., and Norouzi, M. Novel view synthesis with diffusion models. International Conference on Learning Representations, 2023
work page 2023
-
[60]
Watson, D., Saxena, S., Li, L., Tagliasacchi, A., and Fleet, D. J. Controlling space and time with diffusion models. International Conference on Learning Representations, 2025
work page 2025
-
[61]
Efficient streaming language models with attention sinks
Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. International Conference on Learning Representations, 2024
work page 2024
-
[62]
Dynamicrafter: Animating open-domain images with video diffusion priors
Xing, J., Xia, M., Zhang, Y., Chen, H., Yu, W., Liu, H., Wang, X., Wong, T.-T., and Shan, Y. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023
-
[63]
Temporally consistent transformers for video generation
Yan, W., Hafner, D., James, S., and Abbeel, P. Temporally consistent transformers for video generation. In International Conference on Machine Learning, pp.\ 39062--39098. PMLR, 2023
work page 2023
-
[64]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
T., Durand, F., Shechtman, E., and Huang, X
Yin, T., Zhang, Q., Zhang, R., Freeman, W. T., Durand, F., Shechtman, E., and Huang, X. From slow bidirectional to fast causal video generators. arXiv preprint arXiv:2412.07772, 2024
-
[66]
G., Yang, M.-H., Hao, Y., Essa, I., et al
Yu, L., Cheng, Y., Sohn, K., Lezama, J., Zhang, H., Chang, H., Hauptmann, A. G., Yang, M.-H., Hao, Y., Essa, I., et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10459--10469, 2023 a
work page 2023
-
[67]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Yu, L., Lezama, J., Gundavarapu, N. B., Versari, L., Sohn, K., Minnen, D., Cheng, Y., Birodkar, V., Gupta, A., Gu, X., et al. Language model beats diffusion--tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
A., Shechtman, E., and Wang, O
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 586--595, 2018
work page 2018
-
[69]
Open-Sora: Democratizing Efficient Video Production for All
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., and Snavely, N. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.