Recognition: 2 theorem links
· Lean TheoremSelf Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Pith reviewed 2026-05-11 01:30 UTC · model grok-4.3
The pith
Self Forcing trains autoregressive video diffusion models on their own generated outputs to close the exposure bias gap and enable real-time streaming.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value caching during training. This enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives, and supports efficient inference via few-step diffusion, stochastic gradient truncation, and a rolling KV cache mechanism.
What carries the argument
Self Forcing, the training paradigm of autoregressive rollout with KV caching that conditions each frame on self-generated prior outputs and applies video-level loss.
If this is right
- Real-time streaming video generation with sub-second latency on a single GPU
- Generation quality that matches or surpasses significantly slower non-causal diffusion models
- Efficient autoregressive video extrapolation through the rolling KV cache
- Holistic video-level supervision instead of per-frame objectives
Where Pith is reading between the lines
- The same self-conditioning idea could reduce error accumulation in other long-horizon autoregressive tasks such as audio or 3D scene generation.
- Rolling KV caches may allow extension to substantially longer output sequences without proportional memory growth.
- Stochastic gradient truncation could be combined with other efficiency techniques to scale the method to higher-resolution video.
Load-bearing premise
That autoregressive rollout with KV caching during training using a few-step diffusion model and stochastic gradient truncation accurately simulates inference conditions without introducing substantial new biases or quality degradation.
What would settle it
A side-by-side evaluation in which Self Forcing models produce lower perceptual quality scores or exceed sub-second latency on a single GPU compared with non-causal diffusion models run under identical inference settings.
read the original abstract
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Self Forcing, a training paradigm for autoregressive video diffusion models that mitigates exposure bias by performing autoregressive rollout with KV caching during training, conditioning each frame on previously self-generated outputs rather than ground-truth context. It employs a few-step diffusion model and stochastic gradient truncation to maintain training efficiency, introduces a rolling KV cache for extrapolation, and reports a holistic video-level loss. Experiments claim this enables real-time streaming video generation with sub-second latency on a single GPU while matching or surpassing the quality of slower, non-causal diffusion baselines.
Significance. If the training-time approximations faithfully reproduce inference conditions, the approach could enable practical causal autoregressive video generation for low-latency applications. The KV-caching and rolling-cache mechanisms provide concrete efficiency gains, and the shift to self-conditioned training with video-level supervision is a direct procedural response to exposure bias.
major comments (2)
- [Section 3] Training procedure (Section 3): The few-step diffusion approximation combined with stochastic gradient truncation is presented as sufficient to simulate full inference-time error accumulation and KV-cache evolution, yet no quantitative analysis (e.g., comparison of noise schedules, drift metrics, or cache-state divergence) is provided to bound the discrepancy; this directly underpins the headline claim that Self Forcing closes the train-test gap without quality degradation.
- [Section 4] Experimental validation (Section 4): The reported sub-second latency and quality parity with non-causal models rely on the truncated training procedure, but the manuscript lacks ablations isolating the effects of step count and truncation probability on long-horizon consistency and cache behavior; without these, it is unclear whether the performance gains are robust or artifacts of the efficiency shortcuts.
minor comments (2)
- [Abstract] The abstract states that supervision occurs 'through a holistic loss at the video level,' but the precise formulation of this loss relative to the standard per-frame diffusion objective is not shown as an equation; adding it would clarify the difference from prior frame-wise training.
- Figure captions and method diagrams would benefit from explicit annotation of the stochastic truncation points and the rolling KV-cache update rule to improve reproducibility.
Simulated Author's Rebuttal
Thank you for the referee's thoughtful and constructive comments on our manuscript. We appreciate the focus on the training approximations and experimental rigor. Below we address each major comment point by point and describe the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Section 3] Training procedure (Section 3): The few-step diffusion approximation combined with stochastic gradient truncation is presented as sufficient to simulate full inference-time error accumulation and KV-cache evolution, yet no quantitative analysis (e.g., comparison of noise schedules, drift metrics, or cache-state divergence) is provided to bound the discrepancy; this directly underpins the headline claim that Self Forcing closes the train-test gap without quality degradation.
Authors: We agree that explicit quantitative bounds on the discrepancy would strengthen the justification for the few-step diffusion and stochastic gradient truncation. The current manuscript demonstrates effectiveness through end-to-end video-level quality metrics, latency results, and comparisons to non-causal baselines, which indirectly support that the approximations preserve the benefits of self-forcing. To directly address the concern, we will add a new analysis subsection in Section 3 that includes quantitative comparisons such as cache-state divergence (measured via L2 distance on KV tensors) and drift metrics (e.g., accumulated noise schedule deviation) between truncated and full rollouts on short sequences. This will provide explicit bounds and better support the claim that the train-test gap is closed without quality degradation. revision: yes
-
Referee: [Section 4] Experimental validation (Section 4): The reported sub-second latency and quality parity with non-causal models rely on the truncated training procedure, but the manuscript lacks ablations isolating the effects of step count and truncation probability on long-horizon consistency and cache behavior; without these, it is unclear whether the performance gains are robust or artifacts of the efficiency shortcuts.
Authors: We acknowledge that dedicated ablations isolating step count and truncation probability would improve clarity on robustness. The existing experiments already vary sequence lengths and report consistent quality across different video durations, with the rolling KV cache enabling extrapolation. However, to isolate these hyperparameters, we will expand Section 4 with new ablation tables that vary diffusion steps (1, 2, 4, 8) and truncation probabilities (0.1, 0.3, 0.5), reporting metrics for long-horizon consistency (e.g., temporal coherence scores) and cache behavior (e.g., cache hit rates and state divergence over 100+ frames). These additions will confirm that the gains are not artifacts of the shortcuts. revision: yes
Circularity Check
No significant circularity; training procedure is a self-contained procedural change.
full rationale
The paper presents Self Forcing as a training strategy that performs autoregressive rollout with KV caching to address exposure bias, supplemented by few-step diffusion and stochastic gradient truncation for tractability. No equations, fitted parameters, or self-citations are shown to reduce the claimed performance gains (real-time causal generation matching non-causal baselines) to the inputs by construction. The derivation chain consists of standard diffusion objectives with modified conditioning and rollout, evaluated externally against baselines. This is the most common honest finding for method papers without mathematical self-reference.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of diffusion steps
axioms (1)
- domain assumption Few-step diffusion approximates the full multi-step denoising process sufficiently for training the autoregressive objective
Forward citations
Cited by 55 Pith papers
-
ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
ReconPhys is the first feedforward neural network that jointly reconstructs 3D geometry and appearance via Gaussian Splatting while estimating physical attributes from a single monocular video using self-supervised training.
-
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
-
Discrete Stochastic Localization for Non-autoregressive Generation
Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
-
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
-
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
-
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
-
X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference
X-Cache achieves 71% block skip rate and 2.6x wall-clock speedup in few-step autoregressive multi-camera driving world models via cross-chunk residual caching with dual-metric gating and forced KV updates.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
Speculative Decoding for Autoregressive Video Generation
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning
DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.
-
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
-
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
-
Quantitative Video World Model Evaluation for Geometric-Consistency
PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
-
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
-
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...
-
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
-
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
-
FlashMol: High-Quality Molecule Generation in as Few as Four Steps
FlashMol produces chemically valid 3D molecules in 4 steps via distribution matching distillation with respaced timesteps and Jensen-Shannon regularization, matching or exceeding 1000-step teacher performance on QM9 a...
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
Repurposing 3D Generative Model for Autoregressive Layout Generation
LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
-
Lyra 2.0: Explorable Generative 3D Worlds
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
LiVER conditions video diffusion models on renderer-derived 3D control signals for disentangled, editable control over object layout, lighting, and camera trajectory.
-
LPM 1.0: Video-based Character Performance Model
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
-
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
LongLive: Real-time Interactive Long Video Generation
LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
-
Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-trai...
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.
-
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.
-
MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering
MuSteerNet generates realistic 3D human reactions from videos by mutually steering visual observations and reaction motions to reduce content mismatch.
-
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...
-
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Block diffusion: Interpolating between autoregressive and diffusion language models
Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InICLR, 2025
work page 2025
-
[2]
Abdelhak Bentaleb, May Lim, Mehmet N Akcay, Ali C Begen, Sarra Hammoudi, and Roger Zimmermann. Toward one-second latency: Evolution of live media streaming.IEEE Communications Surveys & Tutorials, 2025. 10
work page 2025
-
[3]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InCVPR, 2023
work page 2023
-
[5]
Generating long videos of dynamic scenes.NeurIPS, 2022
Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes.NeurIPS, 2022
work page 2022
-
[6]
Video generation models as world simulators, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024
work page 2024
-
[7]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024
work page 2024
-
[8]
Diffusion forcing: Next-token prediction meets full-sequence diffusion
Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024
work page 2024
-
[9]
Feng Chen, Zhen Yang, Bohan Zhuang, and Qi Wu. Streaming video diffusion: Online video editing with diffusion models.arXiv preprint arXiv:2405.19726, 2024
-
[10]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Juncheng Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengchen Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review arXiv 2025
-
[11]
Oasis: A universe in a transformer, 2024
Julian Decart, Quinn Quevedo, Spruce McIntyre, Xinlei Campbell, Robert Chen, and Wachen. Oasis: A universe in a transformer, 2024
work page 2024
-
[12]
arXiv preprint arXiv:2412.12095 , year=
Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024
-
[13]
Autoregressive video generation without vector quantization
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InICLR, 2025
work page 2025
-
[14]
Unsupervised learning of disentangled representations from video
Emily L Denton et al. Unsupervised learning of disentangled representations from video. InNeurIPS, 2017
work page 2017
-
[15]
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A program- ming model for generating optimized attention kernels.ArXiv, abs/2412.05496, 2024
-
[16]
arXiv preprint arXiv:2411.16375 (2024)
Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024
-
[17]
Long video generation with time-agnostic vqgan and time-sensitive transformer
Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. InECCV, 2022
work page 2022
-
[18]
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014
work page 2014
-
[19]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InCOLM, 2024
work page 2024
-
[20]
Long-context autoregressive video modeling with next-frame prediction
Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025
-
[21]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Long context tuning for video generation
Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025
-
[23]
Photorealistic video generation with diffusion models
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InECCV, 2024. 11
work page 2024
-
[24]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review arXiv 2024
-
[25]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models.ArXiv, abs/2210.02303, 2022
work page internal anchor Pith review arXiv 2022
-
[26]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. InNeurIPS, 2022
work page 2022
-
[27]
Cogvideo: Large-scale pretraining for text-to-video generation via transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InICLR, 2023
work page 2023
-
[28]
arXiv preprint arXiv:2412.07720 (2024)
Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei- Ying Ma, and Maosong Sun. Acdit: Interpolating autoregressive conditional modeling and diffusion transformer.arXiv preprint arXiv:2412.07720, 2024
-
[29]
The gan is dead; long live the gan! a modern gan baseline
Nick Huang, Aaron Gokaslan, V olodymyr Kuleshov, and James Tompkin. The gan is dead; long live the gan! a modern gan baseline. InNeurIPS, 2024
work page 2024
-
[30]
Flow generator matching.arXiv preprint arXiv:2410.19310, 2024
Zemin Huang, Zhengyang Geng, Weijian Luo, and Guo-jun Qi. Flow generator matching.arXiv preprint arXiv:2410.19310, 2024
-
[31]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InCVPR, 2024
work page 2024
-
[32]
On stabilizing generative adversarial training with noise
Simon Jenni and Paolo Favaro. On stabilizing generative adversarial training with noise. InCVPR, 2019
work page 2019
-
[33]
Pyramidal flow matching for efficient video generative modeling
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. In ICLR, 2025
work page 2025
-
[34]
The relativistic discriminator: a key element missing from standard gan
Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. In ICLR, 2019
work page 2019
-
[35]
Fifo-diffusion: Generating infinite videos from text without training
Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training. InNeurIPS, 2024
work page 2024
-
[36]
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. InNeurIPS, 2021
work page 2021
-
[37]
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014
work page 2014
-
[38]
Videopoet: A large language model for zero-shot video generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. InICML, 2024
work page 2024
-
[39]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Professor forcing: A new algorithm for training recurrent networks
Alex M Lamb, Anirudh Goyal ALIAS PARTH GOY AL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. InNeurIPS, 2016
work page 2016
-
[41]
Latency reducing in real-time internet video transport: A survey.SSRN 4654242, 2023
Qing Li, Xun Tang, Junkun Peng, Yuanzheng Tan, and Yong Jiang. Latency reducing in real-time internet video transport: A survey.SSRN 4654242, 2023
work page 2023
-
[42]
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025
work page internal anchor Pith review arXiv 2025
-
[43]
Autoregressive image generation without vector quantization
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InNeurIPS, 2024
work page 2024
-
[44]
Infinitenature-zero: Learning perpetual view generation of natural scenes from single images
Zhengqi Li, Qianqian Wang, Noah Snavely, and Angjoo Kanazawa. Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. InECCV, 2022. 12
work page 2022
-
[45]
Arlon: Boosting diffusion transformers with autoregressive models for long video generation
Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation. InICLR, 2025
work page 2025
-
[46]
Looking backward: Streaming video-to-video translation with feature banks
Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Looking backward: Streaming video-to-video translation with feature banks. InICLR, 2025
work page 2025
-
[47]
arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4
Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025
-
[48]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023
work page 2023
-
[49]
Infinite nature: Perpetual view generation of natural scenes from a single image
Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InICCV, 2021
work page 2021
-
[50]
Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024
-
[51]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023
work page 2023
-
[52]
Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H Chan, and Jean-michel Morel. Redefining temporal modeling in video diffusion: The vectorized timestep approach. arXiv preprint arXiv:2410.03160, 2024
-
[53]
Autoregressive diffusion transformer for text-to-speech synthesis
Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, and Haizhou Li. Autoregressive diffusion transformer for text-to-speech synthesis.arXiv preprint arXiv:2406.05551, 2024
-
[54]
Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. InNeurIPS, 2023
work page 2023
-
[55]
One-step diffusion distillation through score implicit matching.NeurIPS, 2024
Weijian Luo, Zemin Huang, Zhengyang Geng, J Zico Kolter, and Guo-jun Qi. One-step diffusion distillation through score implicit matching.NeurIPS, 2024
work page 2024
-
[56]
Osv: One step is enough for high-quality image to video generation
Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang, and Wenhan Luo. Osv: One step is enough for high-quality image to video generation. InCVPR, 2025
work page 2025
-
[57]
The parallelism tradeoff: Limitations of log-precision transformers
William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. TACL, 2023
work page 2023
-
[58]
Which training methods for gans do actually converge? InICML, 2018
Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? InICML, 2018
work page 2018
-
[59]
Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, et al. X-fusion: Introducing new modality to frozen large language models.arXiv preprint arXiv:2504.20996, 2025
-
[60]
Elucidating the exposure bias in diffusion models
Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, and Itir Onal Ertugrul. Elucidating the exposure bias in diffusion models. InICLR, 2024
work page 2024
-
[61]
Genie 2: A large-scale foundation world model, 2024
Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...
work page 2024
-
[62]
Scalable diffusion models with transformers
William S Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023
work page 2023
-
[63]
Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025
Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025
-
[64]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review arXiv 2024
-
[65]
Sequence level training with recurrent neural networks
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InICLR, 2016. 13
work page 2016
-
[66]
arXiv preprint arXiv:2502.07737 (2025)
Shuhuai Ren, Shuming Ma, Xu Sun, and Furu Wei. Next block prediction: Video generation via semi-auto-regressive modeling.arXiv preprint arXiv:2502.07737, 2025
-
[67]
David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. InICML, 2024
work page 2024
-
[68]
Temporal generative adversarial nets with singular value clipping
Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. InICCV, 2017
work page 2017
-
[69]
Magi-1: Autoregressive video generation at scale, 2025
Sand-AI. Magi-1: Autoregressive video generation at scale, 2025
work page 2025
-
[70]
Fast high-resolution image synthesis with latent adversarial diffusion distillation
Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024
work page 2024
-
[71]
Generalization in generation: A closer look at exposure bias.EMNLP-IJCNLP 2019, page 157, 2019
Florian Schmidt. Generalization in generation: A closer look at exposure bias.EMNLP-IJCNLP 2019, page 157, 2019
work page 2019
-
[72]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InNeurIPS, 2024
work page 2024
-
[73]
History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025
Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion.arXiv preprint arXiv:2502.06764, 2025
-
[74]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023
work page 2023
-
[75]
Maximum likelihood training of score-based diffusion models
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. InNeurIPS, 2021
work page 2021
-
[76]
Ar-diffusion: Asynchronous video generation with auto-regressive diffusion
Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. In CVPR, 2025
work page 2025
-
[77]
Mocogan: Decomposing motion and content for video generation
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. InCVPR, 2018
work page 2018
-
[78]
Diffusion models are real-time game engines
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, 2025
work page 2025
-
[79]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017
work page 2017
-
[80]
Phenaki: Variable length video generation from open domain textual descriptions
R Villegas, H Moraldo, S Castro, M Babaeizadeh, H Zhang, J Kunze, PJ Kindermans, MT Saffar, and D Erhan. Phenaki: Variable length video generation from open domain textual descriptions. InICLR, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.