arxiv: 2506.08009 · v2 · submitted 2025-06-09 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang , Zhengqi Li , Guande He , Mingyuan Zhou , Eli Shechtman

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords autoregressive video diffusionexposure biasself forcingKV cachingreal-time video generationtrain-test gapstreaming video

0 comments

The pith

Self Forcing trains autoregressive video diffusion models on their own generated outputs to close the exposure bias gap and enable real-time streaming.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self Forcing to address exposure bias in autoregressive video diffusion models, where training uses ground-truth context but inference must rely on the model's own imperfect outputs. It performs autoregressive rollout with key-value caching during training so each frame is conditioned on previously self-generated frames, then applies a holistic loss over the full video sequence. Efficiency comes from using a few-step diffusion process together with stochastic gradient truncation. The approach also adds a rolling KV cache for extrapolation. This yields real-time streaming video generation at sub-second latency on a single GPU while matching or exceeding the quality of slower non-causal models.

Core claim

Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value caching during training. This enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives, and supports efficient inference via few-step diffusion, stochastic gradient truncation, and a rolling KV cache mechanism.

What carries the argument

Self Forcing, the training paradigm of autoregressive rollout with KV caching that conditions each frame on self-generated prior outputs and applies video-level loss.

If this is right

Real-time streaming video generation with sub-second latency on a single GPU
Generation quality that matches or surpasses significantly slower non-causal diffusion models
Efficient autoregressive video extrapolation through the rolling KV cache
Holistic video-level supervision instead of per-frame objectives

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-conditioning idea could reduce error accumulation in other long-horizon autoregressive tasks such as audio or 3D scene generation.
Rolling KV caches may allow extension to substantially longer output sequences without proportional memory growth.
Stochastic gradient truncation could be combined with other efficiency techniques to scale the method to higher-resolution video.

Load-bearing premise

That autoregressive rollout with KV caching during training using a few-step diffusion model and stochastic gradient truncation accurately simulates inference conditions without introducing substantial new biases or quality degradation.

What would settle it

A side-by-side evaluation in which Self Forcing models produce lower perceptual quality scores or exceed sub-second latency on a single GPU compared with non-causal diffusion models run under identical inference settings.

read the original abstract

We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self Forcing trains autoregressive video diffusion with KV-cached rollout and video-level loss to cut exposure bias, delivering claimed real-time single-GPU streaming while the few-step and truncation approximations remain the main open question.

read the letter

Self Forcing trains the model by actually rolling out autoregressive generations with KV caching during training and applying a loss over the full video sequence instead of frame by frame. This directly exposes the network to its own outputs at train time, which is the core fix for exposure bias in these causal video models. The rolling KV cache extension for extrapolation is a practical addition that keeps memory use manageable for longer clips. The paper reports that the resulting models run in real time with sub-second latency on one GPU and match or beat the quality of slower non-causal baselines, which is the result that matters for streaming or interactive uses. Those outcomes rest on standard diffusion objectives plus the modified conditioning and supervision, so the method stays grounded in existing machinery rather than introducing new equations. The few-step diffusion schedule and stochastic gradient truncation are presented as necessary engineering choices to keep training tractable, and the abstract indicates they preserve performance. Still, those shortcuts are the soft spot: they risk under-sampling the exact noise schedule and error accumulation that full inference produces, so the training distribution may not match test conditions as closely as claimed. Without detailed ablations on step count or truncation rate, it is hard to separate genuine bias reduction from optimization artifacts. The experiments appear to use reasonable baselines and report both quality and speed metrics, which is enough to make the work worth refereeing. This paper is aimed at researchers building causal video generators who need practical speed without sacrificing coherence. A reader working on exposure bias or streaming diffusion would find the training procedure and efficiency numbers useful even if they later tighten the approximation analysis. I would send it to peer review because the central training change is concrete, the performance claims are testable, and the practical gains are large enough to justify the effort of checking the details.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Self Forcing, a training paradigm for autoregressive video diffusion models that mitigates exposure bias by performing autoregressive rollout with KV caching during training, conditioning each frame on previously self-generated outputs rather than ground-truth context. It employs a few-step diffusion model and stochastic gradient truncation to maintain training efficiency, introduces a rolling KV cache for extrapolation, and reports a holistic video-level loss. Experiments claim this enables real-time streaming video generation with sub-second latency on a single GPU while matching or surpassing the quality of slower, non-causal diffusion baselines.

Significance. If the training-time approximations faithfully reproduce inference conditions, the approach could enable practical causal autoregressive video generation for low-latency applications. The KV-caching and rolling-cache mechanisms provide concrete efficiency gains, and the shift to self-conditioned training with video-level supervision is a direct procedural response to exposure bias.

major comments (2)

[Section 3] Training procedure (Section 3): The few-step diffusion approximation combined with stochastic gradient truncation is presented as sufficient to simulate full inference-time error accumulation and KV-cache evolution, yet no quantitative analysis (e.g., comparison of noise schedules, drift metrics, or cache-state divergence) is provided to bound the discrepancy; this directly underpins the headline claim that Self Forcing closes the train-test gap without quality degradation.
[Section 4] Experimental validation (Section 4): The reported sub-second latency and quality parity with non-causal models rely on the truncated training procedure, but the manuscript lacks ablations isolating the effects of step count and truncation probability on long-horizon consistency and cache behavior; without these, it is unclear whether the performance gains are robust or artifacts of the efficiency shortcuts.

minor comments (2)

[Abstract] The abstract states that supervision occurs 'through a holistic loss at the video level,' but the precise formulation of this loss relative to the standard per-frame diffusion objective is not shown as an equation; adding it would clarify the difference from prior frame-wise training.
Figure captions and method diagrams would benefit from explicit annotation of the stochastic truncation points and the rolling KV-cache update rule to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's thoughtful and constructive comments on our manuscript. We appreciate the focus on the training approximations and experimental rigor. Below we address each major comment point by point and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Section 3] Training procedure (Section 3): The few-step diffusion approximation combined with stochastic gradient truncation is presented as sufficient to simulate full inference-time error accumulation and KV-cache evolution, yet no quantitative analysis (e.g., comparison of noise schedules, drift metrics, or cache-state divergence) is provided to bound the discrepancy; this directly underpins the headline claim that Self Forcing closes the train-test gap without quality degradation.

Authors: We agree that explicit quantitative bounds on the discrepancy would strengthen the justification for the few-step diffusion and stochastic gradient truncation. The current manuscript demonstrates effectiveness through end-to-end video-level quality metrics, latency results, and comparisons to non-causal baselines, which indirectly support that the approximations preserve the benefits of self-forcing. To directly address the concern, we will add a new analysis subsection in Section 3 that includes quantitative comparisons such as cache-state divergence (measured via L2 distance on KV tensors) and drift metrics (e.g., accumulated noise schedule deviation) between truncated and full rollouts on short sequences. This will provide explicit bounds and better support the claim that the train-test gap is closed without quality degradation. revision: yes
Referee: [Section 4] Experimental validation (Section 4): The reported sub-second latency and quality parity with non-causal models rely on the truncated training procedure, but the manuscript lacks ablations isolating the effects of step count and truncation probability on long-horizon consistency and cache behavior; without these, it is unclear whether the performance gains are robust or artifacts of the efficiency shortcuts.

Authors: We acknowledge that dedicated ablations isolating step count and truncation probability would improve clarity on robustness. The existing experiments already vary sequence lengths and report consistent quality across different video durations, with the rolling KV cache enabling extrapolation. However, to isolate these hyperparameters, we will expand Section 4 with new ablation tables that vary diffusion steps (1, 2, 4, 8) and truncation probabilities (0.1, 0.3, 0.5), reporting metrics for long-horizon consistency (e.g., temporal coherence scores) and cache behavior (e.g., cache hit rates and state divergence over 100+ frames). These additions will confirm that the gains are not artifacts of the shortcuts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training procedure is a self-contained procedural change.

full rationale

The paper presents Self Forcing as a training strategy that performs autoregressive rollout with KV caching to address exposure bias, supplemented by few-step diffusion and stochastic gradient truncation for tractability. No equations, fitted parameters, or self-citations are shown to reduce the claimed performance gains (real-time causal generation matching non-causal baselines) to the inputs by construction. The derivation chain consists of standard diffusion objectives with modified conditioning and rollout, evaluated externally against baselines. This is the most common honest finding for method papers without mathematical self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that few-step diffusion plus stochastic gradient truncation preserves sufficient training signal for the autoregressive objective, plus standard diffusion model assumptions about noise schedules and conditioning.

free parameters (1)

number of diffusion steps
Few-step diffusion model chosen to balance training speed and quality; exact count not specified in abstract.

axioms (1)

domain assumption Few-step diffusion approximates the full multi-step denoising process sufficiently for training the autoregressive objective
Invoked to enable efficient autoregressive rollout during training.

pith-pipeline@v0.9.0 · 5502 in / 1215 out tokens · 30872 ms · 2026-05-11T01:30:50.959758+00:00 · methodology

discussion (0)

Forward citations

Cited by 55 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
cs.CV 2026-04 unverdicted novelty 8.0

ReconPhys is the first feedforward neural network that jointly reconstructs 3D geometry and appearance via Gaussian Splatting while estimating physical attributes from a single monocular video using self-supervised training.
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
cs.CV 2026-05 unverdicted novelty 7.0

KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-05 unverdicted novelty 7.0

Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
cs.RO 2026-05 unverdicted novelty 7.0

DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
cs.CV 2026-04 unverdicted novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference
cs.CV 2026-04 unverdicted novelty 7.0

X-Cache achieves 71% block skip rate and 2.6x wall-clock speedup in few-step autoregressive multi-camera driving world models via cross-chunk residual caching with dual-metric gating and forced KV updates.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
Speculative Decoding for Autoregressive Video Generation
cs.CV 2026-04 conditional novelty 7.0

A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning
eess.IV 2026-04 unverdicted novelty 7.0

DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
cs.CV 2026-03 unverdicted novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
Quantitative Video World Model Evaluation for Geometric-Consistency
cs.CV 2026-05 unverdicted novelty 6.0

PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
cs.CV 2026-05 unverdicted novelty 6.0

Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 6.0

ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
FlashMol: High-Quality Molecule Generation in as Few as Four Steps
cs.LG 2026-05 unverdicted novelty 6.0

FlashMol produces chemically valid 3D molecules in 4 steps via distribution matching distillation with respaced timesteps and Jensen-Shannon regularization, matching or exceeding 1000-step teacher performance on QM9 a...
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
cs.CV 2026-05 unverdicted novelty 6.0

RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Stream-T1: Test-Time Scaling for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
Repurposing 3D Generative Model for Autoregressive Layout Generation
cs.CV 2026-04 unverdicted novelty 6.0

LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
cs.CV 2026-04 unverdicted novelty 6.0

Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
cs.CV 2026-04 unverdicted novelty 6.0

RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
Lyra 2.0: Explorable Generative 3D Worlds
cs.CV 2026-04 unverdicted novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
cs.CV 2026-04 conditional novelty 6.0

Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

LiVER conditions video diffusion models on renderer-derived 3D control signals for disentangled, editable control over object layout, lighting, and camera trajectory.
LPM 1.0: Video-based Character Performance Model
cs.CV 2026-04 unverdicted novelty 6.0

LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
LongLive: Real-time Interactive Long Video Generation
cs.CV 2025-09 conditional novelty 6.0

LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
cs.CV 2026-05 unverdicted novelty 5.0

Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-trai...
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
cs.CV 2026-04 unverdicted novelty 5.0

PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
cs.CV 2026-04 unverdicted novelty 5.0

TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.
MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering
cs.CV 2026-03 unverdicted novelty 5.0

MuSteerNet generates realistic 3D human reactions from videos by mutually steering visual observations and reaction motions to reduce content mismatch.
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
cs.CV 2026-04 unverdicted novelty 4.0

HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 53 Pith papers · 11 internal anchors

[1]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InICLR, 2025

work page 2025
[2]

Toward one-second latency: Evolution of live media streaming.IEEE Communications Surveys & Tutorials, 2025

Abdelhak Bentaleb, May Lim, Mehmet N Akcay, Ali C Begen, Sarra Hammoudi, and Roger Zimmermann. Toward one-second latency: Evolution of live media streaming.IEEE Communications Surveys & Tutorials, 2025. 10

work page 2025
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InCVPR, 2023

work page 2023
[5]

Generating long videos of dynamic scenes.NeurIPS, 2022

Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes.NeurIPS, 2022

work page 2022
[6]

Video generation models as world simulators, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024

work page 2024
[7]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024

work page 2024
[8]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

work page 2024
[9]

Streaming video diffusion: Online video editing with diffusion models.arXiv preprint arXiv:2405.19726, 2024

Feng Chen, Zhen Yang, Bohan Zhuang, and Qi Wu. Streaming video diffusion: Online video editing with diffusion models.arXiv preprint arXiv:2405.19726, 2024

work page arXiv 2024
[10]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Juncheng Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengchen Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review arXiv 2025
[11]

Oasis: A universe in a transformer, 2024

Julian Decart, Quinn Quevedo, Spruce McIntyre, Xinlei Campbell, Robert Chen, and Wachen. Oasis: A universe in a transformer, 2024

work page 2024
[12]

arXiv preprint arXiv:2412.12095 , year=

Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

work page arXiv 2024
[13]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InICLR, 2025

work page 2025
[14]

Unsupervised learning of disentangled representations from video

Emily L Denton et al. Unsupervised learning of disentangled representations from video. InNeurIPS, 2017

work page 2017
[15]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A program- ming model for generating optimized attention kernels.ArXiv, abs/2412.05496, 2024

work page arXiv 2024
[16]

arXiv preprint arXiv:2411.16375 (2024)

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

work page arXiv 2024
[17]

Long video generation with time-agnostic vqgan and time-sensitive transformer

Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. InECCV, 2022

work page 2022
[18]

Generative adversarial nets

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

work page 2014
[19]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InCOLM, 2024

work page 2024
[20]

Long-context autoregressive video modeling with next-frame prediction

Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025

work page arXiv 2025
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Long context tuning for video generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

work page arXiv 2025
[23]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InECCV, 2024. 11

work page 2024
[24]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review arXiv 2024
[25]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models.ArXiv, abs/2210.02303, 2022

work page internal anchor Pith review arXiv 2022
[26]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. InNeurIPS, 2022

work page 2022
[27]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InICLR, 2023

work page 2023
[28]

arXiv preprint arXiv:2412.07720 (2024)

Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei- Ying Ma, and Maosong Sun. Acdit: Interpolating autoregressive conditional modeling and diffusion transformer.arXiv preprint arXiv:2412.07720, 2024

work page arXiv 2024
[29]

The gan is dead; long live the gan! a modern gan baseline

Nick Huang, Aaron Gokaslan, V olodymyr Kuleshov, and James Tompkin. The gan is dead; long live the gan! a modern gan baseline. InNeurIPS, 2024

work page 2024
[30]

Flow generator matching.arXiv preprint arXiv:2410.19310, 2024

Zemin Huang, Zhengyang Geng, Weijian Luo, and Guo-jun Qi. Flow generator matching.arXiv preprint arXiv:2410.19310, 2024

work page arXiv 2024
[31]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

work page 2024
[32]

On stabilizing generative adversarial training with noise

Simon Jenni and Paolo Favaro. On stabilizing generative adversarial training with noise. InCVPR, 2019

work page 2019
[33]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. In ICLR, 2025

work page 2025
[34]

The relativistic discriminator: a key element missing from standard gan

Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. In ICLR, 2019

work page 2019
[35]

Fifo-diffusion: Generating infinite videos from text without training

Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training. InNeurIPS, 2024

work page 2024
[36]

Variational diffusion models

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. InNeurIPS, 2021

work page 2021
[37]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

work page 2014
[38]

Videopoet: A large language model for zero-shot video generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. InICML, 2024

work page 2024
[39]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Professor forcing: A new algorithm for training recurrent networks

Alex M Lamb, Anirudh Goyal ALIAS PARTH GOY AL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. InNeurIPS, 2016

work page 2016
[41]

Latency reducing in real-time internet video transport: A survey.SSRN 4654242, 2023

Qing Li, Xun Tang, Junkun Peng, Yuanzheng Tan, and Yong Jiang. Latency reducing in real-time internet video transport: A survey.SSRN 4654242, 2023

work page 2023
[42]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

work page internal anchor Pith review arXiv 2025
[43]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InNeurIPS, 2024

work page 2024
[44]

Infinitenature-zero: Learning perpetual view generation of natural scenes from single images

Zhengqi Li, Qianqian Wang, Noah Snavely, and Angjoo Kanazawa. Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. InECCV, 2022. 12

work page 2022
[45]

Arlon: Boosting diffusion transformers with autoregressive models for long video generation

Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation. InICLR, 2025

work page 2025
[46]

Looking backward: Streaming video-to-video translation with feature banks

Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Looking backward: Streaming video-to-video translation with feature banks. InICLR, 2025

work page 2025
[47]

arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

work page arXiv 2025
[48]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

work page 2023
[49]

Infinite nature: Perpetual view generation of natural scenes from a single image

Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InICCV, 2021

work page 2021
[50]

Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa

Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

work page arXiv 2024
[51]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023
[52]

Redefining temporal modeling in video diffusion: The vectorized timestep approach.arXiv preprint arXiv:2410.03160, 2024

Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H Chan, and Jean-michel Morel. Redefining temporal modeling in video diffusion: The vectorized timestep approach. arXiv preprint arXiv:2410.03160, 2024

work page arXiv 2024
[53]

Autoregressive diffusion transformer for text-to-speech synthesis

Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, and Haizhou Li. Autoregressive diffusion transformer for text-to-speech synthesis.arXiv preprint arXiv:2406.05551, 2024

work page arXiv 2024
[54]

Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. InNeurIPS, 2023

work page 2023
[55]

One-step diffusion distillation through score implicit matching.NeurIPS, 2024

Weijian Luo, Zemin Huang, Zhengyang Geng, J Zico Kolter, and Guo-jun Qi. One-step diffusion distillation through score implicit matching.NeurIPS, 2024

work page 2024
[56]

Osv: One step is enough for high-quality image to video generation

Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang, and Wenhan Luo. Osv: One step is enough for high-quality image to video generation. InCVPR, 2025

work page 2025
[57]

The parallelism tradeoff: Limitations of log-precision transformers

William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. TACL, 2023

work page 2023
[58]

Which training methods for gans do actually converge? InICML, 2018

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? InICML, 2018

work page 2018
[59]

X-fusion: Introducing new modality to frozen large language models.arXiv preprint arXiv:2504.20996, 2025

Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, et al. X-fusion: Introducing new modality to frozen large language models.arXiv preprint arXiv:2504.20996, 2025

work page arXiv 2025
[60]

Elucidating the exposure bias in diffusion models

Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, and Itir Onal Ertugrul. Elucidating the exposure bias in diffusion models. InICLR, 2024

work page 2024
[61]

Genie 2: A large-scale foundation world model, 2024

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

work page 2024
[62]

Scalable diffusion models with transformers

William S Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

work page 2023
[63]

Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025

Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025

work page arXiv 2025
[64]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review arXiv 2024
[65]

Sequence level training with recurrent neural networks

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InICLR, 2016. 13

work page 2016
[66]

arXiv preprint arXiv:2502.07737 (2025)

Shuhuai Ren, Shuming Ma, Xu Sun, and Furu Wei. Next block prediction: Video generation via semi-auto-regressive modeling.arXiv preprint arXiv:2502.07737, 2025

work page arXiv 2025
[67]

Rolling diffusion models

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. InICML, 2024

work page 2024
[68]

Temporal generative adversarial nets with singular value clipping

Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. InICCV, 2017

work page 2017
[69]

Magi-1: Autoregressive video generation at scale, 2025

Sand-AI. Magi-1: Autoregressive video generation at scale, 2025

work page 2025
[70]

Fast high-resolution image synthesis with latent adversarial diffusion distillation

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024
[71]

Generalization in generation: A closer look at exposure bias.EMNLP-IJCNLP 2019, page 157, 2019

Florian Schmidt. Generalization in generation: A closer look at exposure bias.EMNLP-IJCNLP 2019, page 157, 2019

work page 2019
[72]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InNeurIPS, 2024

work page 2024
[73]

History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

work page arXiv 2025
[74]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

work page 2023
[75]

Maximum likelihood training of score-based diffusion models

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. InNeurIPS, 2021

work page 2021
[76]

Ar-diffusion: Asynchronous video generation with auto-regressive diffusion

Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. In CVPR, 2025

work page 2025
[77]

Mocogan: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. InCVPR, 2018

work page 2018
[78]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, 2025

work page 2025
[79]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

work page 2017
[80]

Phenaki: Variable length video generation from open domain textual descriptions

R Villegas, H Moraldo, S Castro, M Babaeizadeh, H Zhang, J Kunze, PJ Kindermans, MT Saffar, and D Erhan. Phenaki: Variable length video generation from open domain textual descriptions. InICLR, 2023

work page 2023

Showing first 80 references.