arxiv: 2509.22622 · v2 · submitted 2025-09-26 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LongLive: Real-time Interactive Long Video Generation

Shuai Yang , Wei Huang , Ruihang Chu , Yicheng Xiao , Yuyang Zhao , Xianbang Wang , Muyang Li , Enze Xie

show 4 more authors

Yingcong Chen Yao Lu Song Han Yukang Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords long video generationautoregressive modelreal-time videointeractive generationKV cachevideo synthesiscausal attention

0 comments

The pith

LongLive turns a short-clip autoregressive model into a real-time system that generates up to 240-second videos at 20.7 FPS while accepting streaming prompt changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongLive as a frame-level autoregressive framework that solves efficiency and consistency problems in long video generation. It uses KV-recache to refresh states on new prompts, streaming long tuning to match training and inference lengths, and short-window attention plus frame sink to keep visual coherence without full attention. These changes let the authors fine-tune a 1.3B model to minute-scale output in 32 GPU-days. At inference the system runs at 20.7 frames per second on one H100 GPU and handles interactive prompt streams without major drift.

Core claim

LongLive is a causal frame-level autoregressive model that integrates a KV-recache mechanism to update cached states for prompt switches, streaming long tuning to enforce train-long-test-long alignment, and short-window attention paired with a frame sink to maintain long-range consistency while speeding generation. With these designs the model supports minute-long videos, real-time interaction, and INT8 inference on a single GPU.

What carries the argument

KV-recache combined with short-window attention and frame sink inside a causal frame-level autoregressive architecture.

If this is right

Real-time interactive video creation becomes practical on consumer GPUs.
Training cost for long-video capability drops to tens of GPU-days instead of hundreds.
The same model produces both short clips and full-minute sequences at high speed.
INT8 quantization preserves quality, enabling lower-memory deployment.
VBench scores remain strong for both short and long outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could transfer to other autoregressive sequence tasks such as long audio or motion capture.
If memory scaling improves, the same mechanisms might support hour-long coherent videos.
Interactive control opens uses in live editing, simulation, or educational content.
Quantized real-time inference suggests deployment on edge devices for on-the-fly video synthesis.

Load-bearing premise

The KV-recache and frame-sink pair keeps visual consistency and semantic adherence across prompt changes and long sequences without cumulative drift or artifacts.

What would settle it

A 240-second video with repeated prompt transitions that shows visible object distortion, color shift, or loss of prompt adherence after the first minute.

read the original abstract

We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LongLive gives a concrete engineering recipe for real-time interactive AR video with KV-recache and frame sink, but the no-drift claim rests on unablated mechanisms.

read the letter

LongLive takes a standard causal frame-level AR video model and adds KV-recache on prompt switches, streaming long tuning to align train and test lengths, and short-window attention plus a frame sink to keep consistency. After 32 GPU-days of fine-tuning on a 1.3B short-clip base, it reaches 20.7 FPS on one H100, handles up to 240-second outputs, and posts competitive VBench numbers for both short and long clips, including under INT8 quantization. The interactive prompt-streaming angle is the clearest addition; most prior AR video work stays with fixed prompts, so the recache step plus train-long-test-long alignment addresses a practical gap. The efficiency numbers are specific and the architecture stays close to existing causal models, which makes the design easy to reproduce on top of current codebases. The soft spot is the missing validation for the central consistency claim. The abstract and stress-test note show no transition-specific metrics, no drift curves over long sequences, and no ablations that isolate KV-recache or the frame sink. Without those, it is unclear whether the mechanisms bound error accumulation or merely slow it down enough for the reported demos. The numbers themselves look plausible for an AR system, but the evidence for reliable prompt switching across minutes is still thin. This paper is aimed at groups that need deployable long-video tools rather than new theoretical primitives. A reader working on real-time generation or interactive media pipelines would pick up usable design choices and throughput figures. I would send it to peer review because the performance claims are concrete and the problem is current, even though tighter ablations would strengthen it.

Referee Report

3 major / 2 minor

Summary. The paper presents LongLive, a causal frame-level autoregressive framework for real-time interactive long video generation. It integrates KV-recache to refresh states on prompt switches, streaming long tuning to align train and inference distributions, and short-window attention with a frame sink to preserve consistency. Starting from a 1.3B-parameter short-clip model, the approach enables minute-long generation after 32 GPU-days of fine-tuning, delivering 20.7 FPS inference on a single H100 GPU, support for up to 240-second videos, strong VBench scores on both short and long clips, and INT8 quantization with marginal quality loss.

Significance. If the reported efficiency and consistency results are substantiated, the work would constitute a meaningful step toward practical interactive long-video synthesis. The causal AR design with targeted long-sequence mechanisms addresses the efficiency bottlenecks of bidirectional diffusion models and the quality degradation typical of standard KV-cached autoregressive video models, potentially enabling real-time applications that were previously infeasible on single-GPU hardware.

major comments (3)

[Experimental Results] The headline claim that KV-recache combined with short-window attention and frame sink prevents cumulative drift and maintains semantic adherence across prompt transitions rests on untested premises. No ablation removing KV-recache (or the other components) is reported, nor are transition-specific metrics such as per-switch CLIP/DINO consistency or drift curves over 240 s provided in the results.
[Results] The reported 20.7 FPS, VBench scores, and 240 s duration figures are given as point estimates without error bars, standard deviations across runs, or details on the number of evaluation seeds. This weakens confidence in the reliability of the efficiency and quality assertions that underpin the central contribution.
[Methods] The streaming long tuning procedure is described as aligning training and inference, yet no quantitative comparison (e.g., train-long vs. train-short test-long performance gap) is supplied to demonstrate that the alignment actually reduces the quality degradation that the introduction attributes to standard AR long-video training.

minor comments (2)

[Abstract] The abstract states 'strong performance on VBench' without quoting the numerical scores or naming the exact baselines used for comparison; these details should appear in the abstract or be cross-referenced to a table.
[Methods] Implementation specifics of the frame sink (exact sink size, how it interacts with the short window) are only sketched; a precise algorithmic description or pseudocode would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the detailed and insightful review of our manuscript. The comments highlight important aspects that can further strengthen the presentation of our work on LongLive. We address each major comment below and outline the revisions we plan to incorporate.

read point-by-point responses

Referee: [Experimental Results] The headline claim that KV-recache combined with short-window attention and frame sink prevents cumulative drift and maintains semantic adherence across prompt transitions rests on untested premises. No ablation removing KV-recache (or the other components) is reported, nor are transition-specific metrics such as per-switch CLIP/DINO consistency or drift curves over 240 s provided in the results.

Authors: We appreciate this observation. While the main results demonstrate the overall effectiveness through VBench scores and qualitative examples of prompt transitions, we agree that dedicated ablations isolating the contribution of KV-recache, short-window attention, and frame sink would provide stronger evidence. Additionally, we will include transition-specific metrics such as per-switch CLIP and DINO consistency scores, as well as drift curves over extended sequences. These will be added to the revised manuscript. revision: yes
Referee: [Results] The reported 20.7 FPS, VBench scores, and 240 s duration figures are given as point estimates without error bars, standard deviations across runs, or details on the number of evaluation seeds. This weakens confidence in the reliability of the efficiency and quality assertions that underpin the central contribution.

Authors: We acknowledge the importance of statistical reliability in reporting. In the revised version, we will provide error bars, standard deviations, and specify the number of evaluation seeds used for the FPS, VBench, and duration metrics. This will be based on multiple runs to better substantiate the claims. revision: yes
Referee: [Methods] The streaming long tuning procedure is described as aligning training and inference, yet no quantitative comparison (e.g., train-long vs. train-short test-long performance gap) is supplied to demonstrate that the alignment actually reduces the quality degradation that the introduction attributes to standard AR long-video training.

Authors: Thank you for pointing this out. To demonstrate the benefit of streaming long tuning, we will add a quantitative comparison in the methods or experiments section, showing performance gaps between models trained with train-long vs. train-short on long test sequences. This will include metrics highlighting the reduction in quality degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical method (KV-recache for prompt transitions, streaming long tuning for train-long-test-long alignment, and short-window attention plus frame sink for consistency) whose performance claims rest on measured quantities: 32 GPU-days of fine-tuning a 1.3B model, 20.7 FPS inference on H100, VBench scores, and support for 240-second videos. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce these results to the inputs by construction. The derivation is self-contained and externally falsifiable via the reported benchmarks and throughput measurements.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the KV-recache mechanism for prompt adherence and the frame sink for long-range consistency; these are introduced without independent theoretical proof and are validated only through the reported training and inference runs.

free parameters (2)

short window size
Chosen to trade off memory and consistency; exact value not stated in abstract but required for the attention design.
frame sink size
Hyperparameter controlling how many early frames remain visible; fitted or tuned during development.

axioms (2)

domain assumption Causal attention permits efficient KV caching without quality loss relative to bidirectional attention for video sequences.
Invoked to justify the speed advantage of the AR design.
domain assumption Streaming long tuning aligns training and inference distributions sufficiently to prevent degradation on long outputs.
Central to the claim that minute-long generation is achievable from short-clip pretraining.

pith-pipeline@v0.9.0 · 5614 in / 1589 out tokens · 45255 ms · 2026-05-15T03:48:41.959401+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
cs.CV 2026-05 unverdicted novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
cs.CV 2026-05 conditional novelty 7.0

EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
cs.CV 2026-05 unverdicted novelty 7.0

KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 conditional novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
cs.CV 2026-05 unverdicted novelty 6.0

RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
Stream-T1: Test-Time Scaling for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 conditional novelty 6.0

MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
cs.CV 2026-04 conditional novelty 6.0

Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
cs.CV 2025-12 unverdicted novelty 6.0

WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
cs.CV 2026-04 unverdicted novelty 5.0

TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · cited by 22 Pith papers · 7 internal anchors

[1]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. In NeurIPS, 2024 a

work page 2024
[2]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, and Enze Xie. Sana-video: Efficient video generation with block linear diffusion transformer, 2025 b . URL https://arxiv.org/a...

work page arXiv 2025
[4]

SEINE: short-to-long video diffusion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. SEINE: short-to-long video diffusion model for generative transition and prediction. In ICLR, 2024 b

work page 2024
[5]

Longlora: Efficient fine-tuning of long-context large language models

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models. In ICLR, 2024 c

work page 2024
[6]

One-minute video generation with test-time training

Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. In CVPR, pp.\ 17702--17711, 2025

work page 2025
[7]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. In ICLR, 2025

work page 2025
[8]

The matrix: Infinite-horizon world generation with real-time moving control

Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control. CoRR, abs/2412.03568, 2024

work page arXiv 2024
[9]

Longvie: Multimodal-guided controllable ultra-long video generation

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, and Ziwei Liu. Longvie: Multimodal-guided controllable ultra-long video generation. CoRR, abs/2508.03694, 2025

work page arXiv 2025
[10]

Long-context autoregressive video modeling with next-frame prediction

Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction. CoRR, abs/2503.19325, 2025

work page arXiv 2025
[11]

Long context tuning for video generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. CoRR, abs/2503.10589, 2025

work page arXiv 2025
[12]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion. CoRR, abs/2501.00103, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. CoRR, abs/2211.13221, 2022

work page internal anchor Pith review arXiv 2022
[14]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In CVPR, pp.\ 2568--2577, 2025

work page 2025
[15]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. CoRR, abs/2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

VBench : Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench : Comprehensive benchmark suite for video generative models. In CVPR, 2024 a

work page 2024
[17]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying - Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models. CoRR, abs/2411.13503, 2024 b

work page arXiv 2024
[18]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. In ICLR, 2025

work page 2025
[19]

Streamdit: Real-time streaming text-to-video generation

Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. CoRR, abs/2507.03745, 2025

work page arXiv 2025
[20]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Kling ai: Next-generation ai creative studio, 2024

Kuaishou . Kling ai: Next-generation ai creative studio, 2024

work page 2024
[22]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[23]

Freelong++: Training-free long video generation via multi-band spectralfusion

Yu Lu and Yi Yang. Freelong++: Training-free long video generation via multi-band spectralfusion. CoRR, abs/2507.00162, 2025

work page arXiv 2025
[24]

Freelong: Training-free long video generation with spectralblend temporal attention

Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention. In NeurIPS, 2024

work page 2024
[25]

Yume: An interactive world gen- eration model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model. CoRR, abs/2507.17744, 2025

work page arXiv 2025
[26]

Sora: Creating video from text , 2024

OpenAI . Sora: Creating video from text , 2024

work page 2024
[27]

Introducing GPT-5 , aug 2025

OpenAI . Introducing GPT-5 , aug 2025. Accessed: 2025-09-21

work page 2025
[28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pp.\ 4172--4182, 2023

work page 2023
[29]

Freenoise: Tuning-free longer video diffusion via noise rescheduling

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling. In ICLR, 2024

work page 2024
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139, pp.\ 8748--8763, 2021

work page 2021
[31]

History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. CoRR, abs/2502.06764, 2025

work page arXiv 2025
[32]

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng Yin, Sir...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Phenaki: Variable length video generation from open domain textual descriptions

Ruben Villegas, Mohammad Babaeizadeh, Pieter - Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2023

work page 2023
[35]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. 2024

work page 2024
[36]

Lavie: High-quality video generation with cascaded latent diffusion models

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, and Ziwei Liu. Lavie: High-quality video generation with cascaded latent diffusion models. Int. J. Comput. Vis., 133 0 (5): 0 ...

work page 2025
[37]

Mocha: Towards movie-grade talking character synthesis

Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei - Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, and Wenhu Chen. Mocha: Towards movie-grade talking character synthesis. CoRR, abs/2503.23307, 2025

work page arXiv 2025
[38]

Qwen2 technical report

An Yang, Jinze Bai, et al. Qwen2 technical report. arXiv, 2024 a

work page 2024
[41]

Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies

Yi Yang, Yueting Zhuang, and Yunhe Pan. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 22 0 (12): 0 1551--1558, 2021

work page 2021
[42]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

work page 2025
[43]

NUWA-XL: diffusion over diffusion for extremely long video generation

Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Ming Gong, Lijuan Wang, Zicheng Liu, Houqiang Li, and Nan Duan. NUWA-XL: diffusion over diffusion for extremely long video generation. In Anna Rogers, Jordan L. Boyd - Graber, and Naoaki Okazaki (eds.), ACL, pp.\ ...

work page 2023
[44]

Tianwei Yin, Micha\"el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fr\'edo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In NeurIPS, volume 37, 2024 a

work page 2024
[45]

Freeman, and Taesung Park

Tianwei Yin, Micha \" e l Gharbi, Richard Zhang, Eli Shechtman, Fr \' e do Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, pp.\ 6613--6623, 2024 b

work page 2024
[46]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In CVPR, 2025

work page 2025
[47]

Lumos-1: On autoregressive video generation from a unified model perspective

Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, and Yi Yang. Lumos-1: On autoregressive video generation from a unified model perspective. CoRR, abs/2507.08801, 2025

work page arXiv 2025
[48]

Packing input frame context in next-frame prediction models for video generation

Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. CoRR, abs/2504.12626, 2025

work page arXiv 2025
[49]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model. CoRR, abs/2506.18701, 2025

work page arXiv 2025
[50]

Riflex: A free lunch for length extrapolation in video diffusion transformers

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers. CoRR, abs/2502.15894, 2025

work page arXiv 2025
[51]

Ni, and Heung-Yeung Shum

Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, and Heung-Yeung Shum. Taming teacher forcing for masked autoregressive video generation, 2025. URL https://arxiv.org/abs/2501.12389

work page arXiv 2025
[52]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[53]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[54]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[55]

ICLR , year =

Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and Xiaohan Zhang and Guanyu Feng and Da Yin and Yuxuan Zhang and Weihan Wang and Yean Cheng and Bin Xu and Xiaotao Gu and Yuxiao Dong and Jie Tang , title =. ICLR , year =

work page
[56]

CoRR , volume =

Weijie Kong and Qi Tian and Zijian Zhang and Rox Min and Zuozhuo Dai and Jin Zhou and Jiangfeng Xiong and Xin Li and Bo Wu and Jianwei Zhang and Kathrina Wu and Qin Lin and Junkun Yuan and Yanxin Long and Aladdin Wang and Andong Wang and Changlin Li and Duojun Huang and Fang Yang and Hao Tan and Hongmei Wang and Jacob Song and Jiawang Bai and Jianbing Wu ...

work page
[57]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

MoCha: Towards Movie-Grade Talking Character Synthesis , journal =

Cong Wei and Bo Sun and Haoyu Ma and Ji Hou and Felix Juefei. MoCha: Towards Movie-Grade Talking Character Synthesis , journal =

work page
[59]

Kling AI: Next-Generation AI Creative Studio , year =

work page
[60]

CoRR , volume =

Yingqing He and Tianyu Yang and Yong Zhang and Ying Shan and Qifeng Chen , title =. CoRR , volume =

work page
[61]

Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions , booktitle =

Ruben Villegas and Mohammad Babaeizadeh and Pieter. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions , booktitle =

work page
[62]

ICLR , year =

Xinyuan Chen and Yaohui Wang and Lingjun Zhang and Shaobin Zhuang and Xin Ma and Jiashuo Yu and Yali Wang and Dahua Lin and Yu Qiao and Ziwei Liu , title =. ICLR , year =

work page
[63]

CoRR , volume =

Min Zhao and Guande He and Yixiao Chen and Hongzhou Zhu and Chongxuan Li and Jun Zhu , title =. CoRR , volume =

work page
[64]

ICLR , year =

Haonan Qiu and Menghan Xia and Yong Zhang and Yingqing He and Xintao Wang and Ying Shan and Ziwei Liu , title =. ICLR , year =

work page
[65]

NeurIPS , year =

Yu Lu and Yuanzhi Liang and Linchao Zhu and Yi Yang , title =. NeurIPS , year =

work page
[66]

CoRR , volume =

Yu Lu and Yi Yang , title =. CoRR , volume =

work page
[67]

ICLR , year =

Yang Jin and Zhicheng Sun and Ningyuan Li and Kun Xu and Hao Jiang and Nan Zhuang and Quzhe Huang and Yang Song and Yadong Mu and Zhouchen Lin , title =. ICLR , year =

work page
[68]

NeurIPS , year =

Boyuan Chen and Diego Marti Monso and Yilun Du and Max Simchowitz and Russ Tedrake and Vincent Sitzmann , title =. NeurIPS , year =

work page
[69]

CoRR , volume =

Guibin Chen and Dixuan Lin and Jiangping Yang and Chunze Lin and Junchen Zhu and Mingyuan Fan and Hao Zhang and Sheng Chen and Zheng Chen and Chengcheng Ma and Weiming Xiong and Wei Wang and Nuo Pang and Kang Kang and Zhiheng Xu and Yuzhe Jin and Yupeng Liang and Yubing Song and Peng Zhao and Boyuan Xu and Di Qiu and Debang Li and Zhengcong Fei and Yang L...

work page
[70]

ACL , pages =

Shengming Yin and Chenfei Wu and Huan Yang and Jianfeng Wang and Xiaodong Wang and Minheng Ni and Zhengyuan Yang and Linjie Li and Shuguang Liu and Fan Yang and Jianlong Fu and Ming Gong and Lijuan Wang and Zicheng Liu and Houqiang Li and Nan Duan , editor =. ACL , pages =

work page
[71]

Yaohui Wang and Xinyuan Chen and Xin Ma and Shangchen Zhou and Ziqi Huang and Yi Wang and Ceyuan Yang and Yinan He and Jiashuo Yu and Peiqing Yang and Yuwei Guo and Tianxing Wu and Chenyang Si and Yuming Jiang and Cunjian Chen and Chen Change Loy and Bo Dai and Dahua Lin and Yu Qiao and Ziwei Liu , title =. Int. J. Comput. Vis. , volume =

work page
[72]

CVPR , pages =

Karan Dalal and Daniel Koceja and Jiarui Xu and Yue Zhao and Shihao Han and Ka Chun Cheung and Jan Kautz and Yejin Choi and Yu Sun and Xiaolong Wang , title =. CVPR , pages =

work page
[73]

CoRR , volume =

Yuwei Guo and Ceyuan Yang and Ziyan Yang and Zhibei Ma and Zhijie Lin and Zhenheng Yang and Dahua Lin and Lu Jiang , title =. CoRR , volume =

work page
[74]

CVPR , pages =

Roberto Henschel and Levon Khachatryan and Hayk Poghosyan and Daniil Hayrapetyan and Vahram Tadevosyan and Zhangyang Wang and Shant Navasardyan and Humphrey Shi , title =. CVPR , pages =

work page
[75]

CVPR , year=

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models , author=. CVPR , year=

work page
[76]

CoRR , volume =

Xun Huang and Zhengqi Li and Guande He and Mingyuan Zhou and Eli Shechtman , title =. CoRR , volume =

work page
[77]

CoRR , volume =

Yuchao Gu and Weijia Mao and Mike Zheng Shou , title =. CoRR , volume =

work page
[78]

Hansi Teng and Hongyu Jia and Lei Sun and Lingzhi Li and Maolin Li and Mingqiu Tang and Shuai Han and Tianning Zhang and W. Q. Zhang and Weifeng Luo and Xiaoyang Kang and Yuchen Sun and Yue Cao and Yunpeng Huang and Yutong Lin and Yuxin Fang and Zewei Tao and Zheng Zhang and Zhongshu Wang and Zixun Liu and Dai Shi and Guoli Su and Hanwen Sun and Hong Pan ...

work page
[79]

CoRR , volume =

Lvmin Zhang and Maneesh Agrawala , title =. CoRR , volume =

work page
[80]

CoRR , volume =

Ruili Feng and Han Zhang and Zhantao Yang and Jie Xiao and Zhilei Shu and Zhiheng Liu and Andy Zheng and Yukun Huang and Yu Liu and Hongyang Zhang , title =. CoRR , volume =

work page
[81]

CoRR , volume =

Yifan Zhang and Chunli Peng and Boyang Wang and Puyi Wang and Qingcheng Zhu and Fei Kang and Biao Jiang and Zedong Gao and Eric Li and Yang Liu and Yahui Zhou , title =. CoRR , volume =

work page
[82]

CoRR , volume =

Xiaofeng Mao and Shaoheng Lin and Zhen Li and Chuanhao Li and Wenshuo Peng and Tong He and Jiangmiao Pang and Mingmin Chi and Yu Qiao and Kaipeng Zhang , title =. CoRR , volume =

work page
[83]

CoRR , volume =

Jianxiong Gao and Zhaoxi Chen and Xian Liu and Jianfeng Feng and Chenyang Si and Yanwei Fu and Yu Qiao and Ziwei Liu , title =. CoRR , volume =

work page

Showing first 80 references.