Recognition: 2 theorem links
· Lean TheoremLongLive: Real-time Interactive Long Video Generation
Pith reviewed 2026-05-15 03:48 UTC · model grok-4.3
The pith
LongLive turns a short-clip autoregressive model into a real-time system that generates up to 240-second videos at 20.7 FPS while accepting streaming prompt changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongLive is a causal frame-level autoregressive model that integrates a KV-recache mechanism to update cached states for prompt switches, streaming long tuning to enforce train-long-test-long alignment, and short-window attention paired with a frame sink to maintain long-range consistency while speeding generation. With these designs the model supports minute-long videos, real-time interaction, and INT8 inference on a single GPU.
What carries the argument
KV-recache combined with short-window attention and frame sink inside a causal frame-level autoregressive architecture.
If this is right
- Real-time interactive video creation becomes practical on consumer GPUs.
- Training cost for long-video capability drops to tens of GPU-days instead of hundreds.
- The same model produces both short clips and full-minute sequences at high speed.
- INT8 quantization preserves quality, enabling lower-memory deployment.
- VBench scores remain strong for both short and long outputs.
Where Pith is reading between the lines
- The approach could transfer to other autoregressive sequence tasks such as long audio or motion capture.
- If memory scaling improves, the same mechanisms might support hour-long coherent videos.
- Interactive control opens uses in live editing, simulation, or educational content.
- Quantized real-time inference suggests deployment on edge devices for on-the-fly video synthesis.
Load-bearing premise
The KV-recache and frame-sink pair keeps visual consistency and semantic adherence across prompt changes and long sequences without cumulative drift or artifacts.
What would settle it
A 240-second video with repeated prompt transitions that shows visible object distortion, color shift, or loss of prompt adherence after the first minute.
read the original abstract
We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents LongLive, a causal frame-level autoregressive framework for real-time interactive long video generation. It integrates KV-recache to refresh states on prompt switches, streaming long tuning to align train and inference distributions, and short-window attention with a frame sink to preserve consistency. Starting from a 1.3B-parameter short-clip model, the approach enables minute-long generation after 32 GPU-days of fine-tuning, delivering 20.7 FPS inference on a single H100 GPU, support for up to 240-second videos, strong VBench scores on both short and long clips, and INT8 quantization with marginal quality loss.
Significance. If the reported efficiency and consistency results are substantiated, the work would constitute a meaningful step toward practical interactive long-video synthesis. The causal AR design with targeted long-sequence mechanisms addresses the efficiency bottlenecks of bidirectional diffusion models and the quality degradation typical of standard KV-cached autoregressive video models, potentially enabling real-time applications that were previously infeasible on single-GPU hardware.
major comments (3)
- [Experimental Results] The headline claim that KV-recache combined with short-window attention and frame sink prevents cumulative drift and maintains semantic adherence across prompt transitions rests on untested premises. No ablation removing KV-recache (or the other components) is reported, nor are transition-specific metrics such as per-switch CLIP/DINO consistency or drift curves over 240 s provided in the results.
- [Results] The reported 20.7 FPS, VBench scores, and 240 s duration figures are given as point estimates without error bars, standard deviations across runs, or details on the number of evaluation seeds. This weakens confidence in the reliability of the efficiency and quality assertions that underpin the central contribution.
- [Methods] The streaming long tuning procedure is described as aligning training and inference, yet no quantitative comparison (e.g., train-long vs. train-short test-long performance gap) is supplied to demonstrate that the alignment actually reduces the quality degradation that the introduction attributes to standard AR long-video training.
minor comments (2)
- [Abstract] The abstract states 'strong performance on VBench' without quoting the numerical scores or naming the exact baselines used for comparison; these details should appear in the abstract or be cross-referenced to a table.
- [Methods] Implementation specifics of the frame sink (exact sink size, how it interacts with the short window) are only sketched; a precise algorithmic description or pseudocode would aid reproducibility.
Simulated Author's Rebuttal
We sincerely thank the referee for the detailed and insightful review of our manuscript. The comments highlight important aspects that can further strengthen the presentation of our work on LongLive. We address each major comment below and outline the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [Experimental Results] The headline claim that KV-recache combined with short-window attention and frame sink prevents cumulative drift and maintains semantic adherence across prompt transitions rests on untested premises. No ablation removing KV-recache (or the other components) is reported, nor are transition-specific metrics such as per-switch CLIP/DINO consistency or drift curves over 240 s provided in the results.
Authors: We appreciate this observation. While the main results demonstrate the overall effectiveness through VBench scores and qualitative examples of prompt transitions, we agree that dedicated ablations isolating the contribution of KV-recache, short-window attention, and frame sink would provide stronger evidence. Additionally, we will include transition-specific metrics such as per-switch CLIP and DINO consistency scores, as well as drift curves over extended sequences. These will be added to the revised manuscript. revision: yes
-
Referee: [Results] The reported 20.7 FPS, VBench scores, and 240 s duration figures are given as point estimates without error bars, standard deviations across runs, or details on the number of evaluation seeds. This weakens confidence in the reliability of the efficiency and quality assertions that underpin the central contribution.
Authors: We acknowledge the importance of statistical reliability in reporting. In the revised version, we will provide error bars, standard deviations, and specify the number of evaluation seeds used for the FPS, VBench, and duration metrics. This will be based on multiple runs to better substantiate the claims. revision: yes
-
Referee: [Methods] The streaming long tuning procedure is described as aligning training and inference, yet no quantitative comparison (e.g., train-long vs. train-short test-long performance gap) is supplied to demonstrate that the alignment actually reduces the quality degradation that the introduction attributes to standard AR long-video training.
Authors: Thank you for pointing this out. To demonstrate the benefit of streaming long tuning, we will add a quantitative comparison in the methods or experiments section, showing performance gaps between models trained with train-long vs. train-short on long test sequences. This will include metrics highlighting the reduction in quality degradation. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical method (KV-recache for prompt transitions, streaming long tuning for train-long-test-long alignment, and short-window attention plus frame sink for consistency) whose performance claims rest on measured quantities: 32 GPU-days of fine-tuning a 1.3B model, 20.7 FPS inference on H100, VBench scores, and support for 240-second videos. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce these results to the inputs by construction. The derivation is self-contained and externally falsifiable via the reported benchmarks and throughput measurements.
Axiom & Free-Parameter Ledger
free parameters (2)
- short window size
- frame sink size
axioms (2)
- domain assumption Causal attention permits efficient KV caching without quality loss relative to bidirectional attention for video sequences.
- domain assumption Streaming long tuning aligns training and inference distributions sufficiently to prevent degradation on long outputs.
Forward citations
Cited by 23 Pith papers
-
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
-
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
-
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.
-
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
Motion-Aware Caching for Efficient Autoregressive Video Generation
MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
-
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
Motion-Aware Caching for Efficient Autoregressive Video Generation
MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.
Reference graph
Works this paper leans on
-
[1]
Diffusion forcing: Next-token prediction meets full-sequence diffusion
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. In NeurIPS, 2024 a
work page 2024
-
[2]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, and Enze Xie. Sana-video: Efficient video generation with block linear diffusion transformer, 2025 b . URL https://arxiv.org/a...
-
[4]
SEINE: short-to-long video diffusion model for generative transition and prediction
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. SEINE: short-to-long video diffusion model for generative transition and prediction. In ICLR, 2024 b
work page 2024
-
[5]
Longlora: Efficient fine-tuning of long-context large language models
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models. In ICLR, 2024 c
work page 2024
-
[6]
One-minute video generation with test-time training
Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. In CVPR, pp.\ 17702--17711, 2025
work page 2025
-
[7]
Autoregressive video generation without vector quantization
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. In ICLR, 2025
work page 2025
-
[8]
The matrix: Infinite-horizon world generation with real-time moving control
Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control. CoRR, abs/2412.03568, 2024
-
[9]
Longvie: Multimodal-guided controllable ultra-long video generation
Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, and Ziwei Liu. Longvie: Multimodal-guided controllable ultra-long video generation. CoRR, abs/2508.03694, 2025
-
[10]
Long-context autoregressive video modeling with next-frame prediction
Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction. CoRR, abs/2503.19325, 2025
-
[11]
Long context tuning for video generation
Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. CoRR, abs/2503.10589, 2025
-
[12]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion. CoRR, abs/2501.00103, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. CoRR, abs/2211.13221, 2022
work page internal anchor Pith review arXiv 2022
-
[14]
Streamingt2v: Consistent, dynamic, and extendable long video generation from text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In CVPR, pp.\ 2568--2577, 2025
work page 2025
-
[15]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. CoRR, abs/2506.08009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
VBench : Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench : Comprehensive benchmark suite for video generative models. In CVPR, 2024 a
work page 2024
-
[17]
Vbench++: Comprehensive and versatile benchmark suite for video generative models
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying - Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models. CoRR, abs/2411.13503, 2024 b
-
[18]
Pyramidal flow matching for efficient video generative modeling
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. In ICLR, 2025
work page 2025
-
[19]
Streamdit: Real-time streaming text-to-video generation
Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. CoRR, abs/2507.03745, 2025
-
[20]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Kling ai: Next-generation ai creative studio, 2024
Kuaishou . Kling ai: Next-generation ai creative studio, 2024
work page 2024
-
[22]
Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models
Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[23]
Freelong++: Training-free long video generation via multi-band spectralfusion
Yu Lu and Yi Yang. Freelong++: Training-free long video generation via multi-band spectralfusion. CoRR, abs/2507.00162, 2025
-
[24]
Freelong: Training-free long video generation with spectralblend temporal attention
Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention. In NeurIPS, 2024
work page 2024
-
[25]
Yume: An interactive world gen- eration model.arXiv preprint arXiv:2507.17744, 2025
Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model. CoRR, abs/2507.17744, 2025
- [26]
-
[27]
OpenAI . Introducing GPT-5 , aug 2025. Accessed: 2025-09-21
work page 2025
-
[28]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pp.\ 4172--4182, 2023
work page 2023
-
[29]
Freenoise: Tuning-free longer video diffusion via noise rescheduling
Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling. In ICLR, 2024
work page 2024
-
[30]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139, pp.\ 8748--8763, 2021
work page 2021
-
[31]
History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025
Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. CoRR, abs/2502.06764, 2025
-
[32]
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng Yin, Sir...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Phenaki: Variable length video generation from open domain textual descriptions
Ruben Villegas, Mohammad Babaeizadeh, Pieter - Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2023
work page 2023
-
[35]
Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models
Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. 2024
work page 2024
-
[36]
Lavie: High-quality video generation with cascaded latent diffusion models
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, and Ziwei Liu. Lavie: High-quality video generation with cascaded latent diffusion models. Int. J. Comput. Vis., 133 0 (5): 0 ...
work page 2025
-
[37]
Mocha: Towards movie-grade talking character synthesis
Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei - Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, and Wenhu Chen. Mocha: Towards movie-grade talking character synthesis. CoRR, abs/2503.23307, 2025
-
[38]
An Yang, Jinze Bai, et al. Qwen2 technical report. arXiv, 2024 a
work page 2024
-
[41]
Yi Yang, Yueting Zhuang, and Yunhe Pan. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 22 0 (12): 0 1551--1558, 2021
work page 2021
-
[42]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025
work page 2025
-
[43]
NUWA-XL: diffusion over diffusion for extremely long video generation
Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Ming Gong, Lijuan Wang, Zicheng Liu, Houqiang Li, and Nan Duan. NUWA-XL: diffusion over diffusion for extremely long video generation. In Anna Rogers, Jordan L. Boyd - Graber, and Naoaki Okazaki (eds.), ACL, pp.\ ...
work page 2023
-
[44]
Tianwei Yin, Micha\"el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fr\'edo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In NeurIPS, volume 37, 2024 a
work page 2024
-
[45]
Tianwei Yin, Micha \" e l Gharbi, Richard Zhang, Eli Shechtman, Fr \' e do Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, pp.\ 6613--6623, 2024 b
work page 2024
-
[46]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In CVPR, 2025
work page 2025
-
[47]
Lumos-1: On autoregressive video generation from a unified model perspective
Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, and Yi Yang. Lumos-1: On autoregressive video generation from a unified model perspective. CoRR, abs/2507.08801, 2025
-
[48]
Packing input frame context in next-frame prediction models for video generation
Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. CoRR, abs/2504.12626, 2025
-
[49]
Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model. CoRR, abs/2506.18701, 2025
-
[50]
Riflex: A free lunch for length extrapolation in video diffusion transformers
Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers. CoRR, abs/2502.15894, 2025
-
[51]
Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, and Heung-Yeung Shum. Taming teacher forcing for masked autoregressive video generation, 2025. URL https://arxiv.org/abs/2501.12389
-
[52]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[53]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [54]
-
[55]
Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and Xiaohan Zhang and Guanyu Feng and Da Yin and Yuxuan Zhang and Weihan Wang and Yean Cheng and Bin Xu and Xiaotao Gu and Yuxiao Dong and Jie Tang , title =. ICLR , year =
-
[56]
Weijie Kong and Qi Tian and Zijian Zhang and Rox Min and Zuozhuo Dai and Jin Zhou and Jiangfeng Xiong and Xin Li and Bo Wu and Jianwei Zhang and Kathrina Wu and Qin Lin and Junkun Yuan and Yanxin Long and Aladdin Wang and Andong Wang and Changlin Li and Duojun Huang and Fang Yang and Hao Tan and Hongmei Wang and Jacob Song and Jiawang Bai and Jianbing Wu ...
-
[57]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
MoCha: Towards Movie-Grade Talking Character Synthesis , journal =
Cong Wei and Bo Sun and Haoyu Ma and Ji Hou and Felix Juefei. MoCha: Towards Movie-Grade Talking Character Synthesis , journal =
-
[59]
Kling AI: Next-Generation AI Creative Studio , year =
-
[60]
Yingqing He and Tianyu Yang and Yong Zhang and Ying Shan and Qifeng Chen , title =. CoRR , volume =
-
[61]
Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions , booktitle =
Ruben Villegas and Mohammad Babaeizadeh and Pieter. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions , booktitle =
-
[62]
Xinyuan Chen and Yaohui Wang and Lingjun Zhang and Shaobin Zhuang and Xin Ma and Jiashuo Yu and Yali Wang and Dahua Lin and Yu Qiao and Ziwei Liu , title =. ICLR , year =
-
[63]
Min Zhao and Guande He and Yixiao Chen and Hongzhou Zhu and Chongxuan Li and Jun Zhu , title =. CoRR , volume =
-
[64]
Haonan Qiu and Menghan Xia and Yong Zhang and Yingqing He and Xintao Wang and Ying Shan and Ziwei Liu , title =. ICLR , year =
-
[65]
Yu Lu and Yuanzhi Liang and Linchao Zhu and Yi Yang , title =. NeurIPS , year =
- [66]
-
[67]
Yang Jin and Zhicheng Sun and Ningyuan Li and Kun Xu and Hao Jiang and Nan Zhuang and Quzhe Huang and Yang Song and Yadong Mu and Zhouchen Lin , title =. ICLR , year =
-
[68]
Boyuan Chen and Diego Marti Monso and Yilun Du and Max Simchowitz and Russ Tedrake and Vincent Sitzmann , title =. NeurIPS , year =
-
[69]
Guibin Chen and Dixuan Lin and Jiangping Yang and Chunze Lin and Junchen Zhu and Mingyuan Fan and Hao Zhang and Sheng Chen and Zheng Chen and Chengcheng Ma and Weiming Xiong and Wei Wang and Nuo Pang and Kang Kang and Zhiheng Xu and Yuzhe Jin and Yupeng Liang and Yubing Song and Peng Zhao and Boyuan Xu and Di Qiu and Debang Li and Zhengcong Fei and Yang L...
-
[70]
Shengming Yin and Chenfei Wu and Huan Yang and Jianfeng Wang and Xiaodong Wang and Minheng Ni and Zhengyuan Yang and Linjie Li and Shuguang Liu and Fan Yang and Jianlong Fu and Ming Gong and Lijuan Wang and Zicheng Liu and Houqiang Li and Nan Duan , editor =. ACL , pages =
-
[71]
Yaohui Wang and Xinyuan Chen and Xin Ma and Shangchen Zhou and Ziqi Huang and Yi Wang and Ceyuan Yang and Yinan He and Jiashuo Yu and Peiqing Yang and Yuwei Guo and Tianxing Wu and Chenyang Si and Yuming Jiang and Cunjian Chen and Chen Change Loy and Bo Dai and Dahua Lin and Yu Qiao and Ziwei Liu , title =. Int. J. Comput. Vis. , volume =
-
[72]
Karan Dalal and Daniel Koceja and Jiarui Xu and Yue Zhao and Shihao Han and Ka Chun Cheung and Jan Kautz and Yejin Choi and Yu Sun and Xiaolong Wang , title =. CVPR , pages =
-
[73]
Yuwei Guo and Ceyuan Yang and Ziyan Yang and Zhibei Ma and Zhijie Lin and Zhenheng Yang and Dahua Lin and Lu Jiang , title =. CoRR , volume =
-
[74]
Roberto Henschel and Levon Khachatryan and Hayk Poghosyan and Daniil Hayrapetyan and Vahram Tadevosyan and Zhangyang Wang and Shant Navasardyan and Humphrey Shi , title =. CVPR , pages =
-
[75]
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models , author=. CVPR , year=
-
[76]
Xun Huang and Zhengqi Li and Guande He and Mingyuan Zhou and Eli Shechtman , title =. CoRR , volume =
- [77]
-
[78]
Hansi Teng and Hongyu Jia and Lei Sun and Lingzhi Li and Maolin Li and Mingqiu Tang and Shuai Han and Tianning Zhang and W. Q. Zhang and Weifeng Luo and Xiaoyang Kang and Yuchen Sun and Yue Cao and Yunpeng Huang and Yutong Lin and Yuxin Fang and Zewei Tao and Zheng Zhang and Zhongshu Wang and Zixun Liu and Dai Shi and Guoli Su and Hanwen Sun and Hong Pan ...
- [79]
-
[80]
Ruili Feng and Han Zhang and Zhantao Yang and Jie Xiao and Zhilei Shu and Zhiheng Liu and Andy Zheng and Yukun Huang and Yu Liu and Hongyang Zhang , title =. CoRR , volume =
-
[81]
Yifan Zhang and Chunli Peng and Boyang Wang and Puyi Wang and Qingcheng Zhu and Fei Kang and Biao Jiang and Zedong Gao and Eric Li and Yang Liu and Yahui Zhou , title =. CoRR , volume =
-
[82]
Xiaofeng Mao and Shaoheng Lin and Zhen Li and Chuanhao Li and Wenshuo Peng and Tong He and Jiangmiao Pang and Mingmin Chi and Yu Qiao and Kaipeng Zhang , title =. CoRR , volume =
-
[83]
Jianxiong Gao and Zhaoxi Chen and Xian Liu and Jianfeng Feng and Chenyang Si and Yanwei Fu and Yu Qiao and Ziwei Liu , title =. CoRR , volume =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.