pith. machine review for the scientific record. sign in

arxiv: 2509.22622 · v2 · submitted 2025-09-26 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LongLive: Real-time Interactive Long Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video generationautoregressive modelreal-time videointeractive generationKV cachevideo synthesiscausal attention
0
0 comments X

The pith

LongLive turns a short-clip autoregressive model into a real-time system that generates up to 240-second videos at 20.7 FPS while accepting streaming prompt changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongLive as a frame-level autoregressive framework that solves efficiency and consistency problems in long video generation. It uses KV-recache to refresh states on new prompts, streaming long tuning to match training and inference lengths, and short-window attention plus frame sink to keep visual coherence without full attention. These changes let the authors fine-tune a 1.3B model to minute-scale output in 32 GPU-days. At inference the system runs at 20.7 frames per second on one H100 GPU and handles interactive prompt streams without major drift.

Core claim

LongLive is a causal frame-level autoregressive model that integrates a KV-recache mechanism to update cached states for prompt switches, streaming long tuning to enforce train-long-test-long alignment, and short-window attention paired with a frame sink to maintain long-range consistency while speeding generation. With these designs the model supports minute-long videos, real-time interaction, and INT8 inference on a single GPU.

What carries the argument

KV-recache combined with short-window attention and frame sink inside a causal frame-level autoregressive architecture.

If this is right

  • Real-time interactive video creation becomes practical on consumer GPUs.
  • Training cost for long-video capability drops to tens of GPU-days instead of hundreds.
  • The same model produces both short clips and full-minute sequences at high speed.
  • INT8 quantization preserves quality, enabling lower-memory deployment.
  • VBench scores remain strong for both short and long outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could transfer to other autoregressive sequence tasks such as long audio or motion capture.
  • If memory scaling improves, the same mechanisms might support hour-long coherent videos.
  • Interactive control opens uses in live editing, simulation, or educational content.
  • Quantized real-time inference suggests deployment on edge devices for on-the-fly video synthesis.

Load-bearing premise

The KV-recache and frame-sink pair keeps visual consistency and semantic adherence across prompt changes and long sequences without cumulative drift or artifacts.

What would settle it

A 240-second video with repeated prompt transitions that shows visible object distortion, color shift, or loss of prompt adherence after the first minute.

read the original abstract

We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents LongLive, a causal frame-level autoregressive framework for real-time interactive long video generation. It integrates KV-recache to refresh states on prompt switches, streaming long tuning to align train and inference distributions, and short-window attention with a frame sink to preserve consistency. Starting from a 1.3B-parameter short-clip model, the approach enables minute-long generation after 32 GPU-days of fine-tuning, delivering 20.7 FPS inference on a single H100 GPU, support for up to 240-second videos, strong VBench scores on both short and long clips, and INT8 quantization with marginal quality loss.

Significance. If the reported efficiency and consistency results are substantiated, the work would constitute a meaningful step toward practical interactive long-video synthesis. The causal AR design with targeted long-sequence mechanisms addresses the efficiency bottlenecks of bidirectional diffusion models and the quality degradation typical of standard KV-cached autoregressive video models, potentially enabling real-time applications that were previously infeasible on single-GPU hardware.

major comments (3)
  1. [Experimental Results] The headline claim that KV-recache combined with short-window attention and frame sink prevents cumulative drift and maintains semantic adherence across prompt transitions rests on untested premises. No ablation removing KV-recache (or the other components) is reported, nor are transition-specific metrics such as per-switch CLIP/DINO consistency or drift curves over 240 s provided in the results.
  2. [Results] The reported 20.7 FPS, VBench scores, and 240 s duration figures are given as point estimates without error bars, standard deviations across runs, or details on the number of evaluation seeds. This weakens confidence in the reliability of the efficiency and quality assertions that underpin the central contribution.
  3. [Methods] The streaming long tuning procedure is described as aligning training and inference, yet no quantitative comparison (e.g., train-long vs. train-short test-long performance gap) is supplied to demonstrate that the alignment actually reduces the quality degradation that the introduction attributes to standard AR long-video training.
minor comments (2)
  1. [Abstract] The abstract states 'strong performance on VBench' without quoting the numerical scores or naming the exact baselines used for comparison; these details should appear in the abstract or be cross-referenced to a table.
  2. [Methods] Implementation specifics of the frame sink (exact sink size, how it interacts with the short window) are only sketched; a precise algorithmic description or pseudocode would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the detailed and insightful review of our manuscript. The comments highlight important aspects that can further strengthen the presentation of our work on LongLive. We address each major comment below and outline the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [Experimental Results] The headline claim that KV-recache combined with short-window attention and frame sink prevents cumulative drift and maintains semantic adherence across prompt transitions rests on untested premises. No ablation removing KV-recache (or the other components) is reported, nor are transition-specific metrics such as per-switch CLIP/DINO consistency or drift curves over 240 s provided in the results.

    Authors: We appreciate this observation. While the main results demonstrate the overall effectiveness through VBench scores and qualitative examples of prompt transitions, we agree that dedicated ablations isolating the contribution of KV-recache, short-window attention, and frame sink would provide stronger evidence. Additionally, we will include transition-specific metrics such as per-switch CLIP and DINO consistency scores, as well as drift curves over extended sequences. These will be added to the revised manuscript. revision: yes

  2. Referee: [Results] The reported 20.7 FPS, VBench scores, and 240 s duration figures are given as point estimates without error bars, standard deviations across runs, or details on the number of evaluation seeds. This weakens confidence in the reliability of the efficiency and quality assertions that underpin the central contribution.

    Authors: We acknowledge the importance of statistical reliability in reporting. In the revised version, we will provide error bars, standard deviations, and specify the number of evaluation seeds used for the FPS, VBench, and duration metrics. This will be based on multiple runs to better substantiate the claims. revision: yes

  3. Referee: [Methods] The streaming long tuning procedure is described as aligning training and inference, yet no quantitative comparison (e.g., train-long vs. train-short test-long performance gap) is supplied to demonstrate that the alignment actually reduces the quality degradation that the introduction attributes to standard AR long-video training.

    Authors: Thank you for pointing this out. To demonstrate the benefit of streaming long tuning, we will add a quantitative comparison in the methods or experiments section, showing performance gaps between models trained with train-long vs. train-short on long test sequences. This will include metrics highlighting the reduction in quality degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical method (KV-recache for prompt transitions, streaming long tuning for train-long-test-long alignment, and short-window attention plus frame sink for consistency) whose performance claims rest on measured quantities: 32 GPU-days of fine-tuning a 1.3B model, 20.7 FPS inference on H100, VBench scores, and support for 240-second videos. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce these results to the inputs by construction. The derivation is self-contained and externally falsifiable via the reported benchmarks and throughput measurements.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the KV-recache mechanism for prompt adherence and the frame sink for long-range consistency; these are introduced without independent theoretical proof and are validated only through the reported training and inference runs.

free parameters (2)
  • short window size
    Chosen to trade off memory and consistency; exact value not stated in abstract but required for the attention design.
  • frame sink size
    Hyperparameter controlling how many early frames remain visible; fitted or tuned during development.
axioms (2)
  • domain assumption Causal attention permits efficient KV caching without quality loss relative to bidirectional attention for video sequences.
    Invoked to justify the speed advantage of the AR design.
  • domain assumption Streaming long tuning aligns training and inference distributions sufficiently to prevent degradation on long outputs.
    Central to the claim that minute-long generation is achievable from short-clip pretraining.

pith-pipeline@v0.9.0 · 5614 in / 1589 out tokens · 45255 ms · 2026-05-15T03:48:41.959401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

    cs.CV 2026-05 unverdicted novelty 8.0

    AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

  2. EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

    cs.CV 2026-05 conditional novelty 7.0

    EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.

  3. KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

    cs.CV 2026-05 unverdicted novelty 7.0

    KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

  4. CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

    cs.CV 2026-05 unverdicted novelty 7.0

    CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

  5. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  6. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  7. Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

  8. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  9. Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.

  10. Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...

  11. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  12. RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

    cs.CV 2026-05 unverdicted novelty 6.0

    RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.

  13. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  14. Motion-Aware Caching for Efficient Autoregressive Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.

  15. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  16. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  17. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  18. Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.

  19. WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    cs.CV 2025-12 unverdicted novelty 6.0

    WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.

  20. SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

    cs.CV 2026-05 unverdicted novelty 5.0

    SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

  21. Motion-Aware Caching for Efficient Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.

  22. A Systematic Post-Train Framework for Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

  23. TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · cited by 22 Pith papers · 7 internal anchors

  1. [1]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. In NeurIPS, 2024 a

  2. [2]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...

  3. [3]

    Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

    Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, and Enze Xie. Sana-video: Efficient video generation with block linear diffusion transformer, 2025 b . URL https://arxiv.org/a...

  4. [4]

    SEINE: short-to-long video diffusion model for generative transition and prediction

    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. SEINE: short-to-long video diffusion model for generative transition and prediction. In ICLR, 2024 b

  5. [5]

    Longlora: Efficient fine-tuning of long-context large language models

    Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models. In ICLR, 2024 c

  6. [6]

    One-minute video generation with test-time training

    Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. In CVPR, pp.\ 17702--17711, 2025

  7. [7]

    Autoregressive video generation without vector quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. In ICLR, 2025

  8. [8]

    The matrix: Infinite-horizon world generation with real-time moving control

    Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control. CoRR, abs/2412.03568, 2024

  9. [9]

    Longvie: Multimodal-guided controllable ultra-long video generation

    Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, and Ziwei Liu. Longvie: Multimodal-guided controllable ultra-long video generation. CoRR, abs/2508.03694, 2025

  10. [10]

    Long-context autoregressive video modeling with next-frame prediction

    Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction. CoRR, abs/2503.19325, 2025

  11. [11]

    Long context tuning for video generation

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. CoRR, abs/2503.10589, 2025

  12. [12]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion. CoRR, abs/2501.00103, 2025

  13. [13]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. CoRR, abs/2211.13221, 2022

  14. [14]

    Streamingt2v: Consistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In CVPR, pp.\ 2568--2577, 2025

  15. [15]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. CoRR, abs/2506.08009, 2025

  16. [16]

    VBench : Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench : Comprehensive benchmark suite for video generative models. In CVPR, 2024 a

  17. [17]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying - Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models. CoRR, abs/2411.13503, 2024 b

  18. [18]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. In ICLR, 2025

  19. [19]

    Streamdit: Real-time streaming text-to-video generation

    Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. CoRR, abs/2507.03745, 2025

  20. [20]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  21. [21]

    Kling ai: Next-generation ai creative studio, 2024

    Kuaishou . Kling ai: Next-generation ai creative studio, 2024

  22. [22]

    Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

    Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. In The Thirteenth International Conference on Learning Representations, 2025

  23. [23]

    Freelong++: Training-free long video generation via multi-band spectralfusion

    Yu Lu and Yi Yang. Freelong++: Training-free long video generation via multi-band spectralfusion. CoRR, abs/2507.00162, 2025

  24. [24]

    Freelong: Training-free long video generation with spectralblend temporal attention

    Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention. In NeurIPS, 2024

  25. [25]

    Yume: An interactive world gen- eration model.arXiv preprint arXiv:2507.17744, 2025

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model. CoRR, abs/2507.17744, 2025

  26. [26]

    Sora: Creating video from text , 2024

    OpenAI . Sora: Creating video from text , 2024

  27. [27]

    Introducing GPT-5 , aug 2025

    OpenAI . Introducing GPT-5 , aug 2025. Accessed: 2025-09-21

  28. [28]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pp.\ 4172--4182, 2023

  29. [29]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling. In ICLR, 2024

  30. [30]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139, pp.\ 8748--8763, 2021

  31. [31]

    History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. CoRR, abs/2502.06764, 2025

  32. [32]

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng Yin, Sir...

  33. [33]

    Phenaki: Variable length video generation from open domain textual descriptions

    Ruben Villegas, Mohammad Babaeizadeh, Pieter - Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2023

  34. [35]

    Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models

    Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. 2024

  35. [36]

    Lavie: High-quality video generation with cascaded latent diffusion models

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, and Ziwei Liu. Lavie: High-quality video generation with cascaded latent diffusion models. Int. J. Comput. Vis., 133 0 (5): 0 ...

  36. [37]

    Mocha: Towards movie-grade talking character synthesis

    Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei - Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, and Wenhu Chen. Mocha: Towards movie-grade talking character synthesis. CoRR, abs/2503.23307, 2025

  37. [38]

    Qwen2 technical report

    An Yang, Jinze Bai, et al. Qwen2 technical report. arXiv, 2024 a

  38. [41]

    Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies

    Yi Yang, Yueting Zhuang, and Yunhe Pan. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 22 0 (12): 0 1551--1558, 2021

  39. [42]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

  40. [43]

    NUWA-XL: diffusion over diffusion for extremely long video generation

    Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Ming Gong, Lijuan Wang, Zicheng Liu, Houqiang Li, and Nan Duan. NUWA-XL: diffusion over diffusion for extremely long video generation. In Anna Rogers, Jordan L. Boyd - Graber, and Naoaki Okazaki (eds.), ACL, pp.\ ...

  41. [44]

    Tianwei Yin, Micha\"el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fr\'edo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In NeurIPS, volume 37, 2024 a

  42. [45]

    Freeman, and Taesung Park

    Tianwei Yin, Micha \" e l Gharbi, Richard Zhang, Eli Shechtman, Fr \' e do Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, pp.\ 6613--6623, 2024 b

  43. [46]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In CVPR, 2025

  44. [47]

    Lumos-1: On autoregressive video generation from a unified model perspective

    Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, and Yi Yang. Lumos-1: On autoregressive video generation from a unified model perspective. CoRR, abs/2507.08801, 2025

  45. [48]

    Packing input frame context in next-frame prediction models for video generation

    Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. CoRR, abs/2504.12626, 2025

  46. [49]

    Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model. CoRR, abs/2506.18701, 2025

  47. [50]

    Riflex: A free lunch for length extrapolation in video diffusion transformers

    Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers. CoRR, abs/2502.15894, 2025

  48. [51]

    Ni, and Heung-Yeung Shum

    Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, and Heung-Yeung Shum. Taming teacher forcing for masked autoregressive video generation, 2025. URL https://arxiv.org/abs/2501.12389

  49. [52]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  50. [53]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  51. [54]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  52. [55]

    ICLR , year =

    Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and Xiaohan Zhang and Guanyu Feng and Da Yin and Yuxuan Zhang and Weihan Wang and Yean Cheng and Bin Xu and Xiaotao Gu and Yuxiao Dong and Jie Tang , title =. ICLR , year =

  53. [56]

    CoRR , volume =

    Weijie Kong and Qi Tian and Zijian Zhang and Rox Min and Zuozhuo Dai and Jin Zhou and Jiangfeng Xiong and Xin Li and Bo Wu and Jianwei Zhang and Kathrina Wu and Qin Lin and Junkun Yuan and Yanxin Long and Aladdin Wang and Andong Wang and Changlin Li and Duojun Huang and Fang Yang and Hao Tan and Hongmei Wang and Jacob Song and Jiawang Bai and Jianbing Wu ...

  54. [57]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

  55. [58]

    MoCha: Towards Movie-Grade Talking Character Synthesis , journal =

    Cong Wei and Bo Sun and Haoyu Ma and Ji Hou and Felix Juefei. MoCha: Towards Movie-Grade Talking Character Synthesis , journal =

  56. [59]

    Kling AI: Next-Generation AI Creative Studio , year =

  57. [60]

    CoRR , volume =

    Yingqing He and Tianyu Yang and Yong Zhang and Ying Shan and Qifeng Chen , title =. CoRR , volume =

  58. [61]

    Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions , booktitle =

    Ruben Villegas and Mohammad Babaeizadeh and Pieter. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions , booktitle =

  59. [62]

    ICLR , year =

    Xinyuan Chen and Yaohui Wang and Lingjun Zhang and Shaobin Zhuang and Xin Ma and Jiashuo Yu and Yali Wang and Dahua Lin and Yu Qiao and Ziwei Liu , title =. ICLR , year =

  60. [63]

    CoRR , volume =

    Min Zhao and Guande He and Yixiao Chen and Hongzhou Zhu and Chongxuan Li and Jun Zhu , title =. CoRR , volume =

  61. [64]

    ICLR , year =

    Haonan Qiu and Menghan Xia and Yong Zhang and Yingqing He and Xintao Wang and Ying Shan and Ziwei Liu , title =. ICLR , year =

  62. [65]

    NeurIPS , year =

    Yu Lu and Yuanzhi Liang and Linchao Zhu and Yi Yang , title =. NeurIPS , year =

  63. [66]

    CoRR , volume =

    Yu Lu and Yi Yang , title =. CoRR , volume =

  64. [67]

    ICLR , year =

    Yang Jin and Zhicheng Sun and Ningyuan Li and Kun Xu and Hao Jiang and Nan Zhuang and Quzhe Huang and Yang Song and Yadong Mu and Zhouchen Lin , title =. ICLR , year =

  65. [68]

    NeurIPS , year =

    Boyuan Chen and Diego Marti Monso and Yilun Du and Max Simchowitz and Russ Tedrake and Vincent Sitzmann , title =. NeurIPS , year =

  66. [69]

    CoRR , volume =

    Guibin Chen and Dixuan Lin and Jiangping Yang and Chunze Lin and Junchen Zhu and Mingyuan Fan and Hao Zhang and Sheng Chen and Zheng Chen and Chengcheng Ma and Weiming Xiong and Wei Wang and Nuo Pang and Kang Kang and Zhiheng Xu and Yuzhe Jin and Yupeng Liang and Yubing Song and Peng Zhao and Boyuan Xu and Di Qiu and Debang Li and Zhengcong Fei and Yang L...

  67. [70]

    ACL , pages =

    Shengming Yin and Chenfei Wu and Huan Yang and Jianfeng Wang and Xiaodong Wang and Minheng Ni and Zhengyuan Yang and Linjie Li and Shuguang Liu and Fan Yang and Jianlong Fu and Ming Gong and Lijuan Wang and Zicheng Liu and Houqiang Li and Nan Duan , editor =. ACL , pages =

  68. [71]

    Yaohui Wang and Xinyuan Chen and Xin Ma and Shangchen Zhou and Ziqi Huang and Yi Wang and Ceyuan Yang and Yinan He and Jiashuo Yu and Peiqing Yang and Yuwei Guo and Tianxing Wu and Chenyang Si and Yuming Jiang and Cunjian Chen and Chen Change Loy and Bo Dai and Dahua Lin and Yu Qiao and Ziwei Liu , title =. Int. J. Comput. Vis. , volume =

  69. [72]

    CVPR , pages =

    Karan Dalal and Daniel Koceja and Jiarui Xu and Yue Zhao and Shihao Han and Ka Chun Cheung and Jan Kautz and Yejin Choi and Yu Sun and Xiaolong Wang , title =. CVPR , pages =

  70. [73]

    CoRR , volume =

    Yuwei Guo and Ceyuan Yang and Ziyan Yang and Zhibei Ma and Zhijie Lin and Zhenheng Yang and Dahua Lin and Lu Jiang , title =. CoRR , volume =

  71. [74]

    CVPR , pages =

    Roberto Henschel and Levon Khachatryan and Hayk Poghosyan and Daniil Hayrapetyan and Vahram Tadevosyan and Zhangyang Wang and Shant Navasardyan and Humphrey Shi , title =. CVPR , pages =

  72. [75]

    CVPR , year=

    From Slow Bidirectional to Fast Autoregressive Video Diffusion Models , author=. CVPR , year=

  73. [76]

    CoRR , volume =

    Xun Huang and Zhengqi Li and Guande He and Mingyuan Zhou and Eli Shechtman , title =. CoRR , volume =

  74. [77]

    CoRR , volume =

    Yuchao Gu and Weijia Mao and Mike Zheng Shou , title =. CoRR , volume =

  75. [78]

    Hansi Teng and Hongyu Jia and Lei Sun and Lingzhi Li and Maolin Li and Mingqiu Tang and Shuai Han and Tianning Zhang and W. Q. Zhang and Weifeng Luo and Xiaoyang Kang and Yuchen Sun and Yue Cao and Yunpeng Huang and Yutong Lin and Yuxin Fang and Zewei Tao and Zheng Zhang and Zhongshu Wang and Zixun Liu and Dai Shi and Guoli Su and Hanwen Sun and Hong Pan ...

  76. [79]

    CoRR , volume =

    Lvmin Zhang and Maneesh Agrawala , title =. CoRR , volume =

  77. [80]

    CoRR , volume =

    Ruili Feng and Han Zhang and Zhantao Yang and Jie Xiao and Zhilei Shu and Zhiheng Liu and Andy Zheng and Yukun Huang and Yu Liu and Hongyang Zhang , title =. CoRR , volume =

  78. [81]

    CoRR , volume =

    Yifan Zhang and Chunli Peng and Boyang Wang and Puyi Wang and Qingcheng Zhu and Fei Kang and Biao Jiang and Zedong Gao and Eric Li and Yang Liu and Yahui Zhou , title =. CoRR , volume =

  79. [82]

    CoRR , volume =

    Xiaofeng Mao and Shaoheng Lin and Zhen Li and Chuanhao Li and Wenshuo Peng and Tong He and Jiangmiao Pang and Mingmin Chi and Yu Qiao and Kaipeng Zhang , title =. CoRR , volume =

  80. [83]

    CoRR , volume =

    Jianxiong Gao and Zhaoxi Chen and Xian Liu and Jianfeng Feng and Chenyang Si and Yanwei Fu and Yu Qiao and Ziwei Liu , title =. CoRR , volume =

Showing first 80 references.