arxiv: 2604.17625 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation

Hovhannes Margaryan , Quentin Bammey , Christian Sandor

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords videocurrentframessucceedingchunkscontinuationcouplingsevaluations

0 comments

The pith

FlowC2S flows directly from current video frames to succeeding ones, halving input size and outperforming prior methods with five evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FlowC2S, a fine-tuned flow model for generating video continuations from text. Instead of combining current frames with noise, it learns a direct vector field from current to next video chunks. This approach uses adjacent chunks as proxies for optimal couplings to create straighter flows and adds target inversion for better fidelity. The result is a method that requires half the input dimensionality, runs efficiently with few neural evaluations, and achieves better FID and FVD scores than existing techniques. A sympathetic reader would care because video generation often demands high compute and memory, so reducing these while improving quality opens practical applications in editing and extension tasks.

Core claim

FlowC2S learns a vector field directly between the current and succeeding video chunks by fine-tuning pre-trained text-to-video flow models. Using temporally adjacent chunks as inherent optimal couplings produces straighter flows, and injecting the inverted latent of the target chunk strengthens the mapping. This direct flow reduces the model input dimensionality by a factor of two compared to standard current-plus-noise inputs, enabling fast continuation with as few as five function evaluations while surpassing state-of-the-art FID and FVD scores.

What carries the argument

The direct vector field from current to succeeding video chunks, facilitated by inherent optimal couplings from adjacent frames and target inversion.

Load-bearing premise

Temporally adjacent video chunks can serve as a practical proxy for true optimal couplings to produce straighter flows, and target inversion improves correspondences without adding artifacts.

What would settle it

An experiment showing that a baseline model using current frames plus noise achieves equal or better FID and FVD scores than FlowC2S when both are fine-tuned similarly and evaluated on the same video continuation benchmarks.

Figures

Figures reproduced from arXiv: 2604.17625 by Christian Sandor, Hovhannes Margaryan, Quentin Bammey.

**Figure 1.** Figure 1: FlowC2S generates video continuations starting the generation directly from the given frames. We achieve this by training [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 3.** Figure 3: Optimal Transport (OT) plan heatmaps between video chunks. We compute pairwise OT plans between a batch of video chunks, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss (left), validation FID (middle), and FVD (right) across four experimental set-ups. Training from scratch with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visual comparison across four settings; frames shown with a stride of 13. Training from scratch w/ OC+TI shows visual artifacts, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Ablations on NFE and number of frames: (a) With inherent OC+TI, 5–10 NFEs equate or surpass 40 NFEs on FID/FVD and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Per-category FID vs. NFE comparing w/ inherent OC, w/o TI (blue) and w/ inherent, OC w/ TI (red). The benefit of TI is [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Per-category FVD vs. NFE comparing w/ inherent OC, w/o TI (blue) and w/ inherent, OC w/ TI (red). FVD is substantially [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Additional visual results on OpenVid (val). FlowC2S, fine-tuned from LTXV, generates video continuations that are both [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Additional visual results on ablation across four training setups (frames shown with stride 13). Training from scratch with [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation on neural function evaluations (NFEs). Frames are shown with a stride of 13. 5–10 NFEs yield quality comparable to [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Long video continuation. The number of input and future frames is 113, and the frames are visualized with a stride of 28. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Failure cases for very long continuation. Shown are 129 input and generated frames (visualized with a stride of 28). Beyond [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

read the original abstract

This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current and succeeding video chunks. Two design choices are key. First, we introduce inherent optimal couplings, utilizing temporally adjacent video chunks during training as a practical proxy for true optimal couplings, resulting in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from current to succeeding frames, instead of the common combination of current frames with noise to generate a video continuation, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlowC2S halves the input size for flow-based video continuation by conditioning only on current frames and using adjacent chunks as a proxy for better couplings, but the abstract gives no ablations or metrics to back the straighter-flows claim.

read the letter

The main thing to know is that this paper fine-tunes existing text-to-video flow models to map directly from one video chunk to the next instead of mixing current frames with noise. That cuts the input dimensionality in half and they say it delivers stronger FID and FVD numbers with only five function evaluations after starting from LTXV and Wan. The two concrete moves are treating temporally adjacent chunks as a stand-in for optimal couplings during training and adding target inversion by feeding in the inverted latent of the target chunk. Both are presented as ways to get straighter paths and tighter visual matches. The efficiency angle is practical and the fine-tuning route avoids training everything from scratch, which is a reasonable engineering choice for this kind of work. The soft spots sit in the missing checks. The abstract asserts straighter flows and SOTA results from the adjacent-chunk proxy and inversion, yet it supplies no path-length numbers, velocity-norm diagnostics, or ablations against standard noise conditioning to show those choices actually produce the claimed benefits. Without those, it is difficult to tell whether the proxy holds for real video dynamics or whether the inversion introduces artifacts on longer clips. The quantitative wins are stated without naming the exact baselines or reporting variance, so the strength of the improvement is hard to gauge from what is given. This is aimed at groups already working on efficient inference for flow or diffusion video models. Someone looking for input-size tricks in continuation tasks could extract the core idea if the full experiments confirm it. I would send it to peer review so the experiments and ablations can be examined directly.

Referee Report

3 major / 1 minor

Summary. The paper proposes FlowC2S, which fine-tunes pre-trained text-to-video flow models (LTXV and Wan) to learn a vector field directly between current and succeeding video chunks for continuation. Key elements include using temporally adjacent chunks as a proxy for optimal couplings to produce straighter flows, target inversion by injecting the inverted latent of the target chunk into the input, and a resulting factor-of-two reduction in input dimensionality versus standard current-plus-noise conditioning. The method is claimed to achieve state-of-the-art FID and FVD scores with as few as five neural function evaluations.

Significance. If the core design choices prove robust, the dimensionality reduction and low-NFE performance would represent a practical advance for memory-efficient video continuation, with potential benefits for downstream tasks such as editing and streaming. The empirical fine-tuning strategy from existing flow models is a clear strength, as is the explicit focus on straighter flows via adjacent-frame couplings; however, the absence of supporting metrics or controls limits evaluation of whether these choices deliver the claimed advantages over noise-based baselines.

major comments (3)

[Abstract] Abstract: The central claim that temporally adjacent video chunks serve as a practical proxy for true optimal couplings (producing straighter flows and enabling the factor-of-two dimensionality reduction) is load-bearing for both the efficiency argument and the reported FID/FVD gains, yet the manuscript provides no quantitative checks such as path-length statistics, velocity-norm distributions on the learned vector field, or ablations comparing adjacent-chunk couplings against noise-based alternatives.
[Abstract] Abstract: Superiority on FID and FVD is asserted after fine-tuning from LTXV and Wan, but no experimental details are supplied on datasets, baseline implementations, evaluation protocols, sample counts, or variance estimates; this absence prevents verification that the gains are attributable to the proposed couplings and inversion rather than other factors.
[Abstract] Abstract: Target inversion is presented as strengthening correspondences and improving fidelity without introducing artifacts, but the text contains no ablation isolating its contribution or measuring its effect on flow straightness or visual quality, leaving a load-bearing component of the method unverified.

minor comments (1)

[Abstract] Abstract: The phrase 'inherent optimal couplings' is used without a formal definition or citation to optimal-transport literature in the flow-matching context, which could confuse readers unfamiliar with the distinction from learned couplings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify opportunities to strengthen the empirical support for our design choices. We will revise the manuscript to incorporate additional quantitative analyses, experimental details, and ablations as outlined below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that temporally adjacent video chunks serve as a practical proxy for true optimal couplings (producing straighter flows and enabling the factor-of-two dimensionality reduction) is load-bearing for both the efficiency argument and the reported FID/FVD gains, yet the manuscript provides no quantitative checks such as path-length statistics, velocity-norm distributions on the learned vector field, or ablations comparing adjacent-chunk couplings against noise-based alternatives.

Authors: We agree that direct quantitative validation of the straighter-flow hypothesis would strengthen the paper. In the revised manuscript we will add path-length statistics and velocity-norm distributions computed on the learned vector field, together with an explicit ablation that compares adjacent-chunk couplings against standard noise-based conditioning on the same backbone models. These additions will be placed in the Experiments and Ablation sections. revision: yes
Referee: [Abstract] Abstract: Superiority on FID and FVD is asserted after fine-tuning from LTXV and Wan, but no experimental details are supplied on datasets, baseline implementations, evaluation protocols, sample counts, or variance estimates; this absence prevents verification that the gains are attributable to the proposed couplings and inversion rather than other factors.

Authors: The full manuscript already contains the requested information in the Experiments section (datasets, fine-tuning protocol, baseline re-implementations, evaluation metrics, and number of samples). To address the referee’s concern about verifiability, we will (i) expand the abstract with a concise statement of the evaluation protocol and (ii) add per-metric standard deviations and exact sample counts to the main results tables. These changes will make the attribution of gains to the proposed components explicit. revision: yes
Referee: [Abstract] Abstract: Target inversion is presented as strengthening correspondences and improving fidelity without introducing artifacts, but the text contains no ablation isolating its contribution or measuring its effect on flow straightness or visual quality, leaving a load-bearing component of the method unverified.

Authors: We acknowledge the value of an isolated ablation for target inversion. The revised version will include a dedicated ablation study that removes target inversion while keeping all other components fixed, reporting its impact on FID, FVD, flow straightness metrics, and qualitative visual quality. This will be added to the Ablation Studies subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning from external pre-trained models

full rationale

The paper presents FlowC2S as a fine-tuning procedure applied to independent pre-trained text-to-video flow models (LTXV and Wan). It adopts temporally adjacent chunks as a practical proxy for couplings and adds target inversion as an input modification, then reports empirical FID/FVD gains at low NFEs. No equations, derivations, or self-citations are shown that reduce the claimed dimensionality reduction or performance gains to fitted parameters by construction, to a self-referential uniqueness theorem, or to an ansatz smuggled from prior author work. The central claims rest on external model initialization and quantitative evaluation against external benchmarks, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of suitable pre-trained flow models (LTXV, Wan) and the assumption that adjacent video chunks approximate optimal transport couplings. No explicit free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Pre-trained text-to-video flow models can be fine-tuned to learn direct vector fields between adjacent chunks
Invoked when stating fine-tuning from LTXV and Wan
ad hoc to paper Temporally adjacent chunks serve as practical proxies for optimal couplings
Stated as first key design choice

pith-pipeline@v0.9.0 · 5473 in / 1378 out tokens · 37679 ms · 2026-05-10T05:31:26.627692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Albergo and Eric Vanden-Eijnden

Michael S. Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants, 2023. 2, 3, 19

2023
[2]

Albergo, Nicholas M

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions, 2025. 14

2025
[3]

All are worth words: A vit backbone for diffusion models, 2023

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models, 2023. 2

2023
[4]

Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. 2, 13

2023
[5]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving, 2020. 2, 5, 13

2020
[6]

Diffusion forcing: Next-token prediction meets full-sequence diffu- sion, 2024

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion, 2024. 3

2024
[7]

Skyreels- v2: Infinite-length film generative model, 2025

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, De- bang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels- v2: Infinite-length film generative mo...

2025
[8]

Videocrafter1: Open diffusion models for high-quality video generation, 2023

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023. 2

2023
[9]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024. 2

2024
[10]

Goku: Flow based video generative foundation models,

Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Goku: Flow based video generative foundation models,
[11]

Emu: Enhancing image generation models using photogenic nee- dles in a haystack, 2023

Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xi- aofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Mot- wani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ra- manathan, Zijian He, Peter Vajda...

2023
[12]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 8780–8794. Curran Asso- ciates, Inc., 2021. 2, 19

2021
[13]

Taming transformers for high-resolution image synthesis, 2021

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis, 2021. 2

2021
[14]

Scaling rectified flow trans- formers for high-resolution image synthesis, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis, 2024. 2, 14, 18

2024
[15]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yi- hang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2024. 2, 3, 5, 6, 13

2024
[16]

Ltx-video: Realtime video latent diffusion, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion, 2024. 2, 3, 4, 5, 6, 13, 18

2024
[17]

Gem: A generaliz- able ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition con- trol, 2024

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Br ¨uggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, and Alexandre Alahi. Gem: A generaliz- able ego-vision multimodal ...

2024
[18]

Cameractrl: Enabling camera control for text-to-video generation, 2025

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation, 2025. 8, 19

2025
[19]

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text, 2025

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text, 2025. 2

2025
[20]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium, 2018

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium, 2018. 5

2018
[21]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 13

2022
[22]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Infor- 9 mation Processing Systems, pages 6840–6851. Curran Asso- ciates, Inc., 2020. 2, 19

2020
[23]

Training-free camera control for video generation, 2025

Chen Hou and Zhibo Chen. Training-free camera control for video generation, 2025. 19

2025
[24]

VBench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Reco...

2024
[25]

Vace: All-in-one video creation and editing, 2025

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing, 2025. 8, 19

2025
[26]

Pyramidal flow matching for effi- cient video generative modeling, 2025

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for effi- cient video generative modeling, 2025. 2

2025
[27]

L. V . Kantorovich. On a problem of monge.Uspekhi Matem- aticheskikh Nauk, 3(2):225–226, 1948. In Russian. 4

1948
[28]

Analyzing and improving the training dynamics of diffusion models, 2024

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models, 2024. 2

2024
[29]

Text2video-zero: Text-to- image diffusion models are zero-shot video generators, 2023

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to- image diffusion models are zero-shot video generators, 2023. 2

2023
[30]

Fifo-diffusion: Generating infinite videos from text without training, 2024

Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training, 2024. 3

2024
[31]

Auto-encoding varia- tional bayes, 2022

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes, 2022. 2, 4, 13

2022
[32]

Hunyuanvideo: A systematic framework for large video generative models, 2025

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

2025
[33]

Optimal flow matching: Learning straight trajectories in just one step, 2024

Nikita Kornilov, Petr Mokrov, Alexander Gasnikov, and Alexander Korotin. Optimal flow matching: Learning straight trajectories in just one step, 2024. 3

2024
[34]

Animateanything: Consistent and con- trollable animation for video generation

Guojun Lei, Chi Wang, Rong Zhang, Yikai Wang, Hong Li, and Weiwei Xu. Animateanything: Consistent and con- trollable animation for video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27946–27956, 2025. 19

2025
[35]

Movideo: Motion-aware video generation with diffusion models, 2024

Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, and Rakesh Ranjan. Movideo: Motion-aware video generation with diffusion models, 2024. 19

2024
[36]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling, 2023. 2, 3, 19

2023
[37]

Generative video bi-flow

Chen Liu and Tobias Ritschel. Generative video bi-flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19363–19372, 2025. 13, 14, 16

2025
[38]

Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 2, 3, 19

2022
[39]

Chan, and Jean michel Morel

Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H. Chan, and Jean michel Morel. Redefining temporal modeling in video diffu- sion: The vectorized timestep approach, 2024. 3

2024
[40]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 6, 13

2019
[41]

Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiao- niu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qil- ing Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Cheng...

2025
[42]

Latte: La- tent diffusion transformer for video generation, 2025

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: La- tent diffusion transformer for video generation, 2025. 2

2025
[43]

Controllable video generation: A survey, 2025

Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, Zeyu Wang, Zhifeng Li, Xiu Li, Wei Liu, Dan Xu, Linfeng Zhang, and Qifeng Chen. Controllable video generation: A survey, 2025. 8, 19

2025
[44]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 2, 4, 5, 13, 15

work page internal anchor Pith review arXiv 2024
[45]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, 10 Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Pa...

2023
[46]

Scalable diffusion models with transformers, 2023

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. 2

2023
[47]

Non-denoising forward-time diffusions,

Stefano Peluchetti. Non-denoising forward-time diffusions,
[48]

Open-sora 2.0: Train- ing a commercial-level video generation model in $200k,

Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang, Silan Hu, Shijie Huang, Xiaokang Wang, Yu...
[49]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 2, 18

2023
[50]

Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro- vic, and Yuming Du

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sin...

2025
[51]

Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, and Ricky T. Q. Chen. Multisample flow matching: Straightening flows with minibatch couplings, 2023. 3, 4

2023
[52]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2

2022
[53]

Rolling diffusion models, 2024

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models, 2024. 3

2024
[54]

Seaweed-7b: Cost-effective training of video generation foundation model, 2025

Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo, Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Meng Wei, Zhiwu Qing, Fei Xiao, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu...

2025
[55]

Diffusion schr ¨odinger bridge matching, 2023

Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Ar- naud Doucet. Diffusion schr ¨odinger bridge matching, 2023. 14

2023
[56]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of the 32nd International Conference on Machine Learning, pages 2256–2265, Lille, France, 2015. PMLR. 2, 19

2015
[57]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations, 2021

2021
[58]

Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions, 2021. 2, 19

2021
[59]

Improving and generalizing flow-based gen- erative models with minibatch optimal transport, 2024

Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based gen- erative models with minibatch optimal transport, 2024. 3, 4

2024
[60]

To- wards accurate generative models of video: A new metric & challenges, 2019

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges, 2019. 5

2019
[61]

Neural discrete representation learning,

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning,
[62]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 2

2023
[63]

Wan: Open and advanced large-scale video generative models, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

2025
[64]

Modelscope text-to-video technical report, 2023

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report, 2023. 2

2023
[65]

Taming rectified flow for inversion and editing, 2025

Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing, 2025. 4, 5

2025
[66]

Lavie: High-quality video gen- eration with cascaded latent diffusion models, 2023

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yum- ing Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, and Ziwei Liu. Lavie: High-quality video gen- eration with cascaded latent diffusion models, 2023. 2

2023
[67]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023. 2

2023
[68]

Omnigen: Unified image genera- tion, 2024

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion, 2024. 19

2024
[69]

Progressive autoregres- sive video diffusion models, 2025

Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregres- sive video diffusion models, 2025. 3

2025
[70]

Sana: Efficient high-resolution im- age synthesis with linear diffusion transformers, 2024

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution im- age synthesis with linear diffusion transformers, 2024. 2, 18

2024
[71]

Camco: Camera- controllable 3d-consistent image-to-video generation, 2024

Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera- controllable 3d-consistent image-to-video generation, 2024. 8, 19

2024
[72]

Cogvideox: Text-to-video diffusion models with an expert transformer, 2025

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. 2, 18

2025
[73]

Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Im- proved distribution matching distillation for fast image syn- thesis, 2024. 3

2024
[74]

Freeman, and Taesung Park

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation,
[75]

From slow bidirectional to fast autoregressive video diffusion mod- els

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Free- man, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els. 2025. 2, 3, 4, 5, 6, 13, 16

2025
[76]

Adding conditional control to text-to-image diffusion models, 2023

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 8, 19

2023
[77]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 5

2018
[78]

Controlvideo: Training-free controllable text-to-video generation, 2023

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation, 2023. 8, 19

2023
[79]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models, 2023

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models, 2023. 14

2023
[80]

Vidcraft3: Camera, object, and lighting control for image-to-video generation,

Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, and Yanwei Fu. Vidcraft3: Camera, object, and lighting control for image-to-video generation,

Showing first 80 references.